git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
* [PATCH 0/6] [GSoC] Implement Corrected Commit Date
@ 2020-07-28  9:13 Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Here's what left for the PR (which I intend to take on with the second
version of pull request):

 1. Add an option to skip writing generation data chunk (to test whether new
    Git works without GDAT as intended).
 2. Handle writing to commit-graph for mismatched version (that is, merging
    all graphs into a new graph with a GDAT chunk).
 3. Update technical documentation.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
unrelated to my code changes. Still need to investigate further.

Abhishek Kumar (6):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date offset

 blame.c                       |   2 +-
 commit-graph.c                | 181 +++++++++++++++++++++-------------
 commit-graph.h                |   7 +-
 commit-reach.c                |  47 +++------
 commit-reach.h                |   2 +-
 commit.c                      |   9 +-
 commit.h                      |   3 +
 revision.c                    |  17 ++--
 t/helper/test-read-graph.c    |   2 +
 t/t4216-log-bloom.sh          |   4 +-
 t/t5000-tar-tree.sh           |   4 +-
 t/t5318-commit-graph.sh       |  21 ++--
 t/t5324-split-commit-graph.sh |  12 +--
 upload-pack.c                 |   2 +-
 14 files changed, 178 insertions(+), 135 deletions(-)


base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/676
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 15:28   ` Taylor Blau
  2020-08-04  0:46   ` Jakub Narębski
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With 3d112755 (commit-graph: examine commits by generation number), Git
knew to sort by generation number before examining the diff when not
using pack order. c49c82aa (commit: move members graph_pos, generation
to a slab, 2020-06-17) moved generation number into a slab and
introduced a helper which returns GENERATION_NUMBER_INFINITY when
writing the graph. Sorting is no longer useful and essentially reverts
the earlier commit.

Let's fix this by accessing generation number directly through the slab.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1af68c297d..5d3c9bd23c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
+
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 13:00   ` Derrick Stolee
  2020-08-05 23:16   ` Jakub Narębski
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 6aa7f4f567..23287d26c3 100644
--- a/revision.c
+++ b/revision.c
@@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return ;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number and has undefined behavior. The test used to pass as
fill_commit_in_graph() did not read commit time from commit graph,
reading commit date from odb instead.

Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 31 ++++++++++++-------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5d3c9bd23c..204eb454b2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
+	fill_commit_graph_info(item, g, pos);
+
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
-
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 16:03   ` Taylor Blau
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 204eb454b2..1c98f38d69 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 28f89cdf3e..98cc5a3b9d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,4 +145,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 7128895c3a..bed63b41fb 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 16:12   ` Taylor Blau
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

One of the essential pre-requisites before implementing generation
number as to distinguish between generation numbers v1 and v2 while
still being compatible with old Git.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. Newer versions of Git can parse GDAT and
take advantage of newer generation numbers, falling back to topological
levels when GDAT chunk is missing (as it would happen with a commit
graph written by old Git).

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 33 +++++++++++++++++++++++++++++----
 commit-graph.h                |  1 +
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 ++--
 t/t5318-commit-graph.sh       | 19 +++++++++++--------
 t/t5324-split-commit-graph.sh | 12 ++++++------
 6 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1c98f38d69..ab714f4a76 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 	}
 }
 
+static void write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	struct commit **list = ctx->commits.list;
+	int count;
+	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
+	}
+}
+
 static void write_graph_chunk_extra_edges(struct hashfile *f,
 					  struct write_commit_graph_context *ctx)
 {
@@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
+	int num_chunks = 4;
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
@@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
 	if (ctx->num_extra_edges) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
@@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
 
-	num_chunks = 3;
+	num_chunks = 4;
 	if (ctx->num_extra_edges) {
 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
 						4 * ctx->num_extra_edges;
@@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_fanout(f, ctx);
 	write_graph_chunk_oids(f, hashsz, ctx);
 	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_generation_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
diff --git a/commit-graph.h b/commit-graph.h
index 98cc5a3b9d..e3d4ba96f4 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -67,6 +67,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c855bcd3e7..780855e691 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 26f332d6a3..3ec5248d70 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 
 graph_read_expect() {
 	OPTIONAL=""
-	NUM_CHUNKS=3
+	NUM_CHUNKS=4
 	if test ! -z $2
 	then
 		OPTIONAL=" $2"
-		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
+		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
 	fi
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
 GRAPH_BYTE_CHUNK_COUNT=6
 GRAPH_CHUNK_LOOKUP_OFFSET=8
 GRAPH_CHUNK_LOOKUP_WIDTH=12
-GRAPH_CHUNK_LOOKUP_ROWS=5
+GRAPH_CHUNK_LOOKUP_ROWS=6
 GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
 GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
 			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
@@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
 GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
-GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
 GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
 GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
-GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
-			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
+GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
+				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
+GRAPH_GENERATION_DATA_WIDTH=4
+GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
+GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
+			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
 GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
 GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
 
@@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
 '
 
 test_expect_success 'detect incorrect generation number' '
-	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
 		"non-zero generation number"
 '
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 269d0964a3..096a96ec41 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_init &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -29,9 +29,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With preparations done, let's implement corrected commit date offset.
We add a new commit-slab to store topological levels while writing
commit graph and upgrade number of struct commit_graph_data to 64-bits.

We have to touch many files, upgrading generation number from uint32_t
to timestamp_t.

We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
which tests if verify can detect if a commit graph have
GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
one of dates is Unix epoch zero.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 blame.c                 |   2 +-
 commit-graph.c          | 109 ++++++++++++++++++++++------------------
 commit-graph.h          |   4 +-
 commit-reach.c          |  32 ++++++------
 commit-reach.h          |   2 +-
 commit.h                |   3 ++
 revision.c              |  14 +++---
 t/t5318-commit-graph.sh |   2 +-
 upload-pack.c           |   2 +-
 9 files changed, 93 insertions(+), 77 deletions(-)

diff --git a/blame.c b/blame.c
index 82fa16d658..48aa632461 100644
--- a/blame.c
+++ b/blame.c
@@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
 	if (!bd)
 		return 1;
 
-	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
+	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
 		return 1;
 
 	filter = get_bloom_filter(r, origin->commit, 0);
diff --git a/commit-graph.c b/commit-graph.c
index ab714f4a76..9647d9f0df 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -65,6 +65,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -100,15 +102,15 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
 
 	if (!data)
-		return GENERATION_NUMBER_INFINITY;
+		return GENERATION_NUMBER_V2_INFINITY;
 	else if (data->graph_pos == COMMIT_NOT_FROM_GRAPH)
-		return GENERATION_NUMBER_INFINITY;
+		return GENERATION_NUMBER_V2_INFINITY;
 
 	return data->generation;
 }
@@ -116,8 +118,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -160,8 +162,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	timestamp_t generation_b = commit_graph_data_at(b)->generation;
 
 	/* lower generation commits first */
 	if (generation_a < generation_b)
@@ -169,11 +171,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -777,8 +774,13 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	if (g->chunk_generation_data)
-		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	{
+		/* Read corrected commit date offset from GDAT */
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	}
 	else
+		/* Read topological level from CDAT */
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -950,6 +952,7 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+	struct topo_level_slab *topo_levels;
 
 	char *base_graph_name;
 	int num_commit_graphs_before;
@@ -1102,7 +1105,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1117,8 +1120,13 @@ static void write_graph_chunk_generation_data(struct hashfile *f,
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		timestamp_t offset = commit_graph_data_at(*list)->generation - (*list)->date;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+
+		hashwrite_be32(f, offset);
 	}
 }
 
@@ -1316,7 +1324,7 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-static void compute_generation_numbers(struct write_commit_graph_context *ctx)
+static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit_list *list = NULL;
@@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (topo_level != GENERATION_NUMBER_INFINITY &&
+		    topo_level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = current->date;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (topo_level == GENERATION_NUMBER_INFINITY ||
+				    topo_level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else {
+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
+
+					if (topo_level > max_level)
+						max_level = topo_level;
+
+					if (data->generation > max_corrected_commit_date)
+						max_corrected_commit_date = data->generation;
 				}
 			}
 
 			if (all_parents_computed) {
 				struct commit_graph_data *data = commit_graph_data_at(current);
 
-				data->generation = max_generation + 1;
-				pop_commit(&list);
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+				data->generation = max_corrected_commit_date + 1;
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				pop_commit(&list);
 			}
 		}
 	}
@@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
@@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
 	} else
 		ctx->num_commit_graphs_after = 1;
 
-	compute_generation_numbers(ctx);
+	compute_corrected_commit_date_offsets(ctx);
 
 	if (ctx->changed_paths)
 		compute_bloom_filters(ctx);
@@ -2325,8 +2346,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_parent_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2365,9 +2386,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_parent_corrected_commit_date)
+				max_parent_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2389,20 +2410,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_parent_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
diff --git a/commit-graph.h b/commit-graph.h
index e3d4ba96f4..20c5848587 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..2ce9867ff3 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_V2_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_V2_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_V2_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..dd17a81672 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,9 @@
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
+#define GENERATION_NUMBER_V2_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
+
 struct commit_list {
 	struct commit *item;
 	struct commit_list *next;
diff --git a/revision.c b/revision.c
index 23287d26c3..b978e79601 100644
--- a/revision.c
+++ b/revision.c
@@ -725,7 +725,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (!revs->repo->objects->commit_graph)
 		return -1;
 
-	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
+	if (commit_graph_generation(commit) == GENERATION_NUMBER_V2_INFINITY)
 		return -1;
 
 	filter = get_bloom_filter(revs->repo, commit, 0);
@@ -3270,7 +3270,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3316,7 +3316,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3359,7 +3359,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3414,10 +3414,10 @@ static void init_topo_walk(struct rev_info *revs)
 	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
 	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
 
-	info->min_generation = GENERATION_NUMBER_INFINITY;
+	info->min_generation = GENERATION_NUMBER_V2_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3478,7 +3478,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3ec5248d70..43801f07a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -596,7 +596,7 @@ test_expect_success 'detect incorrect generation number' '
 		"generation for commit"
 '
 
-test_expect_success 'detect incorrect generation number' '
+test_expect_failure 'detect incorrect generation number' '
 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
 		"non-zero generation number"
 '
diff --git a/upload-pack.c b/upload-pack.c
index 951a2b23aa..db2332e687 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -489,7 +489,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-07-28 13:00   ` Derrick Stolee
  2020-07-28 15:30     ` Taylor Blau
  2020-08-05 23:16   ` Jakub Narębski
  1 sibling, 1 reply; 107+ messages in thread
From: Derrick Stolee @ 2020-07-28 13:00 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.

You mentioned this in your blog post. I'm sorry that such a small
issue caused you pain. Perhaps you could summarize a little bit of
how that investigation led you to find this issue?

Question: is this something that is only necessary when we change
the generation number, or is it something that is only _exposed_
by the test suite when we change the generation number? It seems that
it is likely to be an existing bug, but it might be hard to expose
in a test case.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/revision.c b/revision.c
> index 6aa7f4f567..23287d26c3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>  
> +		if (parse_commit_gently(parent, 1) < 0)
> +			return ;

Drop the extra space.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
                       ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-07-28 13:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
> 
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> generation number and has undefined behavior. The test used to pass as
> fill_commit_in_graph() did not read commit time from commit graph,
> reading commit date from odb instead.

I was first confused as to why fill_commit_graph_info() did not
load the timestamp, but the reason is that it is only used by
two methods:

1. fill_commit_in_graph(): this actually leaves the commit in a
   "parsed" state, so the date must be correct. Thus, it parses
   the date out of the commit-graph.

2. load_commit_graph_info(): this only helps to guarantee we
   know the graph_pos and generation number values.

Perhaps add this extra context: you will _need_ the commit date
from the commit-graph in order to populate the generation number
v2 in fill_commit_graph_info().

> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.

The timestamp limit placed in the commit-graph is more restrictive
than 64-bit timestamps, but as your test points out, the maximum
timestamp allowed takes place in the year 2514. That is far enough
away for all real data.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c      | 31 ++++++++++++-------------------
>  t/t5000-tar-tree.sh |  4 ++--
>  2 files changed, 14 insertions(+), 21 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 5d3c9bd23c..204eb454b2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
>  	uint32_t lex_index;
> +	uint64_t date_high, date_low;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		die(_("invalid commit position. commit-graph is likely corrupt"));
> +
>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>  
>  	graph_data = commit_graph_data_at(item);
>  	graph_data->graph_pos = pos;
> +
> +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> +	date_low = get_be32(commit_data + g->hash_len + 12);
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>  
> @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
> -	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	struct commit_graph_data *graph_data;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
>  
> +	fill_commit_graph_info(item, g, pos);
> +
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;

This 'while' loop happens in both implementations, so you could
save a miniscule amount of time by placing the call to
fill_commit_graph_info() after the while loop.

> -	if (pos >= g->num_commits + g->num_commits_in_base)
> -		die(_("invalid commit position. commit-graph is likely corrupt"));

> -	/*
> -	 * Store the "full" position, but then use the
> -	 * "local" position for the rest of the calculation.
> -	 */
> -	graph_data = commit_graph_data_at(item);
> -	graph_data->graph_pos = pos;
>  	lex_index = pos - g->num_commits_in_base;
> -
> -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;

I was about to complain about this change, but GRAPH_DATA_WIDTH
is a macro that does an equivalent thing (except the_hash_algo->rawsz
instead of g->hash_len).

>  
>  	item->object.parsed = 1;
>  
>  	set_commit_tree(item, NULL);
>  
> -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> -	date_low = get_be32(commit_data + g->hash_len + 12);
> -	item->date = (timestamp_t)((date_high << 32) | date_low);
> -
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -
>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 37655a237c..1986354fc3 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
>  	rm -f .git/index &&
>  	echo content >file &&
>  	git add file &&
> -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> +	GIT_COMMITTER_DATE="@17179869183 +0000" \
>  		git commit -m "tempori parendum"
>  '
>  
> @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
>  '
>  
>  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> -	echo 4147 >expect &&
> +	echo 2514 >expect &&
>  	tar_info future.tar | cut -d" " -f2 >actual &&
>  	test_cmp expect actual
>  '
> 

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
@ 2020-07-28 14:54 ` Taylor Blau
  2020-07-30  7:47   ` Abhishek Kumar
  2020-07-28 16:35 ` Derrick Stolee
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  8 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 14:54 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

Hi Abhishek,

On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Very exciting. I have been eagerly following your blog and asking
Stolee about your progress, so I am excited to read these patches.

> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

Interestingly, we use a very similar metric at GitHub to sort commits in
various UI views which have lots of existing machinery that sorts
an abstract collection by each element's "date". Since that sort is
stable, and we want to respect the order that Git delivered, we take the
pairwise max of each successive pair of commits.

> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

I'm sure that I'll learn more when I get to this point, but I would like
to hear more about why you want to store the offset rather than the
corrected commit date itself. It seems that the offset could be either
positive or negative, so you'd only have the range of a signed integer
(rather than storing 8 bytes of a time_t for the full breadth of
possibilities).

I know also that Peff is working on negative timestamp support, so I
would want to hear about what he thinks of this, too.

> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
>
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This will be good to gradually roll-out the new chunk. Another thought
is to control whether or not the commit-graph machinery _reads_ this
chunk if it's present. That can be useful for debugging too (eg., I have
a commit-graph with a GDAT chunk that is broken in some way, what
happens if I don't read that chunk?)

Maybe something like `commitgraph.readsGenerationData`? Incidentally,
I'm preparing a `commitgraph.readsChangedPaths` to control whether or
not we read the Bloom index and data chunks. I'll send that to the list
shortly (it's in my fork somewhere if you want an earlier look), but
that may be a useful reference for you.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).
>  3. Update technical documentation.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
> unrelated to my code changes. Still need to investigate further.
>
> Abhishek Kumar (6):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: implement generation data chunk
>   commit-graph: implement corrected commit date offset
>
>  blame.c                       |   2 +-
>  commit-graph.c                | 181 +++++++++++++++++++++-------------
>  commit-graph.h                |   7 +-
>  commit-reach.c                |  47 +++------
>  commit-reach.h                |   2 +-
>  commit.c                      |   9 +-
>  commit.h                      |   3 +
>  revision.c                    |  17 ++--
>  t/helper/test-read-graph.c    |   2 +
>  t/t4216-log-bloom.sh          |   4 +-
>  t/t5000-tar-tree.sh           |   4 +-
>  t/t5318-commit-graph.sh       |  21 ++--
>  t/t5324-split-commit-graph.sh |  12 +--
>  upload-pack.c                 |   2 +-
>  14 files changed, 178 insertions(+), 135 deletions(-)
>
>
> base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
@ 2020-07-28 15:19     ` René Scharfe
  2020-07-28 15:58       ` Derrick Stolee
  2020-07-28 16:01     ` Taylor Blau
  2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 1 reply; 107+ messages in thread
From: René Scharfe @ 2020-07-28 15:19 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git, Derrick Stolee
  Cc: Jakub Narębski, Abhishek Kumar

[Had to remove stolee@gmail.com because with it my mail provider
 rejected this email with the following error message:

   Requested action not taken: mailbox unavailable
   invalid DNS MX or A/AAAA resource record.]

Am 28.07.20 um 15:14 schrieb Derrick Stolee:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>
>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>> information present in commit data chunk. Let's simplify the
>> implementation by calling fill_commit_graph_info() within
>> fill_commit_in_graph().
>>
>> The test 'generate tar with future mtime' creates a commit with commit
>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>> generation number and has undefined behavior. The test used to pass as
>> fill_commit_in_graph() did not read commit time from commit graph,
>> reading commit date from odb instead.
>
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
>
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
>
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
>
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().
>
>> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.

We all may feel like the end of the world is imminent, but do we really
need to set such an arbitrary limit?  OK, that limit was already set two
years ago, and I'm really late.  But still: It's sad to see anything
else than signed 64-bit timestamps to be used in fresh code (after Y2K).
The extra four bytes would fatten up the structures less than the
transition from SHA-1 to SHA-256 will, and no bit twiddling would be
required.  *sigh*

René

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-07-28 15:28   ` Taylor Blau
  2020-07-30  5:24     ` Abhishek Kumar
  2020-08-04  0:46   ` Jakub Narębski
  1 sibling, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 15:28 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.

This last sentence is slightly confusing. Do you think it would be more
clear if you said elaborated a bit? Perhaps something like:

  [...]

  commit_gen_cmp is used when writing a commit-graph to sort commits in
  generation order before computing Bloom filters. Since c49c82aa made
  it so that 'commit_graph_generation()' returns
  'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
  this function. Instead, access the generation number directly through
  the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
  order to access it while writing.

I think the above would be a good extra paragraph in the commit message
provided that you remove the sentence beginning with "Sorting is no
longer useful..."

> Let's fix this by accessing generation number directly through the slab.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +

Nit; this whitespace diff is extraneous, but it's not hurting anything
either. Since it looks like you're rerolling anyway, it would be good to
just get rid of it.

Otherwise this fix makes sense to me.

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28 13:00   ` Derrick Stolee
@ 2020-07-28 15:30     ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 15:30 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar

On Tue, Jul 28, 2020 at 09:00:51AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > In indegree_walk_step(), we add unvisited parents to the indegree queue.
> > However, parents are not guaranteed to be parsed. As the indegree queue
> > sorts by generation number, let's parse parents before inserting them to
> > ensure the correct priority order.
>
> You mentioned this in your blog post. I'm sorry that such a small
> issue caused you pain. Perhaps you could summarize a little bit of
> how that investigation led you to find this issue?

Indeed ;-). I feel like forgetting to call 'parse_commit_gently()' is a
rite of passage for this part of the code in some sense.

> Question: is this something that is only necessary when we change
> the generation number, or is it something that is only _exposed_
> by the test suite when we change the generation number? It seems that
> it is likely to be an existing bug, but it might be hard to expose
> in a test case.

I tend to agree that this bug probably existed before Abhishek's
changes, but that it's probably more trouble than it's worth to tickle
with a test case. So, I'd be fine with this fix as it is (provided that
the style nit is addressed below, too).

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  revision.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/revision.c b/revision.c
> > index 6aa7f4f567..23287d26c3 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
> >  		struct commit *parent = p->item;
> >  		int *pi = indegree_slab_at(&info->indegree, parent);
> >
> > +		if (parse_commit_gently(parent, 1) < 0)
> > +			return ;
>
> Drop the extra space.
>
> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
@ 2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 16:23     ` Taylor Blau
  2020-07-30  7:27     ` Abhishek Kumar
  0 siblings, 2 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-07-28 15:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git; +Cc: Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> With preparations done,...

I feel like this commit could have been made smaller by doing the
uint32_t -> timestamp_t conversion in a separate patch. That would
make it easier to focus on the changes to the generation number v2
logic.

> let's implement corrected commit date offset.
> We add a new commit-slab to store topological levels while writing

It's important to add: we store topological levels to ensure that older
versions of Git will still have the performance benefits from generation
number v1.

> commit graph and upgrade number of struct commit_graph_data to 64-bits.

Do you mean "update the generation member in struct commit_graph_data
to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
member.

> We have to touch many files, upgrading generation number from uint32_t
> to timestamp_t.

Yes, that's why I recommend doing that in a different step.

> We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> which tests if verify can detect if a commit graph have
> GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> one of dates is Unix epoch zero.

What about the topological levels? Are we caring about verifying the data
that we start to ignore in this new version? I'm hesitant to drop this
right now, but I'm open to it if we really don't see it as a valuable test.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  blame.c                 |   2 +-
>  commit-graph.c          | 109 ++++++++++++++++++++++------------------
>  commit-graph.h          |   4 +-
>  commit-reach.c          |  32 ++++++------
>  commit-reach.h          |   2 +-
>  commit.h                |   3 ++
>  revision.c              |  14 +++---
>  t/t5318-commit-graph.sh |   2 +-
>  upload-pack.c           |   2 +-
>  9 files changed, 93 insertions(+), 77 deletions(-)
> 
> diff --git a/blame.c b/blame.c
> index 82fa16d658..48aa632461 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
>  	if (!bd)
>  		return 1;
>  
> -	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
> +	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
>  		return 1;

I don't see value in changing the name of this macro. It
is only used as the default value for a commit not in the
commit-graph. Changing its value to 0xFFFFFFFF works for
both versions when the type is updated to timestamp_t.

The actually-important change in this patch (not just the
type change) is here:

> -static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> +static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
>  {
>  	int i;
>  	struct commit_list *list = NULL;
> @@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>  
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (topo_level != GENERATION_NUMBER_INFINITY &&
> +		    topo_level != GENERATION_NUMBER_ZERO)
>  			continue;

Here, our "skip" condition is that the topo_level has been computed.
This should be fine, as we are never reading that out of the commit-graph.
We will never be in a mode where topo_level is computed but corrected
commit-date is not.

>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = current->date;

Later you assign data->generation to be "max_corrected_commit_date + 1",
which made me think this should be "current->date - 1". Is that so? Or,
do we want most offsets to be one instead of zero? Is there value there?

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>  
> -				if (generation == GENERATION_NUMBER_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (topo_level == GENERATION_NUMBER_INFINITY ||
> +				    topo_level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else {
> +					struct commit_graph_data *data = commit_graph_data_at(parent->item);
> +
> +					if (topo_level > max_level)
> +						max_level = topo_level;
> +
> +					if (data->generation > max_corrected_commit_date)
> +						max_corrected_commit_date = data->generation;
>  				}
>  			}
>  
>  			if (all_parents_computed) {
>  				struct commit_graph_data *data = commit_graph_data_at(current);
>  
> -				data->generation = max_generation + 1;
> -				pop_commit(&list);
> +				if (max_level > GENERATION_NUMBER_MAX - 1)
> +					max_level = GENERATION_NUMBER_MAX - 1;
> +
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +				data->generation = max_corrected_commit_date + 1;
>  
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				pop_commit(&list);
>  			}
>  		}
>  	}

This looks correct, and I've done a tiny bit of perf tests locally.

> @@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
>  	uint32_t i, count_distinct = 0;
>  	int res = 0;
>  	int replace = 0;
> +	struct topo_level_slab topo_levels;
>  
>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>  	ctx->total_bloom_filter_data_size = 0;
>  
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
>  	if (ctx->split) {
>  		struct commit_graph *g;
>  		prepare_commit_graph(ctx->r);
> @@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
>  	} else
>  		ctx->num_commit_graphs_after = 1;
>  
> -	compute_generation_numbers(ctx);
> +	compute_corrected_commit_date_offsets(ctx);

This rename might not be necessary. You are computing both
versions (v1 and v2) so the name change is actually less
accurate than the old name.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 15:19     ` René Scharfe
@ 2020-07-28 15:58       ` Derrick Stolee
  0 siblings, 0 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-07-28 15:58 UTC (permalink / raw)
  To: René Scharfe, Abhishek Kumar via GitGitGadget, git, Derrick Stolee
  Cc: Jakub Narębski, Abhishek Kumar

On 7/28/2020 11:19 AM, René Scharfe wrote:
> [Had to remove stolee@gmail.com because with it my mail provider
>  rejected this email with the following error message:
> 
>    Requested action not taken: mailbox unavailable
>    invalid DNS MX or A/AAAA resource record.]
> 
> Am 28.07.20 um 15:14 schrieb Derrick Stolee:
>> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>>> information present in commit data chunk. Let's simplify the
>>> implementation by calling fill_commit_graph_info() within
>>> fill_commit_in_graph().
>>>
>>> The test 'generate tar with future mtime' creates a commit with commit
>>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>>> generation number and has undefined behavior. The test used to pass as
>>> fill_commit_in_graph() did not read commit time from commit graph,
>>> reading commit date from odb instead.
>>
>> I was first confused as to why fill_commit_graph_info() did not
>> load the timestamp, but the reason is that it is only used by
>> two methods:
>>
>> 1. fill_commit_in_graph(): this actually leaves the commit in a
>>    "parsed" state, so the date must be correct. Thus, it parses
>>    the date out of the commit-graph.
>>
>> 2. load_commit_graph_info(): this only helps to guarantee we
>>    know the graph_pos and generation number values.
>>
>> Perhaps add this extra context: you will _need_ the commit date
>> from the commit-graph in order to populate the generation number
>> v2 in fill_commit_graph_info().
>>
>>> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>>
>> The timestamp limit placed in the commit-graph is more restrictive
>> than 64-bit timestamps, but as your test points out, the maximum
>> timestamp allowed takes place in the year 2514. That is far enough
>> away for all real data.
> 
> We all may feel like the end of the world is imminent, but do we really
> need to set such an arbitrary limit?  OK, that limit was already set two
> years ago, and I'm really late.  But still: It's sad to see anything
> else than signed 64-bit timestamps to be used in fresh code (after Y2K).
> The extra four bytes would fatten up the structures less than the
> transition from SHA-1 to SHA-256 will, and no bit twiddling would be
> required.  *sigh*

One thing to consider after generation number v2 is out long enough
is if we could drop the topo-levels and write zeroes for the topo-
level portion. This was valid data in the first version of the
commit-graph, so it would still be valid. Then, we could allow
full 64-bit timestamps again.

This is something to think about again in a year, maybe.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
@ 2020-07-28 16:01     ` Taylor Blau
  2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 16:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:14:42AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number and has undefined behavior. The test used to pass as
> > fill_commit_in_graph() did not read commit time from commit graph,
> > reading commit date from odb instead.
>
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
>
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
>
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
>
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().
>
> > Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 31 ++++++++++++-------------------
> >  t/t5000-tar-tree.sh |  4 ++--
> >  2 files changed, 14 insertions(+), 21 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 5d3c9bd23c..204eb454b2 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >
> > @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >
> > +	fill_commit_graph_info(item, g, pos);
> > +
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
>
> This 'while' loop happens in both implementations, so you could
> save a miniscule amount of time by placing the call to
> fill_commit_graph_info() after the while loop.
>
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
>
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> > -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> > +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>
> I was about to complain about this change, but GRAPH_DATA_WIDTH
> is a macro that does an equivalent thing (except the_hash_algo->rawsz
> instead of g->hash_len).
>
> >
> >  	item->object.parsed = 1;
> >
> >  	set_commit_tree(item, NULL);
> >
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >
> >  	edge_value = get_be32(commit_data + g->hash_len);
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 37655a237c..1986354fc3 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> >
> > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> >  '
> >
> >  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > -	echo 4147 >expect &&
> > +	echo 2514 >expect &&
> >  	tar_info future.tar | cut -d" " -f2 >actual &&
> >  	test_cmp expect actual
> >  '
> >
>
> Thanks,
> -Stolee

Agreed with Stolee's review, but otherwise this looks like a faithful
transformation.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-07-28 16:03   ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 16:03 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:49AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 15 +++++++++++++++
>  commit-graph.h |  2 ++
>  commit-reach.c | 15 ---------------
>  commit.c       |  9 +++------
>  4 files changed, 20 insertions(+), 21 deletions(-)

All looks good to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-07-28 16:12   ` Taylor Blau
  2020-07-30  6:52     ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 16:12 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:50AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> One of the essential pre-requisites before implementing generation
> number as to distinguish between generation numbers v1 and v2 while

s/as/is

> still being compatible with old Git.

Maybe you could add a section here to talk about why this is needed
specifically? That is, you mention it's a prerequisite, but a reader in
a year or two may not remember why. Adding that information here would
be good.

> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.
>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. Newer versions of Git can parse GDAT and
> take advantage of newer generation numbers, falling back to topological
> levels when GDAT chunk is missing (as it would happen with a commit
> graph written by old Git).

...this is exactly the paragraph that I was looking for above. Could you
swap the order of these last two paragraphs? I think that it would make
the patch message far clearer.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 33 +++++++++++++++++++++++++++++----
>  commit-graph.h                |  1 +
>  t/helper/test-read-graph.c    |  2 ++
>  t/t4216-log-bloom.sh          |  4 ++--
>  t/t5318-commit-graph.sh       | 19 +++++++++++--------
>  t/t5324-split-commit-graph.sh | 12 ++++++------
>  6 files changed, 51 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1c98f38d69..ab714f4a76 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 8

Ugh. I am simultaneously working on a new chunk myself (so a bad
conflict resolution would look at both of us incrementing this number
to the same value without generating a conflict.)

I think the right thing to do here would be to define an enum over chunk
names, and then index an array by that enum (where the value at each
index is the chunk identifier). Then, the last value of that enum would
be a '__COUNT' which you could use to initialize the array (as well as
within the commit-graph writing routines).

Anyway, I think that it's probably not worth it in the meantime, but it
is something that Junio should look out for when merging (if yours and
my topic happen to get merged around the same time, which they may not).

>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
>  				graph->chunk_commit_data = data + chunk_offset;
>  			break;
>
> +		case GRAPH_CHUNKID_GENERATION_DATA:
> +			if (graph->chunk_generation_data)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data = data + chunk_offset;
> +			break;
> +
>  		case GRAPH_CHUNKID_EXTRAEDGES:
>  			if (graph->chunk_extra_edges)
>  				chunk_repeated = 1;
> @@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +	if (g->chunk_generation_data)
> +		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	else
> +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
> @@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  	}
>  }
>
> +static void write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	int count;
> +	for (count = 0; count < ctx->commits.nr; count++, list++) {
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
> +	}
> +}
> +

This pointer arithmetic is not necessary. Why not like:

  int i;
  for (i = 0; i < ctx->commits.nr; i++) {
    struct commit *c = ctx->commits.list[i];
    display_progress(ctx->progress, ++ctx->progress_cnt);
    hashwrite_be32(f, commit_graph_data_at(c)->generation);
  }

instead?

>  static void write_graph_chunk_extra_edges(struct hashfile *f,
>  					  struct write_commit_graph_context *ctx)
>  {
> @@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
>  	const unsigned hashsz = the_hash_algo->rawsz;
>  	struct strbuf progress_title = STRBUF_INIT;
> -	int num_chunks = 3;
> +	int num_chunks = 4;
>  	struct object_id file_hash;
>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> @@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
>  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> +	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
>  	if (ctx->num_extra_edges) {
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
>  		num_chunks++;
> @@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
>  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
>  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> +	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
>
> -	num_chunks = 3;
> +	num_chunks = 4;
>  	if (ctx->num_extra_edges) {
>  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
>  						4 * ctx->num_extra_edges;
> @@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	write_graph_chunk_fanout(f, ctx);
>  	write_graph_chunk_oids(f, hashsz, ctx);
>  	write_graph_chunk_data(f, hashsz, ctx);
> +	write_graph_chunk_generation_data(f, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
>  	if (ctx->changed_paths) {
> diff --git a/commit-graph.h b/commit-graph.h
> index 98cc5a3b9d..e3d4ba96f4 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -67,6 +67,7 @@ struct commit_graph {
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_generation_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 6d0c962438..1c2a5366c7 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" oid_lookup");
>  	if (graph->chunk_commit_data)
>  		printf(" commit_metadata");
> +	if (graph->chunk_generation_data)
> +		printf(" generation_data");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
>  	if (graph->chunk_bloom_indexes)
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c855bcd3e7..780855e691 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 26f332d6a3..3ec5248d70 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>
>  graph_read_expect() {
>  	OPTIONAL=""
> -	NUM_CHUNKS=3
> +	NUM_CHUNKS=4
>  	if test ! -z $2
>  	then
>  		OPTIONAL=" $2"
> -		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> +		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
>  	fi
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> @@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
>  GRAPH_BYTE_CHUNK_COUNT=6
>  GRAPH_CHUNK_LOOKUP_OFFSET=8
>  GRAPH_CHUNK_LOOKUP_WIDTH=12
> -GRAPH_CHUNK_LOOKUP_ROWS=5
> +GRAPH_CHUNK_LOOKUP_ROWS=6
>  GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
>  GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
>  			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
> @@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
>  GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
>  GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
>  GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
> -GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
>  GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
>  GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
> -GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> -			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> +GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> +				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> +GRAPH_GENERATION_DATA_WIDTH=4
> +GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
> +GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
> +			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
>  GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
>  GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
>
> @@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
>  '
>
>  test_expect_success 'detect incorrect generation number' '
> -	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
>  		"non-zero generation number"
>  '
>
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 269d0964a3..096a96ec41 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
>  	graphdir="$infodir/commit-graphs" &&
>  	test_oid_init &&
>  	test_oid_cache <<-EOM
> -	shallow sha1:1760
> -	shallow sha256:2064
> +	shallow sha1:2132
> +	shallow sha256:2436
>
> -	base sha1:1376
> -	base sha256:1496
> +	base sha1:1408
> +	base sha256:1528
>  	EOM
>  '
>
> @@ -29,9 +29,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> --
> gitgitgadget
>

All of this looks good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28 15:55   ` Derrick Stolee
@ 2020-07-28 16:23     ` Taylor Blau
  2020-07-30  7:27     ` Abhishek Kumar
  1 sibling, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2020-07-28 16:23 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar

On Tue, Jul 28, 2020 at 11:55:12AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With preparations done,...
>
> I feel like this commit could have been made smaller by doing the
> uint32_t -> timestamp_t conversion in a separate patch. That would
> make it easier to focus on the changes to the generation number v2
> logic.

Yep, agreed.

> > let's implement corrected commit date offset.
> > We add a new commit-slab to store topological levels while writing
>
> It's important to add: we store topological levels to ensure that older
> versions of Git will still have the performance benefits from generation
> number v1.
>
> > commit graph and upgrade number of struct commit_graph_data to 64-bits.
>
> Do you mean "update the generation member in struct commit_graph_data
> to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
> member.
>
> > We have to touch many files, upgrading generation number from uint32_t
> > to timestamp_t.
>
> Yes, that's why I recommend doing that in a different step.
>
> > We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> > which tests if verify can detect if a commit graph have
> > GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> > With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> > one of dates is Unix epoch zero.
>
> What about the topological levels? Are we caring about verifying the data
> that we start to ignore in this new version? I'm hesitant to drop this
> right now, but I'm open to it if we really don't see it as a valuable test.
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  blame.c                 |   2 +-
> >  commit-graph.c          | 109 ++++++++++++++++++++++------------------
> >  commit-graph.h          |   4 +-
> >  commit-reach.c          |  32 ++++++------
> >  commit-reach.h          |   2 +-
> >  commit.h                |   3 ++
> >  revision.c              |  14 +++---
> >  t/t5318-commit-graph.sh |   2 +-
> >  upload-pack.c           |   2 +-
> >  9 files changed, 93 insertions(+), 77 deletions(-)
> >
> > diff --git a/blame.c b/blame.c
> > index 82fa16d658..48aa632461 100644
> > --- a/blame.c
> > +++ b/blame.c
> > @@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
> >  	if (!bd)
> >  		return 1;
> >
> > -	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
> > +	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
> >  		return 1;
>
> I don't see value in changing the name of this macro. It
> is only used as the default value for a commit not in the
> commit-graph. Changing its value to 0xFFFFFFFF works for
> both versions when the type is updated to timestamp_t.
>
> The actually-important change in this patch (not just the
> type change) is here:
>
> > -static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > +static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
> >  {
> >  	int i;
> >  	struct commit_list *list = NULL;
> > @@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (topo_level != GENERATION_NUMBER_INFINITY &&
> > +		    topo_level != GENERATION_NUMBER_ZERO)
> >  			continue;
>
> Here, our "skip" condition is that the topo_level has been computed.
> This should be fine, as we are never reading that out of the commit-graph.
> We will never be in a mode where topo_level is computed but corrected
> commit-date is not.
>
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = current->date;
>
> Later you assign data->generation to be "max_corrected_commit_date + 1",
> which made me think this should be "current->date - 1". Is that so? Or,
> do we want most offsets to be one instead of zero? Is there value there?
>
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (topo_level == GENERATION_NUMBER_INFINITY ||
> > +				    topo_level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else {
> > +					struct commit_graph_data *data = commit_graph_data_at(parent->item);
> > +
> > +					if (topo_level > max_level)
> > +						max_level = topo_level;
> > +
> > +					if (data->generation > max_corrected_commit_date)
> > +						max_corrected_commit_date = data->generation;
> >  				}
> >  			}
> >
> >  			if (all_parents_computed) {
> >  				struct commit_graph_data *data = commit_graph_data_at(current);
> >
> > -				data->generation = max_generation + 1;
> > -				pop_commit(&list);
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +				data->generation = max_corrected_commit_date + 1;
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				pop_commit(&list);
> >  			}
> >  		}
> >  	}
>
> This looks correct, and I've done a tiny bit of perf tests locally.
>
> > @@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
> >  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> >  	ctx->total_bloom_filter_data_size = 0;
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> >  	if (ctx->split) {
> >  		struct commit_graph *g;
> >  		prepare_commit_graph(ctx->r);
> > @@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	} else
> >  		ctx->num_commit_graphs_after = 1;
> >
> > -	compute_generation_numbers(ctx);
> > +	compute_corrected_commit_date_offsets(ctx);
>
> This rename might not be necessary. You are computing both
> versions (v1 and v2) so the name change is actually less
> accurate than the old name.
>
> Thanks,
> -Stolee

I don't have anything to add that Stolee hasn't already pointed out.
Thanks for your work on this series, and I'm looking forward to another
reroll.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
@ 2020-07-28 16:35 ` Derrick Stolee
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  8 siblings, 0 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-07-28 16:35 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Abhishek Kumar, Taylor Blau

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
> 
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
> 
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
> 
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
> 
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
> 
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.
> 
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
> 
> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
> 
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This would be a good idea, if only as a GIT_TEST_* environment variable.
I think it important we have a test for the compatibility scenario where
we have an "old" commit-graph with the new code and test that reading and
writing still works properly.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).

This is an excellent thing to do. There are a few options when writing an
incremental commit-graph when the base graphs do not have the GDAT chunk:

   i. Do not write the GDAT chunk unless we are merging all levels
      (based on the merging strategy).

  ii. Merge all levels, then write the GDAT chunk.

>  3. Update technical documentation.

Yes, I was going to ask for a patch that updates
Documentation/technical/commit-graph-format.txt.

This is an excellent v1. A lot of small things, but no
really big issues.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28 15:28   ` Taylor Blau
@ 2020-07-30  5:24     ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-07-30  5:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, abhishekkumar8222, gitgitgadget, jnareb, stolee

On Tue, Jul 28, 2020 at 11:28:44AM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> 
> This last sentence is slightly confusing. Do you think it would be more
> clear if you said elaborated a bit? Perhaps something like:
> 
>   [...]
> 
>   commit_gen_cmp is used when writing a commit-graph to sort commits in
>   generation order before computing Bloom filters. Since c49c82aa made
>   it so that 'commit_graph_generation()' returns
>   'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
>   this function. Instead, access the generation number directly through
>   the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
>   order to access it while writing.
> 

Thanks! That is clearer. Will change.

> I think the above would be a good extra paragraph in the commit message
> provided that you remove the sentence beginning with "Sorting is no
> longer useful..."
> 
> > Let's fix this by accessing generation number directly through the slab.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;
> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> 
> Nit; this whitespace diff is extraneous, but it's not hurting anything
> either. Since it looks like you're rerolling anyway, it would be good to
> just get rid of it.
> 
> Otherwise this fix makes sense to me.
> 
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
> > --
> > gitgitgadget
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
  2020-07-28 16:01     ` Taylor Blau
@ 2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-07-30  6:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb

On Tue, Jul 28, 2020 at 09:14:42AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> > 
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number and has undefined behavior. The test used to pass as
> > fill_commit_in_graph() did not read commit time from commit graph,
> > reading commit date from odb instead.
> 
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
> 
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
> 
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
> 
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().

Thanks, that makes sense. I have revised the commit message to:

commit-graph: consolidate fill_commit_graph_info
    
    Both fill_commit_graph_info() and fill_commit_in_graph() parse
    information present in commit data chunk. Let's simplify the
    implementation by calling fill_commit_graph_info() within
    fill_commit_in_graph().
    
    The test 'generate tar with future mtime' creates a commit with commit
    time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
    generation number (within CDAT chunk) and has undefined behavior.
    
    The test used to pass as fill_commit_in_graph() guarantees the values of
    graph position and generation number, and did not load timestamp.
    However, with corrected commit date we will need load the timestamp as
    well to populate the generation number.
> 
> > Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
> 
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.
> 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 31 ++++++++++++-------------------
> >  t/t5000-tar-tree.sh |  4 ++--
> >  2 files changed, 14 insertions(+), 21 deletions(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 5d3c9bd23c..204eb454b2 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >  
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >  
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >  
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >  
> > @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >  
> > +	fill_commit_graph_info(item, g, pos);
> > +
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> 
> This 'while' loop happens in both implementations, so you could
> save a miniscule amount of time by placing the call to
> fill_commit_graph_info() after the while loop.
> 
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
> 
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> > -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> > +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> 
> I was about to complain about this change, but GRAPH_DATA_WIDTH
> is a macro that does an equivalent thing (except the_hash_algo->rawsz
> instead of g->hash_len).
> 
> >  
> >  	item->object.parsed = 1;
> >  
> >  	set_commit_tree(item, NULL);
> >  
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >  
> >  	edge_value = get_be32(commit_data + g->hash_len);
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 37655a237c..1986354fc3 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> >  
> > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> >  '
> >  
> >  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > -	echo 4147 >expect &&
> > +	echo 2514 >expect &&
> >  	tar_info future.tar | cut -d" " -f2 >actual &&
> >  	test_cmp expect actual
> >  '
> > 
> 
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28 16:12   ` Taylor Blau
@ 2020-07-30  6:52     ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-07-30  6:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: dstolee, jnareb, git, abhishekkumar8222, gitgitgadget

On Tue, Jul 28, 2020 at 12:12:50PM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:50AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > One of the essential pre-requisites before implementing generation
> > number as to distinguish between generation numbers v1 and v2 while
> 
> s/as/is
> 
> > still being compatible with old Git.
> 
> Maybe you could add a section here to talk about why this is needed
> specifically? That is, you mention it's a prerequisite, but a reader in
> a year or two may not remember why. Adding that information here would
> be good.
> 
> > We are going to introduce a new chunk called Generation Data chunk (or
> > GDAT). GDAT stores generation number v2 (and any subsequent versions),
> > whereas CDAT will still store topological level.
> >
> > Old Git does not understand GDAT chunk and would ignore it, reading
> > topological levels from CDAT. Newer versions of Git can parse GDAT and
> > take advantage of newer generation numbers, falling back to topological
> > levels when GDAT chunk is missing (as it would happen with a commit
> > graph written by old Git).
> 
> ...this is exactly the paragraph that I was looking for above. Could you
> swap the order of these last two paragraphs? I think that it would make
> the patch message far clearer.

Here's revised commit message:

  commit-graph: implement generation data chunk
    
  As discovered by Ævar, we cannot increment graph version to
  distinguish between generation numbers v1 and v2 [1]. Thus, one of
  pre-requistes before implementing generation number v2 was to
  distinguish generation numbers in a backwards compatible manner
  without increment graph version.
  
  We are going to introduce a new chunk called Generation Data chunk (or
  GDAT). GDAT stores generation number v2 (and any subsequent versions),
  whereas CDAT will still store topological level.
  
  Old Git does not understand GDAT chunk and would ignore it, reading
  topological levels from CDAT. New Git can parse GDAT and take advantage
  of newer generation numbers, falling back to topological levels when
  GDAT chunk is missing (as it would happen with a commit graph written
  by old Git).
 
  [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

First paragraph explains why we need this patch (cannot increment graph
version) second explains what this patch does (introduce a new chunk)
and third proves why it works (Old Git ignores GDAT, New Git parses GDAT).

Can we improve this commit message further? 

> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c                | 33 +++++++++++++++++++++++++++++----
> >  commit-graph.h                |  1 +
> >  t/helper/test-read-graph.c    |  2 ++
> >  t/t4216-log-bloom.sh          |  4 ++--
> >  t/t5318-commit-graph.sh       | 19 +++++++++++--------
> >  t/t5324-split-commit-graph.sh | 12 ++++++------
> >  6 files changed, 51 insertions(+), 20 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1c98f38d69..ab714f4a76 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
> >  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> >  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> >  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> >  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> >  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> >  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> >  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> > -#define MAX_NUM_CHUNKS 7
> > +#define MAX_NUM_CHUNKS 8
> 
> Ugh. I am simultaneously working on a new chunk myself (so a bad
> conflict resolution would look at both of us incrementing this number
> to the same value without generating a conflict.)
> 
> I think the right thing to do here would be to define an enum over chunk
> names, and then index an array by that enum (where the value at each
> index is the chunk identifier). Then, the last value of that enum would
> be a '__COUNT' which you could use to initialize the array (as well as
> within the commit-graph writing routines).
> 
> Anyway, I think that it's probably not worth it in the meantime, but it
> is something that Junio should look out for when merging (if yours and
> my topic happen to get merged around the same time, which they may not).
> 
> >  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
> >
> > @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
> >  				graph->chunk_commit_data = data + chunk_offset;
> >  			break;
> >
> > +		case GRAPH_CHUNKID_GENERATION_DATA:
> > +			if (graph->chunk_generation_data)
> > +				chunk_repeated = 1;
> > +			else
> > +				graph->chunk_generation_data = data + chunk_offset;
> > +			break;
> > +
> >  		case GRAPH_CHUNKID_EXTRAEDGES:
> >  			if (graph->chunk_extra_edges)
> >  				chunk_repeated = 1;
> > @@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	date_low = get_be32(commit_data + g->hash_len + 12);
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +	if (g->chunk_generation_data)
> > +		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> > +	else
> > +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> >  	}
> >  }
> >
> > +static void write_graph_chunk_generation_data(struct hashfile *f,
> > +					      struct write_commit_graph_context *ctx)
> > +{
> > +	struct commit **list = ctx->commits.list;
> > +	int count;
> > +	for (count = 0; count < ctx->commits.nr; count++, list++) {
> > +		display_progress(ctx->progress, ++ctx->progress_cnt);
> > +		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
> > +	}
> > +}
> > +
> 
> This pointer arithmetic is not necessary. Why not like:
> 
>   int i;
>   for (i = 0; i < ctx->commits.nr; i++) {
>     struct commit *c = ctx->commits.list[i];
>     display_progress(ctx->progress, ++ctx->progress_cnt);
>     hashwrite_be32(f, commit_graph_data_at(c)->generation);
>   }
> 
> instead?
> 
> >  static void write_graph_chunk_extra_edges(struct hashfile *f,
> >  					  struct write_commit_graph_context *ctx)
> >  {
> > @@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
> >  	const unsigned hashsz = the_hash_algo->rawsz;
> >  	struct strbuf progress_title = STRBUF_INIT;
> > -	int num_chunks = 3;
> > +	int num_chunks = 4;
> >  	struct object_id file_hash;
> >  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> >
> > @@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> >  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> >  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> > +	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
> >  	if (ctx->num_extra_edges) {
> >  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
> >  		num_chunks++;
> > @@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> >  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
> >  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> > +	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
> >
> > -	num_chunks = 3;
> > +	num_chunks = 4;
> >  	if (ctx->num_extra_edges) {
> >  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> >  						4 * ctx->num_extra_edges;
> > @@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	write_graph_chunk_fanout(f, ctx);
> >  	write_graph_chunk_oids(f, hashsz, ctx);
> >  	write_graph_chunk_data(f, hashsz, ctx);
> > +	write_graph_chunk_generation_data(f, ctx);
> >  	if (ctx->num_extra_edges)
> >  		write_graph_chunk_extra_edges(f, ctx);
> >  	if (ctx->changed_paths) {
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 98cc5a3b9d..e3d4ba96f4 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -67,6 +67,7 @@ struct commit_graph {
> >  	const uint32_t *chunk_oid_fanout;
> >  	const unsigned char *chunk_oid_lookup;
> >  	const unsigned char *chunk_commit_data;
> > +	const unsigned char *chunk_generation_data;
> >  	const unsigned char *chunk_extra_edges;
> >  	const unsigned char *chunk_base_graphs;
> >  	const unsigned char *chunk_bloom_indexes;
> > diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> > index 6d0c962438..1c2a5366c7 100644
> > --- a/t/helper/test-read-graph.c
> > +++ b/t/helper/test-read-graph.c
> > @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
> >  		printf(" oid_lookup");
> >  	if (graph->chunk_commit_data)
> >  		printf(" commit_metadata");
> > +	if (graph->chunk_generation_data)
> > +		printf(" generation_data");
> >  	if (graph->chunk_extra_edges)
> >  		printf(" extra_edges");
> >  	if (graph->chunk_bloom_indexes)
> > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> > index c855bcd3e7..780855e691 100755
> > --- a/t/t4216-log-bloom.sh
> > +++ b/t/t4216-log-bloom.sh
> > @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> >  	git commit-graph write --reachable --changed-paths
> >  '
> >  graph_read_expect () {
> > -	NUM_CHUNKS=5
> > +	NUM_CHUNKS=6
> >  	cat >expect <<- EOF
> >  	header: 43475048 1 1 $NUM_CHUNKS 0
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
> >  	EOF
> >  	test-tool read-graph >actual &&
> >  	test_cmp expect actual
> > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> > index 26f332d6a3..3ec5248d70 100755
> > --- a/t/t5318-commit-graph.sh
> > +++ b/t/t5318-commit-graph.sh
> > @@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
> >
> >  graph_read_expect() {
> >  	OPTIONAL=""
> > -	NUM_CHUNKS=3
> > +	NUM_CHUNKS=4
> >  	if test ! -z $2
> >  	then
> >  		OPTIONAL=" $2"
> > -		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> > +		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
> >  	fi
> >  	cat >expect <<- EOF
> >  	header: 43475048 1 1 $NUM_CHUNKS 0
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> > @@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
> >  GRAPH_BYTE_CHUNK_COUNT=6
> >  GRAPH_CHUNK_LOOKUP_OFFSET=8
> >  GRAPH_CHUNK_LOOKUP_WIDTH=12
> > -GRAPH_CHUNK_LOOKUP_ROWS=5
> > +GRAPH_CHUNK_LOOKUP_ROWS=6
> >  GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
> >  GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
> >  			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
> > @@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
> >  GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
> >  GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
> >  GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
> > -GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
> >  GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
> >  GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
> > -GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> > -			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> > +GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> > +				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> > +GRAPH_GENERATION_DATA_WIDTH=4
> > +GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
> > +GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
> > +			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
> >  GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
> >  GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
> >
> > @@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
> >  '
> >
> >  test_expect_success 'detect incorrect generation number' '
> > -	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> > +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
> >  		"non-zero generation number"
> >  '
> >
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 269d0964a3..096a96ec41 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
> >  	graphdir="$infodir/commit-graphs" &&
> >  	test_oid_init &&
> >  	test_oid_cache <<-EOM
> > -	shallow sha1:1760
> > -	shallow sha256:2064
> > +	shallow sha1:2132
> > +	shallow sha256:2436
> >
> > -	base sha1:1376
> > -	base sha256:1496
> > +	base sha1:1408
> > +	base sha256:1528
> >  	EOM
> >  '
> >
> > @@ -29,9 +29,9 @@ graph_read_expect() {
> >  		NUM_BASE=$2
> >  	fi
> >  	cat >expect <<- EOF
> > -	header: 43475048 1 1 3 $NUM_BASE
> > +	header: 43475048 1 1 4 $NUM_BASE
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> > --
> > gitgitgadget
> >
> 
> All of this looks good to me.
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 16:23     ` Taylor Blau
@ 2020-07-30  7:27     ` Abhishek Kumar
  1 sibling, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-07-30  7:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, jnareb, gitgitgadget

On Tue, Jul 28, 2020 at 11:55:12AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > With preparations done,...
> 
> I feel like this commit could have been made smaller by doing the
> uint32_t -> timestamp_t conversion in a separate patch. That would
> make it easier to focus on the changes to the generation number v2
> logic.
> 

Sure, would seperate into two patches.

> > let's implement corrected commit date offset.
> > We add a new commit-slab to store topological levels while writing
> 
> It's important to add: we store topological levels to ensure that older
> versions of Git will still have the performance benefits from generation
> number v1.
> 

Will do.

> > commit graph and upgrade number of struct commit_graph_data to 64-bits.
> 
> Do you mean "update the generation member in struct commit_graph_data
> to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
> member.
> 

Yes, "update the generation number".

> > We have to touch many files, upgrading generation number from uint32_t
> > to timestamp_t.
> 
> Yes, that's why I recommend doing that in a different step.
> 
> > We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> > which tests if verify can detect if a commit graph have
> > GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> > With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> > one of dates is Unix epoch zero.
> 
> What about the topological levels? Are we caring about verifying the data
> that we start to ignore in this new version? I'm hesitant to drop this
> right now, but I'm open to it if we really don't see it as a valuable test.
> 

We haven't tested the scenario "New Git reads a commit graph without
GDAT chunk" yet. Verifying topological levels (along with many of the
changed offsets) would be a part of the scenario.

Now that I think about it, those tests should have been included with
this patch.

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> [...]
>
> Later you assign data->generation to be "max_corrected_commit_date + 1",
> which made me think this should be "current->date - 1". Is that so? Or,
> do we want most offsets to be one instead of zero? Is there value there?
> 

Does it? 

I had hoped most of the offsets could have been zero, as we could take
advantage of the fact that commit-slab zero initializes values and avoid
a commit-slab access.

Right, What I meant to do was:

        /*
         * max_parent_corrected_commit_date is initialized with zero and
         * takes the maximum of
         * (parent->item->date + commit_graph_data_at(parent->item)->generation)
        */

        if (max_parent_corrected_commit_date >= current->date)
        {
                struct commit_graph_data *data = commit_graph_data_at(current);
                data->generation = max_parent_corrected_commit_date + 1;
        }

Thanks for pointing this out!

> [...]

- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
@ 2020-07-30  7:47   ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-07-30  7:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitgitgadget, abhishekkumar8222

On Tue, Jul 28, 2020 at 10:54:58AM -0400, Taylor Blau wrote:
> Hi Abhishek,
> 
> On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Very exciting. I have been eagerly following your blog and asking
> Stolee about your progress, so I am excited to read these patches.
> 

I am so glad to hear that!

> 
> [...]
> 
> I'm sure that I'll learn more when I get to this point, but I would like
> to hear more about why you want to store the offset rather than the
> corrected commit date itself. It seems that the offset could be either
> positive or negative, so you'd only have the range of a signed integer
> (rather than storing 8 bytes of a time_t for the full breadth of
> possibilities).
> 

Corrected commit dates are at least as big as the committer date, so the
offset (i.e. corrected date - committer date) would never be negative.

We store offsets instead of corrected commit dates because:
- We save 4 bytes for each commit, which amounts to 7-8% of the size of
  commit graph file (of course, dependent on the other chunks used).
- We save some time while writing the commit-graph file too, around
  ~200ms for the Linux repo.

While the savings are modest, writing corrected dates does not offer any
advantage that we could think of, at the time.

> I know also that Peff is working on negative timestamp support, so I
> would want to hear about what he thinks of this, too.

I have read up on Peff's work with negative timestamp support and it's
pretty exciting.

> [...]

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-07-28 15:28   ` Taylor Blau
@ 2020-08-04  0:46   ` Jakub Narębski
  2020-08-04  0:56     ` Taylor Blau
  2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 2 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-04  0:46 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Abhishek Kumar, Taylor Blau

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.
>
> Let's fix this by accessing generation number directly through the slab.

It looks like unfortunate and unforeseen consequence of putting together
graph position and generation number in the commit_graph_data struct.
During writing of the commit-graph file generation number is computed,
but graph position is undefined (yet), and commit_graph_generation()
uses graph_pos field to find if the data for commit is initialized;
in this case wrongly.

Anyway, when writing the commit graph we first compute generation
number, then (if requested) the changed-paths Bloom filter.  Skipping
the unnecessary check is a good thing... assuming that commit_gen_cmp()
is used only when writing the commit graph, and not when traversing it
(because then some commits may not have generation number set, and maybe
even do not have any data on the commit slab) - which is the case.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c

We might want to add function comment either here or in the header that
this comparisonn function is to be used only for `git commit-graph
write`, and not for graph traversal (even if similar funnction exists in
other modules).

> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:46   ` Jakub Narębski
@ 2020-08-04  0:56     ` Taylor Blau
  2020-08-04 10:10       ` Jakub Narębski
  2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2020-08-04  0:56 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Abhishek Kumar, Taylor Blau

On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> >
> > Let's fix this by accessing generation number directly through the slab.
>
> It looks like unfortunate and unforeseen consequence of putting together
> graph position and generation number in the commit_graph_data struct.
> During writing of the commit-graph file generation number is computed,
> but graph position is undefined (yet), and commit_graph_generation()
> uses graph_pos field to find if the data for commit is initialized;
> in this case wrongly.
>
> Anyway, when writing the commit graph we first compute generation
> number, then (if requested) the changed-paths Bloom filter.  Skipping
> the unnecessary check is a good thing... assuming that commit_gen_cmp()
> is used only when writing the commit graph, and not when traversing it
> (because then some commits may not have generation number set, and maybe
> even do not have any data on the commit slab) - which is the case.
>
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
>
> We might want to add function comment either here or in the header that
> this comparisonn function is to be used only for `git commit-graph
> write`, and not for graph traversal (even if similar funnction exists in
> other modules).

I think that probably within the function is just fine, and that we can
avoid touching commit-graph.h here.

>
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;

Maybe something like:

  /*
   * Access the generation number directly with
   * 'commit_graph_data_at(...)->generation' instead of going through
   * the slab as usual to avoid accessing a yet-uncomputed value.
   */

Folks that are curious for more can blame this commit and read there.
I'd err on the side of being brief in the code comment and verbose in
the commit message than the other way around ;).

> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
>
> Best,
> --
> Jakub Narębski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:46   ` Jakub Narębski
  2020-08-04  0:56     ` Taylor Blau
@ 2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-04  7:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Abhishek Kumar, Taylor Blau

Jakub Narębski <jnareb@gmail.com> writes:

[...]
>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>  	const struct commit *a = *(const struct commit **)va;
>>  	const struct commit *b = *(const struct commit **)vb;
>>
>> -	uint32_t generation_a = commit_graph_generation(a);
>> -	uint32_t generation_b = commit_graph_generation(b);
>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>> +
>>  	/* lower generation commits first */
>>  	if (generation_a < generation_b)
>>  		return -1;

NOTE: One more thing: we would want to check if corrected commit date
(generation number v2) or topological level (generation number v1) is
better for this purpose, that is gives better performance.

The commit 3d11275505 (commit-graph: examine commits by generation
number) which introduced using commit_gen_cmp when writing commit graph
when finding commits via `--reachable` flags describes the following
performance improvement:

    On the Linux kernel repository, this change reduced the computation
    time for 'git commit-graph write --reachable --changed-paths' from
    3m00s to 1m37s.

We would probably want time for no sorting, for sorting by generation
number v2, and for sorting by topological level (generation number v1)
for the same or similar case.

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:56     ` Taylor Blau
@ 2020-08-04 10:10       ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-04 10:10 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee, Abhishek Kumar

Taylor Blau <me@ttaylorr.com> writes:
> On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index 1af68c297d..5d3c9bd23c 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>
>> We might want to add function comment either here or in the header that
>> this comparisonn function is to be used only for `git commit-graph
>> write`, and not for graph traversal (even if similar funnction exists in
>> other modules).
>
> I think that probably within the function is just fine, and that we can
> avoid touching commit-graph.h here.
>
>>
>>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>  	const struct commit *a = *(const struct commit **)va;
>>>  	const struct commit *b = *(const struct commit **)vb;
>
> Maybe something like:
>
>   /*
>    * Access the generation number directly with
>    * 'commit_graph_data_at(...)->generation' instead of going through
>    * the slab as usual to avoid accessing a yet-uncomputed value.
>    */

I think the last part of this comment should read:

[...]
     * 'commit_graph_data_at(...)->generation' instead of going through
     * the commit_graph_generation() helper function to access just
     * computed data [during `git commit-graph write --reachable --changed-paths`].
     */

Or something like that (the part in square brackets is optional; I am
not sure if adding it helps or not).

>
> Folks that are curious for more can blame this commit and read there.
> I'd err on the side of being brief in the code comment and verbose in
> the commit message than the other way around ;).

I agree.

>>>
>>> -	uint32_t generation_a = commit_graph_generation(a);
>>> -	uint32_t generation_b = commit_graph_generation(b);
>>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>>> +
>>>  	/* lower generation commits first */
>>>  	if (generation_a < generation_b)
>>>  		return -1;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
  2020-07-28 13:00   ` Derrick Stolee
@ 2020-08-05 23:16   ` Jakub Narębski
  1 sibling, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-05 23:16 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget; +Cc: git, Derrick Stolee, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 6aa7f4f567..23287d26c3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>
> +		if (parse_commit_gently(parent, 1) < 0)

All right, parse_commit_gently() avoids re-parsing objects, and makes
use of the commit-graph data.  If parents are not guaranteed to be
parsed, this is a correct thing to do.

Though I do wonder how this issue got missed by the test suite, just
like other reviewers...

> +			return ;

Why this need to be 'return' and not 'continue'?

> +
>  		if (*pi)
>  			(*pi)++;
>  		else

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-07-28 16:35 ` Derrick Stolee
@ 2020-08-09  2:53 ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                     ` (11 more replies)
  8 siblings, 12 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
first version.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed graph version commit chains.
 * Improve commit messages for
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (10):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: implement generation data chunk
  commit-graph: return 64-bit generation number
  commit-graph: implement corrected commit date
  commit-graph: handle mixed generation commit chains
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  12 +-
 Documentation/technical/commit-graph.txt      |  45 ++--
 commit-graph.c                                | 203 ++++++++++++------
 commit-graph.h                                |  14 +-
 commit-reach.c                                |  49 ++---
 commit-reach.h                                |   2 +-
 commit.c                                      |   9 +-
 commit.h                                      |   4 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   2 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |   4 +-
 t/t5318-commit-graph.sh                       |  27 +--
 t/t5324-split-commit-graph.sh                 |  78 ++++++-
 t/t6024-recursive-merge.sh                    |   4 +-
 t/t6600-test-reach.sh                         |  62 +++---
 upload-pack.c                                 |   2 +-
 18 files changed, 354 insertions(+), 183 deletions(-)


base-commit: dc04167d378fb29d30e1647ff6ff51dd182bc9a3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v1:

  1:  91e6e97a66 !  1:  a962b9ae4b commit-graph: fix regression when computing bloom filter
     @@ Metadata
       ## Commit message ##
          commit-graph: fix regression when computing bloom filter
      
     -    With 3d112755 (commit-graph: examine commits by generation number), Git
     -    knew to sort by generation number before examining the diff when not
     -    using pack order. c49c82aa (commit: move members graph_pos, generation
     -    to a slab, 2020-06-17) moved generation number into a slab and
     -    introduced a helper which returns GENERATION_NUMBER_INFINITY when
     -    writing the graph. Sorting is no longer useful and essentially reverts
     -    the earlier commit.
     -
     -    Let's fix this by accessing generation number directly through the slab.
     +    commit_gen_cmp is used when writing a commit-graph to sort commits in
     +    generation order before computing Bloom filters. Since c49c82aa (commit:
     +    move members graph_pos, generation to a slab, 2020-06-17) made it so
     +    that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
     +    during writing, we cannot call it within this function. Instead, access
     +    the generation number directly through the slab (i.e., by calling
     +    'commit_graph_data_at(c)->generation') in order to access it while
     +    writing.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
      -	uint32_t generation_b = commit_graph_generation(b);
      +	uint32_t generation_a = commit_graph_data_at(a)->generation;
      +	uint32_t generation_b = commit_graph_data_at(b)->generation;
     -+
       	/* lower generation commits first */
       	if (generation_a < generation_b)
       		return -1;
  2:  d23f67dc80 !  2:  cf61239f93 revision: parse parent in indegree_walk_step()
     @@ revision.c: static void indegree_walk_step(struct rev_info *revs)
       		int *pi = indegree_slab_at(&info->indegree, parent);
       
      +		if (parse_commit_gently(parent, 1) < 0)
     -+			return ;
     ++			return;
      +
       		if (*pi)
       			(*pi)++;
  3:  701f591236 !  3:  32da955e31 commit-graph: consolidate fill_commit_graph_info
     @@ Commit message
      
          The test 'generate tar with future mtime' creates a commit with commit
          time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
     -    generation number and has undefined behavior. The test used to pass as
     -    fill_commit_in_graph() did not read commit time from commit graph,
     -    reading commit date from odb instead.
     +    generation number (within CDAT chunk) and has undefined behavior.
      
     -    Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
     +    The test used to pass as fill_commit_in_graph() guarantees the values of
     +    graph position and generation number, and did not load timestamp.
     +    However, with corrected commit date we will need load the timestamp as
     +    well to populate the generation number.
     +
     +    Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
       	const unsigned char *commit_data;
       	uint32_t lex_index;
       
     -+	fill_commit_graph_info(item, g, pos);
     -+
       	while (pos < g->num_commits_in_base)
       		g = g->base_graph;
       
      -	if (pos >= g->num_commits + g->num_commits_in_base)
      -		die(_("invalid commit position. commit-graph is likely corrupt"));
     --
     ++	fill_commit_graph_info(item, g, pos);
     + 
      -	/*
      -	 * Store the "full" position, but then use the
      -	 * "local" position for the rest of the calculation.
  4:  812fe75fc7 !  4:  b254782858 commit-graph: consolidate compare_commits_by_gen
     @@ Commit message
          compare_commits_by_gen() to commit-graph.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
     +    Reviewed-by: Taylor Blau <me@ttaylorr.com>
      
       ## commit-graph.c ##
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
  5:  80ea7da343 !  5:  cb797e20d7 commit-graph: implement generation data chunk
     @@ Metadata
       ## Commit message ##
          commit-graph: implement generation data chunk
      
     -    One of the essential pre-requisites before implementing generation
     -    number as to distinguish between generation numbers v1 and v2 while
     -    still being compatible with old Git.
     +    As discovered by Ævar, we cannot increment graph version to
     +    distinguish between generation numbers v1 and v2 [1]. Thus, one of
     +    pre-requistes before implementing generation number was to distinguish
     +    between graph versions in a backwards compatible manner.
      
          We are going to introduce a new chunk called Generation Data chunk (or
          GDAT). GDAT stores generation number v2 (and any subsequent versions),
          whereas CDAT will still store topological level.
      
          Old Git does not understand GDAT chunk and would ignore it, reading
     -    topological levels from CDAT. Newer versions of Git can parse GDAT and
     -    take advantage of newer generation numbers, falling back to topological
     -    levels when GDAT chunk is missing (as it would happen with a commit
     -    graph written by old Git).
     +    topological levels from CDAT. New Git can parse GDAT and take advantage
     +    of newer generation numbers, falling back to topological levels when
     +    GDAT chunk is missing (as it would happen with a commit graph written
     +    by old Git).
     +
     +    We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
     +    which forces commit-graph file to be written without generation data
     +    chunk to emulate a commit-graph file written by old Git.
     +
     +    [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       }
       
       static inline void set_commit_tree(struct commit *c, struct tree *t)
     -@@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_len,
     - 	}
     +@@ commit-graph.c: struct write_commit_graph_context {
     + 		 report_progress:1,
     + 		 split:1,
     + 		 changed_paths:1,
     +-		 order_by_pack:1;
     ++		 order_by_pack:1,
     ++		 write_generation_data:1;
     + 
     + 	const struct split_commit_graph_opts *split_opts;
     + 	size_t total_bloom_filter_data_size;
     +@@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
     + 	return 0;
       }
       
     -+static void write_graph_chunk_generation_data(struct hashfile *f,
     ++static int write_graph_chunk_generation_data(struct hashfile *f,
      +					      struct write_commit_graph_context *ctx)
      +{
     -+	struct commit **list = ctx->commits.list;
     -+	int count;
     -+	for (count = 0; count < ctx->commits.nr; count++, list++) {
     ++	int i;
     ++	for (i = 0; i < ctx->commits.nr; i++) {
     ++		struct commit *c = ctx->commits.list[i];
      +		display_progress(ctx->progress, ++ctx->progress_cnt);
     -+		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
     ++		hashwrite_be32(f, commit_graph_data_at(c)->generation);
      +	}
     ++
     ++	return 0;
      +}
      +
     - static void write_graph_chunk_extra_edges(struct hashfile *f,
     - 					  struct write_commit_graph_context *ctx)
     + static int write_graph_chunk_extra_edges(struct hashfile *f,
     +-					 struct write_commit_graph_context *ctx)
     ++					  struct write_commit_graph_context *ctx)
       {
     + 	struct commit **list = ctx->commits.list;
     + 	struct commit **last = ctx->commits.list + ctx->commits.nr;
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
     - 	const unsigned hashsz = the_hash_algo->rawsz;
     - 	struct strbuf progress_title = STRBUF_INIT;
     --	int num_chunks = 3;
     -+	int num_chunks = 4;
     - 	struct object_id file_hash;
     - 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     - 
     -@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
     - 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
     - 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
     -+	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
     + 	chunks[2].id = GRAPH_CHUNKID_DATA;
     + 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
     + 	chunks[2].write_fn = write_graph_chunk_data;
     ++	if (ctx->write_generation_data) {
     ++		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
     ++		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     ++		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
     ++		num_chunks++;
     ++	}
       	if (ctx->num_extra_edges) {
     - 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
     - 		num_chunks++;
     -@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
     - 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
     - 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
     -+	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
     + 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
     + 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
     +@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     + 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
     + 	ctx->split_opts = split_opts;
     + 	ctx->total_bloom_filter_data_size = 0;
     ++	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
       
     --	num_chunks = 3;
     -+	num_chunks = 4;
     - 	if (ctx->num_extra_edges) {
     - 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
     - 						4 * ctx->num_extra_edges;
     -@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	write_graph_chunk_fanout(f, ctx);
     - 	write_graph_chunk_oids(f, hashsz, ctx);
     - 	write_graph_chunk_data(f, hashsz, ctx);
     -+	write_graph_chunk_generation_data(f, ctx);
     - 	if (ctx->num_extra_edges)
     - 		write_graph_chunk_extra_edges(f, ctx);
     - 	if (ctx->changed_paths) {
     + 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     + 		ctx->changed_paths = 1;
      
       ## commit-graph.h ##
     +@@
     + #include "oidset.h"
     + 
     + #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
     ++#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
     + #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
     + #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
     + 
      @@ commit-graph.h: struct commit_graph {
       	const uint32_t *chunk_oid_fanout;
       	const unsigned char *chunk_oid_lookup;
     @@ commit-graph.h: struct commit_graph {
       	const unsigned char *chunk_base_graphs;
       	const unsigned char *chunk_bloom_indexes;
      
     + ## t/README ##
     +@@ t/README: GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
     + be written after every 'git commit' command, and overrides the
     + 'core.commitGraph' setting to true.
     + 
     ++GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
     ++commit-graph to be written without generation data chunk.
     ++
     + GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
     + commit-graph write to compute and write changed path Bloom filters for
     + every 'git commit-graph write', as if the `--changed-paths` option was
     +
       ## t/helper/test-read-graph.c ##
      @@ t/helper/test-read-graph.c: int cmd__read_graph(int argc, const char **argv)
       		printf(" oid_lookup");
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup test - repo, commits, commit gr
      
       ## t/t5318-commit-graph.sh ##
      @@ t/t5318-commit-graph.sh: graph_git_behavior 'no graph' full commits/3 commits/1
     - 
       graph_read_expect() {
       	OPTIONAL=""
     --	NUM_CHUNKS=3
     -+	NUM_CHUNKS=4
     - 	if test ! -z $2
     + 	NUM_CHUNKS=3
     +-	if test ! -z $2
     ++	if test ! -z "$2"
       	then
       		OPTIONAL=" $2"
     --		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
     -+		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
     - 	fi
     - 	cat >expect <<- EOF
     - 	header: 43475048 1 1 $NUM_CHUNKS 0
     - 	num_commits: $1
     --	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
     -+	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
     - 	EOF
     - 	test-tool read-graph >output &&
     - 	test_cmp expect output
     -@@ t/t5318-commit-graph.sh: GRAPH_BYTE_HASH=5
     - GRAPH_BYTE_CHUNK_COUNT=6
     - GRAPH_CHUNK_LOOKUP_OFFSET=8
     - GRAPH_CHUNK_LOOKUP_WIDTH=12
     --GRAPH_CHUNK_LOOKUP_ROWS=5
     -+GRAPH_CHUNK_LOOKUP_ROWS=6
     - GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
     - GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
     - 			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
     -@@ t/t5318-commit-graph.sh: GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
     - GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
     - GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
     - GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
     --GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
     - GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
     - GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
     --GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
     --			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
     -+GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
     -+				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
     -+GRAPH_GENERATION_DATA_WIDTH=4
     -+GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
     -+GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
     -+			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
     - GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
     - GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
     - 
     -@@ t/t5318-commit-graph.sh: test_expect_success 'detect incorrect generation number' '
     - '
     - 
     - test_expect_success 'detect incorrect generation number' '
     --	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
     -+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
     - 		"non-zero generation number"
     + 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
     +@@ t/t5318-commit-graph.sh: test_expect_success 'exit with correct error on bad input to --stdin-commits' '
     + 	# valid commit and tree OID
     + 	git rev-parse HEAD HEAD^{tree} >in &&
     + 	git commit-graph write --stdin-commits <in &&
     +-	graph_read_expect 3
     ++	graph_read_expect 3 generation_data
     + '
     + 
     + test_expect_success 'write graph' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "3"
     ++	graph_read_expect "3" generation_data
       '
       
     + test_expect_success POSIXPERM 'write graph has correct permissions' '
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph with merges' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "10" "extra_edges"
     ++	graph_read_expect "10" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph with new commit' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph with nothing new' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph from latest pack with closure' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	cat new-idx | git commit-graph write --stdin-packs &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "9" "extra_edges"
     ++	graph_read_expect "9" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph from commits with closure' '
     + 	git rev-parse merge/1 >>commits-in &&
     + 	cat commits-in | git commit-graph write --stdin-commits &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "6"
     ++	graph_read_expect "6" "generation_data"
     + '
     + 
     + graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph from commits with append' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "10" "extra_edges"
     ++	graph_read_expect "10" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph using --reachable' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write --reachable &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph in bare repo' '
     + 	cd "$TRASH_DIRECTORY/bare" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $baredir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit-graph' '
     + 
     + test_expect_success 'git commit-graph verify' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     +-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
     +-	git commit-graph verify >output
     ++	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
     ++	git commit-graph verify >output &&
     ++	graph_read_expect 9 extra_edges
     + '
     + 
     + NUM_COMMITS=9
      
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
     @@ t/t5324-split-commit-graph.sh: graph_read_expect() {
       	EOF
       	test-tool read-graph >output &&
       	test_cmp expect output
     +
     + ## t/t6600-test-reach.sh ##
     +@@ t/t6600-test-reach.sh: test_expect_success 'setup' '
     + 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
     + 	mv .git/objects/info/commit-graph commit-graph-half &&
     + 	chmod u+w commit-graph-half &&
     ++	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
     ++	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
     ++	chmod u+w commit-graph-no-gdat &&
     + 	git config core.commitGraph true
     + '
     + 
     +-run_three_modes () {
     ++run_all_modes () {
     + 	test_when_finished rm -rf .git/objects/info/commit-graph &&
     + 	"$@" <input >actual &&
     + 	test_cmp expect actual &&
     +@@ t/t6600-test-reach.sh: run_three_modes () {
     + 	test_cmp expect actual &&
     + 	cp commit-graph-half .git/objects/info/commit-graph &&
     + 	"$@" <input >actual &&
     ++	test_cmp expect actual &&
     ++	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
     ++	"$@" <input >actual &&
     + 	test_cmp expect actual
     + }
     + 
     +-test_three_modes () {
     +-	run_three_modes test-tool reach "$@"
     ++test_all_modes () {
     ++	run_all_modes test-tool reach "$@"
     + }
     + 
     + test_expect_success 'ref_newer:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:miss' '
     + 	B:commit-4-9
     + 	EOF
     + 	echo "ref_newer(A,B):0" >expect &&
     +-	test_three_modes ref_newer
     ++	test_all_modes ref_newer
     + '
     + 
     + test_expect_success 'ref_newer:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:hit' '
     + 	B:commit-2-3
     + 	EOF
     + 	echo "ref_newer(A,B):1" >expect &&
     +-	test_three_modes ref_newer
     ++	test_all_modes ref_newer
     + '
     + 
     + test_expect_success 'in_merge_bases:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:hit' '
     + 	B:commit-8-8
     + 	EOF
     + 	echo "in_merge_bases(A,B):1" >expect &&
     +-	test_three_modes in_merge_bases
     ++	test_all_modes in_merge_bases
     + '
     + 
     + test_expect_success 'in_merge_bases:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:miss' '
     + 	B:commit-5-9
     + 	EOF
     + 	echo "in_merge_bases(A,B):0" >expect &&
     +-	test_three_modes in_merge_bases
     ++	test_all_modes in_merge_bases
     + '
     + 
     + test_expect_success 'is_descendant_of:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:hit' '
     + 	X:commit-1-1
     + 	EOF
     + 	echo "is_descendant_of(A,X):1" >expect &&
     +-	test_three_modes is_descendant_of
     ++	test_all_modes is_descendant_of
     + '
     + 
     + test_expect_success 'is_descendant_of:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:miss' '
     + 	X:commit-7-6
     + 	EOF
     + 	echo "is_descendant_of(A,X):0" >expect &&
     +-	test_three_modes is_descendant_of
     ++	test_all_modes is_descendant_of
     + '
     + 
     + test_expect_success 'get_merge_bases_many' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_merge_bases_many' '
     + 		git rev-parse commit-5-6 \
     + 			      commit-4-7 | sort
     + 	} >expect &&
     +-	test_three_modes get_merge_bases_many
     ++	test_all_modes get_merge_bases_many
     + '
     + 
     + test_expect_success 'reduce_heads' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'reduce_heads' '
     + 			      commit-2-8 \
     + 			      commit-1-10 | sort
     + 	} >expect &&
     +-	test_three_modes reduce_heads
     ++	test_all_modes reduce_heads
     + '
     + 
     + test_expect_success 'can_all_from_reach:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:hit' '
     + 	Y:commit-8-1
     + 	EOF
     + 	echo "can_all_from_reach(X,Y):1" >expect &&
     +-	test_three_modes can_all_from_reach
     ++	test_all_modes can_all_from_reach
     + '
     + 
     + test_expect_success 'can_all_from_reach:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:miss' '
     + 	Y:commit-8-5
     + 	EOF
     + 	echo "can_all_from_reach(X,Y):0" >expect &&
     +-	test_three_modes can_all_from_reach
     ++	test_all_modes can_all_from_reach
     + '
     + 
     + test_expect_success 'can_all_from_reach_with_flag: tags case' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach_with_flag: tags case' '
     + 	Y:commit-8-1
     + 	EOF
     + 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
     +-	test_three_modes can_all_from_reach_with_flag
     ++	test_all_modes can_all_from_reach_with_flag
     + '
     + 
     + test_expect_success 'commit_contains:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:hit' '
     + 	X:commit-9-3
     + 	EOF
     + 	echo "commit_contains(_,A,X,_):1" >expect &&
     +-	test_three_modes commit_contains &&
     +-	test_three_modes commit_contains --tag
     ++	test_all_modes commit_contains &&
     ++	test_all_modes commit_contains --tag
     + '
     + 
     + test_expect_success 'commit_contains:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:miss' '
     + 	X:commit-9-3
     + 	EOF
     + 	echo "commit_contains(_,A,X,_):0" >expect &&
     +-	test_three_modes commit_contains &&
     +-	test_three_modes commit_contains --tag
     ++	test_all_modes commit_contains &&
     ++	test_all_modes commit_contains --tag
     + '
     + 
     + test_expect_success 'rev-list: basic topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: basic topo-order' '
     + 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: first-parent topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent topo-order' '
     + 		commit-6-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
     ++	run_all_modes git rev-list --first-parent --topo-order commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: range topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
     + 		commit-6-2 commit-5-2 commit-4-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: range topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
     + 		commit-6-2 commit-5-2 commit-4-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: first-parent range topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent range topo-order' '
     + 		commit-6-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
     ++	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: ancestry-path topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: ancestry-path topo-order' '
     + 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
     + 		commit-6-3 commit-5-3 commit-4-3 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
     ++	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: symmetric difference topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: symmetric difference topo-order' '
     + 		commit-3-8 commit-2-8 commit-1-8 \
     + 		commit-3-7 commit-2-7 commit-1-7 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
     + '
     + 
     + test_expect_success 'get_reachable_subset:all' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:all' '
     + 			      commit-1-7 \
     + 			      commit-5-6 | sort
     + 	) >expect &&
     +-	test_three_modes get_reachable_subset
     ++	test_all_modes get_reachable_subset
     + '
     + 
     + test_expect_success 'get_reachable_subset:some' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:some' '
     + 		git rev-parse commit-3-3 \
     + 			      commit-1-7 | sort
     + 	) >expect &&
     +-	test_three_modes get_reachable_subset
     ++	test_all_modes get_reachable_subset
     + '
     + 
     + test_expect_success 'get_reachable_subset:none' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:none' '
     + 	Y:commit-2-8
     + 	EOF
     + 	echo "get_reachable_subset(X,Y)" >expect &&
     +-	test_three_modes get_reachable_subset
     ++	test_all_modes get_reachable_subset
     + '
     + 
     + test_done
  6:  647290d036 !  6:  1aa2a00a7a commit-graph: implement corrected commit date offset
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: implement corrected commit date offset
     +    commit-graph: return 64-bit generation number
      
     -    With preparations done, let's implement corrected commit date offset.
     -    We add a new commit-slab to store topological levels while writing
     -    commit graph and upgrade number of struct commit_graph_data to 64-bits.
     -
     -    We have to touch many files, upgrading generation number from uint32_t
     -    to timestamp_t.
     -
     -    We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
     -    which tests if verify can detect if a commit graph have
     -    GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
     -    With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
     -    one of dates is Unix epoch zero.
     +    In a preparatory step, let's return timestamp_t values from
     +    commit_graph_generation(), use timestamp_t for local variables and
     +    define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     - ## blame.c ##
     -@@ blame.c: static int maybe_changed_path(struct repository *r,
     - 	if (!bd)
     - 		return 1;
     - 
     --	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
     -+	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
     - 		return 1;
     - 
     - 	filter = get_bloom_filter(r, origin->commit, 0);
     -
       ## commit-graph.c ##
     -@@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
     - /* Remember to update object flag allocation in object.h */
     - #define REACHABLE       (1u<<15)
     - 
     -+define_commit_slab(topo_level_slab, uint32_t);
     -+
     - /* Keep track of the order in which commits are added to our list. */
     - define_commit_slab(commit_pos, int);
     - static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
      @@ commit-graph.c: uint32_t commit_graph_position(const struct commit *c)
       	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
       }
     @@ commit-graph.c: uint32_t commit_graph_position(const struct commit *c)
       {
       	struct commit_graph_data *data =
       		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
     - 
     - 	if (!data)
     --		return GENERATION_NUMBER_INFINITY;
     -+		return GENERATION_NUMBER_V2_INFINITY;
     - 	else if (data->graph_pos == COMMIT_NOT_FROM_GRAPH)
     --		return GENERATION_NUMBER_INFINITY;
     -+		return GENERATION_NUMBER_V2_INFINITY;
     - 
     - 	return data->generation;
     - }
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
       int compare_commits_by_gen(const void *_a, const void *_b)
       {
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
       
      -	uint32_t generation_a = commit_graph_data_at(a)->generation;
      -	uint32_t generation_b = commit_graph_data_at(b)->generation;
     -+	timestamp_t generation_a = commit_graph_data_at(a)->generation;
     -+	timestamp_t generation_b = commit_graph_data_at(b)->generation;
     - 
     ++	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
     ++	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
       	/* lower generation commits first */
       	if (generation_a < generation_b)
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     - 	else if (generation_a > generation_b)
     - 		return 1;
     - 
     --	/* use date as a heuristic when generations are equal */
     --	if (a->date < b->date)
     --		return -1;
     --	else if (a->date > b->date)
     --		return 1;
     - 	return 0;
     - }
     - 
     -@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
     - 	item->date = (timestamp_t)((date_high << 32) | date_low);
     - 
     - 	if (g->chunk_generation_data)
     --		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	{
     -+		/* Read corrected commit date offset from GDAT */
     -+		graph_data->generation = item->date +
     -+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	}
     - 	else
     -+		/* Read topological level from CDAT */
     - 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     - }
     - 
     -@@ commit-graph.c: struct write_commit_graph_context {
     - 	struct progress *progress;
     - 	int progress_done;
     - 	uint64_t progress_cnt;
     -+	struct topo_level_slab *topo_levels;
     - 
     - 	char *base_graph_name;
     - 	int num_commit_graphs_before;
     -@@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_len,
     - 		else
     - 			packedDate[0] = 0;
     - 
     --		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
     -+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
     - 
     - 		packedDate[1] = htonl((*list)->date);
     - 		hashwrite(f, packedDate, 8);
     -@@ commit-graph.c: static void write_graph_chunk_generation_data(struct hashfile *f,
     - 	struct commit **list = ctx->commits.list;
     - 	int count;
     - 	for (count = 0; count < ctx->commits.nr; count++, list++) {
     -+		timestamp_t offset = commit_graph_data_at(*list)->generation - (*list)->date;
     - 		display_progress(ctx->progress, ++ctx->progress_cnt);
     --		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
     -+
     -+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     -+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     -+
     -+		hashwrite_be32(f, offset);
     - 	}
     - }
     - 
     -@@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *ctx)
     - 	stop_progress(&ctx->progress);
     - }
     - 
     --static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     -+static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
     - {
     - 	int i;
     - 	struct commit_list *list = NULL;
     + 		return -1;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 					_("Computing commit graph generation numbers"),
     - 					ctx->commits.nr);
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     --		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     + 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
       
       		display_progress(ctx->progress, i + 1);
      -		if (generation != GENERATION_NUMBER_INFINITY &&
     --		    generation != GENERATION_NUMBER_ZERO)
     -+		if (topo_level != GENERATION_NUMBER_INFINITY &&
     -+		    topo_level != GENERATION_NUMBER_ZERO)
     ++		if (generation != GENERATION_NUMBER_V1_INFINITY &&
     + 		    generation != GENERATION_NUMBER_ZERO)
       			continue;
       
     - 		commit_list_insert(ctx->commits.list[i], &list);
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 			struct commit *current = list->item;
     - 			struct commit_list *parent;
     - 			int all_parents_computed = 1;
     --			uint32_t max_generation = 0;
     -+			uint32_t max_level = 0;
     -+			timestamp_t max_corrected_commit_date = current->date;
     - 
       			for (parent = current->parents; parent; parent = parent->next) {
     --				generation = commit_graph_data_at(parent->item)->generation;
     -+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     + 				generation = commit_graph_data_at(parent->item)->generation;
       
      -				if (generation == GENERATION_NUMBER_INFINITY ||
     --				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (topo_level == GENERATION_NUMBER_INFINITY ||
     -+				    topo_level == GENERATION_NUMBER_ZERO) {
     ++				if (generation == GENERATION_NUMBER_V1_INFINITY ||
     + 				    generation == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
     - 					break;
     --				} else if (generation > max_generation) {
     --					max_generation = generation;
     -+				} else {
     -+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
     -+
     -+					if (topo_level > max_level)
     -+						max_level = topo_level;
     -+
     -+					if (data->generation > max_corrected_commit_date)
     -+						max_corrected_commit_date = data->generation;
     - 				}
     - 			}
     - 
     - 			if (all_parents_computed) {
     - 				struct commit_graph_data *data = commit_graph_data_at(current);
     - 
     --				data->generation = max_generation + 1;
     --				pop_commit(&list);
     -+				if (max_level > GENERATION_NUMBER_MAX - 1)
     -+					max_level = GENERATION_NUMBER_MAX - 1;
     -+
     -+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
     -+				data->generation = max_corrected_commit_date + 1;
     - 
     --				if (data->generation > GENERATION_NUMBER_MAX)
     --					data->generation = GENERATION_NUMBER_MAX;
     -+				pop_commit(&list);
     - 			}
     - 		}
     - 	}
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	uint32_t i, count_distinct = 0;
     - 	int res = 0;
     - 	int replace = 0;
     -+	struct topo_level_slab topo_levels;
     - 
     - 	if (!commit_graph_compatible(the_repository))
     - 		return 0;
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
     - 	ctx->total_bloom_filter_data_size = 0;
     - 
     -+	init_topo_level_slab(&topo_levels);
     -+	ctx->topo_levels = &topo_levels;
     -+
     - 	if (ctx->split) {
     - 		struct commit_graph *g;
     - 		prepare_commit_graph(ctx->r);
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	} else
     - 		ctx->num_commit_graphs_after = 1;
     - 
     --	compute_generation_numbers(ctx);
     -+	compute_corrected_commit_date_offsets(ctx);
     - 
     - 	if (ctx->changed_paths)
     - 		compute_bloom_filters(ctx);
      @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
       	for (i = 0; i < g->num_commits; i++) {
       		struct commit *graph_commit, *odb_commit;
       		struct commit_list *graph_parents, *odb_parents;
      -		uint32_t max_generation = 0;
      -		uint32_t generation;
     -+		timestamp_t max_parent_corrected_commit_date = 0;
     -+		timestamp_t corrected_commit_date;
     ++		timestamp_t max_generation = 0;
     ++		timestamp_t generation;
       
       		display_progress(progress, i + 1);
       		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 					     oid_to_hex(&graph_parents->item->object.oid),
     - 					     oid_to_hex(&odb_parents->item->object.oid));
     - 
     --			generation = commit_graph_generation(graph_parents->item);
     --			if (generation > max_generation)
     --				max_generation = generation;
     -+			corrected_commit_date = commit_graph_generation(graph_parents->item);
     -+			if (corrected_commit_date > max_parent_corrected_commit_date)
     -+				max_parent_corrected_commit_date = corrected_commit_date;
     - 
     - 			graph_parents = graph_parents->next;
     - 			odb_parents = odb_parents->next;
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 		if (generation_zero == GENERATION_ZERO_EXISTS)
     - 			continue;
     - 
     --		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     --		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     --		 * extra logic in the following condition.
     --		 */
     --		if (max_generation == GENERATION_NUMBER_MAX)
     --			max_generation--;
     --
     --		generation = commit_graph_generation(graph_commit);
     --		if (generation != max_generation + 1)
     --			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		corrected_commit_date = commit_graph_generation(graph_commit);
     -+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
     -+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     - 				     oid_to_hex(&cur_oid),
     --				     generation,
     --				     max_generation + 1);
     -+				     corrected_commit_date,
     -+				     max_parent_corrected_commit_date + 1);
     - 
     - 		if (graph_commit->date != odb_commit->date)
     - 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
      
       ## commit-graph.h ##
      @@ commit-graph.h: void disable_commit_graph(struct repository *r);
     @@ commit-reach.c: static int queue_has_nonstale(struct prio_queue *queue)
       	struct commit_list *result = NULL;
       	int i;
      -	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t last_gen = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
       
       	if (!min_generation)
       		queue.compare = compare_commits_by_commit_date;
     @@ commit-reach.c: int repo_in_merge_bases_many(struct repository *r, struct commit
       	struct commit_list *bases;
       	int ret = 0, i;
      -	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t generation, min_generation = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
       
       	if (repo_parse_commit(r, commit))
       		return ret;
     @@ commit-reach.c: static enum contains_result contains_tag_algo(struct commit *can
       	struct contains_stack contains_stack = { 0, 0, NULL };
       	enum contains_result result;
      -	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t cutoff = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
       	const struct commit_list *p;
       
       	for (p = want; p; p = p->next) {
     @@ commit-reach.c: int can_all_from_reach(struct commit_list *from, struct commit_l
       	struct commit_list *from_iter = from, *to_iter = to;
       	int result;
      -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
       
       	while (from_iter) {
       		add_object_array(&from_iter->item->object, NULL, &from_objs);
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
       	struct commit **to_last = to + nr_to;
       	struct commit **from_last = from + nr_from;
      -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
       	int num_to_find = 0;
       
       	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
     @@ commit-reach.h: int can_all_from_reach_with_flag(struct object_array *from,
      
       ## commit.h ##
      @@
     + #include "commit-slab.h"
     + 
     + #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
     +-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
     ++#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
     ++#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
       #define GENERATION_NUMBER_MAX 0x3FFFFFFF
       #define GENERATION_NUMBER_ZERO 0
       
     -+#define GENERATION_NUMBER_V2_INFINITY ((1ULL << 63) - 1)
     -+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
     -+
     - struct commit_list {
     - 	struct commit *item;
     - 	struct commit_list *next;
      
       ## revision.c ##
     -@@ revision.c: static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
     - 	if (!revs->repo->objects->commit_graph)
     - 		return -1;
     - 
     --	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
     -+	if (commit_graph_generation(commit) == GENERATION_NUMBER_V2_INFINITY)
     - 		return -1;
     - 
     - 	filter = get_bloom_filter(revs->repo, commit, 0);
      @@ revision.c: define_commit_slab(indegree_slab, int);
       define_commit_slab(author_date_slab, timestamp_t);
       
     @@ revision.c: static void indegree_walk_step(struct rev_info *revs)
       	struct topo_walk_info *info = revs->topo_walk_info;
       	struct commit *c;
      @@ revision.c: static void init_topo_walk(struct rev_info *revs)
     - 	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
     - 	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
     - 
     --	info->min_generation = GENERATION_NUMBER_INFINITY;
     -+	info->min_generation = GENERATION_NUMBER_V2_INFINITY;
     + 	info->min_generation = GENERATION_NUMBER_INFINITY;
       	for (list = revs->commits; list; list = list->next) {
       		struct commit *c = list->item;
      -		uint32_t generation;
     @@ revision.c: static void expand_topo_walk(struct rev_info *revs, struct commit *c
       		if (parent->object.flags & UNINTERESTING)
       			continue;
      
     - ## t/t5318-commit-graph.sh ##
     -@@ t/t5318-commit-graph.sh: test_expect_success 'detect incorrect generation number' '
     - 		"generation for commit"
     - '
     - 
     --test_expect_success 'detect incorrect generation number' '
     -+test_expect_failure 'detect incorrect generation number' '
     - 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
     - 		"non-zero generation number"
     - '
     -
       ## upload-pack.c ##
      @@ upload-pack.c: static int got_oid(struct upload_pack_data *data,
       
  -:  ---------- >  7:  bfe1473201 commit-graph: implement corrected commit date
  -:  ---------- >  8:  833779ad53 commit-graph: handle mixed generation commit chains
  -:  ---------- >  9:  58a2d5da01 commit-reach: use corrected commit dates in paint_down_to_common()
  -:  ---------- > 10:  4c34294602 doc: add corrected commit date info

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..ace7400a1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 02/10] revision: parse parent in indegree_walk_step()
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 6de29cdf7a..4ec82ed5ab 100644
--- a/revision.c
+++ b/revision.c
@@ -3365,6 +3365,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number (within CDAT chunk) and has undefined behavior.

The test used to pass as fill_commit_in_graph() guarantees the values of
graph position and generation number, and did not load timestamp.
However, with corrected commit date we will need load the timestamp as
well to populate the generation number.

Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 29 +++++++++++------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ace7400a1a..af8d9cc45e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index af8d9cc45e..fb6e2bf18f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..701e3d41aa 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -146,4 +146,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 7128895c3a..bed63b41fb 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 16:28     ` Derrick Stolee
  2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number was to distinguish
between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 40 +++++++++++++++++++---
 commit-graph.h                |  2 ++
 t/README                      |  3 ++
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 +--
 t/t5318-commit-graph.sh       | 27 +++++++--------
 t/t5324-split-commit-graph.sh | 12 +++----
 t/t6600-test-reach.sh         | 62 +++++++++++++++++++----------------
 8 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fb6e2bf18f..d5da1e8028 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -392,6 +393,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -758,7 +766,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -951,7 +962,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
@@ -1105,8 +1117,21 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+		hashwrite_be32(f, commit_graph_data_at(c)->generation);
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1710,6 +1735,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2113,6 +2144,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
 
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
diff --git a/commit-graph.h b/commit-graph.h
index 701e3d41aa..cc232e0678 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -67,6 +68,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/README b/t/README
index 70ec61cf88..6647ef132e 100644
--- a/t/README
+++ b/t/README
@@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..55c94e9ebd 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2804b0dd45..fef05c33d7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -72,7 +72,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -99,14 +99,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -215,7 +215,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -250,7 +250,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -260,7 +260,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -270,7 +270,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -283,7 +283,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -293,7 +293,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -303,7 +303,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -324,7 +324,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -421,8 +421,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 9b850ea907..6b25c3d9ce 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_init &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -29,9 +29,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 475564bee7..d14b129f06 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,10 +55,13 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -67,11 +70,14 @@ run_three_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -118,7 +124,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -129,7 +135,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -144,7 +150,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -166,7 +172,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -189,7 +195,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -211,7 +217,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -234,7 +240,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -250,8 +256,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -267,8 +273,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -280,7 +286,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -292,7 +298,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -304,7 +310,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -316,7 +322,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -328,7 +334,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -338,7 +344,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -352,7 +358,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -372,7 +378,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -390,7 +396,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -404,7 +410,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 06/10] commit-graph: return 64-bit generation number
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step, let's return timestamp_t values from
commit_graph_generation(), use timestamp_t for local variables and
define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 18 +++++++++---------
 commit-graph.h |  4 ++--
 commit-reach.c | 32 ++++++++++++++++----------------
 commit-reach.h |  2 +-
 commit.h       |  3 ++-
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 7 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d5da1e8028..42f3ec5460 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -100,7 +100,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -116,8 +116,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -160,8 +160,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1363,7 +1363,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
+		if (generation != GENERATION_NUMBER_V1_INFINITY &&
 		    generation != GENERATION_NUMBER_ZERO)
 			continue;
 
@@ -1377,7 +1377,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			for (parent = current->parents; parent; parent = parent->next) {
 				generation = commit_graph_data_at(parent->item)->generation;
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
+				if (generation == GENERATION_NUMBER_V1_INFINITY ||
 				    generation == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
@@ -2387,8 +2387,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
diff --git a/commit-graph.h b/commit-graph.h
index cc232e0678..f89614ecd5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -140,13 +140,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..470bc80139 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..bc0732a4fe 100644
--- a/commit.h
+++ b/commit.h
@@ -11,7 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
diff --git a/revision.c b/revision.c
index 4ec82ed5ab..bd7b39c806 100644
--- a/revision.c
+++ b/revision.c
@@ -3292,7 +3292,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3338,7 +3338,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3381,7 +3381,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3439,7 +3439,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3500,7 +3500,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 8673741070..18ee29db67 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -490,7 +490,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 14:23     ` Derrick Stolee
  2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date
offset. We add a new commit-slab to store topogical levels while
writing commit graph and upgrade the generation member in struct
commit_graph_data to a 64-bit timestamp. We store topological levels to
ensure that older versions of Git will still have the performance
benefits from generation number v2.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 89 ++++++++++++++++++++++++++++----------------------
 commit.h       |  1 +
 2 files changed, 51 insertions(+), 39 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 42f3ec5460..d0f977852b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -65,6 +65,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -168,11 +170,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -767,7 +764,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	if (g->chunk_generation_data)
-		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	{
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	}
 	else
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -948,6 +948,7 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+	struct topo_level_slab *topo_levels;
 
 	char *base_graph_name;
 	int num_commit_graphs_before;
@@ -1106,7 +1107,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1123,8 +1124,13 @@ static int write_graph_chunk_generation_data(struct hashfile *f,
 	int i;
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite_be32(f, commit_graph_data_at(c)->generation);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+
+		hashwrite_be32(f, offset);
 	}
 
 	return 0;
@@ -1360,11 +1366,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_V1_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (topo_level != GENERATION_NUMBER_V1_INFINITY &&
+		    topo_level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1372,29 +1378,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = current->date - 1;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_V1_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (topo_level == GENERATION_NUMBER_V1_INFINITY ||
+				    topo_level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else {
+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
+
+					if (topo_level > max_level)
+						max_level = topo_level;
+
+					if (data->generation > max_corrected_commit_date)
+						max_corrected_commit_date = data->generation;
 				}
 			}
 
 			if (all_parents_computed) {
 				struct commit_graph_data *data = commit_graph_data_at(current);
 
-				data->generation = max_generation + 1;
-				pop_commit(&list);
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+				data->generation = max_corrected_commit_date + 1;
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				pop_commit(&list);
 			}
 		}
 	}
@@ -2132,6 +2147,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2146,6 +2162,9 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->total_bloom_filter_data_size = 0;
 	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
@@ -2387,8 +2406,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		timestamp_t max_generation = 0;
-		timestamp_t generation;
+		timestamp_t max_parent_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2427,9 +2446,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_parent_corrected_commit_date)
+				max_parent_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2451,20 +2470,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_parent_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
diff --git a/commit.h b/commit.h
index bc0732a4fe..bb846e0025 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,7 @@
 #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
 
 struct commit_list {
 	struct commit *item;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 16:42     ` Derrick Stolee
  2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                     ` (3 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As corrected commit dates and topological levels cannot be compared
directly, we must handle commit graph chains with mixed generation
number definitions.

While reading a commit graph file, we disable generation numbers if the
chain contains mixed generation numbers.

While writing to commit graph chain, we write generation data chunk only
if the previous tip of chain had a generation data chunk. Using
`--split=replace` overwrites the existing chain and writes generation
data chunk regardless of previous tip.

In t5324-split-commit-graph, we set up a repo with twelve commits and
write a base commit graph file with no generation data chunk. When add
three commits and write to chain again, Git does not write generation
data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
as we replace the existing chain, Git writes a commit graph file with
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 14 ++++++++
 t/t5324-split-commit-graph.sh | 66 +++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d0f977852b..c6b6111adf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -674,6 +674,14 @@ int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
+	/* We cannot compare topological levels and corrected commit dates */
+	while (g->base_graph) {
+		warning(_("commit-graph-chain contains mixed generation versions"));
+		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
+			return 0;
+		g = g->base_graph;
+	}
+
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -2186,6 +2194,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2204,6 +2215,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 6b25c3d9ce..1a9be5e656 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,70 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 3 0
+	num_commits: 12
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split='replace' &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().

t6024-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base. As this has
already caused problems (as noted in 859fdc0 (commit-graph: define
GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
test script.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6024-recursive-merge.sh |  4 +++-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c6b6111adf..eb78af3dad 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -688,6 +688,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return !!g->chunk_generation_data;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index f89614ecd5..d3a485faa6 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -89,6 +89,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/commit-reach.c b/commit-reach.c
index 470bc80139..3a1b925274 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
index 332cfc53fd..d3def66e7d 100755
--- a/t/t6024-recursive-merge.sh
+++ b/t/t6024-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v2 10/10] doc: add corrected commit date info
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  11 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 12 ++---
 Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
 2 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..71c43884ec 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -88,6 +84,12 @@ CHUNK DATA:
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * This list can be later modified to store future generation number related
+      data.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 808fa30b99..f27145328c 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates
+2. Topological levels
+
+Define "corrected committer date" of a commit recursively as follows:
+
+  * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
+
+  * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+Define the "topological level" of a commit recursively as follows:
 
  * A commit with no parents (a root commit) has generation number one.
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
 
-Equivalently, the generation number of a commit A is one more than the
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
-violate topological relationships due to clock skew (such as "git log"
-with default order), but is not used when the topological order is
-required (such as merge base calculations, "git log --graph").
-
 In practice, we expect some commits to be created recently and not stored
 in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
+are computed to be at least this value. We limit at this value since it is
+the largest value that can be stored in the commit-graph file using the
+available to generation numbers. This presents another case where a
+commit can have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+We also merge commit-graph chains when we try to write a commit graph with
+two different generation number definitions as they cannot be compared directly.
+We overwrite the existing chain and create a commit-graph with the newer or more
+efficient defintion. For example, overwriting topological levels commit graph
+chain to create a corrected commit dates commit graph chain.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-10 14:23     ` Derrick Stolee
  2020-08-14  4:59       ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Derrick Stolee @ 2020-08-10 14:23 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> With most of preparations done, let's implement corrected commit date
> offset. We add a new commit-slab to store topogical levels while
> writing commit graph and upgrade the generation member in struct
> commit_graph_data to a 64-bit timestamp. We store topological levels to
> ensure that older versions of Git will still have the performance
> benefits from generation number v2.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

> @@ -767,7 +764,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
>  	if (g->chunk_generation_data)
> -		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	{
> +		graph_data->generation = item->date +
> +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	}

You don't need curly braces here, since this is only one line
in the block. Even if you did, these braces are in the wrong
location.

There is a subtle issue with this interpretation, and it
involves the case where the following happens:

1. A new version of Git writes a commit-graph using the
   GDAT chunk.

2. An older version of Git adds a new layer without the
   GDAT chunk.

At that point, the tip commit-graph does not have GDAT,
so the commits in that layer will get "generation" set
with the topological level, which is likely to be much
lower than the corrected commit dates set in the
"generation" field for commits in the lower layer.

The crux of the issue is that we are only considering
the current layer when interpreting the generation number
value.

The patch below inserts a flag into fill_commit_graph_info()
corresponding to the "global" state of whether the top
commit-graph layer has a GDAT chunk. By your later protection
to not write GDAT chunks on top of commit-graphs without
a GDAT chunk, this top commit-graph has all of the information
we need for this check.

Thanks,
-Stolee

--- >8 ---

From 62189709fad3b051cedbd36193f5244fcce17e1f Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Mon, 10 Aug 2020 10:06:47 -0400
Subject: [PATCH] commit-graph: use generation v2 only if entire chain does

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

 1. "New" Git writes a commit-graph with the GDAT chunk.
 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
generation_data_chunk pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this
case, especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index eb78af3dad..17623274d9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -762,7 +762,9 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	return &commit_list_insert(c, pptr)->next;
 }
 
-static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+#define COMMIT_GRAPH_GENERATION_V2 (1 << 0)
+
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos, int flags)
 {
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
@@ -785,11 +787,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data)
-	{
+	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
-	}
 	else
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -799,6 +799,10 @@ static inline void set_commit_tree(struct commit *c, struct tree *t)
 	c->maybe_tree = t;
 }
 
+/*
+ * In the case of a split commit-graph, this method expects the given
+ * commit-graph 'g' to be the top layer.
+ */
 static int fill_commit_in_graph(struct repository *r,
 				struct commit *item,
 				struct commit_graph *g, uint32_t pos)
@@ -808,11 +812,15 @@ static int fill_commit_in_graph(struct repository *r,
 	struct commit_list **pptr;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
+	int flags = 0;
+
+	if (g->chunk_generation_data)
+		flags |= COMMIT_GRAPH_GENERATION_V2;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	fill_commit_graph_info(item, g, pos);
+	fill_commit_graph_info(item, g, pos, flags);
 
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
@@ -904,10 +912,14 @@ int parse_commit_in_graph(struct repository *r, struct commit *item)
 void load_commit_graph_info(struct repository *r, struct commit *item)
 {
 	uint32_t pos;
+	int flags = 0;
+
 	if (!prepare_commit_graph(r))
 		return;
+	if (r->objects->commit_graph->chunk_generation_data)
+		flags |= COMMIT_GRAPH_GENERATION_V2;
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
-		fill_commit_graph_info(item, r->objects->commit_graph, pos);
+		fill_commit_graph_info(item, r->objects->commit_graph, pos, flags);
 }
 
 static struct tree *load_tree_for_commit(struct repository *r,
-- 
2.28.0.38.gc6f546511c1


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-10 16:28     ` Derrick Stolee
  2020-08-11 11:03       ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Derrick Stolee @ 2020-08-10 16:28 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> As discovered by Ævar, we cannot increment graph version to
> distinguish between generation numbers v1 and v2 [1]. Thus, one of
> pre-requistes before implementing generation number was to distinguish
> between graph versions in a backwards compatible manner.
> 
> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.
> 
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. New Git can parse GDAT and take advantage
> of newer generation numbers, falling back to topological levels when
> GDAT chunk is missing (as it would happen with a commit graph written
> by old Git).

There is a philosophical problem with this patch, and I'm not sure
about the right way to fix it, or if there really is a problem at all.
At minimum, the commit message needs to be improved to make the issue
clear:

This version of the chunk does not store corrected commit date offsets!

This commit add a chunk named "GDAT" and fills it with topological
levels. This is _different_ than the intended final format. For that
reason, the commit-graph-format.txt document is not updated.

The reason I say this is a "philosophical" problem is that this patch
introduces a version of Git that has a different interpretation of the
GDAT chunk than the version presented two patches later. While this
version would never be released, it still exists in history and could
present difficulty if someone were to bisect on an issue with the GDAT
chunk (using external data, not data produced by the compiled binary
at that version).

The justification for this commit the way you did it is clear: there
is a lot of test fallout to just including a new chunk. The question
is whether it is enough to justify this "dummy" implementation for
now?

The tricky bit is the series of three patches starting with this
one.

1. The next patch "commit-graph: return 64-bit generation number" can
   be reordered to be before this patch, no problem. I don't think
   there will be any text conflicts _except_ inside the
   write_graph_chunk_generation_data() method introduced here.

2. The patch after that, "commit-graph: implement corrected commit date"
   only has a small dependence: it writes to the GDAT chunk and parses
   it out. If you remove the interaction with the GDAT chunk, then you
   still have the computation as part of compute_generation_numbers()
   that is valuable. You will need to be careful about the exit
   condition, though, since you also introduce the topo_level chunk.

Patches 5-7 could perhaps be reorganized as follows:

  i. commit-graph: return 64-bit generation number, as-is.

 ii. Add a topo_level slab that is parsed from CDAT. Modify
     compute_generation_numbers() to populate this value and modify
     write_graph_chunk_data() to read this value. Simultaneously
     populate the "generation" member with the same value.

iii. "commit-graph: implement corrected commit date" without any GDAT
     chunk interaction. Make sure the algorithm in
     compute_generation_numbers() walks commits if either topo_level or
     generation are unset. There is a trick here: the generation value
     _is_ set if the commit is parsed from the existing commit-graph!
     Is this case covered by the existing logic to not write GDAT when
     writing a split commit-graph file with a base that does not have
     GDAT? Note that the non-split case does not load the commit-graph
     for parsing, so the interesting case is "--split-replace". Worth
     a test (after we write the GDAT chunk), which you have in "commit-graph:
     handle mixed generation commit chains".

 iv. This patch, introducing the chunk and the read/write logic.

  v. Add the remaining patches.

Again, this is a complicated patch-reorganization. The hope is that
the end result is something that is easy to review as well as something
that produces an as-sane-as-possible history for future bisecters.

Perhaps other reviewers have similar feelings, or can say that I am
being too picky.

> We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> which forces commit-graph file to be written without generation data
> chunk to emulate a commit-graph file written by old Git.

Thank you for introducing this. It really makes it clear what the
benefit is when looking at the t6600-test-reach.sh changes. However,
the changes to that script are more "here is an opportunity for extra
coverage" as opposed to a necessary change immediately upon creating
the GDAT chunk. That could be separated out and justified on its own.
Recall that the justification is that the new version of Git will
continue to work with commit-graph files without a GDAT chunk.

> +static int write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +		hashwrite_be32(f, commit_graph_data_at(c)->generation);

Here is the "incorrect" data being written.

> +	}
> +
> +	return 0;
> +}
> +

> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -72,7 +72,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>  graph_read_expect() {
>  	OPTIONAL=""
>  	NUM_CHUNKS=3
> -	if test ! -z $2
> +	if test ! -z "$2"

A subtle change, but important because we now have multiple "extra"
chunks possible here. Good.

>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> @@ -421,8 +421,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
>  
>  test_expect_success 'git commit-graph verify' '
>  	cd "$TRASH_DIRECTORY/full" &&
> -	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> -	git commit-graph verify >output
> +	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> +	git commit-graph verify >output &&
> +	graph_read_expect 9 extra_edges
>  '

And it is this case as to why we don't just add "generation_data" to our
list of expected chunks.

> @@ -29,9 +29,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data

In this script, you _do_ add it to the default chunk list, which
saves some extra work in the rest of the tests. Good.


Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
@ 2020-08-10 16:42     ` Derrick Stolee
  2020-08-11 11:36       ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Derrick Stolee @ 2020-08-10 16:42 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> As corrected commit dates and topological levels cannot be compared
> directly, we must handle commit graph chains with mixed generation
> number definitions.
> 
> While reading a commit graph file, we disable generation numbers if the
> chain contains mixed generation numbers.
> 
> While writing to commit graph chain, we write generation data chunk only
> if the previous tip of chain had a generation data chunk. Using
> `--split=replace` overwrites the existing chain and writes generation
> data chunk regardless of previous tip.
> 
> In t5324-split-commit-graph, we set up a repo with twelve commits and
> write a base commit graph file with no generation data chunk. When add
> three commits and write to chain again, Git does not write generation
> data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
> as we replace the existing chain, Git writes a commit graph file with
> generation data chunk.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 14 ++++++++
>  t/t5324-split-commit-graph.sh | 66 +++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index d0f977852b..c6b6111adf 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -674,6 +674,14 @@ int generation_numbers_enabled(struct repository *r)
>  	if (!g->num_commits)
>  		return 0;
>  
> +	/* We cannot compare topological levels and corrected commit dates */
> +	while (g->base_graph) {
> +		warning(_("commit-graph-chain contains mixed generation versions"));

This warning is premature. It will add a warning whenever we have
a split commit-graph, regardless of an incorrect chain.

> +		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))

Hm. A bit-wise XOR here? That seems unfortunate. I think that it
is easier to focus on the 



> +			return 0;
> +		g = g->base_graph;
> +	}
> +

Hm. So this scenario actually disables generation numbers completely
in the event that anything in the chain disagrees. I think this is
not the right way to approach the situation, as it will significantly
punish users in this state with slow performance.

The patch I sent [1] is probably better: it uses generation number
v1 if the tip of the chain does not have a GDAT chunk.

[1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/

>  	first_generation = get_be32(g->chunk_commit_data +
>  				    g->hash_len + 8) >> 2;
>  
> @@ -2186,6 +2194,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		g = ctx->r->objects->commit_graph;
>  
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +
>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2204,6 +2215,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		if (ctx->split_opts)
>  			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;
>  	}

Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
from its current location to after this line. We want to make sure that the
environment variable is checked _last_. The best location is likely the start
of the implementation of compute_generation_numbers(), or immediately before
the call to the method.

> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&
> +	graphdir=".git/objects/info/commit-graphs" &&
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 3 0
> +	num_commits: 12
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output
> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 1
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split='replace' &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&
> +	graph_read_expect 15
> +'

It would be valuable to double-check here that the values in the GDAT chunk
are correct. I'm concerned about the possibility that the 'generation'
member of struct commit_graph_data gets filled with topological level during
parsing and then that is written as an offset into the CDAT chunk.

Perhaps this is best left for a follow-up series that updates the 'verify'
subcommand to check the GDAT chunk.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (9 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-08-10 16:47   ` Derrick Stolee
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  11 siblings, 0 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-08-10 16:47 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
> 
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
> 
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
> 
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
> 
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
> 
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.
> 
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
> 
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> first version.
> 
> I look forward to everyone's reviews!
> 
> Thanks
> 
>  * Abhishek
> 
> 
> ----------------------------------------------------------------------------
> 
> Changes in version 2:
> 
>  * Add tests for generation data chunk.
>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.
>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().
>  * Update technical documentation.
>  * Handle mixed graph version commit chains.
>  * Improve commit messages for
>  * Revert unnecessary whitespace changes.
>  * Split uint_32 -> timestamp_t change into a new commit.

This version looks to be in really good shape, for the most part. I feel
the need to point out that this is a very mature submission, and it covers
many of the bases we expect from a quality series, even at only the second
version.

My comments deal with some high-level "patch series strategy" considerations,
mostly. There are a few concerns about correctness in the complicated cases
of mixed commit-graphs with or without the GDAT chunk, and how we could
perhaps test those cases. The best way would be to update the 'verify'
subcommand to check the GDAT chunk. That's a good feature to have, but would
be unnecessary bloat to this series. Depending on timing, we might want to
hold this series in 'next' until such an implementation is submitted. (I
expect that Abhishek's GSoC internship would be over at that point, so I
volunteer to send those patches after this series stabilizes.)

Thank you for your careful work on such a complicated subject!

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-10 16:28     ` Derrick Stolee
@ 2020-08-11 11:03       ` Abhishek Kumar
  2020-08-11 12:27         ` Derrick Stolee
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-11 11:03 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: jnareb, abhishekkumar8222, git, me, gitgitgadget

On Mon, Aug 10, 2020 at 12:28:10PM -0400, Derrick Stolee wrote:
> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > As discovered by Ævar, we cannot increment graph version to
> > distinguish between generation numbers v1 and v2 [1]. Thus, one of
> > pre-requistes before implementing generation number was to distinguish
> > between graph versions in a backwards compatible manner.
> > 
> > We are going to introduce a new chunk called Generation Data chunk (or
> > GDAT). GDAT stores generation number v2 (and any subsequent versions),
> > whereas CDAT will still store topological level.
> > 
> > Old Git does not understand GDAT chunk and would ignore it, reading
> > topological levels from CDAT. New Git can parse GDAT and take advantage
> > of newer generation numbers, falling back to topological levels when
> > GDAT chunk is missing (as it would happen with a commit graph written
> > by old Git).
> 
> There is a philosophical problem with this patch, and I'm not sure
> about the right way to fix it, or if there really is a problem at all.
> At minimum, the commit message needs to be improved to make the issue
> clear:
> 
> This version of the chunk does not store corrected commit date offsets!
> 
> This commit add a chunk named "GDAT" and fills it with topological
> levels. This is _different_ than the intended final format. For that
> reason, the commit-graph-format.txt document is not updated.
> 
> The reason I say this is a "philosophical" problem is that this patch
> introduces a version of Git that has a different interpretation of the
> GDAT chunk than the version presented two patches later. While this
> version would never be released, it still exists in history and could
> present difficulty if someone were to bisect on an issue with the GDAT
> chunk (using external data, not data produced by the compiled binary
> at that version).
> 

Yes, that is correct. I did often wonder that our inference that "commit
graph has a generation data chunk implies commit graph stores corrected
commit date offsets" is not always true because of this "dummy"
implementation. 

> The justification for this commit the way you did it is clear: there
> is a lot of test fallout to just including a new chunk. The question
> is whether it is enough to justify this "dummy" implementation for
> now?
> 
> The tricky bit is the series of three patches starting with this
> one.
> 
> 1. The next patch "commit-graph: return 64-bit generation number" can
>    be reordered to be before this patch, no problem. I don't think
>    there will be any text conflicts _except_ inside the
>    write_graph_chunk_generation_data() method introduced here.
> 
> 2. The patch after that, "commit-graph: implement corrected commit date"
>    only has a small dependence: it writes to the GDAT chunk and parses
>    it out. If you remove the interaction with the GDAT chunk, then you
>    still have the computation as part of compute_generation_numbers()
>    that is valuable. You will need to be careful about the exit
>    condition, though, since you also introduce the topo_level chunk.
> 
> Patches 5-7 could perhaps be reorganized as follows:
> 
>   i. commit-graph: return 64-bit generation number, as-is.
> 
>  ii. Add a topo_level slab that is parsed from CDAT. Modify
>      compute_generation_numbers() to populate this value and modify
>      write_graph_chunk_data() to read this value. Simultaneously
>      populate the "generation" member with the same value.
> 
> iii. "commit-graph: implement corrected commit date" without any GDAT
>      chunk interaction. Make sure the algorithm in
>      compute_generation_numbers() walks commits if either topo_level or
>      generation are unset. There is a trick here: the generation value
>      _is_ set if the commit is parsed from the existing commit-graph!
>      Is this case covered by the existing logic to not write GDAT when
>      writing a split commit-graph file with a base that does not have
>      GDAT? Note that the non-split case does not load the commit-graph
>      for parsing, so the interesting case is "--split-replace". Worth
>      a test (after we write the GDAT chunk), which you have in "commit-graph:
>      handle mixed generation commit chains".
> 

Right, so at the end of this patch we compute corrected commit dates but
don't write them to graph file.

Although, writing ii. and iii. together in the same patch makes more
sense to me. Would it be hard to follow for someone who has no context
of this discussion?

>  iv. This patch, introducing the chunk and the read/write logic.
> 
>   v. Add the remaining patches.
> 
> Again, this is a complicated patch-reorganization. The hope is that
> the end result is something that is easy to review as well as something
> that produces an as-sane-as-possible history for future bisecters.
> 
> Perhaps other reviewers have similar feelings, or can say that I am
> being too picky.
> 

I can see how the reorganization helps us avoid a rather nasty
situation to be in. Should not be too hard to reorganize.

> > We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> > which forces commit-graph file to be written without generation data
> > chunk to emulate a commit-graph file written by old Git.
> 
> ...
> 
> Thanks,
> -Stolee

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-10 16:42     ` Derrick Stolee
@ 2020-08-11 11:36       ` Abhishek Kumar
  2020-08-11 12:43         ` Derrick Stolee
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-11 11:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, me, jnareb, gitgitgadget

On Mon, Aug 10, 2020 at 12:42:29PM -0400, Derrick Stolee wrote:
> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> 
> ...
>
> Hm. So this scenario actually disables generation numbers completely
> in the event that anything in the chain disagrees. I think this is
> not the right way to approach the situation, as it will significantly
> punish users in this state with slow performance.
> 
> The patch I sent [1] is probably better: it uses generation number
> v1 if the tip of the chain does not have a GDAT chunk.
> 
> [1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/
> 

Yes, the patch is an clear improvement over my (convoluted and incorrect)
logic. Will add.

>
> ...
> 
> Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
> from its current location to after this line. We want to make sure that the
> environment variable is checked _last_. The best location is likely the start
> of the implementation of compute_generation_numbers(), or immediately before
> the call to the method.
> 

Sure, will do.

>
> ...
> 
> It would be valuable to double-check here that the values in the GDAT chunk
> are correct. I'm concerned about the possibility that the 'generation'
> member of struct commit_graph_data gets filled with topological level during
> parsing and then that is written as an offset into the CDAT chunk.
> 
> Perhaps this is best left for a follow-up series that updates the 'verify'
> subcommand to check the GDAT chunk.

If I can understand it correctly, one of ways to update 'verify'
subcommand to check the GDAT chunk as well would to be make use of the
flag variable introduced in your patch. We can isolate generation number
related checks and run checks once with flag = 1 (checking corrected
commit dates) and once with flag = 0 (checking topological levels).

This has the unfortunate effect of filling all commits twice, but as we
cannot change the commit_graph_data->generation any other way, I see no
alternatives without changing how commit_graph_generation() works.

Would it make more sense if we add the flag to struct commit_graph
instead of making it depend solely on g->chunk_generation_data and set
it within parse_commit_graph()?

We would be able to control the behavior of fill_commit_graph_info() and
we will not need to check g->chunk_generation_data before filling every
commit.

> 
> Thanks,
> -Stolee
> 

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-11 11:03       ` Abhishek Kumar
@ 2020-08-11 12:27         ` Derrick Stolee
  2020-08-11 18:58           ` Taylor Blau
  0 siblings, 1 reply; 107+ messages in thread
From: Derrick Stolee @ 2020-08-11 12:27 UTC (permalink / raw)
  To: aee0ae56-3395-6848-d573-27a318d72755
  Cc: jnareb, abhishekkumar8222, git, me, gitgitgadget

On 8/11/2020 7:03 AM, Abhishek Kumar wrote:
> On Mon, Aug 10, 2020 at 12:28:10PM -0400, Derrick Stolee wrote:
>> Patches 5-7 could perhaps be reorganized as follows:
>>
>>   i. commit-graph: return 64-bit generation number, as-is.
>>
>>  ii. Add a topo_level slab that is parsed from CDAT. Modify
>>      compute_generation_numbers() to populate this value and modify
>>      write_graph_chunk_data() to read this value. Simultaneously
>>      populate the "generation" member with the same value.
>>
>> iii. "commit-graph: implement corrected commit date" without any GDAT
>>      chunk interaction. Make sure the algorithm in
>>      compute_generation_numbers() walks commits if either topo_level or
>>      generation are unset. There is a trick here: the generation value
>>      _is_ set if the commit is parsed from the existing commit-graph!
>>      Is this case covered by the existing logic to not write GDAT when
>>      writing a split commit-graph file with a base that does not have
>>      GDAT? Note that the non-split case does not load the commit-graph
>>      for parsing, so the interesting case is "--split-replace". Worth
>>      a test (after we write the GDAT chunk), which you have in "commit-graph:
>>      handle mixed generation commit chains".
>>
> 
> Right, so at the end of this patch we compute corrected commit dates but
> don't write them to graph file.
> 
> Although, writing ii. and iii. together in the same patch makes more
> sense to me. Would it be hard to follow for someone who has no context
> of this discussion?

It is always easier to combine two patches than to split one into two.

With that in mind, I recommend starting with a split version and then
seeing how each patch looks. I think that these are "independent enough"
ideas that justify the separate patches.

>>  iv. This patch, introducing the chunk and the read/write logic.
>>
>>   v. Add the remaining patches.
>>
>> Again, this is a complicated patch-reorganization. The hope is that
>> the end result is something that is easy to review as well as something
>> that produces an as-sane-as-possible history for future bisecters.
>>
>> Perhaps other reviewers have similar feelings, or can say that I am
>> being too picky.
>>
> 
> I can see how the reorganization helps us avoid a rather nasty
> situation to be in. Should not be too hard to reorganize.

I hope not. Hopefully you get some more review on this version
before jumping in on such a big reorg (in case someone else has
a different opinion).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-11 11:36       ` Abhishek Kumar
@ 2020-08-11 12:43         ` Derrick Stolee
  0 siblings, 0 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-08-11 12:43 UTC (permalink / raw)
  To: 0d741fb2-e25a-be05-9f2b-81ba2b4ced3f
  Cc: abhishekkumar8222, git, me, jnareb, gitgitgadget

On 8/11/2020 7:36 AM, Abhishek Kumar wrote:
> On Mon, Aug 10, 2020 at 12:42:29PM -0400, Derrick Stolee wrote:
>> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
>>
>> ...
>>
>> Hm. So this scenario actually disables generation numbers completely
>> in the event that anything in the chain disagrees. I think this is
>> not the right way to approach the situation, as it will significantly
>> punish users in this state with slow performance.
>>
>> The patch I sent [1] is probably better: it uses generation number
>> v1 if the tip of the chain does not have a GDAT chunk.
>>
>> [1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/
>>
> 
> Yes, the patch is an clear improvement over my (convoluted and incorrect)
> logic. Will add.
> 
>>
>> ...
>>
>> Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
>> from its current location to after this line. We want to make sure that the
>> environment variable is checked _last_. The best location is likely the start
>> of the implementation of compute_generation_numbers(), or immediately before
>> the call to the method.
>>
> 
> Sure, will do.
> 
>>
>> ...
>>
>> It would be valuable to double-check here that the values in the GDAT chunk
>> are correct. I'm concerned about the possibility that the 'generation'
>> member of struct commit_graph_data gets filled with topological level during
>> parsing and then that is written as an offset into the CDAT chunk.
>>
>> Perhaps this is best left for a follow-up series that updates the 'verify'
>> subcommand to check the GDAT chunk.
> 
> If I can understand it correctly, one of ways to update 'verify'
> subcommand to check the GDAT chunk as well would to be make use of the
> flag variable introduced in your patch. We can isolate generation number
> related checks and run checks once with flag = 1 (checking corrected
> commit dates) and once with flag = 0 (checking topological levels).
> 
> This has the unfortunate effect of filling all commits twice, but as we
> cannot change the commit_graph_data->generation any other way, I see no
> alternatives without changing how commit_graph_generation() works.
> 
> Would it make more sense if we add the flag to struct commit_graph
> instead of making it depend solely on g->chunk_generation_data and set
> it within parse_commit_graph()?
> 
> We would be able to control the behavior of fill_commit_graph_info() and
> we will not need to check g->chunk_generation_data before filling every
> commit.

I missed that you _already_ updated the logic in verify_commit_graph()
based on the generation. That logic should catch the problem, so it
might be enough to just add some "git commit-graph verify" commands into
your multi-level tests.

Specifically, the end result is this check:

	corrected_commit_date = commit_graph_generation(graph_commit);
	if (corrected_commit_date < max_parent_corrected_commit_date + 1)
		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
			     oid_to_hex(&cur_oid),
			     corrected_commit_date,
			     max_parent_corrected_commit_date + 1);

This will catch the order violations I was proposing could happen. It
doesn't go the extra mile to ensure that the commit-graph stores the
exact correct value or that the two bits of data are correct (both
topo-level and corrected commit date). That is fine for now, and we
can revisit if necessary.

The diff below makes some tweaks to your split-level test to show the
logic _was_ incorrect without my patch. Please incorporate the test
changes into your series. Note in particular that I added a base
layer that includes the GDAT chunk and _then_ adds a layer without
the GDAT chunk. That is an important case!

Thanks,
-Stolee

--- >8 ---

diff --git a/commit-graph.c b/commit-graph.c
index 17623274d9..d891a8ba3a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -674,14 +674,6 @@ int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
-	/* We cannot compare topological levels and corrected commit dates */
-	while (g->base_graph) {
-		warning(_("commit-graph-chain contains mixed generation versions"));
-		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
-			return 0;
-		g = g->base_graph;
-	}
-
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -787,7 +779,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
+	if (g->chunk_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 	else
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 1a9be5e656..721515cc23 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -443,6 +443,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 		test_commit $i &&
 		git branch commits/$i || return 1
 	done &&
+	git commit-graph write --reachable --split &&
 	git reset --hard commits/2 &&
 	for i in $(test_seq 6 10)
 	do
@@ -455,14 +456,15 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 	git reset --hard commits/4 &&
 	git merge commits/6 &&
 	git branch merge/2 &&
-	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
 	test-tool read-graph >output &&
 	cat >expect <<-EOF &&
-	header: 43475048 1 1 3 0
-	num_commits: 12
+	header: 43475048 1 1 4 1
+	num_commits: 7
 	chunks: oid_fanout oid_lookup commit_metadata
 	EOF
-	test_cmp expect output
+	test_cmp expect output &&
+	git commit-graph verify
 '
 
 test_expect_success 'does not write generation data chunk if not present on existing tip' '
@@ -472,23 +474,25 @@ test_expect_success 'does not write generation data chunk if not present on exis
 	git merge commits/5 &&
 	git merge merge/2 &&
 	git branch merge/3 &&
-	git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split=no-merge &&
 	test-tool read-graph >output &&
 	cat >expect <<-EOF &&
 	header: 43475048 1 1 4 1
 	num_commits: 3
 	chunks: oid_fanout oid_lookup commit_metadata
 	EOF
-	test_cmp expect output
+	test_cmp expect output &&
+	git commit-graph verify
 '
 
 test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
 	cd "$TRASH_DIRECTORY/mixed" &&
-	git commit-graph write --reachable --split='replace' &&
+	git commit-graph write --reachable --split=replace &&
 	test_path_is_file $graphdir/commit-graph-chain &&
 	test_line_count = 1 $graphdir/commit-graph-chain &&
 	verify_chain_files_exist $graphdir &&
-	graph_read_expect 15
+	graph_read_expect 15 &&
+	git commit-graph verify
 '
 
 test_done

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-11 12:27         ` Derrick Stolee
@ 2020-08-11 18:58           ` Taylor Blau
  0 siblings, 0 replies; 107+ messages in thread
From: Taylor Blau @ 2020-08-11 18:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: aee0ae56-3395-6848-d573-27a318d72755, jnareb, abhishekkumar8222,
	git, me, gitgitgadget

On Tue, Aug 11, 2020 at 08:27:41AM -0400, Derrick Stolee wrote:
> On 8/11/2020 7:03 AM, Abhishek Kumar wrote:
> > On Mon, Aug 10, 2020 at 12:28:10PM -0400, Derrick Stolee wrote:
> >> Patches 5-7 could perhaps be reorganized as follows:
> >>
> >>   i. commit-graph: return 64-bit generation number, as-is.
> >>
> >>  ii. Add a topo_level slab that is parsed from CDAT. Modify
> >>      compute_generation_numbers() to populate this value and modify
> >>      write_graph_chunk_data() to read this value. Simultaneously
> >>      populate the "generation" member with the same value.
> >>
> >> iii. "commit-graph: implement corrected commit date" without any GDAT
> >>      chunk interaction. Make sure the algorithm in
> >>      compute_generation_numbers() walks commits if either topo_level or
> >>      generation are unset. There is a trick here: the generation value
> >>      _is_ set if the commit is parsed from the existing commit-graph!
> >>      Is this case covered by the existing logic to not write GDAT when
> >>      writing a split commit-graph file with a base that does not have
> >>      GDAT? Note that the non-split case does not load the commit-graph
> >>      for parsing, so the interesting case is "--split-replace". Worth
> >>      a test (after we write the GDAT chunk), which you have in "commit-graph:
> >>      handle mixed generation commit chains".
> >>
> >
> > Right, so at the end of this patch we compute corrected commit dates but
> > don't write them to graph file.
> >
> > Although, writing ii. and iii. together in the same patch makes more
> > sense to me. Would it be hard to follow for someone who has no context
> > of this discussion?
>
> It is always easier to combine two patches than to split one into two.
>
> With that in mind, I recommend starting with a split version and then
> seeing how each patch looks. I think that these are "independent enough"
> ideas that justify the separate patches.
>
> >>  iv. This patch, introducing the chunk and the read/write logic.
> >>
> >>   v. Add the remaining patches.
> >>
> >> Again, this is a complicated patch-reorganization. The hope is that
> >> the end result is something that is easy to review as well as something
> >> that produces an as-sane-as-possible history for future bisecters.
> >>
> >> Perhaps other reviewers have similar feelings, or can say that I am
> >> being too picky.
> >>
> >
> > I can see how the reorganization helps us avoid a rather nasty
> > situation to be in. Should not be too hard to reorganize.
>
> I hope not. Hopefully you get some more review on this version
> before jumping in on such a big reorg (in case someone else has
> a different opinion).

I think the direction makes sense. We should avoid having the dummy
implementation of the GDAT chunk in the interim if at all possible (and
it seems like it is). What Stolee is proposing is what I'd suggest, too.

Please let us know if you need any help restructuring these patches.
Please make sure to give them a careful review since it is easy to move
a hunk into the wrong commit, or let a detail in the patch text become
out-of-date. Looking over "git log -p origin/master..HEAD" and "git
rebase -x make -j8 DEVELOPER=1 test' origin/master" never hurts ;).

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-10 14:23     ` Derrick Stolee
@ 2020-08-14  4:59       ` Abhishek Kumar
  2020-08-14 12:24         ` Derrick Stolee
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-14  4:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, jnareb, me, gitgitgadget

On Mon, Aug 10, 2020 at 10:23:32AM -0400, Derrick Stolee wrote:
> ....
> --- >8 ---
> 
> From 62189709fad3b051cedbd36193f5244fcce17e1f Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Mon, 10 Aug 2020 10:06:47 -0400
> Subject: [PATCH] commit-graph: use generation v2 only if entire chain does
> 
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
> 
>  1. "New" Git writes a commit-graph with the GDAT chunk.
>  2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> Because of the current use of inspecting the current layer for a
> generation_data_chunk pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.
> 
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.
> 
> This issue would manifest itself as a performance problem in this
> case, especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index eb78af3dad..17623274d9 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -762,7 +762,9 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
>  	return &commit_list_insert(c, pptr)->next;
>  }
>  
> -static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +#define COMMIT_GRAPH_GENERATION_V2 (1 << 0)
> +
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos, int flags)
>  {
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
> @@ -785,11 +787,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> -	if (g->chunk_generation_data)
> -	{
> +	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
>  		graph_data->generation = item->date +
>  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> -	}
>  	else
>  		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
> @@ -799,6 +799,10 @@ static inline void set_commit_tree(struct commit *c, struct tree *t)
>  	c->maybe_tree = t;
>  }
>  
> +/*
> + * In the case of a split commit-graph, this method expects the given
> + * commit-graph 'g' to be the top layer.
> + */

Unfortunately, this turns out to be an optimistic assumption. After
adding in changes from this patch and extended tests for
split-commit-graph [1], the tests fail with `git commit-graph verify` as
Git tries to compare topological level with corrected commit dates.

The problem lies in assuming that `g` is always the top layer, without
any way to assert if that's true. In case of `commit-graph verify`, we
update `g = g->base_graph` and verify recursively.

If we can assume such behavior is only the part of verify subcommand, I
can update the tests to no longer verify at the end.

[1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

>  static int fill_commit_in_graph(struct repository *r,
>  				struct commit *item,
>  				struct commit_graph *g, uint32_t pos)
> @@ -808,11 +812,15 @@ static int fill_commit_in_graph(struct repository *r,
>  	struct commit_list **pptr;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
> +	int flags = 0;
> +
> +	if (g->chunk_generation_data)
> +		flags |= COMMIT_GRAPH_GENERATION_V2;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> -	fill_commit_graph_info(item, g, pos);
> +	fill_commit_graph_info(item, g, pos, flags);
>  
>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> @@ -904,10 +912,14 @@ int parse_commit_in_graph(struct repository *r, struct commit *item)
>  void load_commit_graph_info(struct repository *r, struct commit *item)
>  {
>  	uint32_t pos;
> +	int flags = 0;
> +
>  	if (!prepare_commit_graph(r))
>  		return;
> +	if (r->objects->commit_graph->chunk_generation_data)
> +		flags |= COMMIT_GRAPH_GENERATION_V2;
>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
> -		fill_commit_graph_info(item, r->objects->commit_graph, pos);
> +		fill_commit_graph_info(item, r->objects->commit_graph, pos, flags);
>  }
>  
>  static struct tree *load_tree_for_commit(struct repository *r,
> -- 
> 2.28.0.38.gc6f546511c1
> 

I solved the issue by adding a new member to struct commit_graph
`read_generation_data` to maintain the "global" state of the entire
commit-graph chain instead.

The relevant changes are in validate_mixed_generation_chain(),
read_commit_graph_one() and fill_commit_graph_info().

Thanks
- Abhishek

--- >8 ---
From aaf8c27bfec6e110a8bb12173c2dd612a8c6b8b9 Mon Sep 17 00:00:00 2001
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Date: Thu, 6 Aug 2020 19:08:52 +0530
Subject: [PATCH] commit-graph: use generation v2 only if entire chain does

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 36 ++++++++++++++++--
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
 3 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d0f977852b..10309f870f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	int read_generation_data = 1;
+
+	while (g) {
+		if (!g->chunk_generation_data) {
+			read_generation_data = 0;
+			break;
+		}
+		g = g->base_graph;
+	}
+
+	g = r->objects->commit_graph;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(r);
+
 	return g;
 }
 
@@ -740,6 +763,8 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	return &commit_list_insert(c, pptr)->next;
 }
 
+#define COMMIT_GRAPH_GENERATION_V2 (1 << 0)
+
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	const unsigned char *commit_data;
@@ -763,11 +788,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data)
-	{
+	if (g->chunk_generation_data && g->read_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
-	}
 	else
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -884,6 +907,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
 	uint32_t pos;
 	if (!prepare_commit_graph(r))
 		return;
+
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
 }
@@ -2186,6 +2210,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2204,6 +2231,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index f89614ecd5..305c332b7e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	uint32_t read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 531016f405..ac5e7783fb 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -424,4 +424,74 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
 test_done
-- 
2.28.0

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-14  4:59       ` Abhishek Kumar
@ 2020-08-14 12:24         ` Derrick Stolee
  0 siblings, 0 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-08-14 12:24 UTC (permalink / raw)
  To: a3910f82-ab2e-bf35-ac43-c30d77f3c96b
  Cc: abhishekkumar8222, git, jnareb, me, gitgitgadget

On 8/14/2020 12:59 AM, Abhishek Kumar wrote:
> I solved the issue by adding a new member to struct commit_graph
> `read_generation_data` to maintain the "global" state of the entire
> commit-graph chain instead.
> 
> The relevant changes are in validate_mixed_generation_chain(),
> read_commit_graph_one() and fill_commit_graph_info().

I think this is a good way to go. Adding that restriction about
the tip commit-graph was short-sighted of me and was likely to
break in the future.

I think your solution here to store extra state from the entire
chain into each layer makes a lot of sense.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (10 preceding siblings ...)
  2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2020-08-15 16:39   ` Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                       ` (11 more replies)
  11 siblings, 12 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
first version.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

Changes in version 3:

 * Reordered patches as discussed in 1
   [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed graph version commit chains.
 * Improve commit messages for
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (11):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  12 +-
 Documentation/technical/commit-graph.txt      |  45 ++--
 commit-graph.c                                | 241 +++++++++++++-----
 commit-graph.h                                |  16 +-
 commit-reach.c                                |  49 ++--
 commit-reach.h                                |   2 +-
 commit.c                                      |   9 +-
 commit.h                                      |   4 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   2 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |   4 +-
 t/t5318-commit-graph.sh                       |  27 +-
 t/t5324-split-commit-graph.sh                 |  82 +++++-
 t/t6024-recursive-merge.sh                    |   4 +-
 t/t6600-test-reach.sh                         |  62 +++--
 upload-pack.c                                 |   2 +-
 18 files changed, 396 insertions(+), 185 deletions(-)


base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v2:

  1:  a962b9ae4b =  1:  c6b7ade7af commit-graph: fix regression when computing bloom filter
  2:  cf61239f93 =  2:  e673867234 revision: parse parent in indegree_walk_step()
  3:  32da955e31 =  3:  18d5864f81 commit-graph: consolidate fill_commit_graph_info
  4:  b254782858 !  4:  6a0cde983d commit-graph: consolidate compare_commits_by_gen
     @@ Commit message
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
          Reviewed-by: Taylor Blau <me@ttaylorr.com>
     +    Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
  6:  1aa2a00a7a =  5:  6be759a954 commit-graph: return 64-bit generation number
  7:  bfe1473201 !  6:  b347dbb01b commit-graph: implement corrected commit date
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: implement corrected commit date
     +    commit-graph: add a slab to store topological levels
      
     -    With most of preparations done, let's implement corrected commit date
     -    offset. We add a new commit-slab to store topogical levels while
     -    writing commit graph and upgrade the generation member in struct
     -    commit_graph_data to a 64-bit timestamp. We store topological levels to
     -    ensure that older versions of Git will still have the performance
     -    benefits from generation number v2.
     +    As we are writing topological levels to commit data chunk to ensure
     +    backwards compatibility with "Old" Git and the member `generation` of
     +    struct commit_graph_data will store corrected commit date in a later
     +    commit, let's introduce a commit-slab to store topological levels while
     +    writing commit-graph.
     +
     +    When Git creates a split commit-graph, it takes advantage of the
     +    generation values that have been computed already and present in
     +    existing commit-graph files.
     +
     +    So, let's add a pointer to struct commit_graph to the topological level
     +    commit-slab and populate it with topological levels while writing a
     +    split commit-graph.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
       /* Keep track of the order in which commits are added to our list. */
       define_commit_slab(commit_pos, int);
       static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     - 	else if (generation_a > generation_b)
     - 		return 1;
     - 
     --	/* use date as a heuristic when generations are equal */
     --	if (a->date < b->date)
     --		return -1;
     --	else if (a->date > b->date)
     --		return 1;
     - 	return 0;
     - }
     - 
      @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
       	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     - 	if (g->chunk_generation_data)
     --		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	{
     -+		graph_data->generation = item->date +
     -+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	}
     - 	else
     - 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     + 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     ++
     ++	if (g->topo_levels)
     ++		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
       }
     + 
     + static inline void set_commit_tree(struct commit *c, struct tree *t)
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct progress *progress;
     - 	int progress_done;
     - 	uint64_t progress_cnt;
     -+	struct topo_level_slab *topo_levels;
     + 		 changed_paths:1,
     + 		 order_by_pack:1;
       
     - 	char *base_graph_name;
     - 	int num_commit_graphs_before;
     ++	struct topo_level_slab *topo_levels;
     + 	const struct split_commit_graph_opts *split_opts;
     + 	size_t total_bloom_filter_data_size;
     + 	const struct bloom_filter_settings *bloom_settings;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       		else
       			packedDate[0] = 0;
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       
       		packedDate[1] = htonl((*list)->date);
       		hashwrite(f, packedDate, 8);
     -@@ commit-graph.c: static int write_graph_chunk_generation_data(struct hashfile *f,
     - 	int i;
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     - 		struct commit *c = ctx->commits.list[i];
     -+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
     - 		display_progress(ctx->progress, ++ctx->progress_cnt);
     --		hashwrite_be32(f, commit_graph_data_at(c)->generation);
     -+
     -+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     -+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     -+
     -+		hashwrite_be32(f, offset);
     - 	}
     - 
     - 	return 0;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
       					_("Computing commit graph generation numbers"),
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
      -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     ++		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
       
       		display_progress(ctx->progress, i + 1);
      -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
      -		    generation != GENERATION_NUMBER_ZERO)
     -+		if (topo_level != GENERATION_NUMBER_V1_INFINITY &&
     -+		    topo_level != GENERATION_NUMBER_ZERO)
     ++		if (level != GENERATION_NUMBER_V1_INFINITY &&
     ++		    level != GENERATION_NUMBER_ZERO)
       			continue;
       
       		commit_list_insert(ctx->commits.list[i], &list);
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       			int all_parents_computed = 1;
      -			uint32_t max_generation = 0;
      +			uint32_t max_level = 0;
     -+			timestamp_t max_corrected_commit_date = current->date - 1;
       
       			for (parent = current->parents; parent; parent = parent->next) {
      -				generation = commit_graph_data_at(parent->item)->generation;
     -+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     ++				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
       
      -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
      -				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (topo_level == GENERATION_NUMBER_V1_INFINITY ||
     -+				    topo_level == GENERATION_NUMBER_ZERO) {
     ++				if (level == GENERATION_NUMBER_V1_INFINITY ||
     ++				    level == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
       					break;
      -				} else if (generation > max_generation) {
      -					max_generation = generation;
     -+				} else {
     -+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
     -+
     -+					if (topo_level > max_level)
     -+						max_level = topo_level;
     -+
     -+					if (data->generation > max_corrected_commit_date)
     -+						max_corrected_commit_date = data->generation;
     ++				} else if (level > max_level) {
     ++					max_level = level;
       				}
       			}
       
       			if (all_parents_computed) {
     - 				struct commit_graph_data *data = commit_graph_data_at(current);
     - 
     +-				struct commit_graph_data *data = commit_graph_data_at(current);
     +-
      -				data->generation = max_generation + 1;
     --				pop_commit(&list);
     -+				if (max_level > GENERATION_NUMBER_MAX - 1)
     -+					max_level = GENERATION_NUMBER_MAX - 1;
     -+
     -+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
     -+				data->generation = max_corrected_commit_date + 1;
     + 				pop_commit(&list);
       
      -				if (data->generation > GENERATION_NUMBER_MAX)
      -					data->generation = GENERATION_NUMBER_MAX;
     -+				pop_commit(&list);
     ++				if (max_level > GENERATION_NUMBER_MAX - 1)
     ++					max_level = GENERATION_NUMBER_MAX - 1;
     ++				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
       			}
       		}
       	}
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	if (!commit_graph_compatible(the_repository))
       		return 0;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	ctx->total_bloom_filter_data_size = 0;
     - 	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
     + 		}
     + 	}
       
      +	init_topo_level_slab(&topo_levels);
      +	ctx->topo_levels = &topo_levels;
      +
     - 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     - 		ctx->changed_paths = 1;
     - 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 	for (i = 0; i < g->num_commits; i++) {
     - 		struct commit *graph_commit, *odb_commit;
     - 		struct commit_list *graph_parents, *odb_parents;
     --		timestamp_t max_generation = 0;
     --		timestamp_t generation;
     -+		timestamp_t max_parent_corrected_commit_date = 0;
     -+		timestamp_t corrected_commit_date;
     - 
     - 		display_progress(progress, i + 1);
     - 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 					     oid_to_hex(&graph_parents->item->object.oid),
     - 					     oid_to_hex(&odb_parents->item->object.oid));
     - 
     --			generation = commit_graph_generation(graph_parents->item);
     --			if (generation > max_generation)
     --				max_generation = generation;
     -+			corrected_commit_date = commit_graph_generation(graph_parents->item);
     -+			if (corrected_commit_date > max_parent_corrected_commit_date)
     -+				max_parent_corrected_commit_date = corrected_commit_date;
     - 
     - 			graph_parents = graph_parents->next;
     - 			odb_parents = odb_parents->next;
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 		if (generation_zero == GENERATION_ZERO_EXISTS)
     - 			continue;
     ++	if (ctx->r->objects->commit_graph) {
     ++		struct commit_graph *g = ctx->r->objects->commit_graph;
     ++
     ++		while (g) {
     ++			g->topo_levels = &topo_levels;
     ++			g = g->base_graph;
     ++		}
     ++	}
     ++
     + 	if (pack_indexes) {
     + 		ctx->order_by_pack = 1;
     + 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
     +
     + ## commit-graph.h ##
     +@@ commit-graph.h: struct commit_graph {
     + 	const unsigned char *chunk_bloom_indexes;
     + 	const unsigned char *chunk_bloom_data;
     + 
     ++	struct topo_level_slab *topo_levels;
     + 	struct bloom_filter_settings *bloom_filter_settings;
     + };
       
     --		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     --		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     --		 * extra logic in the following condition.
     --		 */
     --		if (max_generation == GENERATION_NUMBER_MAX)
     --			max_generation--;
     --
     --		generation = commit_graph_generation(graph_commit);
     --		if (generation != max_generation + 1)
     --			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		corrected_commit_date = commit_graph_generation(graph_commit);
     -+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
     -+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     - 				     oid_to_hex(&cur_oid),
     --				     generation,
     --				     max_generation + 1);
     -+				     corrected_commit_date,
     -+				     max_parent_corrected_commit_date + 1);
     - 
     - 		if (graph_commit->date != odb_commit->date)
     - 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
      
       ## commit.h ##
      @@
  -:  ---------- >  7:  4074ace65b commit-graph: implement corrected commit date
  5:  cb797e20d7 !  8:  4e746628ac commit-graph: implement generation data chunk
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       
      -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
      +	if (g->chunk_generation_data)
     -+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     ++		graph_data->generation = item->date +
     ++			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
      +	else
      +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     - }
       
     - static inline void set_commit_tree(struct commit *c, struct tree *t)
     + 	if (g->topo_levels)
     + 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
      @@ commit-graph.c: struct write_commit_graph_context {
       		 report_progress:1,
       		 split:1,
     @@ commit-graph.c: struct write_commit_graph_context {
      +		 order_by_pack:1,
      +		 write_generation_data:1;
       
     + 	struct topo_level_slab *topo_levels;
       	const struct split_commit_graph_opts *split_opts;
     - 	size_t total_bloom_filter_data_size;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       	return 0;
       }
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
      +	int i;
      +	for (i = 0; i < ctx->commits.nr; i++) {
      +		struct commit *c = ctx->commits.list[i];
     ++		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
      +		display_progress(ctx->progress, ++ctx->progress_cnt);
     -+		hashwrite_be32(f, commit_graph_data_at(c)->generation);
     ++
     ++		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     ++			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     ++		hashwrite_be32(f, offset);
      +	}
      +
      +	return 0;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
       	chunks[2].id = GRAPH_CHUNKID_DATA;
       	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
       	chunks[2].write_fn = write_graph_chunk_data;
     ++
     ++	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
     ++		ctx->write_generation_data = 0;
      +	if (ctx->write_generation_data) {
      +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
      +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
       	ctx->split_opts = split_opts;
       	ctx->total_bloom_filter_data_size = 0;
     -+	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
     ++	ctx->write_generation_data = 1;
       
       	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
       		ctx->changed_paths = 1;
     @@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit
      
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
     + 	infodir=".git/objects/info" &&
       	graphdir="$infodir/commit-graphs" &&
     - 	test_oid_init &&
       	test_oid_cache <<-EOM
      -	shallow sha1:1760
      -	shallow sha256:2064
  8:  833779ad53 !  9:  5a147a9704 commit-graph: handle mixed generation commit chains
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: handle mixed generation commit chains
     +    commit-graph: use generation v2 only if entire chain does
      
     -    As corrected commit dates and topological levels cannot be compared
     -    directly, we must handle commit graph chains with mixed generation
     -    number definitions.
     +    Since there are released versions of Git that understand generation
     +    numbers in the commit-graph's CDAT chunk but do not understand the GDAT
     +    chunk, the following scenario is possible:
      
     -    While reading a commit graph file, we disable generation numbers if the
     -    chain contains mixed generation numbers.
     +    1. "New" Git writes a commit-graph with the GDAT chunk.
     +    2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
      
     -    While writing to commit graph chain, we write generation data chunk only
     -    if the previous tip of chain had a generation data chunk. Using
     -    `--split=replace` overwrites the existing chain and writes generation
     -    data chunk regardless of previous tip.
     +    Because of the current use of inspecting the current layer for a
     +    chunk_generation_data pointer, the commits in the lower layer will be
     +    interpreted as having very large generation values (commit date plus
     +    offset) compared to the generation numbers in the top layer (topological
     +    level). This violates the expectation that the generation of a parent is
     +    strictly smaller than the generation of a child.
      
     -    In t5324-split-commit-graph, we set up a repo with twelve commits and
     -    write a base commit graph file with no generation data chunk. When add
     -    three commits and write to chain again, Git does not write generation
     -    data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
     -    as we replace the existing chain, Git writes a commit graph file with
     -    generation data chunk.
     +    It is difficult to expose this issue in a test. Since we _start_ with
     +    artificially low generation numbers, any commit walk that prioritizes
     +    generation numbers will walk all of the commits with high generation
     +    number before walking the commits with low generation number. In all the
     +    cases I tried, the commit-graph layers themselves "protect" any
     +    incorrect behavior since none of the commits in the lower layer can
     +    reach the commits in the upper layer.
      
     +    This issue would manifest itself as a performance problem in this case,
     +    especially with something like "git log --graph" since the low
     +    generation numbers would cause the in-degree queue to walk all of the
     +    commits in the lower layer before allowing the topo-order queue to write
     +    anything to output (depending on the size of the upper layer).
     +
     +    Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
     -@@ commit-graph.c: int generation_numbers_enabled(struct repository *r)
     - 	if (!g->num_commits)
     - 		return 0;
     +@@ commit-graph.c: static struct commit_graph *load_commit_graph_chain(struct repository *r,
     + 	return graph_chain;
     + }
       
     -+	/* We cannot compare topological levels and corrected commit dates */
     -+	while (g->base_graph) {
     -+		warning(_("commit-graph-chain contains mixed generation versions"));
     -+		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
     -+			return 0;
     ++static void validate_mixed_generation_chain(struct repository *r)
     ++{
     ++	struct commit_graph *g = r->objects->commit_graph;
     ++	int read_generation_data = 1;
     ++
     ++	while (g) {
     ++		if (!g->chunk_generation_data) {
     ++			read_generation_data = 0;
     ++			break;
     ++		}
      +		g = g->base_graph;
      +	}
      +
     - 	first_generation = get_be32(g->chunk_commit_data +
     - 				    g->hash_len + 8) >> 2;
     ++	g = r->objects->commit_graph;
     ++
     ++	while (g) {
     ++		g->read_generation_data = read_generation_data;
     ++		g = g->base_graph;
     ++	}
     ++}
     ++
     + struct commit_graph *read_commit_graph_one(struct repository *r,
     + 					   struct object_directory *odb)
     + {
     +@@ commit-graph.c: struct commit_graph *read_commit_graph_one(struct repository *r,
     + 	if (!g)
     + 		g = load_commit_graph_chain(r, odb);
     + 
     ++	validate_mixed_generation_chain(r);
     ++
     + 	return g;
     + }
     + 
     +@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
     + 	date_low = get_be32(commit_data + g->hash_len + 12);
     + 	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     +-	if (g->chunk_generation_data)
     ++	if (g->chunk_generation_data && g->read_generation_data)
     + 		graph_data->generation = item->date +
     + 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     + 	else
     +@@ commit-graph.c: void load_commit_graph_info(struct repository *r, struct commit *item)
     + 	uint32_t pos;
     + 	if (!prepare_commit_graph(r))
     + 		return;
     ++
     + 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
     + 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
     + }
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       		g = ctx->r->objects->commit_graph;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       	ctx->approx_nr_objects = approximate_object_count();
      
     + ## commit-graph.h ##
     +@@ commit-graph.h: struct commit_graph {
     + 	struct object_directory *odb;
     + 
     + 	uint32_t num_commits_in_base;
     ++	uint32_t read_generation_data;
     + 	struct commit_graph *base_graph;
     + 
     + 	const uint32_t *chunk_oid_fanout;
     +
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: done <<\EOF
       0600 -r--------
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +		test_commit $i &&
      +		git branch commits/$i || return 1
      +	done &&
     ++	git commit-graph write --reachable --split &&
      +	git reset --hard commits/2 &&
      +	git merge commits/4 &&
      +	git branch merge/1 &&
      +	git reset --hard commits/4 &&
      +	git merge commits/6 &&
      +	git branch merge/2 &&
     -+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
     ++	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 3 0
     -+	num_commits: 12
     ++	header: 43475048 1 1 4 1
     ++	num_commits: 2
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output
     ++	test_cmp expect output &&
     ++	git commit-graph verify
      +'
      +
      +test_expect_success 'does not write generation data chunk if not present on existing tip' '
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +	git merge commits/5 &&
      +	git merge merge/2 &&
      +	git branch merge/3 &&
     -+	git commit-graph write --reachable --split &&
     ++	git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 4 1
     ++	header: 43475048 1 1 4 2
      +	num_commits: 3
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output
     ++	test_cmp expect output &&
     ++	git commit-graph verify
      +'
      +
      +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
      +	cd "$TRASH_DIRECTORY/mixed" &&
     -+	git commit-graph write --reachable --split='replace' &&
     ++	git commit-graph write --reachable --split=replace &&
      +	test_path_is_file $graphdir/commit-graph-chain &&
      +	test_line_count = 1 $graphdir/commit-graph-chain &&
      +	verify_chain_files_exist $graphdir &&
     -+	graph_read_expect 15
     ++	graph_read_expect 15 &&
     ++	git commit-graph verify
      +'
      +
       test_done
  9:  58a2d5da01 = 10:  439adc1718 commit-reach: use corrected commit dates in paint_down_to_common()
 10:  4c34294602 = 11:  f6f91af305 doc: add corrected commit date info

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-17 22:30       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                       ` (10 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..ace7400a1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 02/11] revision: parse parent in indegree_walk_step()
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-18 14:18       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                       ` (9 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 3dcf689341..ecf757c327 100644
--- a/revision.c
+++ b/revision.c
@@ -3363,6 +3363,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-19 17:54       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                       ` (8 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number (within CDAT chunk) and has undefined behavior.

The test used to pass as fill_commit_in_graph() guarantees the values of
graph position and generation number, and did not load timestamp.
However, with corrected commit date we will need load the timestamp as
well to populate the generation number.

Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 29 +++++++++++------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ace7400a1a..af8d9cc45e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (2 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-17 13:22       ` Derrick Stolee
  2020-08-21 11:05       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                       ` (7 subsequent siblings)
  11 siblings, 2 replies; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index af8d9cc45e..fb6e2bf18f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..701e3d41aa 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -146,4 +146,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 4ce8cb38d5..bd6d5e587f 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (3 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-21 13:14       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
                       ` (6 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step, let's return timestamp_t values from
commit_graph_generation(), use timestamp_t for local variables and
define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 18 +++++++++---------
 commit-graph.h |  4 ++--
 commit-reach.c | 32 ++++++++++++++++----------------
 commit-reach.h |  2 +-
 commit.h       |  3 ++-
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 7 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fb6e2bf18f..7f9f858577 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -115,8 +115,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -159,8 +159,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1338,7 +1338,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
+		if (generation != GENERATION_NUMBER_V1_INFINITY &&
 		    generation != GENERATION_NUMBER_ZERO)
 			continue;
 
@@ -1352,7 +1352,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			for (parent = current->parents; parent; parent = parent->next) {
 				generation = commit_graph_data_at(parent->item)->generation;
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
+				if (generation == GENERATION_NUMBER_V1_INFINITY ||
 				    generation == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
@@ -2355,8 +2355,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
diff --git a/commit-graph.h b/commit-graph.h
index 701e3d41aa..430bc830bb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..470bc80139 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..bc0732a4fe 100644
--- a/commit.h
+++ b/commit.h
@@ -11,7 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
diff --git a/revision.c b/revision.c
index ecf757c327..411852468b 100644
--- a/revision.c
+++ b/revision.c
@@ -3290,7 +3290,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3336,7 +3336,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3379,7 +3379,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3437,7 +3437,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3498,7 +3498,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 80ad9a38d8..bcb8b5dfda 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (4 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-21 18:43       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                       ` (5 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As we are writing topological levels to commit data chunk to ensure
backwards compatibility with "Old" Git and the member `generation` of
struct commit_graph_data will store corrected commit date in a later
commit, let's introduce a commit-slab to store topological levels while
writing commit-graph.

When Git creates a split commit-graph, it takes advantage of the
generation values that have been computed already and present in
existing commit-graph files.

So, let's add a pointer to struct commit_graph to the topological level
commit-slab and populate it with topological levels while writing a
split commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
 commit-graph.h |  1 +
 commit.h       |  1 +
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7f9f858577..a2f15b2825 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+	if (g->topo_levels)
+		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -953,6 +958,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
+	struct topo_level_slab *topo_levels;
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
@@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_V1_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_V1_INFINITY &&
+		    level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_V1_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_V1_INFINITY ||
+				    level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else if (level > max_level) {
+					max_level = level;
 				}
 			}
 
 			if (all_parents_computed) {
-				struct commit_graph_data *data = commit_graph_data_at(current);
-
-				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
 	}
@@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
 		}
 	}
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
+	if (ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			g->topo_levels = &topo_levels;
+			g = g->base_graph;
+		}
+	}
+
 	if (pack_indexes) {
 		ctx->order_by_pack = 1;
 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
diff --git a/commit-graph.h b/commit-graph.h
index 430bc830bb..1152a9642e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -72,6 +72,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
+	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
diff --git a/commit.h b/commit.h
index bc0732a4fe..bb846e0025 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,7 @@
 #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
 
 struct commit_list {
 	struct commit *item;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (5 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22  0:05       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                       ` (4 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 27 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a2f15b2825..fd69534dd5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
 		if (level != GENERATION_NUMBER_V1_INFINITY &&
-		    level != GENERATION_NUMBER_ZERO)
+		    level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO
+		    )
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 
 				if (level == GENERATION_NUMBER_V1_INFINITY ||
-				    level == GENERATION_NUMBER_ZERO) {
+				    level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO
+				    ) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				} else {
+					if (level > max_level)
+						max_level = level;
+
+					if (corrected_commit_date > max_corrected_commit_date)
+						max_corrected_commit_date = corrected_commit_date;
 				}
 			}
 
@@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_MAX - 1)
 					max_level = GENERATION_NUMBER_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
@@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		timestamp_t max_generation = 0;
-		timestamp_t generation;
+		timestamp_t max_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_corrected_commit_date)
+				max_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 08/11] commit-graph: implement generation data chunk
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (6 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 13:09       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
                       ` (3 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number was to distinguish
between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 48 ++++++++++++++++++++++++---
 commit-graph.h                |  2 ++
 t/README                      |  3 ++
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 +--
 t/t5318-commit-graph.sh       | 27 +++++++--------
 t/t5324-split-commit-graph.sh | 12 +++----
 t/t6600-test-reach.sh         | 62 +++++++++++++++++++----------------
 8 files changed, 107 insertions(+), 53 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fd69534dd5..b7a72b40db 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 
 	if (g->topo_levels)
 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -951,7 +963,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct split_commit_graph_opts *split_opts;
@@ -1106,8 +1119,25 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+		hashwrite_be32(f, offset);
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1726,6 +1756,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+		ctx->write_generation_data = 0;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2130,6 +2169,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = 1;
 
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
diff --git a/commit-graph.h b/commit-graph.h
index 1152a9642e..f78c892fc0 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -67,6 +68,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/README b/t/README
index 70ec61cf88..6647ef132e 100644
--- a/t/README
+++ b/t/README
@@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..55c94e9ebd 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 044cf8a3de..b41b2160c6 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -71,7 +71,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -98,14 +98,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -214,7 +214,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -249,7 +249,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -259,7 +259,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -269,7 +269,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -282,7 +282,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -292,7 +292,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -302,7 +302,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -323,7 +323,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -420,8 +420,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index ea28d522b8..531016f405 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
 	infodir=".git/objects/info" &&
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -28,9 +28,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 475564bee7..d14b129f06 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,10 +55,13 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -67,11 +70,14 @@ run_three_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -118,7 +124,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -129,7 +135,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -144,7 +150,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -166,7 +172,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -189,7 +195,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -211,7 +217,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -234,7 +240,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -250,8 +256,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -267,8 +273,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -280,7 +286,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -292,7 +298,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -304,7 +310,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -316,7 +322,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -328,7 +334,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -338,7 +344,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -352,7 +358,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -372,7 +378,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -390,7 +396,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -404,7 +410,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (7 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 17:14       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                       ` (2 subsequent siblings)
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 32 +++++++++++++++-
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b7a72b40db..c1292f8e08 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	int read_generation_data = 1;
+
+	while (g) {
+		if (!g->chunk_generation_data) {
+			read_generation_data = 0;
+			break;
+		}
+		g = g->base_graph;
+	}
+
+	g = r->objects->commit_graph;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(r);
+
 	return g;
 }
 
@@ -763,7 +786,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data)
+	if (g->chunk_generation_data && g->read_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 	else
@@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
 	uint32_t pos;
 	if (!prepare_commit_graph(r))
 		return;
+
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
 }
@@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2210,6 +2237,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index f78c892fc0..3cf89d895d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	uint32_t read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 531016f405..ac5e7783fb 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -424,4 +424,74 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (8 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 19:09       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().

t6024-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base. As this has
already caused problems (as noted in 859fdc0 (commit-graph: define
GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
test script.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6024-recursive-merge.sh |  4 +++-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c1292f8e08..6411068411 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return !!g->chunk_generation_data;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 3cf89d895d..e22ec1e626 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -91,6 +91,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/commit-reach.c b/commit-reach.c
index 470bc80139..3a1b925274 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
index 332cfc53fd..d3def66e7d 100755
--- a/t/t6024-recursive-merge.sh
+++ b/t/t6024-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (9 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 22:20       ` Jakub Narębski
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
  11 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 12 ++---
 Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
 2 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..71c43884ec 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -88,6 +84,12 @@ CHUNK DATA:
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * This list can be later modified to store future generation number related
+      data.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 808fa30b99..f27145328c 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates
+2. Topological levels
+
+Define "corrected committer date" of a commit recursively as follows:
+
+  * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
+
+  * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+Define the "topological level" of a commit recursively as follows:
 
  * A commit with no parents (a root commit) has generation number one.
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
 
-Equivalently, the generation number of a commit A is one more than the
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
-violate topological relationships due to clock skew (such as "git log"
-with default order), but is not used when the topological order is
-required (such as merge base calculations, "git log --graph").
-
 In practice, we expect some commits to be created recently and not stored
 in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
+are computed to be at least this value. We limit at this value since it is
+the largest value that can be stored in the commit-graph file using the
+available to generation numbers. This presents another case where a
+commit can have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+We also merge commit-graph chains when we try to write a commit graph with
+two different generation number definitions as they cannot be compared directly.
+We overwrite the existing chain and create a commit-graph with the newer or more
+efficient defintion. For example, overwriting topological levels commit graph
+chain to create a corrected commit dates commit graph chain.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (10 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-08-17  0:13     ` Jakub Narębski
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
                         ` (2 more replies)
  11 siblings, 3 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-17  0:13 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

I'm not sure if this level of detail is required in the cover letter for
the series, but generation number v2 is corrected commit date; corrected
commit date offsets is how we store this value in the commit-graph file.

>
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.

I would say "committer date heuristics" instead of just "committer
date", to be more exact.

Is this data generated using https://github.com/derrickstolee/gen-test
scripts?

>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.

It would be nice to have reference email (or other place with details)
for "unfortunate problem".

>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2.

This is a bit of simplification, but good enough for a cover letter.

To be more exact, from various candidates the Corrected Commit Date was
chosen.  Then it turned out that old Git crashes on changed commit-graph
format version value, so if the generation number v2 was to replace v1
it needed to be backward-compatibile: hence the idea of Corrected Commit
Date with Monotonically Increasing Offsets.  But with GDAT chunk to
store generation number v2 (and for the time being leaving generation
number v1, i.e. Topological Levels, in CDAT), we are no longer
constrained by the requirement of backward-compatibility to make old Git
work with commit-graph file created by new Git.  So we could go back to
Corrected Commit Date, and as you wrote above the backward-compatibile
variant performs worse.

> The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents.

Actually it needs to be "corrected commit dates of its parents plus 1"
to fulfill the reachability condition for a generation number for a
commit:

      A can reach B   =>  gen(A) < gen(B)

Of course it can be computed in simpler way, because

  max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1


>                                                           Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

All right.

>
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

All right.

However I think the cover letter should also describe what should happen
in a mixed version environment (for example new Git on command line,
copy of old Git used by GUI client), and in particular what should
happen in a mixed-chain case - both for reading and for writing the
commit-graph file.

For *writing*: because old Git would create commit-graph layers without
the GDAT chunk, to simplify the behavior and make easy to reason about
commit-graph data (the situation should be not that common, and
transient -- it should get more rare as the time goes), we want the
following behavior from new Git:

- If top layer contains the GDAT chunk, or we are rewriting commit-graph
  file (--split=replace), or we are merging layers and there are no
  layers without GDAT chunk below set of layers that are merged, then

     write commit-graph file or commit-graph layer with GDAT chunk,

  otherwise

     write commit-graph layer without GDAT chunk.

  This means that there are commit-graph layers without GDAT chunk if
  and only if the top layer is also without GDAT chunk.


For *reading* we want to use generation number v2 (corrected commit
date) if possible, and fall back to generation number v1 (topological
levels).

- If the top layer contains the GDAT chunk (or maybe even if the topmost
  layer that involves all commits in question, not necessarily the top
  layer in the full commit-graph chain), then use generation number v2

  - commit_graph_data->generation stores corrected commit date,
    computed as sum of committer date (from CDAT) and offset (from GDAT)

  - A can reach B   =>  gen(A) < gen(B)

  - there is no need for committer date heuristics, and no need for
    limiting use of generation number to where there is a cutoff (to not
    hamper performance).

- If there are layers without GDAT chunks, which thanks to the write
  behavior means simply top layer without GDAT chunk, we need to turn
  off use of generation numbers or fall back to using topological levels

  - commit_graph_data->generation stores topological levels,
    taken from CDAT chunk (30-bits)

  - A can reach B   =>  gen(A) < gen(B)

  - we probably want to keep tie-breaking of sorting by generation
    number via committer date, and limit use of generation number as
    opposed to using committer date heuristics (with slop) to not make
    performance worse.

>
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> first version.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> Changes in version 3:
>
>  * Reordered patches as discussed in 1
>    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]

If I remember it correctly this was done to always store in GDAT chunk
corrected commit date offsets, isn't it?

>  * Split "implement corrected commit date" into two patches - one
>    introducing the topo level slab and other implementing corrected commit
>    dates.

All right.

I think it might be good idea to split off the change to tar file tests
(as a preparatory patch), to make reviews and bisecting easier.

>  * Extended split-commit-graph tests to verify at the end of test.

Do we also test for proper merging of split commit-graph layers, not
only adding a new layer and a full rewrite (--split=replace)?

>  * Use topological levels as generation number if any of split commit-graph
>    files do not have generation data chunk.

That is good for performance.

>
> Changes in version 2:
>
>  * Add tests for generation data chunk.

Good.

>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.

Good, that is needed for testing mixed-version behavior.

>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().

All right, but see the caveat.

>  * Update technical documentation.

Always a good thing.

>  * Handle mixed graph version commit chains.

Where by "version" you mean generation number version - the commit-graph
version number unfortunately needs to stay the same...

>  * Improve commit messages for
                                ^^^^^^
Something missing in this point, the sentence ends abruptly.

>  * Revert unnecessary whitespace changes.

Thanks.

>  * Split uint_32 -> timestamp_t change into a new commit.

It is usually better to keep the commits small.  Good.


Good work!

>
> Abhishek Kumar (11):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: return 64-bit generation number
>   commit-graph: add a slab to store topological levels
>   commit-graph: implement corrected commit date
>   commit-graph: implement generation data chunk
>   commit-graph: use generation v2 only if entire chain does
>   commit-reach: use corrected commit dates in paint_down_to_common()
>   doc: add corrected commit date info
>
>  .../technical/commit-graph-format.txt         |  12 +-
>  Documentation/technical/commit-graph.txt      |  45 ++--
>  commit-graph.c                                | 241 +++++++++++++-----
>  commit-graph.h                                |  16 +-
>  commit-reach.c                                |  49 ++--
>  commit-reach.h                                |   2 +-
>  commit.c                                      |   9 +-
>  commit.h                                      |   4 +-
>  revision.c                                    |  13 +-
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   2 +
>  t/t4216-log-bloom.sh                          |   4 +-
>  t/t5000-tar-tree.sh                           |   4 +-
>  t/t5318-commit-graph.sh                       |  27 +-
>  t/t5324-split-commit-graph.sh                 |  82 +++++-
>  t/t6024-recursive-merge.sh                    |   4 +-
>  t/t6600-test-reach.sh                         |  62 +++--
>  upload-pack.c                                 |   2 +-
>  18 files changed, 396 insertions(+), 185 deletions(-)
>
>
> base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
[...]

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
@ 2020-08-17  1:32         ` Taylor Blau
  2020-08-17  7:56           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Taylor Blau @ 2020-08-17  1:32 UTC (permalink / raw)
  To: git; +Cc: Jakub Narębski, Derrick Stolee, Abhishek Kumar

On Mon, Aug 17, 2020 at 02:16:08AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
>
> All right.
>
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,
>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This seems very sane to me, and I'd be glad to see it spelled out in
more specific detail. I was wondering this myself, and had to double
check with Stolee off-list that my interpretation of Abhishek's code was
correct.

But yes, only writing GDAT chunks when all layers in the chain have GDAT
chunks makes sense, since we can't interoperate between corrected dates
and topological levels. Since we can't fill in the GDAT data of layers
generated in pre-GDAT versions of Git without invalidating the GDAT
layers on-disk, there's no point to speculatively computing both chunks.

Merging rules are obviously correct, which is good. For what it's worth,
the '--split=replace' case is what we'll really care about at GitHub,
since it's unlikely we'd drop all existing commit-graph chains and
rebuild them from scratch. More likely is that we'll let the new GDAT
chunks trickle in over time when we run 'git commit-graph write' with
'--split=replace', which happens "every so often".

> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2

I don't follow this. If we have a multi-layer chain, either all or none
of the layers have a GDAT chunk. So, "if the top layer contains the GDAT
chunk" makes sense, since it implies that all layers have the GDAT
chunk. I don't see how "even if the topmost layer that involves all
commits in question" would be possible, since (if I'm understanding your
description correctly), we can't have *some* of the layers having a GDAT
chunk with others only having a CDAT chunk.

I'm a little confused here.

>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels

Good, I'm glad that this can be a quick check (that we can cache for
future reads, but I'm not even sure the caching would be necessary
without measuring).
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

All makes very good sense, except for the one point I raised above.

> >
> > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > first version.

Thanks, Abhishek for your great work on this. I was feeling bad that I
wasn't more involved in the early discussions about the transition plan,
but what you, Stolee, and Jakub came up with all seems like what I would
have suggested, anyway ;-).

> Jakub Narebski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-17  1:32         ` Fwd: " Taylor Blau
@ 2020-08-17  7:56           ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-17  7:56 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Derrick Stolee, Abhishek Kumar

On Mon, 17 Aug 2020 at 03:32, Taylor Blau <me@ttaylorr.com> wrote:
> On Mon, Aug 17, 2020 at 02:16:08AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >
> > > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > > and store corrected commit date offsets in GDAT chunk while storing
> > > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > > of Git would use corrected commit dates, falling back to topological level
> > > if the generation data chunk is absent in the commit-graph file.
> >
> > All right.
> >
> > However I think the cover letter should also describe what should happen
> > in a mixed version environment (for example new Git on command line,
> > copy of old Git used by GUI client), and in particular what should
> > happen in a mixed-chain case - both for reading and for writing the
> > commit-graph file.
> >
> > For *writing*: because old Git would create commit-graph layers without
> > the GDAT chunk, to simplify the behavior and make easy to reason about
> > commit-graph data (the situation should be not that common, and
> > transient -- it should get more rare as the time goes), we want the
> > following behavior from new Git:
> >
> > - If top layer contains the GDAT chunk, or we are rewriting commit-graph
> >   file (--split=replace), or we are merging layers and there are no
> >   layers without GDAT chunk below set of layers that are merged, then
> >
> >      write commit-graph file or commit-graph layer with GDAT chunk,
> >
> >   otherwise
> >
> >      write commit-graph layer without GDAT chunk.
> >
> >   This means that there are commit-graph layers without GDAT chunk if
> >   and only if the top layer is also without GDAT chunk.
>
> This seems very sane to me, and I'd be glad to see it spelled out in
> more specific detail. I was wondering this myself, and had to double
> check with Stolee off-list that my interpretation of Abhishek's code was
> correct.
>
> But yes, only writing GDAT chunks when all layers in the chain have GDAT
> chunks makes sense, since we can't interoperate between corrected dates
> and topological levels. Since we can't fill in the GDAT data of layers
> generated in pre-GDAT versions of Git without invalidating the GDAT
> layers on-disk, there's no point to speculatively computing both chunks.
>
> Merging rules are obviously correct, which is good. For what it's worth,
> the '--split=replace' case is what we'll really care about at GitHub,
> since it's unlikely we'd drop all existing commit-graph chains and
> rebuild them from scratch. More likely is that we'll let the new GDAT
> chunks trickle in over time when we run 'git commit-graph write' with
> '--split=replace', which happens "every so often".

To be more detailed, without '--split=replace' we would want the following
layer merging behavior:

   [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]

In the split commit-graph chain above, merging two topmost layers
should create a layer without GDAT; merging three topmost layers
(and any other layers, e.g. two middle ones) should create a layer
 with GDAT.

> > For *reading* we want to use generation number v2 (corrected commit
> > date) if possible, and fall back to generation number v1 (topological
> > levels).
> >
> > - If the top layer contains the GDAT chunk (or maybe even if the topmost
> >   layer that involves all commits in question, not necessarily the top
> >   layer in the full commit-graph chain), then use generation number v2
>
> I don't follow this. If we have a multi-layer chain, either all or none
> of the layers have a GDAT chunk. So, "if the top layer contains the GDAT
> chunk" makes sense, since it implies that all layers have the GDAT
> chunk. I don't see how "even if the topmost layer that involves all
> commits in question" would be possible, since (if I'm understanding your
> description correctly), we can't have *some* of the layers having a GDAT
> chunk with others only having a CDAT chunk.
>
> I'm a little confused here.

This is only speculative, and most probably totally unnecessary
complication (either that, or something that we would get for free).
Assume that the command in question operates only on
historical data; for example `git log --topo-order HEAD~1000`.
If all commits (or, what's equivalent, most recent commits
i.e. HEAD~1000) have their data in split commit-graph layers
with GDAT, we can theoretically use generation number v2,
even if there are some newer commits that have their data
in layers without GDAT (and some even newer ones outside
commit-graph files).

I hope that this explains my (possibly harebrained) idea.

> >   - commit_graph_data->generation stores corrected commit date,
> >     computed as sum of committer date (from CDAT) and offset (from GDAT)
> >
> >   - A can reach B   =>  gen(A) < gen(B)
> >
> >   - there is no need for committer date heuristics, and no need for
> >     limiting use of generation number to where there is a cutoff (to not
> >     hamper performance).
> >
> > - If there are layers without GDAT chunks, which thanks to the write
> >   behavior means simply top layer without GDAT chunk, we need to turn
> >   off use of generation numbers or fall back to using topological levels
>
> Good, I'm glad that this can be a quick check (that we can cache for
> future reads, but I'm not even sure the caching would be necessary
> without measuring).

There is a question where to store the information that we cannot
use generation number v2 (that 'generation' contains topological
levels and not corrected commit date):
- create new global variable
- store it in `struct split_commit_graph_opts`
- set `chunk_generation_data` to NULL for all graphs
  in chain (it is in `struct commit_graph`)?

> >
> >   - commit_graph_data->generation stores topological levels,
> >     taken from CDAT chunk (30-bits)
> >
> >   - A can reach B   =>  gen(A) < gen(B)
> >
> >   - we probably want to keep tie-breaking of sorting by generation
> >     number via committer date, and limit use of generation number as
> >     opposed to using committer date heuristics (with slop) to not make
> >     performance worse.
>
> All makes very good sense, except for the one point I raised above.
>
> > >
> > > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > > first version.
>
> Thanks, Abhishek for your great work on this. I was feeling bad that I
> wasn't more involved in the early discussions about the transition plan,
> but what you, Stolee, and Jakub came up with all seems like what I would
> have suggested, anyway ;-).

Thank you for your work on improving this feature.

Best,
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-08-17 13:22       ` Derrick Stolee
  2020-08-21 11:05       ` Jakub Narębski
  1 sibling, 0 replies; 107+ messages in thread
From: Derrick Stolee @ 2020-08-17 13:22 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/15/2020 12:39 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> Reviewed-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Whoops! Be sure to add the "Reviewed-by: above the "Signed-off-by" line(s)
so re-signing off doesn't add a duplicate like this.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-08-17 22:30       ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-17 22:30 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Subject: [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter

s/bloom filter/Bloom filters/

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> commit_gen_cmp is used when writing a commit-graph to sort commits in
> generation order before computing Bloom filters. Since c49c82aa (commit:
> move members graph_pos, generation to a slab, 2020-06-17) made it so
> that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
> during writing, we cannot call it within this function. Instead, access
> the generation number directly through the slab (i.e., by calling
> 'commit_graph_data_at(c)->generation') in order to access it while
> writing.

Two things that might not be obvious from the commit message:

- Is commit_gen_cmp in commit-graph.c used by anything but writing
  Bloom filters for changed paths?

- That the generation number is computed during `commit-graph write`
  before computing Bloom filters.

Also, after this series 'generation' would be generation number v2, that
is corrected commit date, and not v1, that is topological levels.  We
should check, just in case, that it does not lead to significant
performance regression for `git commit-graph write --reachable <...>`
case (the one that uses commit_gen_cmp sort).

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e51c91dd5b..ace7400a1a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;

Nice and easy.

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
@ 2020-08-18  6:12       ` Abhishek Kumar
  2020-08-23 15:27       ` Jakub Narębski
  2 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Mon, Aug 17, 2020 at 02:13:18AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> I'm not sure if this level of detail is required in the cover letter for
> the series, but generation number v2 is corrected commit date; corrected
> commit date offsets is how we store this value in the commit-graph file.
> 
> >
> > Git uses topological levels in the commit-graph file for commit-graph
> > traversal operations like git log --graph. Unfortunately, using topological
> > levels can result in a worse performance than without them when compared
> > with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> > on the Linux repository walks 635,579 commits using topological levels and
> > walks 167,468 using committer date.
> 
> I would say "committer date heuristics" instead of just "committer
> date", to be more exact.
> 
> Is this data generated using https://github.com/derrickstolee/gen-test
> scripts?
> 

Yes, it is.

> >
> > Thus, the need for generation number v2 was born. New generation number
> > needed to provide good performance, increment updates, and backward
> > compatibility. Due to an unfortunate problem, we also needed a way to
> > distinguish between the old and new generation number without incrementing
> > graph version.
> 
> It would be nice to have reference email (or other place with details)
> for "unfortunate problem".
> 

Will add.

> >
> > Various candidates were examined (https://github.com/derrickstolee/gen-test,
> > https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> > number v2, Corrected Commit Date with Mononotically Increasing Offsets
> > performed much worse than committer date (506,577 vs. 167,468 commits walked
> > for git merge-base v4.8 v4.9) and was dropped.
> >
> > Using Generation Data chunk (GDAT) relieves the requirement of backward
> > compatibility as we would continue to store topological levels in Commit
> > Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> > number v2.
> 
> This is a bit of simplification, but good enough for a cover letter.
> 
> To be more exact, from various candidates the Corrected Commit Date was
> chosen.  Then it turned out that old Git crashes on changed commit-graph
> format version value, so if the generation number v2 was to replace v1
> it needed to be backward-compatibile: hence the idea of Corrected Commit
> Date with Monotonically Increasing Offsets.  But with GDAT chunk to
> store generation number v2 (and for the time being leaving generation
> number v1, i.e. Topological Levels, in CDAT), we are no longer
> constrained by the requirement of backward-compatibility to make old Git
> work with commit-graph file created by new Git.  So we could go back to
> Corrected Commit Date, and as you wrote above the backward-compatibile
> variant performs worse.
> 
> > The Corrected Commit Date is defined as:
> >
> > For a commit C, let its corrected commit date be the maximum of the commit
> > date of C and the corrected commit dates of its parents.
> 
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
> 
>       A can reach B   =>  gen(A) < gen(B)
> 
> Of course it can be computed in simpler way, because
> 
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1
> 
> 
> >                                                           Then corrected
> > commit date offset is the difference between corrected commit date of C and
> > commit date of C.
> 
> All right.
> 
> >
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
> 
> All right.
> 
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
> 

Yes, definitely. Will add

> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
> 
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
> 
>      write commit-graph file or commit-graph layer with GDAT chunk,
> 
>   otherwise
> 
>      write commit-graph layer without GDAT chunk.
> 
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.
> 
> 
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
> 
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
> 

The current implementation checks the entire chain for GDAT, rather than
just the topmost layer as we cannot assert that `g` would be the topmost
layer of the chain.

See the discussion here: https://lore.kernel.org/git/20200814045957.GA1380@Abhishek-Arch/

It's one of drawbacks of having a single member 64-bit `generation`
instead of two 32-bit members `level` and `odate`.

>
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)
> 
>   - A can reach B   =>  gen(A) < gen(B)
> 
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
> 
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
> 
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
> 
>   - A can reach B   =>  gen(A) < gen(B)
> 
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.
> 
> >
> > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > first version.
> >
> > I look forward to everyone's reviews!
> >
> > Thanks
> >
> >  * Abhishek
> >
> >
> > ----------------------------------------------------------------------------
> >
> > Changes in version 3:
> >
> >  * Reordered patches as discussed in 1
> >    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
> 
> If I remember it correctly this was done to always store in GDAT chunk
> corrected commit date offsets, isn't it?
> 

Yes.

> >  * Split "implement corrected commit date" into two patches - one
> >    introducing the topo level slab and other implementing corrected commit
> >    dates.
> 
> All right.
> 
> I think it might be good idea to split off the change to tar file tests
> (as a preparatory patch), to make reviews and bisecting easier.
> 
> >  * Extended split-commit-graph tests to verify at the end of test.
> 
> Do we also test for proper merging of split commit-graph layers, not
> only adding a new layer and a full rewrite (--split=replace)?
> 

We do not, will add a test at end. Thanks for pointing this out.

> >  * Use topological levels as generation number if any of split commit-graph
> >    files do not have generation data chunk.
> 
> That is good for performance.
> 
> >
> > Changes in version 2:
> >
> >  * Add tests for generation data chunk.
> 
> Good.
> 
> >  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
> >    generation data chunk.
> 
> Good, that is needed for testing mixed-version behavior.
> 
> >  * Compare commits with corrected commit dates if present in
> >    paint_down_to_common().
> 
> All right, but see the caveat.
> 
> >  * Update technical documentation.
> 
> Always a good thing.
> 
> >  * Handle mixed graph version commit chains.
> 
> Where by "version" you mean generation number version - the commit-graph
> version number unfortunately needs to stay the same...
> 

Yes, clarified.

> >  * Improve commit messages for
>                                 ^^^^^^
> Something missing in this point, the sentence ends abruptly.

I didn't finish the sentence. Meant to say:

- Improve commit messages for "commit-graph: fix regression when computing bloom filter", "commit-graph: consolidate fill_commit_graph_info",
> 
> >  * Revert unnecessary whitespace changes.
> 
> Thanks.
> 
> >  * Split uint_32 -> timestamp_t change into a new commit.
> 
> It is usually better to keep the commits small.  Good.
> 
> 
> Good work!
> 
> ...
> --
> Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 02/11] revision: parse parent in indegree_walk_step()
  2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-08-18 14:18       ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-18 14:18 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.

All right, we need to have commit parsed to have correct value for its
generation number.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 3dcf689341..ecf757c327 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3363,6 +3363,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>
> +		if (parse_commit_gently(parent, 1) < 0)
> +			return;
> +

All right, this is exactly what is done in this function for commit 'c'
taken from indegree_queue, whose parents we process here:

	if (parse_commit_gently(c, 1) < 0)
		return;

>  		if (*pi)
>  			(*pi)++;
>  		else

Looks good to me.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-08-19 17:54       ` Jakub Narębski
  2020-08-21  4:11         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-19 17:54 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
>
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> generation number (within CDAT chunk) and has undefined behavior.
>
> The test used to pass

Could you please tell us how does this test starts to fail without the
change to the test described there?  What is the error message, etc.?

>                       as fill_commit_in_graph() guarantees the values of
                           ^^^^^^^^^^^^^^^^^^^^^^

s/fill_commit_in_graph()/fill_commit_graph_info()/

It is fill_commit_graph_info() that changes its behavior in this patch.

> graph position and generation number, and did not load timestamp.
> However, with corrected commit date we will need load the timestamp as
> well to populate the generation number.
>
> Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.

I think this commit should be split into two commits:
- fix to the 'generate tar with future mtime' test
- simplify implementation of fill_commit_in_graph()

The test 'generate tar with future mtime' in t/t5000-tar-tree.sh creates
a commit with commit time of (2 ^ 36 - 1) seconds since EPOCH
(68719476737). However, the commit-graph file format version 1 provides
only 34-bits for storing committer date (32 + 2 bits), not 64-bits.
Therefore maximum far in the future commit time can only be at most
(2 ^ 34 - 1) seconds since EPOCH, as Stolee said in commet for v1
of this series.

This "limitation" is not a problem in practice, because the maximum
timestamp allowed takes place in the year 2514. I hope at that time
there would be no Git version in use that still crashes on changing the
version field in the commit-graph format -- then we can simply get rid
of storing topological levels (generation number v1) in those 30 bits of
CDAT chunk and use full 64 bits for committer date.

Git does not perform any bounds checking for committer date value in
write_graph_chunk_data():

	uint32_t packedDate[2];

	/* ... */

	if (sizeof((*list)->date) > 4)
		packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
	else
		packedDate[0] = 0;

	packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);

	packedDate[1] = htonl((*list)->date);
	hashwrite(f, packedDate, 8);

This means that the date is trimmed to 34 bits on save discarding most
significant bits, assuming that unsigned overflow simply discards most
significant bits truncating the signed (?) value.

In this case running the test with GIT_TEST_COMMIT_GRAPH=1 would lead to
errors, as the committer date read from the commit graph would be
incorrect, and therefore generation number v2 would also be incorrect.


I don't quite understand however how second part of this patch in its
current iteration, namely simplifing the implementation of
fill_commit_in_graph() makes this bug / error shows...

Do I understand it correctly that before this change the committer date
would always be parsed out of the commit object, instead of reading it
from the commit-graph file?  However the only user of static
fill_commit_in_graph() is the parse_commit_in_graph(), which in turn is
used by parse_commit_gently(); but fill_commit_in_graph() read commit
date from commit-graph before this change... color me confused.

Ah, after the change fill_commit_graph_info() changes its behavior, not
fill_commit_in_graph() as said in the commit message. Before this commit
it used to only load graph position and generation number, and did not
load the timestamp. The function fill_commit_graph_info() is used in
turn by public-facing load_commit_graph_info():

  /*
   * It is possible that we loaded commit contents from the commit buffer,
   * but we also want to ensure the commit-graph content is correctly
   * checked and filled. Fill the graph_pos and generation members of
   * the given commit.
   */
  void load_commit_graph_info(struct repository *r, struct commit *item);

This function is used in turn by get_bloom_filter(), contains_tag_algo()
and parse_commit_buffer(), change in any of which behavior can lead to
failing 'generate tar with future mtime' test.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c      | 29 +++++++++++------------------
>  t/t5000-tar-tree.sh |  4 ++--
>  2 files changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index ace7400a1a..af8d9cc45e 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
>  	uint32_t lex_index;
> +	uint64_t date_high, date_low;
>
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>
> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		die(_("invalid commit position. commit-graph is likely corrupt"));
> +

All right, if we want to use fill_commit_graph_info() function to load
the graph data (graph position and generation number) in the
fill_commit_in_graph() we need to perform this check.

I'd think that this check should be here from the beginning, just in
case.


Sidenote: I wonder if it would be good idea to print more information in
the above error message, for example:

	die(_("invalid commit position %ld. commit-graph '%s' is likely corrupt"),
        pos, g->filename);

But this is unrelated thing, tangential to this change, and it might not
add anything useful.

>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>
>  	graph_data = commit_graph_data_at(item);
>  	graph_data->graph_pos = pos;
> +
> +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> +	date_low = get_be32(commit_data + g->hash_len + 12);
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +

I think this change, moving loading of commit date from commit-graph out
of fill_commit_in_graph() and into fill_commit_graph_info(), is in my
opinion a bit inadequatly described in the commit message. As I
understand it this change prepares fill_commit_graph_info() for
generation number v2, that is Corrected Commit Date, where loading
commit date from CDAT together with loading offset from GDAT would be
necessary to correctly set the 'generation' field of 'struct
commit_graph_data' (on the commit_graph_data_slab).

I'm not sure if it would be worth it splitting this refactoring change
(Move Statements into Function) into a separate patch -- it would split
this commit into three, changing 11 part series into 13 part series.


Note that we might want to update the description of
load_commit_graph_info() in commit-graph.h to include that it
incidentally loads commit date from the commit-graph.  Butthis might be
not worth it -- it is a side effect, not the major goal of this
function.

>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>
> @@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r,
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
> -	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	struct commit_graph_data *graph_data;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
>
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>
> -	if (pos >= g->num_commits + g->num_commits_in_base)
> -		die(_("invalid commit position. commit-graph is likely corrupt"));
> +	fill_commit_graph_info(item, g, pos);

All right, the check got moved into fill_commit_graph_info().

>
> -	/*
> -	 * Store the "full" position, but then use the
> -	 * "local" position for the rest of the calculation.
> -	 */
> -	graph_data = commit_graph_data_at(item);
> -	graph_data->graph_pos = pos;

All right, 'graph_pos' field in the graph data (on commit slab) got
filled by just called load_commit_graph_info().

>  	lex_index = pos - g->num_commits_in_base;
> -
> -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;

All right, unrelated cleanup in the neighbourhood.

>
>  	item->object.parsed = 1;
>
>  	set_commit_tree(item, NULL);
>
> -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> -	date_low = get_be32(commit_data + g->hash_len + 12);
> -	item->date = (timestamp_t)((date_high << 32) | date_low);
> -

All right, this code got moved down the call chain into just called
load_commit_graph_info().

> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -


All right, 'generation' field in the graph data (on commit slab) got
filled by load_commit_graph_info() called at the beginning of the
function.



>  	pptr = &item->parents;
>
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 37655a237c..1986354fc3 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
>  	rm -f .git/index &&
>  	echo content >file &&
>  	git add file &&
> -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> +	GIT_COMMITTER_DATE="@17179869183 +0000" \
>  		git commit -m "tempori parendum"
>  '
>
> @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
>  '
>
>  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> -	echo 4147 >expect &&
> +	echo 2514 >expect &&
>  	tar_info future.tar | cut -d" " -f2 >actual &&
>  	test_cmp expect actual
>  '

Looks good to me.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-19 17:54       ` Jakub Narębski
@ 2020-08-21  4:11         ` Abhishek Kumar
  2020-08-25 11:11           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-21  4:11 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number (within CDAT chunk) and has undefined behavior.
> >
> > The test used to pass
> 
> Could you please tell us how does this test starts to fail without the
> change to the test described there?  What is the error message, etc.?
> 

Here's what I revised the commit message to:

commit-graph: consolidate fill_commit_graph_info

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
34-bits for storing commiter date, thus committer time overflows into
generation number (within CDAT chunk) and has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and loads committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits) to populate the generation
number. Thus, fill_commit_graph_info() sets date and generates tar file
with the truncated mtime and the test fails.

Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
will not be truncated.

> >                       as fill_commit_in_graph() guarantees the values of
>                            ^^^^^^^^^^^^^^^^^^^^^^
> 
> s/fill_commit_in_graph()/fill_commit_graph_info()/
> 
> ...
> 
> Ah, after the change fill_commit_graph_info() changes its behavior, not
> fill_commit_in_graph() as said in the commit message. Before this commit
> it used to only load graph position and generation number, and did not
> load the timestamp. The function fill_commit_graph_info() is used in
> turn by public-facing load_commit_graph_info():
> 

That's exactly it. I should have elaborated better in the commit
message. Thanks for the through investigation.

>   /*
>    * It is possible that we loaded commit contents from the commit buffer,
>    * but we also want to ensure the commit-graph content is correctly
>    * checked and filled. Fill the graph_pos and generation members of
>    * the given commit.
>    */
>   void load_commit_graph_info(struct repository *r, struct commit *item);
> 
> ...
> 
> Looks good to me.
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
  2020-08-17 13:22       ` Derrick Stolee
@ 2020-08-21 11:05       ` Jakub Narębski
  1 sibling, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-21 11:05 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.

All right, seems reasonable.

Though it might be not obvious that the second repetition of code
comparing commits by generation is part of commit.c's
compare_commits_by_gen_then_commit_date().

Is't it micro-pessimization though, or can the compiler inline function
across different files?  On the other hand it reduces code duplication...

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> Reviewed-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 15 +++++++++++++++
>  commit-graph.h |  2 ++
>  commit-reach.c | 15 ---------------
>  commit.c       |  9 +++------
>  4 files changed, 20 insertions(+), 21 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index af8d9cc45e..fb6e2bf18f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
>  	return data->generation;
>  }
>
> +int compare_commits_by_gen(const void *_a, const void *_b)
> +{
> +	const struct commit *a = _a, *b = _b;
> +	const uint32_t generation_a = commit_graph_generation(a);
> +	const uint32_t generation_b = commit_graph_generation(b);

All right, this function used protected access to generation number of a
commit, that is it correctly handles the case where commit '_a' and/or
'_b' are new enough to be not present in the commit graph.

That is why we cannot simply use commit_gen_cmp(), that is the function
used for sorting during `git commit-graph write --reachable --changed-paths`,
because after 1st patch it access the slab directly.

> +
> +	/* older commits first */

Nice!  Thanks for adding this comment.

Though it might be good idea to add this comment also to the header
file, commit-graph.h, because the fact that compare_commits_by_gen()
and compare_commits_by_gen_then_commit_date() sort in different
order is not something that we can see from their names.  Well,
they have slightly different sigatures...

> +	if (generation_a < generation_b)
> +		return -1;
> +	else if (generation_a > generation_b)
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
>  {
>  	unsigned int i, nth_slab;
> diff --git a/commit-graph.h b/commit-graph.h
> index 09a97030dc..701e3d41aa 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -146,4 +146,6 @@ struct commit_graph_data {
>   */
>  uint32_t commit_graph_generation(const struct commit *);
>  uint32_t commit_graph_position(const struct commit *);
> +
> +int compare_commits_by_gen(const void *_a, const void *_b);

All right.

>  #endif
> diff --git a/commit-reach.c b/commit-reach.c
> index efd5925cbb..c83cc291e7 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
>  	return repo_is_descendant_of(the_repository, commit, list);
>  }
>
> -static int compare_commits_by_gen(const void *_a, const void *_b)
> -{
> -	const struct commit *a = *(const struct commit * const *)_a;
> -	const struct commit *b = *(const struct commit * const *)_b;
> -
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> -
> -	if (generation_a < generation_b)
> -		return -1;
> -	if (generation_a > generation_b)
> -		return 1;
> -	return 0;
> -}

All right, commit-reach.c includes commit-graph.h, so now it simply uses
compare_commits_by_gen() that was copied to commit-graph.c.

> -
>  int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
> diff --git a/commit.c b/commit.c
> index 4ce8cb38d5..bd6d5e587f 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
>  int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> -	const uint32_t generation_a = commit_graph_generation(a),
> -		       generation_b = commit_graph_generation(b);
> +	int ret_val = compare_commits_by_gen(a_, b_);
>
>  	/* newer commits first */

Maybe this comment should be put in the header file, near this functionn
declaration?

> -	if (generation_a < generation_b)
> -		return 1;
> -	else if (generation_a > generation_b)
> -		return -1;
> +	if (ret_val)
> +		return -ret_val;

All right, this handles reversed sorting order of compare_commits_by_gen().

>
>  	/* use date as a heuristic when generations are equal */
>  	if (a->date < b->date)

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-08-21 13:14       ` Jakub Narębski
  2020-08-25  5:04         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-21 13:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In a preparatory step, let's return timestamp_t values from
> commit_graph_generation(), use timestamp_t for local variables

All right, this is all good.

> and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

This needs more detailed examination.  There are two similar constants,
GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
used for newest commits outside the commit-graph, while the latter is
maximum number that commits in the commit-graph can have (because of the
storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
to be larger than GENERATION_NUMBER_MAX, and it is (and was).

The GENERATION_NUMBER_INFINITY is because of the above requirement
traditionally taken as maximum value that can be represented in the data
type used to store commit's generation number _in memory_, but it can be
less.  For timestamp_t the maximum value that can be represented
is (2 ^ 63 - 1).

All right then.

>

The commit message says nothing about the new symbolic constant
GENERATION_NUMBER_V1_INFINITY, though.

I'm not sure it is even needed (see comments below).

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 18 +++++++++---------
>  commit-graph.h |  4 ++--
>  commit-reach.c | 32 ++++++++++++++++----------------
>  commit-reach.h |  2 +-
>  commit.h       |  3 ++-
>  revision.c     | 10 +++++-----
>  upload-pack.c  |  2 +-
>  7 files changed, 36 insertions(+), 35 deletions(-)

I hope that changing the type returned by commit_graph_generation() and
stored in 'generation' field of `struct commit_graph_data` would mean
that the compiler or at least the linter would catch all the places that
need updating the type.

Just in case, I have performed a simple code search and it agrees with
the above list (one search result missing, in commit.c, was handled by
previous patch).

>
> diff --git a/commit-graph.c b/commit-graph.c
> index fb6e2bf18f..7f9f858577 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
>  	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
>  }
>
> -uint32_t commit_graph_generation(const struct commit *c)
> +timestamp_t commit_graph_generation(const struct commit *c)
>  {
>  	struct commit_graph_data *data =
>  		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
> @@ -115,8 +115,8 @@ uint32_t commit_graph_generation(const struct commit *c)
>  int compare_commits_by_gen(const void *_a, const void *_b)
>  {
>  	const struct commit *a = _a, *b = _b;
> -	const uint32_t generation_a = commit_graph_generation(a);
> -	const uint32_t generation_b = commit_graph_generation(b);
> +	const timestamp_t generation_a = commit_graph_generation(a);
> +	const timestamp_t generation_b = commit_graph_generation(b);
>
>  	/* older commits first */
>  	if (generation_a < generation_b)
> @@ -159,8 +159,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_data_at(a)->generation;
> -	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
> +	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

All right.

> @@ -1338,7 +1338,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;

Shouldn't this be

-  		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+  		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;


>
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_INFINITY &&
> +		if (generation != GENERATION_NUMBER_V1_INFINITY &&

Then there would be no need for this change, isn't it?

>  		    generation != GENERATION_NUMBER_ZERO)
>  			continue;
>
> @@ -1352,7 +1352,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				generation = commit_graph_data_at(parent->item)->generation;
>
> -				if (generation == GENERATION_NUMBER_INFINITY ||
> +				if (generation == GENERATION_NUMBER_V1_INFINITY ||

And this one either.

>  				    generation == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
> @@ -2355,8 +2355,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> -		uint32_t max_generation = 0;
> -		uint32_t generation;
> +		timestamp_t max_generation = 0;
> +		timestamp_t generation;

All right.

>
>  		display_progress(progress, i + 1);
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> diff --git a/commit-graph.h b/commit-graph.h
> index 701e3d41aa..430bc830bb 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
>
>  struct commit_graph_data {
>  	uint32_t graph_pos;
> -	uint32_t generation;
> +	timestamp_t generation;
>  };

All right; this is the main part of this change.

>
>  /*
>   * Commits should be parsed before accessing generation, graph positions.
>   */
> -uint32_t commit_graph_generation(const struct commit *);
> +timestamp_t commit_graph_generation(const struct commit *);
>  uint32_t commit_graph_position(const struct commit *);

As is this one.

>
>  int compare_commits_by_gen(const void *_a, const void *_b);
> diff --git a/commit-reach.c b/commit-reach.c
> index c83cc291e7..470bc80139 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  static struct commit_list *paint_down_to_common(struct repository *r,
>  						struct commit *one, int n,
>  						struct commit **twos,
> -						int min_generation)
> +						timestamp_t min_generation)
>  {
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
> -	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
> +	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>
>  	if (!min_generation)
>  		queue.compare = compare_commits_by_commit_date;

All right.

> @@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  		struct commit *commit = prio_queue_get(&queue);
>  		struct commit_list *parents;
>  		int flags;
> -		uint32_t generation = commit_graph_generation(commit);
> +		timestamp_t generation = commit_graph_generation(commit);
>
>  		if (min_generation && generation > last_gen)
> -			BUG("bad generation skip %8x > %8x at %s",
> +			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
>  			    generation, last_gen,
>  			    oid_to_hex(&commit->object.oid));
>  		last_gen = generation;

All right.

> @@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>  		repo_parse_commit(r, array[i]);
>  	for (i = 0; i < cnt; i++) {
>  		struct commit_list *common;
> -		uint32_t min_generation = commit_graph_generation(array[i]);
> +		timestamp_t min_generation = commit_graph_generation(array[i]);
>
>  		if (redundant[i])
>  			continue;
>  		for (j = filled = 0; j < cnt; j++) {
> -			uint32_t curr_generation;
> +			timestamp_t curr_generation;
>  			if (i == j || redundant[j])
>  				continue;
>  			filled_index[filled] = j;

All right.

> @@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
>  {
>  	struct commit_list *bases;
>  	int ret = 0, i;
> -	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
>
>  	if (repo_parse_commit(r, commit))
>  		return ret;

All right,

> @@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  static enum contains_result contains_test(struct commit *candidate,
>  					  const struct commit_list *want,
>  					  struct contains_cache *cache,
> -					  uint32_t cutoff)
> +					  timestamp_t cutoff)

All right.

(Sidenote: this one I have missed in my simple search.)

>  {
>  	enum contains_result *cached = contains_cache_at(cache, candidate);
>
> @@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
>  	enum contains_result result;
> -	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> +	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
>  	const struct commit_list *p;
>
>  	for (p = want; p; p = p->next) {
> -		uint32_t generation;
> +		timestamp_t generation;
>  		struct commit *c = p->item;
>  		load_commit_graph_info(the_repository, c);
>  		generation = commit_graph_generation(c);

All right.

> @@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
>  				 time_t min_commit_date,
> -				 uint32_t min_generation)
> +				 timestamp_t min_generation)
>  {
>  	struct commit **list = NULL;
>  	int i;
> @@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
>  	struct commit_list *from_iter = from, *to_iter = to;
>  	int result;
> -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>
>  	while (from_iter) {
>  		add_object_array(&from_iter->item->object, NULL, &from_objs);
>
>  		if (!parse_commit(from_iter->item)) {
> -			uint32_t generation;
> +			timestamp_t generation;
>  			if (from_iter->item->date < min_commit_date)
>  				min_commit_date = from_iter->item->date;
>
> @@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>
>  	while (to_iter) {
>  		if (!parse_commit(to_iter->item)) {
> -			uint32_t generation;
> +			timestamp_t generation;
>  			if (to_iter->item->date < min_commit_date)
>  				min_commit_date = to_iter->item->date;
>

All right.

> @@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>  	struct commit_list *found_commits = NULL;
>  	struct commit **to_last = to + nr_to;
>  	struct commit **from_last = from + nr_from;
> -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>  	int num_to_find = 0;
>
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>
>  	for (item = to; item < to_last; item++) {
> -		uint32_t generation;
> +		timestamp_t generation;
>  		struct commit *c = *item;
>
>  		parse_commit(c);

All right.

> diff --git a/commit-reach.h b/commit-reach.h
> index b49ad71a31..148b56fea5 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
>  				 time_t min_commit_date,
> -				 uint32_t min_generation);
> +				 timestamp_t min_generation);
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  		       int commit_date_cutoff);
>

All right.

> diff --git a/commit.h b/commit.h
> index e901538909..bc0732a4fe 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -11,7 +11,8 @@
>  #include "commit-slab.h"
>
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> -#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> +#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
> +#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
>  #define GENERATION_NUMBER_MAX 0x3FFFFFFF
>  #define GENERATION_NUMBER_ZERO 0
>

Why do we even need GENERATION_NUMBER_V1_INFINITY?  It is about marking
out-of-graph commits, and it is about in-memory storage.

We would need separate GENERATION_NUMBER_V1_MAX and GENERATION_NUMBER_V2_MAX
because of different _on-disk_ storage, or in other words file format
limitations.  But that is for the future commit.

> diff --git a/revision.c b/revision.c
> index ecf757c327..411852468b 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3290,7 +3290,7 @@ define_commit_slab(indegree_slab, int);
>  define_commit_slab(author_date_slab, timestamp_t);
>
>  struct topo_walk_info {
> -	uint32_t min_generation;
> +	timestamp_t min_generation;
>  	struct prio_queue explore_queue;
>  	struct prio_queue indegree_queue;
>  	struct prio_queue topo_queue;

All right.

> @@ -3336,7 +3336,7 @@ static void explore_walk_step(struct rev_info *revs)
>  }
>
>  static void explore_to_depth(struct rev_info *revs,
> -			     uint32_t gen_cutoff)
> +			     timestamp_t gen_cutoff)
>  {
>  	struct topo_walk_info *info = revs->topo_walk_info;
>  	struct commit *c;

All right.

> @@ -3379,7 +3379,7 @@ static void indegree_walk_step(struct rev_info *revs)
>  }
>
>  static void compute_indegrees_to_depth(struct rev_info *revs,
> -				       uint32_t gen_cutoff)
> +				       timestamp_t gen_cutoff)
>  {
>  	struct topo_walk_info *info = revs->topo_walk_info;
>  	struct commit *c;
> @@ -3437,7 +3437,7 @@ static void init_topo_walk(struct rev_info *revs)
>  	info->min_generation = GENERATION_NUMBER_INFINITY;
>  	for (list = revs->commits; list; list = list->next) {
>  		struct commit *c = list->item;
> -		uint32_t generation;
> +		timestamp_t generation;
>
>  		if (parse_commit_gently(c, 1))
>  			continue;

All right.

> @@ -3498,7 +3498,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>  	for (p = commit->parents; p; p = p->next) {
>  		struct commit *parent = p->item;
>  		int *pi;
> -		uint32_t generation;
> +		timestamp_t generation;
>
>  		if (parent->object.flags & UNINTERESTING)
>  			continue;

All right.

> diff --git a/upload-pack.c b/upload-pack.c
> index 80ad9a38d8..bcb8b5dfda 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
>
>  static int ok_to_give_up(struct upload_pack_data *data)
>  {
> -	uint32_t min_generation = GENERATION_NUMBER_ZERO;
> +	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
>
>  	if (!data->have_obj.nr)
>  		return 0;

All right.


Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-08-21 18:43       ` Jakub Narębski
  2020-08-25  6:14         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-21 18:43 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

Hello,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> As we are writing topological levels to commit data chunk to ensure
> backwards compatibility with "Old" Git and the member `generation` of
> struct commit_graph_data will store corrected commit date in a later
> commit, let's introduce a commit-slab to store topological levels while
> writing commit-graph.

In my opinion the above it would be easier to follow if rephrased in the
following way:

  In a later commit we will introduce corrected commit date as the
  generation number v2.  This value will be stored in the new separate
  GDAT chunk.  However to ensure backwards compatibility with "Old" Git
  we need to continue to write generation number v1, which is
  topological level, to the commit data chunk (CDAT).  This means that
  we need to compute both versions of generation numbers when writing
  the commit-graph file.  Let's therefore introduce a commit-slab
  to store topological levels; corrected commit date will be stored
  in the member `generation` of struct commit_graph_data.

What do you think?


By the way, do I understand it correctly that in backward-compatibility
mode (that is, in mixed-version environment where at least some
commit-graph files were written by "Old" Git and are lacking GDAT chunk
and generation number v2 data) the `generation` member of commit graph
data chunk will be populated and will store generation number v1, that
is topological level? And that the commit-slab for topological levels is
only there for writing and re-writing?

>
> When Git creates a split commit-graph, it takes advantage of the
> generation values that have been computed already and present in
> existing commit-graph files.
>
> So, let's add a pointer to struct commit_graph to the topological level
> commit-slab and populate it with topological levels while writing a
> split commit-graph.

All right, looks sensible.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
>  commit-graph.h |  1 +
>  commit.h       |  1 +
>  3 files changed, 34 insertions(+), 15 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7f9f858577..a2f15b2825 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
>  /* Remember to update object flag allocation in object.h */
>  #define REACHABLE       (1u<<15)
>
> +define_commit_slab(topo_level_slab, uint32_t);
> +

All right.

Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
think it would be necessary.

>  /* Keep track of the order in which commits are added to our list. */
>  define_commit_slab(commit_pos, int);
>  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
> +	if (g->topo_levels)
> +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }

All right, here we store topological levels on commit-slab to avoid
recomputing them.

Do I understand it correctly that the `topo_levels` member of the `struct
commit_graph` would be non-null only when we are updating the
commit-graph?

>
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
> @@ -953,6 +958,7 @@ struct write_commit_graph_context {
>  		 changed_paths:1,
>  		 order_by_pack:1;
>
> +	struct topo_level_slab *topo_levels;
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
>  	const struct bloom_filter_settings *bloom_settings;

Why do we need `topo_levels` member *both* in `struct commit_graph` and
in `struct write_commit_graph_context`?

[After examining the change further I have realized why both are needed,
 and written about the reasoning later in this email.]


Note that the commit message talks only about `struct commit_graph`...

> @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
>  		else
>  			packedDate[0] = 0;
>
> -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);

All right, here we prepare for writing to the CDAT chunk using data that
is now stored on newly introduced topo_levels slab (either computed, or
taken from commit-graph file being rewritten).

Assuming that ctx->topo_levels is not-null, and that the values are
properly calculated before this -- and we did compute topological levels
before writing the commit-graph.

>
>  		packedDate[1] = htonl((*list)->date);
>  		hashwrite(f, packedDate, 8);
> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);

All right, so that is why this 'generation' variable was not converted
to timestamp_t type.

>
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> +		    level != GENERATION_NUMBER_ZERO)
>  			continue;

Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
outside commit-graph files, and therefore we would need its topological
level computed.

However, I don't understand how it works.  We have had created the
commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
to provide default values for `struct commit_graph`... but only for
`graph_pos` member.  It is commit_graph_generation() that returns
GENERATION_NUMBER_INFINITY for commits not in graph.

But neither commit_graph_data_at()->generation nor topo_level_slab_at()
handles this special case, so I don't see how 'generation' variable can
*ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
GENERATION_NUMBER_V1_INFINITY for commits not in graph.

Does it work *accidentally*, because the default value for uninitialized
data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
of commit-graph feature development history, namely the short time where
Git didn't use any generation numbers and stored 0 in the place set for
it in the commit-graph format...  On the other hand this is not the case
for corrected commit date (generation number v2), as it could
"legitimately" be 0 if some root commit (without any parents) had
committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
caused by malformed but valid commit object.

Ugh...

>
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;
>
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>
> -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> +				    level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else if (level > max_level) {
> +					max_level = level;
>  				}
>  			}

This is the same case as for previous chunk; see the comment above.

This code checks if parents have generation number / topological level
computed, and tracks maximum value of it among all parents.

>
>  			if (all_parents_computed) {
> -				struct commit_graph_data *data = commit_graph_data_at(current);
> -
> -				data->generation = max_generation + 1;
>  				pop_commit(&list);
>
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				if (max_level > GENERATION_NUMBER_MAX - 1)
> +					max_level = GENERATION_NUMBER_MAX - 1;
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;

OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
this value can be maximum value that can be safely stored in a given
data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
value that can be safely stored in uint32_t, so generation+1 had no
chance to overflow.  This is no longer the case; the reorganization done
here leads to more defensive code (safer).

All good.  However I think that we should clamp the value of topological
level to the maximum value that can be safely stored *on disk*, in the
30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
the code to write topological level would get more complicated.

In my opinion the symbolic constant used here should be named
GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
it should be the current value of GENERATION_NUMBER_MAX, that is
0x3FFFFFFF.

>  			}
>  		}
>  	}
> @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
>  	uint32_t i, count_distinct = 0;
>  	int res = 0;
>  	int replace = 0;
> +	struct topo_level_slab topo_levels;
>

All right, we will be using topo_level slab for writing the
commit-graph, and only for this purpose, so it is good to put it here.

>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
>  		}
>  	}
>
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
> +	if (ctx->r->objects->commit_graph) {
> +		struct commit_graph *g = ctx->r->objects->commit_graph;
> +
> +		while (g) {
> +			g->topo_levels = &topo_levels;
> +			g = g->base_graph;
> +		}
> +	}

All right, now I see why we need `topo_levels` member both in the
`struct write_commit_graph_context` and in `struct commit_graph`.
The former is for functions that write the commit-graph, the latter for
fill_commit_graph_info() functions that is deep in the callstack, but it
needs to know whether to load topological level to commit-slab, or maybe
put it as generation number (and in the future -- discard it, if not
needed).


Sidenote: this fragment of code, that fills with a given value some
member of the `struct commit_graph` throughout the split commit-graph
chain, will be repeated as similar code in patches later in series.
However without resorting to preprocessor macros I have no idea how to
generalize it to avoid code duplication (well, almost).

> +
>  	if (pack_indexes) {
>  		ctx->order_by_pack = 1;
>  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> diff --git a/commit-graph.h b/commit-graph.h
> index 430bc830bb..1152a9642e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -72,6 +72,7 @@ struct commit_graph {
>  	const unsigned char *chunk_bloom_indexes;
>  	const unsigned char *chunk_bloom_data;
>
> +	struct topo_level_slab *topo_levels;
>  	struct bloom_filter_settings *bloom_filter_settings;
>  };

All right: `struct commit_graph` is public, `struct
write_commit_graph_context` is not.

>
> diff --git a/commit.h b/commit.h
> index bc0732a4fe..bb846e0025 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -15,6 +15,7 @@
>  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
>  #define GENERATION_NUMBER_MAX 0x3FFFFFFF

The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.

>  #define GENERATION_NUMBER_ZERO 0
> +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF

This value is never used, so why it is defined in this commit.

>
>  struct commit_list {
>  	struct commit *item;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-22  0:05       ` Jakub Narębski
  2020-08-25  6:49         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-22  0:05 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.

Good.

>
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.

Perhaps we should add more details about data type sizes in question.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t).  However corrected commit date
offsets can be safely stored^* using only 32 bits.  This halves the size
of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
2*H + 16 + 4 bytes, which is reduction of around 6%, not including
header, fanout table (OIDF) and extra edges list (EDGE).

Which might mean that the extra complication is not worth it, and we
should store corrected commit date directly instead.

*) unless for example one of commits is malformed but valid,
   and has committerdate of 0 Unix time, 1 January 1970.

>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.

OK, so the agenda for introducing geeration number v2 is as follows:
- compute generation numbers v2, i.e. corrected commit date
- store corrected commit date [offsets] in new GDAT chunk,
  unless backward-compatibility concerns require us to not to
- load [and compute] corrected commit date from commit-graph
  storing it as 'generation' field of `struct commit_graph_data`,
  unless backward-compatibility concerns require us to store
  topological levels (generation number v1) in there instead

Because the reachability condition for corrected commit date and for
topological level is exactly the same, we don't need to do anything to
take advantage of generation number v2.

Though we can use generation number v2 in more cases, where we turned
off use of generation numbers because v1 gave worse performance than
date heuristics.

Did I got this right?

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index a2f15b2825..fd69534dd5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	else if (generation_a > generation_b)
>  		return 1;
>
> -	/* use date as a heuristic when generations are equal */
> -	if (a->date < b->date)
> -		return -1;
> -	else if (a->date > b->date)
> -		return 1;

At first I was wondering why this tie-breaking is beig removed; wouldn't
be needed for backward-compatibility?  But then I remembered that this
comparison function is used _only_ for sorting commits when writing
Bloom filters, for `git commit-graph write --reachable --changed-paths ...`

Assuming that when writing the commit graph we always compute geeration
number v2 and 'generation' field stores corrected commit date, we don't
need to use date as a heuristic when generations are equal, and it would
not help in tie-breaking anyway.

All right.

>  	return 0;
>  }
>
> @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;

All right, so the pattern is to add 'corrected_commit_date' stuff after
'topological_level' stuff.

>
>  		display_progress(ctx->progress, i + 1);
>  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> -		    level != GENERATION_NUMBER_ZERO)
> +		    level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> +		    )
>  			continue;
>
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;
>
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>
>  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> -				    level == GENERATION_NUMBER_ZERO) {
> +				    level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> +				    ) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;
>  				}
>  			}
>
> @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (max_level > GENERATION_NUMBER_MAX - 1)
>  					max_level = GENERATION_NUMBER_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> +				if (current->date > max_corrected_commit_date)
> +					max_corrected_commit_date = current->date - 1;
> +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>  			}
>  		}
>  	}

All right.  Looks good to me.

> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> -		timestamp_t max_generation = 0;
> -		timestamp_t generation;
> +		timestamp_t max_corrected_commit_date = 0;
> +		timestamp_t corrected_commit_date;

This is simple, and perhaps unnecessary, rename of variables.
Shouldn't we however verify *both* topological level, and
(if exists) corrected commit date?

>
>  		display_progress(progress, i + 1);
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  					     oid_to_hex(&graph_parents->item->object.oid),
>  					     oid_to_hex(&odb_parents->item->object.oid));
>
> -			generation = commit_graph_generation(graph_parents->item);
> -			if (generation > max_generation)
> -				max_generation = generation;
> +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> +			if (corrected_commit_date > max_corrected_commit_date)
> +				max_corrected_commit_date = corrected_commit_date;

Actually, commit_graph_generation(<commit>) can return either corrected
commit date, or topological level, the latter in backward-compatibility
case (if at least one commit-graph file is lacking GDAT chunk, because
[some of] it was created by the "Old" Git).

>
>  			graph_parents = graph_parents->next;
>  			odb_parents = odb_parents->next;
> @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>  			continue;
>
> -		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> -		 * extra logic in the following condition.
> -		 */
> -		if (max_generation == GENERATION_NUMBER_MAX)
> -			max_generation--;

All right, this was needed for checking the correctness of topological
levels (generation number v1) because we were checking not that it
fullfills the reachability condition, but more strict one: namely that
topological level of commit is equal to maximum of topological levels of
its parents plus one.

The comment about checking both generation number v1 and v2 still
applies.

> -
> -		generation = commit_graph_generation(graph_commit);
> -		if (generation != max_generation + 1)
> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> +		corrected_commit_date = commit_graph_generation(graph_commit);
> +		if (corrected_commit_date < max_corrected_commit_date + 1)
> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>  				     oid_to_hex(&cur_oid),
> -				     generation,
> -				     max_generation + 1);
> +				     corrected_commit_date,
> +				     max_corrected_commit_date + 1);

All right, we check less strict condition for corrected commit date.

>
>  		if (graph_commit->date != odb_commit->date)
>  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 08/11] commit-graph: implement generation data chunk
  2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-22 13:09       ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-22 13:09 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

All right, it looks like this patch implements first part of step 2.)
and step 3.) in the following plan of adding support for geeration
number v2:

1. compute generation numbers v2, i.e. corrected commit date
2. store corrected commit date [offsets] in new GDAT chunk,
   unless backward-compatibility concerns require us to not to
3. load [and compute] corrected commit date from commit-graph
   storing it as 'generation' field of `struct commit_graph_data`,
   unless backward-compatibility concerns require us to store
   topological levels (generation number v1) in there instead
4. use generation number v2 in more places, where we had to turn
   off using v1 for performance reasons


"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> As discovered by Ævar, we cannot increment graph version to
> distinguish between generation numbers v1 and v2 [1]. Thus, one of
> pre-requistes before implementing generation number was to distinguish
> between graph versions in a backwards compatible manner.

Fortunately, we have fixed this issue, and Git does no longer die when
it encounters commit-graph format version that it does not understand.

>
> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.

Should we say anything about storing 64 bit corrected commit date
(geeration number v2) as 32 bit corrected commit date offset?

>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. New Git can parse GDAT and take advantage
> of newer generation numbers, falling back to topological levels when
> GDAT chunk is missing (as it would happen with a commit graph written
> by old Git).

Note that the fact that we do not have special code for handling
mixed-version layers in split commit-graph is not [that] dangerous, as
we don't read this new data yet.  Splitting it to patch 09/11 (next
patch) makes this patch simpler.

>
> We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> which forces commit-graph file to be written without generation data
> chunk to emulate a commit-graph file written by old Git.

All right.

>
> [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 48 ++++++++++++++++++++++++---
>  commit-graph.h                |  2 ++
>  t/README                      |  3 ++
>  t/helper/test-read-graph.c    |  2 ++
>  t/t4216-log-bloom.sh          |  4 +--
>  t/t5318-commit-graph.sh       | 27 +++++++--------
>  t/t5324-split-commit-graph.sh | 12 +++----
>  t/t6600-test-reach.sh         | 62 +++++++++++++++++++----------------
>  8 files changed, 107 insertions(+), 53 deletions(-)

It might be a good idea to add documentation of this chunk (and only
about this chunk) to Documentation/technical/commit-graph-format.txt

>
> diff --git a/commit-graph.c b/commit-graph.c
> index fd69534dd5..b7a72b40db 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 8

All right, define new chunk and increase the maximum number of chunks
commit-graph file can have.

>
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
>  				graph->chunk_commit_data = data + chunk_offset;
>  			break;
>
> +		case GRAPH_CHUNKID_GENERATION_DATA:
> +			if (graph->chunk_generation_data)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data = data + chunk_offset;
> +			break;
> +

All right.  The size of GDAT chunk is defined by the number of commits,
so nothink more is needed.

>  		case GRAPH_CHUNKID_EXTRAEDGES:
>  			if (graph->chunk_extra_edges)
>  				chunk_repeated = 1;
> @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +	if (g->chunk_generation_data)
> +		graph_data->generation = item->date +
> +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

WARNING: The above does not properly handle clamped data.

If the offset is maximum value that can be stored in 32 bit field, that
is GENERATION_NUMBER_V2_OFFSET_MAX for more that one commit, we wouldn't
know which commit has greater generation number v2 because of this clamping.
The 'generation' data needs to be set to GENERATION_NUMBER_V2_MAX (which
in turn needs to be smaller than GENERATION_NUMBER_INFINITY).  This is
not done here!

All the above complication would not be an issue if we stored 64 bit
corrected commit date directly, instead of storing 32 bit corrected
commit date offsets. Storing offsets saves at most 4/(2*H + 16 + 4) = 7%
of commit-graph file size (OIDL + CDAT + GDAT with offsets), when
160-bit/20-byte SHA-1 hash is used.

> +	else
> +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

Should we convert GENERATION_NUMBER_V1_MAX into GENERATION_NUMBER_MAX
here, for easier handling of clamped values of generation numbers later
on in backward-compatibile way, without special-casing for v1 and v2?


Here we load and perhaps compute generation number, using v2 if possible
(from GDAT + commit date), with fallback to v1 (from CDAT).

Almost all right... but should this reading be a part of this patch, or
split off into separate patch?

>
>  	if (g->topo_levels)
>  		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> @@ -951,7 +963,8 @@ struct write_commit_graph_context {
>  		 report_progress:1,
>  		 split:1,
>  		 changed_paths:1,
> -		 order_by_pack:1;
> +		 order_by_pack:1,
> +		 write_generation_data:1;

All right, here we store the iformation if we should write the GDAT
chunk, taking into account among other things state of the
GIT_TEST_COMMIT_GRAPH_NO_GDAT enviroment variable.

>
>  	struct topo_level_slab *topo_levels;
>  	const struct split_commit_graph_opts *split_opts;
> @@ -1106,8 +1119,25 @@ static int write_graph_chunk_data(struct hashfile *f,
>  	return 0;
>  }
>
> +static int write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	for (i = 0; i < ctx->commits.nr; i++) {

Side note: it is a bit funny that some of write_graph_chunk_*()
functions use `for` loop, and some `while` loop to process commits.

> +		struct commit *c = ctx->commits.list[i];
> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +
> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
> +			offset = GENERATION_NUMBER_V2_OFFSET_MAX;

This GENERATION_NUMBER_V2_OFFSET_MAX symbolic constant (or equivalet)
should be defined in this patch and not in previous one, in my opinion.

> +		hashwrite_be32(f, offset);
> +	}
> +
> +	return 0;
> +}
> +
>  static int write_graph_chunk_extra_edges(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx)
> +					  struct write_commit_graph_context *ctx)

This change in whitespace is, I think, incorrect.

>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1726,6 +1756,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
>  	chunks[2].write_fn = write_graph_chunk_data;
> +
> +	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
> +		ctx->write_generation_data = 0;

All right, here we handle newly introduced GIT_TEST_COMMIT_GRAPH_NO_GDAT
environment variable.

> +	if (ctx->write_generation_data) {
> +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
> +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
> +		num_chunks++;
> +	}

Hmmm... so the GDAT chunk does not have a header, and is not versioned.
Does this mean that to move to generation number v3, or add some
additional reachability labeling we would need to either rename the
chunk or add new one?

>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> @@ -2130,6 +2169,7 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->split_opts = split_opts;
>  	ctx->total_bloom_filter_data_size = 0;
> +	ctx->write_generation_data = 1;

But by default we do write the GDAT chunk.

>
>  	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
>  		ctx->changed_paths = 1;
> diff --git a/commit-graph.h b/commit-graph.h
> index 1152a9642e..f78c892fc0 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -6,6 +6,7 @@
>  #include "oidset.h"
>
>  #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
> +#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
>  #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
>

All right.

(Though I wonder about the ordering -- but better not to start "bike
shed" discussion).

> @@ -67,6 +68,7 @@ struct commit_graph {
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_generation_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;

All right, we need to store position of new GDAT chunk.

> diff --git a/t/README b/t/README
> index 70ec61cf88..6647ef132e 100644
> --- a/t/README
> +++ b/t/README
> @@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
>  be written after every 'git commit' command, and overrides the
>  'core.commitGraph' setting to true.
>
> +GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
> +commit-graph to be written without generation data chunk.
> +

All right.

This description could have been more detailed, but I think it is good
enough.

>  GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
>  commit-graph write to compute and write changed path Bloom filters for
>  every 'git commit-graph write', as if the `--changed-paths` option was
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 6d0c962438..1c2a5366c7 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" oid_lookup");
>  	if (graph->chunk_commit_data)
>  		printf(" commit_metadata");
> +	if (graph->chunk_generation_data)
> +		printf(" generation_data");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
>  	if (graph->chunk_bloom_indexes)

All right, we examine if GDAT chunk is present.

Many commit-graph tests would probably need to be updated; at least
those that make use if `git test-tool read-graph`.

> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c21cc160f3..55c94e9ebd 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual

All right.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 044cf8a3de..b41b2160c6 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -71,7 +71,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>  graph_read_expect() {
>  	OPTIONAL=""
>  	NUM_CHUNKS=3
> -	if test ! -z $2
> +	if test ! -z "$2"
>  	then
>  		OPTIONAL=" $2"
>  		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))

All right, a fix for issue that is relevant only after this change, now
that there can be more than one extra chunk.

Side note: how I wish that helper function in tests were documented...

> @@ -98,14 +98,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
>  	# valid commit and tree OID
>  	git rev-parse HEAD HEAD^{tree} >in &&
>  	git commit-graph write --stdin-commits <in &&
> -	graph_read_expect 3
> +	graph_read_expect 3 generation_data

All right, we need to treat generation_data as extra, because it can be
not there (it's existence is conditional).

>  '
>
>  test_expect_success 'write graph' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "3"
> +	graph_read_expect "3" generation_data

Side note: I wonder why here we have

  	graph_read_expect "3" generation_data

but one test earlier we have

  	graph_read_expect 3 generation_data

without quotes.

>  '
>
>  test_expect_success POSIXPERM 'write graph has correct permissions' '
> @@ -214,7 +214,7 @@ test_expect_success 'write graph with merges' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "10" "extra_edges"
> +	graph_read_expect "10" "generation_data extra_edges"

All right.  It is why we needed to fix graph_read_expect().

>  '
>
>  graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
> @@ -249,7 +249,7 @@ test_expect_success 'write graph with new commit' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -259,7 +259,7 @@ test_expect_success 'write graph with nothing new' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -269,7 +269,7 @@ test_expect_success 'build graph from latest pack with closure' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	cat new-idx | git commit-graph write --stdin-packs &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "9" "extra_edges"
> +	graph_read_expect "9" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
> @@ -282,7 +282,7 @@ test_expect_success 'build graph from commits with closure' '
>  	git rev-parse merge/1 >>commits-in &&
>  	cat commits-in | git commit-graph write --stdin-commits &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "6"
> +	graph_read_expect "6" "generation_data"
>  '
>
>  graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
> @@ -292,7 +292,7 @@ test_expect_success 'build graph from commits with append' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "10" "extra_edges"
> +	graph_read_expect "10" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -302,7 +302,7 @@ test_expect_success 'build graph using --reachable' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write --reachable &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -323,7 +323,7 @@ test_expect_success 'write graph in bare repo' '
>  	cd "$TRASH_DIRECTORY/bare" &&
>  	git commit-graph write &&
>  	test_path_is_file $baredir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '

All right, those were the straightforward changes.

>
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> @@ -420,8 +420,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
>
>  test_expect_success 'git commit-graph verify' '
>  	cd "$TRASH_DIRECTORY/full" &&
> -	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> -	git commit-graph verify >output
> +	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> +	git commit-graph verify >output &&
> +	graph_read_expect 9 extra_edges
>  '

What this change is about?  Is it about the fact that we have not added
support for checking correctness of GDAT chunk to `git commit-graph
verify`?  But in previous commit we did modify verify_commit_graph()...

>
>  NUM_COMMITS=9
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index ea28d522b8..531016f405 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
>  	infodir=".git/objects/info" &&
>  	graphdir="$infodir/commit-graphs" &&
>  	test_oid_cache <<-EOM
> -	shallow sha1:1760
> -	shallow sha256:2064
> +	shallow sha1:2132
> +	shallow sha256:2436
>
> -	base sha1:1376
> -	base sha256:1496
> +	base sha1:1408
> +	base sha256:1528
>  	EOM
>  '

I guess this change is because with GDAT chunk present the positions of
relevant bits that we want to corrupt change (I guess because we have
extra chunk in chunk lookup section).

Someone else would have to verify if this change is in fact correct, if
it was not done already.

>
> @@ -28,9 +28,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data

All right, we now have 4 chunks not 3 (old ones + generation_data).

>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> index 475564bee7..d14b129f06 100755
> --- a/t/t6600-test-reach.sh
> +++ b/t/t6600-test-reach.sh
> @@ -55,10 +55,13 @@ test_expect_success 'setup' '
>  	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
>  	mv .git/objects/info/commit-graph commit-graph-half &&
>  	chmod u+w commit-graph-half &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
> +	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
> +	chmod u+w commit-graph-no-gdat &&
>  	git config core.commitGraph true
>  '

All right, we add setup for testing no GDAT case (as if the commit-graph
file was written by "Old" Git).

>
> -run_three_modes () {
> +run_all_modes () {

All right, we compare more modes, among others the no-GDAT case.

Nice futureproofing!

>  	test_when_finished rm -rf .git/objects/info/commit-graph &&
>  	"$@" <input >actual &&
>  	test_cmp expect actual &&
> @@ -67,11 +70,14 @@ run_three_modes () {
>  	test_cmp expect actual &&
>  	cp commit-graph-half .git/objects/info/commit-graph &&
>  	"$@" <input >actual &&
> +	test_cmp expect actual &&
> +	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
> +	"$@" <input >actual &&
>  	test_cmp expect actual
>  }

OK, now we test yet another variant of commit-graph file: one without
the GDAT chunk (testing for backward compatibility with "Old" Git).

>
> -test_three_modes () {
> -	run_three_modes test-tool reach "$@"
> +test_all_modes () {
> +	run_all_modes test-tool reach "$@"
>  }

All right.

>
>  test_expect_success 'ref_newer:miss' '
> @@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
>  	B:commit-4-9
>  	EOF
>  	echo "ref_newer(A,B):0" >expect &&
> -	test_three_modes ref_newer
> +	test_all_modes ref_newer
>  '
>
>  test_expect_success 'ref_newer:hit' '
> @@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
>  	B:commit-2-3
>  	EOF
>  	echo "ref_newer(A,B):1" >expect &&
> -	test_three_modes ref_newer
> +	test_all_modes ref_newer
>  '
>
>  test_expect_success 'in_merge_bases:hit' '
> @@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
>  	B:commit-8-8
>  	EOF
>  	echo "in_merge_bases(A,B):1" >expect &&
> -	test_three_modes in_merge_bases
> +	test_all_modes in_merge_bases
>  '
>
>  test_expect_success 'in_merge_bases:miss' '
> @@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
>  	B:commit-5-9
>  	EOF
>  	echo "in_merge_bases(A,B):0" >expect &&
> -	test_three_modes in_merge_bases
> +	test_all_modes in_merge_bases
>  '
>
>  test_expect_success 'is_descendant_of:hit' '
> @@ -118,7 +124,7 @@ test_expect_success 'is_descendant_of:hit' '
>  	X:commit-1-1
>  	EOF
>  	echo "is_descendant_of(A,X):1" >expect &&
> -	test_three_modes is_descendant_of
> +	test_all_modes is_descendant_of
>  '
>
>  test_expect_success 'is_descendant_of:miss' '
> @@ -129,7 +135,7 @@ test_expect_success 'is_descendant_of:miss' '
>  	X:commit-7-6
>  	EOF
>  	echo "is_descendant_of(A,X):0" >expect &&
> -	test_three_modes is_descendant_of
> +	test_all_modes is_descendant_of
>  '
>
>  test_expect_success 'get_merge_bases_many' '
> @@ -144,7 +150,7 @@ test_expect_success 'get_merge_bases_many' '
>  		git rev-parse commit-5-6 \
>  			      commit-4-7 | sort
>  	} >expect &&
> -	test_three_modes get_merge_bases_many
> +	test_all_modes get_merge_bases_many
>  '
>
>  test_expect_success 'reduce_heads' '
> @@ -166,7 +172,7 @@ test_expect_success 'reduce_heads' '
>  			      commit-2-8 \
>  			      commit-1-10 | sort
>  	} >expect &&
> -	test_three_modes reduce_heads
> +	test_all_modes reduce_heads
>  '
>
>  test_expect_success 'can_all_from_reach:hit' '
> @@ -189,7 +195,7 @@ test_expect_success 'can_all_from_reach:hit' '
>  	Y:commit-8-1
>  	EOF
>  	echo "can_all_from_reach(X,Y):1" >expect &&
> -	test_three_modes can_all_from_reach
> +	test_all_modes can_all_from_reach
>  '
>
>  test_expect_success 'can_all_from_reach:miss' '
> @@ -211,7 +217,7 @@ test_expect_success 'can_all_from_reach:miss' '
>  	Y:commit-8-5
>  	EOF
>  	echo "can_all_from_reach(X,Y):0" >expect &&
> -	test_three_modes can_all_from_reach
> +	test_all_modes can_all_from_reach
>  '
>
>  test_expect_success 'can_all_from_reach_with_flag: tags case' '
> @@ -234,7 +240,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
>  	Y:commit-8-1
>  	EOF
>  	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
> -	test_three_modes can_all_from_reach_with_flag
> +	test_all_modes can_all_from_reach_with_flag
>  '
>
>  test_expect_success 'commit_contains:hit' '
> @@ -250,8 +256,8 @@ test_expect_success 'commit_contains:hit' '
>  	X:commit-9-3
>  	EOF
>  	echo "commit_contains(_,A,X,_):1" >expect &&
> -	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_all_modes commit_contains &&
> +	test_all_modes commit_contains --tag
>  '
>
>  test_expect_success 'commit_contains:miss' '
> @@ -267,8 +273,8 @@ test_expect_success 'commit_contains:miss' '
>  	X:commit-9-3
>  	EOF
>  	echo "commit_contains(_,A,X,_):0" >expect &&
> -	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_all_modes commit_contains &&
> +	test_all_modes commit_contains --tag
>  '
>
>  test_expect_success 'rev-list: basic topo-order' '
> @@ -280,7 +286,7 @@ test_expect_success 'rev-list: basic topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
>  		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-6-6
> +	run_all_modes git rev-list --topo-order commit-6-6
>  '
>
>  test_expect_success 'rev-list: first-parent topo-order' '
> @@ -292,7 +298,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
>  		commit-6-2 \
>  		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>  	>expect &&
> -	run_three_modes git rev-list --first-parent --topo-order commit-6-6
> +	run_all_modes git rev-list --first-parent --topo-order commit-6-6
>  '
>
>  test_expect_success 'rev-list: range topo-order' '
> @@ -304,7 +310,7 @@ test_expect_success 'rev-list: range topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
>  '
>
>  test_expect_success 'rev-list: range topo-order' '
> @@ -316,7 +322,7 @@ test_expect_success 'rev-list: range topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
>  '
>
>  test_expect_success 'rev-list: first-parent range topo-order' '
> @@ -328,7 +334,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
>  		commit-6-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
> +	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
>  '
>
>  test_expect_success 'rev-list: ancestry-path topo-order' '
> @@ -338,7 +344,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
>  		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
>  		commit-6-3 commit-5-3 commit-4-3 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
> +	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
>  '
>
>  test_expect_success 'rev-list: symmetric difference topo-order' '
> @@ -352,7 +358,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
>  		commit-3-8 commit-2-8 commit-1-8 \
>  		commit-3-7 commit-2-7 commit-1-7 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
>  '
>
>  test_expect_success 'get_reachable_subset:all' '
> @@ -372,7 +378,7 @@ test_expect_success 'get_reachable_subset:all' '
>  			      commit-1-7 \
>  			      commit-5-6 | sort
>  	) >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '
>
>  test_expect_success 'get_reachable_subset:some' '
> @@ -390,7 +396,7 @@ test_expect_success 'get_reachable_subset:some' '
>  		git rev-parse commit-3-3 \
>  			      commit-1-7 | sort
>  	) >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '
>
>  test_expect_success 'get_reachable_subset:none' '
> @@ -404,7 +410,7 @@ test_expect_success 'get_reachable_subset:none' '
>  	Y:commit-2-8
>  	EOF
>  	echo "get_reachable_subset(X,Y)" >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '

All right, s/test_three_modes/test_all_modes/... which admitedly could
have been separate pure refactoring patch, but it is not necessary.

>
>  test_done

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-08-22 17:14       ` Jakub Narębski
  2020-08-26  7:15         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-22 17:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.

All right, this explains changes to the *reading* side.  If there is
split commit-graph layer without GDAT chunk (or to be more exact,
current code checks if there is any layer without GDAT chunk), then we
fall back to using topological levels, as if all layers / all graphs
were without GDAT.  This allows to avoid the issue explained above,
where for some commits 'generation' holds corrected commit date, and for
some it holds topological levels, breaking the reachability condition
guarantee.


However the commit message do not say anything about the *writing* side.

We have decided to not write the GDAT chunk when writing the new layer
in split commit-graph, and top layer doesn't itself have GDAT chunks.
That makes for easier reasoning and safer handling: in mixed-version
environment the only possible arrangement is for the lower layers
(possibly zero) have GDAT chunk, and higher layers (possibly zero) not
have GDAT chunks.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite should
not have GDAT chunks either.


To be more detailed, without '--split=replace' we would want the following
layer merging behavior:

   [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
           1              2           3             4            5

In the split commit-graph chain above, merging two topmost layers
(layers 4 and 5) should create a layer without GDAT; merging three
topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
4) should create a new layer with GDAT.

   [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
           1              2           3               merged

   [layer with GDAT][with GDAT][-------------with GDAT------------------]
           1              2                    merged

I hope those ASCII-art pictures help understanding it

>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.
>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).

Wouldn't breaking the reachability condition promise make some Git
commands to return *incorrect* results if they short-circuit, stop
walking if generation number shows that A cannot reach B?

I am talking here about commands that return boolean, or select subset
from given set of revisions:
- git merge-base --is-ancestor <B> <A>
- git branch branch-A <A> && git branch --contains <B>
- git branch branch-B <B> && git branch --merged <A>

Git assumes that generation numbers fulfill the following condition:

  if A can reach B, then gen(A) > gen(B)

Notably this includes commits not in commit-graph, and clamped values.

However, in the following case

* if commit A is from higher layer without GDAT
  and uses topological levels for 'generation', e.g. 115 (in a small repo)
* and commit B is from lower layer with GDAT
  and uses corrected commit date as 'generation', for example 1598112896,

it may happen that A (later commit) can reach B (earlier commit), but
gen(B) > gen(A).  The reachability condition promise for generation
numbers is broken.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

I have reordered files in the patch itself to make it easier to review
the proposed changes.

>  commit-graph.h                |  1 +
>  commit-graph.c                | 32 +++++++++++++++-
>  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
>  3 files changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.h b/commit-graph.h
> index f78c892fc0..3cf89d895d 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -63,6 +63,7 @@ struct commit_graph {
>  	struct object_directory *odb;
>
>  	uint32_t num_commits_in_base;
> +	uint32_t read_generation_data;
>  	struct commit_graph *base_graph;
>

First, why `read_generation_data` is of uint32_t type, when it stores
(as far as I understand it), a "boolean" value of either 0 or 1?

Second, couldn't we simply set chunk_generation_data to NULL?  Or would
that interfere with the case of rewriting, where we want to use existing
GDAT data when writing new commit-graph with GDAT chunk?

>  	const uint32_t *chunk_oid_fanout;
> diff --git a/commit-graph.c b/commit-graph.c
> index b7a72b40db..c1292f8e08 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
>  	return graph_chain;
>  }
>
> +static void validate_mixed_generation_chain(struct repository *r)
> +{
> +	struct commit_graph *g = r->objects->commit_graph;
> +	int read_generation_data = 1;
> +
> +	while (g) {
> +		if (!g->chunk_generation_data) {
> +			read_generation_data = 0;
> +			break;
> +		}
> +		g = g->base_graph;
> +	}

This loop checks whole split commit-graph chain for existence of layers
without GDAT chunk.

On one hand it is more than needed _if_ we assume that the fact that
only topmost layers can be without GDAT holds true. On the other hand it
is safer (an example of defensive coding), and as the length of chain is
limited it should be not much of a performance penalty.

> +
> +	g = r->objects->commit_graph;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}

All right... though one of earlier commits introduced similar loop, but
it set chunk_generation_data to NULL instead.  Or did I remember it wrong?

> +}
> +
>  struct commit_graph *read_commit_graph_one(struct repository *r,
>  					   struct object_directory *odb)
>  {
> @@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>
> +	validate_mixed_generation_chain(r);
> +

All right, when reading the commit-graph, check if we are in forced
backward-compatibile mode, and we need to use topological levels for
generation numbers.

>  	return g;
>  }
>
> @@ -763,7 +786,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	if (g->chunk_generation_data)
> +	if (g->chunk_generation_data && g->read_generation_data)

All right, when deciding whether to use corrected commit date
(generation number v2 from GDAT chunk), or fall back to using
topological levels (generation number v1 from CDAT chunk), we need to
take into accout other layers, to not mix v1 and v2.

We have earlier checked whether we can use generation number v2, now we
use the result of this check, propagated down the commit-graph chain.

>  		graph_data->generation = item->date +
>  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>  	else
> @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
>  	uint32_t pos;
>  	if (!prepare_commit_graph(r))
>  		return;
> +
>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
>  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
>  }

This is unrelated whitespace fix, a "while at it" in neighbourhood of
changes.  All right then.

> @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
>  		g = ctx->r->objects->commit_graph;
>
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +
>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2210,6 +2237,9 @@ int write_commit_graph(struct object_directory *odb,
>
>  		if (ctx->split_opts)
>  			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;
>  	}


The previous commit introduced `write_generation_data` member in
`struct write_commit_graph_context`, then used to handle support for
GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable.

Those two hunks of changes above are both inside

   if (ctx->split) {
      ...
   }

Here we examine the topmost layer of split commit-graph chain, and if it
does not contain GDAT chunk, then we do not store the GDAT chunk, unless
`git commit-graph write` is ru with `--split=replace` option.

However this is overly strict condition.  If we merge layer without GDAT
with layer with GDAT below, then we surely can write GDAT; the condition
for GDAT-less layers would be still fulfilled (met).  However we can
consider it 'good enough' for now, and relax this condition in later
commits.


Note that it is the first time in this patch were we make use of
assumption that if there are layers without GDAT then topmost layer is
without GDAT.

>
>  	ctx->approx_nr_objects = approximate_object_count();
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 531016f405..ac5e7783fb 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -424,4 +424,74 @@ done <<\EOF
>  0600 -r--------
>  EOF
>

It would be nice to have an ASCII-art graph of commits, but earlier
tests do not have it either...

> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&
> +	graphdir=".git/objects/info/commit-graphs" &&
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --split &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 1
> +	num_commits: 2
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify
> +'

Looks all right to me.

> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 2
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split=replace &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&
> +	graph_read_expect 15 &&
> +	git commit-graph verify
> +'

It would be nice to have an example with merging layers (whether we
would handle it in strict or relaxed way).

> +
>  test_done

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-08-22 19:09       ` Jakub Narębski
  2020-09-01 10:08         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-22 19:09 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

Hello Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With corrected commit dates implemented, we no longer have to rely on
> commit date as a heuristic in paint_down_to_common().

All right, but it would be nice to have some benchmark data: what were
performance when using topological levels, what was performance when
using commit date heuristics (before this patch), what is performace now
when using corrected commit date.

>
> t6024-recursive-merge setups a unique repository where all commits have
> the same committer date without well-defined merge-base. As this has
> already caused problems (as noted in 859fdc0 (commit-graph: define
> GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
> test script.

OK?

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c             | 14 ++++++++++++++
>  commit-graph.h             |  6 ++++++
>  commit-reach.c             |  2 +-
>  t/t6024-recursive-merge.sh |  4 +++-
>  4 files changed, 24 insertions(+), 2 deletions(-)
>

I have reorderd files for easier review.

> diff --git a/commit-graph.h b/commit-graph.h
> index 3cf89d895d..e22ec1e626 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -91,6 +91,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
>   */
>  int generation_numbers_enabled(struct repository *r);
>
> +/*
> + * Return 1 if and only if the repository has a commit-graph
> + * file and generation data chunk has been written for the file.
> + */
> +int corrected_commit_dates_enabled(struct repository *r);
> +
>  enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
>  	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),

All right.

> diff --git a/commit-graph.c b/commit-graph.c
> index c1292f8e08..6411068411 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
>  	return !!first_generation;
>  }
>
> +int corrected_commit_dates_enabled(struct repository *r)
> +{
> +	struct commit_graph *g;
> +	if (!prepare_commit_graph(r))
> +		return 0;
> +
> +	g = r->objects->commit_graph;
> +
> +	if (!g->num_commits)
> +		return 0;
> +
> +	return !!g->chunk_generation_data;
> +}

The previous commit introduced validate_mixed_generation_chain(), which
walked whole split commit-graph chain, and set `read_generation_data`
field in `struct commit_graph` for all layers in the chain.

This function examines only the top layer, so it follows the assumption
that Git would behave in such way that oly topmost layers in the chai
can be GDAT-less.

Why the difference?  Couldn't validate_mixed_generation_chain() simply
call corrected_commit_dates_enabled()?

> +
>  static void close_commit_graph_one(struct commit_graph *g)
>  {
>  	if (!g)
> diff --git a/commit-reach.c b/commit-reach.c
> index 470bc80139..3a1b925274 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  	int i;
>  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>
> -	if (!min_generation)

This check was added in 091f4cf (commit: don't use generation numbers if
not needed, 2018-08-30) by Derrick Stolee, and its commit message
includes benchmark results for running 'git merge-base v4.8 v4.9' in
Linux kernel repository:

      v2.18.0: 0.122s    167,468 walked
  v2.19.0-rc1: 0.547s    635,579 walked
         HEAD: 0.127s

> +	if (!min_generation && !corrected_commit_dates_enabled(r))
>  		queue.compare = compare_commits_by_commit_date;

It would be nice to have similar benchmark for this change... unless of
course there is no change in performance, but I think then it needs to
be stated explicitly.  I think.

>
>  	one->object.flags |= PARENT1;
> diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
> index 332cfc53fd..d3def66e7d 100755
> --- a/t/t6024-recursive-merge.sh
> +++ b/t/t6024-recursive-merge.sh
> @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
>  export GIT_COMMITTER_DATE
>
>  test_expect_success 'setup tests' '
> +	GIT_TEST_COMMIT_GRAPH=0 &&
> +	export GIT_TEST_COMMIT_GRAPH &&
>  	echo 1 >a1 &&
>  	git add a1 &&
>  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
> @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
>  '
>
>  test_expect_success 'combined merge conflicts' '
> -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> +	test_must_fail git merge -m final G
>  '
>
>  test_expect_success 'result contains a conflict' '

OK, so instead of disabling commit-graph for this test, now we disable
it for the whole script.

Maybe this change should be in a separate patch?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-08-22 22:20       ` Jakub Narębski
  2020-08-27  6:39         ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-22 22:20 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

Hello,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

All right.

> ---
>  .../technical/commit-graph-format.txt         | 12 ++---
>  Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
>  2 files changed, 36 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 440541045d..71c43884ec 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
>
> -- The generation number of the commit. Commits with no parents have
> -  generation number 1; commits with parents have generation number
> -  one more than the maximum generation number of its parents. We
> -  reserve zero as special, and can be used to mark a generation
> -  number invalid or as "not computed".
> +- The generation number of the commit.

All right, that was duplicated information.  Now that we need to talk
about two of them, it would not make sense to duplicate that.

>
>  - The root tree OID.
>
> @@ -88,6 +84,12 @@ CHUNK DATA:

Shouldn't we also replace 'generation number' occurences in description
of the Commit Data (CDAT) chunk with either 'topological level' or
'generation number v1'?

>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>
> +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]

It is not exactly 'optional', as it implies that we need to turn it on
(or that we can turn it off).  It is more 'conditional', as it can be
not present due to outside influences (mixed-version environment).

> +    * This list of 4-byte values store corrected commit date offsets for the
> +      commits, arranged in the same order as commit data chunk.

I have just realized purely theoretical, but possible, problem with
storing non-monotinic generation number related values like corrected
commit date offset in constrained space.  There are problems with
clamping them.

Say that somewhere in the ancestry chain there is a commit A with commit
date far in the future by mistake, for example 2120-08-22; it is
important for that date to be not able to be represented using uint32_t.
Say that a later descendant commit B is malformed, and has committer
date of 0, that is 1970-01-01. This means that the corrected commit date
for B must be larger than 2120-08-22 - which for this commit means that
corrected commit date offset do not fit in 32 bits, and must be clamped
(replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.

Say that we have commit C that is child of B, and it has correct commit
date.  Because of mistake in commit A, it has corrected commit date of
more than 2120-08-22 (corrected commit date degenerated into topological
level plus constant).

Now C can reach B, and B can reach A.  However, if we recover corrected
commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
we get a number that is smaller than correct corrected commit date.  We
will have

   gen(A) > date(B) + offset(B) < gen(C)

Which breaks reachability condition guarantee.

If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
we would get

  gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)

And again reachability condition is broken.

This is a very contrived but possible example.  This shouldn't happen,
but ufortunately it can happen.


The question is how to deal with this issue.  Ignore it as unlikely?
Switch to storing corrected commit date, which is monotonic, so if there
is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
larger commit-graph file?

> +    * This list can be later modified to store future generation number related
> +      data.

How can it be later modified?  There is no header, no version number.
How would we add another generation number data?

> +
>    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>        This list of 4-byte values store the second through nth parents for
>        all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 808fa30b99..f27145328c 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
>
>  Values 1-4 satisfy the requirements of parse_commit_gently().
>
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates
> +2. Topological levels

Should we add versioning info, that is:

  +1. Corrected committer dates  (generation number v2)
  +2. Topological levels  (generation number v1)

> +
> +Define "corrected committer date" of a commit recursively as follows:
> +
> +  * A commit with no parents (a root commit) has corrected committer date
> +    equal to its committer date.
> +
> +  * A commit with at least one parent has corrected committer date equal to
> +    the maximum of its commiter date and one more than the largest corrected
> +    committer date among its parents.
> +
> +Define the "topological level" of a commit recursively as follows:
>
>   * A commit with no parents (a root commit) has generation number one.

Shouldn't this be

    * A commit with no parents (a root commit) has topological level of one.

>
> - * A commit with at least one parent has generation number one more than
> -   the largest generation number among its parents.
> + * A commit with at least one parent has topological level one more than
> +   the largest topological level among its parents.
>
> -Equivalently, the generation number of a commit A is one more than the
> +Equivalently, the topological level of a commit A is one more than the
>  length of a longest path from A to a root commit. The recursive definition
>  is easier to use for computation and observing the following property:

We should probably explicitly state that the property state applies to
both versions of generation number, not only to topological level.

>
> @@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
>      If A and B are commits with commit time X and Y, respectively, and
>      X < Y, then A _probably_ cannot reach B.
>
> -This heuristic is currently used whenever the computation is allowed to
> -violate topological relationships due to clock skew (such as "git log"
> -with default order), but is not used when the topological order is
> -required (such as merge base calculations, "git log --graph").
> -

To be overly pedantic, this heuristic is still used, but now in much
more rare case.  In addition to what is stated above, at least one layer
in the split commit-graph chain must have been generated by "Old" Git,
for the date heuristic to be used.

But that might be unnecessary level of detail.

>  In practice, we expect some commits to be created recently and not stored
>  in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not

All right.

>  in the commit-graph file. If a commit-graph file was written by a version
>  of Git that did not compute generation numbers, then those commits will
>  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
>  walking a few extra commits, but the simplicity in dealing with commits
>  with generation number *_INFINITY or *_ZERO is valuable.
>
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> -generation numbers are computed to be at least this value. We limit at
> -this value since it is the largest value that can be stored in the
> -commit-graph file using the 30 bits available to generation numbers. This
> -presents another case where a commit can have generation number equal to
> -that of a parent.
> +We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
> +are computed to be at least this value. We limit at this value since it is
> +the largest value that can be stored in the commit-graph file using the
> +available to generation numbers. This presents another case where a
> +commit can have generation number equal to that of a parent.

All right, though it could have been done without re-wrapping, so that
only first line would be marked as changed.

As I wrote, there is theoretical problem with this for offsets.

>
>  Design Details
>  --------------
> @@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
>  number of commits) could be extracted into config settings for full
>  flexibility.
>
> +We also merge commit-graph chains when we try to write a commit graph with
> +two different generation number definitions as they cannot be compared directly.
> +We overwrite the existing chain and create a commit-graph with the newer or more
> +efficient defintion. For example, overwriting topological levels commit graph
> +chain to create a corrected commit dates commit graph chain.
> +

This is more complicated than that.

I think we should explicitly state that Git ensures that in split
commit-graph chain, if there are layers without the GDAT chunk (that
force Git to use topological levels for generation numbers), then they
are top layers.  So if there is commit-graph file created by "Old" Git,
then when addig new layer it would also be GDAT-less.

Now how to write this...

>  ## Deleting graph-{hash} files
>
>  After a new tip file is written, some `graph-{hash}` files may no longer

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
  2020-08-18  6:12       ` Abhishek Kumar
@ 2020-08-23 15:27       ` Jakub Narębski
  2020-08-24  2:49         ` Abhishek Kumar
  2 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-23 15:27 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Jakub Narębski

Hello,

Here is a summary of my comments and thoughts after carefully reviewing
all patches in the series.

Jakub Narębski <jnareb@gmail.com> writes:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>> The Corrected Commit Date is defined as:
>>
>> For a commit C, let its corrected commit date be the maximum of the commit
>> date of C and the corrected commit dates of its parents.
>
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
>
>       A can reach B   =>  gen(A) < gen(B)
>
> Of course it can be computed in simpler way, because
>
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1

I see that it is defined correctly in the documentation, which is more
important than cover letter (which would not be stored in the repository
for memory, even in the commit message for the merge).

[...]
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,

Actually we can simplify handling of merging layers, while still
retaining the property that in mixed-version chain all GDAT-full layers
are before / below GDAT-less layers.

Namely, if merging layers, and at least one layer being merged doesn't
have GDAT chunk, then the result of merge wouldn't have either.

We can always switch to slightly more complicated behavior proposed
above in quoted part later, perhaps as a followup commit.

>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This might be not necessary if Git would always check the whole chain of
split commit-graph layers for presence of GDAT-less layers.

But I still think it is a good idea to avoid having GDAT-less "holes".

>
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)

Or stored directly in GDAT, at the cost of increasing the file size by
at most 7% (if I have done my math correctly).

See also the issue with clamping offsets (GENERATION_NUMBER_V2_OFFSET_MAX).

>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

And this is being done in this patch series.  Good!

The thing I was worrying about turned out to be non-issue, as the
comparison function in question is used only when writing, and in this
case we have corrected commit date computer - though perhaps not being
written (as far as I understand it, but I might be mistaken).

[...]
>>
>> Abhishek Kumar (11):
>>   commit-graph: fix regression when computing bloom filter

No problems, maybe just expanding a commit message and/or adding a comment.

>>   revision: parse parent in indegree_walk_step()

Looks good to me.

>>   commit-graph: consolidate fill_commit_graph_info

I think this commit could be split into three:
- fix to the 'generate tar with future mtime' test
  as it is a hidden bug (when using commit-graph)
- simplify implementation of fill_commit_in_graph()
  by using fill_commit_graph_info()
- move loading date into fill_commit_graph_info()
  that uncovers the issue with 'generate tar with future mtime'

On the other hand because they are inter-related, those changes might be
kept in a single commit.

In commit message greater care needs to be taken with
fill_commit_graph_info() and fill_commit_in_graph(), when to use one and
when to use the other. For example it is fill_commit_graph_info() that
changes its behavior, and it is fill_commit_in_graph() that is getting
simplified.

>>   commit-graph: consolidate compare_commits_by_gen

Looks good to me, though it might be good idea to add comments about the
sorting order (inside comment) to appropriate header files.

>>   commit-graph: return 64-bit generation number

This conversion misses one place, though it would be changed to use
topological levels slab in next patch.

This patch also unnecessary introduces GENERATION_NUMBER_V1_INFINITY.
There is no need for it: `generation` field can always simply use
GENERATION_NUMBER_INFINITY for commits not in commit-graph.

>>   commit-graph: add a slab to store topological levels

This is the patch that needs GENERATION_NUMBER_V1_INFINITY (or
TOPOLOGICAL_LEVEL_INFINITY), not the previous patch.

Detailed analysis uncovered hidden bug in the code of
compute_generation_numbers() that works only because of historical
reasons (that topological levels in Git start from 1, not from 0). The
problem is that the 'level' / 'generation' variable for commits not in
graph, and therefore ones that needs to have its generation number
computed, is 0 (default value on commit slab) and not
GENERATION_NUMBER*_INFINITY as it should.

This issue is present since moving commit graph info data to
commit-slab.  We can simply document it and ignore (it works, if by
accident), or try to fix it.

We need to handle GENERATION_NUMBER*_MAX clamping carefully
in the future patches.

I think also that this patch needs a bit more detailed commit message.

>>   commit-graph: implement corrected commit date

Looks good, though verify_commit_graph() now verifies *a* generation
number used, be it v1 (topological level) or v2 (corrected commit date),
so the variable rename is unnecessary.  We verify that they fulfill the
reachability condition promise, that is gen(parent) <= gen(commit),
(the '=' is to include GENERATION_NUMBER*_MAX case), as it is what is
used to speed up graph traversal.

We probably want to verify both topological level values in CDAT, and if
they exist also corrected commit date values in GDAT.  But that might be
left for the future commit.

>>   commit-graph: implement generation data chunk

To save up to 6-7% on commit-graph file size we store 32-bits corrected
commit date offsets, instead of storing 64-bits corrected commit date.

However, as far as I understand it, using non-monotonic values for
on-disk storage with limited field size, like 32-bits corrected
commit date offset, leads to problems with GENERATION_NUMBER*_MAX.
Namely, as I have written in detail in my reply to patch 11/11 in this
series, there is no way to fulfill the reachability condition promise if
we have to store offset which true value do not fit in 32-bits reserved
for it.

This is extremly unlikely to happen in practice, but we need to be able
to handle it somehow.  We can store 64-bit corrected commit date, which
has graph-monotonic values, and the problem goes away in theory and in
practice (we would never have values that do not fit).  We can keep
storing 32-bit offsets, and simply do not use GDAT chunk if there is
offset value that do not fit.

All this of course, provided that I am not wrong about this issue...

>>   commit-graph: use generation v2 only if entire chain does

The commit message do not say anything about the *writing* side.

However, if we want to keep the requirement that GDAT-less layers in the
split commit-graph chain must all be above any GDAT-containing layers,
we need to consider how we want layer merging to behave.  We could
either opt for using GDAT whenever possible, or similify code and skip
using GDAT if we are unsure.

The first approach would mean that if the topmost layer below set of
layers being rewritten (in the split commit-graph chain) exists, and it
does not contain GDAT chunk, then and only then the result of rewrite
should not have GDAT chunk either.

The second approach is, I think, simpler: if any of layers that is being
rewritten is GDAT-less (we need to check only the top layer, though),
and we are not running full rewrite ('--split=replace'), then the result
of rewrite should not have GDAT chunk either.  We can switch to the
first algorithm in later commit.

Whether we choose one or the other, we need test that doing layer
merging do not break GDAT-inness requirement stated above.


Also, we can probably test that we are not using v1 and v2 at the same
time with tests involving --is-ancestor, or --contains / --merged.

>>   commit-reach: use corrected commit dates in paint_down_to_common()

I think this commit could be split into two:
- disable commit graph for entire t6024-recursive-merge test
- use corrected commit dates in paint_down_to_common()

On the other hand because they are inter-related, those changes might be
better kept in a single commit.

It would be nice to have some benchmark data, or at least stating that
performance does not change (within the error bounds), using for example
'git merge-base v4.8 v4.9' in Linux kernel repository.

>>   doc: add corrected commit date info

Looks good, there are a few places where 'generation number' (referring
to the v1 version) should have been replaced with 'topological level'.

I am also unsure how the GDAT chunk can be "later modified to store
future generation number related data.".  I'd like to have an example,
or for this statement to be removed (if it turns out to not be true, not
without introducing yet another chunk type).

>>  18 files changed, 396 insertions(+), 185 deletions(-)

Thanks for all your work.

Best regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-23 15:27       ` Jakub Narębski
@ 2020-08-24  2:49         ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-24  2:49 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

Hello,

On Sun, Aug 23, 2020 at 05:27:46PM +0200, Jakub Narębski wrote:
> Hello,
> 
> Here is a summary of my comments and thoughts after carefully reviewing
> all patches in the series.
> 

Thanks for the detailed review. It must have taken you a lot of time and
focus to go through the patches, so I have really appreciate the effort.

I have been going over the comments and they are reasonable and very
much needed. I will respond to the mails in the specific threads (to
keep the discussion locally scoped and thus manageable), with doubts
and comments that I have as I try to follow through the suggestions.

> Best regards,
> --
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-21 13:14       ` Jakub Narębski
@ 2020-08-25  5:04         ` Abhishek Kumar
  2020-08-25 12:18           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-25  5:04 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, stolee

On Fri, Aug 21, 2020 at 03:14:34PM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > In a preparatory step, let's return timestamp_t values from
> > commit_graph_generation(), use timestamp_t for local variables
> 
> All right, this is all good.
> 
> > and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
> 
> This needs more detailed examination.  There are two similar constants,
> GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
> used for newest commits outside the commit-graph, while the latter is
> maximum number that commits in the commit-graph can have (because of the
> storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
> to be larger than GENERATION_NUMBER_MAX, and it is (and was).
> 
> The GENERATION_NUMBER_INFINITY is because of the above requirement
> traditionally taken as maximum value that can be represented in the data
> type used to store commit's generation number _in memory_, but it can be
> less.  For timestamp_t the maximum value that can be represented
> is (2 ^ 63 - 1).
> 
> All right then.
> 
> >

Related to this, by the end of this series we are using
GENERATION_NUMBER_MAX in just one place - compute_generation_numbers()
to make sure the topological levels fit within 30 bits.

Would it be more appropriate to rename GENERATION_NUMBER_MAX to
GENERATION_NUMBER_V1_MAX (along the lines of
GENERATION_NUMBER_V2_OFFSET_MAX)  to correctly describe that is a
limit on topological levels, rather than generation number value?

> 
> The commit message says nothing about the new symbolic constant
> GENERATION_NUMBER_V1_INFINITY, though.
> 
> I'm not sure it is even needed (see comments below).

Yes, you are correct. I tried it out with your suggestions and it wasn't
really needed.

Thanks for catching this!

> ...

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-21 18:43       ` Jakub Narębski
@ 2020-08-25  6:14         ` Abhishek Kumar
  2020-08-25  7:33           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-25  6:14 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > As we are writing topological levels to commit data chunk to ensure
> > backwards compatibility with "Old" Git and the member `generation` of
> > struct commit_graph_data will store corrected commit date in a later
> > commit, let's introduce a commit-slab to store topological levels while
> > writing commit-graph.
> 
> In my opinion the above it would be easier to follow if rephrased in the
> following way:
> 
>   In a later commit we will introduce corrected commit date as the
>   generation number v2.  This value will be stored in the new separate
>   GDAT chunk.  However to ensure backwards compatibility with "Old" Git
>   we need to continue to write generation number v1, which is
>   topological level, to the commit data chunk (CDAT).  This means that
>   we need to compute both versions of generation numbers when writing
>   the commit-graph file.  Let's therefore introduce a commit-slab
>   to store topological levels; corrected commit date will be stored
>   in the member `generation` of struct commit_graph_data.
> 
> What do you think?
> 

Yes, that's better.

> 
> By the way, do I understand it correctly that in backward-compatibility
> mode (that is, in mixed-version environment where at least some
> commit-graph files were written by "Old" Git and are lacking GDAT chunk
> and generation number v2 data) the `generation` member of commit graph
> data chunk will be populated and will store generation number v1, that
> is topological level? And that the commit-slab for topological levels is
> only there for writing and re-writing?
> 

No, the topo_levels commit-slab would be always populated when we write
a commit data chunk. The topo_level slab is a workaround for the fact
that we are trying to write two independent values (corrected commit
date offset, topological levels) but have one struct member to store them in
(data->generation).

If we are in a mixed-version environment, we could avoid initializing
the slab and fill the topological levels into data->generation instead,
but that's not how it is implemented right now.

> >
> > When Git creates a split commit-graph, it takes advantage of the
> > generation values that have been computed already and present in
> > existing commit-graph files.
> >
> > So, let's add a pointer to struct commit_graph to the topological level
> > commit-slab and populate it with topological levels while writing a
> > split commit-graph.
> 
> All right, looks sensible.

I have extend the last paragraph to include write_commit_graph_context
as well as:

  So, let's add a pointer to struct commit_graph as well as struct
  write_commit_graph_context to the topological level commit-slab and
  populate it with topological levels while writing a commit-graph file.

> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
> >  commit-graph.h |  1 +
> >  commit.h       |  1 +
> >  3 files changed, 34 insertions(+), 15 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 7f9f858577..a2f15b2825 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
> >  /* Remember to update object flag allocation in object.h */
> >  #define REACHABLE       (1u<<15)
> >
> > +define_commit_slab(topo_level_slab, uint32_t);
> > +
> 
> All right.
> 
> Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
> think it would be necessary.
> 
> >  /* Keep track of the order in which commits are added to our list. */
> >  define_commit_slab(commit_pos, int);
> >  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> > @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +
> > +	if (g->topo_levels)
> > +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> 
> All right, here we store topological levels on commit-slab to avoid
> recomputing them.
> 
> Do I understand it correctly that the `topo_levels` member of the `struct
> commit_graph` would be non-null only when we are updating the
> commit-graph?
> 

Yes, that's correct.

> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -953,6 +958,7 @@ struct write_commit_graph_context {
> >  		 changed_paths:1,
> >  		 order_by_pack:1;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	const struct split_commit_graph_opts *split_opts;
> >  	size_t total_bloom_filter_data_size;
> >  	const struct bloom_filter_settings *bloom_settings;
> 
> Why do we need `topo_levels` member *both* in `struct commit_graph` and
> in `struct write_commit_graph_context`?
> 
> [After examining the change further I have realized why both are needed,
>  and written about the reasoning later in this email.]
> 
> 
> Note that the commit message talks only about `struct commit_graph`...
> 
> > @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
> >  		else
> >  			packedDate[0] = 0;
> >
> > -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> > +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
> 
> All right, here we prepare for writing to the CDAT chunk using data that
> is now stored on newly introduced topo_levels slab (either computed, or
> taken from commit-graph file being rewritten).
> 
> Assuming that ctx->topo_levels is not-null, and that the values are
> properly calculated before this -- and we did compute topological levels
> before writing the commit-graph.
> 
> >
> >  		packedDate[1] = htonl((*list)->date);
> >  		hashwrite(f, packedDate, 8);
> > @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> 
> All right, so that is why this 'generation' variable was not converted
> to timestamp_t type.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > +		    level != GENERATION_NUMBER_ZERO)
> >  			continue;
> 
> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> outside commit-graph files, and therefore we would need its topological
> level computed.
> 
> However, I don't understand how it works.  We have had created the
> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> to provide default values for `struct commit_graph`... but only for
> `graph_pos` member.  It is commit_graph_generation() that returns
> GENERATION_NUMBER_INFINITY for commits not in graph.
> 
> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> handles this special case, so I don't see how 'generation' variable can
> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> 
> Does it work *accidentally*, because the default value for uninitialized
> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> of commit-graph feature development history, namely the short time where
> Git didn't use any generation numbers and stored 0 in the place set for
> it in the commit-graph format...  On the other hand this is not the case
> for corrected commit date (generation number v2), as it could
> "legitimately" be 0 if some root commit (without any parents) had
> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> caused by malformed but valid commit object.
> 
> Ugh...

It works accidentally.

Our decision to avoid the cost of initializing both
commit_graph_data->generation and commit_graph_data->graph_pos has
led to some unwieldy choices - the complexity of helper functions,
bypassing helper functions when writing a commit-graph file [1].

I want to re-visit how commit_graph_data slab is defined in a future series.

[1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/

> 
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > +				    level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else if (level > max_level) {
> > +					max_level = level;
> >  				}
> >  			}
> 
> This is the same case as for previous chunk; see the comment above.
> 
> This code checks if parents have generation number / topological level
> computed, and tracks maximum value of it among all parents.
> 
> >
> >  			if (all_parents_computed) {
> > -				struct commit_graph_data *data = commit_graph_data_at(current);
> > -
> > -				data->generation = max_generation + 1;
> >  				pop_commit(&list);
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> 
> OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
> this value can be maximum value that can be safely stored in a given
> data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
> value that can be safely stored in uint32_t, so generation+1 had no
> chance to overflow.  This is no longer the case; the reorganization done
> here leads to more defensive code (safer).
> 
> All good.  However I think that we should clamp the value of topological
> level to the maximum value that can be safely stored *on disk*, in the
> 30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
> the code to write topological level would get more complicated.
> 
> In my opinion the symbolic constant used here should be named
> GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
> it should be the current value of GENERATION_NUMBER_MAX, that is
> 0x3FFFFFFF.
> 
> >  			}
> >  		}
> >  	}
> > @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> 
> All right, we will be using topo_level slab for writing the
> commit-graph, and only for this purpose, so it is good to put it here.
> 
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
> >  		}
> >  	}
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> > +	if (ctx->r->objects->commit_graph) {
> > +		struct commit_graph *g = ctx->r->objects->commit_graph;
> > +
> > +		while (g) {
> > +			g->topo_levels = &topo_levels;
> > +			g = g->base_graph;
> > +		}
> > +	}
> 
> All right, now I see why we need `topo_levels` member both in the
> `struct write_commit_graph_context` and in `struct commit_graph`.
> The former is for functions that write the commit-graph, the latter for
> fill_commit_graph_info() functions that is deep in the callstack, but it
> needs to know whether to load topological level to commit-slab, or maybe
> put it as generation number (and in the future -- discard it, if not
> needed).
> 
> 
> Sidenote: this fragment of code, that fills with a given value some
> member of the `struct commit_graph` throughout the split commit-graph
> chain, will be repeated as similar code in patches later in series.
> However without resorting to preprocessor macros I have no idea how to
> generalize it to avoid code duplication (well, almost).
> 

The pattern is: iterate over the commit-graph chain and assign a member
(here, topo_level and in the other patch, read_generation_data) a value
(the address of topo_level slab, 1/0 depending on whether it is a mixed
generation chain).

We could generalize this in a future series but I don't think it is
worthwhile.

> > +
> >  	if (pack_indexes) {
> >  		ctx->order_by_pack = 1;
> >  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 430bc830bb..1152a9642e 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -72,6 +72,7 @@ struct commit_graph {
> >  	const unsigned char *chunk_bloom_indexes;
> >  	const unsigned char *chunk_bloom_data;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	struct bloom_filter_settings *bloom_filter_settings;
> >  };
> 
> All right: `struct commit_graph` is public, `struct
> write_commit_graph_context` is not.
> 
> >
> > diff --git a/commit.h b/commit.h
> > index bc0732a4fe..bb846e0025 100644
> > --- a/commit.h
> > +++ b/commit.h
> > @@ -15,6 +15,7 @@
> >  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
> >  #define GENERATION_NUMBER_MAX 0x3FFFFFFF
> 
> The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
> GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.
> 
> >  #define GENERATION_NUMBER_ZERO 0
> > +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
> 
> This value is never used, so why it is defined in this commit.
> 

Moved down to the patch actually uses it.

> >
> >  struct commit_list {
> >  	struct commit *item;
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-22  0:05       ` Jakub Narębski
@ 2020-08-25  6:49         ` Abhishek Kumar
  2020-08-25 10:07           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-25  6:49 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With most of preparations done, let's implement corrected commit date.
> >
> > The corrected commit date for a commit is defined as:
> >
> > * A commit with no parents (a root commit) has corrected commit date
> >   equal to its committer date.
> > * A commit with at least one parent has corrected commit date equal to
> >   the maximum of its commit date and one more than the largest corrected
> >   commit date among its parents.
> 
> Good.
> 
> >
> > To minimize the space required to store corrected commit date, Git
> > stores corrected commit date offsets into the commit-graph file. The
> > corrected commit date offset for a commit is defined as the difference
> > between its corrected commit date and actual commit date.
> 
> Perhaps we should add more details about data type sizes in question.

Will add.

> 
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t).  However corrected commit date
> offsets can be safely stored^* using only 32 bits.  This halves the size
> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
> header, fanout table (OIDF) and extra edges list (EDGE).
> 
> Which might mean that the extra complication is not worth it, and we
> should store corrected commit date directly instead.
> 
> *) unless for example one of commits is malformed but valid,
>    and has committerdate of 0 Unix time, 1 January 1970.
> 
> >
> > While Git does not write out offsets at this stage, Git stores the
> > corrected commit dates in member generation of struct commit_graph_data.
> > It will begin writing commit date offsets with the introduction of
> > generation data chunk.
> 
> OK, so the agenda for introducing geeration number v2 is as follows:
> - compute generation numbers v2, i.e. corrected commit date
> - store corrected commit date [offsets] in new GDAT chunk,
>   unless backward-compatibility concerns require us to not to
> - load [and compute] corrected commit date from commit-graph
>   storing it as 'generation' field of `struct commit_graph_data`,
>   unless backward-compatibility concerns require us to store
>   topological levels (generation number v1) in there instead
> 

The last point is not correct. We always store topological levels into
the topo_levels slab introduced and always store corrected commit date
into data->generation, regardless of backward compatibility concerns.

We could avoid initializing topo_slab if we are not writing generation
data chunk (and thus don't need corrected commit dates) but that
wouldn't have an impact on run time while writing commit-graph because
computing corrected commit dates is cheap as the main cost is in walking
the graph and writing the file.

> Because the reachability condition for corrected commit date and for
> topological level is exactly the same, we don't need to do anything to
> take advantage of generation number v2.
> 
> Though we can use generation number v2 in more cases, where we turned
> off use of generation numbers because v1 gave worse performance than
> date heuristics.
> 
> Did I got this right?
> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 31 insertions(+), 27 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index a2f15b2825..fd69534dd5 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	else if (generation_a > generation_b)
> >  		return 1;
> >
> > -	/* use date as a heuristic when generations are equal */
> > -	if (a->date < b->date)
> > -		return -1;
> > -	else if (a->date > b->date)
> > -		return 1;
> 
> At first I was wondering why this tie-breaking is beig removed; wouldn't
> be needed for backward-compatibility?  But then I remembered that this
> comparison function is used _only_ for sorting commits when writing
> Bloom filters, for `git commit-graph write --reachable --changed-paths ...`
> 
> Assuming that when writing the commit graph we always compute geeration
> number v2 and 'generation' field stores corrected commit date, we don't
> need to use date as a heuristic when generations are equal, and it would
> not help in tie-breaking anyway.
> 
> All right.
> 
> >  	return 0;
> >  }
> >
> > @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> 
> All right, so the pattern is to add 'corrected_commit_date' stuff after
> 'topological_level' stuff.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> >  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > -		    level != GENERATION_NUMBER_ZERO)
> > +		    level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> > +		    )
> >  			continue;
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> >
> >  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > -				    level == GENERATION_NUMBER_ZERO) {
> > +				    level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> > +				    ) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> >  				}
> >  			}
> >
> > @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  				if (max_level > GENERATION_NUMBER_MAX - 1)
> >  					max_level = GENERATION_NUMBER_MAX - 1;
> >  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > +				if (current->date > max_corrected_commit_date)
> > +					max_corrected_commit_date = current->date - 1;
> > +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> >  			}
> >  		}
> >  	}
> 
> All right.  Looks good to me.
> 
> > @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  	for (i = 0; i < g->num_commits; i++) {
> >  		struct commit *graph_commit, *odb_commit;
> >  		struct commit_list *graph_parents, *odb_parents;
> > -		timestamp_t max_generation = 0;
> > -		timestamp_t generation;
> > +		timestamp_t max_corrected_commit_date = 0;
> > +		timestamp_t corrected_commit_date;
> 
> This is simple, and perhaps unnecessary, rename of variables.
> Shouldn't we however verify *both* topological level, and
> (if exists) corrected commit date?

The problem with verifying both topological level and corrected commit
dates is that we would have to re-fill commit_graph_data slab with commit
data chunk as we cannot modify data->generation otherwise, essentially
repeating the whole verification process.

While it's okay for now, I might take this up in a future series [1].

[1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

> 
> >
> >  		display_progress(progress, i + 1);
> >  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> > @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  					     oid_to_hex(&graph_parents->item->object.oid),
> >  					     oid_to_hex(&odb_parents->item->object.oid));
> >
> > -			generation = commit_graph_generation(graph_parents->item);
> > -			if (generation > max_generation)
> > -				max_generation = generation;
> > +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> > +			if (corrected_commit_date > max_corrected_commit_date)
> > +				max_corrected_commit_date = corrected_commit_date;
> 
> Actually, commit_graph_generation(<commit>) can return either corrected
> commit date, or topological level, the latter in backward-compatibility
> case (if at least one commit-graph file is lacking GDAT chunk, because
> [some of] it was created by the "Old" Git).
> 
> >
> >  			graph_parents = graph_parents->next;
> >  			odb_parents = odb_parents->next;
> > @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  		if (generation_zero == GENERATION_ZERO_EXISTS)
> >  			continue;
> >
> > -		/*
> > -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> > -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> > -		 * extra logic in the following condition.
> > -		 */
> > -		if (max_generation == GENERATION_NUMBER_MAX)
> > -			max_generation--;
> 
> All right, this was needed for checking the correctness of topological
> levels (generation number v1) because we were checking not that it
> fullfills the reachability condition, but more strict one: namely that
> topological level of commit is equal to maximum of topological levels of
> its parents plus one.
> 
> The comment about checking both generation number v1 and v2 still
> applies.
> 
> > -
> > -		generation = commit_graph_generation(graph_commit);
> > -		if (generation != max_generation + 1)
> > -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > +		corrected_commit_date = commit_graph_generation(graph_commit);
> > +		if (corrected_commit_date < max_corrected_commit_date + 1)
> > +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
> >  				     oid_to_hex(&cur_oid),
> > -				     generation,
> > -				     max_generation + 1);
> > +				     corrected_commit_date,
> > +				     max_corrected_commit_date + 1);
> 
> All right, we check less strict condition for corrected commit date.
> 
> >
> >  		if (graph_commit->date != odb_commit->date)
> >  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-25  6:14         ` Abhishek Kumar
@ 2020-08-25  7:33           ` Jakub Narębski
  2020-08-25  7:56             ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-25  7:33 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  					_("Computing commit graph generation numbers"),
>>>  					ctx->commits.nr);
>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
>>> +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>
>> All right, so that is why this 'generation' variable was not converted
>> to timestamp_t type.
>>
>>>
>>>  		display_progress(ctx->progress, i + 1);
>>> -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
>>> -		    generation != GENERATION_NUMBER_ZERO)
>>> +		if (level != GENERATION_NUMBER_V1_INFINITY &&
>>> +		    level != GENERATION_NUMBER_ZERO)
>>>  			continue;
>>
>> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
>> outside commit-graph files, and therefore we would need its topological
>> level computed.
>>
>> However, I don't understand how it works.  We have had created the
>> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
>> to provide default values for `struct commit_graph`... but only for
>> `graph_pos` member.  It is commit_graph_generation() that returns
>> GENERATION_NUMBER_INFINITY for commits not in graph.
>>
>> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
>> handles this special case, so I don't see how 'generation' variable can
>> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
>> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
>>
>> Does it work *accidentally*, because the default value for uninitialized
>> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
>> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
>> of commit-graph feature development history, namely the short time where
>> Git didn't use any generation numbers and stored 0 in the place set for
>> it in the commit-graph format...  On the other hand this is not the case
>> for corrected commit date (generation number v2), as it could
>> "legitimately" be 0 if some root commit (without any parents) had
>> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
>> caused by malformed but valid commit object.
>>
>> Ugh...
>
> It works accidentally.
>
> Our decision to avoid the cost of initializing both
> commit_graph_data->generation and commit_graph_data->graph_pos has
> led to some unwieldy choices - the complexity of helper functions,
> bypassing helper functions when writing a commit-graph file [1].
>
> I want to re-visit how commit_graph_data slab is defined in a future series.
>
> [1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/

All right, we might want to make use of the fact that the value of 0 for
topological level here always mean that its value for a commit needs to
be computed, that 0 is not a valid value for topological levels.
- if the value 0 came from commit-graph file, it means that it came
  from Git version that used commit-graph but didn't compute generation
  numbers; the value is GENERATION_NUMBER_ZERO
- the value 0 might came from the fact that commit is not in graph,
  and that commit-slab zero-initializes the values stored; let's
  call this value GENERATION_NUMBER_UNINITIALIZED

If we ensure that corrected commit date can never be zero (which is
extremely unlikely, as one of root commits would have to be malformed or
written on badly misconfigured computer, with value of 0 for committer
timestamp), then this "happy accident" can keep working.

  As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
  has corrected commit date of one, to be able to distinguish
  uninitialized values.

Or something like that.

Actually, it is not even necessary, as corrected commit date of 0 just
means that this single value (well, for every root commit with commit
date of 0) would be unnecessary recomputed in compute_generation_numbers().

Anyway, we would want to document this fact in the commit message.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-25  7:33           ` Jakub Narębski
@ 2020-08-25  7:56             ` Jakub Narębski
  2020-09-01 10:26               ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-25  7:56 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> > On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>
> >>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> [...]
> >>> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >>>                                     _("Computing commit graph generation numbers"),
> >>>                                     ctx->commits.nr);
> >>>     for (i = 0; i < ctx->commits.nr; i++) {
> >>> -           uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> >>> +           uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> >>
> >> All right, so that is why this 'generation' variable was not converted
> >> to timestamp_t type.
> >>
> >>>
> >>>             display_progress(ctx->progress, i + 1);
> >>> -           if (generation != GENERATION_NUMBER_V1_INFINITY &&
> >>> -               generation != GENERATION_NUMBER_ZERO)
> >>> +           if (level != GENERATION_NUMBER_V1_INFINITY &&
> >>> +               level != GENERATION_NUMBER_ZERO)
> >>>                     continue;
> >>
> >> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> >> outside commit-graph files, and therefore we would need its topological
> >> level computed.
> >>
> >> However, I don't understand how it works.  We have had created the
> >> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> >> to provide default values for `struct commit_graph`... but only for
> >> `graph_pos` member.  It is commit_graph_generation() that returns
> >> GENERATION_NUMBER_INFINITY for commits not in graph.
> >>
> >> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> >> handles this special case, so I don't see how 'generation' variable can
> >> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> >> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> >>
> >> Does it work *accidentally*, because the default value for uninitialized
> >> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> >> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> >> of commit-graph feature development history, namely the short time where
> >> Git didn't use any generation numbers and stored 0 in the place set for
> >> it in the commit-graph format...  On the other hand this is not the case
> >> for corrected commit date (generation number v2), as it could
> >> "legitimately" be 0 if some root commit (without any parents) had
> >> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> >> caused by malformed but valid commit object.
> >>
> >> Ugh...
> >
> > It works accidentally.
> >
> > Our decision to avoid the cost of initializing both
> > commit_graph_data->generation and commit_graph_data->graph_pos has
> > led to some unwieldy choices - the complexity of helper functions,
> > bypassing helper functions when writing a commit-graph file [1].
> >
> > I want to re-visit how commit_graph_data slab is defined in a future series.
> >
> > [1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/
>
> All right, we might want to make use of the fact that the value of 0 for
> topological level here always mean that its value for a commit needs to
> be computed, that 0 is not a valid value for topological levels.
> - if the value 0 came from commit-graph file, it means that it came
>   from Git version that used commit-graph but didn't compute generation
>   numbers; the value is GENERATION_NUMBER_ZERO
> - the value 0 might came from the fact that commit is not in graph,
>   and that commit-slab zero-initializes the values stored; let's
>   call this value GENERATION_NUMBER_UNINITIALIZED
>
> If we ensure that corrected commit date can never be zero (which is
> extremely unlikely, as one of root commits would have to be malformed or
> written on badly misconfigured computer, with value of 0 for committer
> timestamp), then this "happy accident" can keep working.
>
>   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
>   has corrected commit date of one, to be able to distinguish
>   uninitialized values.
>
> Or something like that.
>
> Actually, it is not even necessary, as corrected commit date of 0 just
> means that this single value (well, for every root commit with commit
> date of 0) would be unnecessary recomputed in compute_generation_numbers().
>
> Anyway, we would want to document this fact in the commit message.

Alternatively, instead of comparing 'level' (and later in series also
'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
we could load at no extra cost `graph_pos` value and compare it
against COMMIT_NOT_FROM_GRAPH.

But with this solution we could never get rid of graph_pos, if we
think it is unnecessary. If we split commit_graph_data into separate
slabs (as it was in early versions of respective patch series), we
would have to pay additional cost.

But it is an alternative.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-25  6:49         ` Abhishek Kumar
@ 2020-08-25 10:07           ` Jakub Narębski
  2020-09-01 11:01             ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-25 10:07 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> To minimize the space required to store corrected commit date, Git
>>> stores corrected commit date offsets into the commit-graph file. The
>>> corrected commit date offset for a commit is defined as the difference
>>> between its corrected commit date and actual commit date.
>>
>> Perhaps we should add more details about data type sizes in question.
>
> Will add.

Note however that we need to solve the problem of storing values which
are not monotonic wrt. parent relation (partial order) in limited disk
space, that is GENERATION_NUMBER_V2_OFFSET_MAX vs GENERATION_NUMBER_MAX;
see comments in 11/11 and 00/11.

>>
>> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
>> in most cases is 64 bits (uintmax_t).  However corrected commit date
>> offsets can be safely stored^* using only 32 bits.  This halves the size
>> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
>> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
>> header, fanout table (OIDF) and extra edges list (EDGE).
>>
>> Which might mean that the extra complication is not worth it, and we
>> should store corrected commit date directly instead.
>>
>> *) unless for example one of commits is malformed but valid,
>>    and has committerdate of 0 Unix time, 1 January 1970.

See above.

>>> While Git does not write out offsets at this stage, Git stores the
>>> corrected commit dates in member generation of struct commit_graph_data.
>>> It will begin writing commit date offsets with the introduction of
>>> generation data chunk.
>>
>> OK, so the agenda for introducing geeration number v2 is as follows:
>> - compute generation numbers v2, i.e. corrected commit date
>> - store corrected commit date [offsets] in new GDAT chunk,
>>   unless backward-compatibility concerns require us to not to
>> - load [and compute] corrected commit date from commit-graph
>>   storing it as 'generation' field of `struct commit_graph_data`,
>>   unless backward-compatibility concerns require us to store
>>   topological levels (generation number v1) in there instead
>>
>
> The last point is not correct. We always store topological levels into
> the topo_levels slab introduced and always store corrected commit date
> into data->generation, regardless of backward compatibility concerns.

I think I was not clear enough (in trying to be brief).  I meant here
loading available generation numbers for use in graph traversal,
done in later patches in this series.

In _next_ commit we store topological levels in `generation` field:

  @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
   	date_low = get_be32(commit_data + g->hash_len + 12);
   	item->date = (timestamp_t)((date_high << 32) | date_low);

  -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
  +	if (g->chunk_generation_data)
  +		graph_data->generation = item->date +
  +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
  +	else
  +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;


We use topo_level slab only when writing the commit-graph file.

> We could avoid initializing topo_slab if we are not writing generation
> data chunk (and thus don't need corrected commit dates) but that
> wouldn't have an impact on run time while writing commit-graph because
> computing corrected commit dates is cheap as the main cost is in walking
> the graph and writing the file.

Right.

Though you need to add the cost of allocation and managing extra
commit slab, I think that amortized cost is negligible.

But what would be better is showing benchmark data: does writing the
commit graph without GDAT take not insigificant more time than without
this patch?

[...]
>>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>  	for (i = 0; i < g->num_commits; i++) {
>>>  		struct commit *graph_commit, *odb_commit;
>>>  		struct commit_list *graph_parents, *odb_parents;
>>> -		timestamp_t max_generation = 0;
>>> -		timestamp_t generation;
>>> +		timestamp_t max_corrected_commit_date = 0;
>>> +		timestamp_t corrected_commit_date;
>>
>> This is simple, and perhaps unnecessary, rename of variables.
>> Shouldn't we however verify *both* topological level, and
>> (if exists) corrected commit date?
>
> The problem with verifying both topological level and corrected commit
> dates is that we would have to re-fill commit_graph_data slab with commit
> data chunk as we cannot modify data->generation otherwise, essentially
> repeating the whole verification process.
>
> While it's okay for now, I might take this up in a future series [1].
>
> [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

All right, I believe you that verifying both topological level and
corrected commit date would be more difficult.

That doesn't change the conclusion that this variable should remain to
be named `generation`, as when verifying GDAT-less commit-graph files it
would check topological levels (it uses commit_graph_generation(), which
in turn uses `generation` field in commit graph info, which as I have
show above in later patch could be v1 or v2 generation number).

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-21  4:11         ` Abhishek Kumar
@ 2020-08-25 11:11           ` Jakub Narębski
  2020-09-01 11:35             ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-25 11:11 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>>> information present in commit data chunk. Let's simplify the
>>> implementation by calling fill_commit_graph_info() within
>>> fill_commit_in_graph().
>>>
>>> The test 'generate tar with future mtime' creates a commit with commit
>>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>>> generation number (within CDAT chunk) and has undefined behavior.
>>>
>>> The test used to pass
>>
>> Could you please tell us how does this test starts to fail without the
>> change to the test described there?  What is the error message, etc.?
>>
>
> Here's what I revised the commit message to:
>
> commit-graph: consolidate fill_commit_graph_info
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().

All right.

We might want to add here the information that we also move loading the
commit date from the commit-graph file from fill_commit_in_graph() down
the [new] call chain into fill_commit_graph_info().  The commit date
would be needed in fill_commit_graph_info() in the next commit to
compute corrected commit date out of corrected commit date offset, and
store it as generation number.


NOTE that this means that if we switch to storing 64-bit corrected
commit date directly in the commit-graph file, instead of storing 32-bit
offsets, neither this Move Statement Into Function Out of Caller
refactoring nor change to the 'generate tar with future mtime' test
would be necessary.

>
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
> 34-bits for storing commiter date, thus committer time overflows into
> generation number (within CDAT chunk) and has undefined behavior.
>
> The test used to pass as fill_commit_graph_info() would not set struct
> member `date` of struct commit and loads committer date from the object
> database, generating a tar file with the expected mtime.

I guess that in the case of generating a tar file we would read the
commit out of 'object database', and then only add commit-graph specific
info with fill_commit_graph_info().  Possibly because we need more
information that commit-graph provides for a commit.

>
> However, with corrected commit date, we will load the committer date
> from CDAT chunk (truncated to lower 34-bits) to populate the generation
> number. Thus, fill_commit_graph_info() sets date and generates tar file
> with the truncated mtime and the test fails.
>
> Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
> will not be truncated.

Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH
was used.

The commit that introduced the 'generate tar with future mtime' test,
namely e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016), says:

	The ustar format only has room for 11 (or 12, depending on
	some implementations) octal digits for the size and mtime of
	each file. For values larger than this, we have to add pax
	extended headers to specify the real data, and git does not
	yet know how to do so.

	Before fixing that, let's start off with some test
	infrastructure [...]

The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12.
So we need the value of (2 ^ 36 + 1) for this test do do its job.
Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough
(if we skip testing "some implementations").

So I think to make this test more clear (for inquisitive minds) we
should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds
since EPOCH.  Maybe even add a variant of this test that uses the
origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off
use of serialized commit-graph.

I'm sorry for not checking this earlier.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-25  5:04         ` Abhishek Kumar
@ 2020-08-25 12:18           ` Jakub Narębski
  2020-09-01 12:06             ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-08-25 12:18 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: 85d03km98l.fsf, git, stolee

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Aug 21, 2020 at 03:14:34PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> In a preparatory step, let's return timestamp_t values from
>>> commit_graph_generation(), use timestamp_t for local variables
>>
>> All right, this is all good.
>>
>>> and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
>>
>> This needs more detailed examination.  There are two similar constants,
>> GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
>> used for newest commits outside the commit-graph, while the latter is
>> maximum number that commits in the commit-graph can have (because of the
>> storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
>> to be larger than GENERATION_NUMBER_MAX, and it is (and was).
>>
>> The GENERATION_NUMBER_INFINITY is because of the above requirement
>> traditionally taken as maximum value that can be represented in the data
>> type used to store commit's generation number _in memory_, but it can be
>> less.  For timestamp_t the maximum value that can be represented
>> is (2 ^ 63 - 1).
>>
>> All right then.
>
> Related to this, by the end of this series we are using
> GENERATION_NUMBER_MAX in just one place - compute_generation_numbers()
> to make sure the topological levels fit within 30 bits.
>
> Would it be more appropriate to rename GENERATION_NUMBER_MAX to
> GENERATION_NUMBER_V1_MAX (along the lines of
> GENERATION_NUMBER_V2_OFFSET_MAX)  to correctly describe that is a
> limit on topological levels, rather than generation number value?

Yes, I think that at the end of this patch series we should be using
GENERATION_NUMBER_V1_MAX and GENERATION_NUMBER_V2_OFFSET_MAX to describe
storage limits, and GENERATION_NUMBER_INFINITY (the latter as generation
number value for commits not in graph).

We need to ensure that both GENERATION_NUMBER_V1_MAX and
GENERATION_NUMBER_V2_OFFSET_MAX are smaller than
GENERATION_NUMBER_INFINITY.


However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
difficult.  As far as I can see, we can choose one of the *three*
solutions (the third one is _new_):

a. store 64-bit corrected commit date in the GDAT chunk
   all possible values are able to be stored, no need for
   GENERATION_NUMBER_V2_MAX,

b. store 32-bit corrected commit date offset in the GDAT chunk,
   if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
   do not write GDAT chunk at all (like for backward compatibility
   with mixed-version chains of split commit-graph layers),

c. store 32-bit corrected commit date offset in the GDAT chunk,
   using some kind of overflow handling scheme; for example if
   the most significant bit of 32-bit value is 1, then the
   rest 31-bits are position in GDOV chunk, which uses 64-bit
   to store those corrected commit date offsets that do not
   fit in 32 bits.

This type of schema is used in other places in Git code, if I remember
it correctly.

>> The commit message says nothing about the new symbolic constant
>> GENERATION_NUMBER_V1_INFINITY, though.
>>
>> I'm not sure it is even needed (see comments below).
>
> Yes, you are correct. I tried it out with your suggestions and it wasn't
> really needed.
>
> Thanks for catching this!

Mistakes can happen when changig how the series is split into commits.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-22 17:14       ` Jakub Narębski
@ 2020-08-26  7:15         ` Abhishek Kumar
  2020-08-26 10:38           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-26  7:15 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
> Hi Abhishek,
> 
> ... 
> 
> However the commit message do not say anything about the *writing* side.
> 

Revised the commit message to include the following at the end:

When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite does not
have GDAT chunks either.

> 
> ...
> 
> To be more detailed, without '--split=replace' we would want the following
> layer merging behavior:
> 
>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>            1              2           3             4            5
> 
> In the split commit-graph chain above, merging two topmost layers
> (layers 4 and 5) should create a layer without GDAT; merging three
> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
> 4) should create a new layer with GDAT.
> 
>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>            1              2           3               merged
> 
>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>            1              2                    merged
> 
> I hope those ASCII-art pictures help understanding it
> 

Thanks! There were helpful.

While we work as expected in the first scenario i.e merging 4 and 5, we
would *still* write a layer without GDAT in the second scenario.

I have tweaked split_graph_merge_strategy() to fix this:

----------------------------------------------

diff --git a/commit-graph.c b/commit-graph.c
index 6d54d9a286..246fad030d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		}
 	}
 
+	if (!ctx->write_generation_data && g->chunk_generation_data)
+		ctx->write_generation_data = 1;
+
 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->new_base_graph = g;
 	else if (ctx->num_commit_graphs_after != 1)

----------------------------------------------------

That is, if we were not writing generation data (because of mixed
generation concerns) but the new topmost layer has a generation data
chunk, we have merged all layers without GDAT chunk and can now write a
GDAT chunk safely.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> Wouldn't breaking the reachability condition promise make some Git
> commands to return *incorrect* results if they short-circuit, stop
> walking if generation number shows that A cannot reach B?
> 
> I am talking here about commands that return boolean, or select subset
> from given set of revisions:
> - git merge-base --is-ancestor <B> <A>
> - git branch branch-A <A> && git branch --contains <B>
> - git branch branch-B <B> && git branch --merged <A>
> 
> Git assumes that generation numbers fulfill the following condition:
> 
>   if A can reach B, then gen(A) > gen(B)
> 
> Notably this includes commits not in commit-graph, and clamped values.
> 
> However, in the following case
> 
> * if commit A is from higher layer without GDAT
>   and uses topological levels for 'generation', e.g. 115 (in a small repo)
> * and commit B is from lower layer with GDAT
>   and uses corrected commit date as 'generation', for example 1598112896,
> 
> it may happen that A (later commit) can reach B (earlier commit), but
> gen(B) > gen(A).  The reachability condition promise for generation
> numbers is broken.
> 
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> 
> I have reordered files in the patch itself to make it easier to review
> the proposed changes.
> 
> >  commit-graph.h                |  1 +
> >  commit-graph.c                | 32 +++++++++++++++-
> >  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 102 insertions(+), 1 deletion(-)
> >
> > diff --git a/commit-graph.h b/commit-graph.h
> > index f78c892fc0..3cf89d895d 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -63,6 +63,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >
> >  	uint32_t num_commits_in_base;
> > +	uint32_t read_generation_data;
> >  	struct commit_graph *base_graph;
> >
> 
> First, why `read_generation_data` is of uint32_t type, when it stores
> (as far as I understand it), a "boolean" value of either 0 or 1?
> 

Yes, using unsigned int instead of uint32_t (although in most of cases
it would be same).  If commit_graph had other flags as well, we could
have used a bit field.

> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
> that interfere with the case of rewriting, where we want to use existing
> GDAT data when writing new commit-graph with GDAT chunk?

It interferes with rewriting the split commit-graph, as you might have
guessed from the above code snippet.

> 
> ...
>
> > diff --git a/commit-graph.c b/commit-graph.c
> 
> >  		graph_data->generation = item->date +
> >  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> >  	else
> > @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
> >  	uint32_t pos;
> >  	if (!prepare_commit_graph(r))
> >  		return;
> > +
> >  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
> >  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
> >  }
> 
> This is unrelated whitespace fix, a "while at it" in neighbourhood of
> changes.  All right then.
> 

Reverted this change, as it's unimportant.

> > @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
> ...
> 
> It would be nice to have an example with merging layers (whether we
> would handle it in strict or relaxed way).
> 

Sure, will add.

> > +
> >  test_done
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-26  7:15         ` Abhishek Kumar
@ 2020-08-26 10:38           ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-26 10:38 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski

Hi Abhishek,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
>> Hi Abhishek,
>>
>> ...
>>
>> However the commit message do not say anything about the *writing* side.
>>
>
> Revised the commit message to include the following at the end:
>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>

All right.

> Rewriting layers follows similar approach: if the topmost layer below
> set of layers being rewritten (in the split commit-graph chain) exists,
> and it does not contain GDAT chunk, then the result of rewrite does not
> have GDAT chunks either.

All right.

I see that you went with proposed more complex (but better) solution...

>>
>> ...
>>
>> To be more detailed, without '--split=replace' we would want the following
>> layer merging behavior:
>>
>>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>>            1              2           3             4            5
>>
>> In the split commit-graph chain above, merging two topmost layers
>> (layers 4 and 5) should create a layer without GDAT; merging three
>> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
>> 4) should create a new layer with GDAT.

A simpler solution would be to create a new merged layer without GDAT if
any of the layers being merged do not have GDAT.

In this solution merging 3+4+5, 3+4, and even 2+3 would result with
layer without GDAT, and only merging 1+2 would result in layer with GDAT.

>>
>>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>>            1              2           3               merged
>>
>>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>>            1              2                    merged
>>
>> I hope those ASCII-art pictures help understanding it
>>
>
> Thanks! There were helpful.
>
> While we work as expected in the first scenario i.e merging 4 and 5, we
> would *still* write a layer without GDAT in the second scenario.
>
> I have tweaked split_graph_merge_strategy() to fix this:
>
> ----------------------------------------------
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 6d54d9a286..246fad030d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  		}
>  	}
>
> +	if (!ctx->write_generation_data && g->chunk_generation_data)
> +		ctx->write_generation_data = 1;
> +
>  	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
>  		ctx->new_base_graph = g;
>  	else if (ctx->num_commit_graphs_after != 1)

...which turned out to be not that complicated.  Nice work!

Though this needs tests that if fulfills the stated condition (because I
am not sure if it is entirely correct: we are not checking the layer
below current one, isn't it?... ah, you explain it below).

One possible solution would be to grep `test-tool read-graph` output for
"^chunks: ", then pass it through `uniq` (without `sort`!), check that
the number of lines is less or equal 2, and if there are two lines then
check that we get the following contents:

  chunks: oid_fanout oid_lookup commit_metadata generation_data
  chunks: oid_fanout oid_lookup commit_metadata

(assuming that information about layers is added in top-down order).

This test must be run with GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0, which
I think is the default.

> ----------------------------------------------------
>
> That is, if we were not writing generation data (because of mixed
> generation concerns) but the new topmost layer has a generation data
> chunk, we have merged all layers without GDAT chunk and can now write a
> GDAT chunk safely.

All right.

[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index f78c892fc0..3cf89d895d 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -63,6 +63,7 @@ struct commit_graph {
>>>  	struct object_directory *odb;
>>>
>>>  	uint32_t num_commits_in_base;
>>> +	uint32_t read_generation_data;
>>>  	struct commit_graph *base_graph;
>>>
>>
>> First, why `read_generation_data` is of uint32_t type, when it stores
>> (as far as I understand it), a "boolean" value of either 0 or 1?
>
> Yes, using unsigned int instead of uint32_t (although in most of cases
> it would be same).  If commit_graph had other flags as well, we could
> have used a bit field.

OK.

>> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
>> that interfere with the case of rewriting, where we want to use existing
>> GDAT data when writing new commit-graph with GDAT chunk?
>
> It interferes with rewriting the split commit-graph, as you might have
> guessed from the above code snippet.

All right.

[...]
>>> @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
>>>  	uint32_t pos;
>>>  	if (!prepare_commit_graph(r))
>>>  		return;
>>> +
>>>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
>>>  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
>>>  }
>>
>> This is unrelated whitespace fix, a "while at it" in neighbourhood of
>> changes.  All right then.
>>
>
> Reverted this change, as it's unimportant.

Actually I am not against fixing the whitespace in the neighbourhood of
changes, so you can keep it or revert it (discard).

>>> @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>>
>> ...
>>
>> It would be nice to have an example with merging layers (whether we
>> would handle it in strict or relaxed way).
>>
>
> Sure, will add.

Thanks.


Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-22 22:20       ` Jakub Narębski
@ 2020-08-27  6:39         ` Abhishek Kumar
  2020-08-27 12:43           ` Jakub Narębski
  2020-08-27 13:15           ` Derrick Stolee
  0 siblings, 2 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-08-27  6:39 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> All right.
> 
> > ---
> >  .../technical/commit-graph-format.txt         | 12 ++---
> >  Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
> >  2 files changed, 36 insertions(+), 21 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index 440541045d..71c43884ec 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> >  The Git commit graph stores a list of commit OIDs and some associated
> >  metadata, including:
> >
> > -- The generation number of the commit. Commits with no parents have
> > -  generation number 1; commits with parents have generation number
> > -  one more than the maximum generation number of its parents. We
> > -  reserve zero as special, and can be used to mark a generation
> > -  number invalid or as "not computed".
> > +- The generation number of the commit.
> 
> All right, that was duplicated information.  Now that we need to talk
> about two of them, it would not make sense to duplicate that.
> 
> >
> >  - The root tree OID.
> >
> > @@ -88,6 +84,12 @@ CHUNK DATA:
> 
> Shouldn't we also replace 'generation number' occurences in description
> of the Commit Data (CDAT) chunk with either 'topological level' or
> 'generation number v1'?

Yes, we should.

> 
> >        2 bits of the lowest byte, storing the 33rd and 34th bit of the
> >        commit time.
> >
> > +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> 
> It is not exactly 'optional', as it implies that we need to turn it on
> (or that we can turn it off).  It is more 'conditional', as it can be
> not present due to outside influences (mixed-version environment).
> 
> > +    * This list of 4-byte values store corrected commit date offsets for the
> > +      commits, arranged in the same order as commit data chunk.
> 
> I have just realized purely theoretical, but possible, problem with
> storing non-monotinic generation number related values like corrected
> commit date offset in constrained space.  There are problems with
> clamping them.
> 
> Say that somewhere in the ancestry chain there is a commit A with commit
> date far in the future by mistake, for example 2120-08-22; it is
> important for that date to be not able to be represented using uint32_t.
> Say that a later descendant commit B is malformed, and has committer
> date of 0, that is 1970-01-01. This means that the corrected commit date
> for B must be larger than 2120-08-22 - which for this commit means that
> corrected commit date offset do not fit in 32 bits, and must be clamped
> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
> 
> Say that we have commit C that is child of B, and it has correct commit
> date.  Because of mistake in commit A, it has corrected commit date of
> more than 2120-08-22 (corrected commit date degenerated into topological
> level plus constant).
> 
> Now C can reach B, and B can reach A.  However, if we recover corrected
> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
> we get a number that is smaller than correct corrected commit date.  We
> will have
> 
>    gen(A) > date(B) + offset(B) < gen(C)
> 
> Which breaks reachability condition guarantee.
> 
> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
> we would get
> 
>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
> 
> And again reachability condition is broken.
> 
> This is a very contrived but possible example.  This shouldn't happen,
> but ufortunately it can happen.
> 

Yes, that's very unfortunate. 

Here's a much simpler example:

A commit P has an reasonable commit date (i.e. after release of Git to
present) D and has a child commit C with committer date 0. Now, the 
corrected commiter date of C would D + 1 and the offset would be same too,
as the committer date is zero. This overflows as reasonable dates are of
the order 2 ^ 34.

> 
> The question is how to deal with this issue.  Ignore it as unlikely?
> Switch to storing corrected commit date, which is monotonic, so if there
> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
> larger commit-graph file?
> 

To be honest, I would prefer storing corrected committer dates over
storing offsets.

While it is 7% of the size of commit-graph file, it is also *only* around
~3.5 MB for a repository of the size of linux kernel (and IIRC
correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

Minimizing space and memory requirements are a top priority, but
shouldn't making sure our program is correct and efficient to be a
greater priority?

I would love to hear your and Dr. Stolee's opinions on this.

> > +    * This list can be later modified to store future generation number related
> > +      data.
> 
> How can it be later modified?  There is no header, no version number.
> How would we add another generation number data?
> 

We could modify the graph version in future. Here's how I think it would
work:

Graph Version 1, No GDAT -> Topological level
Graph Version 2, GDAT    -> Corrected committer dates
Graph Version 3, GDAT    -> Generation number v3

and so on.

Of course, we do not have to update generation number definition for
each graph version.

However, my statement could still be wrong for things that we do not
foresee (similar to how we missed the hard die on different graph version),
so I am removing the statement.

> > +
> >    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> >        This list of 4-byte values store the second through nth parents for
> >        all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index 808fa30b99..f27145328c 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
> >
> >  Values 1-4 satisfy the requirements of parse_commit_gently().
> >
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates
> > +2. Topological levels
> 
> Should we add versioning info, that is:
> 
>   +1. Corrected committer dates  (generation number v2)
>   +2. Topological levels  (generation number v1)
> 

Yes, added.

> > +
> > +Define "corrected committer date" of a commit recursively as follows:
> > +
> > +  * A commit with no parents (a root commit) has corrected committer date
> > +    equal to its committer date.
> > +
> > +  * A commit with at least one parent has corrected committer date equal to
> > +    the maximum of its commiter date and one more than the largest corrected
> > +    committer date among its parents.
> > +
> > +Define the "topological level" of a commit recursively as follows:
> >
> >   * A commit with no parents (a root commit) has generation number one.
> 
> Shouldn't this be
> 
>     * A commit with no parents (a root commit) has topological level of one.
> 

Thanks, fixed!

> >
> > - * A commit with at least one parent has generation number one more than
> > -   the largest generation number among its parents.
> > + * A commit with at least one parent has topological level one more than
> > +   the largest topological level among its parents.
> >
> > -Equivalently, the generation number of a commit A is one more than the
> > +Equivalently, the topological level of a commit A is one more than the
> >  length of a longest path from A to a root commit. The recursive definition
> >  is easier to use for computation and observing the following property:
> 
> We should probably explicitly state that the property state applies to
> both versions of generation number, not only to topological level.
> 
> >
> > @@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
> >      If A and B are commits with commit time X and Y, respectively, and
> >      X < Y, then A _probably_ cannot reach B.
> >
> > -This heuristic is currently used whenever the computation is allowed to
> > -violate topological relationships due to clock skew (such as "git log"
> > -with default order), but is not used when the topological order is
> > -required (such as merge base calculations, "git log --graph").
> > -
> 
> To be overly pedantic, this heuristic is still used, but now in much
> more rare case.  In addition to what is stated above, at least one layer
> in the split commit-graph chain must have been generated by "Old" Git,
> for the date heuristic to be used.
> 
> But that might be unnecessary level of detail.
> 
> >  In practice, we expect some commits to be created recently and not stored
> >  in the commit graph. We can treat these commits as having "infinite"
> >  generation number and walk until reaching commits with known generation
> >  number.
> >
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> 
> All right.
> 
> >  in the commit-graph file. If a commit-graph file was written by a version
> >  of Git that did not compute generation numbers, then those commits will
> >  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
> >  walking a few extra commits, but the simplicity in dealing with commits
> >  with generation number *_INFINITY or *_ZERO is valuable.
> >
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > -generation numbers are computed to be at least this value. We limit at
> > -this value since it is the largest value that can be stored in the
> > -commit-graph file using the 30 bits available to generation numbers. This
> > -presents another case where a commit can have generation number equal to
> > -that of a parent.
> > +We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
> > +are computed to be at least this value. We limit at this value since it is
> > +the largest value that can be stored in the commit-graph file using the
> > +available to generation numbers. This presents another case where a
> > +commit can have generation number equal to that of a parent.
> 
> All right, though it could have been done without re-wrapping, so that
> only first line would be marked as changed.
> 
> As I wrote, there is theoretical problem with this for offsets.
> 
> >
> >  Design Details
> >  --------------
> > @@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> >  number of commits) could be extracted into config settings for full
> >  flexibility.
> >
> > +We also merge commit-graph chains when we try to write a commit graph with
> > +two different generation number definitions as they cannot be compared directly.
> > +We overwrite the existing chain and create a commit-graph with the newer or more
> > +efficient defintion. For example, overwriting topological levels commit graph
> > +chain to create a corrected commit dates commit graph chain.
> > +
> 
> This is more complicated than that.
> 
> I think we should explicitly state that Git ensures that in split
> commit-graph chain, if there are layers without the GDAT chunk (that
> force Git to use topological levels for generation numbers), then they
> are top layers.  So if there is commit-graph file created by "Old" Git,
> then when addig new layer it would also be GDAT-less.
> 
> Now how to write this...

Thinking about this, I feel creating a new section called "Handling
Mixed Generation Number Chains" made more sense:

  ## Handling Mixed Generation Number Chains

  With the introduction of generation number v2 and generation data chunk,
  the following scenario is possible:

  1. "New" Git writes a commit-graph with a GDAT chunk.
  2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

  The commits in the lower layer will be interpreted as having very large
  generation values (commit date plus offset) compared to the generation
  numbers in the top layer (toplogical level). This violates the
  expectation that the generation of a parent is strictly smaller than the
  generation of a child. In such cases, we revert to using topological
  levels for all layers to maintain backwards compatability.

  When writing a new layer in split commit-graph, we write a GDAT chunk
  only if the topmost layer has a GDAT chunk. This guarantees that if a
  lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.

  Rewriting layers follows similar approach: if the topmost layer below
  set of layers being rewriteen (in the split commit-graph chain) exists,
  and it does not contain GDAT chunk, then the result of rewrite does not
  have GDAT chunks either.

> 
> >  ## Deleting graph-{hash} files
> >
> >  After a new tip file is written, some `graph-{hash}` files may no longer
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-27  6:39         ` Abhishek Kumar
@ 2020-08-27 12:43           ` Jakub Narębski
  2020-08-27 13:15           ` Derrick Stolee
  1 sibling, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-08-27 12:43 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski, Junio C Hamano

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]

>>> +    * This list of 4-byte values store corrected commit date offsets for the
>>> +      commits, arranged in the same order as commit data chunk.
>>
>> I have just realized purely theoretical, but possible, problem with
>> storing non-monotinic generation number related values like corrected
>> commit date offset in constrained space.  There are problems with
>> clamping them.
>>
>> Say that somewhere in the ancestry chain there is a commit A with commit
>> date far in the future by mistake, for example 2120-08-22; it is
>> important for that date to be not able to be represented using uint32_t.
>> Say that a later descendant commit B is malformed, and has committer
>> date of 0, that is 1970-01-01. This means that the corrected commit date
>> for B must be larger than 2120-08-22 - which for this commit means that
>> corrected commit date offset do not fit in 32 bits, and must be clamped
>> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
>>
>> Say that we have commit C that is child of B, and it has correct commit
>> date.  Because of mistake in commit A, it has corrected commit date of
>> more than 2120-08-22 (corrected commit date degenerated into topological
>> level plus constant).
>>
>> Now C can reach B, and B can reach A.  However, if we recover corrected
>> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
>> we get a number that is smaller than correct corrected commit date.  We
>> will have
>>
>>    gen(A) > date(B) + offset(B) < gen(C)
>>
>> Which breaks reachability condition guarantee.
>>
>> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
>> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
>> we would get
>>
>>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
>>
>> And again reachability condition is broken.
>>
>> This is a very contrived but possible example.  This shouldn't happen,
>> but ufortunately it can happen.
>>
>
> Yes, that's very unfortunate.
>
> Here's a much simpler example:
>
> A commit P has an reasonable commit date (i.e. after release of Git to
> present) D and has a child commit C with committer date 0. Now, the
> corrected commiter date of C would D + 1 and the offset would be same too,
> as the committer date is zero. This overflows as reasonable dates are of
> the order 2 ^ 34.

No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value,
so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07),
which is related to storing date as a _signed_ 32-bit integer

The current-ish Unix epoch time is 1598524281 - let's use it for value
of D.  Then the offset for commit C would be 1598524282.  The current
proposal uses 32 bits to store commit date offsets (as unsigned value).
The maximum value of offset that we can store is therefore 2^32 - 1,
which is 4294967295.

   corrected commit date offset(C) = 1,598,524,282
   GENERATION_NUMBER_V2_MAX        = 4,294,967,295

As you can see there is no overflow in the simplified example.

>>
>> The question is how to deal with this issue.  Ignore it as unlikely?
>> Switch to storing corrected commit date, which is monotonic, so if there
>> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
>> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
>> larger commit-graph file?
>>
>
> To be honest, I would prefer storing corrected committer dates over
> storing offsets.
>
> While it is 7% of the size of commit-graph file, it is also *only* around
> ~3.5 MB for a repository of the size of linux kernel (and IIRC
> correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

It is up to 7% of per-commit data, and it doesn't take into account EDGE
chunk (for octopus merges), and it doesn't also take into account the
size of changed-paths Bloom filters data take in the commit-graph.

> Minimizing space and memory requirements are a top priority, but
> shouldn't making sure our program is correct and efficient to be a
> greater priority?

On the other hand the case where we would encounter offsets that do not
fit in uint32_t is extremply unlikely in sane repositories.

I can think of three solutions:

1. use 64-bit corrected commit dates
   - advantages:
     * simplest code,
     * no need for overflow handling, as we can store all possible values
       of timestamp_t
   - disadvantages:
     * commit-graph size increased by up to 7%

2. use 32-bit corrected commit date offsets,
   but simply do not store GDAT chunk if there is offset that would not
   fit in 32-bit wide field
   - advantages:
     * commit-graph is smaller
     * relatively simple overflow handling
   - disadvantages:
     * performance penalty (generation number v1 vs v2) for abnormal
       repositories (with overflow not fitting in uint32_t)
     * tests would be needed to exercise the overflow code

3. use 32-bit for corrected commit date offset,
   with oveflow handling, for example using most significant bit
   to denote that other bits store position into offset overflow
   with 64-bits for those offsets that do not fit in 31-bits
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
   - disadvantages:
     * most complex code of all proposed solutions
     * smaller overflow limit of 2^31 - 1
     * tests would be needed to exercise the overflow code

I think because the situation where we encounter overflow in 32-bit
corrected commit date offset is rare, we should go with either 1 or 2
solution.

> I would love to hear your and Dr. Stolee's opinions on this.

I have CC-ed Junio C Hamano to ask for his opinion.

>>> +    * This list can be later modified to store future generation number related
>>> +      data.
>>
>> How can it be later modified?  There is no header, no version number.
>> How would we add another generation number data?
>>
>
> We could modify the graph version in future. Here's how I think it would
> work:
>
> Graph Version 1, No GDAT -> Topological level
> Graph Version 2, GDAT    -> Corrected committer dates
> Graph Version 3, GDAT    -> Generation number v3
>
> and so on.
>
> Of course, we do not have to update generation number definition for
> each graph version.

So it was about generic mechanism, not something specific to the GDAT chunk.

> However, my statement could still be wrong for things that we do not
> foresee (similar to how we missed the hard die on different graph version),
> so I am removing the statement.

Good.

[...]
>>> +We also merge commit-graph chains when we try to write a commit graph with
>>> +two different generation number definitions as they cannot be compared directly.
>>> +We overwrite the existing chain and create a commit-graph with the newer or more
>>> +efficient defintion. For example, overwriting topological levels commit graph
>>> +chain to create a corrected commit dates commit graph chain.
>>> +
>>
>> This is more complicated than that.
>>
>> I think we should explicitly state that Git ensures that in split
>> commit-graph chain, if there are layers without the GDAT chunk (that
>> force Git to use topological levels for generation numbers), then they
>> are top layers.  So if there is commit-graph file created by "Old" Git,
>> then when addig new layer it would also be GDAT-less.
>>
>> Now how to write this...
>
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
>
>   ## Handling Mixed Generation Number Chains
>
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
>
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.
>
>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

Good idea, and nice writeup.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-27  6:39         ` Abhishek Kumar
  2020-08-27 12:43           ` Jakub Narębski
@ 2020-08-27 13:15           ` Derrick Stolee
  2020-09-01 13:01             ` Abhishek Kumar
  1 sibling, 1 reply; 107+ messages in thread
From: Derrick Stolee @ 2020-08-27 13:15 UTC (permalink / raw)
  To: 85y2m6fhkm.fsf, Jakub Narębski
  Cc: abhishekkumar8222, git, gitgitgadget, me

On 8/27/2020 2:39 AM, Abhishek Kumar wrote:
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
> 
>   ## Handling Mixed Generation Number Chains
> 
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
> 
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

I like the idea of this section, and this setup is good.

>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.

s/toplogical/topological

But also, we don't want to phrase this as "in this case, we do the wrong
thing" but instead

  A naive approach of using the newest available generation number from
  each layer would lead to violated expectations: the lower layer would
  use corrected commit dates which are much larger than the topological
  levels of the higher layer. For this reason, Git inspects each layer
  to see if any layer is missing corrected commit dates. In such a case,
  Git only uses topological levels.

>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.

s/lyer/layer

Perhaps leaving this at a higher level than referencing "GDAT chunk" is
advisable. Perhaps use "we write corrected commit dates" or "all lower
layers must store corrected commit dates as well", for example.

>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

This could use more positive language to make it clear that sometimes
we _do_ want to write corrected commit dates when merging layers:

  When merging layers, we do not consider whether the merged layers had
  corrected commit dates. Instead, the new layer will have corrected
  commit dates if and only if all existing layers below the new layer
  have corrected commit dates.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-22 19:09       ` Jakub Narębski
@ 2020-09-01 10:08         ` Abhishek Kumar
  2020-09-03 19:11           ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-09-01 10:08 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Sat, Aug 22, 2020 at 09:09:21PM +0200, Jakub Narębski wrote:
> Hello Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With corrected commit dates implemented, we no longer have to rely on
> > commit date as a heuristic in paint_down_to_common().
> 
> All right, but it would be nice to have some benchmark data: what were
> performance when using topological levels, what was performance when
> using commit date heuristics (before this patch), what is performace now
> when using corrected commit date.
> 
> >
> > t6024-recursive-merge setups a unique repository where all commits have
> > the same committer date without well-defined merge-base. As this has
> > already caused problems (as noted in 859fdc0 (commit-graph: define
> > GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
> > test script.
> 
> OK?

In hindsight, that is a terrible explanation. Here's what I have revised
this to:

  With corrected commit dates implemented, we no longer have to rely on
  commit date as a heuristic in paint_down_to_common().

  While using corrected commit dates Git walks nearly the same number of
  commits as commit date, the process is slower as for each comparision we
  have to access the commit-slab (for corrected committer date) instead of
  accessing struct member (for committer date).

  For example, the command `git merge-base v4.8 v4.9` on the linux
  repository walks 167468 commits, taking 0.135s for committer date and
  167496 commits, taking 0.157s for corrected committer date respectively.

  t6404-recursive-merge setups a unique repository where all commits have
  the same committer date without well-defined merge-base. As this has
  already caused problems (as noted in 859fdc0 (commit-graph: define
  GIT_TEST_COMMIT_GRAPH, 2018-08-29)).

  While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
  date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
  conflicts' merges commits in the order:
  - Merge C with B to form a intermediate commit.
  - Merge the intermediate commit with A.

  With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
  use the corrected committer date, which changes the order in which
  commits are merged:
  - Merge A with B to form a intermediate commit.
  - Merge the intermediate commit with C.

  While resulting repositories are equivalent, 6404.4 'virtual trees were
  processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
  different merge-bases and thus have different object ids for the
  intermediate commits.

  As this has already causes problems (as noted in 859fdc0 (commit-graph:
  define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
  within t6404-recursive-merge.
> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c             | 14 ++++++++++++++
> >  commit-graph.h             |  6 ++++++
> >  commit-reach.c             |  2 +-
> >  t/t6024-recursive-merge.sh |  4 +++-
> >  4 files changed, 24 insertions(+), 2 deletions(-)
> >
> 
> I have reorderd files for easier review.
> 
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 3cf89d895d..e22ec1e626 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -91,6 +91,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
> >   */
> >  int generation_numbers_enabled(struct repository *r);
> >
> > +/*
> > + * Return 1 if and only if the repository has a commit-graph
> > + * file and generation data chunk has been written for the file.
> > + */
> > +int corrected_commit_dates_enabled(struct repository *r);
> > +
> >  enum commit_graph_write_flags {
> >  	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
> >  	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
> 
> All right.
> 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index c1292f8e08..6411068411 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
> >  	return !!first_generation;
> >  }
> >
> > +int corrected_commit_dates_enabled(struct repository *r)
> > +{
> > +	struct commit_graph *g;
> > +	if (!prepare_commit_graph(r))
> > +		return 0;
> > +
> > +	g = r->objects->commit_graph;
> > +
> > +	if (!g->num_commits)
> > +		return 0;
> > +
> > +	return !!g->chunk_generation_data;
> > +}
> 
> The previous commit introduced validate_mixed_generation_chain(), which
> walked whole split commit-graph chain, and set `read_generation_data`
> field in `struct commit_graph` for all layers in the chain.
> 
> This function examines only the top layer, so it follows the assumption
> that Git would behave in such way that oly topmost layers in the chai
> can be GDAT-less.
> 
> Why the difference?  Couldn't validate_mixed_generation_chain() simply
> call corrected_commit_dates_enabled()?

The previous commit didn't need to walk the whole split commit-graph
chain. Because of how we are handling writing in a mixed generation data
chunk, if a layer has generation data chunk, all layers below it have a
generation data chunk as well.

So, there are two cases at hand:

- Topmost layer has generation data chunk, so we know all layers below
  it has generation data chunk and we can read values from it.
- Topmost layer does not have generation data chunk, so we know we can't
  read from generation data chunk.

Just checking the topmost layer suffices - modified the previous commit.

Then, this function is more or less the same as
`g->read_generation_data` that is, if we are reading from generation
data chunk, we are using corrected commit dates.

> 
> > +
> >  static void close_commit_graph_one(struct commit_graph *g)
> >  {
> >  	if (!g)
> > diff --git a/commit-reach.c b/commit-reach.c
> > index 470bc80139..3a1b925274 100644
> > --- a/commit-reach.c
> > +++ b/commit-reach.c
> > @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
> >  	int i;
> >  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
> >
> > -	if (!min_generation)
> 
> This check was added in 091f4cf (commit: don't use generation numbers if
> not needed, 2018-08-30) by Derrick Stolee, and its commit message
> includes benchmark results for running 'git merge-base v4.8 v4.9' in
> Linux kernel repository:
> 
>       v2.18.0: 0.122s    167,468 walked
>   v2.19.0-rc1: 0.547s    635,579 walked
>          HEAD: 0.127s
> 
> > +	if (!min_generation && !corrected_commit_dates_enabled(r))
> >  		queue.compare = compare_commits_by_commit_date;
> 
> It would be nice to have similar benchmark for this change... unless of
> course there is no change in performance, but I think then it needs to
> be stated explicitly.  I think.
> 

Mentioned in the commit message - we walk (nearly) the same number of
commits but take somewhat longer.

> >
> >  	one->object.flags |= PARENT1;
> > diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
> > index 332cfc53fd..d3def66e7d 100755
> > --- a/t/t6024-recursive-merge.sh
> > +++ b/t/t6024-recursive-merge.sh
> > @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
> >  export GIT_COMMITTER_DATE
> >
> >  test_expect_success 'setup tests' '
> > +	GIT_TEST_COMMIT_GRAPH=0 &&
> > +	export GIT_TEST_COMMIT_GRAPH &&
> >  	echo 1 >a1 &&
> >  	git add a1 &&
> >  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
> > @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
> >  '
> >
> >  test_expect_success 'combined merge conflicts' '
> > -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> > +	test_must_fail git merge -m final G
> >  '
> >
> >  test_expect_success 'result contains a conflict' '
> 
> OK, so instead of disabling commit-graph for this test, now we disable
> it for the whole script.
> 
> Maybe this change should be in a separate patch?

With the explanation in commit message, it's clear to see how using
corrected commit dates leads to an (incorrectly) failing test. Does it
still make sense to seperate them?

> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-25  7:56             ` Jakub Narębski
@ 2020-09-01 10:26               ` Abhishek Kumar
  2020-09-03  9:25                 ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-09-01 10:26 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Aug 25, 2020 at 09:56:44AM +0200, Jakub Narębski wrote:
> On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
>
> ...
>
> >
> > All right, we might want to make use of the fact that the value of 0 for
> > topological level here always mean that its value for a commit needs to
> > be computed, that 0 is not a valid value for topological levels.
> > - if the value 0 came from commit-graph file, it means that it came
> >   from Git version that used commit-graph but didn't compute generation
> >   numbers; the value is GENERATION_NUMBER_ZERO
> > - the value 0 might came from the fact that commit is not in graph,
> >   and that commit-slab zero-initializes the values stored; let's
> >   call this value GENERATION_NUMBER_UNINITIALIZED
> >
> > If we ensure that corrected commit date can never be zero (which is
> > extremely unlikely, as one of root commits would have to be malformed or
> > written on badly misconfigured computer, with value of 0 for committer
> > timestamp), then this "happy accident" can keep working.
> >
> >   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
> >   has corrected commit date of one, to be able to distinguish
> >   uninitialized values.
> >
> > Or something like that.
> >
> > Actually, it is not even necessary, as corrected commit date of 0 just
> > means that this single value (well, for every root commit with commit
> > date of 0) would be unnecessary recomputed in compute_generation_numbers().
> >
> > Anyway, we would want to document this fact in the commit message.
> 
> Alternatively, instead of comparing 'level' (and later in series also
> 'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
> we could load at no extra cost `graph_pos` value and compare it
> against COMMIT_NOT_FROM_GRAPH.
> 
> But with this solution we could never get rid of graph_pos, if we
> think it is unnecessary. If we split commit_graph_data into separate
> slabs (as it was in early versions of respective patch series), we
> would have to pay additional cost.
> 
> But it is an alternative.
> 
> Best,
> -- 
> Jakub Narębski

I think updating a commit date with timestampt of zero to use corrected
commit date of one would leave us more options down the line.

Changing this is easy enough.

For a root commit with timestamp zero, current->date would be zero and 
max_corrected_commit_date would be zero as well. So we can set 
corrected commit date as `max_corrected_commit_date + 1`, instead of the
earlier `(current->date - 1) + 1`.

----

diff --git a/commit-graph.c b/commit-graph.c
index 7ed0a33ad6..e3c5e30405 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1389,7 +1389,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 
-				if (current->date > max_corrected_commit_date)
+				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-25 10:07           ` Jakub Narębski
@ 2020-09-01 11:01             ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-09-01 11:01 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Aug 25, 2020 at 12:07:17PM +0200, Jakub Narębski wrote:
> Hello,
> 
> ...
> 
> I think I was not clear enough (in trying to be brief).  I meant here
> loading available generation numbers for use in graph traversal,
> done in later patches in this series.
> 
> In _next_ commit we store topological levels in `generation` field:
> 
>   @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>    	date_low = get_be32(commit_data + g->hash_len + 12);
>    	item->date = (timestamp_t)((date_high << 32) | date_low);
> 
>   -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>   +	if (g->chunk_generation_data)
>   +		graph_data->generation = item->date +
>   +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>   +	else
>   +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> 
> 
> We use topo_level slab only when writing the commit-graph file.
> 

Right, I thought the agenda outlined points in the process of writing
commit-graph file.

>
> > We could avoid initializing topo_slab if we are not writing generation
> > data chunk (and thus don't need corrected commit dates) but that
> > wouldn't have an impact on run time while writing commit-graph because
> > computing corrected commit dates is cheap as the main cost is in walking
> > the graph and writing the file.
> 
> Right.
> 
> Though you need to add the cost of allocation and managing extra
> commit slab, I think that amortized cost is negligible.
> 
> But what would be better is showing benchmark data: does writing the
> commit graph without GDAT take not insigificant more time than without
> this patch?

Right, we could compare time taken by master and series until (but not
including this patcth) to write a commit-graph file. Will add.

> 
> [...]
> >>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >>>  	for (i = 0; i < g->num_commits; i++) {
> >>>  		struct commit *graph_commit, *odb_commit;
> >>>  		struct commit_list *graph_parents, *odb_parents;
> >>> -		timestamp_t max_generation = 0;
> >>> -		timestamp_t generation;
> >>> +		timestamp_t max_corrected_commit_date = 0;
> >>> +		timestamp_t corrected_commit_date;
> >>
> >> This is simple, and perhaps unnecessary, rename of variables.
> >> Shouldn't we however verify *both* topological level, and
> >> (if exists) corrected commit date?
> >
> > The problem with verifying both topological level and corrected commit
> > dates is that we would have to re-fill commit_graph_data slab with commit
> > data chunk as we cannot modify data->generation otherwise, essentially
> > repeating the whole verification process.
> >
> > While it's okay for now, I might take this up in a future series [1].
> >
> > [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/
> 
> All right, I believe you that verifying both topological level and
> corrected commit date would be more difficult.
> 
> That doesn't change the conclusion that this variable should remain to
> be named `generation`, as when verifying GDAT-less commit-graph files it
> would check topological levels (it uses commit_graph_generation(), which
> in turn uses `generation` field in commit graph info, which as I have
> show above in later patch could be v1 or v2 generation number).
> 

Right, I completely misunderstood you initially. Reverted the variable
name changes.

> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-25 11:11           ` Jakub Narębski
@ 2020-09-01 11:35             ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-09-01 11:35 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Aug 25, 2020 at 01:11:11PM +0200, Jakub Narębski wrote:
> Hello,
> 
> ...
> 
> All right.
> 
> We might want to add here the information that we also move loading the
> commit date from the commit-graph file from fill_commit_in_graph() down
> the [new] call chain into fill_commit_graph_info().  The commit date
> would be needed in fill_commit_graph_info() in the next commit to
> compute corrected commit date out of corrected commit date offset, and
> store it as generation number.
> 
> 
> NOTE that this means that if we switch to storing 64-bit corrected
> commit date directly in the commit-graph file, instead of storing 32-bit
> offsets, neither this Move Statement Into Function Out of Caller
> refactoring nor change to the 'generate tar with future mtime' test
> would be necessary.
> 
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
> > 34-bits for storing commiter date, thus committer time overflows into
> > generation number (within CDAT chunk) and has undefined behavior.
> >
> > The test used to pass as fill_commit_graph_info() would not set struct
> > member `date` of struct commit and loads committer date from the object
> > database, generating a tar file with the expected mtime.
> 
> I guess that in the case of generating a tar file we would read the
> commit out of 'object database', and then only add commit-graph specific
> info with fill_commit_graph_info().  Possibly because we need more
> information that commit-graph provides for a commit.
> 
> >
> > However, with corrected commit date, we will load the committer date
> > from CDAT chunk (truncated to lower 34-bits) to populate the generation
> > number. Thus, fill_commit_graph_info() sets date and generates tar file
> > with the truncated mtime and the test fails.
> >
> > Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
> > will not be truncated.
> 
> Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH
> was used.
> 
> The commit that introduced the 'generate tar with future mtime' test,
> namely e51217e15 (t5000: test tar files that overflow ustar headers,
> 30-06-2016), says:
> 
> 	The ustar format only has room for 11 (or 12, depending on
> 	some implementations) octal digits for the size and mtime of
> 	each file. For values larger than this, we have to add pax
> 	extended headers to specify the real data, and git does not
> 	yet know how to do so.
> 
> 	Before fixing that, let's start off with some test
> 	infrastructure [...]
> 
> The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12.
> So we need the value of (2 ^ 36 + 1) for this test do do its job.
> Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough
> (if we skip testing "some implementations").
> 
> So I think to make this test more clear (for inquisitive minds) we
> should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds
> since EPOCH.  Maybe even add a variant of this test that uses the
> origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off
> use of serialized commit-graph.

That's pretty interesting! I didn't look into this either, will modify
the existing test and add a new test for it.

Thanks for investigating this further.

> 
> I'm sorry for not checking this earlier.
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-25 12:18           ` Jakub Narębski
@ 2020-09-01 12:06             ` Abhishek Kumar
  2020-09-03 13:42               ` Jakub Narębski
  0 siblings, 1 reply; 107+ messages in thread
From: Abhishek Kumar @ 2020-09-01 12:06 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, stolee, me

On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> ...
> 
> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
> difficult.  As far as I can see, we can choose one of the *three*
> solutions (the third one is _new_):
> 
> a. store 64-bit corrected commit date in the GDAT chunk
>    all possible values are able to be stored, no need for
>    GENERATION_NUMBER_V2_MAX,
> 
> b. store 32-bit corrected commit date offset in the GDAT chunk,
>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>    do not write GDAT chunk at all (like for backward compatibility
>    with mixed-version chains of split commit-graph layers),
> 
> c. store 32-bit corrected commit date offset in the GDAT chunk,
>    using some kind of overflow handling scheme; for example if
>    the most significant bit of 32-bit value is 1, then the
>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>    to store those corrected commit date offsets that do not
>    fit in 32 bits.
> 

Alright, so the third solution leverages the fact that in practice,
very few offsets would overflow the 32-bit limit. Using 64-bits for all
offsets would be wasteful, we can trade off a miniscule amount of
computation to save large amounts of disk space.

>
> This type of schema is used in other places in Git code, if I remember
> it correctly.
> 

Yes, it's a similar idea to the extra edge list chunk, where the most
significant bit of second parent indicates whether they are more than
two parents.

It's definitely feasible, albeit a little complex.

What's the overall consensus on the third solution?

>
> >> The commit message says nothing about the new symbolic constant
> >> GENERATION_NUMBER_V1_INFINITY, though.
> >>
> >> I'm not sure it is even needed (see comments below).
> >
> > Yes, you are correct. I tried it out with your suggestions and it wasn't
> > really needed.
> >
> > Thanks for catching this!
> 
> Mistakes can happen when changig how the series is split into commits.
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-27 13:15           ` Derrick Stolee
@ 2020-09-01 13:01             ` Abhishek Kumar
  0 siblings, 0 replies; 107+ messages in thread
From: Abhishek Kumar @ 2020-09-01 13:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb

On Thu, Aug 27, 2020 at 09:15:56AM -0400, Derrick Stolee wrote:
> On 8/27/2020 2:39 AM, Abhishek Kumar wrote:
> > Thinking about this, I feel creating a new section called "Handling
> > Mixed Generation Number Chains" made more sense:
> > 
> >   ## Handling Mixed Generation Number Chains
> > 
> >   With the introduction of generation number v2 and generation data chunk,
> >   the following scenario is possible:
> > 
> >   1. "New" Git writes a commit-graph with a GDAT chunk.
> >   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> I like the idea of this section, and this setup is good.
> 
> >   The commits in the lower layer will be interpreted as having very large
> >   generation values (commit date plus offset) compared to the generation
> >   numbers in the top layer (toplogical level). This violates the
> >   expectation that the generation of a parent is strictly smaller than the
> >   generation of a child. In such cases, we revert to using topological
> >   levels for all layers to maintain backwards compatability.
> 
> s/toplogical/topological
> 
> But also, we don't want to phrase this as "in this case, we do the wrong
> thing" but instead
> 
>   A naive approach of using the newest available generation number from
>   each layer would lead to violated expectations: the lower layer would
>   use corrected commit dates which are much larger than the topological
>   levels of the higher layer. For this reason, Git inspects each layer
>   to see if any layer is missing corrected commit dates. In such a case,
>   Git only uses topological levels.
> 
> >   When writing a new layer in split commit-graph, we write a GDAT chunk
> >   only if the topmost layer has a GDAT chunk. This guarantees that if a
> >   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
> 
> s/lyer/layer
> 
> Perhaps leaving this at a higher level than referencing "GDAT chunk" is
> advisable. Perhaps use "we write corrected commit dates" or "all lower
> layers must store corrected commit dates as well", for example.
> 
> >   Rewriting layers follows similar approach: if the topmost layer below
> >   set of layers being rewriteen (in the split commit-graph chain) exists,
> >   and it does not contain GDAT chunk, then the result of rewrite does not
> >   have GDAT chunks either.
> 
> This could use more positive language to make it clear that sometimes
> we _do_ want to write corrected commit dates when merging layers:
> 
>   When merging layers, we do not consider whether the merged layers had
>   corrected commit dates. Instead, the new layer will have corrected
>   commit dates if and only if all existing layers below the new layer
>   have corrected commit dates.

Thanks, that is a great suggestion! Using positive language is more
straightforward and easier to understand.

> 
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-09-01 10:26               ` Abhishek Kumar
@ 2020-09-03  9:25                 ` Jakub Narębski
  0 siblings, 0 replies; 107+ messages in thread
From: Jakub Narębski @ 2020-09-03  9:25 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Aug 25, 2020 at 09:56:44AM +0200, Jakub Narębski wrote:
>> On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>> ...
>>
>>>
>>> All right, we might want to make use of the fact that the value of 0 for
>>> topological level here always mean that its value for a commit needs to
>>> be computed, that 0 is not a valid value for topological levels.
>>> - if the value 0 came from commit-graph file, it means that it came
>>>   from Git version that used commit-graph but didn't compute generation
>>>   numbers; the value is GENERATION_NUMBER_ZERO
>>> - the value 0 might came from the fact that commit is not in graph,
>>>   and that commit-slab zero-initializes the values stored; let's
>>>   call this value GENERATION_NUMBER_UNINITIALIZED
>>>
>>> If we ensure that corrected commit date can never be zero (which is
>>> extremely unlikely, as one of root commits would have to be malformed or
>>> written on badly misconfigured computer, with value of 0 for committer
>>> timestamp), then this "happy accident" can keep working.
>>>
>>>   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
>>>   has corrected commit date of one, to be able to distinguish
>>>   uninitialized values.
>>>
>>> Or something like that.
>>>
>>> Actually, it is not even necessary, as corrected commit date of 0 just
>>> means that this single value (well, for every root commit with commit
>>> date of 0) would be unnecessary recomputed in compute_generation_numbers().
>>>
>>> Anyway, we would want to document this fact in the commit message.
>> 
>> Alternatively, instead of comparing 'level' (and later in series also
>> 'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
>> we could load at no extra cost `graph_pos` value and compare it
>> against COMMIT_NOT_FROM_GRAPH.
>> 
>> But with this solution we could never get rid of graph_pos, if we
>> think it is unnecessary. If we split commit_graph_data into separate
>> slabs (as it was in early versions of respective patch series), we
>> would have to pay additional cost.
>> 
>> But it is an alternative.
>
> I think updating a commit date with timestampt of zero to use corrected
> commit date of one would leave us more options down the line.
>
> Changing this is easy enough.
>
> For a root commit with timestamp zero, current->date would be zero and 
> max_corrected_commit_date would be zero as well. So we can set 
> corrected commit date as `max_corrected_commit_date + 1`, instead of the
> earlier `(current->date - 1) + 1`.
>
> ----
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7ed0a33ad6..e3c5e30405 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1389,7 +1389,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
>  
> -				if (current->date > max_corrected_commit_date)
> +				if (current->date && current->date > max_corrected_commit_date)
>  					max_corrected_commit_date = current->date - 1;
>  				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>  			}

It turned out to be much easier than I have expected: a one-line change,
adding simply a new condition.  Good work!

Perhaps it would be better to write it as current->date == GENERATION_NUMBER_UNINITIALIZED
(or *_ZERO, or *_NO_DATA,...), but current version is quite idiomatic
and easy to read.

With this change we should, of course, also change the commit-graph
format docs.


On the other hand it is a bit unnecessary.  If `generation` is zero,
using it would still work, and it would just mean that it would be
unnecessarily recomputed - but corrected commit date equal zero is
possible only for root commits.

But the above solution is more consistent, using 0 to mark not
initialized values...  it is cleaner, at the cost of one more corner
case, single line change, and possibly an insignificant amount of
performance penalty due to adding unlikely true branch to the
conditional.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-09-01 12:06             ` Abhishek Kumar
@ 2020-09-03 13:42               ` Jakub Narębski
  2020-09-05 17:21                 ` Abhishek Kumar
  0 siblings, 1 reply; 107+ messages in thread
From: Jakub Narębski @ 2020-09-03 13:42 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee,
	Taylor Blau, Jakub Narębski

Hi,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> 
>> ...
>> 
>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>> difficult.  As far as I can see, we can choose one of the *three*
>> solutions (the third one is _new_):
>> 
>> a. store 64-bit corrected commit date in the GDAT chunk
>>    all possible values are able to be stored, no need for
>>    GENERATION_NUMBER_V2_MAX,
>> 
>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>    do not write GDAT chunk at all (like for backward compatibility
>>    with mixed-version chains of split commit-graph layers),
>> 
>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>    using some kind of overflow handling scheme; for example if
>>