[PATCH 0/6] [GSoC] Implement Corrected Commit Date

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/6] [GSoC] Implement Corrected Commit Date
@ 2020-07-28  9:13 Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Here's what left for the PR (which I intend to take on with the second
version of pull request):

 1. Add an option to skip writing generation data chunk (to test whether new
    Git works without GDAT as intended).
 2. Handle writing to commit-graph for mismatched version (that is, merging
    all graphs into a new graph with a GDAT chunk).
 3. Update technical documentation.

I look forward to everyone's reviews!

Thanks

 * Abhishek

----------------------------------------------------------------------------

The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
unrelated to my code changes. Still need to investigate further.

Abhishek Kumar (6):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date offset

 blame.c                       |   2 +-
 commit-graph.c                | 181 +++++++++++++++++++++-------------
 commit-graph.h                |   7 +-
 commit-reach.c                |  47 +++------
 commit-reach.h                |   2 +-
 commit.c                      |   9 +-
 commit.h                      |   3 +
 revision.c                    |  17 ++--
 t/helper/test-read-graph.c    |   2 +
 t/t4216-log-bloom.sh          |   4 +-
 t/t5000-tar-tree.sh           |   4 +-
 t/t5318-commit-graph.sh       |  21 ++--
 t/t5324-split-commit-graph.sh |  12 +--
 upload-pack.c                 |   2 +-
 14 files changed, 178 insertions(+), 135 deletions(-)

base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/676
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 15:28   ` Taylor Blau
  2020-08-04  0:46   ` Jakub Narębski
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With 3d112755 (commit-graph: examine commits by generation number), Git
knew to sort by generation number before examining the diff when not
using pack order. c49c82aa (commit: move members graph_pos, generation
to a slab, 2020-06-17) moved generation number into a slab and
introduced a helper which returns GENERATION_NUMBER_INFINITY when
writing the graph. Sorting is no longer useful and essentially reverts
the earlier commit.

Let's fix this by accessing generation number directly through the slab.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1af68c297d..5d3c9bd23c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
+
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 13:00   ` Derrick Stolee
  2020-08-05 23:16   ` Jakub Narębski
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 6aa7f4f567..23287d26c3 100644
--- a/revision.c
+++ b/revision.c
@@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return ;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number and has undefined behavior. The test used to pass as
fill_commit_in_graph() did not read commit time from commit graph,
reading commit date from odb instead.

Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 31 ++++++++++++-------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5d3c9bd23c..204eb454b2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
+	fill_commit_graph_info(item, g, pos);
+
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
-
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 16:03   ` Taylor Blau
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 204eb454b2..1c98f38d69 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 28f89cdf3e..98cc5a3b9d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,4 +145,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 7128895c3a..bed63b41fb 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 16:12   ` Taylor Blau
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

One of the essential pre-requisites before implementing generation
number as to distinguish between generation numbers v1 and v2 while
still being compatible with old Git.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. Newer versions of Git can parse GDAT and
take advantage of newer generation numbers, falling back to topological
levels when GDAT chunk is missing (as it would happen with a commit
graph written by old Git).

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 33 +++++++++++++++++++++++++++++----
 commit-graph.h                |  1 +
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 ++--
 t/t5318-commit-graph.sh       | 19 +++++++++++--------
 t/t5324-split-commit-graph.sh | 12 ++++++------
 6 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1c98f38d69..ab714f4a76 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 	}
 }
 
+static void write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	struct commit **list = ctx->commits.list;
+	int count;
+	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
+	}
+}
+
 static void write_graph_chunk_extra_edges(struct hashfile *f,
 					  struct write_commit_graph_context *ctx)
 {
@@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
+	int num_chunks = 4;
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
@@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
 	if (ctx->num_extra_edges) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
@@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
 
-	num_chunks = 3;
+	num_chunks = 4;
 	if (ctx->num_extra_edges) {
 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
 						4 * ctx->num_extra_edges;
@@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_fanout(f, ctx);
 	write_graph_chunk_oids(f, hashsz, ctx);
 	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_generation_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
diff --git a/commit-graph.h b/commit-graph.h
index 98cc5a3b9d..e3d4ba96f4 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -67,6 +67,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c855bcd3e7..780855e691 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 26f332d6a3..3ec5248d70 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 
 graph_read_expect() {
 	OPTIONAL=""
-	NUM_CHUNKS=3
+	NUM_CHUNKS=4
 	if test ! -z $2
 	then
 		OPTIONAL=" $2"
-		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
+		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
 	fi
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
 GRAPH_BYTE_CHUNK_COUNT=6
 GRAPH_CHUNK_LOOKUP_OFFSET=8
 GRAPH_CHUNK_LOOKUP_WIDTH=12
-GRAPH_CHUNK_LOOKUP_ROWS=5
+GRAPH_CHUNK_LOOKUP_ROWS=6
 GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
 GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
 			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
@@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
 GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
-GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
 GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
 GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
-GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
-			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
+GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
+				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
+GRAPH_GENERATION_DATA_WIDTH=4
+GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
+GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
+			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
 GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
 GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
 
@@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
 '
 
 test_expect_success 'detect incorrect generation number' '
-	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
 		"non-zero generation number"
 '
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 269d0964a3..096a96ec41 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_init &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -29,9 +29,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With preparations done, let's implement corrected commit date offset.
We add a new commit-slab to store topological levels while writing
commit graph and upgrade number of struct commit_graph_data to 64-bits.

We have to touch many files, upgrading generation number from uint32_t
to timestamp_t.

We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
which tests if verify can detect if a commit graph have
GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
one of dates is Unix epoch zero.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 blame.c                 |   2 +-
 commit-graph.c          | 109 ++++++++++++++++++++++------------------
 commit-graph.h          |   4 +-
 commit-reach.c          |  32 ++++++------
 commit-reach.h          |   2 +-
 commit.h                |   3 ++
 revision.c              |  14 +++---
 t/t5318-commit-graph.sh |   2 +-
 upload-pack.c           |   2 +-
 9 files changed, 93 insertions(+), 77 deletions(-)

diff --git a/blame.c b/blame.c
index 82fa16d658..48aa632461 100644
--- a/blame.c
+++ b/blame.c
@@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
 	if (!bd)
 		return 1;
 
-	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
+	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
 		return 1;
 
 	filter = get_bloom_filter(r, origin->commit, 0);
diff --git a/commit-graph.c b/commit-graph.c
index ab714f4a76..9647d9f0df 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -65,6 +65,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -100,15 +102,15 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
 
 	if (!data)
-		return GENERATION_NUMBER_INFINITY;
+		return GENERATION_NUMBER_V2_INFINITY;
 	else if (data->graph_pos == COMMIT_NOT_FROM_GRAPH)
-		return GENERATION_NUMBER_INFINITY;
+		return GENERATION_NUMBER_V2_INFINITY;
 
 	return data->generation;
 }
@@ -116,8 +118,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -160,8 +162,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	timestamp_t generation_b = commit_graph_data_at(b)->generation;
 
 	/* lower generation commits first */
 	if (generation_a < generation_b)
@@ -169,11 +171,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -777,8 +774,13 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	if (g->chunk_generation_data)
-		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	{
+		/* Read corrected commit date offset from GDAT */
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	}
 	else
+		/* Read topological level from CDAT */
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -950,6 +952,7 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+	struct topo_level_slab *topo_levels;
 
 	char *base_graph_name;
 	int num_commit_graphs_before;
@@ -1102,7 +1105,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1117,8 +1120,13 @@ static void write_graph_chunk_generation_data(struct hashfile *f,
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		timestamp_t offset = commit_graph_data_at(*list)->generation - (*list)->date;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+
+		hashwrite_be32(f, offset);
 	}
 }
 
@@ -1316,7 +1324,7 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-static void compute_generation_numbers(struct write_commit_graph_context *ctx)
+static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit_list *list = NULL;
@@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (topo_level != GENERATION_NUMBER_INFINITY &&
+		    topo_level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = current->date;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (topo_level == GENERATION_NUMBER_INFINITY ||
+				    topo_level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else {
+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
+
+					if (topo_level > max_level)
+						max_level = topo_level;
+
+					if (data->generation > max_corrected_commit_date)
+						max_corrected_commit_date = data->generation;
 				}
 			}
 
 			if (all_parents_computed) {
 				struct commit_graph_data *data = commit_graph_data_at(current);
 
-				data->generation = max_generation + 1;
-				pop_commit(&list);
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+				data->generation = max_corrected_commit_date + 1;
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				pop_commit(&list);
 			}
 		}
 	}
@@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
@@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
 	} else
 		ctx->num_commit_graphs_after = 1;
 
-	compute_generation_numbers(ctx);
+	compute_corrected_commit_date_offsets(ctx);
 
 	if (ctx->changed_paths)
 		compute_bloom_filters(ctx);
@@ -2325,8 +2346,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_parent_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2365,9 +2386,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_parent_corrected_commit_date)
+				max_parent_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2389,20 +2410,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_parent_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
diff --git a/commit-graph.h b/commit-graph.h
index e3d4ba96f4..20c5848587 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..2ce9867ff3 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_V2_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_V2_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_V2_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..dd17a81672 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,9 @@
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
+#define GENERATION_NUMBER_V2_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
+
 struct commit_list {
 	struct commit *item;
 	struct commit_list *next;
diff --git a/revision.c b/revision.c
index 23287d26c3..b978e79601 100644
--- a/revision.c
+++ b/revision.c
@@ -725,7 +725,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (!revs->repo->objects->commit_graph)
 		return -1;
 
-	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
+	if (commit_graph_generation(commit) == GENERATION_NUMBER_V2_INFINITY)
 		return -1;
 
 	filter = get_bloom_filter(revs->repo, commit, 0);
@@ -3270,7 +3270,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3316,7 +3316,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3359,7 +3359,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3414,10 +3414,10 @@ static void init_topo_walk(struct rev_info *revs)
 	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
 	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
 
-	info->min_generation = GENERATION_NUMBER_INFINITY;
+	info->min_generation = GENERATION_NUMBER_V2_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3478,7 +3478,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3ec5248d70..43801f07a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -596,7 +596,7 @@ test_expect_success 'detect incorrect generation number' '
 		"generation for commit"
 '
 
-test_expect_success 'detect incorrect generation number' '
+test_expect_failure 'detect incorrect generation number' '
 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
 		"non-zero generation number"
 '
diff --git a/upload-pack.c b/upload-pack.c
index 951a2b23aa..db2332e687 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -489,7 +489,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-07-28 13:00   ` Derrick Stolee
  2020-07-28 15:30     ` Taylor Blau
  2020-08-05 23:16   ` Jakub Narębski
  1 sibling, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-07-28 13:00 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.

You mentioned this in your blog post. I'm sorry that such a small
issue caused you pain. Perhaps you could summarize a little bit of
how that investigation led you to find this issue?

Question: is this something that is only necessary when we change
the generation number, or is it something that is only _exposed_
by the test suite when we change the generation number? It seems that
it is likely to be an existing bug, but it might be hard to expose
in a test case.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/revision.c b/revision.c
> index 6aa7f4f567..23287d26c3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>  
> +		if (parse_commit_gently(parent, 1) < 0)
> +			return ;

Drop the extra space.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
                       ` (2 more replies)
  0 siblings, 3 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-07-28 13:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
> 
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> generation number and has undefined behavior. The test used to pass as
> fill_commit_in_graph() did not read commit time from commit graph,
> reading commit date from odb instead.

I was first confused as to why fill_commit_graph_info() did not
load the timestamp, but the reason is that it is only used by
two methods:

1. fill_commit_in_graph(): this actually leaves the commit in a
   "parsed" state, so the date must be correct. Thus, it parses
   the date out of the commit-graph.

2. load_commit_graph_info(): this only helps to guarantee we
   know the graph_pos and generation number values.

Perhaps add this extra context: you will _need_ the commit date
from the commit-graph in order to populate the generation number
v2 in fill_commit_graph_info().

> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.

The timestamp limit placed in the commit-graph is more restrictive
than 64-bit timestamps, but as your test points out, the maximum
timestamp allowed takes place in the year 2514. That is far enough
away for all real data.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c      | 31 ++++++++++++-------------------
>  t/t5000-tar-tree.sh |  4 ++--
>  2 files changed, 14 insertions(+), 21 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 5d3c9bd23c..204eb454b2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
>  	uint32_t lex_index;
> +	uint64_t date_high, date_low;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		die(_("invalid commit position. commit-graph is likely corrupt"));
> +
>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>  
>  	graph_data = commit_graph_data_at(item);
>  	graph_data->graph_pos = pos;
> +
> +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> +	date_low = get_be32(commit_data + g->hash_len + 12);
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>  
> @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
> -	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	struct commit_graph_data *graph_data;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
>  
> +	fill_commit_graph_info(item, g, pos);
> +
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;

This 'while' loop happens in both implementations, so you could
save a miniscule amount of time by placing the call to
fill_commit_graph_info() after the while loop.

> -	if (pos >= g->num_commits + g->num_commits_in_base)
> -		die(_("invalid commit position. commit-graph is likely corrupt"));

> -	/*
> -	 * Store the "full" position, but then use the
> -	 * "local" position for the rest of the calculation.
> -	 */
> -	graph_data = commit_graph_data_at(item);
> -	graph_data->graph_pos = pos;
>  	lex_index = pos - g->num_commits_in_base;
> -
> -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;

I was about to complain about this change, but GRAPH_DATA_WIDTH
is a macro that does an equivalent thing (except the_hash_algo->rawsz
instead of g->hash_len).

>  
>  	item->object.parsed = 1;
>  
>  	set_commit_tree(item, NULL);
>  
> -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> -	date_low = get_be32(commit_data + g->hash_len + 12);
> -	item->date = (timestamp_t)((date_high << 32) | date_low);
> -
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -
>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 37655a237c..1986354fc3 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
>  	rm -f .git/index &&
>  	echo content >file &&
>  	git add file &&
> -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> +	GIT_COMMITTER_DATE="@17179869183 +0000" \
>  		git commit -m "tempori parendum"
>  '
>  
> @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
>  '
>  
>  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> -	echo 4147 >expect &&
> +	echo 2514 >expect &&
>  	tar_info future.tar | cut -d" " -f2 >actual &&
>  	test_cmp expect actual
>  '
> 

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
@ 2020-07-28 14:54 ` Taylor Blau
  2020-07-30  7:47   ` Abhishek Kumar
  2020-07-28 16:35 ` Derrick Stolee
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  8 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 14:54 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

Hi Abhishek,

On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Very exciting. I have been eagerly following your blog and asking
Stolee about your progress, so I am excited to read these patches.

> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

Interestingly, we use a very similar metric at GitHub to sort commits in
various UI views which have lots of existing machinery that sorts
an abstract collection by each element's "date". Since that sort is
stable, and we want to respect the order that Git delivered, we take the
pairwise max of each successive pair of commits.

> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

I'm sure that I'll learn more when I get to this point, but I would like
to hear more about why you want to store the offset rather than the
corrected commit date itself. It seems that the offset could be either
positive or negative, so you'd only have the range of a signed integer
(rather than storing 8 bytes of a time_t for the full breadth of
possibilities).

I know also that Peff is working on negative timestamp support, so I
would want to hear about what he thinks of this, too.

> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
>
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This will be good to gradually roll-out the new chunk. Another thought
is to control whether or not the commit-graph machinery _reads_ this
chunk if it's present. That can be useful for debugging too (eg., I have
a commit-graph with a GDAT chunk that is broken in some way, what
happens if I don't read that chunk?)

Maybe something like `commitgraph.readsGenerationData`? Incidentally,
I'm preparing a `commitgraph.readsChangedPaths` to control whether or
not we read the Bloom index and data chunks. I'll send that to the list
shortly (it's in my fork somewhere if you want an earlier look), but
that may be a useful reference for you.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).
>  3. Update technical documentation.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
> unrelated to my code changes. Still need to investigate further.
>
> Abhishek Kumar (6):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: implement generation data chunk
>   commit-graph: implement corrected commit date offset
>
>  blame.c                       |   2 +-
>  commit-graph.c                | 181 +++++++++++++++++++++-------------
>  commit-graph.h                |   7 +-
>  commit-reach.c                |  47 +++------
>  commit-reach.h                |   2 +-
>  commit.c                      |   9 +-
>  commit.h                      |   3 +
>  revision.c                    |  17 ++--
>  t/helper/test-read-graph.c    |   2 +
>  t/t4216-log-bloom.sh          |   4 +-
>  t/t5000-tar-tree.sh           |   4 +-
>  t/t5318-commit-graph.sh       |  21 ++--
>  t/t5324-split-commit-graph.sh |  12 +--
>  upload-pack.c                 |   2 +-
>  14 files changed, 178 insertions(+), 135 deletions(-)
>
>
> base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
@ 2020-07-28 15:19     ` René Scharfe
  2020-07-28 15:58       ` Derrick Stolee
  2020-07-28 16:01     ` Taylor Blau
  2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 1 reply; 211+ messages in thread
From: René Scharfe @ 2020-07-28 15:19 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git, Derrick Stolee
  Cc: Jakub Narębski, Abhishek Kumar

[Had to remove stolee@gmail.com because with it my mail provider
 rejected this email with the following error message:

   Requested action not taken: mailbox unavailable
   invalid DNS MX or A/AAAA resource record.]

Am 28.07.20 um 15:14 schrieb Derrick Stolee:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>
>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>> information present in commit data chunk. Let's simplify the
>> implementation by calling fill_commit_graph_info() within
>> fill_commit_in_graph().
>>
>> The test 'generate tar with future mtime' creates a commit with commit
>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>> generation number and has undefined behavior. The test used to pass as
>> fill_commit_in_graph() did not read commit time from commit graph,
>> reading commit date from odb instead.
>
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
>
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
>
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
>
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().
>
>> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.

We all may feel like the end of the world is imminent, but do we really
need to set such an arbitrary limit?  OK, that limit was already set two
years ago, and I'm really late.  But still: It's sad to see anything
else than signed 64-bit timestamps to be used in fresh code (after Y2K).
The extra four bytes would fatten up the structures less than the
transition from SHA-1 to SHA-256 will, and no bit twiddling would be
required.  *sigh*

René

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-07-28 15:28   ` Taylor Blau
  2020-07-30  5:24     ` Abhishek Kumar
  2020-08-04  0:46   ` Jakub Narębski
  1 sibling, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 15:28 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.

This last sentence is slightly confusing. Do you think it would be more
clear if you said elaborated a bit? Perhaps something like:

  [...]

  commit_gen_cmp is used when writing a commit-graph to sort commits in
  generation order before computing Bloom filters. Since c49c82aa made
  it so that 'commit_graph_generation()' returns
  'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
  this function. Instead, access the generation number directly through
  the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
  order to access it while writing.

I think the above would be a good extra paragraph in the commit message
provided that you remove the sentence beginning with "Sorting is no
longer useful..."

> Let's fix this by accessing generation number directly through the slab.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +

Nit; this whitespace diff is extraneous, but it's not hurting anything
either. Since it looks like you're rerolling anyway, it would be good to
just get rid of it.

Otherwise this fix makes sense to me.

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28 13:00   ` Derrick Stolee
@ 2020-07-28 15:30     ` Taylor Blau
  0 siblings, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 15:30 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar

On Tue, Jul 28, 2020 at 09:00:51AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > In indegree_walk_step(), we add unvisited parents to the indegree queue.
> > However, parents are not guaranteed to be parsed. As the indegree queue
> > sorts by generation number, let's parse parents before inserting them to
> > ensure the correct priority order.
>
> You mentioned this in your blog post. I'm sorry that such a small
> issue caused you pain. Perhaps you could summarize a little bit of
> how that investigation led you to find this issue?

Indeed ;-). I feel like forgetting to call 'parse_commit_gently()' is a
rite of passage for this part of the code in some sense.

> Question: is this something that is only necessary when we change
> the generation number, or is it something that is only _exposed_
> by the test suite when we change the generation number? It seems that
> it is likely to be an existing bug, but it might be hard to expose
> in a test case.

I tend to agree that this bug probably existed before Abhishek's
changes, but that it's probably more trouble than it's worth to tickle
with a test case. So, I'd be fine with this fix as it is (provided that
the style nit is addressed below, too).

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  revision.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/revision.c b/revision.c
> > index 6aa7f4f567..23287d26c3 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
> >  		struct commit *parent = p->item;
> >  		int *pi = indegree_slab_at(&info->indegree, parent);
> >
> > +		if (parse_commit_gently(parent, 1) < 0)
> > +			return ;
>
> Drop the extra space.
>
> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
@ 2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 16:23     ` Taylor Blau
  2020-07-30  7:27     ` Abhishek Kumar
  0 siblings, 2 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-07-28 15:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git; +Cc: Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> With preparations done,...

I feel like this commit could have been made smaller by doing the
uint32_t -> timestamp_t conversion in a separate patch. That would
make it easier to focus on the changes to the generation number v2
logic.

> let's implement corrected commit date offset.
> We add a new commit-slab to store topological levels while writing

It's important to add: we store topological levels to ensure that older
versions of Git will still have the performance benefits from generation
number v1.

> commit graph and upgrade number of struct commit_graph_data to 64-bits.

Do you mean "update the generation member in struct commit_graph_data
to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
member.

> We have to touch many files, upgrading generation number from uint32_t
> to timestamp_t.

Yes, that's why I recommend doing that in a different step.

> We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> which tests if verify can detect if a commit graph have
> GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> one of dates is Unix epoch zero.

What about the topological levels? Are we caring about verifying the data
that we start to ignore in this new version? I'm hesitant to drop this
right now, but I'm open to it if we really don't see it as a valuable test.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  blame.c                 |   2 +-
>  commit-graph.c          | 109 ++++++++++++++++++++++------------------
>  commit-graph.h          |   4 +-
>  commit-reach.c          |  32 ++++++------
>  commit-reach.h          |   2 +-
>  commit.h                |   3 ++
>  revision.c              |  14 +++---
>  t/t5318-commit-graph.sh |   2 +-
>  upload-pack.c           |   2 +-
>  9 files changed, 93 insertions(+), 77 deletions(-)
> 
> diff --git a/blame.c b/blame.c
> index 82fa16d658..48aa632461 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
>  	if (!bd)
>  		return 1;
>  
> -	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
> +	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
>  		return 1;

I don't see value in changing the name of this macro. It
is only used as the default value for a commit not in the
commit-graph. Changing its value to 0xFFFFFFFF works for
both versions when the type is updated to timestamp_t.

The actually-important change in this patch (not just the
type change) is here:

> -static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> +static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
>  {
>  	int i;
>  	struct commit_list *list = NULL;
> @@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>  
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (topo_level != GENERATION_NUMBER_INFINITY &&
> +		    topo_level != GENERATION_NUMBER_ZERO)
>  			continue;

Here, our "skip" condition is that the topo_level has been computed.
This should be fine, as we are never reading that out of the commit-graph.
We will never be in a mode where topo_level is computed but corrected
commit-date is not.

>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = current->date;

Later you assign data->generation to be "max_corrected_commit_date + 1",
which made me think this should be "current->date - 1". Is that so? Or,
do we want most offsets to be one instead of zero? Is there value there?

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>  
> -				if (generation == GENERATION_NUMBER_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (topo_level == GENERATION_NUMBER_INFINITY ||
> +				    topo_level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else {
> +					struct commit_graph_data *data = commit_graph_data_at(parent->item);
> +
> +					if (topo_level > max_level)
> +						max_level = topo_level;
> +
> +					if (data->generation > max_corrected_commit_date)
> +						max_corrected_commit_date = data->generation;
>  				}
>  			}
>  
>  			if (all_parents_computed) {
>  				struct commit_graph_data *data = commit_graph_data_at(current);
>  
> -				data->generation = max_generation + 1;
> -				pop_commit(&list);
> +				if (max_level > GENERATION_NUMBER_MAX - 1)
> +					max_level = GENERATION_NUMBER_MAX - 1;
> +
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +				data->generation = max_corrected_commit_date + 1;
>  
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				pop_commit(&list);
>  			}
>  		}
>  	}

This looks correct, and I've done a tiny bit of perf tests locally.

> @@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
>  	uint32_t i, count_distinct = 0;
>  	int res = 0;
>  	int replace = 0;
> +	struct topo_level_slab topo_levels;
>  
>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>  	ctx->total_bloom_filter_data_size = 0;
>  
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
>  	if (ctx->split) {
>  		struct commit_graph *g;
>  		prepare_commit_graph(ctx->r);
> @@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
>  	} else
>  		ctx->num_commit_graphs_after = 1;
>  
> -	compute_generation_numbers(ctx);
> +	compute_corrected_commit_date_offsets(ctx);

This rename might not be necessary. You are computing both
versions (v1 and v2) so the name change is actually less
accurate than the old name.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 15:19     ` René Scharfe
@ 2020-07-28 15:58       ` Derrick Stolee
  0 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-07-28 15:58 UTC (permalink / raw)
  To: René Scharfe, Abhishek Kumar via GitGitGadget, git,
	Derrick Stolee
  Cc: Jakub Narębski, Abhishek Kumar

On 7/28/2020 11:19 AM, René Scharfe wrote:
> [Had to remove stolee@gmail.com because with it my mail provider
>  rejected this email with the following error message:
> 
>    Requested action not taken: mailbox unavailable
>    invalid DNS MX or A/AAAA resource record.]
> 
> Am 28.07.20 um 15:14 schrieb Derrick Stolee:
>> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>>> information present in commit data chunk. Let's simplify the
>>> implementation by calling fill_commit_graph_info() within
>>> fill_commit_in_graph().
>>>
>>> The test 'generate tar with future mtime' creates a commit with commit
>>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>>> generation number and has undefined behavior. The test used to pass as
>>> fill_commit_in_graph() did not read commit time from commit graph,
>>> reading commit date from odb instead.
>>
>> I was first confused as to why fill_commit_graph_info() did not
>> load the timestamp, but the reason is that it is only used by
>> two methods:
>>
>> 1. fill_commit_in_graph(): this actually leaves the commit in a
>>    "parsed" state, so the date must be correct. Thus, it parses
>>    the date out of the commit-graph.
>>
>> 2. load_commit_graph_info(): this only helps to guarantee we
>>    know the graph_pos and generation number values.
>>
>> Perhaps add this extra context: you will _need_ the commit date
>> from the commit-graph in order to populate the generation number
>> v2 in fill_commit_graph_info().
>>
>>> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>>
>> The timestamp limit placed in the commit-graph is more restrictive
>> than 64-bit timestamps, but as your test points out, the maximum
>> timestamp allowed takes place in the year 2514. That is far enough
>> away for all real data.
> 
> We all may feel like the end of the world is imminent, but do we really
> need to set such an arbitrary limit?  OK, that limit was already set two
> years ago, and I'm really late.  But still: It's sad to see anything
> else than signed 64-bit timestamps to be used in fresh code (after Y2K).
> The extra four bytes would fatten up the structures less than the
> transition from SHA-1 to SHA-256 will, and no bit twiddling would be
> required.  *sigh*

One thing to consider after generation number v2 is out long enough
is if we could drop the topo-levels and write zeroes for the topo-
level portion. This was valid data in the first version of the
commit-graph, so it would still be valid. Then, we could allow
full 64-bit timestamps again.

This is something to think about again in a year, maybe.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
@ 2020-07-28 16:01     ` Taylor Blau
  2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 16:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:14:42AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number and has undefined behavior. The test used to pass as
> > fill_commit_in_graph() did not read commit time from commit graph,
> > reading commit date from odb instead.
>
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
>
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
>
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
>
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().
>
> > Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 31 ++++++++++++-------------------
> >  t/t5000-tar-tree.sh |  4 ++--
> >  2 files changed, 14 insertions(+), 21 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 5d3c9bd23c..204eb454b2 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >
> > @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >
> > +	fill_commit_graph_info(item, g, pos);
> > +
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
>
> This 'while' loop happens in both implementations, so you could
> save a miniscule amount of time by placing the call to
> fill_commit_graph_info() after the while loop.
>
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
>
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> > -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> > +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>
> I was about to complain about this change, but GRAPH_DATA_WIDTH
> is a macro that does an equivalent thing (except the_hash_algo->rawsz
> instead of g->hash_len).
>
> >
> >  	item->object.parsed = 1;
> >
> >  	set_commit_tree(item, NULL);
> >
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >
> >  	edge_value = get_be32(commit_data + g->hash_len);
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 37655a237c..1986354fc3 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> >
> > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> >  '
> >
> >  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > -	echo 4147 >expect &&
> > +	echo 2514 >expect &&
> >  	tar_info future.tar | cut -d" " -f2 >actual &&
> >  	test_cmp expect actual
> >  '
> >
>
> Thanks,
> -Stolee

Agreed with Stolee's review, but otherwise this looks like a faithful
transformation.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-07-28 16:03   ` Taylor Blau
  0 siblings, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 16:03 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:49AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 15 +++++++++++++++
>  commit-graph.h |  2 ++
>  commit-reach.c | 15 ---------------
>  commit.c       |  9 +++------
>  4 files changed, 20 insertions(+), 21 deletions(-)

All looks good to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-07-28 16:12   ` Taylor Blau
  2020-07-30  6:52     ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 16:12 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:50AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> One of the essential pre-requisites before implementing generation
> number as to distinguish between generation numbers v1 and v2 while

s/as/is

> still being compatible with old Git.

Maybe you could add a section here to talk about why this is needed
specifically? That is, you mention it's a prerequisite, but a reader in
a year or two may not remember why. Adding that information here would
be good.

> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.
>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. Newer versions of Git can parse GDAT and
> take advantage of newer generation numbers, falling back to topological
> levels when GDAT chunk is missing (as it would happen with a commit
> graph written by old Git).

...this is exactly the paragraph that I was looking for above. Could you
swap the order of these last two paragraphs? I think that it would make
the patch message far clearer.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 33 +++++++++++++++++++++++++++++----
>  commit-graph.h                |  1 +
>  t/helper/test-read-graph.c    |  2 ++
>  t/t4216-log-bloom.sh          |  4 ++--
>  t/t5318-commit-graph.sh       | 19 +++++++++++--------
>  t/t5324-split-commit-graph.sh | 12 ++++++------
>  6 files changed, 51 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1c98f38d69..ab714f4a76 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 8

Ugh. I am simultaneously working on a new chunk myself (so a bad
conflict resolution would look at both of us incrementing this number
to the same value without generating a conflict.)

I think the right thing to do here would be to define an enum over chunk
names, and then index an array by that enum (where the value at each
index is the chunk identifier). Then, the last value of that enum would
be a '__COUNT' which you could use to initialize the array (as well as
within the commit-graph writing routines).

Anyway, I think that it's probably not worth it in the meantime, but it
is something that Junio should look out for when merging (if yours and
my topic happen to get merged around the same time, which they may not).

>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
>  				graph->chunk_commit_data = data + chunk_offset;
>  			break;
>
> +		case GRAPH_CHUNKID_GENERATION_DATA:
> +			if (graph->chunk_generation_data)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data = data + chunk_offset;
> +			break;
> +
>  		case GRAPH_CHUNKID_EXTRAEDGES:
>  			if (graph->chunk_extra_edges)
>  				chunk_repeated = 1;
> @@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +	if (g->chunk_generation_data)
> +		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	else
> +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
> @@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  	}
>  }
>
> +static void write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	int count;
> +	for (count = 0; count < ctx->commits.nr; count++, list++) {
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
> +	}
> +}
> +

This pointer arithmetic is not necessary. Why not like:

  int i;
  for (i = 0; i < ctx->commits.nr; i++) {
    struct commit *c = ctx->commits.list[i];
    display_progress(ctx->progress, ++ctx->progress_cnt);
    hashwrite_be32(f, commit_graph_data_at(c)->generation);
  }

instead?

>  static void write_graph_chunk_extra_edges(struct hashfile *f,
>  					  struct write_commit_graph_context *ctx)
>  {
> @@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
>  	const unsigned hashsz = the_hash_algo->rawsz;
>  	struct strbuf progress_title = STRBUF_INIT;
> -	int num_chunks = 3;
> +	int num_chunks = 4;
>  	struct object_id file_hash;
>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> @@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
>  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> +	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
>  	if (ctx->num_extra_edges) {
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
>  		num_chunks++;
> @@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
>  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
>  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> +	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
>
> -	num_chunks = 3;
> +	num_chunks = 4;
>  	if (ctx->num_extra_edges) {
>  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
>  						4 * ctx->num_extra_edges;
> @@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	write_graph_chunk_fanout(f, ctx);
>  	write_graph_chunk_oids(f, hashsz, ctx);
>  	write_graph_chunk_data(f, hashsz, ctx);
> +	write_graph_chunk_generation_data(f, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
>  	if (ctx->changed_paths) {
> diff --git a/commit-graph.h b/commit-graph.h
> index 98cc5a3b9d..e3d4ba96f4 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -67,6 +67,7 @@ struct commit_graph {
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_generation_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 6d0c962438..1c2a5366c7 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" oid_lookup");
>  	if (graph->chunk_commit_data)
>  		printf(" commit_metadata");
> +	if (graph->chunk_generation_data)
> +		printf(" generation_data");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
>  	if (graph->chunk_bloom_indexes)
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c855bcd3e7..780855e691 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 26f332d6a3..3ec5248d70 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>
>  graph_read_expect() {
>  	OPTIONAL=""
> -	NUM_CHUNKS=3
> +	NUM_CHUNKS=4
>  	if test ! -z $2
>  	then
>  		OPTIONAL=" $2"
> -		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> +		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
>  	fi
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> @@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
>  GRAPH_BYTE_CHUNK_COUNT=6
>  GRAPH_CHUNK_LOOKUP_OFFSET=8
>  GRAPH_CHUNK_LOOKUP_WIDTH=12
> -GRAPH_CHUNK_LOOKUP_ROWS=5
> +GRAPH_CHUNK_LOOKUP_ROWS=6
>  GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
>  GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
>  			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
> @@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
>  GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
>  GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
>  GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
> -GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
>  GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
>  GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
> -GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> -			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> +GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> +				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> +GRAPH_GENERATION_DATA_WIDTH=4
> +GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
> +GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
> +			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
>  GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
>  GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
>
> @@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
>  '
>
>  test_expect_success 'detect incorrect generation number' '
> -	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
>  		"non-zero generation number"
>  '
>
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 269d0964a3..096a96ec41 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
>  	graphdir="$infodir/commit-graphs" &&
>  	test_oid_init &&
>  	test_oid_cache <<-EOM
> -	shallow sha1:1760
> -	shallow sha256:2064
> +	shallow sha1:2132
> +	shallow sha256:2436
>
> -	base sha1:1376
> -	base sha256:1496
> +	base sha1:1408
> +	base sha256:1528
>  	EOM
>  '
>
> @@ -29,9 +29,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> --
> gitgitgadget
>

All of this looks good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28 15:55   ` Derrick Stolee
@ 2020-07-28 16:23     ` Taylor Blau
  2020-07-30  7:27     ` Abhishek Kumar
  1 sibling, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2020-07-28 16:23 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar

On Tue, Jul 28, 2020 at 11:55:12AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With preparations done,...
>
> I feel like this commit could have been made smaller by doing the
> uint32_t -> timestamp_t conversion in a separate patch. That would
> make it easier to focus on the changes to the generation number v2
> logic.

Yep, agreed.

> > let's implement corrected commit date offset.
> > We add a new commit-slab to store topological levels while writing
>
> It's important to add: we store topological levels to ensure that older
> versions of Git will still have the performance benefits from generation
> number v1.
>
> > commit graph and upgrade number of struct commit_graph_data to 64-bits.
>
> Do you mean "update the generation member in struct commit_graph_data
> to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
> member.
>
> > We have to touch many files, upgrading generation number from uint32_t
> > to timestamp_t.
>
> Yes, that's why I recommend doing that in a different step.
>
> > We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> > which tests if verify can detect if a commit graph have
> > GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> > With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> > one of dates is Unix epoch zero.
>
> What about the topological levels? Are we caring about verifying the data
> that we start to ignore in this new version? I'm hesitant to drop this
> right now, but I'm open to it if we really don't see it as a valuable test.
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  blame.c                 |   2 +-
> >  commit-graph.c          | 109 ++++++++++++++++++++++------------------
> >  commit-graph.h          |   4 +-
> >  commit-reach.c          |  32 ++++++------
> >  commit-reach.h          |   2 +-
> >  commit.h                |   3 ++
> >  revision.c              |  14 +++---
> >  t/t5318-commit-graph.sh |   2 +-
> >  upload-pack.c           |   2 +-
> >  9 files changed, 93 insertions(+), 77 deletions(-)
> >
> > diff --git a/blame.c b/blame.c
> > index 82fa16d658..48aa632461 100644
> > --- a/blame.c
> > +++ b/blame.c
> > @@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
> >  	if (!bd)
> >  		return 1;
> >
> > -	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
> > +	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
> >  		return 1;
>
> I don't see value in changing the name of this macro. It
> is only used as the default value for a commit not in the
> commit-graph. Changing its value to 0xFFFFFFFF works for
> both versions when the type is updated to timestamp_t.
>
> The actually-important change in this patch (not just the
> type change) is here:
>
> > -static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > +static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
> >  {
> >  	int i;
> >  	struct commit_list *list = NULL;
> > @@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (topo_level != GENERATION_NUMBER_INFINITY &&
> > +		    topo_level != GENERATION_NUMBER_ZERO)
> >  			continue;
>
> Here, our "skip" condition is that the topo_level has been computed.
> This should be fine, as we are never reading that out of the commit-graph.
> We will never be in a mode where topo_level is computed but corrected
> commit-date is not.
>
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = current->date;
>
> Later you assign data->generation to be "max_corrected_commit_date + 1",
> which made me think this should be "current->date - 1". Is that so? Or,
> do we want most offsets to be one instead of zero? Is there value there?
>
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (topo_level == GENERATION_NUMBER_INFINITY ||
> > +				    topo_level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else {
> > +					struct commit_graph_data *data = commit_graph_data_at(parent->item);
> > +
> > +					if (topo_level > max_level)
> > +						max_level = topo_level;
> > +
> > +					if (data->generation > max_corrected_commit_date)
> > +						max_corrected_commit_date = data->generation;
> >  				}
> >  			}
> >
> >  			if (all_parents_computed) {
> >  				struct commit_graph_data *data = commit_graph_data_at(current);
> >
> > -				data->generation = max_generation + 1;
> > -				pop_commit(&list);
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +				data->generation = max_corrected_commit_date + 1;
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				pop_commit(&list);
> >  			}
> >  		}
> >  	}
>
> This looks correct, and I've done a tiny bit of perf tests locally.
>
> > @@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
> >  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> >  	ctx->total_bloom_filter_data_size = 0;
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> >  	if (ctx->split) {
> >  		struct commit_graph *g;
> >  		prepare_commit_graph(ctx->r);
> > @@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	} else
> >  		ctx->num_commit_graphs_after = 1;
> >
> > -	compute_generation_numbers(ctx);
> > +	compute_corrected_commit_date_offsets(ctx);
>
> This rename might not be necessary. You are computing both
> versions (v1 and v2) so the name change is actually less
> accurate than the old name.
>
> Thanks,
> -Stolee

I don't have anything to add that Stolee hasn't already pointed out.
Thanks for your work on this series, and I'm looking forward to another
reroll.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
@ 2020-07-28 16:35 ` Derrick Stolee
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  8 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-07-28 16:35 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Abhishek Kumar, Taylor Blau

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
> 
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
> 
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
> 
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
> 
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
> 
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.
> 
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
> 
> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
> 
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This would be a good idea, if only as a GIT_TEST_* environment variable.
I think it important we have a test for the compatibility scenario where
we have an "old" commit-graph with the new code and test that reading and
writing still works properly.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).

This is an excellent thing to do. There are a few options when writing an
incremental commit-graph when the base graphs do not have the GDAT chunk:

   i. Do not write the GDAT chunk unless we are merging all levels
      (based on the merging strategy).

  ii. Merge all levels, then write the GDAT chunk.

>  3. Update technical documentation.

Yes, I was going to ask for a patch that updates
Documentation/technical/commit-graph-format.txt.

This is an excellent v1. A lot of small things, but no
really big issues.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28 15:28   ` Taylor Blau
@ 2020-07-30  5:24     ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-07-30  5:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, abhishekkumar8222, gitgitgadget, jnareb, stolee

On Tue, Jul 28, 2020 at 11:28:44AM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> 
> This last sentence is slightly confusing. Do you think it would be more
> clear if you said elaborated a bit? Perhaps something like:
> 
>   [...]
> 
>   commit_gen_cmp is used when writing a commit-graph to sort commits in
>   generation order before computing Bloom filters. Since c49c82aa made
>   it so that 'commit_graph_generation()' returns
>   'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
>   this function. Instead, access the generation number directly through
>   the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
>   order to access it while writing.
> 

Thanks! That is clearer. Will change.

> I think the above would be a good extra paragraph in the commit message
> provided that you remove the sentence beginning with "Sorting is no
> longer useful..."
> 
> > Let's fix this by accessing generation number directly through the slab.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;
> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> 
> Nit; this whitespace diff is extraneous, but it's not hurting anything
> either. Since it looks like you're rerolling anyway, it would be good to
> just get rid of it.
> 
> Otherwise this fix makes sense to me.
> 
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
> > --
> > gitgitgadget
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
  2020-07-28 16:01     ` Taylor Blau
@ 2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-07-30  6:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb

On Tue, Jul 28, 2020 at 09:14:42AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> > 
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number and has undefined behavior. The test used to pass as
> > fill_commit_in_graph() did not read commit time from commit graph,
> > reading commit date from odb instead.
> 
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
> 
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
> 
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
> 
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().

Thanks, that makes sense. I have revised the commit message to:

commit-graph: consolidate fill_commit_graph_info
    
    Both fill_commit_graph_info() and fill_commit_in_graph() parse
    information present in commit data chunk. Let's simplify the
    implementation by calling fill_commit_graph_info() within
    fill_commit_in_graph().
    
    The test 'generate tar with future mtime' creates a commit with commit
    time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
    generation number (within CDAT chunk) and has undefined behavior.
    
    The test used to pass as fill_commit_in_graph() guarantees the values of
    graph position and generation number, and did not load timestamp.
    However, with corrected commit date we will need load the timestamp as
    well to populate the generation number.
> 
> > Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
> 
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.
> 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 31 ++++++++++++-------------------
> >  t/t5000-tar-tree.sh |  4 ++--
> >  2 files changed, 14 insertions(+), 21 deletions(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 5d3c9bd23c..204eb454b2 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >  
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >  
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >  
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >  
> > @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >  
> > +	fill_commit_graph_info(item, g, pos);
> > +
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> 
> This 'while' loop happens in both implementations, so you could
> save a miniscule amount of time by placing the call to
> fill_commit_graph_info() after the while loop.
> 
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
> 
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> > -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> > +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> 
> I was about to complain about this change, but GRAPH_DATA_WIDTH
> is a macro that does an equivalent thing (except the_hash_algo->rawsz
> instead of g->hash_len).
> 
> >  
> >  	item->object.parsed = 1;
> >  
> >  	set_commit_tree(item, NULL);
> >  
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >  
> >  	edge_value = get_be32(commit_data + g->hash_len);
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 37655a237c..1986354fc3 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> >  
> > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> >  '
> >  
> >  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > -	echo 4147 >expect &&
> > +	echo 2514 >expect &&
> >  	tar_info future.tar | cut -d" " -f2 >actual &&
> >  	test_cmp expect actual
> >  '
> > 
> 
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28 16:12   ` Taylor Blau
@ 2020-07-30  6:52     ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-07-30  6:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: dstolee, jnareb, git, abhishekkumar8222, gitgitgadget

On Tue, Jul 28, 2020 at 12:12:50PM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:50AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > One of the essential pre-requisites before implementing generation
> > number as to distinguish between generation numbers v1 and v2 while
> 
> s/as/is
> 
> > still being compatible with old Git.
> 
> Maybe you could add a section here to talk about why this is needed
> specifically? That is, you mention it's a prerequisite, but a reader in
> a year or two may not remember why. Adding that information here would
> be good.
> 
> > We are going to introduce a new chunk called Generation Data chunk (or
> > GDAT). GDAT stores generation number v2 (and any subsequent versions),
> > whereas CDAT will still store topological level.
> >
> > Old Git does not understand GDAT chunk and would ignore it, reading
> > topological levels from CDAT. Newer versions of Git can parse GDAT and
> > take advantage of newer generation numbers, falling back to topological
> > levels when GDAT chunk is missing (as it would happen with a commit
> > graph written by old Git).
> 
> ...this is exactly the paragraph that I was looking for above. Could you
> swap the order of these last two paragraphs? I think that it would make
> the patch message far clearer.

Here's revised commit message:

  commit-graph: implement generation data chunk
    
  As discovered by Ævar, we cannot increment graph version to
  distinguish between generation numbers v1 and v2 [1]. Thus, one of
  pre-requistes before implementing generation number v2 was to
  distinguish generation numbers in a backwards compatible manner
  without increment graph version.
  
  We are going to introduce a new chunk called Generation Data chunk (or
  GDAT). GDAT stores generation number v2 (and any subsequent versions),
  whereas CDAT will still store topological level.
  
  Old Git does not understand GDAT chunk and would ignore it, reading
  topological levels from CDAT. New Git can parse GDAT and take advantage
  of newer generation numbers, falling back to topological levels when
  GDAT chunk is missing (as it would happen with a commit graph written
  by old Git).
 
  [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

First paragraph explains why we need this patch (cannot increment graph
version) second explains what this patch does (introduce a new chunk)
and third proves why it works (Old Git ignores GDAT, New Git parses GDAT).

Can we improve this commit message further? 

> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c                | 33 +++++++++++++++++++++++++++++----
> >  commit-graph.h                |  1 +
> >  t/helper/test-read-graph.c    |  2 ++
> >  t/t4216-log-bloom.sh          |  4 ++--
> >  t/t5318-commit-graph.sh       | 19 +++++++++++--------
> >  t/t5324-split-commit-graph.sh | 12 ++++++------
> >  6 files changed, 51 insertions(+), 20 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1c98f38d69..ab714f4a76 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
> >  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> >  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> >  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> >  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> >  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> >  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> >  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> > -#define MAX_NUM_CHUNKS 7
> > +#define MAX_NUM_CHUNKS 8
> 
> Ugh. I am simultaneously working on a new chunk myself (so a bad
> conflict resolution would look at both of us incrementing this number
> to the same value without generating a conflict.)
> 
> I think the right thing to do here would be to define an enum over chunk
> names, and then index an array by that enum (where the value at each
> index is the chunk identifier). Then, the last value of that enum would
> be a '__COUNT' which you could use to initialize the array (as well as
> within the commit-graph writing routines).
> 
> Anyway, I think that it's probably not worth it in the meantime, but it
> is something that Junio should look out for when merging (if yours and
> my topic happen to get merged around the same time, which they may not).
> 
> >  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
> >
> > @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
> >  				graph->chunk_commit_data = data + chunk_offset;
> >  			break;
> >
> > +		case GRAPH_CHUNKID_GENERATION_DATA:
> > +			if (graph->chunk_generation_data)
> > +				chunk_repeated = 1;
> > +			else
> > +				graph->chunk_generation_data = data + chunk_offset;
> > +			break;
> > +
> >  		case GRAPH_CHUNKID_EXTRAEDGES:
> >  			if (graph->chunk_extra_edges)
> >  				chunk_repeated = 1;
> > @@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	date_low = get_be32(commit_data + g->hash_len + 12);
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +	if (g->chunk_generation_data)
> > +		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> > +	else
> > +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> >  	}
> >  }
> >
> > +static void write_graph_chunk_generation_data(struct hashfile *f,
> > +					      struct write_commit_graph_context *ctx)
> > +{
> > +	struct commit **list = ctx->commits.list;
> > +	int count;
> > +	for (count = 0; count < ctx->commits.nr; count++, list++) {
> > +		display_progress(ctx->progress, ++ctx->progress_cnt);
> > +		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
> > +	}
> > +}
> > +
> 
> This pointer arithmetic is not necessary. Why not like:
> 
>   int i;
>   for (i = 0; i < ctx->commits.nr; i++) {
>     struct commit *c = ctx->commits.list[i];
>     display_progress(ctx->progress, ++ctx->progress_cnt);
>     hashwrite_be32(f, commit_graph_data_at(c)->generation);
>   }
> 
> instead?
> 
> >  static void write_graph_chunk_extra_edges(struct hashfile *f,
> >  					  struct write_commit_graph_context *ctx)
> >  {
> > @@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
> >  	const unsigned hashsz = the_hash_algo->rawsz;
> >  	struct strbuf progress_title = STRBUF_INIT;
> > -	int num_chunks = 3;
> > +	int num_chunks = 4;
> >  	struct object_id file_hash;
> >  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> >
> > @@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> >  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> >  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> > +	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
> >  	if (ctx->num_extra_edges) {
> >  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
> >  		num_chunks++;
> > @@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> >  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
> >  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> > +	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
> >
> > -	num_chunks = 3;
> > +	num_chunks = 4;
> >  	if (ctx->num_extra_edges) {
> >  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> >  						4 * ctx->num_extra_edges;
> > @@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	write_graph_chunk_fanout(f, ctx);
> >  	write_graph_chunk_oids(f, hashsz, ctx);
> >  	write_graph_chunk_data(f, hashsz, ctx);
> > +	write_graph_chunk_generation_data(f, ctx);
> >  	if (ctx->num_extra_edges)
> >  		write_graph_chunk_extra_edges(f, ctx);
> >  	if (ctx->changed_paths) {
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 98cc5a3b9d..e3d4ba96f4 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -67,6 +67,7 @@ struct commit_graph {
> >  	const uint32_t *chunk_oid_fanout;
> >  	const unsigned char *chunk_oid_lookup;
> >  	const unsigned char *chunk_commit_data;
> > +	const unsigned char *chunk_generation_data;
> >  	const unsigned char *chunk_extra_edges;
> >  	const unsigned char *chunk_base_graphs;
> >  	const unsigned char *chunk_bloom_indexes;
> > diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> > index 6d0c962438..1c2a5366c7 100644
> > --- a/t/helper/test-read-graph.c
> > +++ b/t/helper/test-read-graph.c
> > @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
> >  		printf(" oid_lookup");
> >  	if (graph->chunk_commit_data)
> >  		printf(" commit_metadata");
> > +	if (graph->chunk_generation_data)
> > +		printf(" generation_data");
> >  	if (graph->chunk_extra_edges)
> >  		printf(" extra_edges");
> >  	if (graph->chunk_bloom_indexes)
> > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> > index c855bcd3e7..780855e691 100755
> > --- a/t/t4216-log-bloom.sh
> > +++ b/t/t4216-log-bloom.sh
> > @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> >  	git commit-graph write --reachable --changed-paths
> >  '
> >  graph_read_expect () {
> > -	NUM_CHUNKS=5
> > +	NUM_CHUNKS=6
> >  	cat >expect <<- EOF
> >  	header: 43475048 1 1 $NUM_CHUNKS 0
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
> >  	EOF
> >  	test-tool read-graph >actual &&
> >  	test_cmp expect actual
> > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> > index 26f332d6a3..3ec5248d70 100755
> > --- a/t/t5318-commit-graph.sh
> > +++ b/t/t5318-commit-graph.sh
> > @@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
> >
> >  graph_read_expect() {
> >  	OPTIONAL=""
> > -	NUM_CHUNKS=3
> > +	NUM_CHUNKS=4
> >  	if test ! -z $2
> >  	then
> >  		OPTIONAL=" $2"
> > -		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> > +		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
> >  	fi
> >  	cat >expect <<- EOF
> >  	header: 43475048 1 1 $NUM_CHUNKS 0
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> > @@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
> >  GRAPH_BYTE_CHUNK_COUNT=6
> >  GRAPH_CHUNK_LOOKUP_OFFSET=8
> >  GRAPH_CHUNK_LOOKUP_WIDTH=12
> > -GRAPH_CHUNK_LOOKUP_ROWS=5
> > +GRAPH_CHUNK_LOOKUP_ROWS=6
> >  GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
> >  GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
> >  			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
> > @@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
> >  GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
> >  GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
> >  GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
> > -GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
> >  GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
> >  GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
> > -GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> > -			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> > +GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> > +				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> > +GRAPH_GENERATION_DATA_WIDTH=4
> > +GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
> > +GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
> > +			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
> >  GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
> >  GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
> >
> > @@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
> >  '
> >
> >  test_expect_success 'detect incorrect generation number' '
> > -	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> > +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
> >  		"non-zero generation number"
> >  '
> >
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 269d0964a3..096a96ec41 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
> >  	graphdir="$infodir/commit-graphs" &&
> >  	test_oid_init &&
> >  	test_oid_cache <<-EOM
> > -	shallow sha1:1760
> > -	shallow sha256:2064
> > +	shallow sha1:2132
> > +	shallow sha256:2436
> >
> > -	base sha1:1376
> > -	base sha256:1496
> > +	base sha1:1408
> > +	base sha256:1528
> >  	EOM
> >  '
> >
> > @@ -29,9 +29,9 @@ graph_read_expect() {
> >  		NUM_BASE=$2
> >  	fi
> >  	cat >expect <<- EOF
> > -	header: 43475048 1 1 3 $NUM_BASE
> > +	header: 43475048 1 1 4 $NUM_BASE
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> > --
> > gitgitgadget
> >
> 
> All of this looks good to me.
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 16:23     ` Taylor Blau
@ 2020-07-30  7:27     ` Abhishek Kumar
  1 sibling, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-07-30  7:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, jnareb, gitgitgadget

On Tue, Jul 28, 2020 at 11:55:12AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > With preparations done,...
> 
> I feel like this commit could have been made smaller by doing the
> uint32_t -> timestamp_t conversion in a separate patch. That would
> make it easier to focus on the changes to the generation number v2
> logic.
> 

Sure, would seperate into two patches.

> > let's implement corrected commit date offset.
> > We add a new commit-slab to store topological levels while writing
> 
> It's important to add: we store topological levels to ensure that older
> versions of Git will still have the performance benefits from generation
> number v1.
> 

Will do.

> > commit graph and upgrade number of struct commit_graph_data to 64-bits.
> 
> Do you mean "update the generation member in struct commit_graph_data
> to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
> member.
> 

Yes, "update the generation number".

> > We have to touch many files, upgrading generation number from uint32_t
> > to timestamp_t.
> 
> Yes, that's why I recommend doing that in a different step.
> 
> > We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> > which tests if verify can detect if a commit graph have
> > GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> > With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> > one of dates is Unix epoch zero.
> 
> What about the topological levels? Are we caring about verifying the data
> that we start to ignore in this new version? I'm hesitant to drop this
> right now, but I'm open to it if we really don't see it as a valuable test.
> 

We haven't tested the scenario "New Git reads a commit graph without
GDAT chunk" yet. Verifying topological levels (along with many of the
changed offsets) would be a part of the scenario.

Now that I think about it, those tests should have been included with
this patch.

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> [...]
>
> Later you assign data->generation to be "max_corrected_commit_date + 1",
> which made me think this should be "current->date - 1". Is that so? Or,
> do we want most offsets to be one instead of zero? Is there value there?
> 

Does it? 

I had hoped most of the offsets could have been zero, as we could take
advantage of the fact that commit-slab zero initializes values and avoid
a commit-slab access.

Right, What I meant to do was:

        /*
         * max_parent_corrected_commit_date is initialized with zero and
         * takes the maximum of
         * (parent->item->date + commit_graph_data_at(parent->item)->generation)
        */

        if (max_parent_corrected_commit_date >= current->date)
        {
                struct commit_graph_data *data = commit_graph_data_at(current);
                data->generation = max_parent_corrected_commit_date + 1;
        }

Thanks for pointing this out!

> [...]

- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
@ 2020-07-30  7:47   ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-07-30  7:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitgitgadget, abhishekkumar8222

On Tue, Jul 28, 2020 at 10:54:58AM -0400, Taylor Blau wrote:
> Hi Abhishek,
> 
> On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Very exciting. I have been eagerly following your blog and asking
> Stolee about your progress, so I am excited to read these patches.
> 

I am so glad to hear that!

> 
> [...]
> 
> I'm sure that I'll learn more when I get to this point, but I would like
> to hear more about why you want to store the offset rather than the
> corrected commit date itself. It seems that the offset could be either
> positive or negative, so you'd only have the range of a signed integer
> (rather than storing 8 bytes of a time_t for the full breadth of
> possibilities).
> 

Corrected commit dates are at least as big as the committer date, so the
offset (i.e. corrected date - committer date) would never be negative.

We store offsets instead of corrected commit dates because:
- We save 4 bytes for each commit, which amounts to 7-8% of the size of
  commit graph file (of course, dependent on the other chunks used).
- We save some time while writing the commit-graph file too, around
  ~200ms for the Linux repo.

While the savings are modest, writing corrected dates does not offer any
advantage that we could think of, at the time.

> I know also that Peff is working on negative timestamp support, so I
> would want to hear about what he thinks of this, too.

I have read up on Peff's work with negative timestamp support and it's
pretty exciting.

> [...]

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-07-28 15:28   ` Taylor Blau
@ 2020-08-04  0:46   ` Jakub Narębski
  2020-08-04  0:56     ` Taylor Blau
  2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 2 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-04  0:46 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Abhishek Kumar, Taylor Blau

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.
>
> Let's fix this by accessing generation number directly through the slab.

It looks like unfortunate and unforeseen consequence of putting together
graph position and generation number in the commit_graph_data struct.
During writing of the commit-graph file generation number is computed,
but graph position is undefined (yet), and commit_graph_generation()
uses graph_pos field to find if the data for commit is initialized;
in this case wrongly.

Anyway, when writing the commit graph we first compute generation
number, then (if requested) the changed-paths Bloom filter.  Skipping
the unnecessary check is a good thing... assuming that commit_gen_cmp()
is used only when writing the commit graph, and not when traversing it
(because then some commits may not have generation number set, and maybe
even do not have any data on the commit slab) - which is the case.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c

We might want to add function comment either here or in the header that
this comparisonn function is to be used only for `git commit-graph
write`, and not for graph traversal (even if similar funnction exists in
other modules).

> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:46   ` Jakub Narębski
@ 2020-08-04  0:56     ` Taylor Blau
  2020-08-04 10:10       ` Jakub Narębski
  2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-08-04  0:56 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Abhishek Kumar, Taylor Blau

On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> >
> > Let's fix this by accessing generation number directly through the slab.
>
> It looks like unfortunate and unforeseen consequence of putting together
> graph position and generation number in the commit_graph_data struct.
> During writing of the commit-graph file generation number is computed,
> but graph position is undefined (yet), and commit_graph_generation()
> uses graph_pos field to find if the data for commit is initialized;
> in this case wrongly.
>
> Anyway, when writing the commit graph we first compute generation
> number, then (if requested) the changed-paths Bloom filter.  Skipping
> the unnecessary check is a good thing... assuming that commit_gen_cmp()
> is used only when writing the commit graph, and not when traversing it
> (because then some commits may not have generation number set, and maybe
> even do not have any data on the commit slab) - which is the case.
>
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
>
> We might want to add function comment either here or in the header that
> this comparisonn function is to be used only for `git commit-graph
> write`, and not for graph traversal (even if similar funnction exists in
> other modules).

I think that probably within the function is just fine, and that we can
avoid touching commit-graph.h here.

>
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;

Maybe something like:

  /*
   * Access the generation number directly with
   * 'commit_graph_data_at(...)->generation' instead of going through
   * the slab as usual to avoid accessing a yet-uncomputed value.
   */

Folks that are curious for more can blame this commit and read there.
I'd err on the side of being brief in the code comment and verbose in
the commit message than the other way around ;).

> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
>
> Best,
> --
> Jakub Narębski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:46   ` Jakub Narębski
  2020-08-04  0:56     ` Taylor Blau
@ 2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-04  7:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Abhishek Kumar, Taylor Blau

Jakub Narębski <jnareb@gmail.com> writes:

[...]
>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>  	const struct commit *a = *(const struct commit **)va;
>>  	const struct commit *b = *(const struct commit **)vb;
>>
>> -	uint32_t generation_a = commit_graph_generation(a);
>> -	uint32_t generation_b = commit_graph_generation(b);
>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>> +
>>  	/* lower generation commits first */
>>  	if (generation_a < generation_b)
>>  		return -1;

NOTE: One more thing: we would want to check if corrected commit date
(generation number v2) or topological level (generation number v1) is
better for this purpose, that is gives better performance.

The commit 3d11275505 (commit-graph: examine commits by generation
number) which introduced using commit_gen_cmp when writing commit graph
when finding commits via `--reachable` flags describes the following
performance improvement:

    On the Linux kernel repository, this change reduced the computation
    time for 'git commit-graph write --reachable --changed-paths' from
    3m00s to 1m37s.

We would probably want time for no sorting, for sorting by generation
number v2, and for sorting by topological level (generation number v1)
for the same or similar case.

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:56     ` Taylor Blau
@ 2020-08-04 10:10       ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-04 10:10 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Abhishek Kumar

Taylor Blau <me@ttaylorr.com> writes:
> On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index 1af68c297d..5d3c9bd23c 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>
>> We might want to add function comment either here or in the header that
>> this comparisonn function is to be used only for `git commit-graph
>> write`, and not for graph traversal (even if similar funnction exists in
>> other modules).
>
> I think that probably within the function is just fine, and that we can
> avoid touching commit-graph.h here.
>
>>
>>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>  	const struct commit *a = *(const struct commit **)va;
>>>  	const struct commit *b = *(const struct commit **)vb;
>
> Maybe something like:
>
>   /*
>    * Access the generation number directly with
>    * 'commit_graph_data_at(...)->generation' instead of going through
>    * the slab as usual to avoid accessing a yet-uncomputed value.
>    */

I think the last part of this comment should read:

[...]
     * 'commit_graph_data_at(...)->generation' instead of going through
     * the commit_graph_generation() helper function to access just
     * computed data [during `git commit-graph write --reachable --changed-paths`].
     */

Or something like that (the part in square brackets is optional; I am
not sure if adding it helps or not).

>
> Folks that are curious for more can blame this commit and read there.
> I'd err on the side of being brief in the code comment and verbose in
> the commit message than the other way around ;).

I agree.

>>>
>>> -	uint32_t generation_a = commit_graph_generation(a);
>>> -	uint32_t generation_b = commit_graph_generation(b);
>>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>>> +
>>>  	/* lower generation commits first */
>>>  	if (generation_a < generation_b)
>>>  		return -1;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
  2020-07-28 13:00   ` Derrick Stolee
@ 2020-08-05 23:16   ` Jakub Narębski
  1 sibling, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-05 23:16 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget; +Cc: git, Derrick Stolee, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 6aa7f4f567..23287d26c3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>
> +		if (parse_commit_gently(parent, 1) < 0)

All right, parse_commit_gently() avoids re-parsing objects, and makes
use of the commit-graph data.  If parents are not guaranteed to be
parsed, this is a correct thing to do.

Though I do wonder how this issue got missed by the test suite, just
like other reviewers...

> +			return ;

Why this need to be 'return' and not 'continue'?

> +
>  		if (*pi)
>  			(*pi)++;
>  		else

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-07-28 16:35 ` Derrick Stolee
@ 2020-08-09  2:53 ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                     ` (11 more replies)
  8 siblings, 12 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
first version.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed graph version commit chains.
 * Improve commit messages for
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (10):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: implement generation data chunk
  commit-graph: return 64-bit generation number
  commit-graph: implement corrected commit date
  commit-graph: handle mixed generation commit chains
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  12 +-
 Documentation/technical/commit-graph.txt      |  45 ++--
 commit-graph.c                                | 203 ++++++++++++------
 commit-graph.h                                |  14 +-
 commit-reach.c                                |  49 ++---
 commit-reach.h                                |   2 +-
 commit.c                                      |   9 +-
 commit.h                                      |   4 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   2 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |   4 +-
 t/t5318-commit-graph.sh                       |  27 +--
 t/t5324-split-commit-graph.sh                 |  78 ++++++-
 t/t6024-recursive-merge.sh                    |   4 +-
 t/t6600-test-reach.sh                         |  62 +++---
 upload-pack.c                                 |   2 +-
 18 files changed, 354 insertions(+), 183 deletions(-)


base-commit: dc04167d378fb29d30e1647ff6ff51dd182bc9a3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v1:

  1:  91e6e97a66 !  1:  a962b9ae4b commit-graph: fix regression when computing bloom filter
     @@ Metadata
       ## Commit message ##
          commit-graph: fix regression when computing bloom filter
      
     -    With 3d112755 (commit-graph: examine commits by generation number), Git
     -    knew to sort by generation number before examining the diff when not
     -    using pack order. c49c82aa (commit: move members graph_pos, generation
     -    to a slab, 2020-06-17) moved generation number into a slab and
     -    introduced a helper which returns GENERATION_NUMBER_INFINITY when
     -    writing the graph. Sorting is no longer useful and essentially reverts
     -    the earlier commit.
     -
     -    Let's fix this by accessing generation number directly through the slab.
     +    commit_gen_cmp is used when writing a commit-graph to sort commits in
     +    generation order before computing Bloom filters. Since c49c82aa (commit:
     +    move members graph_pos, generation to a slab, 2020-06-17) made it so
     +    that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
     +    during writing, we cannot call it within this function. Instead, access
     +    the generation number directly through the slab (i.e., by calling
     +    'commit_graph_data_at(c)->generation') in order to access it while
     +    writing.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
      -	uint32_t generation_b = commit_graph_generation(b);
      +	uint32_t generation_a = commit_graph_data_at(a)->generation;
      +	uint32_t generation_b = commit_graph_data_at(b)->generation;
     -+
       	/* lower generation commits first */
       	if (generation_a < generation_b)
       		return -1;
  2:  d23f67dc80 !  2:  cf61239f93 revision: parse parent in indegree_walk_step()
     @@ revision.c: static void indegree_walk_step(struct rev_info *revs)
       		int *pi = indegree_slab_at(&info->indegree, parent);
       
      +		if (parse_commit_gently(parent, 1) < 0)
     -+			return ;
     ++			return;
      +
       		if (*pi)
       			(*pi)++;
  3:  701f591236 !  3:  32da955e31 commit-graph: consolidate fill_commit_graph_info
     @@ Commit message
      
          The test 'generate tar with future mtime' creates a commit with commit
          time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
     -    generation number and has undefined behavior. The test used to pass as
     -    fill_commit_in_graph() did not read commit time from commit graph,
     -    reading commit date from odb instead.
     +    generation number (within CDAT chunk) and has undefined behavior.
      
     -    Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
     +    The test used to pass as fill_commit_in_graph() guarantees the values of
     +    graph position and generation number, and did not load timestamp.
     +    However, with corrected commit date we will need load the timestamp as
     +    well to populate the generation number.
     +
     +    Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
       	const unsigned char *commit_data;
       	uint32_t lex_index;
       
     -+	fill_commit_graph_info(item, g, pos);
     -+
       	while (pos < g->num_commits_in_base)
       		g = g->base_graph;
       
      -	if (pos >= g->num_commits + g->num_commits_in_base)
      -		die(_("invalid commit position. commit-graph is likely corrupt"));
     --
     ++	fill_commit_graph_info(item, g, pos);
     + 
      -	/*
      -	 * Store the "full" position, but then use the
      -	 * "local" position for the rest of the calculation.
  4:  812fe75fc7 !  4:  b254782858 commit-graph: consolidate compare_commits_by_gen
     @@ Commit message
          compare_commits_by_gen() to commit-graph.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
     +    Reviewed-by: Taylor Blau <me@ttaylorr.com>
      
       ## commit-graph.c ##
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
  5:  80ea7da343 !  5:  cb797e20d7 commit-graph: implement generation data chunk
     @@ Metadata
       ## Commit message ##
          commit-graph: implement generation data chunk
      
     -    One of the essential pre-requisites before implementing generation
     -    number as to distinguish between generation numbers v1 and v2 while
     -    still being compatible with old Git.
     +    As discovered by Ævar, we cannot increment graph version to
     +    distinguish between generation numbers v1 and v2 [1]. Thus, one of
     +    pre-requistes before implementing generation number was to distinguish
     +    between graph versions in a backwards compatible manner.
      
          We are going to introduce a new chunk called Generation Data chunk (or
          GDAT). GDAT stores generation number v2 (and any subsequent versions),
          whereas CDAT will still store topological level.
      
          Old Git does not understand GDAT chunk and would ignore it, reading
     -    topological levels from CDAT. Newer versions of Git can parse GDAT and
     -    take advantage of newer generation numbers, falling back to topological
     -    levels when GDAT chunk is missing (as it would happen with a commit
     -    graph written by old Git).
     +    topological levels from CDAT. New Git can parse GDAT and take advantage
     +    of newer generation numbers, falling back to topological levels when
     +    GDAT chunk is missing (as it would happen with a commit graph written
     +    by old Git).
     +
     +    We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
     +    which forces commit-graph file to be written without generation data
     +    chunk to emulate a commit-graph file written by old Git.
     +
     +    [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       }
       
       static inline void set_commit_tree(struct commit *c, struct tree *t)
     -@@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_len,
     - 	}
     +@@ commit-graph.c: struct write_commit_graph_context {
     + 		 report_progress:1,
     + 		 split:1,
     + 		 changed_paths:1,
     +-		 order_by_pack:1;
     ++		 order_by_pack:1,
     ++		 write_generation_data:1;
     + 
     + 	const struct split_commit_graph_opts *split_opts;
     + 	size_t total_bloom_filter_data_size;
     +@@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
     + 	return 0;
       }
       
     -+static void write_graph_chunk_generation_data(struct hashfile *f,
     ++static int write_graph_chunk_generation_data(struct hashfile *f,
      +					      struct write_commit_graph_context *ctx)
      +{
     -+	struct commit **list = ctx->commits.list;
     -+	int count;
     -+	for (count = 0; count < ctx->commits.nr; count++, list++) {
     ++	int i;
     ++	for (i = 0; i < ctx->commits.nr; i++) {
     ++		struct commit *c = ctx->commits.list[i];
      +		display_progress(ctx->progress, ++ctx->progress_cnt);
     -+		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
     ++		hashwrite_be32(f, commit_graph_data_at(c)->generation);
      +	}
     ++
     ++	return 0;
      +}
      +
     - static void write_graph_chunk_extra_edges(struct hashfile *f,
     - 					  struct write_commit_graph_context *ctx)
     + static int write_graph_chunk_extra_edges(struct hashfile *f,
     +-					 struct write_commit_graph_context *ctx)
     ++					  struct write_commit_graph_context *ctx)
       {
     + 	struct commit **list = ctx->commits.list;
     + 	struct commit **last = ctx->commits.list + ctx->commits.nr;
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
     - 	const unsigned hashsz = the_hash_algo->rawsz;
     - 	struct strbuf progress_title = STRBUF_INIT;
     --	int num_chunks = 3;
     -+	int num_chunks = 4;
     - 	struct object_id file_hash;
     - 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     - 
     -@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
     - 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
     - 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
     -+	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
     + 	chunks[2].id = GRAPH_CHUNKID_DATA;
     + 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
     + 	chunks[2].write_fn = write_graph_chunk_data;
     ++	if (ctx->write_generation_data) {
     ++		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
     ++		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     ++		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
     ++		num_chunks++;
     ++	}
       	if (ctx->num_extra_edges) {
     - 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
     - 		num_chunks++;
     -@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
     - 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
     - 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
     -+	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
     + 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
     + 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
     +@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     + 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
     + 	ctx->split_opts = split_opts;
     + 	ctx->total_bloom_filter_data_size = 0;
     ++	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
       
     --	num_chunks = 3;
     -+	num_chunks = 4;
     - 	if (ctx->num_extra_edges) {
     - 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
     - 						4 * ctx->num_extra_edges;
     -@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 	write_graph_chunk_fanout(f, ctx);
     - 	write_graph_chunk_oids(f, hashsz, ctx);
     - 	write_graph_chunk_data(f, hashsz, ctx);
     -+	write_graph_chunk_generation_data(f, ctx);
     - 	if (ctx->num_extra_edges)
     - 		write_graph_chunk_extra_edges(f, ctx);
     - 	if (ctx->changed_paths) {
     + 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     + 		ctx->changed_paths = 1;
      
       ## commit-graph.h ##
     +@@
     + #include "oidset.h"
     + 
     + #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
     ++#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
     + #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
     + #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
     + 
      @@ commit-graph.h: struct commit_graph {
       	const uint32_t *chunk_oid_fanout;
       	const unsigned char *chunk_oid_lookup;
     @@ commit-graph.h: struct commit_graph {
       	const unsigned char *chunk_base_graphs;
       	const unsigned char *chunk_bloom_indexes;
      
     + ## t/README ##
     +@@ t/README: GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
     + be written after every 'git commit' command, and overrides the
     + 'core.commitGraph' setting to true.
     + 
     ++GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
     ++commit-graph to be written without generation data chunk.
     ++
     + GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
     + commit-graph write to compute and write changed path Bloom filters for
     + every 'git commit-graph write', as if the `--changed-paths` option was
     +
       ## t/helper/test-read-graph.c ##
      @@ t/helper/test-read-graph.c: int cmd__read_graph(int argc, const char **argv)
       		printf(" oid_lookup");
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup test - repo, commits, commit gr
      
       ## t/t5318-commit-graph.sh ##
      @@ t/t5318-commit-graph.sh: graph_git_behavior 'no graph' full commits/3 commits/1
     - 
       graph_read_expect() {
       	OPTIONAL=""
     --	NUM_CHUNKS=3
     -+	NUM_CHUNKS=4
     - 	if test ! -z $2
     + 	NUM_CHUNKS=3
     +-	if test ! -z $2
     ++	if test ! -z "$2"
       	then
       		OPTIONAL=" $2"
     --		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
     -+		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
     - 	fi
     - 	cat >expect <<- EOF
     - 	header: 43475048 1 1 $NUM_CHUNKS 0
     - 	num_commits: $1
     --	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
     -+	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
     - 	EOF
     - 	test-tool read-graph >output &&
     - 	test_cmp expect output
     -@@ t/t5318-commit-graph.sh: GRAPH_BYTE_HASH=5
     - GRAPH_BYTE_CHUNK_COUNT=6
     - GRAPH_CHUNK_LOOKUP_OFFSET=8
     - GRAPH_CHUNK_LOOKUP_WIDTH=12
     --GRAPH_CHUNK_LOOKUP_ROWS=5
     -+GRAPH_CHUNK_LOOKUP_ROWS=6
     - GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
     - GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
     - 			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
     -@@ t/t5318-commit-graph.sh: GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
     - GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
     - GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
     - GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
     --GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
     - GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
     - GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
     --GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
     --			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
     -+GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
     -+				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
     -+GRAPH_GENERATION_DATA_WIDTH=4
     -+GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
     -+GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
     -+			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
     - GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
     - GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
     - 
     -@@ t/t5318-commit-graph.sh: test_expect_success 'detect incorrect generation number' '
     - '
     - 
     - test_expect_success 'detect incorrect generation number' '
     --	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
     -+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
     - 		"non-zero generation number"
     + 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
     +@@ t/t5318-commit-graph.sh: test_expect_success 'exit with correct error on bad input to --stdin-commits' '
     + 	# valid commit and tree OID
     + 	git rev-parse HEAD HEAD^{tree} >in &&
     + 	git commit-graph write --stdin-commits <in &&
     +-	graph_read_expect 3
     ++	graph_read_expect 3 generation_data
     + '
     + 
     + test_expect_success 'write graph' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "3"
     ++	graph_read_expect "3" generation_data
       '
       
     + test_expect_success POSIXPERM 'write graph has correct permissions' '
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph with merges' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "10" "extra_edges"
     ++	graph_read_expect "10" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph with new commit' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph with nothing new' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph from latest pack with closure' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	cat new-idx | git commit-graph write --stdin-packs &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "9" "extra_edges"
     ++	graph_read_expect "9" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph from commits with closure' '
     + 	git rev-parse merge/1 >>commits-in &&
     + 	cat commits-in | git commit-graph write --stdin-commits &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "6"
     ++	graph_read_expect "6" "generation_data"
     + '
     + 
     + graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph from commits with append' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "10" "extra_edges"
     ++	graph_read_expect "10" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'build graph using --reachable' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     + 	git commit-graph write --reachable &&
     + 	test_path_is_file $objdir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'write graph in bare repo' '
     + 	cd "$TRASH_DIRECTORY/bare" &&
     + 	git commit-graph write &&
     + 	test_path_is_file $baredir/info/commit-graph &&
     +-	graph_read_expect "11" "extra_edges"
     ++	graph_read_expect "11" "generation_data extra_edges"
     + '
     + 
     + graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
     +@@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit-graph' '
     + 
     + test_expect_success 'git commit-graph verify' '
     + 	cd "$TRASH_DIRECTORY/full" &&
     +-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
     +-	git commit-graph verify >output
     ++	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
     ++	git commit-graph verify >output &&
     ++	graph_read_expect 9 extra_edges
     + '
     + 
     + NUM_COMMITS=9
      
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
     @@ t/t5324-split-commit-graph.sh: graph_read_expect() {
       	EOF
       	test-tool read-graph >output &&
       	test_cmp expect output
     +
     + ## t/t6600-test-reach.sh ##
     +@@ t/t6600-test-reach.sh: test_expect_success 'setup' '
     + 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
     + 	mv .git/objects/info/commit-graph commit-graph-half &&
     + 	chmod u+w commit-graph-half &&
     ++	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
     ++	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
     ++	chmod u+w commit-graph-no-gdat &&
     + 	git config core.commitGraph true
     + '
     + 
     +-run_three_modes () {
     ++run_all_modes () {
     + 	test_when_finished rm -rf .git/objects/info/commit-graph &&
     + 	"$@" <input >actual &&
     + 	test_cmp expect actual &&
     +@@ t/t6600-test-reach.sh: run_three_modes () {
     + 	test_cmp expect actual &&
     + 	cp commit-graph-half .git/objects/info/commit-graph &&
     + 	"$@" <input >actual &&
     ++	test_cmp expect actual &&
     ++	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
     ++	"$@" <input >actual &&
     + 	test_cmp expect actual
     + }
     + 
     +-test_three_modes () {
     +-	run_three_modes test-tool reach "$@"
     ++test_all_modes () {
     ++	run_all_modes test-tool reach "$@"
     + }
     + 
     + test_expect_success 'ref_newer:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:miss' '
     + 	B:commit-4-9
     + 	EOF
     + 	echo "ref_newer(A,B):0" >expect &&
     +-	test_three_modes ref_newer
     ++	test_all_modes ref_newer
     + '
     + 
     + test_expect_success 'ref_newer:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:hit' '
     + 	B:commit-2-3
     + 	EOF
     + 	echo "ref_newer(A,B):1" >expect &&
     +-	test_three_modes ref_newer
     ++	test_all_modes ref_newer
     + '
     + 
     + test_expect_success 'in_merge_bases:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:hit' '
     + 	B:commit-8-8
     + 	EOF
     + 	echo "in_merge_bases(A,B):1" >expect &&
     +-	test_three_modes in_merge_bases
     ++	test_all_modes in_merge_bases
     + '
     + 
     + test_expect_success 'in_merge_bases:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:miss' '
     + 	B:commit-5-9
     + 	EOF
     + 	echo "in_merge_bases(A,B):0" >expect &&
     +-	test_three_modes in_merge_bases
     ++	test_all_modes in_merge_bases
     + '
     + 
     + test_expect_success 'is_descendant_of:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:hit' '
     + 	X:commit-1-1
     + 	EOF
     + 	echo "is_descendant_of(A,X):1" >expect &&
     +-	test_three_modes is_descendant_of
     ++	test_all_modes is_descendant_of
     + '
     + 
     + test_expect_success 'is_descendant_of:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:miss' '
     + 	X:commit-7-6
     + 	EOF
     + 	echo "is_descendant_of(A,X):0" >expect &&
     +-	test_three_modes is_descendant_of
     ++	test_all_modes is_descendant_of
     + '
     + 
     + test_expect_success 'get_merge_bases_many' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_merge_bases_many' '
     + 		git rev-parse commit-5-6 \
     + 			      commit-4-7 | sort
     + 	} >expect &&
     +-	test_three_modes get_merge_bases_many
     ++	test_all_modes get_merge_bases_many
     + '
     + 
     + test_expect_success 'reduce_heads' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'reduce_heads' '
     + 			      commit-2-8 \
     + 			      commit-1-10 | sort
     + 	} >expect &&
     +-	test_three_modes reduce_heads
     ++	test_all_modes reduce_heads
     + '
     + 
     + test_expect_success 'can_all_from_reach:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:hit' '
     + 	Y:commit-8-1
     + 	EOF
     + 	echo "can_all_from_reach(X,Y):1" >expect &&
     +-	test_three_modes can_all_from_reach
     ++	test_all_modes can_all_from_reach
     + '
     + 
     + test_expect_success 'can_all_from_reach:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:miss' '
     + 	Y:commit-8-5
     + 	EOF
     + 	echo "can_all_from_reach(X,Y):0" >expect &&
     +-	test_three_modes can_all_from_reach
     ++	test_all_modes can_all_from_reach
     + '
     + 
     + test_expect_success 'can_all_from_reach_with_flag: tags case' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach_with_flag: tags case' '
     + 	Y:commit-8-1
     + 	EOF
     + 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
     +-	test_three_modes can_all_from_reach_with_flag
     ++	test_all_modes can_all_from_reach_with_flag
     + '
     + 
     + test_expect_success 'commit_contains:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:hit' '
     + 	X:commit-9-3
     + 	EOF
     + 	echo "commit_contains(_,A,X,_):1" >expect &&
     +-	test_three_modes commit_contains &&
     +-	test_three_modes commit_contains --tag
     ++	test_all_modes commit_contains &&
     ++	test_all_modes commit_contains --tag
     + '
     + 
     + test_expect_success 'commit_contains:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:miss' '
     + 	X:commit-9-3
     + 	EOF
     + 	echo "commit_contains(_,A,X,_):0" >expect &&
     +-	test_three_modes commit_contains &&
     +-	test_three_modes commit_contains --tag
     ++	test_all_modes commit_contains &&
     ++	test_all_modes commit_contains --tag
     + '
     + 
     + test_expect_success 'rev-list: basic topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: basic topo-order' '
     + 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: first-parent topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent topo-order' '
     + 		commit-6-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
     ++	run_all_modes git rev-list --first-parent --topo-order commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: range topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
     + 		commit-6-2 commit-5-2 commit-4-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: range topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
     + 		commit-6-2 commit-5-2 commit-4-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: first-parent range topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent range topo-order' '
     + 		commit-6-2 \
     + 		commit-6-1 commit-5-1 commit-4-1 \
     + 	>expect &&
     +-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
     ++	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: ancestry-path topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: ancestry-path topo-order' '
     + 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
     + 		commit-6-3 commit-5-3 commit-4-3 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
     ++	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
     + '
     + 
     + test_expect_success 'rev-list: symmetric difference topo-order' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: symmetric difference topo-order' '
     + 		commit-3-8 commit-2-8 commit-1-8 \
     + 		commit-3-7 commit-2-7 commit-1-7 \
     + 	>expect &&
     +-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
     ++	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
     + '
     + 
     + test_expect_success 'get_reachable_subset:all' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:all' '
     + 			      commit-1-7 \
     + 			      commit-5-6 | sort
     + 	) >expect &&
     +-	test_three_modes get_reachable_subset
     ++	test_all_modes get_reachable_subset
     + '
     + 
     + test_expect_success 'get_reachable_subset:some' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:some' '
     + 		git rev-parse commit-3-3 \
     + 			      commit-1-7 | sort
     + 	) >expect &&
     +-	test_three_modes get_reachable_subset
     ++	test_all_modes get_reachable_subset
     + '
     + 
     + test_expect_success 'get_reachable_subset:none' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:none' '
     + 	Y:commit-2-8
     + 	EOF
     + 	echo "get_reachable_subset(X,Y)" >expect &&
     +-	test_three_modes get_reachable_subset
     ++	test_all_modes get_reachable_subset
     + '
     + 
     + test_done
  6:  647290d036 !  6:  1aa2a00a7a commit-graph: implement corrected commit date offset
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: implement corrected commit date offset
     +    commit-graph: return 64-bit generation number
      
     -    With preparations done, let's implement corrected commit date offset.
     -    We add a new commit-slab to store topological levels while writing
     -    commit graph and upgrade number of struct commit_graph_data to 64-bits.
     -
     -    We have to touch many files, upgrading generation number from uint32_t
     -    to timestamp_t.
     -
     -    We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
     -    which tests if verify can detect if a commit graph have
     -    GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
     -    With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
     -    one of dates is Unix epoch zero.
     +    In a preparatory step, let's return timestamp_t values from
     +    commit_graph_generation(), use timestamp_t for local variables and
     +    define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     - ## blame.c ##
     -@@ blame.c: static int maybe_changed_path(struct repository *r,
     - 	if (!bd)
     - 		return 1;
     - 
     --	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
     -+	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
     - 		return 1;
     - 
     - 	filter = get_bloom_filter(r, origin->commit, 0);
     -
       ## commit-graph.c ##
     -@@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
     - /* Remember to update object flag allocation in object.h */
     - #define REACHABLE       (1u<<15)
     - 
     -+define_commit_slab(topo_level_slab, uint32_t);
     -+
     - /* Keep track of the order in which commits are added to our list. */
     - define_commit_slab(commit_pos, int);
     - static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
      @@ commit-graph.c: uint32_t commit_graph_position(const struct commit *c)
       	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
       }
     @@ commit-graph.c: uint32_t commit_graph_position(const struct commit *c)
       {
       	struct commit_graph_data *data =
       		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
     - 
     - 	if (!data)
     --		return GENERATION_NUMBER_INFINITY;
     -+		return GENERATION_NUMBER_V2_INFINITY;
     - 	else if (data->graph_pos == COMMIT_NOT_FROM_GRAPH)
     --		return GENERATION_NUMBER_INFINITY;
     -+		return GENERATION_NUMBER_V2_INFINITY;
     - 
     - 	return data->generation;
     - }
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
       int compare_commits_by_gen(const void *_a, const void *_b)
       {
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
       
      -	uint32_t generation_a = commit_graph_data_at(a)->generation;
      -	uint32_t generation_b = commit_graph_data_at(b)->generation;
     -+	timestamp_t generation_a = commit_graph_data_at(a)->generation;
     -+	timestamp_t generation_b = commit_graph_data_at(b)->generation;
     - 
     ++	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
     ++	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
       	/* lower generation commits first */
       	if (generation_a < generation_b)
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     - 	else if (generation_a > generation_b)
     - 		return 1;
     - 
     --	/* use date as a heuristic when generations are equal */
     --	if (a->date < b->date)
     --		return -1;
     --	else if (a->date > b->date)
     --		return 1;
     - 	return 0;
     - }
     - 
     -@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
     - 	item->date = (timestamp_t)((date_high << 32) | date_low);
     - 
     - 	if (g->chunk_generation_data)
     --		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	{
     -+		/* Read corrected commit date offset from GDAT */
     -+		graph_data->generation = item->date +
     -+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	}
     - 	else
     -+		/* Read topological level from CDAT */
     - 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     - }
     - 
     -@@ commit-graph.c: struct write_commit_graph_context {
     - 	struct progress *progress;
     - 	int progress_done;
     - 	uint64_t progress_cnt;
     -+	struct topo_level_slab *topo_levels;
     - 
     - 	char *base_graph_name;
     - 	int num_commit_graphs_before;
     -@@ commit-graph.c: static void write_graph_chunk_data(struct hashfile *f, int hash_len,
     - 		else
     - 			packedDate[0] = 0;
     - 
     --		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
     -+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
     - 
     - 		packedDate[1] = htonl((*list)->date);
     - 		hashwrite(f, packedDate, 8);
     -@@ commit-graph.c: static void write_graph_chunk_generation_data(struct hashfile *f,
     - 	struct commit **list = ctx->commits.list;
     - 	int count;
     - 	for (count = 0; count < ctx->commits.nr; count++, list++) {
     -+		timestamp_t offset = commit_graph_data_at(*list)->generation - (*list)->date;
     - 		display_progress(ctx->progress, ++ctx->progress_cnt);
     --		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
     -+
     -+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     -+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     -+
     -+		hashwrite_be32(f, offset);
     - 	}
     - }
     - 
     -@@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *ctx)
     - 	stop_progress(&ctx->progress);
     - }
     - 
     --static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     -+static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
     - {
     - 	int i;
     - 	struct commit_list *list = NULL;
     + 		return -1;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 					_("Computing commit graph generation numbers"),
     - 					ctx->commits.nr);
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     --		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     + 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
       
       		display_progress(ctx->progress, i + 1);
      -		if (generation != GENERATION_NUMBER_INFINITY &&
     --		    generation != GENERATION_NUMBER_ZERO)
     -+		if (topo_level != GENERATION_NUMBER_INFINITY &&
     -+		    topo_level != GENERATION_NUMBER_ZERO)
     ++		if (generation != GENERATION_NUMBER_V1_INFINITY &&
     + 		    generation != GENERATION_NUMBER_ZERO)
       			continue;
       
     - 		commit_list_insert(ctx->commits.list[i], &list);
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 			struct commit *current = list->item;
     - 			struct commit_list *parent;
     - 			int all_parents_computed = 1;
     --			uint32_t max_generation = 0;
     -+			uint32_t max_level = 0;
     -+			timestamp_t max_corrected_commit_date = current->date;
     - 
       			for (parent = current->parents; parent; parent = parent->next) {
     --				generation = commit_graph_data_at(parent->item)->generation;
     -+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     + 				generation = commit_graph_data_at(parent->item)->generation;
       
      -				if (generation == GENERATION_NUMBER_INFINITY ||
     --				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (topo_level == GENERATION_NUMBER_INFINITY ||
     -+				    topo_level == GENERATION_NUMBER_ZERO) {
     ++				if (generation == GENERATION_NUMBER_V1_INFINITY ||
     + 				    generation == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
     - 					break;
     --				} else if (generation > max_generation) {
     --					max_generation = generation;
     -+				} else {
     -+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
     -+
     -+					if (topo_level > max_level)
     -+						max_level = topo_level;
     -+
     -+					if (data->generation > max_corrected_commit_date)
     -+						max_corrected_commit_date = data->generation;
     - 				}
     - 			}
     - 
     - 			if (all_parents_computed) {
     - 				struct commit_graph_data *data = commit_graph_data_at(current);
     - 
     --				data->generation = max_generation + 1;
     --				pop_commit(&list);
     -+				if (max_level > GENERATION_NUMBER_MAX - 1)
     -+					max_level = GENERATION_NUMBER_MAX - 1;
     -+
     -+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
     -+				data->generation = max_corrected_commit_date + 1;
     - 
     --				if (data->generation > GENERATION_NUMBER_MAX)
     --					data->generation = GENERATION_NUMBER_MAX;
     -+				pop_commit(&list);
     - 			}
     - 		}
     - 	}
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	uint32_t i, count_distinct = 0;
     - 	int res = 0;
     - 	int replace = 0;
     -+	struct topo_level_slab topo_levels;
     - 
     - 	if (!commit_graph_compatible(the_repository))
     - 		return 0;
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
     - 	ctx->total_bloom_filter_data_size = 0;
     - 
     -+	init_topo_level_slab(&topo_levels);
     -+	ctx->topo_levels = &topo_levels;
     -+
     - 	if (ctx->split) {
     - 		struct commit_graph *g;
     - 		prepare_commit_graph(ctx->r);
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	} else
     - 		ctx->num_commit_graphs_after = 1;
     - 
     --	compute_generation_numbers(ctx);
     -+	compute_corrected_commit_date_offsets(ctx);
     - 
     - 	if (ctx->changed_paths)
     - 		compute_bloom_filters(ctx);
      @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
       	for (i = 0; i < g->num_commits; i++) {
       		struct commit *graph_commit, *odb_commit;
       		struct commit_list *graph_parents, *odb_parents;
      -		uint32_t max_generation = 0;
      -		uint32_t generation;
     -+		timestamp_t max_parent_corrected_commit_date = 0;
     -+		timestamp_t corrected_commit_date;
     ++		timestamp_t max_generation = 0;
     ++		timestamp_t generation;
       
       		display_progress(progress, i + 1);
       		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 					     oid_to_hex(&graph_parents->item->object.oid),
     - 					     oid_to_hex(&odb_parents->item->object.oid));
     - 
     --			generation = commit_graph_generation(graph_parents->item);
     --			if (generation > max_generation)
     --				max_generation = generation;
     -+			corrected_commit_date = commit_graph_generation(graph_parents->item);
     -+			if (corrected_commit_date > max_parent_corrected_commit_date)
     -+				max_parent_corrected_commit_date = corrected_commit_date;
     - 
     - 			graph_parents = graph_parents->next;
     - 			odb_parents = odb_parents->next;
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 		if (generation_zero == GENERATION_ZERO_EXISTS)
     - 			continue;
     - 
     --		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     --		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     --		 * extra logic in the following condition.
     --		 */
     --		if (max_generation == GENERATION_NUMBER_MAX)
     --			max_generation--;
     --
     --		generation = commit_graph_generation(graph_commit);
     --		if (generation != max_generation + 1)
     --			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		corrected_commit_date = commit_graph_generation(graph_commit);
     -+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
     -+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     - 				     oid_to_hex(&cur_oid),
     --				     generation,
     --				     max_generation + 1);
     -+				     corrected_commit_date,
     -+				     max_parent_corrected_commit_date + 1);
     - 
     - 		if (graph_commit->date != odb_commit->date)
     - 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
      
       ## commit-graph.h ##
      @@ commit-graph.h: void disable_commit_graph(struct repository *r);
     @@ commit-reach.c: static int queue_has_nonstale(struct prio_queue *queue)
       	struct commit_list *result = NULL;
       	int i;
      -	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t last_gen = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
       
       	if (!min_generation)
       		queue.compare = compare_commits_by_commit_date;
     @@ commit-reach.c: int repo_in_merge_bases_many(struct repository *r, struct commit
       	struct commit_list *bases;
       	int ret = 0, i;
      -	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t generation, min_generation = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
       
       	if (repo_parse_commit(r, commit))
       		return ret;
     @@ commit-reach.c: static enum contains_result contains_tag_algo(struct commit *can
       	struct contains_stack contains_stack = { 0, 0, NULL };
       	enum contains_result result;
      -	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t cutoff = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
       	const struct commit_list *p;
       
       	for (p = want; p; p = p->next) {
     @@ commit-reach.c: int can_all_from_reach(struct commit_list *from, struct commit_l
       	struct commit_list *from_iter = from, *to_iter = to;
       	int result;
      -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
       
       	while (from_iter) {
       		add_object_array(&from_iter->item->object, NULL, &from_objs);
     @@ commit-reach.c: struct commit_list *get_reachable_subset(struct commit **from, i
       	struct commit **to_last = to + nr_to;
       	struct commit **from_last = from + nr_from;
      -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
     ++	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
       	int num_to_find = 0;
       
       	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
     @@ commit-reach.h: int can_all_from_reach_with_flag(struct object_array *from,
      
       ## commit.h ##
      @@
     + #include "commit-slab.h"
     + 
     + #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
     +-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
     ++#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
     ++#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
       #define GENERATION_NUMBER_MAX 0x3FFFFFFF
       #define GENERATION_NUMBER_ZERO 0
       
     -+#define GENERATION_NUMBER_V2_INFINITY ((1ULL << 63) - 1)
     -+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
     -+
     - struct commit_list {
     - 	struct commit *item;
     - 	struct commit_list *next;
      
       ## revision.c ##
     -@@ revision.c: static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
     - 	if (!revs->repo->objects->commit_graph)
     - 		return -1;
     - 
     --	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
     -+	if (commit_graph_generation(commit) == GENERATION_NUMBER_V2_INFINITY)
     - 		return -1;
     - 
     - 	filter = get_bloom_filter(revs->repo, commit, 0);
      @@ revision.c: define_commit_slab(indegree_slab, int);
       define_commit_slab(author_date_slab, timestamp_t);
       
     @@ revision.c: static void indegree_walk_step(struct rev_info *revs)
       	struct topo_walk_info *info = revs->topo_walk_info;
       	struct commit *c;
      @@ revision.c: static void init_topo_walk(struct rev_info *revs)
     - 	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
     - 	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
     - 
     --	info->min_generation = GENERATION_NUMBER_INFINITY;
     -+	info->min_generation = GENERATION_NUMBER_V2_INFINITY;
     + 	info->min_generation = GENERATION_NUMBER_INFINITY;
       	for (list = revs->commits; list; list = list->next) {
       		struct commit *c = list->item;
      -		uint32_t generation;
     @@ revision.c: static void expand_topo_walk(struct rev_info *revs, struct commit *c
       		if (parent->object.flags & UNINTERESTING)
       			continue;
      
     - ## t/t5318-commit-graph.sh ##
     -@@ t/t5318-commit-graph.sh: test_expect_success 'detect incorrect generation number' '
     - 		"generation for commit"
     - '
     - 
     --test_expect_success 'detect incorrect generation number' '
     -+test_expect_failure 'detect incorrect generation number' '
     - 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
     - 		"non-zero generation number"
     - '
     -
       ## upload-pack.c ##
      @@ upload-pack.c: static int got_oid(struct upload_pack_data *data,
       
  -:  ---------- >  7:  bfe1473201 commit-graph: implement corrected commit date
  -:  ---------- >  8:  833779ad53 commit-graph: handle mixed generation commit chains
  -:  ---------- >  9:  58a2d5da01 commit-reach: use corrected commit dates in paint_down_to_common()
  -:  ---------- > 10:  4c34294602 doc: add corrected commit date info

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..ace7400a1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 02/10] revision: parse parent in indegree_walk_step()
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 6de29cdf7a..4ec82ed5ab 100644
--- a/revision.c
+++ b/revision.c
@@ -3365,6 +3365,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number (within CDAT chunk) and has undefined behavior.

The test used to pass as fill_commit_in_graph() guarantees the values of
graph position and generation number, and did not load timestamp.
However, with corrected commit date we will need load the timestamp as
well to populate the generation number.

Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 29 +++++++++++------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ace7400a1a..af8d9cc45e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index af8d9cc45e..fb6e2bf18f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..701e3d41aa 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -146,4 +146,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 7128895c3a..bed63b41fb 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 16:28     ` Derrick Stolee
  2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number was to distinguish
between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 40 +++++++++++++++++++---
 commit-graph.h                |  2 ++
 t/README                      |  3 ++
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 +--
 t/t5318-commit-graph.sh       | 27 +++++++--------
 t/t5324-split-commit-graph.sh | 12 +++----
 t/t6600-test-reach.sh         | 62 +++++++++++++++++++----------------
 8 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fb6e2bf18f..d5da1e8028 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -392,6 +393,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -758,7 +766,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -951,7 +962,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
@@ -1105,8 +1117,21 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+		hashwrite_be32(f, commit_graph_data_at(c)->generation);
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1710,6 +1735,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2113,6 +2144,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
 
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
diff --git a/commit-graph.h b/commit-graph.h
index 701e3d41aa..cc232e0678 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -67,6 +68,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/README b/t/README
index 70ec61cf88..6647ef132e 100644
--- a/t/README
+++ b/t/README
@@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..55c94e9ebd 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2804b0dd45..fef05c33d7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -72,7 +72,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -99,14 +99,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -215,7 +215,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -250,7 +250,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -260,7 +260,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -270,7 +270,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -283,7 +283,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -293,7 +293,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -303,7 +303,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -324,7 +324,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -421,8 +421,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 9b850ea907..6b25c3d9ce 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_init &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -29,9 +29,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 475564bee7..d14b129f06 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,10 +55,13 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -67,11 +70,14 @@ run_three_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -118,7 +124,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -129,7 +135,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -144,7 +150,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -166,7 +172,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -189,7 +195,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -211,7 +217,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -234,7 +240,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -250,8 +256,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -267,8 +273,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -280,7 +286,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -292,7 +298,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -304,7 +310,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -316,7 +322,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -328,7 +334,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -338,7 +344,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -352,7 +358,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -372,7 +378,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -390,7 +396,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -404,7 +410,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 06/10] commit-graph: return 64-bit generation number
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step, let's return timestamp_t values from
commit_graph_generation(), use timestamp_t for local variables and
define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 18 +++++++++---------
 commit-graph.h |  4 ++--
 commit-reach.c | 32 ++++++++++++++++----------------
 commit-reach.h |  2 +-
 commit.h       |  3 ++-
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 7 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d5da1e8028..42f3ec5460 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -100,7 +100,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -116,8 +116,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -160,8 +160,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1363,7 +1363,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
+		if (generation != GENERATION_NUMBER_V1_INFINITY &&
 		    generation != GENERATION_NUMBER_ZERO)
 			continue;
 
@@ -1377,7 +1377,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			for (parent = current->parents; parent; parent = parent->next) {
 				generation = commit_graph_data_at(parent->item)->generation;
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
+				if (generation == GENERATION_NUMBER_V1_INFINITY ||
 				    generation == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
@@ -2387,8 +2387,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
diff --git a/commit-graph.h b/commit-graph.h
index cc232e0678..f89614ecd5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -140,13 +140,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..470bc80139 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..bc0732a4fe 100644
--- a/commit.h
+++ b/commit.h
@@ -11,7 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
diff --git a/revision.c b/revision.c
index 4ec82ed5ab..bd7b39c806 100644
--- a/revision.c
+++ b/revision.c
@@ -3292,7 +3292,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3338,7 +3338,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3381,7 +3381,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3439,7 +3439,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3500,7 +3500,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 8673741070..18ee29db67 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -490,7 +490,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 14:23     ` Derrick Stolee
  2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date
offset. We add a new commit-slab to store topogical levels while
writing commit graph and upgrade the generation member in struct
commit_graph_data to a 64-bit timestamp. We store topological levels to
ensure that older versions of Git will still have the performance
benefits from generation number v2.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 89 ++++++++++++++++++++++++++++----------------------
 commit.h       |  1 +
 2 files changed, 51 insertions(+), 39 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 42f3ec5460..d0f977852b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -65,6 +65,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -168,11 +170,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -767,7 +764,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	if (g->chunk_generation_data)
-		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	{
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	}
 	else
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -948,6 +948,7 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+	struct topo_level_slab *topo_levels;
 
 	char *base_graph_name;
 	int num_commit_graphs_before;
@@ -1106,7 +1107,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1123,8 +1124,13 @@ static int write_graph_chunk_generation_data(struct hashfile *f,
 	int i;
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite_be32(f, commit_graph_data_at(c)->generation);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+
+		hashwrite_be32(f, offset);
 	}
 
 	return 0;
@@ -1360,11 +1366,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_V1_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (topo_level != GENERATION_NUMBER_V1_INFINITY &&
+		    topo_level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1372,29 +1378,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = current->date - 1;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_V1_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (topo_level == GENERATION_NUMBER_V1_INFINITY ||
+				    topo_level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else {
+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
+
+					if (topo_level > max_level)
+						max_level = topo_level;
+
+					if (data->generation > max_corrected_commit_date)
+						max_corrected_commit_date = data->generation;
 				}
 			}
 
 			if (all_parents_computed) {
 				struct commit_graph_data *data = commit_graph_data_at(current);
 
-				data->generation = max_generation + 1;
-				pop_commit(&list);
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+				data->generation = max_corrected_commit_date + 1;
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				pop_commit(&list);
 			}
 		}
 	}
@@ -2132,6 +2147,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2146,6 +2162,9 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->total_bloom_filter_data_size = 0;
 	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
@@ -2387,8 +2406,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		timestamp_t max_generation = 0;
-		timestamp_t generation;
+		timestamp_t max_parent_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2427,9 +2446,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_parent_corrected_commit_date)
+				max_parent_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2451,20 +2470,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_parent_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
diff --git a/commit.h b/commit.h
index bc0732a4fe..bb846e0025 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,7 @@
 #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
 
 struct commit_list {
 	struct commit *item;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 16:42     ` Derrick Stolee
  2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                     ` (3 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As corrected commit dates and topological levels cannot be compared
directly, we must handle commit graph chains with mixed generation
number definitions.

While reading a commit graph file, we disable generation numbers if the
chain contains mixed generation numbers.

While writing to commit graph chain, we write generation data chunk only
if the previous tip of chain had a generation data chunk. Using
`--split=replace` overwrites the existing chain and writes generation
data chunk regardless of previous tip.

In t5324-split-commit-graph, we set up a repo with twelve commits and
write a base commit graph file with no generation data chunk. When add
three commits and write to chain again, Git does not write generation
data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
as we replace the existing chain, Git writes a commit graph file with
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 14 ++++++++
 t/t5324-split-commit-graph.sh | 66 +++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d0f977852b..c6b6111adf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -674,6 +674,14 @@ int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
+	/* We cannot compare topological levels and corrected commit dates */
+	while (g->base_graph) {
+		warning(_("commit-graph-chain contains mixed generation versions"));
+		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
+			return 0;
+		g = g->base_graph;
+	}
+
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -2186,6 +2194,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2204,6 +2215,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 6b25c3d9ce..1a9be5e656 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,70 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 3 0
+	num_commits: 12
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split='replace' &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().

t6024-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base. As this has
already caused problems (as noted in 859fdc0 (commit-graph: define
GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
test script.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6024-recursive-merge.sh |  4 +++-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c6b6111adf..eb78af3dad 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -688,6 +688,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return !!g->chunk_generation_data;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index f89614ecd5..d3a485faa6 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -89,6 +89,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/commit-reach.c b/commit-reach.c
index 470bc80139..3a1b925274 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
index 332cfc53fd..d3def66e7d 100755
--- a/t/t6024-recursive-merge.sh
+++ b/t/t6024-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v2 10/10] doc: add corrected commit date info
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-08-09  2:53   ` Abhishek Kumar via GitGitGadget
  2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-09  2:53 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 12 ++---
 Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
 2 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..71c43884ec 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -88,6 +84,12 @@ CHUNK DATA:
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * This list can be later modified to store future generation number related
+      data.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 808fa30b99..f27145328c 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates
+2. Topological levels
+
+Define "corrected committer date" of a commit recursively as follows:
+
+  * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
+
+  * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+Define the "topological level" of a commit recursively as follows:
 
  * A commit with no parents (a root commit) has generation number one.
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
 
-Equivalently, the generation number of a commit A is one more than the
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
-violate topological relationships due to clock skew (such as "git log"
-with default order), but is not used when the topological order is
-required (such as merge base calculations, "git log --graph").
-
 In practice, we expect some commits to be created recently and not stored
 in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
+are computed to be at least this value. We limit at this value since it is
+the largest value that can be stored in the commit-graph file using the
+available to generation numbers. This presents another case where a
+commit can have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+We also merge commit-graph chains when we try to write a commit graph with
+two different generation number definitions as they cannot be compared directly.
+We overwrite the existing chain and create a commit-graph with the newer or more
+efficient defintion. For example, overwriting topological levels commit graph
+chain to create a corrected commit dates commit graph chain.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-10 14:23     ` Derrick Stolee
  2020-08-14  4:59       ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-08-10 14:23 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> With most of preparations done, let's implement corrected commit date
> offset. We add a new commit-slab to store topogical levels while
> writing commit graph and upgrade the generation member in struct
> commit_graph_data to a 64-bit timestamp. We store topological levels to
> ensure that older versions of Git will still have the performance
> benefits from generation number v2.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

> @@ -767,7 +764,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
>  	if (g->chunk_generation_data)
> -		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	{
> +		graph_data->generation = item->date +
> +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	}

You don't need curly braces here, since this is only one line
in the block. Even if you did, these braces are in the wrong
location.

There is a subtle issue with this interpretation, and it
involves the case where the following happens:

1. A new version of Git writes a commit-graph using the
   GDAT chunk.

2. An older version of Git adds a new layer without the
   GDAT chunk.

At that point, the tip commit-graph does not have GDAT,
so the commits in that layer will get "generation" set
with the topological level, which is likely to be much
lower than the corrected commit dates set in the
"generation" field for commits in the lower layer.

The crux of the issue is that we are only considering
the current layer when interpreting the generation number
value.

The patch below inserts a flag into fill_commit_graph_info()
corresponding to the "global" state of whether the top
commit-graph layer has a GDAT chunk. By your later protection
to not write GDAT chunks on top of commit-graphs without
a GDAT chunk, this top commit-graph has all of the information
we need for this check.

Thanks,
-Stolee

--- >8 ---

From 62189709fad3b051cedbd36193f5244fcce17e1f Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Mon, 10 Aug 2020 10:06:47 -0400
Subject: [PATCH] commit-graph: use generation v2 only if entire chain does

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

 1. "New" Git writes a commit-graph with the GDAT chunk.
 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
generation_data_chunk pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this
case, especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index eb78af3dad..17623274d9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -762,7 +762,9 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	return &commit_list_insert(c, pptr)->next;
 }

-static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+#define COMMIT_GRAPH_GENERATION_V2 (1 << 0)
+
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos, int flags)
 {
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
@@ -785,11 +787,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);

-	if (g->chunk_generation_data)
-	{
+	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
-	}
 	else
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -799,6 +799,10 @@ static inline void set_commit_tree(struct commit *c, struct tree *t)
 	c->maybe_tree = t;
 }

+/*
+ * In the case of a split commit-graph, this method expects the given
+ * commit-graph 'g' to be the top layer.
+ */
 static int fill_commit_in_graph(struct repository *r,
 				struct commit *item,
 				struct commit_graph *g, uint32_t pos)
@@ -808,11 +812,15 @@ static int fill_commit_in_graph(struct repository *r,
 	struct commit_list **pptr;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
+	int flags = 0;
+
+	if (g->chunk_generation_data)
+		flags |= COMMIT_GRAPH_GENERATION_V2;

 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;

-	fill_commit_graph_info(item, g, pos);
+	fill_commit_graph_info(item, g, pos, flags);

 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
@@ -904,10 +912,14 @@ int parse_commit_in_graph(struct repository *r, struct commit *item)
 void load_commit_graph_info(struct repository *r, struct commit *item)
 {
 	uint32_t pos;
+	int flags = 0;
+
 	if (!prepare_commit_graph(r))
 		return;
+	if (r->objects->commit_graph->chunk_generation_data)
+		flags |= COMMIT_GRAPH_GENERATION_V2;
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
-		fill_commit_graph_info(item, r->objects->commit_graph, pos);
+		fill_commit_graph_info(item, r->objects->commit_graph, pos, flags);
 }

 static struct tree *load_tree_for_commit(struct repository *r,
-- 
2.28.0.38.gc6f546511c1

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-10 16:28     ` Derrick Stolee
  2020-08-11 11:03       ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-08-10 16:28 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> As discovered by Ævar, we cannot increment graph version to
> distinguish between generation numbers v1 and v2 [1]. Thus, one of
> pre-requistes before implementing generation number was to distinguish
> between graph versions in a backwards compatible manner.
> 
> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.
> 
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. New Git can parse GDAT and take advantage
> of newer generation numbers, falling back to topological levels when
> GDAT chunk is missing (as it would happen with a commit graph written
> by old Git).

There is a philosophical problem with this patch, and I'm not sure
about the right way to fix it, or if there really is a problem at all.
At minimum, the commit message needs to be improved to make the issue
clear:

This version of the chunk does not store corrected commit date offsets!

This commit add a chunk named "GDAT" and fills it with topological
levels. This is _different_ than the intended final format. For that
reason, the commit-graph-format.txt document is not updated.

The reason I say this is a "philosophical" problem is that this patch
introduces a version of Git that has a different interpretation of the
GDAT chunk than the version presented two patches later. While this
version would never be released, it still exists in history and could
present difficulty if someone were to bisect on an issue with the GDAT
chunk (using external data, not data produced by the compiled binary
at that version).

The justification for this commit the way you did it is clear: there
is a lot of test fallout to just including a new chunk. The question
is whether it is enough to justify this "dummy" implementation for
now?

The tricky bit is the series of three patches starting with this
one.

1. The next patch "commit-graph: return 64-bit generation number" can
   be reordered to be before this patch, no problem. I don't think
   there will be any text conflicts _except_ inside the
   write_graph_chunk_generation_data() method introduced here.

2. The patch after that, "commit-graph: implement corrected commit date"
   only has a small dependence: it writes to the GDAT chunk and parses
   it out. If you remove the interaction with the GDAT chunk, then you
   still have the computation as part of compute_generation_numbers()
   that is valuable. You will need to be careful about the exit
   condition, though, since you also introduce the topo_level chunk.

Patches 5-7 could perhaps be reorganized as follows:

  i. commit-graph: return 64-bit generation number, as-is.

 ii. Add a topo_level slab that is parsed from CDAT. Modify
     compute_generation_numbers() to populate this value and modify
     write_graph_chunk_data() to read this value. Simultaneously
     populate the "generation" member with the same value.

iii. "commit-graph: implement corrected commit date" without any GDAT
     chunk interaction. Make sure the algorithm in
     compute_generation_numbers() walks commits if either topo_level or
     generation are unset. There is a trick here: the generation value
     _is_ set if the commit is parsed from the existing commit-graph!
     Is this case covered by the existing logic to not write GDAT when
     writing a split commit-graph file with a base that does not have
     GDAT? Note that the non-split case does not load the commit-graph
     for parsing, so the interesting case is "--split-replace". Worth
     a test (after we write the GDAT chunk), which you have in "commit-graph:
     handle mixed generation commit chains".

 iv. This patch, introducing the chunk and the read/write logic.

  v. Add the remaining patches.

Again, this is a complicated patch-reorganization. The hope is that
the end result is something that is easy to review as well as something
that produces an as-sane-as-possible history for future bisecters.

Perhaps other reviewers have similar feelings, or can say that I am
being too picky.

> We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> which forces commit-graph file to be written without generation data
> chunk to emulate a commit-graph file written by old Git.

Thank you for introducing this. It really makes it clear what the
benefit is when looking at the t6600-test-reach.sh changes. However,
the changes to that script are more "here is an opportunity for extra
coverage" as opposed to a necessary change immediately upon creating
the GDAT chunk. That could be separated out and justified on its own.
Recall that the justification is that the new version of Git will
continue to work with commit-graph files without a GDAT chunk.

> +static int write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +		hashwrite_be32(f, commit_graph_data_at(c)->generation);

Here is the "incorrect" data being written.

> +	}
> +
> +	return 0;
> +}
> +

> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -72,7 +72,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>  graph_read_expect() {
>  	OPTIONAL=""
>  	NUM_CHUNKS=3
> -	if test ! -z $2
> +	if test ! -z "$2"

A subtle change, but important because we now have multiple "extra"
chunks possible here. Good.

>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> @@ -421,8 +421,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
>  
>  test_expect_success 'git commit-graph verify' '
>  	cd "$TRASH_DIRECTORY/full" &&
> -	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> -	git commit-graph verify >output
> +	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> +	git commit-graph verify >output &&
> +	graph_read_expect 9 extra_edges
>  '

And it is this case as to why we don't just add "generation_data" to our
list of expected chunks.

> @@ -29,9 +29,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data

In this script, you _do_ add it to the default chunk list, which
saves some extra work in the rest of the tests. Good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
@ 2020-08-10 16:42     ` Derrick Stolee
  2020-08-11 11:36       ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-08-10 16:42 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> As corrected commit dates and topological levels cannot be compared
> directly, we must handle commit graph chains with mixed generation
> number definitions.
> 
> While reading a commit graph file, we disable generation numbers if the
> chain contains mixed generation numbers.
> 
> While writing to commit graph chain, we write generation data chunk only
> if the previous tip of chain had a generation data chunk. Using
> `--split=replace` overwrites the existing chain and writes generation
> data chunk regardless of previous tip.
> 
> In t5324-split-commit-graph, we set up a repo with twelve commits and
> write a base commit graph file with no generation data chunk. When add
> three commits and write to chain again, Git does not write generation
> data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
> as we replace the existing chain, Git writes a commit graph file with
> generation data chunk.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 14 ++++++++
>  t/t5324-split-commit-graph.sh | 66 +++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index d0f977852b..c6b6111adf 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -674,6 +674,14 @@ int generation_numbers_enabled(struct repository *r)
>  	if (!g->num_commits)
>  		return 0;
>  
> +	/* We cannot compare topological levels and corrected commit dates */
> +	while (g->base_graph) {
> +		warning(_("commit-graph-chain contains mixed generation versions"));

This warning is premature. It will add a warning whenever we have
a split commit-graph, regardless of an incorrect chain.

> +		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))

Hm. A bit-wise XOR here? That seems unfortunate. I think that it
is easier to focus on the 



> +			return 0;
> +		g = g->base_graph;
> +	}
> +

Hm. So this scenario actually disables generation numbers completely
in the event that anything in the chain disagrees. I think this is
not the right way to approach the situation, as it will significantly
punish users in this state with slow performance.

The patch I sent [1] is probably better: it uses generation number
v1 if the tip of the chain does not have a GDAT chunk.

[1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/

>  	first_generation = get_be32(g->chunk_commit_data +
>  				    g->hash_len + 8) >> 2;
>  
> @@ -2186,6 +2194,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		g = ctx->r->objects->commit_graph;
>  
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +
>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2204,6 +2215,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		if (ctx->split_opts)
>  			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;
>  	}

Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
from its current location to after this line. We want to make sure that the
environment variable is checked _last_. The best location is likely the start
of the implementation of compute_generation_numbers(), or immediately before
the call to the method.

> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&
> +	graphdir=".git/objects/info/commit-graphs" &&
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 3 0
> +	num_commits: 12
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output
> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 1
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split='replace' &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&
> +	graph_read_expect 15
> +'

It would be valuable to double-check here that the values in the GDAT chunk
are correct. I'm concerned about the possibility that the 'generation'
member of struct commit_graph_data gets filled with topological level during
parsing and then that is written as an offset into the CDAT chunk.

Perhaps this is best left for a follow-up series that updates the 'verify'
subcommand to check the GDAT chunk.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (9 preceding siblings ...)
  2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-08-10 16:47   ` Derrick Stolee
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  11 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-08-10 16:47 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
> 
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
> 
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
> 
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
> 
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
> 
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.
> 
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
> 
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> first version.
> 
> I look forward to everyone's reviews!
> 
> Thanks
> 
>  * Abhishek
> 
> 
> ----------------------------------------------------------------------------
> 
> Changes in version 2:
> 
>  * Add tests for generation data chunk.
>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.
>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().
>  * Update technical documentation.
>  * Handle mixed graph version commit chains.
>  * Improve commit messages for
>  * Revert unnecessary whitespace changes.
>  * Split uint_32 -> timestamp_t change into a new commit.

This version looks to be in really good shape, for the most part. I feel
the need to point out that this is a very mature submission, and it covers
many of the bases we expect from a quality series, even at only the second
version.

My comments deal with some high-level "patch series strategy" considerations,
mostly. There are a few concerns about correctness in the complicated cases
of mixed commit-graphs with or without the GDAT chunk, and how we could
perhaps test those cases. The best way would be to update the 'verify'
subcommand to check the GDAT chunk. That's a good feature to have, but would
be unnecessary bloat to this series. Depending on timing, we might want to
hold this series in 'next' until such an implementation is submitted. (I
expect that Abhishek's GSoC internship would be over at that point, so I
volunteer to send those patches after this series stabilizes.)

Thank you for your careful work on such a complicated subject!

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-10 16:28     ` Derrick Stolee
@ 2020-08-11 11:03       ` Abhishek Kumar
  2020-08-11 12:27         ` Derrick Stolee
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-11 11:03 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: jnareb, abhishekkumar8222, git, me, gitgitgadget

On Mon, Aug 10, 2020 at 12:28:10PM -0400, Derrick Stolee wrote:
> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > As discovered by Ævar, we cannot increment graph version to
> > distinguish between generation numbers v1 and v2 [1]. Thus, one of
> > pre-requistes before implementing generation number was to distinguish
> > between graph versions in a backwards compatible manner.
> > 
> > We are going to introduce a new chunk called Generation Data chunk (or
> > GDAT). GDAT stores generation number v2 (and any subsequent versions),
> > whereas CDAT will still store topological level.
> > 
> > Old Git does not understand GDAT chunk and would ignore it, reading
> > topological levels from CDAT. New Git can parse GDAT and take advantage
> > of newer generation numbers, falling back to topological levels when
> > GDAT chunk is missing (as it would happen with a commit graph written
> > by old Git).
> 
> There is a philosophical problem with this patch, and I'm not sure
> about the right way to fix it, or if there really is a problem at all.
> At minimum, the commit message needs to be improved to make the issue
> clear:
> 
> This version of the chunk does not store corrected commit date offsets!
> 
> This commit add a chunk named "GDAT" and fills it with topological
> levels. This is _different_ than the intended final format. For that
> reason, the commit-graph-format.txt document is not updated.
> 
> The reason I say this is a "philosophical" problem is that this patch
> introduces a version of Git that has a different interpretation of the
> GDAT chunk than the version presented two patches later. While this
> version would never be released, it still exists in history and could
> present difficulty if someone were to bisect on an issue with the GDAT
> chunk (using external data, not data produced by the compiled binary
> at that version).
> 

Yes, that is correct. I did often wonder that our inference that "commit
graph has a generation data chunk implies commit graph stores corrected
commit date offsets" is not always true because of this "dummy"
implementation. 

> The justification for this commit the way you did it is clear: there
> is a lot of test fallout to just including a new chunk. The question
> is whether it is enough to justify this "dummy" implementation for
> now?
> 
> The tricky bit is the series of three patches starting with this
> one.
> 
> 1. The next patch "commit-graph: return 64-bit generation number" can
>    be reordered to be before this patch, no problem. I don't think
>    there will be any text conflicts _except_ inside the
>    write_graph_chunk_generation_data() method introduced here.
> 
> 2. The patch after that, "commit-graph: implement corrected commit date"
>    only has a small dependence: it writes to the GDAT chunk and parses
>    it out. If you remove the interaction with the GDAT chunk, then you
>    still have the computation as part of compute_generation_numbers()
>    that is valuable. You will need to be careful about the exit
>    condition, though, since you also introduce the topo_level chunk.
> 
> Patches 5-7 could perhaps be reorganized as follows:
> 
>   i. commit-graph: return 64-bit generation number, as-is.
> 
>  ii. Add a topo_level slab that is parsed from CDAT. Modify
>      compute_generation_numbers() to populate this value and modify
>      write_graph_chunk_data() to read this value. Simultaneously
>      populate the "generation" member with the same value.
> 
> iii. "commit-graph: implement corrected commit date" without any GDAT
>      chunk interaction. Make sure the algorithm in
>      compute_generation_numbers() walks commits if either topo_level or
>      generation are unset. There is a trick here: the generation value
>      _is_ set if the commit is parsed from the existing commit-graph!
>      Is this case covered by the existing logic to not write GDAT when
>      writing a split commit-graph file with a base that does not have
>      GDAT? Note that the non-split case does not load the commit-graph
>      for parsing, so the interesting case is "--split-replace". Worth
>      a test (after we write the GDAT chunk), which you have in "commit-graph:
>      handle mixed generation commit chains".
> 

Right, so at the end of this patch we compute corrected commit dates but
don't write them to graph file.

Although, writing ii. and iii. together in the same patch makes more
sense to me. Would it be hard to follow for someone who has no context
of this discussion?

>  iv. This patch, introducing the chunk and the read/write logic.
> 
>   v. Add the remaining patches.
> 
> Again, this is a complicated patch-reorganization. The hope is that
> the end result is something that is easy to review as well as something
> that produces an as-sane-as-possible history for future bisecters.
> 
> Perhaps other reviewers have similar feelings, or can say that I am
> being too picky.
> 

I can see how the reorganization helps us avoid a rather nasty
situation to be in. Should not be too hard to reorganize.

> > We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> > which forces commit-graph file to be written without generation data
> > chunk to emulate a commit-graph file written by old Git.
> 
> ...
> 
> Thanks,
> -Stolee

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-10 16:42     ` Derrick Stolee
@ 2020-08-11 11:36       ` Abhishek Kumar
  2020-08-11 12:43         ` Derrick Stolee
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-11 11:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, me, jnareb, gitgitgadget

On Mon, Aug 10, 2020 at 12:42:29PM -0400, Derrick Stolee wrote:
> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> 
> ...
>
> Hm. So this scenario actually disables generation numbers completely
> in the event that anything in the chain disagrees. I think this is
> not the right way to approach the situation, as it will significantly
> punish users in this state with slow performance.
> 
> The patch I sent [1] is probably better: it uses generation number
> v1 if the tip of the chain does not have a GDAT chunk.
> 
> [1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/
> 

Yes, the patch is an clear improvement over my (convoluted and incorrect)
logic. Will add.

>
> ...
> 
> Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
> from its current location to after this line. We want to make sure that the
> environment variable is checked _last_. The best location is likely the start
> of the implementation of compute_generation_numbers(), or immediately before
> the call to the method.
> 

Sure, will do.

>
> ...
> 
> It would be valuable to double-check here that the values in the GDAT chunk
> are correct. I'm concerned about the possibility that the 'generation'
> member of struct commit_graph_data gets filled with topological level during
> parsing and then that is written as an offset into the CDAT chunk.
> 
> Perhaps this is best left for a follow-up series that updates the 'verify'
> subcommand to check the GDAT chunk.

If I can understand it correctly, one of ways to update 'verify'
subcommand to check the GDAT chunk as well would to be make use of the
flag variable introduced in your patch. We can isolate generation number
related checks and run checks once with flag = 1 (checking corrected
commit dates) and once with flag = 0 (checking topological levels).

This has the unfortunate effect of filling all commits twice, but as we
cannot change the commit_graph_data->generation any other way, I see no
alternatives without changing how commit_graph_generation() works.

Would it make more sense if we add the flag to struct commit_graph
instead of making it depend solely on g->chunk_generation_data and set
it within parse_commit_graph()?

We would be able to control the behavior of fill_commit_graph_info() and
we will not need to check g->chunk_generation_data before filling every
commit.

> 
> Thanks,
> -Stolee
> 

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-11 11:03       ` Abhishek Kumar
@ 2020-08-11 12:27         ` Derrick Stolee
  2020-08-11 18:58           ` Taylor Blau
  0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-08-11 12:27 UTC (permalink / raw)
  To: aee0ae56-3395-6848-d573-27a318d72755
  Cc: jnareb, abhishekkumar8222, git, me, gitgitgadget

On 8/11/2020 7:03 AM, Abhishek Kumar wrote:
> On Mon, Aug 10, 2020 at 12:28:10PM -0400, Derrick Stolee wrote:
>> Patches 5-7 could perhaps be reorganized as follows:
>>
>>   i. commit-graph: return 64-bit generation number, as-is.
>>
>>  ii. Add a topo_level slab that is parsed from CDAT. Modify
>>      compute_generation_numbers() to populate this value and modify
>>      write_graph_chunk_data() to read this value. Simultaneously
>>      populate the "generation" member with the same value.
>>
>> iii. "commit-graph: implement corrected commit date" without any GDAT
>>      chunk interaction. Make sure the algorithm in
>>      compute_generation_numbers() walks commits if either topo_level or
>>      generation are unset. There is a trick here: the generation value
>>      _is_ set if the commit is parsed from the existing commit-graph!
>>      Is this case covered by the existing logic to not write GDAT when
>>      writing a split commit-graph file with a base that does not have
>>      GDAT? Note that the non-split case does not load the commit-graph
>>      for parsing, so the interesting case is "--split-replace". Worth
>>      a test (after we write the GDAT chunk), which you have in "commit-graph:
>>      handle mixed generation commit chains".
>>
> 
> Right, so at the end of this patch we compute corrected commit dates but
> don't write them to graph file.
> 
> Although, writing ii. and iii. together in the same patch makes more
> sense to me. Would it be hard to follow for someone who has no context
> of this discussion?

It is always easier to combine two patches than to split one into two.

With that in mind, I recommend starting with a split version and then
seeing how each patch looks. I think that these are "independent enough"
ideas that justify the separate patches.

>>  iv. This patch, introducing the chunk and the read/write logic.
>>
>>   v. Add the remaining patches.
>>
>> Again, this is a complicated patch-reorganization. The hope is that
>> the end result is something that is easy to review as well as something
>> that produces an as-sane-as-possible history for future bisecters.
>>
>> Perhaps other reviewers have similar feelings, or can say that I am
>> being too picky.
>>
> 
> I can see how the reorganization helps us avoid a rather nasty
> situation to be in. Should not be too hard to reorganize.

I hope not. Hopefully you get some more review on this version
before jumping in on such a big reorg (in case someone else has
a different opinion).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 08/10] commit-graph: handle mixed generation commit chains
  2020-08-11 11:36       ` Abhishek Kumar
@ 2020-08-11 12:43         ` Derrick Stolee
  0 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-08-11 12:43 UTC (permalink / raw)
  To: 0d741fb2-e25a-be05-9f2b-81ba2b4ced3f
  Cc: abhishekkumar8222, git, me, jnareb, gitgitgadget

On 8/11/2020 7:36 AM, Abhishek Kumar wrote:
> On Mon, Aug 10, 2020 at 12:42:29PM -0400, Derrick Stolee wrote:
>> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
>>
>> ...
>>
>> Hm. So this scenario actually disables generation numbers completely
>> in the event that anything in the chain disagrees. I think this is
>> not the right way to approach the situation, as it will significantly
>> punish users in this state with slow performance.
>>
>> The patch I sent [1] is probably better: it uses generation number
>> v1 if the tip of the chain does not have a GDAT chunk.
>>
>> [1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/
>>
> 
> Yes, the patch is an clear improvement over my (convoluted and incorrect)
> logic. Will add.
> 
>>
>> ...
>>
>> Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
>> from its current location to after this line. We want to make sure that the
>> environment variable is checked _last_. The best location is likely the start
>> of the implementation of compute_generation_numbers(), or immediately before
>> the call to the method.
>>
> 
> Sure, will do.
> 
>>
>> ...
>>
>> It would be valuable to double-check here that the values in the GDAT chunk
>> are correct. I'm concerned about the possibility that the 'generation'
>> member of struct commit_graph_data gets filled with topological level during
>> parsing and then that is written as an offset into the CDAT chunk.
>>
>> Perhaps this is best left for a follow-up series that updates the 'verify'
>> subcommand to check the GDAT chunk.
> 
> If I can understand it correctly, one of ways to update 'verify'
> subcommand to check the GDAT chunk as well would to be make use of the
> flag variable introduced in your patch. We can isolate generation number
> related checks and run checks once with flag = 1 (checking corrected
> commit dates) and once with flag = 0 (checking topological levels).
> 
> This has the unfortunate effect of filling all commits twice, but as we
> cannot change the commit_graph_data->generation any other way, I see no
> alternatives without changing how commit_graph_generation() works.
> 
> Would it make more sense if we add the flag to struct commit_graph
> instead of making it depend solely on g->chunk_generation_data and set
> it within parse_commit_graph()?
> 
> We would be able to control the behavior of fill_commit_graph_info() and
> we will not need to check g->chunk_generation_data before filling every
> commit.

I missed that you _already_ updated the logic in verify_commit_graph()
based on the generation. That logic should catch the problem, so it
might be enough to just add some "git commit-graph verify" commands into
your multi-level tests.

Specifically, the end result is this check:

	corrected_commit_date = commit_graph_generation(graph_commit);
	if (corrected_commit_date < max_parent_corrected_commit_date + 1)
		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
			     oid_to_hex(&cur_oid),
			     corrected_commit_date,
			     max_parent_corrected_commit_date + 1);

This will catch the order violations I was proposing could happen. It
doesn't go the extra mile to ensure that the commit-graph stores the
exact correct value or that the two bits of data are correct (both
topo-level and corrected commit date). That is fine for now, and we
can revisit if necessary.

The diff below makes some tweaks to your split-level test to show the
logic _was_ incorrect without my patch. Please incorporate the test
changes into your series. Note in particular that I added a base
layer that includes the GDAT chunk and _then_ adds a layer without
the GDAT chunk. That is an important case!

Thanks,
-Stolee

--- >8 ---

diff --git a/commit-graph.c b/commit-graph.c
index 17623274d9..d891a8ba3a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -674,14 +674,6 @@ int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
-	/* We cannot compare topological levels and corrected commit dates */
-	while (g->base_graph) {
-		warning(_("commit-graph-chain contains mixed generation versions"));
-		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
-			return 0;
-		g = g->base_graph;
-	}
-
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -787,7 +779,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
+	if (g->chunk_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 	else
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 1a9be5e656..721515cc23 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -443,6 +443,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 		test_commit $i &&
 		git branch commits/$i || return 1
 	done &&
+	git commit-graph write --reachable --split &&
 	git reset --hard commits/2 &&
 	for i in $(test_seq 6 10)
 	do
@@ -455,14 +456,15 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 	git reset --hard commits/4 &&
 	git merge commits/6 &&
 	git branch merge/2 &&
-	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
 	test-tool read-graph >output &&
 	cat >expect <<-EOF &&
-	header: 43475048 1 1 3 0
-	num_commits: 12
+	header: 43475048 1 1 4 1
+	num_commits: 7
 	chunks: oid_fanout oid_lookup commit_metadata
 	EOF
-	test_cmp expect output
+	test_cmp expect output &&
+	git commit-graph verify
 '
 
 test_expect_success 'does not write generation data chunk if not present on existing tip' '
@@ -472,23 +474,25 @@ test_expect_success 'does not write generation data chunk if not present on exis
 	git merge commits/5 &&
 	git merge merge/2 &&
 	git branch merge/3 &&
-	git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split=no-merge &&
 	test-tool read-graph >output &&
 	cat >expect <<-EOF &&
 	header: 43475048 1 1 4 1
 	num_commits: 3
 	chunks: oid_fanout oid_lookup commit_metadata
 	EOF
-	test_cmp expect output
+	test_cmp expect output &&
+	git commit-graph verify
 '
 
 test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
 	cd "$TRASH_DIRECTORY/mixed" &&
-	git commit-graph write --reachable --split='replace' &&
+	git commit-graph write --reachable --split=replace &&
 	test_path_is_file $graphdir/commit-graph-chain &&
 	test_line_count = 1 $graphdir/commit-graph-chain &&
 	verify_chain_files_exist $graphdir &&
-	graph_read_expect 15
+	graph_read_expect 15 &&
+	git commit-graph verify
 '
 
 test_done

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 05/10] commit-graph: implement generation data chunk
  2020-08-11 12:27         ` Derrick Stolee
@ 2020-08-11 18:58           ` Taylor Blau
  0 siblings, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2020-08-11 18:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: aee0ae56-3395-6848-d573-27a318d72755, jnareb, abhishekkumar8222,
	git, me, gitgitgadget

On Tue, Aug 11, 2020 at 08:27:41AM -0400, Derrick Stolee wrote:
> On 8/11/2020 7:03 AM, Abhishek Kumar wrote:
> > On Mon, Aug 10, 2020 at 12:28:10PM -0400, Derrick Stolee wrote:
> >> Patches 5-7 could perhaps be reorganized as follows:
> >>
> >>   i. commit-graph: return 64-bit generation number, as-is.
> >>
> >>  ii. Add a topo_level slab that is parsed from CDAT. Modify
> >>      compute_generation_numbers() to populate this value and modify
> >>      write_graph_chunk_data() to read this value. Simultaneously
> >>      populate the "generation" member with the same value.
> >>
> >> iii. "commit-graph: implement corrected commit date" without any GDAT
> >>      chunk interaction. Make sure the algorithm in
> >>      compute_generation_numbers() walks commits if either topo_level or
> >>      generation are unset. There is a trick here: the generation value
> >>      _is_ set if the commit is parsed from the existing commit-graph!
> >>      Is this case covered by the existing logic to not write GDAT when
> >>      writing a split commit-graph file with a base that does not have
> >>      GDAT? Note that the non-split case does not load the commit-graph
> >>      for parsing, so the interesting case is "--split-replace". Worth
> >>      a test (after we write the GDAT chunk), which you have in "commit-graph:
> >>      handle mixed generation commit chains".
> >>
> >
> > Right, so at the end of this patch we compute corrected commit dates but
> > don't write them to graph file.
> >
> > Although, writing ii. and iii. together in the same patch makes more
> > sense to me. Would it be hard to follow for someone who has no context
> > of this discussion?
>
> It is always easier to combine two patches than to split one into two.
>
> With that in mind, I recommend starting with a split version and then
> seeing how each patch looks. I think that these are "independent enough"
> ideas that justify the separate patches.
>
> >>  iv. This patch, introducing the chunk and the read/write logic.
> >>
> >>   v. Add the remaining patches.
> >>
> >> Again, this is a complicated patch-reorganization. The hope is that
> >> the end result is something that is easy to review as well as something
> >> that produces an as-sane-as-possible history for future bisecters.
> >>
> >> Perhaps other reviewers have similar feelings, or can say that I am
> >> being too picky.
> >>
> >
> > I can see how the reorganization helps us avoid a rather nasty
> > situation to be in. Should not be too hard to reorganize.
>
> I hope not. Hopefully you get some more review on this version
> before jumping in on such a big reorg (in case someone else has
> a different opinion).

I think the direction makes sense. We should avoid having the dummy
implementation of the GDAT chunk in the interim if at all possible (and
it seems like it is). What Stolee is proposing is what I'd suggest, too.

Please let us know if you need any help restructuring these patches.
Please make sure to give them a careful review since it is easy to move
a hunk into the wrong commit, or let a detail in the patch text become
out-of-date. Looking over "git log -p origin/master..HEAD" and "git
rebase -x make -j8 DEVELOPER=1 test' origin/master" never hurts ;).

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-10 14:23     ` Derrick Stolee
@ 2020-08-14  4:59       ` Abhishek Kumar
  2020-08-14 12:24         ` Derrick Stolee
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-14  4:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, jnareb, me, gitgitgadget

On Mon, Aug 10, 2020 at 10:23:32AM -0400, Derrick Stolee wrote:
> ....
> --- >8 ---
> 
> From 62189709fad3b051cedbd36193f5244fcce17e1f Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Mon, 10 Aug 2020 10:06:47 -0400
> Subject: [PATCH] commit-graph: use generation v2 only if entire chain does
> 
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
> 
>  1. "New" Git writes a commit-graph with the GDAT chunk.
>  2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> Because of the current use of inspecting the current layer for a
> generation_data_chunk pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.
> 
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.
> 
> This issue would manifest itself as a performance problem in this
> case, especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index eb78af3dad..17623274d9 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -762,7 +762,9 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
>  	return &commit_list_insert(c, pptr)->next;
>  }
>  
> -static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +#define COMMIT_GRAPH_GENERATION_V2 (1 << 0)
> +
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos, int flags)
>  {
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
> @@ -785,11 +787,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> -	if (g->chunk_generation_data)
> -	{
> +	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
>  		graph_data->generation = item->date +
>  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> -	}
>  	else
>  		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
> @@ -799,6 +799,10 @@ static inline void set_commit_tree(struct commit *c, struct tree *t)
>  	c->maybe_tree = t;
>  }
>  
> +/*
> + * In the case of a split commit-graph, this method expects the given
> + * commit-graph 'g' to be the top layer.
> + */

Unfortunately, this turns out to be an optimistic assumption. After
adding in changes from this patch and extended tests for
split-commit-graph [1], the tests fail with `git commit-graph verify` as
Git tries to compare topological level with corrected commit dates.

The problem lies in assuming that `g` is always the top layer, without
any way to assert if that's true. In case of `commit-graph verify`, we
update `g = g->base_graph` and verify recursively.

If we can assume such behavior is only the part of verify subcommand, I
can update the tests to no longer verify at the end.

[1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

>  static int fill_commit_in_graph(struct repository *r,
>  				struct commit *item,
>  				struct commit_graph *g, uint32_t pos)
> @@ -808,11 +812,15 @@ static int fill_commit_in_graph(struct repository *r,
>  	struct commit_list **pptr;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
> +	int flags = 0;
> +
> +	if (g->chunk_generation_data)
> +		flags |= COMMIT_GRAPH_GENERATION_V2;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> -	fill_commit_graph_info(item, g, pos);
> +	fill_commit_graph_info(item, g, pos, flags);
>  
>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> @@ -904,10 +912,14 @@ int parse_commit_in_graph(struct repository *r, struct commit *item)
>  void load_commit_graph_info(struct repository *r, struct commit *item)
>  {
>  	uint32_t pos;
> +	int flags = 0;
> +
>  	if (!prepare_commit_graph(r))
>  		return;
> +	if (r->objects->commit_graph->chunk_generation_data)
> +		flags |= COMMIT_GRAPH_GENERATION_V2;
>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
> -		fill_commit_graph_info(item, r->objects->commit_graph, pos);
> +		fill_commit_graph_info(item, r->objects->commit_graph, pos, flags);
>  }
>  
>  static struct tree *load_tree_for_commit(struct repository *r,
> -- 
> 2.28.0.38.gc6f546511c1
> 

I solved the issue by adding a new member to struct commit_graph
`read_generation_data` to maintain the "global" state of the entire
commit-graph chain instead.

The relevant changes are in validate_mixed_generation_chain(),
read_commit_graph_one() and fill_commit_graph_info().

Thanks
- Abhishek

--- >8 ---
From aaf8c27bfec6e110a8bb12173c2dd612a8c6b8b9 Mon Sep 17 00:00:00 2001
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Date: Thu, 6 Aug 2020 19:08:52 +0530
Subject: [PATCH] commit-graph: use generation v2 only if entire chain does

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 36 ++++++++++++++++--
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
 3 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d0f977852b..10309f870f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	int read_generation_data = 1;
+
+	while (g) {
+		if (!g->chunk_generation_data) {
+			read_generation_data = 0;
+			break;
+		}
+		g = g->base_graph;
+	}
+
+	g = r->objects->commit_graph;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(r);
+
 	return g;
 }
 
@@ -740,6 +763,8 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	return &commit_list_insert(c, pptr)->next;
 }
 
+#define COMMIT_GRAPH_GENERATION_V2 (1 << 0)
+
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	const unsigned char *commit_data;
@@ -763,11 +788,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data)
-	{
+	if (g->chunk_generation_data && g->read_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
-	}
 	else
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -884,6 +907,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
 	uint32_t pos;
 	if (!prepare_commit_graph(r))
 		return;
+
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
 }
@@ -2186,6 +2210,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2204,6 +2231,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index f89614ecd5..305c332b7e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	uint32_t read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 531016f405..ac5e7783fb 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -424,4 +424,74 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
 test_done
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v2 07/10] commit-graph: implement corrected commit date
  2020-08-14  4:59       ` Abhishek Kumar
@ 2020-08-14 12:24         ` Derrick Stolee
  0 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-08-14 12:24 UTC (permalink / raw)
  To: a3910f82-ab2e-bf35-ac43-c30d77f3c96b
  Cc: abhishekkumar8222, git, jnareb, me, gitgitgadget

On 8/14/2020 12:59 AM, Abhishek Kumar wrote:
> I solved the issue by adding a new member to struct commit_graph
> `read_generation_data` to maintain the "global" state of the entire
> commit-graph chain instead.
> 
> The relevant changes are in validate_mixed_generation_chain(),
> read_commit_graph_one() and fill_commit_graph_info().

I think this is a good way to go. Adding that restriction about
the tip commit-graph was short-sighted of me and was likely to
break in the future.

I think your solution here to store extra state from the entire
chain into each layer makes a lot of sense.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
                     ` (10 preceding siblings ...)
  2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2020-08-15 16:39   ` Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                       ` (12 more replies)
  11 siblings, 13 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
first version.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

Changes in version 3:

 * Reordered patches as discussed in 1
   [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed graph version commit chains.
 * Improve commit messages for
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (11):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  12 +-
 Documentation/technical/commit-graph.txt      |  45 ++--
 commit-graph.c                                | 241 +++++++++++++-----
 commit-graph.h                                |  16 +-
 commit-reach.c                                |  49 ++--
 commit-reach.h                                |   2 +-
 commit.c                                      |   9 +-
 commit.h                                      |   4 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   2 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |   4 +-
 t/t5318-commit-graph.sh                       |  27 +-
 t/t5324-split-commit-graph.sh                 |  82 +++++-
 t/t6024-recursive-merge.sh                    |   4 +-
 t/t6600-test-reach.sh                         |  62 +++--
 upload-pack.c                                 |   2 +-
 18 files changed, 396 insertions(+), 185 deletions(-)


base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v2:

  1:  a962b9ae4b =  1:  c6b7ade7af commit-graph: fix regression when computing bloom filter
  2:  cf61239f93 =  2:  e673867234 revision: parse parent in indegree_walk_step()
  3:  32da955e31 =  3:  18d5864f81 commit-graph: consolidate fill_commit_graph_info
  4:  b254782858 !  4:  6a0cde983d commit-graph: consolidate compare_commits_by_gen
     @@ Commit message
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
          Reviewed-by: Taylor Blau <me@ttaylorr.com>
     +    Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
  6:  1aa2a00a7a =  5:  6be759a954 commit-graph: return 64-bit generation number
  7:  bfe1473201 !  6:  b347dbb01b commit-graph: implement corrected commit date
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: implement corrected commit date
     +    commit-graph: add a slab to store topological levels
      
     -    With most of preparations done, let's implement corrected commit date
     -    offset. We add a new commit-slab to store topogical levels while
     -    writing commit graph and upgrade the generation member in struct
     -    commit_graph_data to a 64-bit timestamp. We store topological levels to
     -    ensure that older versions of Git will still have the performance
     -    benefits from generation number v2.
     +    As we are writing topological levels to commit data chunk to ensure
     +    backwards compatibility with "Old" Git and the member `generation` of
     +    struct commit_graph_data will store corrected commit date in a later
     +    commit, let's introduce a commit-slab to store topological levels while
     +    writing commit-graph.
     +
     +    When Git creates a split commit-graph, it takes advantage of the
     +    generation values that have been computed already and present in
     +    existing commit-graph files.
     +
     +    So, let's add a pointer to struct commit_graph to the topological level
     +    commit-slab and populate it with topological levels while writing a
     +    split commit-graph.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
       /* Keep track of the order in which commits are added to our list. */
       define_commit_slab(commit_pos, int);
       static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     - 	else if (generation_a > generation_b)
     - 		return 1;
     - 
     --	/* use date as a heuristic when generations are equal */
     --	if (a->date < b->date)
     --		return -1;
     --	else if (a->date > b->date)
     --		return 1;
     - 	return 0;
     - }
     - 
      @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
       	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     - 	if (g->chunk_generation_data)
     --		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	{
     -+		graph_data->generation = item->date +
     -+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	}
     - 	else
     - 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     + 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     ++
     ++	if (g->topo_levels)
     ++		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
       }
     + 
     + static inline void set_commit_tree(struct commit *c, struct tree *t)
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct progress *progress;
     - 	int progress_done;
     - 	uint64_t progress_cnt;
     -+	struct topo_level_slab *topo_levels;
     + 		 changed_paths:1,
     + 		 order_by_pack:1;
       
     - 	char *base_graph_name;
     - 	int num_commit_graphs_before;
     ++	struct topo_level_slab *topo_levels;
     + 	const struct split_commit_graph_opts *split_opts;
     + 	size_t total_bloom_filter_data_size;
     + 	const struct bloom_filter_settings *bloom_settings;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       		else
       			packedDate[0] = 0;
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       
       		packedDate[1] = htonl((*list)->date);
       		hashwrite(f, packedDate, 8);
     -@@ commit-graph.c: static int write_graph_chunk_generation_data(struct hashfile *f,
     - 	int i;
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     - 		struct commit *c = ctx->commits.list[i];
     -+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
     - 		display_progress(ctx->progress, ++ctx->progress_cnt);
     --		hashwrite_be32(f, commit_graph_data_at(c)->generation);
     -+
     -+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     -+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     -+
     -+		hashwrite_be32(f, offset);
     - 	}
     - 
     - 	return 0;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
       					_("Computing commit graph generation numbers"),
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
      -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     ++		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
       
       		display_progress(ctx->progress, i + 1);
      -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
      -		    generation != GENERATION_NUMBER_ZERO)
     -+		if (topo_level != GENERATION_NUMBER_V1_INFINITY &&
     -+		    topo_level != GENERATION_NUMBER_ZERO)
     ++		if (level != GENERATION_NUMBER_V1_INFINITY &&
     ++		    level != GENERATION_NUMBER_ZERO)
       			continue;
       
       		commit_list_insert(ctx->commits.list[i], &list);
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       			int all_parents_computed = 1;
      -			uint32_t max_generation = 0;
      +			uint32_t max_level = 0;
     -+			timestamp_t max_corrected_commit_date = current->date - 1;
       
       			for (parent = current->parents; parent; parent = parent->next) {
      -				generation = commit_graph_data_at(parent->item)->generation;
     -+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     ++				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
       
      -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
      -				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (topo_level == GENERATION_NUMBER_V1_INFINITY ||
     -+				    topo_level == GENERATION_NUMBER_ZERO) {
     ++				if (level == GENERATION_NUMBER_V1_INFINITY ||
     ++				    level == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
       					break;
      -				} else if (generation > max_generation) {
      -					max_generation = generation;
     -+				} else {
     -+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
     -+
     -+					if (topo_level > max_level)
     -+						max_level = topo_level;
     -+
     -+					if (data->generation > max_corrected_commit_date)
     -+						max_corrected_commit_date = data->generation;
     ++				} else if (level > max_level) {
     ++					max_level = level;
       				}
       			}
       
       			if (all_parents_computed) {
     - 				struct commit_graph_data *data = commit_graph_data_at(current);
     - 
     +-				struct commit_graph_data *data = commit_graph_data_at(current);
     +-
      -				data->generation = max_generation + 1;
     --				pop_commit(&list);
     -+				if (max_level > GENERATION_NUMBER_MAX - 1)
     -+					max_level = GENERATION_NUMBER_MAX - 1;
     -+
     -+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
     -+				data->generation = max_corrected_commit_date + 1;
     + 				pop_commit(&list);
       
      -				if (data->generation > GENERATION_NUMBER_MAX)
      -					data->generation = GENERATION_NUMBER_MAX;
     -+				pop_commit(&list);
     ++				if (max_level > GENERATION_NUMBER_MAX - 1)
     ++					max_level = GENERATION_NUMBER_MAX - 1;
     ++				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
       			}
       		}
       	}
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	if (!commit_graph_compatible(the_repository))
       		return 0;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	ctx->total_bloom_filter_data_size = 0;
     - 	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
     + 		}
     + 	}
       
      +	init_topo_level_slab(&topo_levels);
      +	ctx->topo_levels = &topo_levels;
      +
     - 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     - 		ctx->changed_paths = 1;
     - 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 	for (i = 0; i < g->num_commits; i++) {
     - 		struct commit *graph_commit, *odb_commit;
     - 		struct commit_list *graph_parents, *odb_parents;
     --		timestamp_t max_generation = 0;
     --		timestamp_t generation;
     -+		timestamp_t max_parent_corrected_commit_date = 0;
     -+		timestamp_t corrected_commit_date;
     - 
     - 		display_progress(progress, i + 1);
     - 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 					     oid_to_hex(&graph_parents->item->object.oid),
     - 					     oid_to_hex(&odb_parents->item->object.oid));
     - 
     --			generation = commit_graph_generation(graph_parents->item);
     --			if (generation > max_generation)
     --				max_generation = generation;
     -+			corrected_commit_date = commit_graph_generation(graph_parents->item);
     -+			if (corrected_commit_date > max_parent_corrected_commit_date)
     -+				max_parent_corrected_commit_date = corrected_commit_date;
     - 
     - 			graph_parents = graph_parents->next;
     - 			odb_parents = odb_parents->next;
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 		if (generation_zero == GENERATION_ZERO_EXISTS)
     - 			continue;
     ++	if (ctx->r->objects->commit_graph) {
     ++		struct commit_graph *g = ctx->r->objects->commit_graph;
     ++
     ++		while (g) {
     ++			g->topo_levels = &topo_levels;
     ++			g = g->base_graph;
     ++		}
     ++	}
     ++
     + 	if (pack_indexes) {
     + 		ctx->order_by_pack = 1;
     + 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
     +
     + ## commit-graph.h ##
     +@@ commit-graph.h: struct commit_graph {
     + 	const unsigned char *chunk_bloom_indexes;
     + 	const unsigned char *chunk_bloom_data;
     + 
     ++	struct topo_level_slab *topo_levels;
     + 	struct bloom_filter_settings *bloom_filter_settings;
     + };
       
     --		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     --		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     --		 * extra logic in the following condition.
     --		 */
     --		if (max_generation == GENERATION_NUMBER_MAX)
     --			max_generation--;
     --
     --		generation = commit_graph_generation(graph_commit);
     --		if (generation != max_generation + 1)
     --			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		corrected_commit_date = commit_graph_generation(graph_commit);
     -+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
     -+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     - 				     oid_to_hex(&cur_oid),
     --				     generation,
     --				     max_generation + 1);
     -+				     corrected_commit_date,
     -+				     max_parent_corrected_commit_date + 1);
     - 
     - 		if (graph_commit->date != odb_commit->date)
     - 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
      
       ## commit.h ##
      @@
  -:  ---------- >  7:  4074ace65b commit-graph: implement corrected commit date
  5:  cb797e20d7 !  8:  4e746628ac commit-graph: implement generation data chunk
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       
      -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
      +	if (g->chunk_generation_data)
     -+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     ++		graph_data->generation = item->date +
     ++			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
      +	else
      +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     - }
       
     - static inline void set_commit_tree(struct commit *c, struct tree *t)
     + 	if (g->topo_levels)
     + 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
      @@ commit-graph.c: struct write_commit_graph_context {
       		 report_progress:1,
       		 split:1,
     @@ commit-graph.c: struct write_commit_graph_context {
      +		 order_by_pack:1,
      +		 write_generation_data:1;
       
     + 	struct topo_level_slab *topo_levels;
       	const struct split_commit_graph_opts *split_opts;
     - 	size_t total_bloom_filter_data_size;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       	return 0;
       }
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
      +	int i;
      +	for (i = 0; i < ctx->commits.nr; i++) {
      +		struct commit *c = ctx->commits.list[i];
     ++		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
      +		display_progress(ctx->progress, ++ctx->progress_cnt);
     -+		hashwrite_be32(f, commit_graph_data_at(c)->generation);
     ++
     ++		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     ++			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     ++		hashwrite_be32(f, offset);
      +	}
      +
      +	return 0;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
       	chunks[2].id = GRAPH_CHUNKID_DATA;
       	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
       	chunks[2].write_fn = write_graph_chunk_data;
     ++
     ++	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
     ++		ctx->write_generation_data = 0;
      +	if (ctx->write_generation_data) {
      +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
      +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
       	ctx->split_opts = split_opts;
       	ctx->total_bloom_filter_data_size = 0;
     -+	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
     ++	ctx->write_generation_data = 1;
       
       	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
       		ctx->changed_paths = 1;
     @@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit
      
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
     + 	infodir=".git/objects/info" &&
       	graphdir="$infodir/commit-graphs" &&
     - 	test_oid_init &&
       	test_oid_cache <<-EOM
      -	shallow sha1:1760
      -	shallow sha256:2064
  8:  833779ad53 !  9:  5a147a9704 commit-graph: handle mixed generation commit chains
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: handle mixed generation commit chains
     +    commit-graph: use generation v2 only if entire chain does
      
     -    As corrected commit dates and topological levels cannot be compared
     -    directly, we must handle commit graph chains with mixed generation
     -    number definitions.
     +    Since there are released versions of Git that understand generation
     +    numbers in the commit-graph's CDAT chunk but do not understand the GDAT
     +    chunk, the following scenario is possible:
      
     -    While reading a commit graph file, we disable generation numbers if the
     -    chain contains mixed generation numbers.
     +    1. "New" Git writes a commit-graph with the GDAT chunk.
     +    2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
      
     -    While writing to commit graph chain, we write generation data chunk only
     -    if the previous tip of chain had a generation data chunk. Using
     -    `--split=replace` overwrites the existing chain and writes generation
     -    data chunk regardless of previous tip.
     +    Because of the current use of inspecting the current layer for a
     +    chunk_generation_data pointer, the commits in the lower layer will be
     +    interpreted as having very large generation values (commit date plus
     +    offset) compared to the generation numbers in the top layer (topological
     +    level). This violates the expectation that the generation of a parent is
     +    strictly smaller than the generation of a child.
      
     -    In t5324-split-commit-graph, we set up a repo with twelve commits and
     -    write a base commit graph file with no generation data chunk. When add
     -    three commits and write to chain again, Git does not write generation
     -    data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
     -    as we replace the existing chain, Git writes a commit graph file with
     -    generation data chunk.
     +    It is difficult to expose this issue in a test. Since we _start_ with
     +    artificially low generation numbers, any commit walk that prioritizes
     +    generation numbers will walk all of the commits with high generation
     +    number before walking the commits with low generation number. In all the
     +    cases I tried, the commit-graph layers themselves "protect" any
     +    incorrect behavior since none of the commits in the lower layer can
     +    reach the commits in the upper layer.
      
     +    This issue would manifest itself as a performance problem in this case,
     +    especially with something like "git log --graph" since the low
     +    generation numbers would cause the in-degree queue to walk all of the
     +    commits in the lower layer before allowing the topo-order queue to write
     +    anything to output (depending on the size of the upper layer).
     +
     +    Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
     -@@ commit-graph.c: int generation_numbers_enabled(struct repository *r)
     - 	if (!g->num_commits)
     - 		return 0;
     +@@ commit-graph.c: static struct commit_graph *load_commit_graph_chain(struct repository *r,
     + 	return graph_chain;
     + }
       
     -+	/* We cannot compare topological levels and corrected commit dates */
     -+	while (g->base_graph) {
     -+		warning(_("commit-graph-chain contains mixed generation versions"));
     -+		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
     -+			return 0;
     ++static void validate_mixed_generation_chain(struct repository *r)
     ++{
     ++	struct commit_graph *g = r->objects->commit_graph;
     ++	int read_generation_data = 1;
     ++
     ++	while (g) {
     ++		if (!g->chunk_generation_data) {
     ++			read_generation_data = 0;
     ++			break;
     ++		}
      +		g = g->base_graph;
      +	}
      +
     - 	first_generation = get_be32(g->chunk_commit_data +
     - 				    g->hash_len + 8) >> 2;
     ++	g = r->objects->commit_graph;
     ++
     ++	while (g) {
     ++		g->read_generation_data = read_generation_data;
     ++		g = g->base_graph;
     ++	}
     ++}
     ++
     + struct commit_graph *read_commit_graph_one(struct repository *r,
     + 					   struct object_directory *odb)
     + {
     +@@ commit-graph.c: struct commit_graph *read_commit_graph_one(struct repository *r,
     + 	if (!g)
     + 		g = load_commit_graph_chain(r, odb);
     + 
     ++	validate_mixed_generation_chain(r);
     ++
     + 	return g;
     + }
     + 
     +@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
     + 	date_low = get_be32(commit_data + g->hash_len + 12);
     + 	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     +-	if (g->chunk_generation_data)
     ++	if (g->chunk_generation_data && g->read_generation_data)
     + 		graph_data->generation = item->date +
     + 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     + 	else
     +@@ commit-graph.c: void load_commit_graph_info(struct repository *r, struct commit *item)
     + 	uint32_t pos;
     + 	if (!prepare_commit_graph(r))
     + 		return;
     ++
     + 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
     + 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
     + }
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       		g = ctx->r->objects->commit_graph;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       	ctx->approx_nr_objects = approximate_object_count();
      
     + ## commit-graph.h ##
     +@@ commit-graph.h: struct commit_graph {
     + 	struct object_directory *odb;
     + 
     + 	uint32_t num_commits_in_base;
     ++	uint32_t read_generation_data;
     + 	struct commit_graph *base_graph;
     + 
     + 	const uint32_t *chunk_oid_fanout;
     +
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: done <<\EOF
       0600 -r--------
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +		test_commit $i &&
      +		git branch commits/$i || return 1
      +	done &&
     ++	git commit-graph write --reachable --split &&
      +	git reset --hard commits/2 &&
      +	git merge commits/4 &&
      +	git branch merge/1 &&
      +	git reset --hard commits/4 &&
      +	git merge commits/6 &&
      +	git branch merge/2 &&
     -+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
     ++	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 3 0
     -+	num_commits: 12
     ++	header: 43475048 1 1 4 1
     ++	num_commits: 2
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output
     ++	test_cmp expect output &&
     ++	git commit-graph verify
      +'
      +
      +test_expect_success 'does not write generation data chunk if not present on existing tip' '
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +	git merge commits/5 &&
      +	git merge merge/2 &&
      +	git branch merge/3 &&
     -+	git commit-graph write --reachable --split &&
     ++	git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 4 1
     ++	header: 43475048 1 1 4 2
      +	num_commits: 3
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output
     ++	test_cmp expect output &&
     ++	git commit-graph verify
      +'
      +
      +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
      +	cd "$TRASH_DIRECTORY/mixed" &&
     -+	git commit-graph write --reachable --split='replace' &&
     ++	git commit-graph write --reachable --split=replace &&
      +	test_path_is_file $graphdir/commit-graph-chain &&
      +	test_line_count = 1 $graphdir/commit-graph-chain &&
      +	verify_chain_files_exist $graphdir &&
     -+	graph_read_expect 15
     ++	graph_read_expect 15 &&
     ++	git commit-graph verify
      +'
      +
       test_done
  9:  58a2d5da01 = 10:  439adc1718 commit-reach: use corrected commit dates in paint_down_to_common()
 10:  4c34294602 = 11:  f6f91af305 doc: add corrected commit date info

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-17 22:30       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                       ` (11 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..ace7400a1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 02/11] revision: parse parent in indegree_walk_step()
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-18 14:18       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                       ` (10 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 3dcf689341..ecf757c327 100644
--- a/revision.c
+++ b/revision.c
@@ -3363,6 +3363,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-19 17:54       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                       ` (9 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number (within CDAT chunk) and has undefined behavior.

The test used to pass as fill_commit_in_graph() guarantees the values of
graph position and generation number, and did not load timestamp.
However, with corrected commit date we will need load the timestamp as
well to populate the generation number.

Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 29 +++++++++++------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ace7400a1a..af8d9cc45e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (2 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-17 13:22       ` Derrick Stolee
  2020-08-21 11:05       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                       ` (8 subsequent siblings)
  12 siblings, 2 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index af8d9cc45e..fb6e2bf18f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..701e3d41aa 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -146,4 +146,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 4ce8cb38d5..bd6d5e587f 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (3 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-21 13:14       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
                       ` (7 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step, let's return timestamp_t values from
commit_graph_generation(), use timestamp_t for local variables and
define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 18 +++++++++---------
 commit-graph.h |  4 ++--
 commit-reach.c | 32 ++++++++++++++++----------------
 commit-reach.h |  2 +-
 commit.h       |  3 ++-
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 7 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fb6e2bf18f..7f9f858577 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -115,8 +115,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -159,8 +159,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1338,7 +1338,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
+		if (generation != GENERATION_NUMBER_V1_INFINITY &&
 		    generation != GENERATION_NUMBER_ZERO)
 			continue;
 
@@ -1352,7 +1352,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			for (parent = current->parents; parent; parent = parent->next) {
 				generation = commit_graph_data_at(parent->item)->generation;
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
+				if (generation == GENERATION_NUMBER_V1_INFINITY ||
 				    generation == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
@@ -2355,8 +2355,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
diff --git a/commit-graph.h b/commit-graph.h
index 701e3d41aa..430bc830bb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..470bc80139 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..bc0732a4fe 100644
--- a/commit.h
+++ b/commit.h
@@ -11,7 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
diff --git a/revision.c b/revision.c
index ecf757c327..411852468b 100644
--- a/revision.c
+++ b/revision.c
@@ -3290,7 +3290,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3336,7 +3336,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3379,7 +3379,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3437,7 +3437,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3498,7 +3498,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 80ad9a38d8..bcb8b5dfda 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (4 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-21 18:43       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                       ` (6 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As we are writing topological levels to commit data chunk to ensure
backwards compatibility with "Old" Git and the member `generation` of
struct commit_graph_data will store corrected commit date in a later
commit, let's introduce a commit-slab to store topological levels while
writing commit-graph.

When Git creates a split commit-graph, it takes advantage of the
generation values that have been computed already and present in
existing commit-graph files.

So, let's add a pointer to struct commit_graph to the topological level
commit-slab and populate it with topological levels while writing a
split commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
 commit-graph.h |  1 +
 commit.h       |  1 +
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7f9f858577..a2f15b2825 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+	if (g->topo_levels)
+		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -953,6 +958,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
+	struct topo_level_slab *topo_levels;
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
@@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_V1_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_V1_INFINITY &&
+		    level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_V1_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_V1_INFINITY ||
+				    level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else if (level > max_level) {
+					max_level = level;
 				}
 			}
 
 			if (all_parents_computed) {
-				struct commit_graph_data *data = commit_graph_data_at(current);
-
-				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
 	}
@@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
 		}
 	}
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
+	if (ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			g->topo_levels = &topo_levels;
+			g = g->base_graph;
+		}
+	}
+
 	if (pack_indexes) {
 		ctx->order_by_pack = 1;
 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
diff --git a/commit-graph.h b/commit-graph.h
index 430bc830bb..1152a9642e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -72,6 +72,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
+	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
diff --git a/commit.h b/commit.h
index bc0732a4fe..bb846e0025 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,7 @@
 #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
 
 struct commit_list {
 	struct commit *item;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (5 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22  0:05       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                       ` (5 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 27 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a2f15b2825..fd69534dd5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
 		if (level != GENERATION_NUMBER_V1_INFINITY &&
-		    level != GENERATION_NUMBER_ZERO)
+		    level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO
+		    )
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 
 				if (level == GENERATION_NUMBER_V1_INFINITY ||
-				    level == GENERATION_NUMBER_ZERO) {
+				    level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO
+				    ) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				} else {
+					if (level > max_level)
+						max_level = level;
+
+					if (corrected_commit_date > max_corrected_commit_date)
+						max_corrected_commit_date = corrected_commit_date;
 				}
 			}
 
@@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_MAX - 1)
 					max_level = GENERATION_NUMBER_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
@@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		timestamp_t max_generation = 0;
-		timestamp_t generation;
+		timestamp_t max_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_corrected_commit_date)
+				max_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 08/11] commit-graph: implement generation data chunk
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (6 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 13:09       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
                       ` (4 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number was to distinguish
between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 48 ++++++++++++++++++++++++---
 commit-graph.h                |  2 ++
 t/README                      |  3 ++
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 +--
 t/t5318-commit-graph.sh       | 27 +++++++--------
 t/t5324-split-commit-graph.sh | 12 +++----
 t/t6600-test-reach.sh         | 62 +++++++++++++++++++----------------
 8 files changed, 107 insertions(+), 53 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fd69534dd5..b7a72b40db 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 
 	if (g->topo_levels)
 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -951,7 +963,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct split_commit_graph_opts *split_opts;
@@ -1106,8 +1119,25 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+		hashwrite_be32(f, offset);
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					  struct write_commit_graph_context *ctx)
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
@@ -1726,6 +1756,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+		ctx->write_generation_data = 0;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2130,6 +2169,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = 1;
 
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
diff --git a/commit-graph.h b/commit-graph.h
index 1152a9642e..f78c892fc0 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -67,6 +68,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/README b/t/README
index 70ec61cf88..6647ef132e 100644
--- a/t/README
+++ b/t/README
@@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..55c94e9ebd 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 044cf8a3de..b41b2160c6 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -71,7 +71,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -98,14 +98,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -214,7 +214,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -249,7 +249,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -259,7 +259,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -269,7 +269,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -282,7 +282,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -292,7 +292,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -302,7 +302,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -323,7 +323,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -420,8 +420,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index ea28d522b8..531016f405 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
 	infodir=".git/objects/info" &&
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -28,9 +28,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 475564bee7..d14b129f06 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,10 +55,13 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -67,11 +70,14 @@ run_three_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -118,7 +124,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -129,7 +135,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -144,7 +150,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -166,7 +172,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -189,7 +195,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -211,7 +217,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -234,7 +240,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -250,8 +256,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -267,8 +273,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -280,7 +286,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -292,7 +298,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -304,7 +310,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -316,7 +322,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -328,7 +334,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -338,7 +344,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -352,7 +358,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -372,7 +378,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -390,7 +396,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -404,7 +410,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (7 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 17:14       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                       ` (3 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 32 +++++++++++++++-
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b7a72b40db..c1292f8e08 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	int read_generation_data = 1;
+
+	while (g) {
+		if (!g->chunk_generation_data) {
+			read_generation_data = 0;
+			break;
+		}
+		g = g->base_graph;
+	}
+
+	g = r->objects->commit_graph;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(r);
+
 	return g;
 }
 
@@ -763,7 +786,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data)
+	if (g->chunk_generation_data && g->read_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 	else
@@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
 	uint32_t pos;
 	if (!prepare_commit_graph(r))
 		return;
+
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
 }
@@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2210,6 +2237,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index f78c892fc0..3cf89d895d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	uint32_t read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 531016f405..ac5e7783fb 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -424,4 +424,74 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (8 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 19:09       ` Jakub Narębski
  2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
                       ` (2 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().

t6024-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base. As this has
already caused problems (as noted in 859fdc0 (commit-graph: define
GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
test script.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6024-recursive-merge.sh |  4 +++-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c1292f8e08..6411068411 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return !!g->chunk_generation_data;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 3cf89d895d..e22ec1e626 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -91,6 +91,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/commit-reach.c b/commit-reach.c
index 470bc80139..3a1b925274 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
index 332cfc53fd..d3def66e7d 100755
--- a/t/t6024-recursive-merge.sh
+++ b/t/t6024-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (9 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-08-15 16:39     ` Abhishek Kumar via GitGitGadget
  2020-08-22 22:20       ` Jakub Narębski
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-08-15 16:39 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 12 ++---
 Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
 2 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..71c43884ec 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -88,6 +84,12 @@ CHUNK DATA:
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * This list can be later modified to store future generation number related
+      data.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 808fa30b99..f27145328c 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates
+2. Topological levels
+
+Define "corrected committer date" of a commit recursively as follows:
+
+  * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
+
+  * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+Define the "topological level" of a commit recursively as follows:
 
  * A commit with no parents (a root commit) has generation number one.
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
 
-Equivalently, the generation number of a commit A is one more than the
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
-violate topological relationships due to clock skew (such as "git log"
-with default order), but is not used when the topological order is
-required (such as merge base calculations, "git log --graph").
-
 In practice, we expect some commits to be created recently and not stored
 in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
+are computed to be at least this value. We limit at this value since it is
+the largest value that can be stored in the commit-graph file using the
+available to generation numbers. This presents another case where a
+commit can have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+We also merge commit-graph chains when we try to write a commit graph with
+two different generation number definitions as they cannot be compared directly.
+We overwrite the existing chain and create a commit-graph with the newer or more
+efficient defintion. For example, overwriting topological levels commit graph
+chain to create a corrected commit dates commit graph chain.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (10 preceding siblings ...)
  2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-08-17  0:13     ` Jakub Narębski
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
                         ` (2 more replies)
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
  12 siblings, 3 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-17  0:13 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

I'm not sure if this level of detail is required in the cover letter for
the series, but generation number v2 is corrected commit date; corrected
commit date offsets is how we store this value in the commit-graph file.

>
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.

I would say "committer date heuristics" instead of just "committer
date", to be more exact.

Is this data generated using https://github.com/derrickstolee/gen-test
scripts?

>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.

It would be nice to have reference email (or other place with details)
for "unfortunate problem".

>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2.

This is a bit of simplification, but good enough for a cover letter.

To be more exact, from various candidates the Corrected Commit Date was
chosen.  Then it turned out that old Git crashes on changed commit-graph
format version value, so if the generation number v2 was to replace v1
it needed to be backward-compatibile: hence the idea of Corrected Commit
Date with Monotonically Increasing Offsets.  But with GDAT chunk to
store generation number v2 (and for the time being leaving generation
number v1, i.e. Topological Levels, in CDAT), we are no longer
constrained by the requirement of backward-compatibility to make old Git
work with commit-graph file created by new Git.  So we could go back to
Corrected Commit Date, and as you wrote above the backward-compatibile
variant performs worse.

> The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents.

Actually it needs to be "corrected commit dates of its parents plus 1"
to fulfill the reachability condition for a generation number for a
commit:

      A can reach B   =>  gen(A) < gen(B)

Of course it can be computed in simpler way, because

  max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1


>                                                           Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

All right.

>
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

All right.

However I think the cover letter should also describe what should happen
in a mixed version environment (for example new Git on command line,
copy of old Git used by GUI client), and in particular what should
happen in a mixed-chain case - both for reading and for writing the
commit-graph file.

For *writing*: because old Git would create commit-graph layers without
the GDAT chunk, to simplify the behavior and make easy to reason about
commit-graph data (the situation should be not that common, and
transient -- it should get more rare as the time goes), we want the
following behavior from new Git:

- If top layer contains the GDAT chunk, or we are rewriting commit-graph
  file (--split=replace), or we are merging layers and there are no
  layers without GDAT chunk below set of layers that are merged, then

     write commit-graph file or commit-graph layer with GDAT chunk,

  otherwise

     write commit-graph layer without GDAT chunk.

  This means that there are commit-graph layers without GDAT chunk if
  and only if the top layer is also without GDAT chunk.


For *reading* we want to use generation number v2 (corrected commit
date) if possible, and fall back to generation number v1 (topological
levels).

- If the top layer contains the GDAT chunk (or maybe even if the topmost
  layer that involves all commits in question, not necessarily the top
  layer in the full commit-graph chain), then use generation number v2

  - commit_graph_data->generation stores corrected commit date,
    computed as sum of committer date (from CDAT) and offset (from GDAT)

  - A can reach B   =>  gen(A) < gen(B)

  - there is no need for committer date heuristics, and no need for
    limiting use of generation number to where there is a cutoff (to not
    hamper performance).

- If there are layers without GDAT chunks, which thanks to the write
  behavior means simply top layer without GDAT chunk, we need to turn
  off use of generation numbers or fall back to using topological levels

  - commit_graph_data->generation stores topological levels,
    taken from CDAT chunk (30-bits)

  - A can reach B   =>  gen(A) < gen(B)

  - we probably want to keep tie-breaking of sorting by generation
    number via committer date, and limit use of generation number as
    opposed to using committer date heuristics (with slop) to not make
    performance worse.

>
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> first version.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> Changes in version 3:
>
>  * Reordered patches as discussed in 1
>    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]

If I remember it correctly this was done to always store in GDAT chunk
corrected commit date offsets, isn't it?

>  * Split "implement corrected commit date" into two patches - one
>    introducing the topo level slab and other implementing corrected commit
>    dates.

All right.

I think it might be good idea to split off the change to tar file tests
(as a preparatory patch), to make reviews and bisecting easier.

>  * Extended split-commit-graph tests to verify at the end of test.

Do we also test for proper merging of split commit-graph layers, not
only adding a new layer and a full rewrite (--split=replace)?

>  * Use topological levels as generation number if any of split commit-graph
>    files do not have generation data chunk.

That is good for performance.

>
> Changes in version 2:
>
>  * Add tests for generation data chunk.

Good.

>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.

Good, that is needed for testing mixed-version behavior.

>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().

All right, but see the caveat.

>  * Update technical documentation.

Always a good thing.

>  * Handle mixed graph version commit chains.

Where by "version" you mean generation number version - the commit-graph
version number unfortunately needs to stay the same...

>  * Improve commit messages for
                                ^^^^^^
Something missing in this point, the sentence ends abruptly.

>  * Revert unnecessary whitespace changes.

Thanks.

>  * Split uint_32 -> timestamp_t change into a new commit.

It is usually better to keep the commits small.  Good.


Good work!

>
> Abhishek Kumar (11):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: return 64-bit generation number
>   commit-graph: add a slab to store topological levels
>   commit-graph: implement corrected commit date
>   commit-graph: implement generation data chunk
>   commit-graph: use generation v2 only if entire chain does
>   commit-reach: use corrected commit dates in paint_down_to_common()
>   doc: add corrected commit date info
>
>  .../technical/commit-graph-format.txt         |  12 +-
>  Documentation/technical/commit-graph.txt      |  45 ++--
>  commit-graph.c                                | 241 +++++++++++++-----
>  commit-graph.h                                |  16 +-
>  commit-reach.c                                |  49 ++--
>  commit-reach.h                                |   2 +-
>  commit.c                                      |   9 +-
>  commit.h                                      |   4 +-
>  revision.c                                    |  13 +-
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   2 +
>  t/t4216-log-bloom.sh                          |   4 +-
>  t/t5000-tar-tree.sh                           |   4 +-
>  t/t5318-commit-graph.sh                       |  27 +-
>  t/t5324-split-commit-graph.sh                 |  82 +++++-
>  t/t6024-recursive-merge.sh                    |   4 +-
>  t/t6600-test-reach.sh                         |  62 +++--
>  upload-pack.c                                 |   2 +-
>  18 files changed, 396 insertions(+), 185 deletions(-)
>
>
> base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
[...]

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Fwd: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
@ 2020-08-17  1:32         ` Taylor Blau
  2020-08-17  7:56           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-08-17  1:32 UTC (permalink / raw)
  To: git; +Cc: Jakub Narębski, Derrick Stolee, Abhishek Kumar

On Mon, Aug 17, 2020 at 02:16:08AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
>
> All right.
>
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,
>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This seems very sane to me, and I'd be glad to see it spelled out in
more specific detail. I was wondering this myself, and had to double
check with Stolee off-list that my interpretation of Abhishek's code was
correct.

But yes, only writing GDAT chunks when all layers in the chain have GDAT
chunks makes sense, since we can't interoperate between corrected dates
and topological levels. Since we can't fill in the GDAT data of layers
generated in pre-GDAT versions of Git without invalidating the GDAT
layers on-disk, there's no point to speculatively computing both chunks.

Merging rules are obviously correct, which is good. For what it's worth,
the '--split=replace' case is what we'll really care about at GitHub,
since it's unlikely we'd drop all existing commit-graph chains and
rebuild them from scratch. More likely is that we'll let the new GDAT
chunks trickle in over time when we run 'git commit-graph write' with
'--split=replace', which happens "every so often".

> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2

I don't follow this. If we have a multi-layer chain, either all or none
of the layers have a GDAT chunk. So, "if the top layer contains the GDAT
chunk" makes sense, since it implies that all layers have the GDAT
chunk. I don't see how "even if the topmost layer that involves all
commits in question" would be possible, since (if I'm understanding your
description correctly), we can't have *some* of the layers having a GDAT
chunk with others only having a CDAT chunk.

I'm a little confused here.

>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels

Good, I'm glad that this can be a quick check (that we can cache for
future reads, but I'm not even sure the caching would be necessary
without measuring).
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

All makes very good sense, except for the one point I raised above.

> >
> > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > first version.

Thanks, Abhishek for your great work on this. I was feeling bad that I
wasn't more involved in the early discussions about the transition plan,
but what you, Stolee, and Jakub came up with all seems like what I would
have suggested, anyway ;-).

> Jakub Narebski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Fwd: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-17  1:32         ` Fwd: " Taylor Blau
@ 2020-08-17  7:56           ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-17  7:56 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Derrick Stolee, Abhishek Kumar

On Mon, 17 Aug 2020 at 03:32, Taylor Blau <me@ttaylorr.com> wrote:
> On Mon, Aug 17, 2020 at 02:16:08AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >
> > > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > > and store corrected commit date offsets in GDAT chunk while storing
> > > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > > of Git would use corrected commit dates, falling back to topological level
> > > if the generation data chunk is absent in the commit-graph file.
> >
> > All right.
> >
> > However I think the cover letter should also describe what should happen
> > in a mixed version environment (for example new Git on command line,
> > copy of old Git used by GUI client), and in particular what should
> > happen in a mixed-chain case - both for reading and for writing the
> > commit-graph file.
> >
> > For *writing*: because old Git would create commit-graph layers without
> > the GDAT chunk, to simplify the behavior and make easy to reason about
> > commit-graph data (the situation should be not that common, and
> > transient -- it should get more rare as the time goes), we want the
> > following behavior from new Git:
> >
> > - If top layer contains the GDAT chunk, or we are rewriting commit-graph
> >   file (--split=replace), or we are merging layers and there are no
> >   layers without GDAT chunk below set of layers that are merged, then
> >
> >      write commit-graph file or commit-graph layer with GDAT chunk,
> >
> >   otherwise
> >
> >      write commit-graph layer without GDAT chunk.
> >
> >   This means that there are commit-graph layers without GDAT chunk if
> >   and only if the top layer is also without GDAT chunk.
>
> This seems very sane to me, and I'd be glad to see it spelled out in
> more specific detail. I was wondering this myself, and had to double
> check with Stolee off-list that my interpretation of Abhishek's code was
> correct.
>
> But yes, only writing GDAT chunks when all layers in the chain have GDAT
> chunks makes sense, since we can't interoperate between corrected dates
> and topological levels. Since we can't fill in the GDAT data of layers
> generated in pre-GDAT versions of Git without invalidating the GDAT
> layers on-disk, there's no point to speculatively computing both chunks.
>
> Merging rules are obviously correct, which is good. For what it's worth,
> the '--split=replace' case is what we'll really care about at GitHub,
> since it's unlikely we'd drop all existing commit-graph chains and
> rebuild them from scratch. More likely is that we'll let the new GDAT
> chunks trickle in over time when we run 'git commit-graph write' with
> '--split=replace', which happens "every so often".

To be more detailed, without '--split=replace' we would want the following
layer merging behavior:

   [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]

In the split commit-graph chain above, merging two topmost layers
should create a layer without GDAT; merging three topmost layers
(and any other layers, e.g. two middle ones) should create a layer
 with GDAT.

> > For *reading* we want to use generation number v2 (corrected commit
> > date) if possible, and fall back to generation number v1 (topological
> > levels).
> >
> > - If the top layer contains the GDAT chunk (or maybe even if the topmost
> >   layer that involves all commits in question, not necessarily the top
> >   layer in the full commit-graph chain), then use generation number v2
>
> I don't follow this. If we have a multi-layer chain, either all or none
> of the layers have a GDAT chunk. So, "if the top layer contains the GDAT
> chunk" makes sense, since it implies that all layers have the GDAT
> chunk. I don't see how "even if the topmost layer that involves all
> commits in question" would be possible, since (if I'm understanding your
> description correctly), we can't have *some* of the layers having a GDAT
> chunk with others only having a CDAT chunk.
>
> I'm a little confused here.

This is only speculative, and most probably totally unnecessary
complication (either that, or something that we would get for free).
Assume that the command in question operates only on
historical data; for example `git log --topo-order HEAD~1000`.
If all commits (or, what's equivalent, most recent commits
i.e. HEAD~1000) have their data in split commit-graph layers
with GDAT, we can theoretically use generation number v2,
even if there are some newer commits that have their data
in layers without GDAT (and some even newer ones outside
commit-graph files).

I hope that this explains my (possibly harebrained) idea.

> >   - commit_graph_data->generation stores corrected commit date,
> >     computed as sum of committer date (from CDAT) and offset (from GDAT)
> >
> >   - A can reach B   =>  gen(A) < gen(B)
> >
> >   - there is no need for committer date heuristics, and no need for
> >     limiting use of generation number to where there is a cutoff (to not
> >     hamper performance).
> >
> > - If there are layers without GDAT chunks, which thanks to the write
> >   behavior means simply top layer without GDAT chunk, we need to turn
> >   off use of generation numbers or fall back to using topological levels
>
> Good, I'm glad that this can be a quick check (that we can cache for
> future reads, but I'm not even sure the caching would be necessary
> without measuring).

There is a question where to store the information that we cannot
use generation number v2 (that 'generation' contains topological
levels and not corrected commit date):
- create new global variable
- store it in `struct split_commit_graph_opts`
- set `chunk_generation_data` to NULL for all graphs
  in chain (it is in `struct commit_graph`)?

> >
> >   - commit_graph_data->generation stores topological levels,
> >     taken from CDAT chunk (30-bits)
> >
> >   - A can reach B   =>  gen(A) < gen(B)
> >
> >   - we probably want to keep tie-breaking of sorting by generation
> >     number via committer date, and limit use of generation number as
> >     opposed to using committer date heuristics (with slop) to not make
> >     performance worse.
>
> All makes very good sense, except for the one point I raised above.
>
> > >
> > > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > > first version.
>
> Thanks, Abhishek for your great work on this. I was feeling bad that I
> wasn't more involved in the early discussions about the transition plan,
> but what you, Stolee, and Jakub came up with all seems like what I would
> have suggested, anyway ;-).

Thank you for your work on improving this feature.

Best,
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-08-17 13:22       ` Derrick Stolee
  2020-08-21 11:05       ` Jakub Narębski
  1 sibling, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2020-08-17 13:22 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 8/15/2020 12:39 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> Reviewed-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Whoops! Be sure to add the "Reviewed-by: above the "Signed-off-by" line(s)
so re-signing off doesn't add a duplicate like this.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter
  2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-08-17 22:30       ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-17 22:30 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Subject: [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter

s/bloom filter/Bloom filters/

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> commit_gen_cmp is used when writing a commit-graph to sort commits in
> generation order before computing Bloom filters. Since c49c82aa (commit:
> move members graph_pos, generation to a slab, 2020-06-17) made it so
> that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
> during writing, we cannot call it within this function. Instead, access
> the generation number directly through the slab (i.e., by calling
> 'commit_graph_data_at(c)->generation') in order to access it while
> writing.

Two things that might not be obvious from the commit message:

- Is commit_gen_cmp in commit-graph.c used by anything but writing
  Bloom filters for changed paths?

- That the generation number is computed during `commit-graph write`
  before computing Bloom filters.

Also, after this series 'generation' would be generation number v2, that
is corrected commit date, and not v1, that is topological levels.  We
should check, just in case, that it does not lead to significant
performance regression for `git commit-graph write --reachable <...>`
case (the one that uses commit_gen_cmp sort).

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e51c91dd5b..ace7400a1a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;

Nice and easy.

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
@ 2020-08-18  6:12       ` Abhishek Kumar
  2020-08-23 15:27       ` Jakub Narębski
  2 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Mon, Aug 17, 2020 at 02:13:18AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> I'm not sure if this level of detail is required in the cover letter for
> the series, but generation number v2 is corrected commit date; corrected
> commit date offsets is how we store this value in the commit-graph file.
> 
> >
> > Git uses topological levels in the commit-graph file for commit-graph
> > traversal operations like git log --graph. Unfortunately, using topological
> > levels can result in a worse performance than without them when compared
> > with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> > on the Linux repository walks 635,579 commits using topological levels and
> > walks 167,468 using committer date.
> 
> I would say "committer date heuristics" instead of just "committer
> date", to be more exact.
> 
> Is this data generated using https://github.com/derrickstolee/gen-test
> scripts?
> 

Yes, it is.

> >
> > Thus, the need for generation number v2 was born. New generation number
> > needed to provide good performance, increment updates, and backward
> > compatibility. Due to an unfortunate problem, we also needed a way to
> > distinguish between the old and new generation number without incrementing
> > graph version.
> 
> It would be nice to have reference email (or other place with details)
> for "unfortunate problem".
> 

Will add.

> >
> > Various candidates were examined (https://github.com/derrickstolee/gen-test,
> > https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> > number v2, Corrected Commit Date with Mononotically Increasing Offsets
> > performed much worse than committer date (506,577 vs. 167,468 commits walked
> > for git merge-base v4.8 v4.9) and was dropped.
> >
> > Using Generation Data chunk (GDAT) relieves the requirement of backward
> > compatibility as we would continue to store topological levels in Commit
> > Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> > number v2.
> 
> This is a bit of simplification, but good enough for a cover letter.
> 
> To be more exact, from various candidates the Corrected Commit Date was
> chosen.  Then it turned out that old Git crashes on changed commit-graph
> format version value, so if the generation number v2 was to replace v1
> it needed to be backward-compatibile: hence the idea of Corrected Commit
> Date with Monotonically Increasing Offsets.  But with GDAT chunk to
> store generation number v2 (and for the time being leaving generation
> number v1, i.e. Topological Levels, in CDAT), we are no longer
> constrained by the requirement of backward-compatibility to make old Git
> work with commit-graph file created by new Git.  So we could go back to
> Corrected Commit Date, and as you wrote above the backward-compatibile
> variant performs worse.
> 
> > The Corrected Commit Date is defined as:
> >
> > For a commit C, let its corrected commit date be the maximum of the commit
> > date of C and the corrected commit dates of its parents.
> 
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
> 
>       A can reach B   =>  gen(A) < gen(B)
> 
> Of course it can be computed in simpler way, because
> 
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1
> 
> 
> >                                                           Then corrected
> > commit date offset is the difference between corrected commit date of C and
> > commit date of C.
> 
> All right.
> 
> >
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
> 
> All right.
> 
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
> 

Yes, definitely. Will add

> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
> 
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
> 
>      write commit-graph file or commit-graph layer with GDAT chunk,
> 
>   otherwise
> 
>      write commit-graph layer without GDAT chunk.
> 
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.
> 
> 
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
> 
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
> 

The current implementation checks the entire chain for GDAT, rather than
just the topmost layer as we cannot assert that `g` would be the topmost
layer of the chain.

See the discussion here: https://lore.kernel.org/git/20200814045957.GA1380@Abhishek-Arch/

It's one of drawbacks of having a single member 64-bit `generation`
instead of two 32-bit members `level` and `odate`.

>
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)
> 
>   - A can reach B   =>  gen(A) < gen(B)
> 
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
> 
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
> 
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
> 
>   - A can reach B   =>  gen(A) < gen(B)
> 
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.
> 
> >
> > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > first version.
> >
> > I look forward to everyone's reviews!
> >
> > Thanks
> >
> >  * Abhishek
> >
> >
> > ----------------------------------------------------------------------------
> >
> > Changes in version 3:
> >
> >  * Reordered patches as discussed in 1
> >    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
> 
> If I remember it correctly this was done to always store in GDAT chunk
> corrected commit date offsets, isn't it?
> 

Yes.

> >  * Split "implement corrected commit date" into two patches - one
> >    introducing the topo level slab and other implementing corrected commit
> >    dates.
> 
> All right.
> 
> I think it might be good idea to split off the change to tar file tests
> (as a preparatory patch), to make reviews and bisecting easier.
> 
> >  * Extended split-commit-graph tests to verify at the end of test.
> 
> Do we also test for proper merging of split commit-graph layers, not
> only adding a new layer and a full rewrite (--split=replace)?
> 

We do not, will add a test at end. Thanks for pointing this out.

> >  * Use topological levels as generation number if any of split commit-graph
> >    files do not have generation data chunk.
> 
> That is good for performance.
> 
> >
> > Changes in version 2:
> >
> >  * Add tests for generation data chunk.
> 
> Good.
> 
> >  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
> >    generation data chunk.
> 
> Good, that is needed for testing mixed-version behavior.
> 
> >  * Compare commits with corrected commit dates if present in
> >    paint_down_to_common().
> 
> All right, but see the caveat.
> 
> >  * Update technical documentation.
> 
> Always a good thing.
> 
> >  * Handle mixed graph version commit chains.
> 
> Where by "version" you mean generation number version - the commit-graph
> version number unfortunately needs to stay the same...
> 

Yes, clarified.

> >  * Improve commit messages for
>                                 ^^^^^^
> Something missing in this point, the sentence ends abruptly.

I didn't finish the sentence. Meant to say:

- Improve commit messages for "commit-graph: fix regression when computing bloom filter", "commit-graph: consolidate fill_commit_graph_info",
> 
> >  * Revert unnecessary whitespace changes.
> 
> Thanks.
> 
> >  * Split uint_32 -> timestamp_t change into a new commit.
> 
> It is usually better to keep the commits small.  Good.
> 
> 
> Good work!
> 
> ...
> --
> Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 02/11] revision: parse parent in indegree_walk_step()
  2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-08-18 14:18       ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-18 14:18 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.

All right, we need to have commit parsed to have correct value for its
generation number.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 3dcf689341..ecf757c327 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3363,6 +3363,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>
> +		if (parse_commit_gently(parent, 1) < 0)
> +			return;
> +

All right, this is exactly what is done in this function for commit 'c'
taken from indegree_queue, whose parents we process here:

	if (parse_commit_gently(c, 1) < 0)
		return;

>  		if (*pi)
>  			(*pi)++;
>  		else

Looks good to me.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-08-19 17:54       ` Jakub Narębski
  2020-08-21  4:11         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-19 17:54 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
>
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> generation number (within CDAT chunk) and has undefined behavior.
>
> The test used to pass

Could you please tell us how does this test starts to fail without the
change to the test described there?  What is the error message, etc.?

>                       as fill_commit_in_graph() guarantees the values of
                           ^^^^^^^^^^^^^^^^^^^^^^

s/fill_commit_in_graph()/fill_commit_graph_info()/

It is fill_commit_graph_info() that changes its behavior in this patch.

> graph position and generation number, and did not load timestamp.
> However, with corrected commit date we will need load the timestamp as
> well to populate the generation number.
>
> Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.

I think this commit should be split into two commits:
- fix to the 'generate tar with future mtime' test
- simplify implementation of fill_commit_in_graph()

The test 'generate tar with future mtime' in t/t5000-tar-tree.sh creates
a commit with commit time of (2 ^ 36 - 1) seconds since EPOCH
(68719476737). However, the commit-graph file format version 1 provides
only 34-bits for storing committer date (32 + 2 bits), not 64-bits.
Therefore maximum far in the future commit time can only be at most
(2 ^ 34 - 1) seconds since EPOCH, as Stolee said in commet for v1
of this series.

This "limitation" is not a problem in practice, because the maximum
timestamp allowed takes place in the year 2514. I hope at that time
there would be no Git version in use that still crashes on changing the
version field in the commit-graph format -- then we can simply get rid
of storing topological levels (generation number v1) in those 30 bits of
CDAT chunk and use full 64 bits for committer date.

Git does not perform any bounds checking for committer date value in
write_graph_chunk_data():

	uint32_t packedDate[2];

	/* ... */

	if (sizeof((*list)->date) > 4)
		packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
	else
		packedDate[0] = 0;

	packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);

	packedDate[1] = htonl((*list)->date);
	hashwrite(f, packedDate, 8);

This means that the date is trimmed to 34 bits on save discarding most
significant bits, assuming that unsigned overflow simply discards most
significant bits truncating the signed (?) value.

In this case running the test with GIT_TEST_COMMIT_GRAPH=1 would lead to
errors, as the committer date read from the commit graph would be
incorrect, and therefore generation number v2 would also be incorrect.

I don't quite understand however how second part of this patch in its
current iteration, namely simplifing the implementation of
fill_commit_in_graph() makes this bug / error shows...

Do I understand it correctly that before this change the committer date
would always be parsed out of the commit object, instead of reading it
from the commit-graph file?  However the only user of static
fill_commit_in_graph() is the parse_commit_in_graph(), which in turn is
used by parse_commit_gently(); but fill_commit_in_graph() read commit
date from commit-graph before this change... color me confused.

Ah, after the change fill_commit_graph_info() changes its behavior, not
fill_commit_in_graph() as said in the commit message. Before this commit
it used to only load graph position and generation number, and did not
load the timestamp. The function fill_commit_graph_info() is used in
turn by public-facing load_commit_graph_info():

  /*
   * It is possible that we loaded commit contents from the commit buffer,
   * but we also want to ensure the commit-graph content is correctly
   * checked and filled. Fill the graph_pos and generation members of
   * the given commit.
   */
  void load_commit_graph_info(struct repository *r, struct commit *item);

This function is used in turn by get_bloom_filter(), contains_tag_algo()
and parse_commit_buffer(), change in any of which behavior can lead to
failing 'generate tar with future mtime' test.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c      | 29 +++++++++++------------------
>  t/t5000-tar-tree.sh |  4 ++--
>  2 files changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index ace7400a1a..af8d9cc45e 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -725,15 +725,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
>  	uint32_t lex_index;
> +	uint64_t date_high, date_low;
>
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>
> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		die(_("invalid commit position. commit-graph is likely corrupt"));
> +

All right, if we want to use fill_commit_graph_info() function to load
the graph data (graph position and generation number) in the
fill_commit_in_graph() we need to perform this check.

I'd think that this check should be here from the beginning, just in
case.

Sidenote: I wonder if it would be good idea to print more information in
the above error message, for example:

	die(_("invalid commit position %ld. commit-graph '%s' is likely corrupt"),
        pos, g->filename);

But this is unrelated thing, tangential to this change, and it might not
add anything useful.

>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>
>  	graph_data = commit_graph_data_at(item);
>  	graph_data->graph_pos = pos;
> +
> +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> +	date_low = get_be32(commit_data + g->hash_len + 12);
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +

I think this change, moving loading of commit date from commit-graph out
of fill_commit_in_graph() and into fill_commit_graph_info(), is in my
opinion a bit inadequatly described in the commit message. As I
understand it this change prepares fill_commit_graph_info() for
generation number v2, that is Corrected Commit Date, where loading
commit date from CDAT together with loading offset from GDAT would be
necessary to correctly set the 'generation' field of 'struct
commit_graph_data' (on the commit_graph_data_slab).

I'm not sure if it would be worth it splitting this refactoring change
(Move Statements into Function) into a separate patch -- it would split
this commit into three, changing 11 part series into 13 part series.

Note that we might want to update the description of
load_commit_graph_info() in commit-graph.h to include that it
incidentally loads commit date from the commit-graph.  Butthis might be
not worth it -- it is a side effect, not the major goal of this
function.

>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>
> @@ -748,38 +757,22 @@ static int fill_commit_in_graph(struct repository *r,
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
> -	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	struct commit_graph_data *graph_data;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
>
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>
> -	if (pos >= g->num_commits + g->num_commits_in_base)
> -		die(_("invalid commit position. commit-graph is likely corrupt"));
> +	fill_commit_graph_info(item, g, pos);

All right, the check got moved into fill_commit_graph_info().

>
> -	/*
> -	 * Store the "full" position, but then use the
> -	 * "local" position for the rest of the calculation.
> -	 */
> -	graph_data = commit_graph_data_at(item);
> -	graph_data->graph_pos = pos;

All right, 'graph_pos' field in the graph data (on commit slab) got
filled by just called load_commit_graph_info().

>  	lex_index = pos - g->num_commits_in_base;
> -
> -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;

All right, unrelated cleanup in the neighbourhood.

>
>  	item->object.parsed = 1;
>
>  	set_commit_tree(item, NULL);
>
> -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> -	date_low = get_be32(commit_data + g->hash_len + 12);
> -	item->date = (timestamp_t)((date_high << 32) | date_low);
> -

All right, this code got moved down the call chain into just called
load_commit_graph_info().

> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -

All right, 'generation' field in the graph data (on commit slab) got
filled by load_commit_graph_info() called at the beginning of the
function.

>  	pptr = &item->parents;
>
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 37655a237c..1986354fc3 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
>  	rm -f .git/index &&
>  	echo content >file &&
>  	git add file &&
> -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> +	GIT_COMMITTER_DATE="@17179869183 +0000" \
>  		git commit -m "tempori parendum"
>  '
>
> @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
>  '
>
>  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> -	echo 4147 >expect &&
> +	echo 2514 >expect &&
>  	tar_info future.tar | cut -d" " -f2 >actual &&
>  	test_cmp expect actual
>  '

Looks good to me.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-19 17:54       ` Jakub Narębski
@ 2020-08-21  4:11         ` Abhishek Kumar
  2020-08-25 11:11           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-21  4:11 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number (within CDAT chunk) and has undefined behavior.
> >
> > The test used to pass
> 
> Could you please tell us how does this test starts to fail without the
> change to the test described there?  What is the error message, etc.?
> 

Here's what I revised the commit message to:

commit-graph: consolidate fill_commit_graph_info

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
34-bits for storing commiter date, thus committer time overflows into
generation number (within CDAT chunk) and has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and loads committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits) to populate the generation
number. Thus, fill_commit_graph_info() sets date and generates tar file
with the truncated mtime and the test fails.

Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
will not be truncated.

> >                       as fill_commit_in_graph() guarantees the values of
>                            ^^^^^^^^^^^^^^^^^^^^^^
> 
> s/fill_commit_in_graph()/fill_commit_graph_info()/
> 
> ...
> 
> Ah, after the change fill_commit_graph_info() changes its behavior, not
> fill_commit_in_graph() as said in the commit message. Before this commit
> it used to only load graph position and generation number, and did not
> load the timestamp. The function fill_commit_graph_info() is used in
> turn by public-facing load_commit_graph_info():
> 

That's exactly it. I should have elaborated better in the commit
message. Thanks for the through investigation.

>   /*
>    * It is possible that we loaded commit contents from the commit buffer,
>    * but we also want to ensure the commit-graph content is correctly
>    * checked and filled. Fill the graph_pos and generation members of
>    * the given commit.
>    */
>   void load_commit_graph_info(struct repository *r, struct commit *item);
> 
> ...
> 
> Looks good to me.
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen
  2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
  2020-08-17 13:22       ` Derrick Stolee
@ 2020-08-21 11:05       ` Jakub Narębski
  1 sibling, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-21 11:05 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.

All right, seems reasonable.

Though it might be not obvious that the second repetition of code
comparing commits by generation is part of commit.c's
compare_commits_by_gen_then_commit_date().

Is't it micro-pessimization though, or can the compiler inline function
across different files?  On the other hand it reduces code duplication...

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> Reviewed-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 15 +++++++++++++++
>  commit-graph.h |  2 ++
>  commit-reach.c | 15 ---------------
>  commit.c       |  9 +++------
>  4 files changed, 20 insertions(+), 21 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index af8d9cc45e..fb6e2bf18f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
>  	return data->generation;
>  }
>
> +int compare_commits_by_gen(const void *_a, const void *_b)
> +{
> +	const struct commit *a = _a, *b = _b;
> +	const uint32_t generation_a = commit_graph_generation(a);
> +	const uint32_t generation_b = commit_graph_generation(b);

All right, this function used protected access to generation number of a
commit, that is it correctly handles the case where commit '_a' and/or
'_b' are new enough to be not present in the commit graph.

That is why we cannot simply use commit_gen_cmp(), that is the function
used for sorting during `git commit-graph write --reachable --changed-paths`,
because after 1st patch it access the slab directly.

> +
> +	/* older commits first */

Nice!  Thanks for adding this comment.

Though it might be good idea to add this comment also to the header
file, commit-graph.h, because the fact that compare_commits_by_gen()
and compare_commits_by_gen_then_commit_date() sort in different
order is not something that we can see from their names.  Well,
they have slightly different sigatures...

> +	if (generation_a < generation_b)
> +		return -1;
> +	else if (generation_a > generation_b)
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
>  {
>  	unsigned int i, nth_slab;
> diff --git a/commit-graph.h b/commit-graph.h
> index 09a97030dc..701e3d41aa 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -146,4 +146,6 @@ struct commit_graph_data {
>   */
>  uint32_t commit_graph_generation(const struct commit *);
>  uint32_t commit_graph_position(const struct commit *);
> +
> +int compare_commits_by_gen(const void *_a, const void *_b);

All right.

>  #endif
> diff --git a/commit-reach.c b/commit-reach.c
> index efd5925cbb..c83cc291e7 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
>  	return repo_is_descendant_of(the_repository, commit, list);
>  }
>
> -static int compare_commits_by_gen(const void *_a, const void *_b)
> -{
> -	const struct commit *a = *(const struct commit * const *)_a;
> -	const struct commit *b = *(const struct commit * const *)_b;
> -
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> -
> -	if (generation_a < generation_b)
> -		return -1;
> -	if (generation_a > generation_b)
> -		return 1;
> -	return 0;
> -}

All right, commit-reach.c includes commit-graph.h, so now it simply uses
compare_commits_by_gen() that was copied to commit-graph.c.

> -
>  int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
> diff --git a/commit.c b/commit.c
> index 4ce8cb38d5..bd6d5e587f 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
>  int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> -	const uint32_t generation_a = commit_graph_generation(a),
> -		       generation_b = commit_graph_generation(b);
> +	int ret_val = compare_commits_by_gen(a_, b_);
>
>  	/* newer commits first */

Maybe this comment should be put in the header file, near this functionn
declaration?

> -	if (generation_a < generation_b)
> -		return 1;
> -	else if (generation_a > generation_b)
> -		return -1;
> +	if (ret_val)
> +		return -ret_val;

All right, this handles reversed sorting order of compare_commits_by_gen().

>
>  	/* use date as a heuristic when generations are equal */
>  	if (a->date < b->date)

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-08-21 13:14       ` Jakub Narębski
  2020-08-25  5:04         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-21 13:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In a preparatory step, let's return timestamp_t values from
> commit_graph_generation(), use timestamp_t for local variables

All right, this is all good.

> and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

This needs more detailed examination.  There are two similar constants,
GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
used for newest commits outside the commit-graph, while the latter is
maximum number that commits in the commit-graph can have (because of the
storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
to be larger than GENERATION_NUMBER_MAX, and it is (and was).

The GENERATION_NUMBER_INFINITY is because of the above requirement
traditionally taken as maximum value that can be represented in the data
type used to store commit's generation number _in memory_, but it can be
less.  For timestamp_t the maximum value that can be represented
is (2 ^ 63 - 1).

All right then.

>

The commit message says nothing about the new symbolic constant
GENERATION_NUMBER_V1_INFINITY, though.

I'm not sure it is even needed (see comments below).

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 18 +++++++++---------
>  commit-graph.h |  4 ++--
>  commit-reach.c | 32 ++++++++++++++++----------------
>  commit-reach.h |  2 +-
>  commit.h       |  3 ++-
>  revision.c     | 10 +++++-----
>  upload-pack.c  |  2 +-
>  7 files changed, 36 insertions(+), 35 deletions(-)

I hope that changing the type returned by commit_graph_generation() and
stored in 'generation' field of `struct commit_graph_data` would mean
that the compiler or at least the linter would catch all the places that
need updating the type.

Just in case, I have performed a simple code search and it agrees with
the above list (one search result missing, in commit.c, was handled by
previous patch).

>
> diff --git a/commit-graph.c b/commit-graph.c
> index fb6e2bf18f..7f9f858577 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
>  	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
>  }
>
> -uint32_t commit_graph_generation(const struct commit *c)
> +timestamp_t commit_graph_generation(const struct commit *c)
>  {
>  	struct commit_graph_data *data =
>  		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
> @@ -115,8 +115,8 @@ uint32_t commit_graph_generation(const struct commit *c)
>  int compare_commits_by_gen(const void *_a, const void *_b)
>  {
>  	const struct commit *a = _a, *b = _b;
> -	const uint32_t generation_a = commit_graph_generation(a);
> -	const uint32_t generation_b = commit_graph_generation(b);
> +	const timestamp_t generation_a = commit_graph_generation(a);
> +	const timestamp_t generation_b = commit_graph_generation(b);
>
>  	/* older commits first */
>  	if (generation_a < generation_b)
> @@ -159,8 +159,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_data_at(a)->generation;
> -	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
> +	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

All right.

> @@ -1338,7 +1338,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;

Shouldn't this be

-  		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+  		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;


>
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_INFINITY &&
> +		if (generation != GENERATION_NUMBER_V1_INFINITY &&

Then there would be no need for this change, isn't it?

>  		    generation != GENERATION_NUMBER_ZERO)
>  			continue;
>
> @@ -1352,7 +1352,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				generation = commit_graph_data_at(parent->item)->generation;
>
> -				if (generation == GENERATION_NUMBER_INFINITY ||
> +				if (generation == GENERATION_NUMBER_V1_INFINITY ||

And this one either.

>  				    generation == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
> @@ -2355,8 +2355,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> -		uint32_t max_generation = 0;
> -		uint32_t generation;
> +		timestamp_t max_generation = 0;
> +		timestamp_t generation;

All right.

>
>  		display_progress(progress, i + 1);
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> diff --git a/commit-graph.h b/commit-graph.h
> index 701e3d41aa..430bc830bb 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
>
>  struct commit_graph_data {
>  	uint32_t graph_pos;
> -	uint32_t generation;
> +	timestamp_t generation;
>  };

All right; this is the main part of this change.

>
>  /*
>   * Commits should be parsed before accessing generation, graph positions.
>   */
> -uint32_t commit_graph_generation(const struct commit *);
> +timestamp_t commit_graph_generation(const struct commit *);
>  uint32_t commit_graph_position(const struct commit *);

As is this one.

>
>  int compare_commits_by_gen(const void *_a, const void *_b);
> diff --git a/commit-reach.c b/commit-reach.c
> index c83cc291e7..470bc80139 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  static struct commit_list *paint_down_to_common(struct repository *r,
>  						struct commit *one, int n,
>  						struct commit **twos,
> -						int min_generation)
> +						timestamp_t min_generation)
>  {
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
> -	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
> +	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>
>  	if (!min_generation)
>  		queue.compare = compare_commits_by_commit_date;

All right.

> @@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  		struct commit *commit = prio_queue_get(&queue);
>  		struct commit_list *parents;
>  		int flags;
> -		uint32_t generation = commit_graph_generation(commit);
> +		timestamp_t generation = commit_graph_generation(commit);
>
>  		if (min_generation && generation > last_gen)
> -			BUG("bad generation skip %8x > %8x at %s",
> +			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
>  			    generation, last_gen,
>  			    oid_to_hex(&commit->object.oid));
>  		last_gen = generation;

All right.

> @@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>  		repo_parse_commit(r, array[i]);
>  	for (i = 0; i < cnt; i++) {
>  		struct commit_list *common;
> -		uint32_t min_generation = commit_graph_generation(array[i]);
> +		timestamp_t min_generation = commit_graph_generation(array[i]);
>
>  		if (redundant[i])
>  			continue;
>  		for (j = filled = 0; j < cnt; j++) {
> -			uint32_t curr_generation;
> +			timestamp_t curr_generation;
>  			if (i == j || redundant[j])
>  				continue;
>  			filled_index[filled] = j;

All right.

> @@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
>  {
>  	struct commit_list *bases;
>  	int ret = 0, i;
> -	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
>
>  	if (repo_parse_commit(r, commit))
>  		return ret;

All right,

> @@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  static enum contains_result contains_test(struct commit *candidate,
>  					  const struct commit_list *want,
>  					  struct contains_cache *cache,
> -					  uint32_t cutoff)
> +					  timestamp_t cutoff)

All right.

(Sidenote: this one I have missed in my simple search.)

>  {
>  	enum contains_result *cached = contains_cache_at(cache, candidate);
>
> @@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
>  	enum contains_result result;
> -	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> +	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
>  	const struct commit_list *p;
>
>  	for (p = want; p; p = p->next) {
> -		uint32_t generation;
> +		timestamp_t generation;
>  		struct commit *c = p->item;
>  		load_commit_graph_info(the_repository, c);
>  		generation = commit_graph_generation(c);

All right.

> @@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
>  				 time_t min_commit_date,
> -				 uint32_t min_generation)
> +				 timestamp_t min_generation)
>  {
>  	struct commit **list = NULL;
>  	int i;
> @@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
>  	struct commit_list *from_iter = from, *to_iter = to;
>  	int result;
> -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>
>  	while (from_iter) {
>  		add_object_array(&from_iter->item->object, NULL, &from_objs);
>
>  		if (!parse_commit(from_iter->item)) {
> -			uint32_t generation;
> +			timestamp_t generation;
>  			if (from_iter->item->date < min_commit_date)
>  				min_commit_date = from_iter->item->date;
>
> @@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>
>  	while (to_iter) {
>  		if (!parse_commit(to_iter->item)) {
> -			uint32_t generation;
> +			timestamp_t generation;
>  			if (to_iter->item->date < min_commit_date)
>  				min_commit_date = to_iter->item->date;
>

All right.

> @@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>  	struct commit_list *found_commits = NULL;
>  	struct commit **to_last = to + nr_to;
>  	struct commit **from_last = from + nr_from;
> -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>  	int num_to_find = 0;
>
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>
>  	for (item = to; item < to_last; item++) {
> -		uint32_t generation;
> +		timestamp_t generation;
>  		struct commit *c = *item;
>
>  		parse_commit(c);

All right.

> diff --git a/commit-reach.h b/commit-reach.h
> index b49ad71a31..148b56fea5 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
>  				 time_t min_commit_date,
> -				 uint32_t min_generation);
> +				 timestamp_t min_generation);
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  		       int commit_date_cutoff);
>

All right.

> diff --git a/commit.h b/commit.h
> index e901538909..bc0732a4fe 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -11,7 +11,8 @@
>  #include "commit-slab.h"
>
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> -#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> +#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
> +#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
>  #define GENERATION_NUMBER_MAX 0x3FFFFFFF
>  #define GENERATION_NUMBER_ZERO 0
>

Why do we even need GENERATION_NUMBER_V1_INFINITY?  It is about marking
out-of-graph commits, and it is about in-memory storage.

We would need separate GENERATION_NUMBER_V1_MAX and GENERATION_NUMBER_V2_MAX
because of different _on-disk_ storage, or in other words file format
limitations.  But that is for the future commit.

> diff --git a/revision.c b/revision.c
> index ecf757c327..411852468b 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3290,7 +3290,7 @@ define_commit_slab(indegree_slab, int);
>  define_commit_slab(author_date_slab, timestamp_t);
>
>  struct topo_walk_info {
> -	uint32_t min_generation;
> +	timestamp_t min_generation;
>  	struct prio_queue explore_queue;
>  	struct prio_queue indegree_queue;
>  	struct prio_queue topo_queue;

All right.

> @@ -3336,7 +3336,7 @@ static void explore_walk_step(struct rev_info *revs)
>  }
>
>  static void explore_to_depth(struct rev_info *revs,
> -			     uint32_t gen_cutoff)
> +			     timestamp_t gen_cutoff)
>  {
>  	struct topo_walk_info *info = revs->topo_walk_info;
>  	struct commit *c;

All right.

> @@ -3379,7 +3379,7 @@ static void indegree_walk_step(struct rev_info *revs)
>  }
>
>  static void compute_indegrees_to_depth(struct rev_info *revs,
> -				       uint32_t gen_cutoff)
> +				       timestamp_t gen_cutoff)
>  {
>  	struct topo_walk_info *info = revs->topo_walk_info;
>  	struct commit *c;
> @@ -3437,7 +3437,7 @@ static void init_topo_walk(struct rev_info *revs)
>  	info->min_generation = GENERATION_NUMBER_INFINITY;
>  	for (list = revs->commits; list; list = list->next) {
>  		struct commit *c = list->item;
> -		uint32_t generation;
> +		timestamp_t generation;
>
>  		if (parse_commit_gently(c, 1))
>  			continue;

All right.

> @@ -3498,7 +3498,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>  	for (p = commit->parents; p; p = p->next) {
>  		struct commit *parent = p->item;
>  		int *pi;
> -		uint32_t generation;
> +		timestamp_t generation;
>
>  		if (parent->object.flags & UNINTERESTING)
>  			continue;

All right.

> diff --git a/upload-pack.c b/upload-pack.c
> index 80ad9a38d8..bcb8b5dfda 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
>
>  static int ok_to_give_up(struct upload_pack_data *data)
>  {
> -	uint32_t min_generation = GENERATION_NUMBER_ZERO;
> +	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
>
>  	if (!data->have_obj.nr)
>  		return 0;

All right.


Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-08-21 18:43       ` Jakub Narębski
  2020-08-25  6:14         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-21 18:43 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

Hello,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> As we are writing topological levels to commit data chunk to ensure
> backwards compatibility with "Old" Git and the member `generation` of
> struct commit_graph_data will store corrected commit date in a later
> commit, let's introduce a commit-slab to store topological levels while
> writing commit-graph.

In my opinion the above it would be easier to follow if rephrased in the
following way:

  In a later commit we will introduce corrected commit date as the
  generation number v2.  This value will be stored in the new separate
  GDAT chunk.  However to ensure backwards compatibility with "Old" Git
  we need to continue to write generation number v1, which is
  topological level, to the commit data chunk (CDAT).  This means that
  we need to compute both versions of generation numbers when writing
  the commit-graph file.  Let's therefore introduce a commit-slab
  to store topological levels; corrected commit date will be stored
  in the member `generation` of struct commit_graph_data.

What do you think?

By the way, do I understand it correctly that in backward-compatibility
mode (that is, in mixed-version environment where at least some
commit-graph files were written by "Old" Git and are lacking GDAT chunk
and generation number v2 data) the `generation` member of commit graph
data chunk will be populated and will store generation number v1, that
is topological level? And that the commit-slab for topological levels is
only there for writing and re-writing?

>
> When Git creates a split commit-graph, it takes advantage of the
> generation values that have been computed already and present in
> existing commit-graph files.
>
> So, let's add a pointer to struct commit_graph to the topological level
> commit-slab and populate it with topological levels while writing a
> split commit-graph.

All right, looks sensible.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
>  commit-graph.h |  1 +
>  commit.h       |  1 +
>  3 files changed, 34 insertions(+), 15 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7f9f858577..a2f15b2825 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
>  /* Remember to update object flag allocation in object.h */
>  #define REACHABLE       (1u<<15)
>
> +define_commit_slab(topo_level_slab, uint32_t);
> +

All right.

Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
think it would be necessary.

>  /* Keep track of the order in which commits are added to our list. */
>  define_commit_slab(commit_pos, int);
>  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
> +	if (g->topo_levels)
> +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }

All right, here we store topological levels on commit-slab to avoid
recomputing them.

Do I understand it correctly that the `topo_levels` member of the `struct
commit_graph` would be non-null only when we are updating the
commit-graph?

>
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
> @@ -953,6 +958,7 @@ struct write_commit_graph_context {
>  		 changed_paths:1,
>  		 order_by_pack:1;
>
> +	struct topo_level_slab *topo_levels;
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
>  	const struct bloom_filter_settings *bloom_settings;

Why do we need `topo_levels` member *both* in `struct commit_graph` and
in `struct write_commit_graph_context`?

[After examining the change further I have realized why both are needed,
 and written about the reasoning later in this email.]

Note that the commit message talks only about `struct commit_graph`...

> @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
>  		else
>  			packedDate[0] = 0;
>
> -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);

All right, here we prepare for writing to the CDAT chunk using data that
is now stored on newly introduced topo_levels slab (either computed, or
taken from commit-graph file being rewritten).

Assuming that ctx->topo_levels is not-null, and that the values are
properly calculated before this -- and we did compute topological levels
before writing the commit-graph.

>
>  		packedDate[1] = htonl((*list)->date);
>  		hashwrite(f, packedDate, 8);
> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);

All right, so that is why this 'generation' variable was not converted
to timestamp_t type.

>
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> +		    level != GENERATION_NUMBER_ZERO)
>  			continue;

Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
outside commit-graph files, and therefore we would need its topological
level computed.

However, I don't understand how it works.  We have had created the
commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
to provide default values for `struct commit_graph`... but only for
`graph_pos` member.  It is commit_graph_generation() that returns
GENERATION_NUMBER_INFINITY for commits not in graph.

But neither commit_graph_data_at()->generation nor topo_level_slab_at()
handles this special case, so I don't see how 'generation' variable can
*ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
GENERATION_NUMBER_V1_INFINITY for commits not in graph.

Does it work *accidentally*, because the default value for uninitialized
data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
of commit-graph feature development history, namely the short time where
Git didn't use any generation numbers and stored 0 in the place set for
it in the commit-graph format...  On the other hand this is not the case
for corrected commit date (generation number v2), as it could
"legitimately" be 0 if some root commit (without any parents) had
committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
caused by malformed but valid commit object.

Ugh...

>
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;
>
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>
> -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> +				    level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else if (level > max_level) {
> +					max_level = level;
>  				}
>  			}

This is the same case as for previous chunk; see the comment above.

This code checks if parents have generation number / topological level
computed, and tracks maximum value of it among all parents.

>
>  			if (all_parents_computed) {
> -				struct commit_graph_data *data = commit_graph_data_at(current);
> -
> -				data->generation = max_generation + 1;
>  				pop_commit(&list);
>
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				if (max_level > GENERATION_NUMBER_MAX - 1)
> +					max_level = GENERATION_NUMBER_MAX - 1;
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;

OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
this value can be maximum value that can be safely stored in a given
data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
value that can be safely stored in uint32_t, so generation+1 had no
chance to overflow.  This is no longer the case; the reorganization done
here leads to more defensive code (safer).

All good.  However I think that we should clamp the value of topological
level to the maximum value that can be safely stored *on disk*, in the
30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
the code to write topological level would get more complicated.

In my opinion the symbolic constant used here should be named
GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
it should be the current value of GENERATION_NUMBER_MAX, that is
0x3FFFFFFF.

>  			}
>  		}
>  	}
> @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
>  	uint32_t i, count_distinct = 0;
>  	int res = 0;
>  	int replace = 0;
> +	struct topo_level_slab topo_levels;
>

All right, we will be using topo_level slab for writing the
commit-graph, and only for this purpose, so it is good to put it here.

>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
>  		}
>  	}
>
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
> +	if (ctx->r->objects->commit_graph) {
> +		struct commit_graph *g = ctx->r->objects->commit_graph;
> +
> +		while (g) {
> +			g->topo_levels = &topo_levels;
> +			g = g->base_graph;
> +		}
> +	}

All right, now I see why we need `topo_levels` member both in the
`struct write_commit_graph_context` and in `struct commit_graph`.
The former is for functions that write the commit-graph, the latter for
fill_commit_graph_info() functions that is deep in the callstack, but it
needs to know whether to load topological level to commit-slab, or maybe
put it as generation number (and in the future -- discard it, if not
needed).

Sidenote: this fragment of code, that fills with a given value some
member of the `struct commit_graph` throughout the split commit-graph
chain, will be repeated as similar code in patches later in series.
However without resorting to preprocessor macros I have no idea how to
generalize it to avoid code duplication (well, almost).

> +
>  	if (pack_indexes) {
>  		ctx->order_by_pack = 1;
>  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> diff --git a/commit-graph.h b/commit-graph.h
> index 430bc830bb..1152a9642e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -72,6 +72,7 @@ struct commit_graph {
>  	const unsigned char *chunk_bloom_indexes;
>  	const unsigned char *chunk_bloom_data;
>
> +	struct topo_level_slab *topo_levels;
>  	struct bloom_filter_settings *bloom_filter_settings;
>  };

All right: `struct commit_graph` is public, `struct
write_commit_graph_context` is not.

>
> diff --git a/commit.h b/commit.h
> index bc0732a4fe..bb846e0025 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -15,6 +15,7 @@
>  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
>  #define GENERATION_NUMBER_MAX 0x3FFFFFFF

The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.

>  #define GENERATION_NUMBER_ZERO 0
> +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF

This value is never used, so why it is defined in this commit.

>
>  struct commit_list {
>  	struct commit *item;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-08-22  0:05       ` Jakub Narębski
  2020-08-25  6:49         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-22  0:05 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.

Good.

>
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.

Perhaps we should add more details about data type sizes in question.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t).  However corrected commit date
offsets can be safely stored^* using only 32 bits.  This halves the size
of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
2*H + 16 + 4 bytes, which is reduction of around 6%, not including
header, fanout table (OIDF) and extra edges list (EDGE).

Which might mean that the extra complication is not worth it, and we
should store corrected commit date directly instead.

*) unless for example one of commits is malformed but valid,
   and has committerdate of 0 Unix time, 1 January 1970.

>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.

OK, so the agenda for introducing geeration number v2 is as follows:
- compute generation numbers v2, i.e. corrected commit date
- store corrected commit date [offsets] in new GDAT chunk,
  unless backward-compatibility concerns require us to not to
- load [and compute] corrected commit date from commit-graph
  storing it as 'generation' field of `struct commit_graph_data`,
  unless backward-compatibility concerns require us to store
  topological levels (generation number v1) in there instead

Because the reachability condition for corrected commit date and for
topological level is exactly the same, we don't need to do anything to
take advantage of generation number v2.

Though we can use generation number v2 in more cases, where we turned
off use of generation numbers because v1 gave worse performance than
date heuristics.

Did I got this right?

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index a2f15b2825..fd69534dd5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	else if (generation_a > generation_b)
>  		return 1;
>
> -	/* use date as a heuristic when generations are equal */
> -	if (a->date < b->date)
> -		return -1;
> -	else if (a->date > b->date)
> -		return 1;

At first I was wondering why this tie-breaking is beig removed; wouldn't
be needed for backward-compatibility?  But then I remembered that this
comparison function is used _only_ for sorting commits when writing
Bloom filters, for `git commit-graph write --reachable --changed-paths ...`

Assuming that when writing the commit graph we always compute geeration
number v2 and 'generation' field stores corrected commit date, we don't
need to use date as a heuristic when generations are equal, and it would
not help in tie-breaking anyway.

All right.

>  	return 0;
>  }
>
> @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;

All right, so the pattern is to add 'corrected_commit_date' stuff after
'topological_level' stuff.

>
>  		display_progress(ctx->progress, i + 1);
>  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> -		    level != GENERATION_NUMBER_ZERO)
> +		    level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> +		    )
>  			continue;
>
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;
>
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>
>  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> -				    level == GENERATION_NUMBER_ZERO) {
> +				    level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> +				    ) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;
>  				}
>  			}
>
> @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (max_level > GENERATION_NUMBER_MAX - 1)
>  					max_level = GENERATION_NUMBER_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> +				if (current->date > max_corrected_commit_date)
> +					max_corrected_commit_date = current->date - 1;
> +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>  			}
>  		}
>  	}

All right.  Looks good to me.

> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> -		timestamp_t max_generation = 0;
> -		timestamp_t generation;
> +		timestamp_t max_corrected_commit_date = 0;
> +		timestamp_t corrected_commit_date;

This is simple, and perhaps unnecessary, rename of variables.
Shouldn't we however verify *both* topological level, and
(if exists) corrected commit date?

>
>  		display_progress(progress, i + 1);
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  					     oid_to_hex(&graph_parents->item->object.oid),
>  					     oid_to_hex(&odb_parents->item->object.oid));
>
> -			generation = commit_graph_generation(graph_parents->item);
> -			if (generation > max_generation)
> -				max_generation = generation;
> +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> +			if (corrected_commit_date > max_corrected_commit_date)
> +				max_corrected_commit_date = corrected_commit_date;

Actually, commit_graph_generation(<commit>) can return either corrected
commit date, or topological level, the latter in backward-compatibility
case (if at least one commit-graph file is lacking GDAT chunk, because
[some of] it was created by the "Old" Git).

>
>  			graph_parents = graph_parents->next;
>  			odb_parents = odb_parents->next;
> @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>  			continue;
>
> -		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> -		 * extra logic in the following condition.
> -		 */
> -		if (max_generation == GENERATION_NUMBER_MAX)
> -			max_generation--;

All right, this was needed for checking the correctness of topological
levels (generation number v1) because we were checking not that it
fullfills the reachability condition, but more strict one: namely that
topological level of commit is equal to maximum of topological levels of
its parents plus one.

The comment about checking both generation number v1 and v2 still
applies.

> -
> -		generation = commit_graph_generation(graph_commit);
> -		if (generation != max_generation + 1)
> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> +		corrected_commit_date = commit_graph_generation(graph_commit);
> +		if (corrected_commit_date < max_corrected_commit_date + 1)
> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>  				     oid_to_hex(&cur_oid),
> -				     generation,
> -				     max_generation + 1);
> +				     corrected_commit_date,
> +				     max_corrected_commit_date + 1);

All right, we check less strict condition for corrected commit date.

>
>  		if (graph_commit->date != odb_commit->date)
>  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 08/11] commit-graph: implement generation data chunk
  2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-08-22 13:09       ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-22 13:09 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

All right, it looks like this patch implements first part of step 2.)
and step 3.) in the following plan of adding support for geeration
number v2:

1. compute generation numbers v2, i.e. corrected commit date
2. store corrected commit date [offsets] in new GDAT chunk,
   unless backward-compatibility concerns require us to not to
3. load [and compute] corrected commit date from commit-graph
   storing it as 'generation' field of `struct commit_graph_data`,
   unless backward-compatibility concerns require us to store
   topological levels (generation number v1) in there instead
4. use generation number v2 in more places, where we had to turn
   off using v1 for performance reasons


"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> As discovered by Ævar, we cannot increment graph version to
> distinguish between generation numbers v1 and v2 [1]. Thus, one of
> pre-requistes before implementing generation number was to distinguish
> between graph versions in a backwards compatible manner.

Fortunately, we have fixed this issue, and Git does no longer die when
it encounters commit-graph format version that it does not understand.

>
> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.

Should we say anything about storing 64 bit corrected commit date
(geeration number v2) as 32 bit corrected commit date offset?

>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. New Git can parse GDAT and take advantage
> of newer generation numbers, falling back to topological levels when
> GDAT chunk is missing (as it would happen with a commit graph written
> by old Git).

Note that the fact that we do not have special code for handling
mixed-version layers in split commit-graph is not [that] dangerous, as
we don't read this new data yet.  Splitting it to patch 09/11 (next
patch) makes this patch simpler.

>
> We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> which forces commit-graph file to be written without generation data
> chunk to emulate a commit-graph file written by old Git.

All right.

>
> [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 48 ++++++++++++++++++++++++---
>  commit-graph.h                |  2 ++
>  t/README                      |  3 ++
>  t/helper/test-read-graph.c    |  2 ++
>  t/t4216-log-bloom.sh          |  4 +--
>  t/t5318-commit-graph.sh       | 27 +++++++--------
>  t/t5324-split-commit-graph.sh | 12 +++----
>  t/t6600-test-reach.sh         | 62 +++++++++++++++++++----------------
>  8 files changed, 107 insertions(+), 53 deletions(-)

It might be a good idea to add documentation of this chunk (and only
about this chunk) to Documentation/technical/commit-graph-format.txt

>
> diff --git a/commit-graph.c b/commit-graph.c
> index fd69534dd5..b7a72b40db 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 8

All right, define new chunk and increase the maximum number of chunks
commit-graph file can have.

>
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
>  				graph->chunk_commit_data = data + chunk_offset;
>  			break;
>
> +		case GRAPH_CHUNKID_GENERATION_DATA:
> +			if (graph->chunk_generation_data)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data = data + chunk_offset;
> +			break;
> +

All right.  The size of GDAT chunk is defined by the number of commits,
so nothink more is needed.

>  		case GRAPH_CHUNKID_EXTRAEDGES:
>  			if (graph->chunk_extra_edges)
>  				chunk_repeated = 1;
> @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +	if (g->chunk_generation_data)
> +		graph_data->generation = item->date +
> +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

WARNING: The above does not properly handle clamped data.

If the offset is maximum value that can be stored in 32 bit field, that
is GENERATION_NUMBER_V2_OFFSET_MAX for more that one commit, we wouldn't
know which commit has greater generation number v2 because of this clamping.
The 'generation' data needs to be set to GENERATION_NUMBER_V2_MAX (which
in turn needs to be smaller than GENERATION_NUMBER_INFINITY).  This is
not done here!

All the above complication would not be an issue if we stored 64 bit
corrected commit date directly, instead of storing 32 bit corrected
commit date offsets. Storing offsets saves at most 4/(2*H + 16 + 4) = 7%
of commit-graph file size (OIDL + CDAT + GDAT with offsets), when
160-bit/20-byte SHA-1 hash is used.

> +	else
> +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

Should we convert GENERATION_NUMBER_V1_MAX into GENERATION_NUMBER_MAX
here, for easier handling of clamped values of generation numbers later
on in backward-compatibile way, without special-casing for v1 and v2?


Here we load and perhaps compute generation number, using v2 if possible
(from GDAT + commit date), with fallback to v1 (from CDAT).

Almost all right... but should this reading be a part of this patch, or
split off into separate patch?

>
>  	if (g->topo_levels)
>  		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> @@ -951,7 +963,8 @@ struct write_commit_graph_context {
>  		 report_progress:1,
>  		 split:1,
>  		 changed_paths:1,
> -		 order_by_pack:1;
> +		 order_by_pack:1,
> +		 write_generation_data:1;

All right, here we store the iformation if we should write the GDAT
chunk, taking into account among other things state of the
GIT_TEST_COMMIT_GRAPH_NO_GDAT enviroment variable.

>
>  	struct topo_level_slab *topo_levels;
>  	const struct split_commit_graph_opts *split_opts;
> @@ -1106,8 +1119,25 @@ static int write_graph_chunk_data(struct hashfile *f,
>  	return 0;
>  }
>
> +static int write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	for (i = 0; i < ctx->commits.nr; i++) {

Side note: it is a bit funny that some of write_graph_chunk_*()
functions use `for` loop, and some `while` loop to process commits.

> +		struct commit *c = ctx->commits.list[i];
> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +
> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
> +			offset = GENERATION_NUMBER_V2_OFFSET_MAX;

This GENERATION_NUMBER_V2_OFFSET_MAX symbolic constant (or equivalet)
should be defined in this patch and not in previous one, in my opinion.

> +		hashwrite_be32(f, offset);
> +	}
> +
> +	return 0;
> +}
> +
>  static int write_graph_chunk_extra_edges(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx)
> +					  struct write_commit_graph_context *ctx)

This change in whitespace is, I think, incorrect.

>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1726,6 +1756,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
>  	chunks[2].write_fn = write_graph_chunk_data;
> +
> +	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
> +		ctx->write_generation_data = 0;

All right, here we handle newly introduced GIT_TEST_COMMIT_GRAPH_NO_GDAT
environment variable.

> +	if (ctx->write_generation_data) {
> +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
> +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
> +		num_chunks++;
> +	}

Hmmm... so the GDAT chunk does not have a header, and is not versioned.
Does this mean that to move to generation number v3, or add some
additional reachability labeling we would need to either rename the
chunk or add new one?

>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> @@ -2130,6 +2169,7 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->split_opts = split_opts;
>  	ctx->total_bloom_filter_data_size = 0;
> +	ctx->write_generation_data = 1;

But by default we do write the GDAT chunk.

>
>  	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
>  		ctx->changed_paths = 1;
> diff --git a/commit-graph.h b/commit-graph.h
> index 1152a9642e..f78c892fc0 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -6,6 +6,7 @@
>  #include "oidset.h"
>
>  #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
> +#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
>  #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
>

All right.

(Though I wonder about the ordering -- but better not to start "bike
shed" discussion).

> @@ -67,6 +68,7 @@ struct commit_graph {
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_generation_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;

All right, we need to store position of new GDAT chunk.

> diff --git a/t/README b/t/README
> index 70ec61cf88..6647ef132e 100644
> --- a/t/README
> +++ b/t/README
> @@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
>  be written after every 'git commit' command, and overrides the
>  'core.commitGraph' setting to true.
>
> +GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
> +commit-graph to be written without generation data chunk.
> +

All right.

This description could have been more detailed, but I think it is good
enough.

>  GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
>  commit-graph write to compute and write changed path Bloom filters for
>  every 'git commit-graph write', as if the `--changed-paths` option was
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 6d0c962438..1c2a5366c7 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" oid_lookup");
>  	if (graph->chunk_commit_data)
>  		printf(" commit_metadata");
> +	if (graph->chunk_generation_data)
> +		printf(" generation_data");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
>  	if (graph->chunk_bloom_indexes)

All right, we examine if GDAT chunk is present.

Many commit-graph tests would probably need to be updated; at least
those that make use if `git test-tool read-graph`.

> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c21cc160f3..55c94e9ebd 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual

All right.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 044cf8a3de..b41b2160c6 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -71,7 +71,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>  graph_read_expect() {
>  	OPTIONAL=""
>  	NUM_CHUNKS=3
> -	if test ! -z $2
> +	if test ! -z "$2"
>  	then
>  		OPTIONAL=" $2"
>  		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))

All right, a fix for issue that is relevant only after this change, now
that there can be more than one extra chunk.

Side note: how I wish that helper function in tests were documented...

> @@ -98,14 +98,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
>  	# valid commit and tree OID
>  	git rev-parse HEAD HEAD^{tree} >in &&
>  	git commit-graph write --stdin-commits <in &&
> -	graph_read_expect 3
> +	graph_read_expect 3 generation_data

All right, we need to treat generation_data as extra, because it can be
not there (it's existence is conditional).

>  '
>
>  test_expect_success 'write graph' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "3"
> +	graph_read_expect "3" generation_data

Side note: I wonder why here we have

  	graph_read_expect "3" generation_data

but one test earlier we have

  	graph_read_expect 3 generation_data

without quotes.

>  '
>
>  test_expect_success POSIXPERM 'write graph has correct permissions' '
> @@ -214,7 +214,7 @@ test_expect_success 'write graph with merges' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "10" "extra_edges"
> +	graph_read_expect "10" "generation_data extra_edges"

All right.  It is why we needed to fix graph_read_expect().

>  '
>
>  graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
> @@ -249,7 +249,7 @@ test_expect_success 'write graph with new commit' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -259,7 +259,7 @@ test_expect_success 'write graph with nothing new' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -269,7 +269,7 @@ test_expect_success 'build graph from latest pack with closure' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	cat new-idx | git commit-graph write --stdin-packs &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "9" "extra_edges"
> +	graph_read_expect "9" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
> @@ -282,7 +282,7 @@ test_expect_success 'build graph from commits with closure' '
>  	git rev-parse merge/1 >>commits-in &&
>  	cat commits-in | git commit-graph write --stdin-commits &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "6"
> +	graph_read_expect "6" "generation_data"
>  '
>
>  graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
> @@ -292,7 +292,7 @@ test_expect_success 'build graph from commits with append' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "10" "extra_edges"
> +	graph_read_expect "10" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -302,7 +302,7 @@ test_expect_success 'build graph using --reachable' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write --reachable &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -323,7 +323,7 @@ test_expect_success 'write graph in bare repo' '
>  	cd "$TRASH_DIRECTORY/bare" &&
>  	git commit-graph write &&
>  	test_path_is_file $baredir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '

All right, those were the straightforward changes.

>
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> @@ -420,8 +420,9 @@ test_expect_success 'replace-objects invalidates commit-graph' '
>
>  test_expect_success 'git commit-graph verify' '
>  	cd "$TRASH_DIRECTORY/full" &&
> -	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> -	git commit-graph verify >output
> +	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> +	git commit-graph verify >output &&
> +	graph_read_expect 9 extra_edges
>  '

What this change is about?  Is it about the fact that we have not added
support for checking correctness of GDAT chunk to `git commit-graph
verify`?  But in previous commit we did modify verify_commit_graph()...

>
>  NUM_COMMITS=9
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index ea28d522b8..531016f405 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
>  	infodir=".git/objects/info" &&
>  	graphdir="$infodir/commit-graphs" &&
>  	test_oid_cache <<-EOM
> -	shallow sha1:1760
> -	shallow sha256:2064
> +	shallow sha1:2132
> +	shallow sha256:2436
>
> -	base sha1:1376
> -	base sha256:1496
> +	base sha1:1408
> +	base sha256:1528
>  	EOM
>  '

I guess this change is because with GDAT chunk present the positions of
relevant bits that we want to corrupt change (I guess because we have
extra chunk in chunk lookup section).

Someone else would have to verify if this change is in fact correct, if
it was not done already.

>
> @@ -28,9 +28,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data

All right, we now have 4 chunks not 3 (old ones + generation_data).

>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> index 475564bee7..d14b129f06 100755
> --- a/t/t6600-test-reach.sh
> +++ b/t/t6600-test-reach.sh
> @@ -55,10 +55,13 @@ test_expect_success 'setup' '
>  	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
>  	mv .git/objects/info/commit-graph commit-graph-half &&
>  	chmod u+w commit-graph-half &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
> +	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
> +	chmod u+w commit-graph-no-gdat &&
>  	git config core.commitGraph true
>  '

All right, we add setup for testing no GDAT case (as if the commit-graph
file was written by "Old" Git).

>
> -run_three_modes () {
> +run_all_modes () {

All right, we compare more modes, among others the no-GDAT case.

Nice futureproofing!

>  	test_when_finished rm -rf .git/objects/info/commit-graph &&
>  	"$@" <input >actual &&
>  	test_cmp expect actual &&
> @@ -67,11 +70,14 @@ run_three_modes () {
>  	test_cmp expect actual &&
>  	cp commit-graph-half .git/objects/info/commit-graph &&
>  	"$@" <input >actual &&
> +	test_cmp expect actual &&
> +	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
> +	"$@" <input >actual &&
>  	test_cmp expect actual
>  }

OK, now we test yet another variant of commit-graph file: one without
the GDAT chunk (testing for backward compatibility with "Old" Git).

>
> -test_three_modes () {
> -	run_three_modes test-tool reach "$@"
> +test_all_modes () {
> +	run_all_modes test-tool reach "$@"
>  }

All right.

>
>  test_expect_success 'ref_newer:miss' '
> @@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
>  	B:commit-4-9
>  	EOF
>  	echo "ref_newer(A,B):0" >expect &&
> -	test_three_modes ref_newer
> +	test_all_modes ref_newer
>  '
>
>  test_expect_success 'ref_newer:hit' '
> @@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
>  	B:commit-2-3
>  	EOF
>  	echo "ref_newer(A,B):1" >expect &&
> -	test_three_modes ref_newer
> +	test_all_modes ref_newer
>  '
>
>  test_expect_success 'in_merge_bases:hit' '
> @@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
>  	B:commit-8-8
>  	EOF
>  	echo "in_merge_bases(A,B):1" >expect &&
> -	test_three_modes in_merge_bases
> +	test_all_modes in_merge_bases
>  '
>
>  test_expect_success 'in_merge_bases:miss' '
> @@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
>  	B:commit-5-9
>  	EOF
>  	echo "in_merge_bases(A,B):0" >expect &&
> -	test_three_modes in_merge_bases
> +	test_all_modes in_merge_bases
>  '
>
>  test_expect_success 'is_descendant_of:hit' '
> @@ -118,7 +124,7 @@ test_expect_success 'is_descendant_of:hit' '
>  	X:commit-1-1
>  	EOF
>  	echo "is_descendant_of(A,X):1" >expect &&
> -	test_three_modes is_descendant_of
> +	test_all_modes is_descendant_of
>  '
>
>  test_expect_success 'is_descendant_of:miss' '
> @@ -129,7 +135,7 @@ test_expect_success 'is_descendant_of:miss' '
>  	X:commit-7-6
>  	EOF
>  	echo "is_descendant_of(A,X):0" >expect &&
> -	test_three_modes is_descendant_of
> +	test_all_modes is_descendant_of
>  '
>
>  test_expect_success 'get_merge_bases_many' '
> @@ -144,7 +150,7 @@ test_expect_success 'get_merge_bases_many' '
>  		git rev-parse commit-5-6 \
>  			      commit-4-7 | sort
>  	} >expect &&
> -	test_three_modes get_merge_bases_many
> +	test_all_modes get_merge_bases_many
>  '
>
>  test_expect_success 'reduce_heads' '
> @@ -166,7 +172,7 @@ test_expect_success 'reduce_heads' '
>  			      commit-2-8 \
>  			      commit-1-10 | sort
>  	} >expect &&
> -	test_three_modes reduce_heads
> +	test_all_modes reduce_heads
>  '
>
>  test_expect_success 'can_all_from_reach:hit' '
> @@ -189,7 +195,7 @@ test_expect_success 'can_all_from_reach:hit' '
>  	Y:commit-8-1
>  	EOF
>  	echo "can_all_from_reach(X,Y):1" >expect &&
> -	test_three_modes can_all_from_reach
> +	test_all_modes can_all_from_reach
>  '
>
>  test_expect_success 'can_all_from_reach:miss' '
> @@ -211,7 +217,7 @@ test_expect_success 'can_all_from_reach:miss' '
>  	Y:commit-8-5
>  	EOF
>  	echo "can_all_from_reach(X,Y):0" >expect &&
> -	test_three_modes can_all_from_reach
> +	test_all_modes can_all_from_reach
>  '
>
>  test_expect_success 'can_all_from_reach_with_flag: tags case' '
> @@ -234,7 +240,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
>  	Y:commit-8-1
>  	EOF
>  	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
> -	test_three_modes can_all_from_reach_with_flag
> +	test_all_modes can_all_from_reach_with_flag
>  '
>
>  test_expect_success 'commit_contains:hit' '
> @@ -250,8 +256,8 @@ test_expect_success 'commit_contains:hit' '
>  	X:commit-9-3
>  	EOF
>  	echo "commit_contains(_,A,X,_):1" >expect &&
> -	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_all_modes commit_contains &&
> +	test_all_modes commit_contains --tag
>  '
>
>  test_expect_success 'commit_contains:miss' '
> @@ -267,8 +273,8 @@ test_expect_success 'commit_contains:miss' '
>  	X:commit-9-3
>  	EOF
>  	echo "commit_contains(_,A,X,_):0" >expect &&
> -	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_all_modes commit_contains &&
> +	test_all_modes commit_contains --tag
>  '
>
>  test_expect_success 'rev-list: basic topo-order' '
> @@ -280,7 +286,7 @@ test_expect_success 'rev-list: basic topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
>  		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-6-6
> +	run_all_modes git rev-list --topo-order commit-6-6
>  '
>
>  test_expect_success 'rev-list: first-parent topo-order' '
> @@ -292,7 +298,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
>  		commit-6-2 \
>  		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>  	>expect &&
> -	run_three_modes git rev-list --first-parent --topo-order commit-6-6
> +	run_all_modes git rev-list --first-parent --topo-order commit-6-6
>  '
>
>  test_expect_success 'rev-list: range topo-order' '
> @@ -304,7 +310,7 @@ test_expect_success 'rev-list: range topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
>  '
>
>  test_expect_success 'rev-list: range topo-order' '
> @@ -316,7 +322,7 @@ test_expect_success 'rev-list: range topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
>  '
>
>  test_expect_success 'rev-list: first-parent range topo-order' '
> @@ -328,7 +334,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
>  		commit-6-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
> +	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
>  '
>
>  test_expect_success 'rev-list: ancestry-path topo-order' '
> @@ -338,7 +344,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
>  		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
>  		commit-6-3 commit-5-3 commit-4-3 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
> +	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
>  '
>
>  test_expect_success 'rev-list: symmetric difference topo-order' '
> @@ -352,7 +358,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
>  		commit-3-8 commit-2-8 commit-1-8 \
>  		commit-3-7 commit-2-7 commit-1-7 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
>  '
>
>  test_expect_success 'get_reachable_subset:all' '
> @@ -372,7 +378,7 @@ test_expect_success 'get_reachable_subset:all' '
>  			      commit-1-7 \
>  			      commit-5-6 | sort
>  	) >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '
>
>  test_expect_success 'get_reachable_subset:some' '
> @@ -390,7 +396,7 @@ test_expect_success 'get_reachable_subset:some' '
>  		git rev-parse commit-3-3 \
>  			      commit-1-7 | sort
>  	) >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '
>
>  test_expect_success 'get_reachable_subset:none' '
> @@ -404,7 +410,7 @@ test_expect_success 'get_reachable_subset:none' '
>  	Y:commit-2-8
>  	EOF
>  	echo "get_reachable_subset(X,Y)" >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '

All right, s/test_three_modes/test_all_modes/... which admitedly could
have been separate pure refactoring patch, but it is not necessary.

>
>  test_done

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-08-22 17:14       ` Jakub Narębski
  2020-08-26  7:15         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-22 17:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.

All right, this explains changes to the *reading* side.  If there is
split commit-graph layer without GDAT chunk (or to be more exact,
current code checks if there is any layer without GDAT chunk), then we
fall back to using topological levels, as if all layers / all graphs
were without GDAT.  This allows to avoid the issue explained above,
where for some commits 'generation' holds corrected commit date, and for
some it holds topological levels, breaking the reachability condition
guarantee.

However the commit message do not say anything about the *writing* side.

We have decided to not write the GDAT chunk when writing the new layer
in split commit-graph, and top layer doesn't itself have GDAT chunks.
That makes for easier reasoning and safer handling: in mixed-version
environment the only possible arrangement is for the lower layers
(possibly zero) have GDAT chunk, and higher layers (possibly zero) not
have GDAT chunks.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite should
not have GDAT chunks either.

To be more detailed, without '--split=replace' we would want the following
layer merging behavior:

   [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
           1              2           3             4            5

In the split commit-graph chain above, merging two topmost layers
(layers 4 and 5) should create a layer without GDAT; merging three
topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
4) should create a new layer with GDAT.

   [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
           1              2           3               merged

   [layer with GDAT][with GDAT][-------------with GDAT------------------]
           1              2                    merged

I hope those ASCII-art pictures help understanding it

>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.
>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).

Wouldn't breaking the reachability condition promise make some Git
commands to return *incorrect* results if they short-circuit, stop
walking if generation number shows that A cannot reach B?

I am talking here about commands that return boolean, or select subset
from given set of revisions:
- git merge-base --is-ancestor <B> <A>
- git branch branch-A <A> && git branch --contains <B>
- git branch branch-B <B> && git branch --merged <A>

Git assumes that generation numbers fulfill the following condition:

  if A can reach B, then gen(A) > gen(B)

Notably this includes commits not in commit-graph, and clamped values.

However, in the following case

* if commit A is from higher layer without GDAT
  and uses topological levels for 'generation', e.g. 115 (in a small repo)
* and commit B is from lower layer with GDAT
  and uses corrected commit date as 'generation', for example 1598112896,

it may happen that A (later commit) can reach B (earlier commit), but
gen(B) > gen(A).  The reachability condition promise for generation
numbers is broken.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

I have reordered files in the patch itself to make it easier to review
the proposed changes.

>  commit-graph.h                |  1 +
>  commit-graph.c                | 32 +++++++++++++++-
>  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
>  3 files changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.h b/commit-graph.h
> index f78c892fc0..3cf89d895d 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -63,6 +63,7 @@ struct commit_graph {
>  	struct object_directory *odb;
>
>  	uint32_t num_commits_in_base;
> +	uint32_t read_generation_data;
>  	struct commit_graph *base_graph;
>

First, why `read_generation_data` is of uint32_t type, when it stores
(as far as I understand it), a "boolean" value of either 0 or 1?

Second, couldn't we simply set chunk_generation_data to NULL?  Or would
that interfere with the case of rewriting, where we want to use existing
GDAT data when writing new commit-graph with GDAT chunk?

>  	const uint32_t *chunk_oid_fanout;
> diff --git a/commit-graph.c b/commit-graph.c
> index b7a72b40db..c1292f8e08 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
>  	return graph_chain;
>  }
>
> +static void validate_mixed_generation_chain(struct repository *r)
> +{
> +	struct commit_graph *g = r->objects->commit_graph;
> +	int read_generation_data = 1;
> +
> +	while (g) {
> +		if (!g->chunk_generation_data) {
> +			read_generation_data = 0;
> +			break;
> +		}
> +		g = g->base_graph;
> +	}

This loop checks whole split commit-graph chain for existence of layers
without GDAT chunk.

On one hand it is more than needed _if_ we assume that the fact that
only topmost layers can be without GDAT holds true. On the other hand it
is safer (an example of defensive coding), and as the length of chain is
limited it should be not much of a performance penalty.

> +
> +	g = r->objects->commit_graph;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}

All right... though one of earlier commits introduced similar loop, but
it set chunk_generation_data to NULL instead.  Or did I remember it wrong?

> +}
> +
>  struct commit_graph *read_commit_graph_one(struct repository *r,
>  					   struct object_directory *odb)
>  {
> @@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>
> +	validate_mixed_generation_chain(r);
> +

All right, when reading the commit-graph, check if we are in forced
backward-compatibile mode, and we need to use topological levels for
generation numbers.

>  	return g;
>  }
>
> @@ -763,7 +786,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	if (g->chunk_generation_data)
> +	if (g->chunk_generation_data && g->read_generation_data)

All right, when deciding whether to use corrected commit date
(generation number v2 from GDAT chunk), or fall back to using
topological levels (generation number v1 from CDAT chunk), we need to
take into accout other layers, to not mix v1 and v2.

We have earlier checked whether we can use generation number v2, now we
use the result of this check, propagated down the commit-graph chain.

>  		graph_data->generation = item->date +
>  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>  	else
> @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
>  	uint32_t pos;
>  	if (!prepare_commit_graph(r))
>  		return;
> +
>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
>  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
>  }

This is unrelated whitespace fix, a "while at it" in neighbourhood of
changes.  All right then.

> @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
>  		g = ctx->r->objects->commit_graph;
>
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +
>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2210,6 +2237,9 @@ int write_commit_graph(struct object_directory *odb,
>
>  		if (ctx->split_opts)
>  			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;
>  	}

The previous commit introduced `write_generation_data` member in
`struct write_commit_graph_context`, then used to handle support for
GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable.

Those two hunks of changes above are both inside

   if (ctx->split) {
      ...
   }

Here we examine the topmost layer of split commit-graph chain, and if it
does not contain GDAT chunk, then we do not store the GDAT chunk, unless
`git commit-graph write` is ru with `--split=replace` option.

However this is overly strict condition.  If we merge layer without GDAT
with layer with GDAT below, then we surely can write GDAT; the condition
for GDAT-less layers would be still fulfilled (met).  However we can
consider it 'good enough' for now, and relax this condition in later
commits.

Note that it is the first time in this patch were we make use of
assumption that if there are layers without GDAT then topmost layer is
without GDAT.

>
>  	ctx->approx_nr_objects = approximate_object_count();
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 531016f405..ac5e7783fb 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -424,4 +424,74 @@ done <<\EOF
>  0600 -r--------
>  EOF
>

It would be nice to have an ASCII-art graph of commits, but earlier
tests do not have it either...

> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&
> +	graphdir=".git/objects/info/commit-graphs" &&
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --split &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 1
> +	num_commits: 2
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify
> +'

Looks all right to me.

> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 2
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split=replace &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&
> +	graph_read_expect 15 &&
> +	git commit-graph verify
> +'

It would be nice to have an example with merging layers (whether we
would handle it in strict or relaxed way).

> +
>  test_done

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-08-22 19:09       ` Jakub Narębski
  2020-09-01 10:08         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-22 19:09 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

Hello Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With corrected commit dates implemented, we no longer have to rely on
> commit date as a heuristic in paint_down_to_common().

All right, but it would be nice to have some benchmark data: what were
performance when using topological levels, what was performance when
using commit date heuristics (before this patch), what is performace now
when using corrected commit date.

>
> t6024-recursive-merge setups a unique repository where all commits have
> the same committer date without well-defined merge-base. As this has
> already caused problems (as noted in 859fdc0 (commit-graph: define
> GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
> test script.

OK?

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c             | 14 ++++++++++++++
>  commit-graph.h             |  6 ++++++
>  commit-reach.c             |  2 +-
>  t/t6024-recursive-merge.sh |  4 +++-
>  4 files changed, 24 insertions(+), 2 deletions(-)
>

I have reorderd files for easier review.

> diff --git a/commit-graph.h b/commit-graph.h
> index 3cf89d895d..e22ec1e626 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -91,6 +91,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
>   */
>  int generation_numbers_enabled(struct repository *r);
>
> +/*
> + * Return 1 if and only if the repository has a commit-graph
> + * file and generation data chunk has been written for the file.
> + */
> +int corrected_commit_dates_enabled(struct repository *r);
> +
>  enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
>  	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),

All right.

> diff --git a/commit-graph.c b/commit-graph.c
> index c1292f8e08..6411068411 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
>  	return !!first_generation;
>  }
>
> +int corrected_commit_dates_enabled(struct repository *r)
> +{
> +	struct commit_graph *g;
> +	if (!prepare_commit_graph(r))
> +		return 0;
> +
> +	g = r->objects->commit_graph;
> +
> +	if (!g->num_commits)
> +		return 0;
> +
> +	return !!g->chunk_generation_data;
> +}

The previous commit introduced validate_mixed_generation_chain(), which
walked whole split commit-graph chain, and set `read_generation_data`
field in `struct commit_graph` for all layers in the chain.

This function examines only the top layer, so it follows the assumption
that Git would behave in such way that oly topmost layers in the chai
can be GDAT-less.

Why the difference?  Couldn't validate_mixed_generation_chain() simply
call corrected_commit_dates_enabled()?

> +
>  static void close_commit_graph_one(struct commit_graph *g)
>  {
>  	if (!g)
> diff --git a/commit-reach.c b/commit-reach.c
> index 470bc80139..3a1b925274 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  	int i;
>  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>
> -	if (!min_generation)

This check was added in 091f4cf (commit: don't use generation numbers if
not needed, 2018-08-30) by Derrick Stolee, and its commit message
includes benchmark results for running 'git merge-base v4.8 v4.9' in
Linux kernel repository:

      v2.18.0: 0.122s    167,468 walked
  v2.19.0-rc1: 0.547s    635,579 walked
         HEAD: 0.127s

> +	if (!min_generation && !corrected_commit_dates_enabled(r))
>  		queue.compare = compare_commits_by_commit_date;

It would be nice to have similar benchmark for this change... unless of
course there is no change in performance, but I think then it needs to
be stated explicitly.  I think.

>
>  	one->object.flags |= PARENT1;
> diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
> index 332cfc53fd..d3def66e7d 100755
> --- a/t/t6024-recursive-merge.sh
> +++ b/t/t6024-recursive-merge.sh
> @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
>  export GIT_COMMITTER_DATE
>
>  test_expect_success 'setup tests' '
> +	GIT_TEST_COMMIT_GRAPH=0 &&
> +	export GIT_TEST_COMMIT_GRAPH &&
>  	echo 1 >a1 &&
>  	git add a1 &&
>  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
> @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
>  '
>
>  test_expect_success 'combined merge conflicts' '
> -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> +	test_must_fail git merge -m final G
>  '
>
>  test_expect_success 'result contains a conflict' '

OK, so instead of disabling commit-graph for this test, now we disable
it for the whole script.

Maybe this change should be in a separate patch?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-08-22 22:20       ` Jakub Narębski
  2020-08-27  6:39         ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-22 22:20 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

Hello,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

All right.

> ---
>  .../technical/commit-graph-format.txt         | 12 ++---
>  Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
>  2 files changed, 36 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 440541045d..71c43884ec 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
>
> -- The generation number of the commit. Commits with no parents have
> -  generation number 1; commits with parents have generation number
> -  one more than the maximum generation number of its parents. We
> -  reserve zero as special, and can be used to mark a generation
> -  number invalid or as "not computed".
> +- The generation number of the commit.

All right, that was duplicated information.  Now that we need to talk
about two of them, it would not make sense to duplicate that.

>
>  - The root tree OID.
>
> @@ -88,6 +84,12 @@ CHUNK DATA:

Shouldn't we also replace 'generation number' occurences in description
of the Commit Data (CDAT) chunk with either 'topological level' or
'generation number v1'?

>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>
> +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]

It is not exactly 'optional', as it implies that we need to turn it on
(or that we can turn it off).  It is more 'conditional', as it can be
not present due to outside influences (mixed-version environment).

> +    * This list of 4-byte values store corrected commit date offsets for the
> +      commits, arranged in the same order as commit data chunk.

I have just realized purely theoretical, but possible, problem with
storing non-monotinic generation number related values like corrected
commit date offset in constrained space.  There are problems with
clamping them.

Say that somewhere in the ancestry chain there is a commit A with commit
date far in the future by mistake, for example 2120-08-22; it is
important for that date to be not able to be represented using uint32_t.
Say that a later descendant commit B is malformed, and has committer
date of 0, that is 1970-01-01. This means that the corrected commit date
for B must be larger than 2120-08-22 - which for this commit means that
corrected commit date offset do not fit in 32 bits, and must be clamped
(replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.

Say that we have commit C that is child of B, and it has correct commit
date.  Because of mistake in commit A, it has corrected commit date of
more than 2120-08-22 (corrected commit date degenerated into topological
level plus constant).

Now C can reach B, and B can reach A.  However, if we recover corrected
commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
we get a number that is smaller than correct corrected commit date.  We
will have

   gen(A) > date(B) + offset(B) < gen(C)

Which breaks reachability condition guarantee.

If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
we would get

  gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)

And again reachability condition is broken.

This is a very contrived but possible example.  This shouldn't happen,
but ufortunately it can happen.

The question is how to deal with this issue.  Ignore it as unlikely?
Switch to storing corrected commit date, which is monotonic, so if there
is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
larger commit-graph file?

> +    * This list can be later modified to store future generation number related
> +      data.

How can it be later modified?  There is no header, no version number.
How would we add another generation number data?

> +
>    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>        This list of 4-byte values store the second through nth parents for
>        all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 808fa30b99..f27145328c 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
>
>  Values 1-4 satisfy the requirements of parse_commit_gently().
>
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates
> +2. Topological levels

Should we add versioning info, that is:

  +1. Corrected committer dates  (generation number v2)
  +2. Topological levels  (generation number v1)

> +
> +Define "corrected committer date" of a commit recursively as follows:
> +
> +  * A commit with no parents (a root commit) has corrected committer date
> +    equal to its committer date.
> +
> +  * A commit with at least one parent has corrected committer date equal to
> +    the maximum of its commiter date and one more than the largest corrected
> +    committer date among its parents.
> +
> +Define the "topological level" of a commit recursively as follows:
>
>   * A commit with no parents (a root commit) has generation number one.

Shouldn't this be

    * A commit with no parents (a root commit) has topological level of one.

>
> - * A commit with at least one parent has generation number one more than
> -   the largest generation number among its parents.
> + * A commit with at least one parent has topological level one more than
> +   the largest topological level among its parents.
>
> -Equivalently, the generation number of a commit A is one more than the
> +Equivalently, the topological level of a commit A is one more than the
>  length of a longest path from A to a root commit. The recursive definition
>  is easier to use for computation and observing the following property:

We should probably explicitly state that the property state applies to
both versions of generation number, not only to topological level.

>
> @@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
>      If A and B are commits with commit time X and Y, respectively, and
>      X < Y, then A _probably_ cannot reach B.
>
> -This heuristic is currently used whenever the computation is allowed to
> -violate topological relationships due to clock skew (such as "git log"
> -with default order), but is not used when the topological order is
> -required (such as merge base calculations, "git log --graph").
> -

To be overly pedantic, this heuristic is still used, but now in much
more rare case.  In addition to what is stated above, at least one layer
in the split commit-graph chain must have been generated by "Old" Git,
for the date heuristic to be used.

But that might be unnecessary level of detail.

>  In practice, we expect some commits to be created recently and not stored
>  in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not

All right.

>  in the commit-graph file. If a commit-graph file was written by a version
>  of Git that did not compute generation numbers, then those commits will
>  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
>  walking a few extra commits, but the simplicity in dealing with commits
>  with generation number *_INFINITY or *_ZERO is valuable.
>
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> -generation numbers are computed to be at least this value. We limit at
> -this value since it is the largest value that can be stored in the
> -commit-graph file using the 30 bits available to generation numbers. This
> -presents another case where a commit can have generation number equal to
> -that of a parent.
> +We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
> +are computed to be at least this value. We limit at this value since it is
> +the largest value that can be stored in the commit-graph file using the
> +available to generation numbers. This presents another case where a
> +commit can have generation number equal to that of a parent.

All right, though it could have been done without re-wrapping, so that
only first line would be marked as changed.

As I wrote, there is theoretical problem with this for offsets.

>
>  Design Details
>  --------------
> @@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
>  number of commits) could be extracted into config settings for full
>  flexibility.
>
> +We also merge commit-graph chains when we try to write a commit graph with
> +two different generation number definitions as they cannot be compared directly.
> +We overwrite the existing chain and create a commit-graph with the newer or more
> +efficient defintion. For example, overwriting topological levels commit graph
> +chain to create a corrected commit dates commit graph chain.
> +

This is more complicated than that.

I think we should explicitly state that Git ensures that in split
commit-graph chain, if there are layers without the GDAT chunk (that
force Git to use topological levels for generation numbers), then they
are top layers.  So if there is commit-graph file created by "Old" Git,
then when addig new layer it would also be GDAT-less.

Now how to write this...

>  ## Deleting graph-{hash} files
>
>  After a new tip file is written, some `graph-{hash}` files may no longer

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
       [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
  2020-08-18  6:12       ` Abhishek Kumar
@ 2020-08-23 15:27       ` Jakub Narębski
  2020-08-24  2:49         ` Abhishek Kumar
  2 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-23 15:27 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar,
	Jakub Narębski

Hello,

Here is a summary of my comments and thoughts after carefully reviewing
all patches in the series.

Jakub Narębski <jnareb@gmail.com> writes:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>> The Corrected Commit Date is defined as:
>>
>> For a commit C, let its corrected commit date be the maximum of the commit
>> date of C and the corrected commit dates of its parents.
>
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
>
>       A can reach B   =>  gen(A) < gen(B)
>
> Of course it can be computed in simpler way, because
>
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1

I see that it is defined correctly in the documentation, which is more
important than cover letter (which would not be stored in the repository
for memory, even in the commit message for the merge).

[...]
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,

Actually we can simplify handling of merging layers, while still
retaining the property that in mixed-version chain all GDAT-full layers
are before / below GDAT-less layers.

Namely, if merging layers, and at least one layer being merged doesn't
have GDAT chunk, then the result of merge wouldn't have either.

We can always switch to slightly more complicated behavior proposed
above in quoted part later, perhaps as a followup commit.

>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This might be not necessary if Git would always check the whole chain of
split commit-graph layers for presence of GDAT-less layers.

But I still think it is a good idea to avoid having GDAT-less "holes".

>
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)

Or stored directly in GDAT, at the cost of increasing the file size by
at most 7% (if I have done my math correctly).

See also the issue with clamping offsets (GENERATION_NUMBER_V2_OFFSET_MAX).

>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

And this is being done in this patch series.  Good!

The thing I was worrying about turned out to be non-issue, as the
comparison function in question is used only when writing, and in this
case we have corrected commit date computer - though perhaps not being
written (as far as I understand it, but I might be mistaken).

[...]
>>
>> Abhishek Kumar (11):
>>   commit-graph: fix regression when computing bloom filter

No problems, maybe just expanding a commit message and/or adding a comment.

>>   revision: parse parent in indegree_walk_step()

Looks good to me.

>>   commit-graph: consolidate fill_commit_graph_info

I think this commit could be split into three:
- fix to the 'generate tar with future mtime' test
  as it is a hidden bug (when using commit-graph)
- simplify implementation of fill_commit_in_graph()
  by using fill_commit_graph_info()
- move loading date into fill_commit_graph_info()
  that uncovers the issue with 'generate tar with future mtime'

On the other hand because they are inter-related, those changes might be
kept in a single commit.

In commit message greater care needs to be taken with
fill_commit_graph_info() and fill_commit_in_graph(), when to use one and
when to use the other. For example it is fill_commit_graph_info() that
changes its behavior, and it is fill_commit_in_graph() that is getting
simplified.

>>   commit-graph: consolidate compare_commits_by_gen

Looks good to me, though it might be good idea to add comments about the
sorting order (inside comment) to appropriate header files.

>>   commit-graph: return 64-bit generation number

This conversion misses one place, though it would be changed to use
topological levels slab in next patch.

This patch also unnecessary introduces GENERATION_NUMBER_V1_INFINITY.
There is no need for it: `generation` field can always simply use
GENERATION_NUMBER_INFINITY for commits not in commit-graph.

>>   commit-graph: add a slab to store topological levels

This is the patch that needs GENERATION_NUMBER_V1_INFINITY (or
TOPOLOGICAL_LEVEL_INFINITY), not the previous patch.

Detailed analysis uncovered hidden bug in the code of
compute_generation_numbers() that works only because of historical
reasons (that topological levels in Git start from 1, not from 0). The
problem is that the 'level' / 'generation' variable for commits not in
graph, and therefore ones that needs to have its generation number
computed, is 0 (default value on commit slab) and not
GENERATION_NUMBER*_INFINITY as it should.

This issue is present since moving commit graph info data to
commit-slab.  We can simply document it and ignore (it works, if by
accident), or try to fix it.

We need to handle GENERATION_NUMBER*_MAX clamping carefully
in the future patches.

I think also that this patch needs a bit more detailed commit message.

>>   commit-graph: implement corrected commit date

Looks good, though verify_commit_graph() now verifies *a* generation
number used, be it v1 (topological level) or v2 (corrected commit date),
so the variable rename is unnecessary.  We verify that they fulfill the
reachability condition promise, that is gen(parent) <= gen(commit),
(the '=' is to include GENERATION_NUMBER*_MAX case), as it is what is
used to speed up graph traversal.

We probably want to verify both topological level values in CDAT, and if
they exist also corrected commit date values in GDAT.  But that might be
left for the future commit.

>>   commit-graph: implement generation data chunk

To save up to 6-7% on commit-graph file size we store 32-bits corrected
commit date offsets, instead of storing 64-bits corrected commit date.

However, as far as I understand it, using non-monotonic values for
on-disk storage with limited field size, like 32-bits corrected
commit date offset, leads to problems with GENERATION_NUMBER*_MAX.
Namely, as I have written in detail in my reply to patch 11/11 in this
series, there is no way to fulfill the reachability condition promise if
we have to store offset which true value do not fit in 32-bits reserved
for it.

This is extremly unlikely to happen in practice, but we need to be able
to handle it somehow.  We can store 64-bit corrected commit date, which
has graph-monotonic values, and the problem goes away in theory and in
practice (we would never have values that do not fit).  We can keep
storing 32-bit offsets, and simply do not use GDAT chunk if there is
offset value that do not fit.

All this of course, provided that I am not wrong about this issue...

>>   commit-graph: use generation v2 only if entire chain does

The commit message do not say anything about the *writing* side.

However, if we want to keep the requirement that GDAT-less layers in the
split commit-graph chain must all be above any GDAT-containing layers,
we need to consider how we want layer merging to behave.  We could
either opt for using GDAT whenever possible, or similify code and skip
using GDAT if we are unsure.

The first approach would mean that if the topmost layer below set of
layers being rewritten (in the split commit-graph chain) exists, and it
does not contain GDAT chunk, then and only then the result of rewrite
should not have GDAT chunk either.

The second approach is, I think, simpler: if any of layers that is being
rewritten is GDAT-less (we need to check only the top layer, though),
and we are not running full rewrite ('--split=replace'), then the result
of rewrite should not have GDAT chunk either.  We can switch to the
first algorithm in later commit.

Whether we choose one or the other, we need test that doing layer
merging do not break GDAT-inness requirement stated above.

Also, we can probably test that we are not using v1 and v2 at the same
time with tests involving --is-ancestor, or --contains / --merged.

>>   commit-reach: use corrected commit dates in paint_down_to_common()

I think this commit could be split into two:
- disable commit graph for entire t6024-recursive-merge test
- use corrected commit dates in paint_down_to_common()

On the other hand because they are inter-related, those changes might be
better kept in a single commit.

It would be nice to have some benchmark data, or at least stating that
performance does not change (within the error bounds), using for example
'git merge-base v4.8 v4.9' in Linux kernel repository.

>>   doc: add corrected commit date info

Looks good, there are a few places where 'generation number' (referring
to the v1 version) should have been replaced with 'topological level'.

I am also unsure how the GDAT chunk can be "later modified to store
future generation number related data.".  I'd like to have an example,
or for this statement to be removed (if it turns out to not be true, not
without introducing yet another chunk type).

>>  18 files changed, 396 insertions(+), 185 deletions(-)

Thanks for all your work.

Best regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
  2020-08-23 15:27       ` Jakub Narębski
@ 2020-08-24  2:49         ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-24  2:49 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

Hello,

On Sun, Aug 23, 2020 at 05:27:46PM +0200, Jakub Narębski wrote:
> Hello,
> 
> Here is a summary of my comments and thoughts after carefully reviewing
> all patches in the series.
> 

Thanks for the detailed review. It must have taken you a lot of time and
focus to go through the patches, so I have really appreciate the effort.

I have been going over the comments and they are reasonable and very
much needed. I will respond to the mails in the specific threads (to
keep the discussion locally scoped and thus manageable), with doubts
and comments that I have as I try to follow through the suggestions.

> Best regards,
> --
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-21 13:14       ` Jakub Narębski
@ 2020-08-25  5:04         ` Abhishek Kumar
  2020-08-25 12:18           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-25  5:04 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, stolee

On Fri, Aug 21, 2020 at 03:14:34PM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > In a preparatory step, let's return timestamp_t values from
> > commit_graph_generation(), use timestamp_t for local variables
> 
> All right, this is all good.
> 
> > and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
> 
> This needs more detailed examination.  There are two similar constants,
> GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
> used for newest commits outside the commit-graph, while the latter is
> maximum number that commits in the commit-graph can have (because of the
> storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
> to be larger than GENERATION_NUMBER_MAX, and it is (and was).
> 
> The GENERATION_NUMBER_INFINITY is because of the above requirement
> traditionally taken as maximum value that can be represented in the data
> type used to store commit's generation number _in memory_, but it can be
> less.  For timestamp_t the maximum value that can be represented
> is (2 ^ 63 - 1).
> 
> All right then.
> 
> >

Related to this, by the end of this series we are using
GENERATION_NUMBER_MAX in just one place - compute_generation_numbers()
to make sure the topological levels fit within 30 bits.

Would it be more appropriate to rename GENERATION_NUMBER_MAX to
GENERATION_NUMBER_V1_MAX (along the lines of
GENERATION_NUMBER_V2_OFFSET_MAX)  to correctly describe that is a
limit on topological levels, rather than generation number value?

> 
> The commit message says nothing about the new symbolic constant
> GENERATION_NUMBER_V1_INFINITY, though.
> 
> I'm not sure it is even needed (see comments below).

Yes, you are correct. I tried it out with your suggestions and it wasn't
really needed.

Thanks for catching this!

> ...

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-21 18:43       ` Jakub Narębski
@ 2020-08-25  6:14         ` Abhishek Kumar
  2020-08-25  7:33           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-25  6:14 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > As we are writing topological levels to commit data chunk to ensure
> > backwards compatibility with "Old" Git and the member `generation` of
> > struct commit_graph_data will store corrected commit date in a later
> > commit, let's introduce a commit-slab to store topological levels while
> > writing commit-graph.
> 
> In my opinion the above it would be easier to follow if rephrased in the
> following way:
> 
>   In a later commit we will introduce corrected commit date as the
>   generation number v2.  This value will be stored in the new separate
>   GDAT chunk.  However to ensure backwards compatibility with "Old" Git
>   we need to continue to write generation number v1, which is
>   topological level, to the commit data chunk (CDAT).  This means that
>   we need to compute both versions of generation numbers when writing
>   the commit-graph file.  Let's therefore introduce a commit-slab
>   to store topological levels; corrected commit date will be stored
>   in the member `generation` of struct commit_graph_data.
> 
> What do you think?
> 

Yes, that's better.

> 
> By the way, do I understand it correctly that in backward-compatibility
> mode (that is, in mixed-version environment where at least some
> commit-graph files were written by "Old" Git and are lacking GDAT chunk
> and generation number v2 data) the `generation` member of commit graph
> data chunk will be populated and will store generation number v1, that
> is topological level? And that the commit-slab for topological levels is
> only there for writing and re-writing?
> 

No, the topo_levels commit-slab would be always populated when we write
a commit data chunk. The topo_level slab is a workaround for the fact
that we are trying to write two independent values (corrected commit
date offset, topological levels) but have one struct member to store them in
(data->generation).

If we are in a mixed-version environment, we could avoid initializing
the slab and fill the topological levels into data->generation instead,
but that's not how it is implemented right now.

> >
> > When Git creates a split commit-graph, it takes advantage of the
> > generation values that have been computed already and present in
> > existing commit-graph files.
> >
> > So, let's add a pointer to struct commit_graph to the topological level
> > commit-slab and populate it with topological levels while writing a
> > split commit-graph.
> 
> All right, looks sensible.

I have extend the last paragraph to include write_commit_graph_context
as well as:

  So, let's add a pointer to struct commit_graph as well as struct
  write_commit_graph_context to the topological level commit-slab and
  populate it with topological levels while writing a commit-graph file.

> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
> >  commit-graph.h |  1 +
> >  commit.h       |  1 +
> >  3 files changed, 34 insertions(+), 15 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 7f9f858577..a2f15b2825 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
> >  /* Remember to update object flag allocation in object.h */
> >  #define REACHABLE       (1u<<15)
> >
> > +define_commit_slab(topo_level_slab, uint32_t);
> > +
> 
> All right.
> 
> Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
> think it would be necessary.
> 
> >  /* Keep track of the order in which commits are added to our list. */
> >  define_commit_slab(commit_pos, int);
> >  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> > @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +
> > +	if (g->topo_levels)
> > +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> 
> All right, here we store topological levels on commit-slab to avoid
> recomputing them.
> 
> Do I understand it correctly that the `topo_levels` member of the `struct
> commit_graph` would be non-null only when we are updating the
> commit-graph?
> 

Yes, that's correct.

> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -953,6 +958,7 @@ struct write_commit_graph_context {
> >  		 changed_paths:1,
> >  		 order_by_pack:1;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	const struct split_commit_graph_opts *split_opts;
> >  	size_t total_bloom_filter_data_size;
> >  	const struct bloom_filter_settings *bloom_settings;
> 
> Why do we need `topo_levels` member *both* in `struct commit_graph` and
> in `struct write_commit_graph_context`?
> 
> [After examining the change further I have realized why both are needed,
>  and written about the reasoning later in this email.]
> 
> 
> Note that the commit message talks only about `struct commit_graph`...
> 
> > @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
> >  		else
> >  			packedDate[0] = 0;
> >
> > -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> > +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
> 
> All right, here we prepare for writing to the CDAT chunk using data that
> is now stored on newly introduced topo_levels slab (either computed, or
> taken from commit-graph file being rewritten).
> 
> Assuming that ctx->topo_levels is not-null, and that the values are
> properly calculated before this -- and we did compute topological levels
> before writing the commit-graph.
> 
> >
> >  		packedDate[1] = htonl((*list)->date);
> >  		hashwrite(f, packedDate, 8);
> > @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> 
> All right, so that is why this 'generation' variable was not converted
> to timestamp_t type.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > +		    level != GENERATION_NUMBER_ZERO)
> >  			continue;
> 
> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> outside commit-graph files, and therefore we would need its topological
> level computed.
> 
> However, I don't understand how it works.  We have had created the
> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> to provide default values for `struct commit_graph`... but only for
> `graph_pos` member.  It is commit_graph_generation() that returns
> GENERATION_NUMBER_INFINITY for commits not in graph.
> 
> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> handles this special case, so I don't see how 'generation' variable can
> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> 
> Does it work *accidentally*, because the default value for uninitialized
> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> of commit-graph feature development history, namely the short time where
> Git didn't use any generation numbers and stored 0 in the place set for
> it in the commit-graph format...  On the other hand this is not the case
> for corrected commit date (generation number v2), as it could
> "legitimately" be 0 if some root commit (without any parents) had
> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> caused by malformed but valid commit object.
> 
> Ugh...

It works accidentally.

Our decision to avoid the cost of initializing both
commit_graph_data->generation and commit_graph_data->graph_pos has
led to some unwieldy choices - the complexity of helper functions,
bypassing helper functions when writing a commit-graph file [1].

I want to re-visit how commit_graph_data slab is defined in a future series.

[1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/

> 
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > +				    level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else if (level > max_level) {
> > +					max_level = level;
> >  				}
> >  			}
> 
> This is the same case as for previous chunk; see the comment above.
> 
> This code checks if parents have generation number / topological level
> computed, and tracks maximum value of it among all parents.
> 
> >
> >  			if (all_parents_computed) {
> > -				struct commit_graph_data *data = commit_graph_data_at(current);
> > -
> > -				data->generation = max_generation + 1;
> >  				pop_commit(&list);
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> 
> OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
> this value can be maximum value that can be safely stored in a given
> data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
> value that can be safely stored in uint32_t, so generation+1 had no
> chance to overflow.  This is no longer the case; the reorganization done
> here leads to more defensive code (safer).
> 
> All good.  However I think that we should clamp the value of topological
> level to the maximum value that can be safely stored *on disk*, in the
> 30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
> the code to write topological level would get more complicated.
> 
> In my opinion the symbolic constant used here should be named
> GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
> it should be the current value of GENERATION_NUMBER_MAX, that is
> 0x3FFFFFFF.
> 
> >  			}
> >  		}
> >  	}
> > @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> 
> All right, we will be using topo_level slab for writing the
> commit-graph, and only for this purpose, so it is good to put it here.
> 
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
> >  		}
> >  	}
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> > +	if (ctx->r->objects->commit_graph) {
> > +		struct commit_graph *g = ctx->r->objects->commit_graph;
> > +
> > +		while (g) {
> > +			g->topo_levels = &topo_levels;
> > +			g = g->base_graph;
> > +		}
> > +	}
> 
> All right, now I see why we need `topo_levels` member both in the
> `struct write_commit_graph_context` and in `struct commit_graph`.
> The former is for functions that write the commit-graph, the latter for
> fill_commit_graph_info() functions that is deep in the callstack, but it
> needs to know whether to load topological level to commit-slab, or maybe
> put it as generation number (and in the future -- discard it, if not
> needed).
> 
> 
> Sidenote: this fragment of code, that fills with a given value some
> member of the `struct commit_graph` throughout the split commit-graph
> chain, will be repeated as similar code in patches later in series.
> However without resorting to preprocessor macros I have no idea how to
> generalize it to avoid code duplication (well, almost).
> 

The pattern is: iterate over the commit-graph chain and assign a member
(here, topo_level and in the other patch, read_generation_data) a value
(the address of topo_level slab, 1/0 depending on whether it is a mixed
generation chain).

We could generalize this in a future series but I don't think it is
worthwhile.

> > +
> >  	if (pack_indexes) {
> >  		ctx->order_by_pack = 1;
> >  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 430bc830bb..1152a9642e 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -72,6 +72,7 @@ struct commit_graph {
> >  	const unsigned char *chunk_bloom_indexes;
> >  	const unsigned char *chunk_bloom_data;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	struct bloom_filter_settings *bloom_filter_settings;
> >  };
> 
> All right: `struct commit_graph` is public, `struct
> write_commit_graph_context` is not.
> 
> >
> > diff --git a/commit.h b/commit.h
> > index bc0732a4fe..bb846e0025 100644
> > --- a/commit.h
> > +++ b/commit.h
> > @@ -15,6 +15,7 @@
> >  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
> >  #define GENERATION_NUMBER_MAX 0x3FFFFFFF
> 
> The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
> GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.
> 
> >  #define GENERATION_NUMBER_ZERO 0
> > +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
> 
> This value is never used, so why it is defined in this commit.
> 

Moved down to the patch actually uses it.

> >
> >  struct commit_list {
> >  	struct commit *item;
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-22  0:05       ` Jakub Narębski
@ 2020-08-25  6:49         ` Abhishek Kumar
  2020-08-25 10:07           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-25  6:49 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With most of preparations done, let's implement corrected commit date.
> >
> > The corrected commit date for a commit is defined as:
> >
> > * A commit with no parents (a root commit) has corrected commit date
> >   equal to its committer date.
> > * A commit with at least one parent has corrected commit date equal to
> >   the maximum of its commit date and one more than the largest corrected
> >   commit date among its parents.
> 
> Good.
> 
> >
> > To minimize the space required to store corrected commit date, Git
> > stores corrected commit date offsets into the commit-graph file. The
> > corrected commit date offset for a commit is defined as the difference
> > between its corrected commit date and actual commit date.
> 
> Perhaps we should add more details about data type sizes in question.

Will add.

> 
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t).  However corrected commit date
> offsets can be safely stored^* using only 32 bits.  This halves the size
> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
> header, fanout table (OIDF) and extra edges list (EDGE).
> 
> Which might mean that the extra complication is not worth it, and we
> should store corrected commit date directly instead.
> 
> *) unless for example one of commits is malformed but valid,
>    and has committerdate of 0 Unix time, 1 January 1970.
> 
> >
> > While Git does not write out offsets at this stage, Git stores the
> > corrected commit dates in member generation of struct commit_graph_data.
> > It will begin writing commit date offsets with the introduction of
> > generation data chunk.
> 
> OK, so the agenda for introducing geeration number v2 is as follows:
> - compute generation numbers v2, i.e. corrected commit date
> - store corrected commit date [offsets] in new GDAT chunk,
>   unless backward-compatibility concerns require us to not to
> - load [and compute] corrected commit date from commit-graph
>   storing it as 'generation' field of `struct commit_graph_data`,
>   unless backward-compatibility concerns require us to store
>   topological levels (generation number v1) in there instead
> 

The last point is not correct. We always store topological levels into
the topo_levels slab introduced and always store corrected commit date
into data->generation, regardless of backward compatibility concerns.

We could avoid initializing topo_slab if we are not writing generation
data chunk (and thus don't need corrected commit dates) but that
wouldn't have an impact on run time while writing commit-graph because
computing corrected commit dates is cheap as the main cost is in walking
the graph and writing the file.

> Because the reachability condition for corrected commit date and for
> topological level is exactly the same, we don't need to do anything to
> take advantage of generation number v2.
> 
> Though we can use generation number v2 in more cases, where we turned
> off use of generation numbers because v1 gave worse performance than
> date heuristics.
> 
> Did I got this right?
> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 31 insertions(+), 27 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index a2f15b2825..fd69534dd5 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	else if (generation_a > generation_b)
> >  		return 1;
> >
> > -	/* use date as a heuristic when generations are equal */
> > -	if (a->date < b->date)
> > -		return -1;
> > -	else if (a->date > b->date)
> > -		return 1;
> 
> At first I was wondering why this tie-breaking is beig removed; wouldn't
> be needed for backward-compatibility?  But then I remembered that this
> comparison function is used _only_ for sorting commits when writing
> Bloom filters, for `git commit-graph write --reachable --changed-paths ...`
> 
> Assuming that when writing the commit graph we always compute geeration
> number v2 and 'generation' field stores corrected commit date, we don't
> need to use date as a heuristic when generations are equal, and it would
> not help in tie-breaking anyway.
> 
> All right.
> 
> >  	return 0;
> >  }
> >
> > @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> 
> All right, so the pattern is to add 'corrected_commit_date' stuff after
> 'topological_level' stuff.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> >  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > -		    level != GENERATION_NUMBER_ZERO)
> > +		    level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> > +		    )
> >  			continue;
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> >
> >  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > -				    level == GENERATION_NUMBER_ZERO) {
> > +				    level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> > +				    ) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> >  				}
> >  			}
> >
> > @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  				if (max_level > GENERATION_NUMBER_MAX - 1)
> >  					max_level = GENERATION_NUMBER_MAX - 1;
> >  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > +				if (current->date > max_corrected_commit_date)
> > +					max_corrected_commit_date = current->date - 1;
> > +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> >  			}
> >  		}
> >  	}
> 
> All right.  Looks good to me.
> 
> > @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  	for (i = 0; i < g->num_commits; i++) {
> >  		struct commit *graph_commit, *odb_commit;
> >  		struct commit_list *graph_parents, *odb_parents;
> > -		timestamp_t max_generation = 0;
> > -		timestamp_t generation;
> > +		timestamp_t max_corrected_commit_date = 0;
> > +		timestamp_t corrected_commit_date;
> 
> This is simple, and perhaps unnecessary, rename of variables.
> Shouldn't we however verify *both* topological level, and
> (if exists) corrected commit date?

The problem with verifying both topological level and corrected commit
dates is that we would have to re-fill commit_graph_data slab with commit
data chunk as we cannot modify data->generation otherwise, essentially
repeating the whole verification process.

While it's okay for now, I might take this up in a future series [1].

[1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

> 
> >
> >  		display_progress(progress, i + 1);
> >  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> > @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  					     oid_to_hex(&graph_parents->item->object.oid),
> >  					     oid_to_hex(&odb_parents->item->object.oid));
> >
> > -			generation = commit_graph_generation(graph_parents->item);
> > -			if (generation > max_generation)
> > -				max_generation = generation;
> > +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> > +			if (corrected_commit_date > max_corrected_commit_date)
> > +				max_corrected_commit_date = corrected_commit_date;
> 
> Actually, commit_graph_generation(<commit>) can return either corrected
> commit date, or topological level, the latter in backward-compatibility
> case (if at least one commit-graph file is lacking GDAT chunk, because
> [some of] it was created by the "Old" Git).
> 
> >
> >  			graph_parents = graph_parents->next;
> >  			odb_parents = odb_parents->next;
> > @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  		if (generation_zero == GENERATION_ZERO_EXISTS)
> >  			continue;
> >
> > -		/*
> > -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> > -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> > -		 * extra logic in the following condition.
> > -		 */
> > -		if (max_generation == GENERATION_NUMBER_MAX)
> > -			max_generation--;
> 
> All right, this was needed for checking the correctness of topological
> levels (generation number v1) because we were checking not that it
> fullfills the reachability condition, but more strict one: namely that
> topological level of commit is equal to maximum of topological levels of
> its parents plus one.
> 
> The comment about checking both generation number v1 and v2 still
> applies.
> 
> > -
> > -		generation = commit_graph_generation(graph_commit);
> > -		if (generation != max_generation + 1)
> > -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > +		corrected_commit_date = commit_graph_generation(graph_commit);
> > +		if (corrected_commit_date < max_corrected_commit_date + 1)
> > +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
> >  				     oid_to_hex(&cur_oid),
> > -				     generation,
> > -				     max_generation + 1);
> > +				     corrected_commit_date,
> > +				     max_corrected_commit_date + 1);
> 
> All right, we check less strict condition for corrected commit date.
> 
> >
> >  		if (graph_commit->date != odb_commit->date)
> >  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-25  6:14         ` Abhishek Kumar
@ 2020-08-25  7:33           ` Jakub Narębski
  2020-08-25  7:56             ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-25  7:33 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  					_("Computing commit graph generation numbers"),
>>>  					ctx->commits.nr);
>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
>>> +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>
>> All right, so that is why this 'generation' variable was not converted
>> to timestamp_t type.
>>
>>>
>>>  		display_progress(ctx->progress, i + 1);
>>> -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
>>> -		    generation != GENERATION_NUMBER_ZERO)
>>> +		if (level != GENERATION_NUMBER_V1_INFINITY &&
>>> +		    level != GENERATION_NUMBER_ZERO)
>>>  			continue;
>>
>> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
>> outside commit-graph files, and therefore we would need its topological
>> level computed.
>>
>> However, I don't understand how it works.  We have had created the
>> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
>> to provide default values for `struct commit_graph`... but only for
>> `graph_pos` member.  It is commit_graph_generation() that returns
>> GENERATION_NUMBER_INFINITY for commits not in graph.
>>
>> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
>> handles this special case, so I don't see how 'generation' variable can
>> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
>> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
>>
>> Does it work *accidentally*, because the default value for uninitialized
>> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
>> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
>> of commit-graph feature development history, namely the short time where
>> Git didn't use any generation numbers and stored 0 in the place set for
>> it in the commit-graph format...  On the other hand this is not the case
>> for corrected commit date (generation number v2), as it could
>> "legitimately" be 0 if some root commit (without any parents) had
>> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
>> caused by malformed but valid commit object.
>>
>> Ugh...
>
> It works accidentally.
>
> Our decision to avoid the cost of initializing both
> commit_graph_data->generation and commit_graph_data->graph_pos has
> led to some unwieldy choices - the complexity of helper functions,
> bypassing helper functions when writing a commit-graph file [1].
>
> I want to re-visit how commit_graph_data slab is defined in a future series.
>
> [1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/

All right, we might want to make use of the fact that the value of 0 for
topological level here always mean that its value for a commit needs to
be computed, that 0 is not a valid value for topological levels.
- if the value 0 came from commit-graph file, it means that it came
  from Git version that used commit-graph but didn't compute generation
  numbers; the value is GENERATION_NUMBER_ZERO
- the value 0 might came from the fact that commit is not in graph,
  and that commit-slab zero-initializes the values stored; let's
  call this value GENERATION_NUMBER_UNINITIALIZED

If we ensure that corrected commit date can never be zero (which is
extremely unlikely, as one of root commits would have to be malformed or
written on badly misconfigured computer, with value of 0 for committer
timestamp), then this "happy accident" can keep working.

  As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
  has corrected commit date of one, to be able to distinguish
  uninitialized values.

Or something like that.

Actually, it is not even necessary, as corrected commit date of 0 just
means that this single value (well, for every root commit with commit
date of 0) would be unnecessary recomputed in compute_generation_numbers().

Anyway, we would want to document this fact in the commit message.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-25  7:33           ` Jakub Narębski
@ 2020-08-25  7:56             ` Jakub Narębski
  2020-09-01 10:26               ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-25  7:56 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> > On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>
> >>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> [...]
> >>> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >>>                                     _("Computing commit graph generation numbers"),
> >>>                                     ctx->commits.nr);
> >>>     for (i = 0; i < ctx->commits.nr; i++) {
> >>> -           uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> >>> +           uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> >>
> >> All right, so that is why this 'generation' variable was not converted
> >> to timestamp_t type.
> >>
> >>>
> >>>             display_progress(ctx->progress, i + 1);
> >>> -           if (generation != GENERATION_NUMBER_V1_INFINITY &&
> >>> -               generation != GENERATION_NUMBER_ZERO)
> >>> +           if (level != GENERATION_NUMBER_V1_INFINITY &&
> >>> +               level != GENERATION_NUMBER_ZERO)
> >>>                     continue;
> >>
> >> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> >> outside commit-graph files, and therefore we would need its topological
> >> level computed.
> >>
> >> However, I don't understand how it works.  We have had created the
> >> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> >> to provide default values for `struct commit_graph`... but only for
> >> `graph_pos` member.  It is commit_graph_generation() that returns
> >> GENERATION_NUMBER_INFINITY for commits not in graph.
> >>
> >> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> >> handles this special case, so I don't see how 'generation' variable can
> >> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> >> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> >>
> >> Does it work *accidentally*, because the default value for uninitialized
> >> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> >> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> >> of commit-graph feature development history, namely the short time where
> >> Git didn't use any generation numbers and stored 0 in the place set for
> >> it in the commit-graph format...  On the other hand this is not the case
> >> for corrected commit date (generation number v2), as it could
> >> "legitimately" be 0 if some root commit (without any parents) had
> >> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> >> caused by malformed but valid commit object.
> >>
> >> Ugh...
> >
> > It works accidentally.
> >
> > Our decision to avoid the cost of initializing both
> > commit_graph_data->generation and commit_graph_data->graph_pos has
> > led to some unwieldy choices - the complexity of helper functions,
> > bypassing helper functions when writing a commit-graph file [1].
> >
> > I want to re-visit how commit_graph_data slab is defined in a future series.
> >
> > [1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/
>
> All right, we might want to make use of the fact that the value of 0 for
> topological level here always mean that its value for a commit needs to
> be computed, that 0 is not a valid value for topological levels.
> - if the value 0 came from commit-graph file, it means that it came
>   from Git version that used commit-graph but didn't compute generation
>   numbers; the value is GENERATION_NUMBER_ZERO
> - the value 0 might came from the fact that commit is not in graph,
>   and that commit-slab zero-initializes the values stored; let's
>   call this value GENERATION_NUMBER_UNINITIALIZED
>
> If we ensure that corrected commit date can never be zero (which is
> extremely unlikely, as one of root commits would have to be malformed or
> written on badly misconfigured computer, with value of 0 for committer
> timestamp), then this "happy accident" can keep working.
>
>   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
>   has corrected commit date of one, to be able to distinguish
>   uninitialized values.
>
> Or something like that.
>
> Actually, it is not even necessary, as corrected commit date of 0 just
> means that this single value (well, for every root commit with commit
> date of 0) would be unnecessary recomputed in compute_generation_numbers().
>
> Anyway, we would want to document this fact in the commit message.

Alternatively, instead of comparing 'level' (and later in series also
'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
we could load at no extra cost `graph_pos` value and compare it
against COMMIT_NOT_FROM_GRAPH.

But with this solution we could never get rid of graph_pos, if we
think it is unnecessary. If we split commit_graph_data into separate
slabs (as it was in early versions of respective patch series), we
would have to pay additional cost.

But it is an alternative.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-25  6:49         ` Abhishek Kumar
@ 2020-08-25 10:07           ` Jakub Narębski
  2020-09-01 11:01             ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-25 10:07 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> To minimize the space required to store corrected commit date, Git
>>> stores corrected commit date offsets into the commit-graph file. The
>>> corrected commit date offset for a commit is defined as the difference
>>> between its corrected commit date and actual commit date.
>>
>> Perhaps we should add more details about data type sizes in question.
>
> Will add.

Note however that we need to solve the problem of storing values which
are not monotonic wrt. parent relation (partial order) in limited disk
space, that is GENERATION_NUMBER_V2_OFFSET_MAX vs GENERATION_NUMBER_MAX;
see comments in 11/11 and 00/11.

>>
>> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
>> in most cases is 64 bits (uintmax_t).  However corrected commit date
>> offsets can be safely stored^* using only 32 bits.  This halves the size
>> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
>> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
>> header, fanout table (OIDF) and extra edges list (EDGE).
>>
>> Which might mean that the extra complication is not worth it, and we
>> should store corrected commit date directly instead.
>>
>> *) unless for example one of commits is malformed but valid,
>>    and has committerdate of 0 Unix time, 1 January 1970.

See above.

>>> While Git does not write out offsets at this stage, Git stores the
>>> corrected commit dates in member generation of struct commit_graph_data.
>>> It will begin writing commit date offsets with the introduction of
>>> generation data chunk.
>>
>> OK, so the agenda for introducing geeration number v2 is as follows:
>> - compute generation numbers v2, i.e. corrected commit date
>> - store corrected commit date [offsets] in new GDAT chunk,
>>   unless backward-compatibility concerns require us to not to
>> - load [and compute] corrected commit date from commit-graph
>>   storing it as 'generation' field of `struct commit_graph_data`,
>>   unless backward-compatibility concerns require us to store
>>   topological levels (generation number v1) in there instead
>>
>
> The last point is not correct. We always store topological levels into
> the topo_levels slab introduced and always store corrected commit date
> into data->generation, regardless of backward compatibility concerns.

I think I was not clear enough (in trying to be brief).  I meant here
loading available generation numbers for use in graph traversal,
done in later patches in this series.

In _next_ commit we store topological levels in `generation` field:

  @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
   	date_low = get_be32(commit_data + g->hash_len + 12);
   	item->date = (timestamp_t)((date_high << 32) | date_low);

  -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
  +	if (g->chunk_generation_data)
  +		graph_data->generation = item->date +
  +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
  +	else
  +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;


We use topo_level slab only when writing the commit-graph file.

> We could avoid initializing topo_slab if we are not writing generation
> data chunk (and thus don't need corrected commit dates) but that
> wouldn't have an impact on run time while writing commit-graph because
> computing corrected commit dates is cheap as the main cost is in walking
> the graph and writing the file.

Right.

Though you need to add the cost of allocation and managing extra
commit slab, I think that amortized cost is negligible.

But what would be better is showing benchmark data: does writing the
commit graph without GDAT take not insigificant more time than without
this patch?

[...]
>>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>  	for (i = 0; i < g->num_commits; i++) {
>>>  		struct commit *graph_commit, *odb_commit;
>>>  		struct commit_list *graph_parents, *odb_parents;
>>> -		timestamp_t max_generation = 0;
>>> -		timestamp_t generation;
>>> +		timestamp_t max_corrected_commit_date = 0;
>>> +		timestamp_t corrected_commit_date;
>>
>> This is simple, and perhaps unnecessary, rename of variables.
>> Shouldn't we however verify *both* topological level, and
>> (if exists) corrected commit date?
>
> The problem with verifying both topological level and corrected commit
> dates is that we would have to re-fill commit_graph_data slab with commit
> data chunk as we cannot modify data->generation otherwise, essentially
> repeating the whole verification process.
>
> While it's okay for now, I might take this up in a future series [1].
>
> [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

All right, I believe you that verifying both topological level and
corrected commit date would be more difficult.

That doesn't change the conclusion that this variable should remain to
be named `generation`, as when verifying GDAT-less commit-graph files it
would check topological levels (it uses commit_graph_generation(), which
in turn uses `generation` field in commit graph info, which as I have
show above in later patch could be v1 or v2 generation number).

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-21  4:11         ` Abhishek Kumar
@ 2020-08-25 11:11           ` Jakub Narębski
  2020-09-01 11:35             ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-25 11:11 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>>> information present in commit data chunk. Let's simplify the
>>> implementation by calling fill_commit_graph_info() within
>>> fill_commit_in_graph().
>>>
>>> The test 'generate tar with future mtime' creates a commit with commit
>>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>>> generation number (within CDAT chunk) and has undefined behavior.
>>>
>>> The test used to pass
>>
>> Could you please tell us how does this test starts to fail without the
>> change to the test described there?  What is the error message, etc.?
>>
>
> Here's what I revised the commit message to:
>
> commit-graph: consolidate fill_commit_graph_info
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().

All right.

We might want to add here the information that we also move loading the
commit date from the commit-graph file from fill_commit_in_graph() down
the [new] call chain into fill_commit_graph_info().  The commit date
would be needed in fill_commit_graph_info() in the next commit to
compute corrected commit date out of corrected commit date offset, and
store it as generation number.


NOTE that this means that if we switch to storing 64-bit corrected
commit date directly in the commit-graph file, instead of storing 32-bit
offsets, neither this Move Statement Into Function Out of Caller
refactoring nor change to the 'generate tar with future mtime' test
would be necessary.

>
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
> 34-bits for storing commiter date, thus committer time overflows into
> generation number (within CDAT chunk) and has undefined behavior.
>
> The test used to pass as fill_commit_graph_info() would not set struct
> member `date` of struct commit and loads committer date from the object
> database, generating a tar file with the expected mtime.

I guess that in the case of generating a tar file we would read the
commit out of 'object database', and then only add commit-graph specific
info with fill_commit_graph_info().  Possibly because we need more
information that commit-graph provides for a commit.

>
> However, with corrected commit date, we will load the committer date
> from CDAT chunk (truncated to lower 34-bits) to populate the generation
> number. Thus, fill_commit_graph_info() sets date and generates tar file
> with the truncated mtime and the test fails.
>
> Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
> will not be truncated.

Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH
was used.

The commit that introduced the 'generate tar with future mtime' test,
namely e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016), says:

	The ustar format only has room for 11 (or 12, depending on
	some implementations) octal digits for the size and mtime of
	each file. For values larger than this, we have to add pax
	extended headers to specify the real data, and git does not
	yet know how to do so.

	Before fixing that, let's start off with some test
	infrastructure [...]

The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12.
So we need the value of (2 ^ 36 + 1) for this test do do its job.
Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough
(if we skip testing "some implementations").

So I think to make this test more clear (for inquisitive minds) we
should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds
since EPOCH.  Maybe even add a variant of this test that uses the
origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off
use of serialized commit-graph.

I'm sorry for not checking this earlier.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-25  5:04         ` Abhishek Kumar
@ 2020-08-25 12:18           ` Jakub Narębski
  2020-09-01 12:06             ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-08-25 12:18 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: 85d03km98l.fsf, git, stolee

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Aug 21, 2020 at 03:14:34PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> In a preparatory step, let's return timestamp_t values from
>>> commit_graph_generation(), use timestamp_t for local variables
>>
>> All right, this is all good.
>>
>>> and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
>>
>> This needs more detailed examination.  There are two similar constants,
>> GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
>> used for newest commits outside the commit-graph, while the latter is
>> maximum number that commits in the commit-graph can have (because of the
>> storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
>> to be larger than GENERATION_NUMBER_MAX, and it is (and was).
>>
>> The GENERATION_NUMBER_INFINITY is because of the above requirement
>> traditionally taken as maximum value that can be represented in the data
>> type used to store commit's generation number _in memory_, but it can be
>> less.  For timestamp_t the maximum value that can be represented
>> is (2 ^ 63 - 1).
>>
>> All right then.
>
> Related to this, by the end of this series we are using
> GENERATION_NUMBER_MAX in just one place - compute_generation_numbers()
> to make sure the topological levels fit within 30 bits.
>
> Would it be more appropriate to rename GENERATION_NUMBER_MAX to
> GENERATION_NUMBER_V1_MAX (along the lines of
> GENERATION_NUMBER_V2_OFFSET_MAX)  to correctly describe that is a
> limit on topological levels, rather than generation number value?

Yes, I think that at the end of this patch series we should be using
GENERATION_NUMBER_V1_MAX and GENERATION_NUMBER_V2_OFFSET_MAX to describe
storage limits, and GENERATION_NUMBER_INFINITY (the latter as generation
number value for commits not in graph).

We need to ensure that both GENERATION_NUMBER_V1_MAX and
GENERATION_NUMBER_V2_OFFSET_MAX are smaller than
GENERATION_NUMBER_INFINITY.

However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
difficult.  As far as I can see, we can choose one of the *three*
solutions (the third one is _new_):

a. store 64-bit corrected commit date in the GDAT chunk
   all possible values are able to be stored, no need for
   GENERATION_NUMBER_V2_MAX,

b. store 32-bit corrected commit date offset in the GDAT chunk,
   if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
   do not write GDAT chunk at all (like for backward compatibility
   with mixed-version chains of split commit-graph layers),

c. store 32-bit corrected commit date offset in the GDAT chunk,
   using some kind of overflow handling scheme; for example if
   the most significant bit of 32-bit value is 1, then the
   rest 31-bits are position in GDOV chunk, which uses 64-bit
   to store those corrected commit date offsets that do not
   fit in 32 bits.

This type of schema is used in other places in Git code, if I remember
it correctly.

>> The commit message says nothing about the new symbolic constant
>> GENERATION_NUMBER_V1_INFINITY, though.
>>
>> I'm not sure it is even needed (see comments below).
>
> Yes, you are correct. I tried it out with your suggestions and it wasn't
> really needed.
>
> Thanks for catching this!

Mistakes can happen when changig how the series is split into commits.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-22 17:14       ` Jakub Narębski
@ 2020-08-26  7:15         ` Abhishek Kumar
  2020-08-26 10:38           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-26  7:15 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
> Hi Abhishek,
> 
> ... 
> 
> However the commit message do not say anything about the *writing* side.
> 

Revised the commit message to include the following at the end:

When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite does not
have GDAT chunks either.

> 
> ...
> 
> To be more detailed, without '--split=replace' we would want the following
> layer merging behavior:
> 
>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>            1              2           3             4            5
> 
> In the split commit-graph chain above, merging two topmost layers
> (layers 4 and 5) should create a layer without GDAT; merging three
> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
> 4) should create a new layer with GDAT.
> 
>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>            1              2           3               merged
> 
>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>            1              2                    merged
> 
> I hope those ASCII-art pictures help understanding it
> 

Thanks! There were helpful.

While we work as expected in the first scenario i.e merging 4 and 5, we
would *still* write a layer without GDAT in the second scenario.

I have tweaked split_graph_merge_strategy() to fix this:

----------------------------------------------

diff --git a/commit-graph.c b/commit-graph.c
index 6d54d9a286..246fad030d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		}
 	}
 
+	if (!ctx->write_generation_data && g->chunk_generation_data)
+		ctx->write_generation_data = 1;
+
 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->new_base_graph = g;
 	else if (ctx->num_commit_graphs_after != 1)

----------------------------------------------------

That is, if we were not writing generation data (because of mixed
generation concerns) but the new topmost layer has a generation data
chunk, we have merged all layers without GDAT chunk and can now write a
GDAT chunk safely.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> Wouldn't breaking the reachability condition promise make some Git
> commands to return *incorrect* results if they short-circuit, stop
> walking if generation number shows that A cannot reach B?
> 
> I am talking here about commands that return boolean, or select subset
> from given set of revisions:
> - git merge-base --is-ancestor <B> <A>
> - git branch branch-A <A> && git branch --contains <B>
> - git branch branch-B <B> && git branch --merged <A>
> 
> Git assumes that generation numbers fulfill the following condition:
> 
>   if A can reach B, then gen(A) > gen(B)
> 
> Notably this includes commits not in commit-graph, and clamped values.
> 
> However, in the following case
> 
> * if commit A is from higher layer without GDAT
>   and uses topological levels for 'generation', e.g. 115 (in a small repo)
> * and commit B is from lower layer with GDAT
>   and uses corrected commit date as 'generation', for example 1598112896,
> 
> it may happen that A (later commit) can reach B (earlier commit), but
> gen(B) > gen(A).  The reachability condition promise for generation
> numbers is broken.
> 
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> 
> I have reordered files in the patch itself to make it easier to review
> the proposed changes.
> 
> >  commit-graph.h                |  1 +
> >  commit-graph.c                | 32 +++++++++++++++-
> >  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 102 insertions(+), 1 deletion(-)
> >
> > diff --git a/commit-graph.h b/commit-graph.h
> > index f78c892fc0..3cf89d895d 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -63,6 +63,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >
> >  	uint32_t num_commits_in_base;
> > +	uint32_t read_generation_data;
> >  	struct commit_graph *base_graph;
> >
> 
> First, why `read_generation_data` is of uint32_t type, when it stores
> (as far as I understand it), a "boolean" value of either 0 or 1?
> 

Yes, using unsigned int instead of uint32_t (although in most of cases
it would be same).  If commit_graph had other flags as well, we could
have used a bit field.

> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
> that interfere with the case of rewriting, where we want to use existing
> GDAT data when writing new commit-graph with GDAT chunk?

It interferes with rewriting the split commit-graph, as you might have
guessed from the above code snippet.

> 
> ...
>
> > diff --git a/commit-graph.c b/commit-graph.c
> 
> >  		graph_data->generation = item->date +
> >  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> >  	else
> > @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
> >  	uint32_t pos;
> >  	if (!prepare_commit_graph(r))
> >  		return;
> > +
> >  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
> >  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
> >  }
> 
> This is unrelated whitespace fix, a "while at it" in neighbourhood of
> changes.  All right then.
> 

Reverted this change, as it's unimportant.

> > @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
> ...
> 
> It would be nice to have an example with merging layers (whether we
> would handle it in strict or relaxed way).
> 

Sure, will add.

> > +
> >  test_done
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does
  2020-08-26  7:15         ` Abhishek Kumar
@ 2020-08-26 10:38           ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-26 10:38 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Hi Abhishek,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
>> Hi Abhishek,
>>
>> ...
>>
>> However the commit message do not say anything about the *writing* side.
>>
>
> Revised the commit message to include the following at the end:
>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>

All right.

> Rewriting layers follows similar approach: if the topmost layer below
> set of layers being rewritten (in the split commit-graph chain) exists,
> and it does not contain GDAT chunk, then the result of rewrite does not
> have GDAT chunks either.

All right.

I see that you went with proposed more complex (but better) solution...

>>
>> ...
>>
>> To be more detailed, without '--split=replace' we would want the following
>> layer merging behavior:
>>
>>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>>            1              2           3             4            5
>>
>> In the split commit-graph chain above, merging two topmost layers
>> (layers 4 and 5) should create a layer without GDAT; merging three
>> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
>> 4) should create a new layer with GDAT.

A simpler solution would be to create a new merged layer without GDAT if
any of the layers being merged do not have GDAT.

In this solution merging 3+4+5, 3+4, and even 2+3 would result with
layer without GDAT, and only merging 1+2 would result in layer with GDAT.

>>
>>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>>            1              2           3               merged
>>
>>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>>            1              2                    merged
>>
>> I hope those ASCII-art pictures help understanding it
>>
>
> Thanks! There were helpful.
>
> While we work as expected in the first scenario i.e merging 4 and 5, we
> would *still* write a layer without GDAT in the second scenario.
>
> I have tweaked split_graph_merge_strategy() to fix this:
>
> ----------------------------------------------
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 6d54d9a286..246fad030d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  		}
>  	}
>
> +	if (!ctx->write_generation_data && g->chunk_generation_data)
> +		ctx->write_generation_data = 1;
> +
>  	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
>  		ctx->new_base_graph = g;
>  	else if (ctx->num_commit_graphs_after != 1)

...which turned out to be not that complicated.  Nice work!

Though this needs tests that if fulfills the stated condition (because I
am not sure if it is entirely correct: we are not checking the layer
below current one, isn't it?... ah, you explain it below).

One possible solution would be to grep `test-tool read-graph` output for
"^chunks: ", then pass it through `uniq` (without `sort`!), check that
the number of lines is less or equal 2, and if there are two lines then
check that we get the following contents:

  chunks: oid_fanout oid_lookup commit_metadata generation_data
  chunks: oid_fanout oid_lookup commit_metadata

(assuming that information about layers is added in top-down order).

This test must be run with GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0, which
I think is the default.

> ----------------------------------------------------
>
> That is, if we were not writing generation data (because of mixed
> generation concerns) but the new topmost layer has a generation data
> chunk, we have merged all layers without GDAT chunk and can now write a
> GDAT chunk safely.

All right.

[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index f78c892fc0..3cf89d895d 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -63,6 +63,7 @@ struct commit_graph {
>>>  	struct object_directory *odb;
>>>
>>>  	uint32_t num_commits_in_base;
>>> +	uint32_t read_generation_data;
>>>  	struct commit_graph *base_graph;
>>>
>>
>> First, why `read_generation_data` is of uint32_t type, when it stores
>> (as far as I understand it), a "boolean" value of either 0 or 1?
>
> Yes, using unsigned int instead of uint32_t (although in most of cases
> it would be same).  If commit_graph had other flags as well, we could
> have used a bit field.

OK.

>> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
>> that interfere with the case of rewriting, where we want to use existing
>> GDAT data when writing new commit-graph with GDAT chunk?
>
> It interferes with rewriting the split commit-graph, as you might have
> guessed from the above code snippet.

All right.

[...]
>>> @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
>>>  	uint32_t pos;
>>>  	if (!prepare_commit_graph(r))
>>>  		return;
>>> +
>>>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
>>>  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
>>>  }
>>
>> This is unrelated whitespace fix, a "while at it" in neighbourhood of
>> changes.  All right then.
>>
>
> Reverted this change, as it's unimportant.

Actually I am not against fixing the whitespace in the neighbourhood of
changes, so you can keep it or revert it (discard).

>>> @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>>
>> ...
>>
>> It would be nice to have an example with merging layers (whether we
>> would handle it in strict or relaxed way).
>>
>
> Sure, will add.

Thanks.


Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-22 22:20       ` Jakub Narębski
@ 2020-08-27  6:39         ` Abhishek Kumar
  2020-08-27 12:43           ` Jakub Narębski
  2020-08-27 13:15           ` Derrick Stolee
  0 siblings, 2 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-08-27  6:39 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee

On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> All right.
> 
> > ---
> >  .../technical/commit-graph-format.txt         | 12 ++---
> >  Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
> >  2 files changed, 36 insertions(+), 21 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index 440541045d..71c43884ec 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> >  The Git commit graph stores a list of commit OIDs and some associated
> >  metadata, including:
> >
> > -- The generation number of the commit. Commits with no parents have
> > -  generation number 1; commits with parents have generation number
> > -  one more than the maximum generation number of its parents. We
> > -  reserve zero as special, and can be used to mark a generation
> > -  number invalid or as "not computed".
> > +- The generation number of the commit.
> 
> All right, that was duplicated information.  Now that we need to talk
> about two of them, it would not make sense to duplicate that.
> 
> >
> >  - The root tree OID.
> >
> > @@ -88,6 +84,12 @@ CHUNK DATA:
> 
> Shouldn't we also replace 'generation number' occurences in description
> of the Commit Data (CDAT) chunk with either 'topological level' or
> 'generation number v1'?

Yes, we should.

> 
> >        2 bits of the lowest byte, storing the 33rd and 34th bit of the
> >        commit time.
> >
> > +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> 
> It is not exactly 'optional', as it implies that we need to turn it on
> (or that we can turn it off).  It is more 'conditional', as it can be
> not present due to outside influences (mixed-version environment).
> 
> > +    * This list of 4-byte values store corrected commit date offsets for the
> > +      commits, arranged in the same order as commit data chunk.
> 
> I have just realized purely theoretical, but possible, problem with
> storing non-monotinic generation number related values like corrected
> commit date offset in constrained space.  There are problems with
> clamping them.
> 
> Say that somewhere in the ancestry chain there is a commit A with commit
> date far in the future by mistake, for example 2120-08-22; it is
> important for that date to be not able to be represented using uint32_t.
> Say that a later descendant commit B is malformed, and has committer
> date of 0, that is 1970-01-01. This means that the corrected commit date
> for B must be larger than 2120-08-22 - which for this commit means that
> corrected commit date offset do not fit in 32 bits, and must be clamped
> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
> 
> Say that we have commit C that is child of B, and it has correct commit
> date.  Because of mistake in commit A, it has corrected commit date of
> more than 2120-08-22 (corrected commit date degenerated into topological
> level plus constant).
> 
> Now C can reach B, and B can reach A.  However, if we recover corrected
> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
> we get a number that is smaller than correct corrected commit date.  We
> will have
> 
>    gen(A) > date(B) + offset(B) < gen(C)
> 
> Which breaks reachability condition guarantee.
> 
> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
> we would get
> 
>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
> 
> And again reachability condition is broken.
> 
> This is a very contrived but possible example.  This shouldn't happen,
> but ufortunately it can happen.
> 

Yes, that's very unfortunate. 

Here's a much simpler example:

A commit P has an reasonable commit date (i.e. after release of Git to
present) D and has a child commit C with committer date 0. Now, the 
corrected commiter date of C would D + 1 and the offset would be same too,
as the committer date is zero. This overflows as reasonable dates are of
the order 2 ^ 34.

> 
> The question is how to deal with this issue.  Ignore it as unlikely?
> Switch to storing corrected commit date, which is monotonic, so if there
> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
> larger commit-graph file?
> 

To be honest, I would prefer storing corrected committer dates over
storing offsets.

While it is 7% of the size of commit-graph file, it is also *only* around
~3.5 MB for a repository of the size of linux kernel (and IIRC
correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

Minimizing space and memory requirements are a top priority, but
shouldn't making sure our program is correct and efficient to be a
greater priority?

I would love to hear your and Dr. Stolee's opinions on this.

> > +    * This list can be later modified to store future generation number related
> > +      data.
> 
> How can it be later modified?  There is no header, no version number.
> How would we add another generation number data?
> 

We could modify the graph version in future. Here's how I think it would
work:

Graph Version 1, No GDAT -> Topological level
Graph Version 2, GDAT    -> Corrected committer dates
Graph Version 3, GDAT    -> Generation number v3

and so on.

Of course, we do not have to update generation number definition for
each graph version.

However, my statement could still be wrong for things that we do not
foresee (similar to how we missed the hard die on different graph version),
so I am removing the statement.

> > +
> >    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> >        This list of 4-byte values store the second through nth parents for
> >        all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index 808fa30b99..f27145328c 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
> >
> >  Values 1-4 satisfy the requirements of parse_commit_gently().
> >
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates
> > +2. Topological levels
> 
> Should we add versioning info, that is:
> 
>   +1. Corrected committer dates  (generation number v2)
>   +2. Topological levels  (generation number v1)
> 

Yes, added.

> > +
> > +Define "corrected committer date" of a commit recursively as follows:
> > +
> > +  * A commit with no parents (a root commit) has corrected committer date
> > +    equal to its committer date.
> > +
> > +  * A commit with at least one parent has corrected committer date equal to
> > +    the maximum of its commiter date and one more than the largest corrected
> > +    committer date among its parents.
> > +
> > +Define the "topological level" of a commit recursively as follows:
> >
> >   * A commit with no parents (a root commit) has generation number one.
> 
> Shouldn't this be
> 
>     * A commit with no parents (a root commit) has topological level of one.
> 

Thanks, fixed!

> >
> > - * A commit with at least one parent has generation number one more than
> > -   the largest generation number among its parents.
> > + * A commit with at least one parent has topological level one more than
> > +   the largest topological level among its parents.
> >
> > -Equivalently, the generation number of a commit A is one more than the
> > +Equivalently, the topological level of a commit A is one more than the
> >  length of a longest path from A to a root commit. The recursive definition
> >  is easier to use for computation and observing the following property:
> 
> We should probably explicitly state that the property state applies to
> both versions of generation number, not only to topological level.
> 
> >
> > @@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
> >      If A and B are commits with commit time X and Y, respectively, and
> >      X < Y, then A _probably_ cannot reach B.
> >
> > -This heuristic is currently used whenever the computation is allowed to
> > -violate topological relationships due to clock skew (such as "git log"
> > -with default order), but is not used when the topological order is
> > -required (such as merge base calculations, "git log --graph").
> > -
> 
> To be overly pedantic, this heuristic is still used, but now in much
> more rare case.  In addition to what is stated above, at least one layer
> in the split commit-graph chain must have been generated by "Old" Git,
> for the date heuristic to be used.
> 
> But that might be unnecessary level of detail.
> 
> >  In practice, we expect some commits to be created recently and not stored
> >  in the commit graph. We can treat these commits as having "infinite"
> >  generation number and walk until reaching commits with known generation
> >  number.
> >
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> 
> All right.
> 
> >  in the commit-graph file. If a commit-graph file was written by a version
> >  of Git that did not compute generation numbers, then those commits will
> >  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
> >  walking a few extra commits, but the simplicity in dealing with commits
> >  with generation number *_INFINITY or *_ZERO is valuable.
> >
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > -generation numbers are computed to be at least this value. We limit at
> > -this value since it is the largest value that can be stored in the
> > -commit-graph file using the 30 bits available to generation numbers. This
> > -presents another case where a commit can have generation number equal to
> > -that of a parent.
> > +We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
> > +are computed to be at least this value. We limit at this value since it is
> > +the largest value that can be stored in the commit-graph file using the
> > +available to generation numbers. This presents another case where a
> > +commit can have generation number equal to that of a parent.
> 
> All right, though it could have been done without re-wrapping, so that
> only first line would be marked as changed.
> 
> As I wrote, there is theoretical problem with this for offsets.
> 
> >
> >  Design Details
> >  --------------
> > @@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> >  number of commits) could be extracted into config settings for full
> >  flexibility.
> >
> > +We also merge commit-graph chains when we try to write a commit graph with
> > +two different generation number definitions as they cannot be compared directly.
> > +We overwrite the existing chain and create a commit-graph with the newer or more
> > +efficient defintion. For example, overwriting topological levels commit graph
> > +chain to create a corrected commit dates commit graph chain.
> > +
> 
> This is more complicated than that.
> 
> I think we should explicitly state that Git ensures that in split
> commit-graph chain, if there are layers without the GDAT chunk (that
> force Git to use topological levels for generation numbers), then they
> are top layers.  So if there is commit-graph file created by "Old" Git,
> then when addig new layer it would also be GDAT-less.
> 
> Now how to write this...

Thinking about this, I feel creating a new section called "Handling
Mixed Generation Number Chains" made more sense:

  ## Handling Mixed Generation Number Chains

  With the introduction of generation number v2 and generation data chunk,
  the following scenario is possible:

  1. "New" Git writes a commit-graph with a GDAT chunk.
  2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

  The commits in the lower layer will be interpreted as having very large
  generation values (commit date plus offset) compared to the generation
  numbers in the top layer (toplogical level). This violates the
  expectation that the generation of a parent is strictly smaller than the
  generation of a child. In such cases, we revert to using topological
  levels for all layers to maintain backwards compatability.

  When writing a new layer in split commit-graph, we write a GDAT chunk
  only if the topmost layer has a GDAT chunk. This guarantees that if a
  lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.

  Rewriting layers follows similar approach: if the topmost layer below
  set of layers being rewriteen (in the split commit-graph chain) exists,
  and it does not contain GDAT chunk, then the result of rewrite does not
  have GDAT chunks either.

> 
> >  ## Deleting graph-{hash} files
> >
> >  After a new tip file is written, some `graph-{hash}` files may no longer
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-27  6:39         ` Abhishek Kumar
@ 2020-08-27 12:43           ` Jakub Narębski
  2020-08-27 13:15           ` Derrick Stolee
  1 sibling, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-08-27 12:43 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski, Junio C Hamano

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]

>>> +    * This list of 4-byte values store corrected commit date offsets for the
>>> +      commits, arranged in the same order as commit data chunk.
>>
>> I have just realized purely theoretical, but possible, problem with
>> storing non-monotinic generation number related values like corrected
>> commit date offset in constrained space.  There are problems with
>> clamping them.
>>
>> Say that somewhere in the ancestry chain there is a commit A with commit
>> date far in the future by mistake, for example 2120-08-22; it is
>> important for that date to be not able to be represented using uint32_t.
>> Say that a later descendant commit B is malformed, and has committer
>> date of 0, that is 1970-01-01. This means that the corrected commit date
>> for B must be larger than 2120-08-22 - which for this commit means that
>> corrected commit date offset do not fit in 32 bits, and must be clamped
>> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
>>
>> Say that we have commit C that is child of B, and it has correct commit
>> date.  Because of mistake in commit A, it has corrected commit date of
>> more than 2120-08-22 (corrected commit date degenerated into topological
>> level plus constant).
>>
>> Now C can reach B, and B can reach A.  However, if we recover corrected
>> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
>> we get a number that is smaller than correct corrected commit date.  We
>> will have
>>
>>    gen(A) > date(B) + offset(B) < gen(C)
>>
>> Which breaks reachability condition guarantee.
>>
>> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
>> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
>> we would get
>>
>>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
>>
>> And again reachability condition is broken.
>>
>> This is a very contrived but possible example.  This shouldn't happen,
>> but ufortunately it can happen.
>>
>
> Yes, that's very unfortunate.
>
> Here's a much simpler example:
>
> A commit P has an reasonable commit date (i.e. after release of Git to
> present) D and has a child commit C with committer date 0. Now, the
> corrected commiter date of C would D + 1 and the offset would be same too,
> as the committer date is zero. This overflows as reasonable dates are of
> the order 2 ^ 34.

No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value,
so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07),
which is related to storing date as a _signed_ 32-bit integer

The current-ish Unix epoch time is 1598524281 - let's use it for value
of D.  Then the offset for commit C would be 1598524282.  The current
proposal uses 32 bits to store commit date offsets (as unsigned value).
The maximum value of offset that we can store is therefore 2^32 - 1,
which is 4294967295.

   corrected commit date offset(C) = 1,598,524,282
   GENERATION_NUMBER_V2_MAX        = 4,294,967,295

As you can see there is no overflow in the simplified example.

>>
>> The question is how to deal with this issue.  Ignore it as unlikely?
>> Switch to storing corrected commit date, which is monotonic, so if there
>> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
>> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
>> larger commit-graph file?
>>
>
> To be honest, I would prefer storing corrected committer dates over
> storing offsets.
>
> While it is 7% of the size of commit-graph file, it is also *only* around
> ~3.5 MB for a repository of the size of linux kernel (and IIRC
> correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

It is up to 7% of per-commit data, and it doesn't take into account EDGE
chunk (for octopus merges), and it doesn't also take into account the
size of changed-paths Bloom filters data take in the commit-graph.

> Minimizing space and memory requirements are a top priority, but
> shouldn't making sure our program is correct and efficient to be a
> greater priority?

On the other hand the case where we would encounter offsets that do not
fit in uint32_t is extremply unlikely in sane repositories.

I can think of three solutions:

1. use 64-bit corrected commit dates
   - advantages:
     * simplest code,
     * no need for overflow handling, as we can store all possible values
       of timestamp_t
   - disadvantages:
     * commit-graph size increased by up to 7%

2. use 32-bit corrected commit date offsets,
   but simply do not store GDAT chunk if there is offset that would not
   fit in 32-bit wide field
   - advantages:
     * commit-graph is smaller
     * relatively simple overflow handling
   - disadvantages:
     * performance penalty (generation number v1 vs v2) for abnormal
       repositories (with overflow not fitting in uint32_t)
     * tests would be needed to exercise the overflow code

3. use 32-bit for corrected commit date offset,
   with oveflow handling, for example using most significant bit
   to denote that other bits store position into offset overflow
   with 64-bits for those offsets that do not fit in 31-bits
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
   - disadvantages:
     * most complex code of all proposed solutions
     * smaller overflow limit of 2^31 - 1
     * tests would be needed to exercise the overflow code

I think because the situation where we encounter overflow in 32-bit
corrected commit date offset is rare, we should go with either 1 or 2
solution.

> I would love to hear your and Dr. Stolee's opinions on this.

I have CC-ed Junio C Hamano to ask for his opinion.

>>> +    * This list can be later modified to store future generation number related
>>> +      data.
>>
>> How can it be later modified?  There is no header, no version number.
>> How would we add another generation number data?
>>
>
> We could modify the graph version in future. Here's how I think it would
> work:
>
> Graph Version 1, No GDAT -> Topological level
> Graph Version 2, GDAT    -> Corrected committer dates
> Graph Version 3, GDAT    -> Generation number v3
>
> and so on.
>
> Of course, we do not have to update generation number definition for
> each graph version.

So it was about generic mechanism, not something specific to the GDAT chunk.

> However, my statement could still be wrong for things that we do not
> foresee (similar to how we missed the hard die on different graph version),
> so I am removing the statement.

Good.

[...]
>>> +We also merge commit-graph chains when we try to write a commit graph with
>>> +two different generation number definitions as they cannot be compared directly.
>>> +We overwrite the existing chain and create a commit-graph with the newer or more
>>> +efficient defintion. For example, overwriting topological levels commit graph
>>> +chain to create a corrected commit dates commit graph chain.
>>> +
>>
>> This is more complicated than that.
>>
>> I think we should explicitly state that Git ensures that in split
>> commit-graph chain, if there are layers without the GDAT chunk (that
>> force Git to use topological levels for generation numbers), then they
>> are top layers.  So if there is commit-graph file created by "Old" Git,
>> then when addig new layer it would also be GDAT-less.
>>
>> Now how to write this...
>
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
>
>   ## Handling Mixed Generation Number Chains
>
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
>
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.
>
>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

Good idea, and nice writeup.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-27  6:39         ` Abhishek Kumar
  2020-08-27 12:43           ` Jakub Narębski
@ 2020-08-27 13:15           ` Derrick Stolee
  2020-09-01 13:01             ` Abhishek Kumar
  1 sibling, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-08-27 13:15 UTC (permalink / raw)
  To: 85y2m6fhkm.fsf, Jakub Narębski
  Cc: abhishekkumar8222, git, gitgitgadget, me

On 8/27/2020 2:39 AM, Abhishek Kumar wrote:
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
> 
>   ## Handling Mixed Generation Number Chains
> 
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
> 
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

I like the idea of this section, and this setup is good.

>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.

s/toplogical/topological

But also, we don't want to phrase this as "in this case, we do the wrong
thing" but instead

  A naive approach of using the newest available generation number from
  each layer would lead to violated expectations: the lower layer would
  use corrected commit dates which are much larger than the topological
  levels of the higher layer. For this reason, Git inspects each layer
  to see if any layer is missing corrected commit dates. In such a case,
  Git only uses topological levels.

>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.

s/lyer/layer

Perhaps leaving this at a higher level than referencing "GDAT chunk" is
advisable. Perhaps use "we write corrected commit dates" or "all lower
layers must store corrected commit dates as well", for example.

>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

This could use more positive language to make it clear that sometimes
we _do_ want to write corrected commit dates when merging layers:

  When merging layers, we do not consider whether the merged layers had
  corrected commit dates. Instead, the new layer will have corrected
  commit dates if and only if all existing layers below the new layer
  have corrected commit dates.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-08-22 19:09       ` Jakub Narębski
@ 2020-09-01 10:08         ` Abhishek Kumar
  2020-09-03 19:11           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-01 10:08 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Sat, Aug 22, 2020 at 09:09:21PM +0200, Jakub Narębski wrote:
> Hello Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With corrected commit dates implemented, we no longer have to rely on
> > commit date as a heuristic in paint_down_to_common().
> 
> All right, but it would be nice to have some benchmark data: what were
> performance when using topological levels, what was performance when
> using commit date heuristics (before this patch), what is performace now
> when using corrected commit date.
> 
> >
> > t6024-recursive-merge setups a unique repository where all commits have
> > the same committer date without well-defined merge-base. As this has
> > already caused problems (as noted in 859fdc0 (commit-graph: define
> > GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
> > test script.
> 
> OK?

In hindsight, that is a terrible explanation. Here's what I have revised
this to:

  With corrected commit dates implemented, we no longer have to rely on
  commit date as a heuristic in paint_down_to_common().

  While using corrected commit dates Git walks nearly the same number of
  commits as commit date, the process is slower as for each comparision we
  have to access the commit-slab (for corrected committer date) instead of
  accessing struct member (for committer date).

  For example, the command `git merge-base v4.8 v4.9` on the linux
  repository walks 167468 commits, taking 0.135s for committer date and
  167496 commits, taking 0.157s for corrected committer date respectively.

  t6404-recursive-merge setups a unique repository where all commits have
  the same committer date without well-defined merge-base. As this has
  already caused problems (as noted in 859fdc0 (commit-graph: define
  GIT_TEST_COMMIT_GRAPH, 2018-08-29)).

  While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
  date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
  conflicts' merges commits in the order:
  - Merge C with B to form a intermediate commit.
  - Merge the intermediate commit with A.

  With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
  use the corrected committer date, which changes the order in which
  commits are merged:
  - Merge A with B to form a intermediate commit.
  - Merge the intermediate commit with C.

  While resulting repositories are equivalent, 6404.4 'virtual trees were
  processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
  different merge-bases and thus have different object ids for the
  intermediate commits.

  As this has already causes problems (as noted in 859fdc0 (commit-graph:
  define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
  within t6404-recursive-merge.
> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c             | 14 ++++++++++++++
> >  commit-graph.h             |  6 ++++++
> >  commit-reach.c             |  2 +-
> >  t/t6024-recursive-merge.sh |  4 +++-
> >  4 files changed, 24 insertions(+), 2 deletions(-)
> >
> 
> I have reorderd files for easier review.
> 
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 3cf89d895d..e22ec1e626 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -91,6 +91,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
> >   */
> >  int generation_numbers_enabled(struct repository *r);
> >
> > +/*
> > + * Return 1 if and only if the repository has a commit-graph
> > + * file and generation data chunk has been written for the file.
> > + */
> > +int corrected_commit_dates_enabled(struct repository *r);
> > +
> >  enum commit_graph_write_flags {
> >  	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
> >  	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
> 
> All right.
> 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index c1292f8e08..6411068411 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
> >  	return !!first_generation;
> >  }
> >
> > +int corrected_commit_dates_enabled(struct repository *r)
> > +{
> > +	struct commit_graph *g;
> > +	if (!prepare_commit_graph(r))
> > +		return 0;
> > +
> > +	g = r->objects->commit_graph;
> > +
> > +	if (!g->num_commits)
> > +		return 0;
> > +
> > +	return !!g->chunk_generation_data;
> > +}
> 
> The previous commit introduced validate_mixed_generation_chain(), which
> walked whole split commit-graph chain, and set `read_generation_data`
> field in `struct commit_graph` for all layers in the chain.
> 
> This function examines only the top layer, so it follows the assumption
> that Git would behave in such way that oly topmost layers in the chai
> can be GDAT-less.
> 
> Why the difference?  Couldn't validate_mixed_generation_chain() simply
> call corrected_commit_dates_enabled()?

The previous commit didn't need to walk the whole split commit-graph
chain. Because of how we are handling writing in a mixed generation data
chunk, if a layer has generation data chunk, all layers below it have a
generation data chunk as well.

So, there are two cases at hand:

- Topmost layer has generation data chunk, so we know all layers below
  it has generation data chunk and we can read values from it.
- Topmost layer does not have generation data chunk, so we know we can't
  read from generation data chunk.

Just checking the topmost layer suffices - modified the previous commit.

Then, this function is more or less the same as
`g->read_generation_data` that is, if we are reading from generation
data chunk, we are using corrected commit dates.

> 
> > +
> >  static void close_commit_graph_one(struct commit_graph *g)
> >  {
> >  	if (!g)
> > diff --git a/commit-reach.c b/commit-reach.c
> > index 470bc80139..3a1b925274 100644
> > --- a/commit-reach.c
> > +++ b/commit-reach.c
> > @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
> >  	int i;
> >  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
> >
> > -	if (!min_generation)
> 
> This check was added in 091f4cf (commit: don't use generation numbers if
> not needed, 2018-08-30) by Derrick Stolee, and its commit message
> includes benchmark results for running 'git merge-base v4.8 v4.9' in
> Linux kernel repository:
> 
>       v2.18.0: 0.122s    167,468 walked
>   v2.19.0-rc1: 0.547s    635,579 walked
>          HEAD: 0.127s
> 
> > +	if (!min_generation && !corrected_commit_dates_enabled(r))
> >  		queue.compare = compare_commits_by_commit_date;
> 
> It would be nice to have similar benchmark for this change... unless of
> course there is no change in performance, but I think then it needs to
> be stated explicitly.  I think.
> 

Mentioned in the commit message - we walk (nearly) the same number of
commits but take somewhat longer.

> >
> >  	one->object.flags |= PARENT1;
> > diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
> > index 332cfc53fd..d3def66e7d 100755
> > --- a/t/t6024-recursive-merge.sh
> > +++ b/t/t6024-recursive-merge.sh
> > @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
> >  export GIT_COMMITTER_DATE
> >
> >  test_expect_success 'setup tests' '
> > +	GIT_TEST_COMMIT_GRAPH=0 &&
> > +	export GIT_TEST_COMMIT_GRAPH &&
> >  	echo 1 >a1 &&
> >  	git add a1 &&
> >  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
> > @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
> >  '
> >
> >  test_expect_success 'combined merge conflicts' '
> > -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> > +	test_must_fail git merge -m final G
> >  '
> >
> >  test_expect_success 'result contains a conflict' '
> 
> OK, so instead of disabling commit-graph for this test, now we disable
> it for the whole script.
> 
> Maybe this change should be in a separate patch?

With the explanation in commit message, it's clear to see how using
corrected commit dates leads to an (incorrectly) failing test. Does it
still make sense to seperate them?

> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-08-25  7:56             ` Jakub Narębski
@ 2020-09-01 10:26               ` Abhishek Kumar
  2020-09-03  9:25                 ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-01 10:26 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Aug 25, 2020 at 09:56:44AM +0200, Jakub Narębski wrote:
> On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
>
> ...
>
> >
> > All right, we might want to make use of the fact that the value of 0 for
> > topological level here always mean that its value for a commit needs to
> > be computed, that 0 is not a valid value for topological levels.
> > - if the value 0 came from commit-graph file, it means that it came
> >   from Git version that used commit-graph but didn't compute generation
> >   numbers; the value is GENERATION_NUMBER_ZERO
> > - the value 0 might came from the fact that commit is not in graph,
> >   and that commit-slab zero-initializes the values stored; let's
> >   call this value GENERATION_NUMBER_UNINITIALIZED
> >
> > If we ensure that corrected commit date can never be zero (which is
> > extremely unlikely, as one of root commits would have to be malformed or
> > written on badly misconfigured computer, with value of 0 for committer
> > timestamp), then this "happy accident" can keep working.
> >
> >   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
> >   has corrected commit date of one, to be able to distinguish
> >   uninitialized values.
> >
> > Or something like that.
> >
> > Actually, it is not even necessary, as corrected commit date of 0 just
> > means that this single value (well, for every root commit with commit
> > date of 0) would be unnecessary recomputed in compute_generation_numbers().
> >
> > Anyway, we would want to document this fact in the commit message.
> 
> Alternatively, instead of comparing 'level' (and later in series also
> 'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
> we could load at no extra cost `graph_pos` value and compare it
> against COMMIT_NOT_FROM_GRAPH.
> 
> But with this solution we could never get rid of graph_pos, if we
> think it is unnecessary. If we split commit_graph_data into separate
> slabs (as it was in early versions of respective patch series), we
> would have to pay additional cost.
> 
> But it is an alternative.
> 
> Best,
> -- 
> Jakub Narębski

I think updating a commit date with timestampt of zero to use corrected
commit date of one would leave us more options down the line.

Changing this is easy enough.

For a root commit with timestamp zero, current->date would be zero and 
max_corrected_commit_date would be zero as well. So we can set 
corrected commit date as `max_corrected_commit_date + 1`, instead of the
earlier `(current->date - 1) + 1`.

----

diff --git a/commit-graph.c b/commit-graph.c
index 7ed0a33ad6..e3c5e30405 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1389,7 +1389,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 
-				if (current->date > max_corrected_commit_date)
+				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
  2020-08-25 10:07           ` Jakub Narębski
@ 2020-09-01 11:01             ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-01 11:01 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Aug 25, 2020 at 12:07:17PM +0200, Jakub Narębski wrote:
> Hello,
> 
> ...
> 
> I think I was not clear enough (in trying to be brief).  I meant here
> loading available generation numbers for use in graph traversal,
> done in later patches in this series.
> 
> In _next_ commit we store topological levels in `generation` field:
> 
>   @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>    	date_low = get_be32(commit_data + g->hash_len + 12);
>    	item->date = (timestamp_t)((date_high << 32) | date_low);
> 
>   -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>   +	if (g->chunk_generation_data)
>   +		graph_data->generation = item->date +
>   +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>   +	else
>   +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> 
> 
> We use topo_level slab only when writing the commit-graph file.
> 

Right, I thought the agenda outlined points in the process of writing
commit-graph file.

>
> > We could avoid initializing topo_slab if we are not writing generation
> > data chunk (and thus don't need corrected commit dates) but that
> > wouldn't have an impact on run time while writing commit-graph because
> > computing corrected commit dates is cheap as the main cost is in walking
> > the graph and writing the file.
> 
> Right.
> 
> Though you need to add the cost of allocation and managing extra
> commit slab, I think that amortized cost is negligible.
> 
> But what would be better is showing benchmark data: does writing the
> commit graph without GDAT take not insigificant more time than without
> this patch?

Right, we could compare time taken by master and series until (but not
including this patcth) to write a commit-graph file. Will add.

> 
> [...]
> >>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >>>  	for (i = 0; i < g->num_commits; i++) {
> >>>  		struct commit *graph_commit, *odb_commit;
> >>>  		struct commit_list *graph_parents, *odb_parents;
> >>> -		timestamp_t max_generation = 0;
> >>> -		timestamp_t generation;
> >>> +		timestamp_t max_corrected_commit_date = 0;
> >>> +		timestamp_t corrected_commit_date;
> >>
> >> This is simple, and perhaps unnecessary, rename of variables.
> >> Shouldn't we however verify *both* topological level, and
> >> (if exists) corrected commit date?
> >
> > The problem with verifying both topological level and corrected commit
> > dates is that we would have to re-fill commit_graph_data slab with commit
> > data chunk as we cannot modify data->generation otherwise, essentially
> > repeating the whole verification process.
> >
> > While it's okay for now, I might take this up in a future series [1].
> >
> > [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/
> 
> All right, I believe you that verifying both topological level and
> corrected commit date would be more difficult.
> 
> That doesn't change the conclusion that this variable should remain to
> be named `generation`, as when verifying GDAT-less commit-graph files it
> would check topological levels (it uses commit_graph_generation(), which
> in turn uses `generation` field in commit graph info, which as I have
> show above in later patch could be v1 or v2 generation number).
> 

Right, I completely misunderstood you initially. Reverted the variable
name changes.

> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-08-25 11:11           ` Jakub Narębski
@ 2020-09-01 11:35             ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-01 11:35 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Aug 25, 2020 at 01:11:11PM +0200, Jakub Narębski wrote:
> Hello,
> 
> ...
> 
> All right.
> 
> We might want to add here the information that we also move loading the
> commit date from the commit-graph file from fill_commit_in_graph() down
> the [new] call chain into fill_commit_graph_info().  The commit date
> would be needed in fill_commit_graph_info() in the next commit to
> compute corrected commit date out of corrected commit date offset, and
> store it as generation number.
> 
> 
> NOTE that this means that if we switch to storing 64-bit corrected
> commit date directly in the commit-graph file, instead of storing 32-bit
> offsets, neither this Move Statement Into Function Out of Caller
> refactoring nor change to the 'generate tar with future mtime' test
> would be necessary.
> 
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
> > 34-bits for storing commiter date, thus committer time overflows into
> > generation number (within CDAT chunk) and has undefined behavior.
> >
> > The test used to pass as fill_commit_graph_info() would not set struct
> > member `date` of struct commit and loads committer date from the object
> > database, generating a tar file with the expected mtime.
> 
> I guess that in the case of generating a tar file we would read the
> commit out of 'object database', and then only add commit-graph specific
> info with fill_commit_graph_info().  Possibly because we need more
> information that commit-graph provides for a commit.
> 
> >
> > However, with corrected commit date, we will load the committer date
> > from CDAT chunk (truncated to lower 34-bits) to populate the generation
> > number. Thus, fill_commit_graph_info() sets date and generates tar file
> > with the truncated mtime and the test fails.
> >
> > Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
> > will not be truncated.
> 
> Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH
> was used.
> 
> The commit that introduced the 'generate tar with future mtime' test,
> namely e51217e15 (t5000: test tar files that overflow ustar headers,
> 30-06-2016), says:
> 
> 	The ustar format only has room for 11 (or 12, depending on
> 	some implementations) octal digits for the size and mtime of
> 	each file. For values larger than this, we have to add pax
> 	extended headers to specify the real data, and git does not
> 	yet know how to do so.
> 
> 	Before fixing that, let's start off with some test
> 	infrastructure [...]
> 
> The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12.
> So we need the value of (2 ^ 36 + 1) for this test do do its job.
> Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough
> (if we skip testing "some implementations").
> 
> So I think to make this test more clear (for inquisitive minds) we
> should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds
> since EPOCH.  Maybe even add a variant of this test that uses the
> origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off
> use of serialized commit-graph.

That's pretty interesting! I didn't look into this either, will modify
the existing test and add a new test for it.

Thanks for investigating this further.

> 
> I'm sorry for not checking this earlier.
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-08-25 12:18           ` Jakub Narębski
@ 2020-09-01 12:06             ` Abhishek Kumar
  2020-09-03 13:42               ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-01 12:06 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, stolee, me

On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> ...
> 
> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
> difficult.  As far as I can see, we can choose one of the *three*
> solutions (the third one is _new_):
> 
> a. store 64-bit corrected commit date in the GDAT chunk
>    all possible values are able to be stored, no need for
>    GENERATION_NUMBER_V2_MAX,
> 
> b. store 32-bit corrected commit date offset in the GDAT chunk,
>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>    do not write GDAT chunk at all (like for backward compatibility
>    with mixed-version chains of split commit-graph layers),
> 
> c. store 32-bit corrected commit date offset in the GDAT chunk,
>    using some kind of overflow handling scheme; for example if
>    the most significant bit of 32-bit value is 1, then the
>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>    to store those corrected commit date offsets that do not
>    fit in 32 bits.
> 

Alright, so the third solution leverages the fact that in practice,
very few offsets would overflow the 32-bit limit. Using 64-bits for all
offsets would be wasteful, we can trade off a miniscule amount of
computation to save large amounts of disk space.

>
> This type of schema is used in other places in Git code, if I remember
> it correctly.
> 

Yes, it's a similar idea to the extra edge list chunk, where the most
significant bit of second parent indicates whether they are more than
two parents.

It's definitely feasible, albeit a little complex.

What's the overall consensus on the third solution?

>
> >> The commit message says nothing about the new symbolic constant
> >> GENERATION_NUMBER_V1_INFINITY, though.
> >>
> >> I'm not sure it is even needed (see comments below).
> >
> > Yes, you are correct. I tried it out with your suggestions and it wasn't
> > really needed.
> >
> > Thanks for catching this!
> 
> Mistakes can happen when changig how the series is split into commits.
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 11/11] doc: add corrected commit date info
  2020-08-27 13:15           ` Derrick Stolee
@ 2020-09-01 13:01             ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-01 13:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb

On Thu, Aug 27, 2020 at 09:15:56AM -0400, Derrick Stolee wrote:
> On 8/27/2020 2:39 AM, Abhishek Kumar wrote:
> > Thinking about this, I feel creating a new section called "Handling
> > Mixed Generation Number Chains" made more sense:
> > 
> >   ## Handling Mixed Generation Number Chains
> > 
> >   With the introduction of generation number v2 and generation data chunk,
> >   the following scenario is possible:
> > 
> >   1. "New" Git writes a commit-graph with a GDAT chunk.
> >   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> I like the idea of this section, and this setup is good.
> 
> >   The commits in the lower layer will be interpreted as having very large
> >   generation values (commit date plus offset) compared to the generation
> >   numbers in the top layer (toplogical level). This violates the
> >   expectation that the generation of a parent is strictly smaller than the
> >   generation of a child. In such cases, we revert to using topological
> >   levels for all layers to maintain backwards compatability.
> 
> s/toplogical/topological
> 
> But also, we don't want to phrase this as "in this case, we do the wrong
> thing" but instead
> 
>   A naive approach of using the newest available generation number from
>   each layer would lead to violated expectations: the lower layer would
>   use corrected commit dates which are much larger than the topological
>   levels of the higher layer. For this reason, Git inspects each layer
>   to see if any layer is missing corrected commit dates. In such a case,
>   Git only uses topological levels.
> 
> >   When writing a new layer in split commit-graph, we write a GDAT chunk
> >   only if the topmost layer has a GDAT chunk. This guarantees that if a
> >   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
> 
> s/lyer/layer
> 
> Perhaps leaving this at a higher level than referencing "GDAT chunk" is
> advisable. Perhaps use "we write corrected commit dates" or "all lower
> layers must store corrected commit dates as well", for example.
> 
> >   Rewriting layers follows similar approach: if the topmost layer below
> >   set of layers being rewriteen (in the split commit-graph chain) exists,
> >   and it does not contain GDAT chunk, then the result of rewrite does not
> >   have GDAT chunks either.
> 
> This could use more positive language to make it clear that sometimes
> we _do_ want to write corrected commit dates when merging layers:
> 
>   When merging layers, we do not consider whether the merged layers had
>   corrected commit dates. Instead, the new layer will have corrected
>   commit dates if and only if all existing layers below the new layer
>   have corrected commit dates.

Thanks, that is a great suggestion! Using positive language is more
straightforward and easier to understand.

> 
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
  2020-09-01 10:26               ` Abhishek Kumar
@ 2020-09-03  9:25                 ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-09-03  9:25 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Aug 25, 2020 at 09:56:44AM +0200, Jakub Narębski wrote:
>> On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>> ...
>>
>>>
>>> All right, we might want to make use of the fact that the value of 0 for
>>> topological level here always mean that its value for a commit needs to
>>> be computed, that 0 is not a valid value for topological levels.
>>> - if the value 0 came from commit-graph file, it means that it came
>>>   from Git version that used commit-graph but didn't compute generation
>>>   numbers; the value is GENERATION_NUMBER_ZERO
>>> - the value 0 might came from the fact that commit is not in graph,
>>>   and that commit-slab zero-initializes the values stored; let's
>>>   call this value GENERATION_NUMBER_UNINITIALIZED
>>>
>>> If we ensure that corrected commit date can never be zero (which is
>>> extremely unlikely, as one of root commits would have to be malformed or
>>> written on badly misconfigured computer, with value of 0 for committer
>>> timestamp), then this "happy accident" can keep working.
>>>
>>>   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
>>>   has corrected commit date of one, to be able to distinguish
>>>   uninitialized values.
>>>
>>> Or something like that.
>>>
>>> Actually, it is not even necessary, as corrected commit date of 0 just
>>> means that this single value (well, for every root commit with commit
>>> date of 0) would be unnecessary recomputed in compute_generation_numbers().
>>>
>>> Anyway, we would want to document this fact in the commit message.
>> 
>> Alternatively, instead of comparing 'level' (and later in series also
>> 'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
>> we could load at no extra cost `graph_pos` value and compare it
>> against COMMIT_NOT_FROM_GRAPH.
>> 
>> But with this solution we could never get rid of graph_pos, if we
>> think it is unnecessary. If we split commit_graph_data into separate
>> slabs (as it was in early versions of respective patch series), we
>> would have to pay additional cost.
>> 
>> But it is an alternative.
>
> I think updating a commit date with timestampt of zero to use corrected
> commit date of one would leave us more options down the line.
>
> Changing this is easy enough.
>
> For a root commit with timestamp zero, current->date would be zero and 
> max_corrected_commit_date would be zero as well. So we can set 
> corrected commit date as `max_corrected_commit_date + 1`, instead of the
> earlier `(current->date - 1) + 1`.
>
> ----
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7ed0a33ad6..e3c5e30405 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1389,7 +1389,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
>  
> -				if (current->date > max_corrected_commit_date)
> +				if (current->date && current->date > max_corrected_commit_date)
>  					max_corrected_commit_date = current->date - 1;
>  				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>  			}

It turned out to be much easier than I have expected: a one-line change,
adding simply a new condition.  Good work!

Perhaps it would be better to write it as current->date == GENERATION_NUMBER_UNINITIALIZED
(or *_ZERO, or *_NO_DATA,...), but current version is quite idiomatic
and easy to read.

With this change we should, of course, also change the commit-graph
format docs.


On the other hand it is a bit unnecessary.  If `generation` is zero,
using it would still work, and it would just mean that it would be
unnecessarily recomputed - but corrected commit date equal zero is
possible only for root commits.

But the above solution is more consistent, using 0 to mark not
initialized values...  it is cleaner, at the cost of one more corner
case, single line change, and possibly an insignificant amount of
performance penalty due to adding unlikely true branch to the
conditional.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-09-01 12:06             ` Abhishek Kumar
@ 2020-09-03 13:42               ` Jakub Narębski
  2020-09-05 17:21                 ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-09-03 13:42 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Hi,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> 
>> ...
>> 
>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>> difficult.  As far as I can see, we can choose one of the *three*
>> solutions (the third one is _new_):
>> 
>> a. store 64-bit corrected commit date in the GDAT chunk
>>    all possible values are able to be stored, no need for
>>    GENERATION_NUMBER_V2_MAX,
>> 
>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>    do not write GDAT chunk at all (like for backward compatibility
>>    with mixed-version chains of split commit-graph layers),
>> 
>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>    using some kind of overflow handling scheme; for example if
>>    the most significant bit of 32-bit value is 1, then the
>>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>>    to store those corrected commit date offsets that do not
>>    fit in 32 bits.

Note that I have posted more detailed analysis of advantages and
disadvantages of each of the above solutions in response to 11/11
https://public-inbox.org/git/85o8mwb6nq.fsf@gmail.com/

I can think of yet another solution, a variant of approach 'c' with
different overflow handling scheme:

c'.  Store 32-bit corrected commit date offset in the GDAT chunk,
     using the following overflow handling scheme: if the value
     is 0xFFFFFFFF (all bits set to 1, the maximum possible value
     for uint32_t), then the corrected commit date or corrected
     commit date offset can be found in GDOV chunk (Generation
     Data OVerflow handling).

     The GDOV chunk is composed of:
     - H bytes of commit OID, or 4 bytes (32 bits) of commit pos
     - 8 bytes (64 bits) of corrected commit date or its offset
     
     Commits in GDOV chunk are sorted; as we expect for the number
     of commits that require GDOV to be zero or a very small number
     there is no need for GDO Fanout chunk.
     
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
     * overflow limit reduced only by 1 (a single value)
   - disadvantages:
     * most complex code of all proposed solutions
       even more complicated than for solution 'c',
       different from EDGE chunk handling
     * tests would be needed to exercise the overflow code

Or we can split overflow handling into two chunks: GDOI (Generation Data
Overflow Index) and GDOV, where GDOI would be composed of H bytes of
commit OID or 4 bytes of commit graph position (sorted), and GDOV would
be composed oly of 8 bytes (64 bits) of corrected commit date data.

This c'') variant has the same advantages and disadvantages as c'), with
negligibly slightly larger disk size and possibly slightly better
performance because of better data locality.

>
> Alright, so the third solution leverages the fact that in practice,
> very few offsets would overflow the 32-bit limit. Using 64-bits for all
> offsets would be wasteful, we can trade off a miniscule amount of
> computation to save large amounts of disk space.

On the other hand we can say that we can trade negligible increase of
commit-graph disk space size (less than 7% in worst case: no octopus
merges, no changed-path Bloom filter data, using SHA-1 for object ids,
large repository so header size + OIFD is negligible) for simpler code
with no need for overflow handling at all (and a minuscule amount of
less computations).

>>
>> This type of schema is used in other places in Git code, if I remember
>> it correctly.
>> 
>
> Yes, it's a similar idea to the extra edge list chunk, where the most
> significant bit of second parent indicates whether they are more than
> two parents.

Yes and no.  Yes, the solution 'c' uses exactly the same mechanism as
the pointer from Commid Data chunk into Extra Edges List chunk:

      [...]  If there are more than two parents, the second value
      has its most-significant bit on and the other bits store an array
      position into the Extra Edge List chunk.

On the other hand we need to have some kind of overflow handling for the
list of parents, as the number of parents is not limited in Git (there
is no technical upper limit on the number of parents a commits can
have), as opposed to for example Mercurial.  This is not the case for
storing corrected commit date (or corrected commit date offset), as 64
bits is all we would ever need.

> It's definitely feasible, albeit a little complex.
>
> What's the overall consensus on the third solution?

Still waiting for others to weight in.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-09-01 10:08         ` Abhishek Kumar
@ 2020-09-03 19:11           ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-09-03 19:11 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 09:09:21PM +0200, Jakub Narębski wrote:
>> 
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> With corrected commit dates implemented, we no longer have to rely on
>>> commit date as a heuristic in paint_down_to_common().
>> 
>> All right, but it would be nice to have some benchmark data: what were
>> performance when using topological levels, what was performance when
>> using commit date heuristics (before this patch), what is performace now
>> when using corrected commit date.

All right, the new proposed commit message has this benchmark data.
Thanks.

>>> t6024-recursive-merge setups a unique repository where all commits have
>>> the same committer date without well-defined merge-base. As this has
>>> already caused problems (as noted in 859fdc0 (commit-graph: define
>>> GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
>>> test script.
>> 
>> OK?
>
> In hindsight, that is a terrible explanation. Here's what I have revised
> this to:
>
>   With corrected commit dates implemented, we no longer have to rely on
>   commit date as a heuristic in paint_down_to_common().
>
>   While using corrected commit dates Git walks nearly the same number of
>   commits as commit date, the process is slower as for each comparision we
>   have to access the commit-slab (for corrected committer date) instead of
>   accessing struct member (for committer date).
>
>   For example, the command `git merge-base v4.8 v4.9` on the linux
>   repository walks 167468 commits, taking 0.135s for committer date and
>   167496 commits, taking 0.157s for corrected committer date respectively.
>
>   t6404-recursive-merge setups a unique repository where all commits have
>   the same committer date without well-defined merge-base. As this has
>   already caused problems (as noted in 859fdc0 (commit-graph: define
>   GIT_TEST_COMMIT_GRAPH, 2018-08-29)).
>
>   While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
>   date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
>   conflicts' merges commits in the order:
>   - Merge C with B to form a intermediate commit.
>   - Merge the intermediate commit with A.
>
>   With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
>   use the corrected committer date, which changes the order in which
>   commits are merged:
>   - Merge A with B to form a intermediate commit.
>   - Merge the intermediate commit with C.
>
>   While resulting repositories are equivalent, 6404.4 'virtual trees were
>   processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
>   different merge-bases and thus have different object ids for the
>   intermediate commits.
>
>   As this has already causes problems (as noted in 859fdc0 (commit-graph:
>   define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
>   within t6404-recursive-merge.

Much better.  Thanks a lot.

[...]
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index c1292f8e08..6411068411 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>> @@ -703,6 +703,20 @@ int generation_numbers_enabled(struct repository *r)
>>>  	return !!first_generation;
>>>  }
>>>
>>> +int corrected_commit_dates_enabled(struct repository *r)
>>> +{
>>> +	struct commit_graph *g;
>>> +	if (!prepare_commit_graph(r))
>>> +		return 0;
>>> +
>>> +	g = r->objects->commit_graph;
>>> +
>>> +	if (!g->num_commits)
>>> +		return 0;
>>> +
>>> +	return !!g->chunk_generation_data;
>>> +}
>> 
>> The previous commit introduced validate_mixed_generation_chain(), which
>> walked whole split commit-graph chain, and set `read_generation_data`
>> field in `struct commit_graph` for all layers in the chain.
>> 
>> This function examines only the top layer, so it follows the assumption
>> that Git would behave in such way that oly topmost layers in the chai
>> can be GDAT-less.
>> 
>> Why the difference?  Couldn't validate_mixed_generation_chain() simply
>> call corrected_commit_dates_enabled()?
>
> The previous commit didn't need to walk the whole split commit-graph
> chain.

Errr... but `validate_mixed_generation_chain()` introduced in previous
commit in this patch series *does* walk all the layers of the whole
split commit-graph chain.

	static void validate_mixed_generation_chain(struct repository *r)
	{
		struct commit_graph *g = r->objects->commit_graph;
		int read_generation_data = 1;
	
		while (g) {
			if (!g->chunk_generation_data) {
				read_generation_data = 0;
				break;
			}
			g = g->base_graph;
		}
	
		g = r->objects->commit_graph;
	
		while (g) {
			g->read_generation_data = read_generation_data;
			g = g->base_graph;
		}
	}

Moreover it "marks up" the whole chain, actually walking it twice.

You wrote somewhere else (possibly after I wrote this post) that this
was needed to handle `git commit-graph validate`, if I remember it
correctly.

If it is true, then we need both approaches: the less expensive one
(relying on our assumptions) and the more expensive one.  But we need to
better explain both: why we need more expensive one, why we can use the
less expensive onne (how we ensure that the requirements are fulfilled).

>       Because of how we are handling writing in a mixed generation data
> chunk, if a layer has generation data chunk, all layers below it have a
> generation data chunk as well.
>
> So, there are two cases at hand:
>
> - Topmost layer has generation data chunk, so we know all layers below
>   it has generation data chunk and we can read values from it.
> - Topmost layer does not have generation data chunk, so we know we can't
>   read from generation data chunk.
>
> Just checking the topmost layer suffices - modified the previous commit.
>
> Then, this function is more or less the same as
> `g->read_generation_data` that is, if we are reading from generation
> data chunk, we are using corrected commit dates.

All right.  That explains how corrected_commit_dates_enabled() works,
but not why we need also validate_mixed_generation_chain() that sets
g->read_generation_data for every layer in the chain.

>> 
>>> +
>>>  static void close_commit_graph_one(struct commit_graph *g)
>>>  {
>>>  	if (!g)
>>> diff --git a/commit-reach.c b/commit-reach.c
>>> index 470bc80139..3a1b925274 100644
>>> --- a/commit-reach.c
>>> +++ b/commit-reach.c
>>> @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>>>  	int i;
>>>  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>>>
>>> -	if (!min_generation)
>> 
>> This check was added in 091f4cf (commit: don't use generation numbers if
>> not needed, 2018-08-30) by Derrick Stolee, and its commit message
>> includes benchmark results for running 'git merge-base v4.8 v4.9' in
>> Linux kernel repository:
>> 
>>       v2.18.0: 0.122s    167,468 walked
>>   v2.19.0-rc1: 0.547s    635,579 walked
>>          HEAD: 0.127s
>> 
>>> +	if (!min_generation && !corrected_commit_dates_enabled(r))
>>>  		queue.compare = compare_commits_by_commit_date;
>> 
>> It would be nice to have similar benchmark for this change... unless of
>> course there is no change in performance, but I think then it needs to
>> be stated explicitly.  I think.
>> 
>
> Mentioned in the commit message - we walk (nearly) the same number of
> commits but take somewhat longer.

All right, the new proposed commit message has it.

Sidenote: this is outside of the scope of this patch series, but perhaps
we should think about bringing the `generation` field from the
commit-slab back as a member of the `struct commit`; this would need
profiling and benchmarking of the typical workload to get amortized
performance across many git commands.

>>>
>>>  	one->object.flags |= PARENT1;
>>> diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
>>> index 332cfc53fd..d3def66e7d 100755
>>> --- a/t/t6024-recursive-merge.sh
>>> +++ b/t/t6024-recursive-merge.sh

Note: this might be now t/t6404-recursive-merge.sh -- 6404 ot 6024.

>>> @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
>>>  export GIT_COMMITTER_DATE
>>>
>>>  test_expect_success 'setup tests' '
>>> +	GIT_TEST_COMMIT_GRAPH=0 &&
>>> +	export GIT_TEST_COMMIT_GRAPH &&
>>>  	echo 1 >a1 &&
>>>  	git add a1 &&
>>>  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
>>> @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
>>>  '
>>>
>>>  test_expect_success 'combined merge conflicts' '
>>> -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
>>> +	test_must_fail git merge -m final G
>>>  '
>>>
>>>  test_expect_success 'result contains a conflict' '
>> 
>> OK, so instead of disabling commit-graph for this test, now we disable
>> it for the whole script.
>> 
>> Maybe this change should be in a separate patch?
>
> With the explanation in commit message, it's clear to see how using
> corrected commit dates leads to an (incorrectly) failing test. Does it
> still make sense to seperate them?

No, I think that the new commit message explains why those changes are
together.

On the other hand it might be a good idea to add a TODO comment to this
test to mark it as fragile (fixing it is certainly out of scope of this
patch series, but better have something to remind us about the issue).
Perhaps:

  # TODO: fragile test, relies on specific resolving of ambiguity

Or something like that.  The original commit that added
GIT_TEST_COMMIT_GRAPH=0 (for a single test) explained:

  There is one test in t6024-recursive-merge.sh that relies on the
  merge-base algorithm picking one of two ambiguous merge-bases, and
  the commit-graph feature changes which merge-base is picked.

I'm not sure of we could salvage some of this test as it is now adding
`env GIT_TEST_COMMIT_GRAPH=0` in more individual tests instead of
turning it off for the whole test script.  But that is something that we
can do later.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-09-03 13:42               ` Jakub Narębski
@ 2020-09-05 17:21                 ` Abhishek Kumar
  2020-09-13 15:39                   ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-09-05 17:21 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, stolee, gitgitgadget

On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote:
> Hi,
> 
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> > On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
> >> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> >> 
> >> ...
> >> 
> >> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
> >> difficult.  As far as I can see, we can choose one of the *three*
> >> solutions (the third one is _new_):
> >> 
> >> a. store 64-bit corrected commit date in the GDAT chunk
> >>    all possible values are able to be stored, no need for
> >>    GENERATION_NUMBER_V2_MAX,
> >> 
> >> b. store 32-bit corrected commit date offset in the GDAT chunk,
> >>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
> >>    do not write GDAT chunk at all (like for backward compatibility
> >>    with mixed-version chains of split commit-graph layers),
> >> 
> >> c. store 32-bit corrected commit date offset in the GDAT chunk,
> >>    using some kind of overflow handling scheme; for example if
> >>    the most significant bit of 32-bit value is 1, then the
> >>    rest 31-bits are position in GDOV chunk, which uses 64-bit
> >>    to store those corrected commit date offsets that do not
> >>    fit in 32 bits.
> 
> Note that I have posted more detailed analysis of advantages and
> disadvantages of each of the above solutions in response to 11/11
> https://public-inbox.org/git/85o8mwb6nq.fsf@gmail.com/
> 
> I can think of yet another solution, a variant of approach 'c' with
> different overflow handling scheme:
> 
> c'.  Store 32-bit corrected commit date offset in the GDAT chunk,
>      using the following overflow handling scheme: if the value
>      is 0xFFFFFFFF (all bits set to 1, the maximum possible value
>      for uint32_t), then the corrected commit date or corrected
>      commit date offset can be found in GDOV chunk (Generation
>      Data OVerflow handling).
> 
>      The GDOV chunk is composed of:
>      - H bytes of commit OID, or 4 bytes (32 bits) of commit pos
>      - 8 bytes (64 bits) of corrected commit date or its offset
>      
>      Commits in GDOV chunk are sorted; as we expect for the number
>      of commits that require GDOV to be zero or a very small number
>      there is no need for GDO Fanout chunk.
>      
>    - advantages:
>      * commit-graph is smaller, increasing for abnormal repos
>      * overflow limit reduced only by 1 (a single value)
>    - disadvantages:
>      * most complex code of all proposed solutions
>        even more complicated than for solution 'c',
>        different from EDGE chunk handling
>      * tests would be needed to exercise the overflow code
> 
> Or we can split overflow handling into two chunks: GDOI (Generation Data
> Overflow Index) and GDOV, where GDOI would be composed of H bytes of
> commit OID or 4 bytes of commit graph position (sorted), and GDOV would
> be composed oly of 8 bytes (64 bits) of corrected commit date data.
> 
> This c'') variant has the same advantages and disadvantages as c'), with
> negligibly slightly larger disk size and possibly slightly better
> performance because of better data locality.
> 

The primary benefit of c') over c) seems to the range of valid offsets -
c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range
betwen [0, 0x7FFFFFF].

In other words, we should prefer c') over c) only if the offsets are
usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF)

Commits were overflowing corrected committer date offsets are rare, and
offsets in that particular range are doubly rare. To be wrong within
the range would be have an offset of 68 to 136 years, so that's possible
only if the corrupted timestamp is in future (it's been 50.68 years since
Unix epoch 0 so far).

Thikning back to the linux repository, the largest offset was around of
the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds

Overall, I don't think the added complexity (compared to c) approach)
makes up for by greater versatility.

[1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-09-05 17:21                 ` Abhishek Kumar
@ 2020-09-13 15:39                   ` Jakub Narębski
  2020-09-28 21:48                     ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-09-13 15:39 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Derrick Stolee, Abhishek Kumar via GitGitGadget, Taylor Blau,
	Jakub Narębski

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote:
>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>>>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>>>
>>>> ...
>>>>
>>>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>>>> difficult.  As far as I can see, we can choose one of the *three*
>>>> solutions (the third one is _new_):
>>>>
>>>> a. store 64-bit corrected commit date in the GDAT chunk
>>>>    all possible values are able to be stored, no need for
>>>>    GENERATION_NUMBER_V2_MAX,
>>>>
>>>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>>>    do not write GDAT chunk at all (like for backward compatibility
>>>>    with mixed-version chains of split commit-graph layers),
>>>>
>>>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>    using some kind of overflow handling scheme; for example if
>>>>    the most significant bit of 32-bit value is 1, then the
>>>>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>>>>    to store those corrected commit date offsets that do not
>>>>    fit in 32 bits.
>>
>> Note that I have posted more detailed analysis of advantages and
>> disadvantages of each of the above solutions in response to 11/11
>> https://public-inbox.org/git/85o8mwb6nq.fsf@gmail.com/
>>
>> I can think of yet another solution, a variant of approach 'c' with
>> different overflow handling scheme:
>>
>> c'.  Store 32-bit corrected commit date offset in the GDAT chunk,
>>      using the following overflow handling scheme: if the value
>>      is 0xFFFFFFFF (all bits set to 1, the maximum possible value
>>      for uint32_t), then the corrected commit date or corrected
>>      commit date offset can be found in GDOV chunk (Generation
>>      Data OVerflow handling).
>>
>>      The GDOV chunk is composed of:
>>      - H bytes of commit OID, or 4 bytes (32 bits) of commit pos
>>      - 8 bytes (64 bits) of corrected commit date or its offset
>>
>>      Commits in GDOV chunk are sorted; as we expect for the number
>>      of commits that require GDOV to be zero or a very small number
>>      there is no need for GDO Fanout chunk.
>>
>>    - advantages:
>>      * commit-graph is smaller, increasing for abnormal repos
>>      * overflow limit reduced only by 1 (a single value)
>>    - disadvantages:
>>      * most complex code of all proposed solutions
>>        even more complicated than for solution 'c',
>>        different from EDGE chunk handling
>>      * tests would be needed to exercise the overflow code
>>
>> Or we can split overflow handling into two chunks: GDOI (Generation Data
>> Overflow Index) and GDOV, where GDOI would be composed of H bytes of
>> commit OID or 4 bytes of commit graph position (sorted), and GDOV would
>> be composed oly of 8 bytes (64 bits) of corrected commit date data.
>>
>> This c'') variant has the same advantages and disadvantages as c'), with
>> negligibly slightly larger disk size and possibly slightly better
>> performance because of better data locality.
>>
>
> The primary benefit of c') over c) seems to the range of valid offsets -
> c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range
> betwen [0, 0x7FFFFFF].
>
> In other words, we should prefer c') over c) only if the offsets are
> usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF)
>
> Commits were overflowing corrected committer date offsets are rare, and
> offsets in that particular range are doubly rare. To be wrong within
> the range would be have an offset of 68 to 136 years, so that's possible
> only if the corrupted timestamp is in future (it's been 50.68 years since
> Unix epoch 0 so far).

Right, the c) variant has the same limitation as if corrected commit
date offsets were stored as signed 32-bit integer (int32_t), so to have
overflow we would have date post Y2k38 followed by date of Unix epoch 0.
Very unlikely.

>
> Thinking back to the linux repository, the largest offset was around of
> the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds
>
> Overall, I don't think the added complexity (compared to c) approach)
> makes up for by greater versatility.
>
> [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/

All right.

With variant c) we have additional advantage in that we can pattern the
code on the code for EDGE chunk handling, as you said.

I wanted to warn about the need for sanity checking, like ensuring that
we have GDOV chunk and that it is large enough -- but it turns out that
we skip this bounds checking for extra edges / EDGE chunk:

	if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) {
		pptr = insert_parent_or_die(r, g, edge_value, pptr);
		return 1;
	}

	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
			  4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK));
	do {
		edge_value = get_be32(parent_data_ptr);
		pptr = insert_parent_or_die(r, g,
					    edge_value & GRAPH_EDGE_LAST_MASK,
					    pptr);
		parent_data_ptr++;
	} while (!(edge_value & GRAPH_LAST_EDGE));

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-09-13 15:39                   ` Jakub Narębski
@ 2020-09-28 21:48                     ` Jakub Narębski
  2020-10-05  5:25                       ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-09-28 21:48 UTC (permalink / raw)
  To: git
  Cc: Abhishek Kumar, Jakub Narębski, Derrick Stolee,
	Abhishek Kumar via GitGitGadget, Taylor Blau

Hello,

Jakub Narębski <jnareb@gmail.com> writes:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote:
>>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>>> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>>>>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>>>>
>>>>> ...
>>>>>
>>>>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>>>>> difficult.  As far as I can see, we can choose one of the *three*
>>>>> solutions (the third one is _new_):
>>>>>
>>>>> a. store 64-bit corrected commit date in the GDAT chunk
>>>>>    all possible values are able to be stored, no need for
>>>>>    GENERATION_NUMBER_V2_MAX,
>>>>>
>>>>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>>>>    do not write GDAT chunk at all (like for backward compatibility
>>>>>    with mixed-version chains of split commit-graph layers),
>>>>>
>>>>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>>    using some kind of overflow handling scheme; for example if
>>>>>    the most significant bit of 32-bit value is 1, then the
>>>>>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>>>>>    to store those corrected commit date offsets that do not
>>>>>    fit in 32 bits.
>>>
>>> Note that I have posted more detailed analysis of advantages and
>>> disadvantages of each of the above solutions in response to 11/11
>>> https://public-inbox.org/git/85o8mwb6nq.fsf@gmail.com/
>>>
>>> I can think of yet another solution, a variant of approach 'c'
[...]
>>
>> The primary benefit of c') over c) seems to the range of valid offsets -
>> c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range
>> betwen [0, 0x7FFFFFF].
>>
>> In other words, we should prefer c') over c) only if the offsets are
>> usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF)
>>
>> Commits were overflowing corrected committer date offsets are rare, and
>> offsets in that particular range are doubly rare. To be wrong within
>> the range would be have an offset of 68 to 136 years, so that's possible
>> only if the corrupted timestamp is in future (it's been 50.68 years since
>> Unix epoch 0 so far).
>
> Right, the c) variant has the same limitation as if corrected commit
> date offsets were stored as signed 32-bit integer (int32_t), so to have
> overflow we would have date post Y2k38 followed by date of Unix epoch 0.
> Very unlikely.
>
>> Thinking back to the linux repository, the largest offset was around of
>> the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds
>>
>> Overall, I don't think the added complexity (compared to c) approach)
>> makes up for by greater versatility.
>>
>> [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/
>
> All right.
>
> With variant c) we have additional advantage in that we can pattern the
> code on the code for EDGE chunk handling, as you said.
>
> I wanted to warn about the need for sanity checking, like ensuring that
> we have GDOV chunk and that it is large enough -- but it turns out that
> we skip this bounds checking for extra edges / EDGE chunk:
>
> 	if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) {
> 		pptr = insert_parent_or_die(r, g, edge_value, pptr);
> 		return 1;
> 	}
>
> 	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
> 			  4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK));
> 	do {
> 		edge_value = get_be32(parent_data_ptr);
> 		pptr = insert_parent_or_die(r, g,
> 					    edge_value & GRAPH_EDGE_LAST_MASK,
> 					    pptr);
> 		parent_data_ptr++;
> 	} while (!(edge_value & GRAPH_LAST_EDGE));

Both Taylor Blau and Junio C Hamano agree that it is better to store
corrected commit date offsets and have overload handling to halve (to
reduce by 50%) the size of the GDAT chunk.  I have not heard from
Derrick Stolee.

It looks then that it is the way to go; as I said that you have
convinced me that variant 'c' (EDGE-like) is the best solution for
overflow handling.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
  2020-09-28 21:48                     ` Jakub Narębski
@ 2020-10-05  5:25                       ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-10-05  5:25 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Mon, Sep 28, 2020 at 11:48:05PM +0200, Jakub Narębski wrote:
> Hello,
> 
> 
> Both Taylor Blau and Junio C Hamano agree that it is better to store
> corrected commit date offsets and have overload handling to halve (to
> reduce by 50%) the size of the GDAT chunk.  I have not heard from
> Derrick Stolee.
> 
> It looks then that it is the way to go; as I said that you have
> convinced me that variant 'c' (EDGE-like) is the best solution for
> overflow handling.
> 

Great!

So I have implemented the variant 'c' and I am unsure whether my tests
are exhaustive enough. Can you preview the commit "commit-graph:
implement generation data chunk" [1] on the pull request?.

Apart from that, I am ready to publish the v4 to the mailing list.

[1]: https://github.com/gitgitgadget/git/pull/676/commits/390973da1d744cbb8a08a3b99c991f6d04ae9baf

> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date
  2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
                       ` (11 preceding siblings ...)
  2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
@ 2020-10-07 14:09     ` Abhishek Kumar via GitGitGadget
  2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
                         ` (11 more replies)
  12 siblings, 12 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem 1
[https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/], we also
needed a way to distinguish between the old and new generation number
without incrementing graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then 
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with timestamp
zero has corrected commit date of 1 to be able distinguish it from
GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's possible for
the offset to overflow the space allocated. To handle such cases, we
introduce a new chunk, Generation Data Overflow (GDOV) that stores the
corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.

For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.

While writing on top of a split commit-graph, we check if the tip of the
chain has a GDAT chunk. If it does, we append to the chain, writing GDAT
chunk. Thus, we guarantee if the topmost split commit-graph file has a GDAT
chunk, rest of the chain does too.

If the topmost split commit-graph file does not have a GDAT chunk (meaning
it has been appended by the old Git), we write without GDAT chunk. We do
write a GDAT chunk when the existing chain does not have GDAT chunk - when
we are writing to the commit-graph chain with the 'replace' strategy.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

Changes in version 4:

 * Added GDOV to handle overflows in generation data.
 * Added a test for writing tip graph for a generation number v2 graph chain
   in t5324-split-commit-graph.sh
 * Added a section on how mixed generation number chains are handled in 
   Documentation/technical/commit-graph-format.txt
 * Reverted unimportant whitespace, style changes in commit-graph.c
 * Added header comments about the order of comparision for
   compare_commits_by_gen_then_commit_date in commit.h,
   compare_commits_by_gen in commit-graph.h
 * Elaborated on why t6404 fails with corrected commit date and must be run
   with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
   commit dates in paint_down_to_common()"
 * Elaborated on write behavior for mixed generation number chains in the
   commit "commit-graph: use generation v2 only if entire chain does"
 * Added notes about adding the topo_level slab to struct
   write_commit_graph_context as well as struct commit_graph.
 * Clarified commit message for "commit-graph: consolidate
   fill_commit_graph_info"
 * Removed the claim "GDAT can store future generation numbers" because it
   hasn't been tested yet.

Changes in version 3:

 * Reordered patches as discussed in 2
   [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
   .
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed generation commit chains.
 * Improve commit messages for "commit-graph: fix regression when computing
   bloom filter", "commit-graph: consolidate fill_commit_graph_info",
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (10):
  commit-graph: fix regression when computing Bloom filters
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  21 +-
 Documentation/technical/commit-graph.txt      |  62 ++++-
 commit-graph.c                                | 256 ++++++++++++++----
 commit-graph.h                                |  17 +-
 commit-reach.c                                |  38 +--
 commit-reach.h                                |   2 +-
 commit.c                                      |   4 +-
 commit.h                                      |   5 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   4 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |  20 +-
 t/t5318-commit-graph.sh                       |  70 ++++-
 t/t5324-split-commit-graph.sh                 |  98 ++++++-
 t/t6404-recursive-merge.sh                    |   5 +-
 t/t6600-test-reach.sh                         |  68 ++---
 upload-pack.c                                 |   2 +-
 18 files changed, 534 insertions(+), 158 deletions(-)


base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v3:

  1:  c6b7ade7af !  1:  fae81b534b commit-graph: fix regression when computing bloom filter
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: fix regression when computing bloom filter
     +    commit-graph: fix regression when computing Bloom filters
      
          commit_gen_cmp is used when writing a commit-graph to sort commits in
          generation order before computing Bloom filters. Since c49c82aa (commit:
     @@ Commit message
          'commit_graph_data_at(c)->generation') in order to access it while
          writing.
      
     +    While measuring performance with `git commit-graph write --reachable
     +    --changed-paths` on the linux repository led to around 1m40s for both
     +    HEAD and master (and could be due to fault in my measurements), it is
     +    still the "right" thing to do.
     +
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
  2:  e673867234 !  2:  4470d91642 revision: parse parent in indegree_walk_step()
     @@ revision.c: static void indegree_walk_step(struct rev_info *revs)
       		struct commit *parent = p->item;
       		int *pi = indegree_slab_at(&info->indegree, parent);
       
     -+		if (parse_commit_gently(parent, 1) < 0)
     ++		if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
      +			return;
      +
       		if (*pi)
  3:  18d5864f81 !  3:  18bb3318a1 commit-graph: consolidate fill_commit_graph_info
     @@ Commit message
          implementation by calling fill_commit_graph_info() within
          fill_commit_in_graph().
      
     -    The test 'generate tar with future mtime' creates a commit with commit
     -    time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
     -    generation number (within CDAT chunk) and has undefined behavior.
     +    fill_commit_graph_info() used to not load committer data from commit data
     +    chunk. However, with the corrected committer date, we have to load
     +    committer date to calculate generation number value.
      
     -    The test used to pass as fill_commit_in_graph() guarantees the values of
     -    graph position and generation number, and did not load timestamp.
     -    However, with corrected commit date we will need load the timestamp as
     -    well to populate the generation number.
     +    e51217e15 (t5000: test tar files that overflow ustar headers,
     +    30-06-2016) introduced a test 'generate tar with future mtime' that
     +    creates a commit with committer date of (2 ^ 36 + 1) seconds since
     +    EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
     +    committer time overflows into generation number (within CDAT chunk) and
     +    has undefined behavior.
      
     -    Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds.
     +    The test used to pass as fill_commit_graph_info() would not set struct
     +    member `date` of struct commit and loads committer date from the object
     +    database, generating a tar file with the expected mtime.
     +
     +    However, with corrected commit date, we will load the committer date
     +    from CDAT chunk (truncated to lower 34-bits to populate the generation
     +    number. Thus, Git sets date and generates tar file with the truncated
     +    mtime.
     +
     +    The ustar format (the header format used by most modern tar programs)
     +    only has room for 11 (or 12, depending om some implementations) octal
     +    digits for the size and mtime of each files.
     +
     +    Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
     +    digit implementations while still fitting into commit data chunk.
     +
     +    Since we want to test 12-octal digit implementations of ustar as well,
     +    let's modify the existing test to no longer use commit-graph file.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
      -	graph_data->graph_pos = pos;
       	lex_index = pos - g->num_commits_in_base;
      -
     --	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
     -+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
     + 	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
       
       	item->object.parsed = 1;
       
     @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
       	edge_value = get_be32(commit_data + g->hash_len);
      
       ## t/t5000-tar-tree.sh ##
     -@@ t/t5000-tar-tree.sh: test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
     +@@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
     + 	test_cmp expect actual
     + '
     + 
     ++test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
     ++	rm -f .git/index &&
     ++	echo foo >file &&
     ++	git add file &&
     ++	GIT_COMMITTER_DATE="@17179869183 +0000" \
     ++		git commit -m "tempori parendum"
     ++'
     ++
     ++test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
     ++	git archive HEAD >future.tar
     ++'
     ++
     ++test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
     ++	echo 2514 >expect &&
     ++	tar_info future.tar | cut -d" " -f2 >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     + test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
       	rm -f .git/index &&
       	echo content >file &&
       	git add file &&
      -	GIT_COMMITTER_DATE="@68719476737 +0000" \
     -+	GIT_COMMITTER_DATE="@17179869183 +0000" \
     ++	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
       		git commit -m "tempori parendum"
       '
       
     -@@ t/t5000-tar-tree.sh: test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
     - '
     - 
     - test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
     --	echo 4147 >expect &&
     -+	echo 2514 >expect &&
     - 	tar_info future.tar | cut -d" " -f2 >actual &&
     - 	test_cmp expect actual
     - '
  4:  6a0cde983d <  -:  ---------- commit-graph: consolidate compare_commits_by_gen
  5:  6be759a954 !  4:  011b0aa497 commit-graph: return 64-bit generation number
     @@ Commit message
          commit_graph_generation(), use timestamp_t for local variables and
          define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
      
     +    We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
     +    represent the largest topological level we can store in the commit data
     +    chunk.
     +
     +    With corrected commit dates implemented, we will have two such *_MAX
     +    variables to denote the largest offset and largest topological level
     +    that can be stored.
     +
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
     @@ commit-graph.c: uint32_t commit_graph_position(const struct commit *c)
       {
       	struct commit_graph_data *data =
       		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
     -@@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
     - int compare_commits_by_gen(const void *_a, const void *_b)
     - {
     - 	const struct commit *a = _a, *b = _b;
     --	const uint32_t generation_a = commit_graph_generation(a);
     --	const uint32_t generation_b = commit_graph_generation(b);
     -+	const timestamp_t generation_a = commit_graph_generation(a);
     -+	const timestamp_t generation_b = commit_graph_generation(b);
     - 
     - 	/* older commits first */
     - 	if (generation_a < generation_b)
      @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
       	const struct commit *a = *(const struct commit **)va;
       	const struct commit *b = *(const struct commit **)vb;
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
       	if (generation_a < generation_b)
       		return -1;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     + 					_("Computing commit graph generation numbers"),
     + 					ctx->commits.nr);
     + 	for (i = 0; i < ctx->commits.nr; i++) {
     +-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     ++		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
       
       		display_progress(ctx->progress, i + 1);
     --		if (generation != GENERATION_NUMBER_INFINITY &&
     -+		if (generation != GENERATION_NUMBER_V1_INFINITY &&
     - 		    generation != GENERATION_NUMBER_ZERO)
     - 			continue;
     - 
     + 		if (generation != GENERATION_NUMBER_INFINITY &&
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 			for (parent = current->parents; parent; parent = parent->next) {
     - 				generation = commit_graph_data_at(parent->item)->generation;
     - 
     --				if (generation == GENERATION_NUMBER_INFINITY ||
     -+				if (generation == GENERATION_NUMBER_V1_INFINITY ||
     - 				    generation == GENERATION_NUMBER_ZERO) {
     - 					all_parents_computed = 0;
     - 					commit_list_insert(parent->item, &list);
     + 				data->generation = max_generation + 1;
     + 				pop_commit(&list);
     + 
     +-				if (data->generation > GENERATION_NUMBER_MAX)
     +-					data->generation = GENERATION_NUMBER_MAX;
     ++				if (data->generation > GENERATION_NUMBER_V1_MAX)
     ++					data->generation = GENERATION_NUMBER_V1_MAX;
     + 			}
     + 		}
     + 	}
      @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
       	for (i = 0; i < g->num_commits; i++) {
       		struct commit *graph_commit, *odb_commit;
     @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_grap
       
       		display_progress(progress, i + 1);
       		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     +@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     + 			continue;
     + 
     + 		/*
     +-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     +-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     ++		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
     ++		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
     + 		 * extra logic in the following condition.
     + 		 */
     +-		if (max_generation == GENERATION_NUMBER_MAX)
     ++		if (max_generation == GENERATION_NUMBER_V1_MAX)
     + 			max_generation--;
     + 
     + 		generation = commit_graph_generation(graph_commit);
      
       ## commit-graph.h ##
      @@ commit-graph.h: void disable_commit_graph(struct repository *r);
     @@ commit-graph.h: void disable_commit_graph(struct repository *r);
      -uint32_t commit_graph_generation(const struct commit *);
      +timestamp_t commit_graph_generation(const struct commit *);
       uint32_t commit_graph_position(const struct commit *);
     - 
     - int compare_commits_by_gen(const void *_a, const void *_b);
     + #endif
      
       ## commit-reach.c ##
      @@ commit-reach.c: static int queue_has_nonstale(struct prio_queue *queue)
     @@ commit-reach.c: int repo_in_merge_bases_many(struct repository *r, struct commit
       {
       	struct commit_list *bases;
       	int ret = 0, i;
     --	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
     -+	timestamp_t generation, min_generation = GENERATION_NUMBER_INFINITY;
     +-	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
     ++	timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;
       
       	if (repo_parse_commit(r, commit))
       		return ret;
     @@ commit-reach.c: static enum contains_result contains_tag_algo(struct commit *can
       		struct commit *c = p->item;
       		load_commit_graph_info(the_repository, c);
       		generation = commit_graph_generation(c);
     +@@ commit-reach.c: static int compare_commits_by_gen(const void *_a, const void *_b)
     + 	const struct commit *a = *(const struct commit * const *)_a;
     + 	const struct commit *b = *(const struct commit * const *)_b;
     + 
     +-	uint32_t generation_a = commit_graph_generation(a);
     +-	uint32_t generation_b = commit_graph_generation(b);
     ++	timestamp_t generation_a = commit_graph_generation(a);
     ++	timestamp_t generation_b = commit_graph_generation(b);
     + 
     + 	if (generation_a < generation_b)
     + 		return -1;
      @@ commit-reach.c: int can_all_from_reach_with_flag(struct object_array *from,
       				 unsigned int with_flag,
       				 unsigned int assign_flag,
     @@ commit-reach.h: int can_all_from_reach_with_flag(struct object_array *from,
       		       int commit_date_cutoff);
       
      
     + ## commit.c ##
     +@@ commit.c: int compare_commits_by_author_date(const void *a_, const void *b_,
     + int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
     + {
     + 	const struct commit *a = a_, *b = b_;
     +-	const uint32_t generation_a = commit_graph_generation(a),
     +-		       generation_b = commit_graph_generation(b);
     ++	const timestamp_t generation_a = commit_graph_generation(a),
     ++			  generation_b = commit_graph_generation(b);
     + 
     + 	/* newer commits first */
     + 	if (generation_a < generation_b)
     +
       ## commit.h ##
      @@
       #include "commit-slab.h"
       
       #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
      -#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
     +-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
      +#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
     -+#define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
     - #define GENERATION_NUMBER_MAX 0x3FFFFFFF
     ++#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
       #define GENERATION_NUMBER_ZERO 0
       
     + struct commit_list {
      
       ## revision.c ##
      @@ revision.c: define_commit_slab(indegree_slab, int);
     @@ revision.c: static void init_topo_walk(struct rev_info *revs)
      -		uint32_t generation;
      +		timestamp_t generation;
       
     - 		if (parse_commit_gently(c, 1))
     + 		if (repo_parse_commit_gently(revs->repo, c, 1))
       			continue;
      @@ revision.c: static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
       	for (p = commit->parents; p; p = p->next) {
  6:  b347dbb01b !  5:  e067f653ad commit-graph: add a slab to store topological levels
     @@ Metadata
       ## Commit message ##
          commit-graph: add a slab to store topological levels
      
     -    As we are writing topological levels to commit data chunk to ensure
     -    backwards compatibility with "Old" Git and the member `generation` of
     -    struct commit_graph_data will store corrected commit date in a later
     -    commit, let's introduce a commit-slab to store topological levels while
     -    writing commit-graph.
     +    In a later commit we will introduce corrected commit date as the
     +    generation number v2. This value will be stored in the new seperate
     +    Generation Data chunk. However, to ensure backwards compatibility with
     +    "Old" Git we need to continue to write generation number v1, which is
     +    topological level, to the commit data chunk. This means that we need to
     +    compute both versions of generation numbers when writing the
     +    commit-graph file. Therefore, let's introduce a commit-slab to store
     +    topological levels; corrected commit date will be stored in the member
     +    `generation` of struct commit_graph_data.
      
          When Git creates a split commit-graph, it takes advantage of the
          generation values that have been computed already and present in
          existing commit-graph files.
      
     -    So, let's add a pointer to struct commit_graph to the topological level
     -    commit-slab and populate it with topological levels while writing a
     -    split commit-graph.
     +    So, let's add a pointer to struct commit_graph as well as struct
     +    write_commit_graph_context to the topological level commit-slab
     +    and populate it with topological levels while writing a commit-graph
     +    file.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: struct write_commit_graph_context {
       		 order_by_pack:1;
       
      +	struct topo_level_slab *topo_levels;
     - 	const struct split_commit_graph_opts *split_opts;
     + 	const struct commit_graph_opts *opts;
       	size_t total_bloom_filter_data_size;
       	const struct bloom_filter_settings *bloom_settings;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       					_("Computing commit graph generation numbers"),
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
     --		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     +-		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     ++		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
       
       		display_progress(ctx->progress, i + 1);
     --		if (generation != GENERATION_NUMBER_V1_INFINITY &&
     +-		if (generation != GENERATION_NUMBER_INFINITY &&
      -		    generation != GENERATION_NUMBER_ZERO)
     -+		if (level != GENERATION_NUMBER_V1_INFINITY &&
     ++		if (level != GENERATION_NUMBER_INFINITY &&
      +		    level != GENERATION_NUMBER_ZERO)
       			continue;
       
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
      -				generation = commit_graph_data_at(parent->item)->generation;
      +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
       
     --				if (generation == GENERATION_NUMBER_V1_INFINITY ||
     +-				if (generation == GENERATION_NUMBER_INFINITY ||
      -				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (level == GENERATION_NUMBER_V1_INFINITY ||
     ++				if (level == GENERATION_NUMBER_INFINITY ||
      +				    level == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
      -				data->generation = max_generation + 1;
       				pop_commit(&list);
       
     --				if (data->generation > GENERATION_NUMBER_MAX)
     --					data->generation = GENERATION_NUMBER_MAX;
     -+				if (max_level > GENERATION_NUMBER_MAX - 1)
     -+					max_level = GENERATION_NUMBER_MAX - 1;
     +-				if (data->generation > GENERATION_NUMBER_V1_MAX)
     +-					data->generation = GENERATION_NUMBER_V1_MAX;
     ++				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
     ++					max_level = GENERATION_NUMBER_V1_MAX - 1;
      +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
       			}
       		}
       	}
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	uint32_t i, count_distinct = 0;
       	int res = 0;
       	int replace = 0;
     + 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
      +	struct topo_level_slab topo_levels;
       
       	if (!commit_graph_compatible(the_repository))
       		return 0;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 		}
     - 	}
     + 							 bloom_settings.max_changed_paths);
     + 	ctx->bloom_settings = &bloom_settings;
       
      +	init_topo_level_slab(&topo_levels);
      +	ctx->topo_levels = &topo_levels;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      +		}
      +	}
      +
     - 	if (pack_indexes) {
     - 		ctx->order_by_pack = 1;
     - 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
     + 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     + 		ctx->changed_paths = 1;
     + 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
      
       ## commit-graph.h ##
      @@ commit-graph.h: struct commit_graph {
     @@ commit-graph.h: struct commit_graph {
       	struct bloom_filter_settings *bloom_filter_settings;
       };
       
     -
     - ## commit.h ##
     -@@
     - #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
     - #define GENERATION_NUMBER_MAX 0x3FFFFFFF
     - #define GENERATION_NUMBER_ZERO 0
     -+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
     - 
     - struct commit_list {
     - 	struct commit *item;
  7:  4074ace65b !  6:  694ef1ec08 commit-graph: implement corrected commit date
     @@ Commit message
            the maximum of its commit date and one more than the largest corrected
            commit date among its parents.
      
     +    As a special case, a root commit with timestamp of zero (01.01.1970
     +    00:00:00Z) has corrected commit date of one, to be able to distinguish
     +    from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
     +    date).
     +
          To minimize the space required to store corrected commit date, Git
          stores corrected commit date offsets into the commit-graph file. The
          corrected commit date offset for a commit is defined as the difference
          between its corrected commit date and actual commit date.
      
     +    Storing corrected commit date requires sizeof(timestamp_t) bytes, which
     +    in most cases is 64 bits (uintmax_t). However, corrected commit date
     +    offsets can be safely stored using only 32-bits. This halves the size
     +    of GDAT chunk, which is a reduction of around 6% in the size of
     +    commit-graph file.
     +
     +    However, using offsets be problematic if one of commits is malformed but
     +    valid and has committerdate of 0 Unix time, as the offset would be the
     +    same as corrected commit date and thus require 64-bits to be stored
     +    properly.
     +
          While Git does not write out offsets at this stage, Git stores the
          corrected commit dates in member generation of struct commit_graph_data.
          It will begin writing commit date offsets with the introduction of
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
     - 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     + 		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
      +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
       
       		display_progress(ctx->progress, i + 1);
     - 		if (level != GENERATION_NUMBER_V1_INFINITY &&
     + 		if (level != GENERATION_NUMBER_INFINITY &&
      -		    level != GENERATION_NUMBER_ZERO)
      +		    level != GENERATION_NUMBER_ZERO &&
      +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       
       			for (parent = current->parents; parent; parent = parent->next) {
       				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     +-
      +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
     - 
     - 				if (level == GENERATION_NUMBER_V1_INFINITY ||
     + 				if (level == GENERATION_NUMBER_INFINITY ||
      -				    level == GENERATION_NUMBER_ZERO) {
      +				    level == GENERATION_NUMBER_ZERO ||
      +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       			}
       
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 				if (max_level > GENERATION_NUMBER_MAX - 1)
     - 					max_level = GENERATION_NUMBER_MAX - 1;
     + 				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
     + 					max_level = GENERATION_NUMBER_V1_MAX - 1;
       				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
      +
     -+				if (current->date > max_corrected_commit_date)
     ++				if (current->date && current->date > max_corrected_commit_date)
      +					max_corrected_commit_date = current->date - 1;
      +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
       			}
       		}
       	}
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 	for (i = 0; i < g->num_commits; i++) {
     - 		struct commit *graph_commit, *odb_commit;
     - 		struct commit_list *graph_parents, *odb_parents;
     --		timestamp_t max_generation = 0;
     --		timestamp_t generation;
     -+		timestamp_t max_corrected_commit_date = 0;
     -+		timestamp_t corrected_commit_date;
     - 
     - 		display_progress(progress, i + 1);
     - 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 					     oid_to_hex(&graph_parents->item->object.oid),
     - 					     oid_to_hex(&odb_parents->item->object.oid));
     - 
     --			generation = commit_graph_generation(graph_parents->item);
     --			if (generation > max_generation)
     --				max_generation = generation;
     -+			corrected_commit_date = commit_graph_generation(graph_parents->item);
     -+			if (corrected_commit_date > max_corrected_commit_date)
     -+				max_corrected_commit_date = corrected_commit_date;
     - 
     - 			graph_parents = graph_parents->next;
     - 			odb_parents = odb_parents->next;
      @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
       		if (generation_zero == GENERATION_ZERO_EXISTS)
       			continue;
       
      -		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     --		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     +-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
     +-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
      -		 * extra logic in the following condition.
      -		 */
     --		if (max_generation == GENERATION_NUMBER_MAX)
     +-		if (max_generation == GENERATION_NUMBER_V1_MAX)
      -			max_generation--;
      -
     --		generation = commit_graph_generation(graph_commit);
     + 		generation = commit_graph_generation(graph_commit);
      -		if (generation != max_generation + 1)
      -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		corrected_commit_date = commit_graph_generation(graph_commit);
     -+		if (corrected_commit_date < max_corrected_commit_date + 1)
     ++		if (generation < max_generation + 1)
      +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
       				     oid_to_hex(&cur_oid),
     --				     generation,
     --				     max_generation + 1);
     -+				     corrected_commit_date,
     -+				     max_corrected_commit_date + 1);
     - 
     - 		if (graph_commit->date != odb_commit->date)
     - 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
     + 				     generation,
     + 				     max_generation + 1);
  8:  4e746628ac !  7:  b903efe2ea commit-graph: implement generation data chunk
     @@ Commit message
          between graph versions in a backwards compatible manner.
      
          We are going to introduce a new chunk called Generation Data chunk (or
     -    GDAT). GDAT stores generation number v2 (and any subsequent versions),
     -    whereas CDAT will still store topological level.
     +    GDAT). GDAT stores corrected committer date offsets whereas CDAT will
     +    still store topological level.
      
          Old Git does not understand GDAT chunk and would ignore it, reading
          topological levels from CDAT. New Git can parse GDAT and take advantage
     @@ Commit message
          which forces commit-graph file to be written without generation data
          chunk to emulate a commit-graph file written by old Git.
      
     +    While storing corrected commit date offset instead of the corrected
     +    commit date saves us 4 bytes per commit, it's possible for the offsets
     +    to overflow the 4-bytes allocated. As such overflows are exceedingly
     +    rare, we use the following overflow management scheme:
     +
     +    We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')
     +    to store corrected commit dates for commits with offsets greater than
     +    GENERATION_NUMBER_V2_OFFSET_MAX.
     +
     +    If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
     +    the MSB of the offset and the other bits store the position of corrected
     +    commit date in GDOV chunk, similar to how Extra Edge List is maintained.
     +
     +    We test the overflow-related code with the following repo history:
     +
     +               F - N - U
     +              /         \
     +    U - N - U            N
     +             \          /
     +              N - F - N
     +
     +    Where the commits denoted by U have committer date of zero seconds
     +    since Unix epoch, the commits denoted by N have committer date of
     +    1112354055 (default committer date for the test suite) seconds since
     +    Unix epoch and the commits denoted by F have committer date of
     +    (2 ^ 31 - 2) seconds since Unix epoch.
     +
     +    The largest offset observed is 2 ^ 31, just large enough to overflow.
     +
          [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
     @@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
       #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
       #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
      +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
     ++#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
       #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
       #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
       #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
       #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
      -#define MAX_NUM_CHUNKS 7
     -+#define MAX_NUM_CHUNKS 8
     ++#define MAX_NUM_CHUNKS 9
       
       #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
       
     -@@ commit-graph.c: struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
     +@@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
     + #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
     + 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
     + 
     ++#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
     ++
     + /* Remember to update object flag allocation in object.h */
     + #define REACHABLE       (1u<<15)
     + 
     +@@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
       				graph->chunk_commit_data = data + chunk_offset;
       			break;
       
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(void *graph_map, size_t
      +			else
      +				graph->chunk_generation_data = data + chunk_offset;
      +			break;
     ++
     ++		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
     ++			if (graph->chunk_generation_data_overflow)
     ++				chunk_repeated = 1;
     ++			else
     ++				graph->chunk_generation_data_overflow = data + chunk_offset;
     ++			break;
      +
       		case GRAPH_CHUNKID_EXTRAEDGES:
       			if (graph->chunk_extra_edges)
       				chunk_repeated = 1;
     +@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
     + {
     + 	const unsigned char *commit_data;
     + 	struct commit_graph_data *graph_data;
     +-	uint32_t lex_index;
     +-	uint64_t date_high, date_low;
     ++	uint32_t lex_index, offset_pos;
     ++	uint64_t date_high, date_low, offset;
     + 
     + 	while (pos < g->num_commits_in_base)
     + 		g = g->base_graph;
      @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
       	date_low = get_be32(commit_data + g->hash_len + 12);
       	item->date = (timestamp_t)((date_high << 32) | date_low);
       
      -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     -+	if (g->chunk_generation_data)
     -+		graph_data->generation = item->date +
     -+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	else
     ++	if (g->chunk_generation_data) {
     ++		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     ++
     ++		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
     ++			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
     ++			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
     ++		} else
     ++			graph_data->generation = item->date + offset;
     ++	} else
      +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
       
       	if (g->topo_levels)
       		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
     +@@ commit-graph.c: struct write_commit_graph_context {
     + 	struct packed_oid_list oids;
     + 	struct packed_commit_list commits;
     + 	int num_extra_edges;
     ++	int num_generation_data_overflows;
     + 	unsigned long approx_nr_objects;
     + 	struct progress *progress;
     + 	int progress_done;
      @@ commit-graph.c: struct write_commit_graph_context {
       		 report_progress:1,
       		 split:1,
     @@ commit-graph.c: struct write_commit_graph_context {
      +		 write_generation_data:1;
       
       	struct topo_level_slab *topo_levels;
     - 	const struct split_commit_graph_opts *split_opts;
     + 	const struct commit_graph_opts *opts;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       	return 0;
       }
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
      +static int write_graph_chunk_generation_data(struct hashfile *f,
      +					      struct write_commit_graph_context *ctx)
      +{
     -+	int i;
     ++	int i, num_generation_data_overflows = 0;
      +	for (i = 0; i < ctx->commits.nr; i++) {
      +		struct commit *c = ctx->commits.list[i];
      +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
      +		display_progress(ctx->progress, ++ctx->progress_cnt);
      +
     -+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     -+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     ++		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
     ++			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
     ++			num_generation_data_overflows++;
     ++		}
     ++
      +		hashwrite_be32(f, offset);
      +	}
      +
      +	return 0;
      +}
     ++
     ++static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
     ++						       struct write_commit_graph_context *ctx)
     ++{
     ++	int i;
     ++	for (i = 0; i < ctx->commits.nr; i++) {
     ++		struct commit *c = ctx->commits.list[i];
     ++		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
     ++		display_progress(ctx->progress, ++ctx->progress_cnt);
     ++
     ++		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
     ++			hashwrite_be32(f, offset >> 32);
     ++			hashwrite_be32(f, (uint32_t) offset);
     ++		}
     ++	}
     ++
     ++	return 0;
     ++}
      +
       static int write_graph_chunk_extra_edges(struct hashfile *f,
     --					 struct write_commit_graph_context *ctx)
     -+					  struct write_commit_graph_context *ctx)
     + 					 struct write_commit_graph_context *ctx)
       {
     - 	struct commit **list = ctx->commits.list;
     - 	struct commit **last = ctx->commits.list + ctx->commits.nr;
     +@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     + 
     + 				if (current->date && current->date > max_corrected_commit_date)
     + 					max_corrected_commit_date = current->date - 1;
     ++
     + 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
     ++
     ++				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
     ++					ctx->num_generation_data_overflows++;
     + 			}
     + 		}
     + 	}
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
       	chunks[2].id = GRAPH_CHUNKID_DATA;
       	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
      +		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
      +		num_chunks++;
     ++	}
     ++	if (ctx->num_generation_data_overflows) {
     ++		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
     ++		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
     ++		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
     ++		num_chunks++;
      +	}
       	if (ctx->num_extra_edges) {
       		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
       		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
     - 	ctx->split_opts = split_opts;
     + 	ctx->opts = opts;
       	ctx->total_bloom_filter_data_size = 0;
      +	ctx->write_generation_data = 1;
     ++	ctx->num_generation_data_overflows = 0;
       
     - 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     - 		ctx->changed_paths = 1;
     + 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
     + 						      bloom_settings.bits_per_entry);
      
       ## commit-graph.h ##
      @@
     @@ commit-graph.h: struct commit_graph {
       	const unsigned char *chunk_oid_lookup;
       	const unsigned char *chunk_commit_data;
      +	const unsigned char *chunk_generation_data;
     ++	const unsigned char *chunk_generation_data_overflow;
       	const unsigned char *chunk_extra_edges;
       	const unsigned char *chunk_base_graphs;
       	const unsigned char *chunk_bloom_indexes;
      
     + ## commit.h ##
     +@@
     + #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
     + #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
     + #define GENERATION_NUMBER_ZERO 0
     ++#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
     + 
     + struct commit_list {
     + 	struct commit *item;
     +
       ## t/README ##
      @@ t/README: GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
       be written after every 'git commit' command, and overrides the
     @@ t/helper/test-read-graph.c: int cmd__read_graph(int argc, const char **argv)
       		printf(" commit_metadata");
      +	if (graph->chunk_generation_data)
      +		printf(" generation_data");
     ++	if (graph->chunk_generation_data_overflow)
     ++		printf(" generation_data_overflow");
       	if (graph->chunk_extra_edges)
       		printf(" extra_edges");
       	if (graph->chunk_bloom_indexes)
      
       ## t/t4216-log-bloom.sh ##
      @@ t/t4216-log-bloom.sh: test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
     - 	git commit-graph write --reachable --changed-paths
       '
     + 
       graph_read_expect () {
      -	NUM_CHUNKS=5
      +	NUM_CHUNKS=6
       	cat >expect <<- EOF
     - 	header: 43475048 1 1 $NUM_CHUNKS 0
     + 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
       	num_commits: $1
      -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
      +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
     @@ t/t5318-commit-graph.sh: test_expect_success 'write graph in bare repo' '
       '
       
       graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
     -@@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit-graph' '
     +@@ t/t5318-commit-graph.sh: test_expect_success 'warn on improper hash version' '
       
       test_expect_success 'git commit-graph verify' '
       	cd "$TRASH_DIRECTORY/full" &&
     @@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit
       '
       
       NUM_COMMITS=9
     +@@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missing tree)' '
     + 	)
     + '
     + 
     ++test_commit_with_date() {
     ++  file="$1.t" &&
     ++  echo "$1" >"$file" &&
     ++  git add "$file" &&
     ++  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
     ++  git tag "$1"
     ++}
     ++
     ++test_expect_success 'overflow corrected commit date offset' '
     ++	objdir=".git/objects" &&
     ++	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
     ++	FUTURE_DATE="@2147483646 +0000" &&
     ++	test_oid_cache <<-EOF &&
     ++	oid_version sha1:1
     ++	oid_version sha256:2
     ++	EOF
     ++	cd "$TRASH_DIRECTORY" &&
     ++	mkdir repo &&
     ++	cd repo &&
     ++	git init &&
     ++	test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
     ++	test_commit 2 &&
     ++	test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
     ++	git commit-graph write --reachable &&
     ++	graph_read_expect 3 generation_data &&
     ++	test_commit_with_date 4 "$FUTURE_DATE" &&
     ++	test_commit 5 &&
     ++	test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
     ++	git branch left &&
     ++	git reset --hard 3 &&
     ++	test_commit 7 &&
     ++	test_commit_with_date 8 "$FUTURE_DATE" &&
     ++	test_commit 9 &&
     ++	git branch right &&
     ++	git reset --hard 3 &&
     ++	git merge left right &&
     ++	git commit-graph write --reachable &&
     ++	graph_read_expect 10 "generation_data generation_data_overflow" &&
     ++	git commit-graph verify
     ++'
     ++
     ++graph_git_behavior 'overflow corrected commit date offset' repo left right
     ++
     + test_done
      
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
     @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
      -	base sha256:1496
      +	base sha1:1408
      +	base sha256:1528
     - 	EOM
     - '
       
     + 	oid_version sha1:1
     + 	oid_version sha256:2
      @@ t/t5324-split-commit-graph.sh: graph_read_expect() {
       		NUM_BASE=$2
       	fi
       	cat >expect <<- EOF
     --	header: 43475048 1 1 3 $NUM_BASE
     -+	header: 43475048 1 1 4 $NUM_BASE
     +-	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
     ++	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
       	num_commits: $1
      -	chunks: oid_fanout oid_lookup commit_metadata
      +	chunks: oid_fanout oid_lookup commit_metadata generation_data
     @@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:miss' '
      +	test_all_modes in_merge_bases
       '
       
     + test_expect_success 'in_merge_bases_many:hit' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:hit' '
     + 	X:commit-5-7
     + 	EOF
     + 	echo "in_merge_bases_many(A,X):1" >expect &&
     +-	test_three_modes in_merge_bases_many
     ++	test_all_modes in_merge_bases_many
     + '
     + 
     + test_expect_success 'in_merge_bases_many:miss' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:miss' '
     + 	X:commit-8-6
     + 	EOF
     + 	echo "in_merge_bases_many(A,X):0" >expect &&
     +-	test_three_modes in_merge_bases_many
     ++	test_all_modes in_merge_bases_many
     + '
     + 
     + test_expect_success 'in_merge_bases_many:miss-heuristic' '
     +@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:miss-heuristic' '
     + 	X:commit-6-6
     + 	EOF
     + 	echo "in_merge_bases_many(A,X):0" >expect &&
     +-	test_three_modes in_merge_bases_many
     ++	test_all_modes in_merge_bases_many
     + '
     + 
       test_expect_success 'is_descendant_of:hit' '
      @@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:hit' '
       	X:commit-1-1
  9:  5a147a9704 !  8:  8ec119edc6 commit-graph: use generation v2 only if entire chain does
     @@ Commit message
          commits in the lower layer before allowing the topo-order queue to write
          anything to output (depending on the size of the upper layer).
      
     +    When writing the new layer in split commit-graph, we write a GDAT chunk
     +    only if the topmost layer has a GDAT chunk. This guarantees that if a
     +    layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
     +
     +    Rewriting layers follows similar approach: if the topmost layer below
     +    the set of layers being rewritten (in the split commit-graph chain)
     +    exists, and it does not contain GDAT chunk, then the result of rewrite
     +    does not have GDAT chunks either.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static struct commit_graph *load_commit_graph_chain(struct repos
       	return graph_chain;
       }
       
     -+static void validate_mixed_generation_chain(struct repository *r)
     ++static void validate_mixed_generation_chain(struct commit_graph *g)
      +{
     -+	struct commit_graph *g = r->objects->commit_graph;
     -+	int read_generation_data = 1;
     ++	int read_generation_data;
      +
     -+	while (g) {
     -+		if (!g->chunk_generation_data) {
     -+			read_generation_data = 0;
     -+			break;
     -+		}
     -+		g = g->base_graph;
     -+	}
     ++	if (!g)
     ++		return;
      +
     -+	g = r->objects->commit_graph;
     ++	read_generation_data = !!g->chunk_generation_data;
      +
      +	while (g) {
      +		g->read_generation_data = read_generation_data;
     @@ commit-graph.c: struct commit_graph *read_commit_graph_one(struct repository *r,
       	if (!g)
       		g = load_commit_graph_chain(r, odb);
       
     -+	validate_mixed_generation_chain(r);
     ++	validate_mixed_generation_chain(g);
      +
       	return g;
       }
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       	date_low = get_be32(commit_data + g->hash_len + 12);
       	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     --	if (g->chunk_generation_data)
     -+	if (g->chunk_generation_data && g->read_generation_data)
     - 		graph_data->generation = item->date +
     - 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     - 	else
     -@@ commit-graph.c: void load_commit_graph_info(struct repository *r, struct commit *item)
     - 	uint32_t pos;
     - 	if (!prepare_commit_graph(r))
     - 		return;
     +-	if (g->chunk_generation_data) {
     ++	if (g->chunk_generation_data && g->read_generation_data) {
     + 		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     + 
     + 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
     +@@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
     + 		}
     + 	}
     + 
     ++	if (!ctx->write_generation_data && g->chunk_generation_data)
     ++		ctx->write_generation_data = 1;
      +
     - 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
     - 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
     - }
     + 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
     + 		ctx->new_base_graph = g;
     + 	else if (ctx->num_commit_graphs_after != 1)
     +@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     + 		struct commit_graph *g = ctx->r->objects->commit_graph;
     + 
     + 		while (g) {
     ++			g->read_generation_data = 1;
     + 			g->topo_levels = &topo_levels;
     + 			g = g->base_graph;
     + 		}
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       		g = ctx->r->objects->commit_graph;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       			g = g->base_graph;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
     - 		if (ctx->split_opts)
     - 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
     + 		if (ctx->opts)
     + 			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
      +
      +		if (replace)
      +			ctx->write_generation_data = 1;
     @@ commit-graph.h: struct commit_graph {
       	struct object_directory *odb;
       
       	uint32_t num_commits_in_base;
     -+	uint32_t read_generation_data;
     ++	unsigned int read_generation_data;
       	struct commit_graph *base_graph;
       
       	const uint32_t *chunk_oid_fanout;
      
       ## t/t5324-split-commit-graph.sh ##
     -@@ t/t5324-split-commit-graph.sh: done <<\EOF
     - 0600 -r--------
     - EOF
     +@@ t/t5324-split-commit-graph.sh: test_expect_success '--split=replace with partial Bloom data' '
     + 	verify_chain_files_exist $graphdir
     + '
       
      +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
      +	mkdir mixed &&
      +	graphdir=".git/objects/info/commit-graphs" &&
     ++	test_oid_cache <<-EOM &&
     ++	oid_version sha1:1
     ++	oid_version sha256:2
     ++	EOM
      +	cd "$TRASH_DIRECTORY/mixed" &&
      +	git init &&
      +	git config core.commitGraph true &&
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 4 1
     ++	header: 43475048 1 $(test_oid oid_version) 4 1
      +	num_commits: 2
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +	git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 4 2
     ++	header: 43475048 1 $(test_oid oid_version) 4 2
      +	num_commits: 3
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +	graph_read_expect 15 &&
      +	git commit-graph verify
      +'
     ++
     ++test_expect_success 'add one commit, write a tip graph' '
     ++	cd "$TRASH_DIRECTORY/mixed" &&
     ++	test_commit 11 &&
     ++	git branch commits/11 &&
     ++	git commit-graph write --reachable --split &&
     ++	test_path_is_missing $infodir/commit-graph &&
     ++	test_path_is_file $graphdir/commit-graph-chain &&
     ++	ls $graphdir/graph-*.graph >graph-files &&
     ++	test_line_count = 2 graph-files &&
     ++	verify_chain_files_exist $graphdir
     ++'
      +
       test_done
 10:  439adc1718 !  9:  bb9b02af32 commit-reach: use corrected commit dates in paint_down_to_common()
     @@ Commit message
          With corrected commit dates implemented, we no longer have to rely on
          commit date as a heuristic in paint_down_to_common().
      
     -    t6024-recursive-merge setups a unique repository where all commits have
     -    the same committer date without well-defined merge-base. As this has
     -    already caused problems (as noted in 859fdc0 (commit-graph: define
     -    GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within the
     -    test script.
     +    While using corrected commit dates Git walks nearly the same number of
     +    commits as commit date, the process is slower as for each comparision we
     +    have to access a commit-slab (for corrected committer date) instead of
     +    accessing struct member (for committer date).
     +
     +    For example, the command `git merge-base v4.8 v4.9` on the linux
     +    repository walks 167468 commits, taking 0.135s for committer date and
     +    167496 commits, taking 0.157s for corrected committer date respectively.
     +
     +    t6404-recursive-merge setups a unique repository where all commits have
     +    the same committer date without well-defined merge-base.
     +
     +    While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
     +    date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
     +    conflicts' merges commits in the order:
     +    - Merge C with B to form a intermediate commit.
     +    - Merge the intermediate commit with A.
     +
     +    With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
     +    use the corrected committer date, which changes the order in which
     +    commits are merged:
     +    - Merge A with B to form a intermediate commit.
     +    - Merge the intermediate commit with C.
     +
     +    While resulting repositories are equivalent, 6404.4 'virtual trees were
     +    processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
     +    different merge-bases and thus have different object ids for the
     +    intermediate commits.
     +
     +    As this has already causes problems (as noted in 859fdc0 (commit-graph:
     +    define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
     +    within t6404-recursive-merge.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: int generation_numbers_enabled(struct repository *r)
      +	if (!g->num_commits)
      +		return 0;
      +
     -+	return !!g->chunk_generation_data;
     ++	return g->read_generation_data;
      +}
      +
     - static void close_commit_graph_one(struct commit_graph *g)
     + struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
       {
     - 	if (!g)
     + 	struct commit_graph *g = r->objects->commit_graph;
      
       ## commit-graph.h ##
     -@@ commit-graph.h: struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
     +@@ commit-graph.h: struct commit_graph *read_commit_graph_one(struct repository *r,
     + struct commit_graph *parse_commit_graph(struct repository *r,
     + 					void *graph_map, size_t graph_size);
     + 
     ++struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
     ++
     + /*
     +  * Return 1 if and only if the repository has a commit-graph
     +  * file and generation numbers are computed in that file.
        */
       int generation_numbers_enabled(struct repository *r);
       
     +-struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
      +/*
      + * Return 1 if and only if the repository has a commit-graph
      + * file and generation data chunk has been written for the file.
      + */
      +int corrected_commit_dates_enabled(struct repository *r);
     -+
     + 
       enum commit_graph_write_flags {
       	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
     - 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
      
       ## commit-reach.c ##
      @@ commit-reach.c: static struct commit_list *paint_down_to_common(struct repository *r,
     @@ commit-reach.c: static struct commit_list *paint_down_to_common(struct repositor
       
       	one->object.flags |= PARENT1;
      
     - ## t/t6024-recursive-merge.sh ##
     -@@ t/t6024-recursive-merge.sh: GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
     + ## t/t6404-recursive-merge.sh ##
     +@@ t/t6404-recursive-merge.sh: GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
       export GIT_COMMITTER_DATE
       
       test_expect_success 'setup tests' '
     @@ t/t6024-recursive-merge.sh: GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
       	echo 1 >a1 &&
       	git add a1 &&
       	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
     -@@ t/t6024-recursive-merge.sh: test_expect_success 'setup tests' '
     +@@ t/t6404-recursive-merge.sh: test_expect_success 'setup tests' '
       '
       
       test_expect_success 'combined merge conflicts' '
     @@ t/t6024-recursive-merge.sh: test_expect_success 'setup tests' '
       '
       
       test_expect_success 'result contains a conflict' '
     +@@ t/t6404-recursive-merge.sh: test_expect_success 'result contains a conflict' '
     + '
     + 
     + test_expect_success 'virtual trees were processed' '
     ++	# TODO: fragile test, relies on ambigious merge-base resolution
     + 	git ls-files --stage >out &&
     + 
     + 	cat >expect <<-EOF &&
 11:  f6f91af305 ! 10:  9ada43967d doc: add corrected commit date info
     @@ Documentation/technical/commit-graph-format.txt: Git commit graph format
       - The root tree OID.
       
      @@ Documentation/technical/commit-graph-format.txt: CHUNK DATA:
     +       position. If there are more than two parents, the second value
     +       has its most-significant bit on and the other bits store an array
     +       position into the Extra Edge List chunk.
     +-    * The next 8 bytes store the generation number of the commit and
     ++    * The next 8 bytes store the topological level (generation number v1)
     ++      of the commit and
     +       the commit time in seconds since EPOCH. The generation number
     +       uses the higher 30 bits of the first 4 bytes, while the commit
     +       time uses the 32 bits of the second 4 bytes, along with the lowest
             2 bits of the lowest byte, storing the 33rd and 34th bit of the
             commit time.
       
     -+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
     ++  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
      +    * This list of 4-byte values store corrected commit date offsets for the
      +      commits, arranged in the same order as commit data chunk.
     -+    * This list can be later modified to store future generation number related
     -+      data.
     ++    * If the corrected commit date offset cannot be stored within 31 bits,
     ++      the value has its most-significant bit on and the other bits store
     ++      the position of corrected commit date into the Generation Data Overflow
     ++      chunk.
     ++
     ++  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
     ++    * This list of 8-byte values stores the corrected commit dates for commits
     ++      with corrected commit date offsets that cannot be stored within 31 bits.
      +
         Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
             This list of 4-byte values store the second through nth parents for
     @@ Documentation/technical/commit-graph.txt: A consumer may load the following info
       
      -Define the "generation number" of a commit recursively as follows:
      +There are two definitions of generation number:
     -+1. Corrected committer dates
     -+2. Topological levels
     -+
     ++1. Corrected committer dates (generation number v2)
     ++2. Topological levels (generation nummber v1)
     + 
     +- * A commit with no parents (a root commit) has generation number one.
      +Define "corrected committer date" of a commit recursively as follows:
     -+
     + 
     +- * A commit with at least one parent has generation number one more than
     +-   the largest generation number among its parents.
      +  * A commit with no parents (a root commit) has corrected committer date
      +    equal to its committer date.
     -+
     + 
     +-Equivalently, the generation number of a commit A is one more than the
      +  * A commit with at least one parent has corrected committer date equal to
      +    the maximum of its commiter date and one more than the largest corrected
      +    committer date among its parents.
      +
     ++  * As a special case, a root commit with timestamp zero has corrected commit
     ++    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
     ++    (that is, an uncomputed corrected commit date).
     ++
      +Define the "topological level" of a commit recursively as follows:
     - 
     -  * A commit with no parents (a root commit) has generation number one.
     - 
     -- * A commit with at least one parent has generation number one more than
     --   the largest generation number among its parents.
     ++
     ++ * A commit with no parents (a root commit) has topological level of one.
     ++
      + * A commit with at least one parent has topological level one more than
      +   the largest topological level among its parents.
     - 
     --Equivalently, the generation number of a commit A is one more than the
     ++
      +Equivalently, the topological level of a commit A is one more than the
       length of a longest path from A to a root commit. The recursive definition
       is easier to use for computation and observing the following property:
       
     +@@ Documentation/technical/commit-graph.txt: is easier to use for computation and observing the following property:
     +     generation numbers, then we always expand the boundary commit with highest
     +     generation number and can easily detect the stopping condition.
     + 
     ++The properties applies to both versions of generation number, that is both
     ++corrected committer dates and topological levels.
     ++
     + This property can be used to significantly reduce the time it takes to
     + walk commits and determine topological relationships. Without generation
     + numbers, the general heuristic is the following:
      @@ Documentation/technical/commit-graph.txt: numbers, the general heuristic is the following:
           If A and B are commits with commit time X and Y, respectively, and
           X < Y, then A _probably_ cannot reach B.
       
      -This heuristic is currently used whenever the computation is allowed to
     --violate topological relationships due to clock skew (such as "git log"
     --with default order), but is not used when the topological order is
     --required (such as merge base calculations, "git log --graph").
     --
     - In practice, we expect some commits to be created recently and not stored
     - in the commit graph. We can treat these commits as having "infinite"
     ++In absence of corrected commit dates (for example, old versions of Git or
     ++mixed generation graph chains),
     ++this heuristic is currently used whenever the computation is allowed to
     + violate topological relationships due to clock skew (such as "git log"
     + with default order), but is not used when the topological order is
     + required (such as merge base calculations, "git log --graph").
     +@@ Documentation/technical/commit-graph.txt: in the commit graph. We can treat these commits as having "infinite"
       generation number and walk until reaching commits with known generation
       number.
       
     @@ Documentation/technical/commit-graph.txt: fully-computed generation numbers. Usi
       with generation number *_INFINITY or *_ZERO is valuable.
       
      -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
     --generation numbers are computed to be at least this value. We limit at
     --this value since it is the largest value that can be stored in the
     --commit-graph file using the 30 bits available to generation numbers. This
     --presents another case where a commit can have generation number equal to
     --that of a parent.
     -+We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
     -+are computed to be at least this value. We limit at this value since it is
     -+the largest value that can be stored in the commit-graph file using the
     -+available to generation numbers. This presents another case where a
     -+commit can have generation number equal to that of a parent.
     - 
     - Design Details
     - --------------
     ++We use the macro GENERATION_NUMBER_MAX for commits whose
     + generation numbers are computed to be at least this value. We limit at
     + this value since it is the largest value that can be stored in the
     + commit-graph file using the 30 bits available to generation numbers. This
      @@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the size multiple, 64,000 for the maximum
       number of commits) could be extracted into config settings for full
       flexibility.
       
     -+We also merge commit-graph chains when we try to write a commit graph with
     -+two different generation number definitions as they cannot be compared directly.
     -+We overwrite the existing chain and create a commit-graph with the newer or more
     -+efficient defintion. For example, overwriting topological levels commit graph
     -+chain to create a corrected commit dates commit graph chain.
     ++## Handling Mixed Generation Number Chains
     ++
     ++With the introduction of generation number v2 and generation data chunk, the
     ++following scenario is possible:
     ++
     ++1. "New" Git writes a commit-graph with the corrected commit dates.
     ++2. "Old" Git writes a split commit-graph on top without corrected commit dates.
     ++
     ++A naive approach of using the newest available generation number from
     ++each layer would lead to violated expectations: the lower layer would
     ++use corrected commit dates which are much larger than the topological
     ++levels of the higher layer. For this reason, Git inspects each layer to
     ++see if any layer is missing corrected commit dates. In such a case, Git
     ++only uses topological level
     ++
     ++When writing a new layer in split commit-graph, we write corrected commit
     ++dates if the topmost layer has corrected commit dates written. This
     ++guarantees that if a layer has corrected commit dates, all lower layers
     ++must have corrected commit dates as well.
     ++
     ++When merging layers, we do not consider whether the merged layers had corrected
     ++commit dates. Instead, the new layer will have corrected commit dates if and
     ++only if all existing layers below the new layer have corrected commit dates.
      +
       ## Deleting graph-{hash} files
       

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-24 23:16         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                         ` (10 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.

While measuring performance with `git commit-graph write --reachable
--changed-paths` on the linux repository led to around 1m40s for both
HEAD and master (and could be due to fault in my measurements), it is
still the "right" thing to do.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..94503e584b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 02/10] revision: parse parent in indegree_walk_step()
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
  2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-24 23:41         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                         ` (9 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index aa62212040..c97abcdde1 100644
--- a/revision.c
+++ b/revision.c
@@ -3381,6 +3381,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
  2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
  2020-10-07 14:09       ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-25 10:52         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                         ` (8 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the corrected committer date, we have to load
committer date to calculate generation number value.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2 ^ 36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and loads committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending om some implementations) octal
digits for the size and mtime of each files.

Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
digit implementations while still fitting into commit data chunk.

Since we want to test 12-octal digit implementations of ustar as well,
let's modify the existing test to no longer use commit-graph file.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 27 ++++++++++-----------------
 t/t5000-tar-tree.sh | 20 +++++++++++++++++++-
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 94503e584b..e8362e144e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
 	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b65..8f41cdc509 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,11 +431,29 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
 	test_cmp expect actual
 '
 
+test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+	rm -f .git/index &&
+	echo foo >file &&
+	git add file &&
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
+		git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+	git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+	echo 2514 >expect &&
+	tar_info future.tar | cut -d" " -f2 >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
 		git commit -m "tempori parendum"
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 04/10] commit-graph: return 64-bit generation number
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (2 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-25 13:48         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
                         ` (7 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step, let's return timestamp_t values from
commit_graph_generation(), use timestamp_t for local variables and
define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 22 +++++++++++-----------
 commit-graph.h |  4 ++--
 commit-reach.c | 36 ++++++++++++++++++------------------
 commit-reach.h |  2 +-
 commit.c       |  4 ++--
 commit.h       |  4 ++--
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 8 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e8362e144e..bfc532de6f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1350,7 +1350,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
 		if (generation != GENERATION_NUMBER_INFINITY &&
@@ -1383,8 +1383,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				if (data->generation > GENERATION_NUMBER_V1_MAX)
+					data->generation = GENERATION_NUMBER_V1_MAX;
 			}
 		}
 	}
@@ -2404,8 +2404,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2469,11 +2469,11 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
 		 * extra logic in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
+		if (max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6..8be247fa35 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -144,12 +144,12 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e..20b48b872b 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
 	const struct commit *a = *(const struct commit * const *)_a;
 	const struct commit *b = *(const struct commit * const *)_b;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
 
 	if (generation_a < generation_b)
 		return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.c b/commit.c
index f53429c0ac..3b488381d5 100644
--- a/commit.c
+++ b/commit.c
@@ -731,8 +731,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a),
+			  generation_b = commit_graph_generation(b);
 
 	/* newer commits first */
 	if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index 5467786c7b..33c66b2177 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
diff --git a/revision.c b/revision.c
index c97abcdde1..2861f1c45c 100644
--- a/revision.c
+++ b/revision.c
@@ -3308,7 +3308,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3354,7 +3354,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3397,7 +3397,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3455,7 +3455,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (repo_parse_commit_gently(revs->repo, c, 1))
 			continue;
@@ -3516,7 +3516,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b858eb457..fdb82885b6 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 05/10] commit-graph: add a slab to store topological levels
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (3 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-25 22:17         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                         ` (6 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a later commit we will introduce corrected commit date as the
generation number v2. This value will be stored in the new seperate
Generation Data chunk. However, to ensure backwards compatibility with
"Old" Git we need to continue to write generation number v1, which is
topological level, to the commit data chunk. This means that we need to
compute both versions of generation numbers when writing the
commit-graph file. Therefore, let's introduce a commit-slab to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

When Git creates a split commit-graph, it takes advantage of the
generation values that have been computed already and present in
existing commit-graph files.

So, let's add a pointer to struct commit_graph as well as struct
write_commit_graph_context to the topological level commit-slab
and populate it with topological levels while writing a commit-graph
file.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
 commit-graph.h |  1 +
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index bfc532de6f..cedd311024 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -768,6 +770,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+	if (g->topo_levels)
+		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -962,6 +967,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
+	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
@@ -1108,7 +1114,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1350,11 +1356,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_INFINITY &&
+		    level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1362,29 +1368,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_INFINITY ||
+				    level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else if (level > max_level) {
+					max_level = level;
 				}
 			}
 
 			if (all_parents_computed) {
-				struct commit_graph_data *data = commit_graph_data_at(current);
-
-				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_V1_MAX)
-					data->generation = GENERATION_NUMBER_V1_MAX;
+				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+					max_level = GENERATION_NUMBER_V1_MAX - 1;
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
 	}
@@ -2142,6 +2146,7 @@ int write_commit_graph(struct object_directory *odb,
 	int res = 0;
 	int replace = 0;
 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2163,6 +2168,18 @@ int write_commit_graph(struct object_directory *odb,
 							 bloom_settings.max_changed_paths);
 	ctx->bloom_settings = &bloom_settings;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
+	if (ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			g->topo_levels = &topo_levels;
+			g = g->base_graph;
+		}
+	}
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index 8be247fa35..2e9aa7824e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
+	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 06/10] commit-graph: implement corrected commit date
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (4 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-27 18:53         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                         ` (5 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if one of commits is malformed but
valid and has committerdate of 0 Unix time, as the offset would be the
same as corrected commit date and thus require 64-bits to be stored
properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 43 +++++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cedd311024..03948adfce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
 		if (level != GENERATION_NUMBER_INFINITY &&
-		    level != GENERATION_NUMBER_ZERO)
+		    level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO
+		    )
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
-
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 				if (level == GENERATION_NUMBER_INFINITY ||
-				    level == GENERATION_NUMBER_ZERO) {
+				    level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO
+				    ) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				} else {
+					if (level > max_level)
+						max_level = level;
+
+					if (corrected_commit_date > max_corrected_commit_date)
+						max_corrected_commit_date = corrected_commit_date;
 				}
 			}
 
@@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date && current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
@@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_V1_MAX)
-			max_generation--;
-
 		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		if (generation < max_generation + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 07/10] commit-graph: implement generation data chunk
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (5 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-10-30 12:45         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
                         ` (4 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number was to distinguish
between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores corrected committer date offsets whereas CDAT will
still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

While storing corrected commit date offset instead of the corrected
commit date saves us 4 bytes per commit, it's possible for the offsets
to overflow the 4-bytes allocated. As such overflows are exceedingly
rare, we use the following overflow management scheme:

We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 98 +++++++++++++++++++++++++++++++++--
 commit-graph.h                |  3 ++
 commit.h                      |  1 +
 t/README                      |  3 ++
 t/helper/test-read-graph.c    |  4 ++
 t/t4216-log-bloom.sh          |  4 +-
 t/t5318-commit-graph.sh       | 70 ++++++++++++++++++++-----
 t/t5324-split-commit-graph.sh | 12 ++---
 t/t6600-test-reach.sh         | 68 +++++++++++++-----------
 9 files changed, 206 insertions(+), 57 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 03948adfce..71d0b243db 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
@@ -385,6 +389,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+			if (graph->chunk_generation_data_overflow)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data_overflow = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -745,8 +763,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 {
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
-	uint32_t lex_index;
-	uint64_t date_high, date_low;
+	uint32_t lex_index, offset_pos;
+	uint64_t date_high, date_low, offset;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
@@ -764,7 +782,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data) {
+		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+		} else
+			graph_data->generation = item->date + offset;
+	} else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 
 	if (g->topo_levels)
 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -942,6 +969,7 @@ struct write_commit_graph_context {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
 	int num_extra_edges;
+	int num_generation_data_overflows;
 	unsigned long approx_nr_objects;
 	struct progress *progress;
 	int progress_done;
@@ -960,7 +988,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
@@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i, num_generation_data_overflows = 0;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+			num_generation_data_overflows++;
+		}
+
+		hashwrite_be32(f, offset);
+	}
+
+	return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+						       struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			hashwrite_be32(f, offset >> 32);
+			hashwrite_be32(f, (uint32_t) offset);
+		}
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1399,7 +1466,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
+
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
@@ -1765,6 +1836,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+		ctx->write_generation_data = 0;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
+	if (ctx->num_generation_data_overflows) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2170,6 +2256,8 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = 1;
+	ctx->num_generation_data_overflows = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824e..19a02001fd 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -68,6 +69,8 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
+	const unsigned char *chunk_generation_data_overflow;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 33c66b2177..251d877fcf 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
 #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
 #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
 
 struct commit_list {
 	struct commit *item;
diff --git a/t/README b/t/README
index 2adaf7c2d2..975c054bc9 100644
--- a/t/README
+++ b/t/README
@@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a1725..75927b2c81 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
+	if (graph->chunk_generation_data_overflow)
+		printf(" generation_data_overflow");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d11040ce41..dbde016188 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 '
 
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544d..0328e98564 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
@@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 	)
 '
 
+test_commit_with_date() {
+  file="$1.t" &&
+  echo "$1" >"$file" &&
+  git add "$file" &&
+  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
+  git tag "$1"
+}
+
+test_expect_success 'overflow corrected commit date offset' '
+	objdir=".git/objects" &&
+	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
+	FUTURE_DATE="@2147483646 +0000" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	cd "$TRASH_DIRECTORY" &&
+	mkdir repo &&
+	cd repo &&
+	git init &&
+	test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
+	test_commit 2 &&
+	test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
+	git commit-graph write --reachable &&
+	graph_read_expect 3 generation_data &&
+	test_commit_with_date 4 "$FUTURE_DATE" &&
+	test_commit 5 &&
+	test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
+	git branch left &&
+	git reset --hard 3 &&
+	test_commit 7 &&
+	test_commit_with_date 8 "$FUTURE_DATE" &&
+	test_commit 9 &&
+	git branch right &&
+	git reset --hard 3 &&
+	git merge left right &&
+	git commit-graph write --reachable &&
+	graph_read_expect 10 "generation_data generation_data_overflow" &&
+	git commit-graph verify
+'
+
+graph_git_behavior 'overflow corrected commit date offset' repo left right
+
 test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c334ee9155..651df89ab2 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
 	infodir=".git/objects/info" &&
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 
 	oid_version sha1:1
 	oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337..e2d33a8a4c 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,10 +55,13 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -67,11 +70,14 @@ run_three_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +123,7 @@ test_expect_success 'in_merge_bases_many:hit' '
 	X:commit-5-7
 	EOF
 	echo "in_merge_bases_many(A,X):1" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +133,7 @@ test_expect_success 'in_merge_bases_many:miss' '
 	X:commit-8-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +143,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
 	X:commit-6-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +154,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +165,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -174,7 +180,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -196,7 +202,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +225,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +247,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +270,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -280,8 +286,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -297,8 +303,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +316,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +328,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +340,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +352,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +364,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +374,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +388,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +408,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +426,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +440,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (6 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-11-01  0:55         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                         ` (3 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 29 +++++++++++-
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
 3 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 71d0b243db..5d15a1399b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -605,6 +605,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+	int read_generation_data;
+
+	if (!g)
+		return;
+
+	read_generation_data = !!g->chunk_generation_data;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -613,6 +628,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(g);
+
 	return g;
 }
 
@@ -782,7 +799,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data) {
+	if (g->chunk_generation_data && g->read_generation_data) {
 		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 
 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		}
 	}
 
+	if (!ctx->write_generation_data && g->chunk_generation_data)
+		ctx->write_generation_data = 1;
+
 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->new_base_graph = g;
 	else if (ctx->num_commit_graphs_after != 1)
@@ -2274,6 +2294,7 @@ int write_commit_graph(struct object_directory *odb,
 		struct commit_graph *g = ctx->r->objects->commit_graph;
 
 		while (g) {
+			g->read_generation_data = 1;
 			g->topo_levels = &topo_levels;
 			g = g->base_graph;
 		}
@@ -2300,6 +2321,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2318,6 +2342,9 @@ int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->opts)
 			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fd..ad52130883 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	unsigned int read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 651df89ab2..d0949a9eb8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
 	verify_chain_files_exist $graphdir
 '
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	test_oid_cache <<-EOM &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOM
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 $(test_oid oid_version) 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 $(test_oid oid_version) 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
+test_expect_success 'add one commit, write a tip graph' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (7 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-11-03 17:59         ` Jakub Narębski
  2020-10-07 14:09       ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
                         ` (2 subsequent siblings)
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().

While using corrected commit dates Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

For example, the command `git merge-base v4.8 v4.9` on the linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

t6404-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form a intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form a intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  8 +++++++-
 commit-reach.c             |  2 +-
 t/t6404-recursive-merge.sh |  5 ++++-
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5d15a1399b..3de1933ede 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -705,6 +705,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return g->read_generation_data;
+}
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 {
 	struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883..d2c048dc64 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 /*
  * Return 1 if and only if the repository has a commit-graph
  * file and generation numbers are computed in that file.
  */
 int generation_numbers_enabled(struct repository *r);
 
-struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
 
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
diff --git a/commit-reach.c b/commit-reach.c
index 20b48b872b..46f5a9e638 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index 332cfc53fd..7055771b62 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
 '
 
 test_expect_success 'virtual trees were processed' '
+	# TODO: fragile test, relies on ambigious merge-base resolution
 	git ls-files --stage >out &&
 
 	cat >expect <<-EOF &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v4 10/10] doc: add corrected commit date info
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (8 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09       ` Abhishek Kumar via GitGitGadget
  2020-11-04  1:37         ` Jakub Narębski
  2020-11-04 23:37       ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
  11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 21 +++++--
 Documentation/technical/commit-graph.txt      | 62 ++++++++++++++++---
 2 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b9..08d9026ad4 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -86,13 +82,26 @@ CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the generation number of the commit and
+    * The next 8 bytes store the topological level (generation number v1)
+      of the commit and
       the commit time in seconds since EPOCH. The generation number
       uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * If the corrected commit date offset cannot be stored within 31 bits,
+      the value has its most-significant bit on and the other bits store
+      the position of corrected commit date into the Generation Data Overflow
+      chunk.
+
+  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+    * This list of 8-byte values stores the corrected commit dates for commits
+      with corrected commit date offsets that cannot be stored within 31 bits.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa..75f71c4c7b 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
 
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+  * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
 
-Equivalently, the generation number of a commit A is one more than the
+  * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+  * As a special case, a root commit with timestamp zero has corrected commit
+    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+    (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
     generation numbers, then we always expand the boundary commit with highest
     generation number and can easily detect the stopping condition.
 
+The properties applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
 This property can be used to significantly reduce the time it takes to
 walk commits and determine topological relationships. Without generation
 numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
 violate topological relationships due to clock skew (such as "git log"
 with default order), but is not used when the topological order is
 required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,7 +115,7 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
+We use the macro GENERATION_NUMBER_MAX for commits whose
 generation numbers are computed to be at least this value. We limit at
 this value since it is the largest value that can be stored in the
 commit-graph file using the 30 bits available to generation numbers. This
@@ -267,6 +289,30 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects each layer to
+see if any layer is missing corrected commit dates. In such a case, Git
+only uses topological level
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if and
+only if all existing layers below the new layer have corrected commit dates.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
  2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-10-24 23:16         ` Jakub Narębski
  2020-10-25 20:58           ` Taylor Blau
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-24 23:16 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Garima Singh,
	Jeff King

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> commit_gen_cmp is used when writing a commit-graph to sort commits in
> generation order before computing Bloom filters. Since c49c82aa (commit:
> move members graph_pos, generation to a slab, 2020-06-17) made it so
> that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
> during writing, we cannot call it within this function. Instead, access
> the generation number directly through the slab (i.e., by calling
> 'commit_graph_data_at(c)->generation') in order to access it while
> writing.

This description is all right, but I think it can be made more clear:

  When running `git commit-graph write --reachable --changed-paths` to
  compute Bloom filters for changed paths, commits are first sorted by
  generation number using 'commit_gen_cmp()'.  Commits with similar
  generation are more likely to have many trees in common, making the
  diff faster, see 3d112755.

  However, since c49c82aa (commit: move members graph_pos, generation to
  a slab, 2020-06-17) made it so that 'commit_graph_generation()'
  returns 'GENERATION_NUMBER_INFINITY' during writing, we cannot call it
  within this function.  Instead, access the generation number directly
  through the slab (i.e., by calling 'commit_graph_data_at(c)->generation')
  in order to access it while writing.

Or something like that.

We should also add an explanation why avoiding getter is safe here,
perhaps adding the following line to the second paragraph:

  It is safe to do because 'commit_gen_cmp()' from commit-graph.c is
  static and used only when writing Bloom filters, and because writing
  changed-paths filters is done after computing generation numbers (if
  necessary).

Or something like that.

>
> While measuring performance with `git commit-graph write --reachable
> --changed-paths` on the linux repository led to around 1m40s for both
> HEAD and master (and could be due to fault in my measurements), it is
> still the "right" thing to do.

I had to read the above paragraph several times to understand it,
possibly because I have expected here to be a fix for a performance
regression.  The commit message for 3d112755 (commit-graph: examine
commits by generation number) describes reduction of computation time
from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
before those changes) to be around 3m, not the same before and after
changes being around 1m40s.

Can anyone recheck this before-and-after benchmark, please?

Anyway, it might be more clear to write it as the following:

  On the Linux kernel repository, this patch didn't reduce the
  computation time for 'git commit-graph write --reachable
  --changed-paths', which is around 1m40s both before and after this
  change.  This could be a fault in my measurements; it is still the
  "right" thing to do.

Or something like that.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

Anyway, it is nice and clear change.

>  commit-graph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cb042bdba8..94503e584b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>  
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 02/10] revision: parse parent in indegree_walk_step()
  2020-10-07 14:09       ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-10-24 23:41         ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-10-24 23:41 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.

All right, we need to ensure the parent commit is parsed to know its
generation number, to insert in into priority queue in a correct order.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Looks good.

> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index aa62212040..c97abcdde1 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3381,6 +3381,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>  
> +		if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
> +			return;
> +
>  		if (*pi)
>  			(*pi)++;
>  		else

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info
  2020-10-07 14:09       ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-10-25 10:52         ` Jakub Narębski
  2020-10-27  6:33           ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-25 10:52 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

Hi Abhishek,

In short: everything is all right, except for the now duplicated test
names in t5000 after this commit.

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
>
> fill_commit_graph_info() used to not load committer data from commit data
> chunk. However, with the corrected committer date, we have to load
> committer date to calculate generation number value.

Nice writeup, however the last sentence would in my opinion read better
in the future tense: we don't use generation number v2 yet.  For
example:

  However, with upcoming switch to using corrected committer date as
  generation number v2, we will have to load committer date to compute
  generation number value anyway.

Or something like that - notice the minor addition and changes.

The following is slightly unrelated change, but we agreed that it would
be better to not separate them; the need for change to the t5000 test is
caused by the change described above.

>
> e51217e15 (t5000: test tar files that overflow ustar headers,
> 30-06-2016) introduced a test 'generate tar with future mtime' that
> creates a commit with committer date of (2 ^ 36 + 1) seconds since
> EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
> committer time overflows into generation number (within CDAT chunk) and
> has undefined behavior.
>
> The test used to pass as fill_commit_graph_info() would not set struct
> member `date` of struct commit and loads committer date from the object
> database, generating a tar file with the expected mtime.

I think it should be s/loads/load/, as in "would load", but I am not a
native English speaker.

>
> However, with corrected commit date, we will load the committer date
> from CDAT chunk (truncated to lower 34-bits to populate the generation
> number. Thus, Git sets date and generates tar file with the truncated
> mtime.
>
> The ustar format (the header format used by most modern tar programs)
> only has room for 11 (or 12, depending om some implementations) octal
> digits for the size and mtime of each files.
>
> Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
> digit implementations while still fitting into commit data chunk.
>
> Since we want to test 12-octal digit implementations of ustar as well,
> let's modify the existing test to no longer use commit-graph file.

The description above is for me does not make it entirely clear that we
add new test for handling possible 11-octal digit overflow nearly
identical to the existing one, and turn off use of commit-graph file for
test that checks handling 12-octal digit overflow.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c      | 27 ++++++++++-----------------
>  t/t5000-tar-tree.sh | 20 +++++++++++++++++++-
>  2 files changed, 29 insertions(+), 18 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 94503e584b..e8362e144e 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
>  	uint32_t lex_index;
> +	uint64_t date_high, date_low;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		die(_("invalid commit position. commit-graph is likely corrupt"));
> +
>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>  
>  	graph_data = commit_graph_data_at(item);
>  	graph_data->graph_pos = pos;
> +
> +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> +	date_low = get_be32(commit_data + g->hash_len + 12);
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>  
> @@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
> -	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	struct commit_graph_data *graph_data;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> -	if (pos >= g->num_commits + g->num_commits_in_base)
> -		die(_("invalid commit position. commit-graph is likely corrupt"));
> +	fill_commit_graph_info(item, g, pos);
>  
> -	/*
> -	 * Store the "full" position, but then use the
> -	 * "local" position for the rest of the calculation.
> -	 */
> -	graph_data = commit_graph_data_at(item);
> -	graph_data->graph_pos = pos;
>  	lex_index = pos - g->num_commits_in_base;
> -
>  	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
>  
>  	item->object.parsed = 1;
>  
>  	set_commit_tree(item, NULL);
>  
> -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> -	date_low = get_be32(commit_data + g->hash_len + 12);
> -	item->date = (timestamp_t)((date_high << 32) | date_low);
> -
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -
>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);

All right, looks good for me.

Here second change begins.

> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 3ebb0d3b65..8f41cdc509 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -431,11 +431,29 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> +	rm -f .git/index &&
> +	echo foo >file &&
> +	git add file &&
> +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> +		git commit -m "tempori parendum"
> +'
> +
> +test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> +	git archive HEAD >future.tar
> +'
> +
> +test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> +	echo 2514 >expect &&
> +	tar_info future.tar | cut -d" " -f2 >actual &&
> +	test_cmp expect actual
> +'
> +

Everything is all right, except we now have duplicated test names.

Perhaps in the three following tests we should use 'far-far-future
commit' and 'far future mtime' in place of current 'far-future commit'
and 'future mtime' for tests checking handling 12-digital ditgits
overflow, or add description how far the future is, for example
'far-future commit (2^11 + 1)', etc.

>  test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
>  	rm -f .git/index &&
>  	echo content >file &&
>  	git add file &&
> -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> +	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
>  		git commit -m "tempori parendum"
>  '

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 04/10] commit-graph: return 64-bit generation number
  2020-10-07 14:09       ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-10-25 13:48         ` Jakub Narębski
  2020-11-03  6:40           ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-25 13:48 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In a preparatory step, let's return timestamp_t values from
> commit_graph_generation(), use timestamp_t for local variables and
> define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.

I think it would be easier to understand if it was explicitely said what
this preparatory step prepares for, e.g.:

  In a preparatory step for introducing corrected commit dates as
  generation number, let's return timestamp_t values from...

Or even

  generation number, let's change the return type of
  commit_graph_generation() to timestamp_t, and use ...

Otherwise it looks good.

>
> We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
> represent the largest topological level we can store in the commit data
> chunk.
>
> With corrected commit dates implemented, we will have two such *_MAX
> variables to denote the largest offset and largest topological level
> that can be stored.

All right, nice explanation.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Note that there are two changes that are not mentioned in the commit
message, namely adding 'const'-ness to generation_a/b local variables in
commit_gen_cmp() from commit-graph.c, and switching from
GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY as the default
(initial) value for 'max_generation' in repo_in_merge_bases_many().

While the former is a simple "while-at-it" change that shouldn't affect
correctness, the latter needs an explanation (or fixing if it is wrong).

> ---
>  commit-graph.c | 22 +++++++++++-----------
>  commit-graph.h |  4 ++--
>  commit-reach.c | 36 ++++++++++++++++++------------------
>  commit-reach.h |  2 +-
>  commit.c       |  4 ++--
>  commit.h       |  4 ++--
>  revision.c     | 10 +++++-----
>  upload-pack.c  |  2 +-
>  8 files changed, 42 insertions(+), 42 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e8362e144e..bfc532de6f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
>  	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
>  }
>  
> -uint32_t commit_graph_generation(const struct commit *c)
> +timestamp_t commit_graph_generation(const struct commit *c)

All right.

>  {
>  	struct commit_graph_data *data =
>  		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>  
> -	uint32_t generation_a = commit_graph_data_at(a)->generation;
> -	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
> +	const timestamp_t generation_b = commit_graph_data_at(b)->generation;

All right... but this also adds 'const' qualifier.  I understand that
you don't want to create separate commit for this "while at it"
change...

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> @@ -1350,7 +1350,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;

All right.

>  
>  		display_progress(ctx->progress, i + 1);
>  		if (generation != GENERATION_NUMBER_INFINITY &&
> @@ -1383,8 +1383,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				data->generation = max_generation + 1;
>  				pop_commit(&list);
>  
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				if (data->generation > GENERATION_NUMBER_V1_MAX)
> +					data->generation = GENERATION_NUMBER_V1_MAX;

All right, this is the other mentioned change.

>  			}
>  		}
>  	}
> @@ -2404,8 +2404,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> -		uint32_t max_generation = 0;
> -		uint32_t generation;
> +		timestamp_t max_generation = 0;
> +		timestamp_t generation;

All right.

>  
>  		display_progress(progress, i + 1);
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> @@ -2469,11 +2469,11 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  			continue;
>  
>  		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> +		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> +		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>  		 * extra logic in the following condition.
>  		 */
> -		if (max_generation == GENERATION_NUMBER_MAX)
> +		if (max_generation == GENERATION_NUMBER_V1_MAX)
>  			max_generation--;

All right.  Nice fixing a comment too.

>  
>  		generation = commit_graph_generation(graph_commit);
> diff --git a/commit-graph.h b/commit-graph.h
> index f8e92500c6..8be247fa35 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -144,12 +144,12 @@ void disable_commit_graph(struct repository *r);
>  
>  struct commit_graph_data {
>  	uint32_t graph_pos;
> -	uint32_t generation;
> +	timestamp_t generation;
>  };

All right.

>  
>  /*
>   * Commits should be parsed before accessing generation, graph positions.
>   */
> -uint32_t commit_graph_generation(const struct commit *);
> +timestamp_t commit_graph_generation(const struct commit *);
>  uint32_t commit_graph_position(const struct commit *);
>  #endif

All right.

> diff --git a/commit-reach.c b/commit-reach.c
> index 50175b159e..20b48b872b 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  static struct commit_list *paint_down_to_common(struct repository *r,
>  						struct commit *one, int n,
>  						struct commit **twos,
> -						int min_generation)
> +						timestamp_t min_generation)
>  {
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
> -	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
> +	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;

All right.

>  
>  	if (!min_generation)
>  		queue.compare = compare_commits_by_commit_date;
> @@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  		struct commit *commit = prio_queue_get(&queue);
>  		struct commit_list *parents;
>  		int flags;
> -		uint32_t generation = commit_graph_generation(commit);
> +		timestamp_t generation = commit_graph_generation(commit);

All right.

>  
>  		if (min_generation && generation > last_gen)
> -			BUG("bad generation skip %8x > %8x at %s",
> +			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",

All right; nice of you noticing this issue.

>  			    generation, last_gen,
>  			    oid_to_hex(&commit->object.oid));
>  		last_gen = generation;
> @@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>  		repo_parse_commit(r, array[i]);
>  	for (i = 0; i < cnt; i++) {
>  		struct commit_list *common;
> -		uint32_t min_generation = commit_graph_generation(array[i]);
> +		timestamp_t min_generation = commit_graph_generation(array[i]);
>  
>  		if (redundant[i])
>  			continue;
>  		for (j = filled = 0; j < cnt; j++) {
> -			uint32_t curr_generation;
> +			timestamp_t curr_generation;
>  			if (i == j || redundant[j])
>  				continue;
>  			filled_index[filled] = j;

All right.

> @@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
>  {
>  	struct commit_list *bases;
>  	int ret = 0, i;
> -	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
> +	timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;

The change of type from uint32_t to timestamp_t is expected, but the
change from GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY is not.

This might be caused by the fact that repo_in_merge_bases_many()
switched from using min_generation and GENERATION_NUMBER_INFINITY to
using max_generation and GENERATION_NUMBER_ZERO. Or the reverse: I see
one version on https://github.com/git/git, and other version in 'master'
pulled from https://github.com/git-for-windows/git

Certainly max_generation should be paired with GENERATION_NUMBER_ZERO,
and min_generation with GENERATION_NUMBER_INFINITY.

>  
>  	if (repo_parse_commit(r, commit))
>  		return ret;
> @@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  static enum contains_result contains_test(struct commit *candidate,
>  					  const struct commit_list *want,
>  					  struct contains_cache *cache,
> -					  uint32_t cutoff)
> +					  timestamp_t cutoff)

All right.

Sidenote: this parameter should probably be named gen_cutoff, for
consistency and better readability (but that was the existing state),
but this would also mean more changes.


>  {
>  	enum contains_result *cached = contains_cache_at(cache, candidate);
>  
> @@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
>  	enum contains_result result;
> -	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> +	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;

Sidenote: this variable should probably be named gen_cutoff, for
consistency and better readability (but that was the existing state).
However changing it would pollute this commit with unrelated changes;
it is not that big of an isseu that it *requires* fixing.

>  	const struct commit_list *p;
>  
>  	for (p = want; p; p = p->next) {
> -		uint32_t generation;
> +		timestamp_t generation;
>  		struct commit *c = p->item;
>  		load_commit_graph_info(the_repository, c);
>  		generation = commit_graph_generation(c);

All right.

> @@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
>  	const struct commit *a = *(const struct commit * const *)_a;
>  	const struct commit *b = *(const struct commit * const *)_b;
>  
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	timestamp_t generation_a = commit_graph_generation(a);
> +	timestamp_t generation_b = commit_graph_generation(b);

All right.

>  
>  	if (generation_a < generation_b)
>  		return -1;
> @@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
>  				 time_t min_commit_date,
> -				 uint32_t min_generation)
> +				 timestamp_t min_generation)
>  {
>  	struct commit **list = NULL;
>  	int i;

All right.

> @@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
>  	struct commit_list *from_iter = from, *to_iter = to;
>  	int result;
> -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>  
>  	while (from_iter) {
>  		add_object_array(&from_iter->item->object, NULL, &from_objs);
>  
>  		if (!parse_commit(from_iter->item)) {
> -			uint32_t generation;
> +			timestamp_t generation;
>  			if (from_iter->item->date < min_commit_date)
>  				min_commit_date = from_iter->item->date;
>

All right.

> @@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  
>  	while (to_iter) {
>  		if (!parse_commit(to_iter->item)) {
> -			uint32_t generation;
> +			timestamp_t generation;
>  			if (to_iter->item->date < min_commit_date)
>  				min_commit_date = to_iter->item->date;
>

All right.

> @@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>  	struct commit_list *found_commits = NULL;
>  	struct commit **to_last = to + nr_to;
>  	struct commit **from_last = from + nr_from;
> -	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>  	int num_to_find = 0;
>  
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  
>  	for (item = to; item < to_last; item++) {
> -		uint32_t generation;
> +		timestamp_t generation;
>  		struct commit *c = *item;
>  
>  		parse_commit(c);

All right.

> diff --git a/commit-reach.h b/commit-reach.h
> index b49ad71a31..148b56fea5 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  				 unsigned int with_flag,
>  				 unsigned int assign_flag,
>  				 time_t min_commit_date,
> -				 uint32_t min_generation);
> +				 timestamp_t min_generation);
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  		       int commit_date_cutoff);
>

All right.

> diff --git a/commit.c b/commit.c
> index f53429c0ac..3b488381d5 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -731,8 +731,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
>  int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> -	const uint32_t generation_a = commit_graph_generation(a),
> -		       generation_b = commit_graph_generation(b);
> +	const timestamp_t generation_a = commit_graph_generation(a),
> +			  generation_b = commit_graph_generation(b);
>

All right (assuming that the indent after change looks all right; but
even if it doesn't t would be a very minor issue).

>  	/* newer commits first */
>  	if (generation_a < generation_b)
> diff --git a/commit.h b/commit.h
> index 5467786c7b..33c66b2177 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -11,8 +11,8 @@
>  #include "commit-slab.h"
>  
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> -#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> -#define GENERATION_NUMBER_MAX 0x3FFFFFFF
> +#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
> +#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
>  #define GENERATION_NUMBER_ZERO 0
>

All right, we redefine GENERATION_NUMBER_INFINITY and rename
GENERATION_NUMBER_MAX.

>  struct commit_list {
> diff --git a/revision.c b/revision.c
> index c97abcdde1..2861f1c45c 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3308,7 +3308,7 @@ define_commit_slab(indegree_slab, int);
>  define_commit_slab(author_date_slab, timestamp_t);
>  
>  struct topo_walk_info {
> -	uint32_t min_generation;
> +	timestamp_t min_generation;
>  	struct prio_queue explore_queue;
>  	struct prio_queue indegree_queue;
>  	struct prio_queue topo_queue;

All right.

> @@ -3354,7 +3354,7 @@ static void explore_walk_step(struct rev_info *revs)
>  }
>  
>  static void explore_to_depth(struct rev_info *revs,
> -			     uint32_t gen_cutoff)
> +			     timestamp_t gen_cutoff)
>  {
>  	struct topo_walk_info *info = revs->topo_walk_info;
>  	struct commit *c;

All right.

> @@ -3397,7 +3397,7 @@ static void indegree_walk_step(struct rev_info *revs)
>  }
>  
>  static void compute_indegrees_to_depth(struct rev_info *revs,
> -				       uint32_t gen_cutoff)
> +				       timestamp_t gen_cutoff)
>  {
>  	struct topo_walk_info *info = revs->topo_walk_info;
>  	struct commit *c;

All right.

> @@ -3455,7 +3455,7 @@ static void init_topo_walk(struct rev_info *revs)
>  	info->min_generation = GENERATION_NUMBER_INFINITY;
>  	for (list = revs->commits; list; list = list->next) {
>  		struct commit *c = list->item;
> -		uint32_t generation;
> +		timestamp_t generation;
>  
>  		if (repo_parse_commit_gently(revs->repo, c, 1))
>  			continue;

All right.

> @@ -3516,7 +3516,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>  	for (p = commit->parents; p; p = p->next) {
>  		struct commit *parent = p->item;
>  		int *pi;
> -		uint32_t generation;
> +		timestamp_t generation;
>  
>  		if (parent->object.flags & UNINTERESTING)
>  			continue;

All right.

> diff --git a/upload-pack.c b/upload-pack.c
> index 3b858eb457..fdb82885b6 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
>  
>  static int ok_to_give_up(struct upload_pack_data *data)
>  {
> -	uint32_t min_generation = GENERATION_NUMBER_ZERO;
> +	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
>  
>  	if (!data->have_obj.nr)
>  		return 0;

All right.

The only thing to check is if you have changed the type in all the
places that need it. My cursory examination shows that those are all
places than need fixing.

Note that the 'generation' variable in git-name-rev, git-fsck and in
git-show-branch (snd sha1-name.c) means something different.

Also, 'first_generation' variable in generation_numbers_enabled() (part
of commit-graph.c) examines and will examine generation number v1 i.e.
topological levels, and do not need type change... though it may require
name change in some time in the future; the generation number
computation path also does not require change type, though variables
would be renamed in the future commit.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
  2020-10-24 23:16         ` Jakub Narębski
@ 2020-10-25 20:58           ` Taylor Blau
  2020-11-03  5:36             ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-10-25 20:58 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee, Taylor Blau,
	Abhishek Kumar, Garima Singh, Jeff King

On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > While measuring performance with `git commit-graph write --reachable
> > --changed-paths` on the linux repository led to around 1m40s for both
> > HEAD and master (and could be due to fault in my measurements), it is
> > still the "right" thing to do.
>
> I had to read the above paragraph several times to understand it,
> possibly because I have expected here to be a fix for a performance
> regression.  The commit message for 3d112755 (commit-graph: examine
> commits by generation number) describes reduction of computation time
> from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
> before those changes) to be around 3m, not the same before and after
> changes being around 1m40s.
>
> Can anyone recheck this before-and-after benchmark, please?

My hunch is that our heuristic to fall back to the commits 'date'
value is saving us here. commit_gen_cmp() first compares the generation
numbers, breaking ties by 'date' as a heuristic. But since all
generation number queries return GENERATION_NUMBER_INFINITY during
writing, we're relying on our heuristic entirely.

I haven't looked much further than that, other than to see that I could
get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
computing Bloom filters region on the kernel.

> Anyway, it might be more clear to write it as the following:
>
>   On the Linux kernel repository, this patch didn't reduce the
>   computation time for 'git commit-graph write --reachable
>   --changed-paths', which is around 1m40s both before and after this
>   change.  This could be a fault in my measurements; it is still the
>   "right" thing to do.
>
> Or something like that.

Assuming that we are in fact being saved by the "date" heuristic, I'd
probably write the following commit message instead:

  Before computing Bloom filters, the commit-graph machinery uses
  commit_gen_cmp to sort commits by generation order for improved diff
  performance. 3d11275505 (commit-graph: examine commits by generation
  number, 2020-03-30) claims that this sort can reduce the time spent to
  compute Bloom filters by nearly half.

  But since c49c82aa4c (commit: move members graph_pos, generation to a
  slab, 2020-06-17), this optimization is broken, since asking for
  'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
  while writing.

  Not all hope is lost, though: 'commit_graph_generation()' falls
  back to comparing commits by their date when they have equal generation
  number, and so since c49c82aa4c is purely a date comparison function.
  This heuristic is good enough that we don't seem to loose appreciable
  performance while computing Bloom filters. [Benchmark that we loose
  about ~4sec before/after c49c82aa4c9...]

  So, avoid the uesless 'commit_graph_generation()' while writing by
  instead accessing the slab directly. This returns the newly-computed
  generation numbers, and allows us to avoid the heuristic by directly
  comparing generation numbers.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 05/10] commit-graph: add a slab to store topological levels
  2020-10-07 14:09       ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-10-25 22:17         ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-10-25 22:17 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In a later commit we will introduce corrected commit date as the
> generation number v2. This value will be stored in the new seperate
> Generation Data chunk. However, to ensure backwards compatibility with
> "Old" Git we need to continue to write generation number v1, which is
> topological level, to the commit data chunk. This means that we need to
> compute both versions of generation numbers when writing the
> commit-graph file. Therefore, let's introduce a commit-slab to store
> topological levels; corrected commit date will be stored in the member
> `generation` of struct commit_graph_data.
>
> When Git creates a split commit-graph, it takes advantage of the
> generation values that have been computed already and present in
> existing commit-graph files.
>
> So, let's add a pointer to struct commit_graph as well as struct
> write_commit_graph_context to the topological level commit-slab
> and populate it with topological levels while writing a commit-graph
> file.

I think you meant here "add a pointer in `struct commit_graph` as well
as in `struct write_commit_graph_context`...".

Perhaps we should add the information that it is done that way to be
able to allocate topo_level_slab only when needed, in the
write_commit_graph(), and adding new member to those struct is required
to pass it through the call chain (modifying `struct commit_graph` is
needed for fill_commit_graph_info()).  But that might be too much detail
to put in the commit message.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
>  commit-graph.h |  1 +
>  2 files changed, 33 insertions(+), 15 deletions(-)
>

Let me reorder those files for easier review.

> diff --git a/commit-graph.h b/commit-graph.h
> index 8be247fa35..2e9aa7824e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -73,6 +73,7 @@ struct commit_graph {
>  	const unsigned char *chunk_bloom_indexes;
>  	const unsigned char *chunk_bloom_data;
>  
> +	struct topo_level_slab *topo_levels;
>  	struct bloom_filter_settings *bloom_filter_settings;
>  };

All right, here we add new member to `struct commit_graph` type.

> diff --git a/commit-graph.c b/commit-graph.c
> index bfc532de6f..cedd311024 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -962,6 +967,7 @@ struct write_commit_graph_context {
>  		 changed_paths:1,
>  		 order_by_pack:1;
>  
> +	struct topo_level_slab *topo_levels;
>  	const struct commit_graph_opts *opts;
>  	size_t total_bloom_filter_data_size;
>  	const struct bloom_filter_settings *bloom_settings;

All right, here we add new member to `struct write_commit_graph_context`
type, which is local to commit-graph.c.

> @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
>  /* Remember to update object flag allocation in object.h */
>  #define REACHABLE       (1u<<15)
>  
> +define_commit_slab(topo_level_slab, uint32_t);
> +

All right, here we define new slab for storing topological levels; this
just defines new type. Note that we do not define any setters and
getters to handle non-zero initialization, like we have for
commit_graph_data_slab.

>  /* Keep track of the order in which commits are added to our list. */
>  define_commit_slab(commit_pos, int);
>  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> @@ -768,6 +770,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
> +	if (g->topo_levels)
> +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;

I guess using get_be32() is repeated in this newly added part of code
because previous part would be changed to read in generation number v2,
if available, and we won't be then able to use

		*topo_level_slab_at(g->topo_levels, item) = graph_data->generation;

All right, that's smart.


I guess that in fill_commit_graph_info() we don't know if we are reading
commit-graph, when topo levels slab is not present, or whether we are
extending and writing the commit-graph file, when we need to fill it
with current commit-graph data.

The fact that fill_commit_graph_info() takes 'struct commit_graph' also
explains why we need to add pointer to a topo_levels slab to both
structs.

>  }
>  
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
[...]
> @@ -2142,6 +2146,7 @@ int write_commit_graph(struct object_directory *odb,
>  	int res = 0;
>  	int replace = 0;
>  	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +	struct topo_level_slab topo_levels;
>  
>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2163,6 +2168,18 @@ int write_commit_graph(struct object_directory *odb,
>  							 bloom_settings.max_changed_paths);
>  	ctx->bloom_settings = &bloom_settings;
>  
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
> +	if (ctx->r->objects->commit_graph) {
> +		struct commit_graph *g = ctx->r->objects->commit_graph;
> +
> +		while (g) {
> +			g->topo_levels = &topo_levels;
> +			g = g->base_graph;
> +		}
> +	}
> +
>  	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
>  		ctx->changed_paths = 1;
>  	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {

All right, we need topo_level_slab only for writing the commit-graph, so
we allocate it with init_*_slab() in write_commit_graph(), and set
pointers to it in `struct write_commit_graph_context *ctx` and in
`struct commit_graph` for each layer in the commit graph.  This is
needed to pass it down the call-chain.

Looks good to me.

> @@ -1108,7 +1114,7 @@ static int write_graph_chunk_data(struct hashfile *f,
>  		else
>  			packedDate[0] = 0;
>  
> -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
>

All right, write_graph_chunk_data() is called from write_commit_graph(),
so we know that cxt->topo_levels is not NULL.

>  		packedDate[1] = htonl((*list)->date);
>  		hashwrite(f, packedDate, 8);
> @@ -1350,11 +1356,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>

All right, we know that compute_generation_numbers() is called by the
write_commit_graph(), so we know that cxt->topo_levels is not NULL.

Also, we rename 'generation' to 'level' in preparation for the time when
we would be computing *both* topological level (for backward
compatibility) and corrected committer date (to be used as generation
number v2).  All right.

>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (level != GENERATION_NUMBER_INFINITY &&
> +		    level != GENERATION_NUMBER_ZERO)
>  			continue;

Same here, the results of renaming of 'generation' local variable to
'level'.

>  
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1362,29 +1368,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;

Similarly, we rename 'max_generation' to 'max_level'.

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>  
> -				if (generation == GENERATION_NUMBER_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (level == GENERATION_NUMBER_INFINITY ||
> +				    level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else if (level > max_level) {
> +					max_level = level;
>  				}
>  			}

Continuation of those renames.

>  
>  			if (all_parents_computed) {
> -				struct commit_graph_data *data = commit_graph_data_at(current);
> -
> -				data->generation = max_generation + 1;
>  				pop_commit(&list);
>  
> -				if (data->generation > GENERATION_NUMBER_V1_MAX)
> -					data->generation = GENERATION_NUMBER_V1_MAX;
> +				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
> +					max_level = GENERATION_NUMBER_V1_MAX - 1;
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;

This is a bit safer way to handle possible overflow: instead of

  final = max_found + 1;            /* set to maximum plus 1 */
  if (final > MAX_POSSIBLE_VALUE)   /* handle overflow */
      final = MAX_POSSIBLE_VALUE;

where we can have problems if MAX_POSSIBLE_VALUE overflows, we use the
following pattern:

  if (max_found > MAX_POSSIBLE_VALUE - 1)  /* handle overflow */
      max_found > MAX_POSSIBLE_VALUE - 1;
  final = max_found + 1;                   /* set to maximum plus 1 */

It is just a bit obscured by renaming variable and switch to using
commit slab.

It is not that important for topological level, where
GENERATION_NUMBER_V1_MAX is smaller than maximum possible value, but it
would be important for generation number v2.

>  			}
>  		}
>  	}

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info
  2020-10-25 10:52         ` Jakub Narębski
@ 2020-10-27  6:33           ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-10-27  6:33 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee, me

Hello Dr. Narębski,

On Sun, Oct 25, 2020 at 11:52:42AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> In short: everything is all right, except for the now duplicated test
> names in t5000 after this commit.
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > fill_commit_graph_info() used to not load committer data from commit data
> > chunk. However, with the corrected committer date, we have to load
> > committer date to calculate generation number value.
> 
> Nice writeup, however the last sentence would in my opinion read better
> in the future tense: we don't use generation number v2 yet.  For
> example:
> 
>   However, with upcoming switch to using corrected committer date as
>   generation number v2, we will have to load committer date to compute
>   generation number value anyway.
> 
> Or something like that - notice the minor addition and changes.
> 

Thanks for the change, it looks better!

> The following is slightly unrelated change, but we agreed that it would
> be better to not separate them; the need for change to the t5000 test is
> caused by the change described above.

> 
> >
> > e51217e15 (t5000: test tar files that overflow ustar headers,
> > 30-06-2016) introduced a test 'generate tar with future mtime' that
> > creates a commit with committer date of (2 ^ 36 + 1) seconds since
> > EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
> > committer time overflows into generation number (within CDAT chunk) and
> > has undefined behavior.
> >
> > The test used to pass as fill_commit_graph_info() would not set struct
> > member `date` of struct commit and loads committer date from the object
> > database, generating a tar file with the expected mtime.
> 
> I think it should be s/loads/load/, as in "would load", but I am not a
> native English speaker.
> 

That's correct - since I have used "would not set" in the first half of
sentence, the later half should follow suit too.

> >
> > However, with corrected commit date, we will load the committer date
> > from CDAT chunk (truncated to lower 34-bits to populate the generation
> > number. Thus, Git sets date and generates tar file with the truncated
> > mtime.
> >
> > The ustar format (the header format used by most modern tar programs)
> > only has room for 11 (or 12, depending om some implementations) octal
> > digits for the size and mtime of each files.
> >
> > Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
> > digit implementations while still fitting into commit data chunk.
> >
> > Since we want to test 12-octal digit implementations of ustar as well,
> > let's modify the existing test to no longer use commit-graph file.
> 
> The description above is for me does not make it entirely clear that we
> add new test for handling possible 11-octal digit overflow nearly
> identical to the existing one, and turn off use of commit-graph file for
> test that checks handling 12-octal digit overflow.
> 

Revised the last paragraphs to:

  The ustar format (the header format used by most modern tar programs)
  only has room for 11 (or 12, depending on some implementations) octal
  digits for the size and mtime of each file.

  To test the 11-octal digit implementation, we create a future commit
  with committer date of 2^34 - 1, which overflows 11-octal digits
  without overflowing 34-bits of the Commit Data chunk.

  To test the 12-octal digit implementation, the smallest committer date
  possible is 2^36, which overflows the Commit Data chunk and thus
  commit-graph must be disabled for the test.

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 27 ++++++++++-----------------
> >  t/t5000-tar-tree.sh | 20 +++++++++++++++++++-
> >  2 files changed, 29 insertions(+), 18 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 94503e584b..e8362e144e 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >  
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >  
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >  
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >  
> > @@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >  
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >  
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +	fill_commit_graph_info(item, g, pos);
> >  
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> >  	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> >  
> >  	item->object.parsed = 1;
> >  
> >  	set_commit_tree(item, NULL);
> >  
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >  
> >  	edge_value = get_be32(commit_data + g->hash_len);
> 
> All right, looks good for me.
> 
> Here second change begins.
> 
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 3ebb0d3b65..8f41cdc509 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -431,11 +431,29 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
> >  	test_cmp expect actual
> >  '
> >  
> > +test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> > +	rm -f .git/index &&
> > +	echo foo >file &&
> > +	git add file &&
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> > +		git commit -m "tempori parendum"
> > +'
> > +
> > +test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> > +	git archive HEAD >future.tar
> > +'
> > +
> > +test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > +	echo 2514 >expect &&
> > +	tar_info future.tar | cut -d" " -f2 >actual &&
> > +	test_cmp expect actual
> > +'
> > +
> 
> Everything is all right, except we now have duplicated test names.
> 
> Perhaps in the three following tests we should use 'far-far-future
> commit' and 'far future mtime' in place of current 'far-future commit'
> and 'future mtime' for tests checking handling 12-digital ditgits
> overflow, or add description how far the future is, for example
> 'far-future commit (2^11 + 1)', etc.
> 

Changed, thanks for pointing this out.

> >  test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
  2020-10-07 14:09       ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-10-27 18:53         ` Jakub Narębski
  2020-11-03 11:44           ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-27 18:53 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.

All right.  We might want to say that it fulfills the same reachability
criteria as topological level, but perhaps this level of detail is not
necessary here.

> As a special case, a root commit with timestamp of zero (01.01.1970
> 00:00:00Z) has corrected commit date of one, to be able to distinguish
> from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> date).

I'm not sure if this special case is really necessary, but it makes for
cleaner reasoning.

> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.
>
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t). However, corrected commit date
> offsets can be safely stored using only 32-bits. This halves the size
> of GDAT chunk, which is a reduction of around 6% in the size of
> commit-graph file.
>
> However, using offsets be problematic if one of commits is malformed but
> valid and has committerdate of 0 Unix time, as the offset would be the
> same as corrected commit date and thus require 64-bits to be stored
> properly.
>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.

All right.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Somewhere in the commit message we should also describe that this commit
changes how commit-graph is verified: from checking that the generation
number agrees with _topological level definition_, that is that for a
given commit it is 1 more than maximum of its parents (with the caveat
that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
checking that slightly weaker condition fulfilled by both topological
levels (generation number v1) and by corrected commit date (generation
number v2) that for a given commit its generation number is 1 more than
maximum of its parents or larger.

But, as far as I understand it, current code does not handle correctly
GENERATION_NUMBER_V1_MAX case (if we use generation number v1).

On the other hand we could have simpy use functional check, that
generation number used (which can be v1 or v2, or any similar other)
fulfills the reachability condition for each edge, which can be
simplified to checking that generation(parents) <= generation(commit).
If the reachability condition is true for each edge, then it is true for
each path, and for each commit.

> ---
>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>  1 file changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cedd311024..03948adfce 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	else if (generation_a > generation_b)
>  		return 1;
>  
> -	/* use date as a heuristic when generations are equal */
> -	if (a->date < b->date)
> -		return -1;
> -	else if (a->date > b->date)
> -		return 1;

Why this change?  It is not described in the commit message.

Note that while this tie-breaking fallback doesn't make much sense for
corrected committer date generation number v2, this tie-breaking helps
if we have to use topological levels (generation number v2).

>  	return 0;
>  }
>  
> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);

Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
enough; no need for 'timestamp_t' for 'level' variable.

> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>

All right, we compute both generation numbers: topological levels and
corrected commit date.

I guess we use 'corrected_commit_date' instead of simply 'generation' to
make it asier to remember which is which.

>  		display_progress(ctx->progress, i + 1);
>  		if (level != GENERATION_NUMBER_INFINITY &&
> -		    level != GENERATION_NUMBER_ZERO)
> +		    level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO

Straightforward addition.

> +		    )

Why this closing parenthesis is now in separated line?

>  			continue;
>  
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;

All right, straightforward addition.

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> -

Why we have removed this empty line?

> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;

All right.

>  				if (level == GENERATION_NUMBER_INFINITY ||
> -				    level == GENERATION_NUMBER_ZERO) {
> +				    level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> +				    ) {

All right, same as above.

>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;
>  				}

All right, reasonable and straightforward.

>  			}
>  
> @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> +				if (current->date && current->date > max_corrected_commit_date)
> +					max_corrected_commit_date = current->date - 1;
> +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;

All right.

Here we use the same trick as in previous commit (and as above) to avoid
any possible overflow, to minimize number of conditionals.  The fact
that max_corrected_commit_date might store incorrect value doesn't
matter, as it is reset at beginning of this loop.

>  			}
>  		}
>  	}
> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>  			continue;
>  
> -		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> -		 * extra logic in the following condition.
> -		 */
> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
> -			max_generation--;
> -

Perhaps in the future we should check that both topological levels, and
also corrected committer date (if it exists) for correctness according
to their definition.  Then the above removed part would be restored (but
with s/max_generation/max_level/).

>  		generation = commit_graph_generation(graph_commit);
> -		if (generation != max_generation + 1)
> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> +		if (generation < max_generation + 1)
> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

All right, so we relaxed the check so that it will be fulfilled by
generation number v2 (and also by generation number v1, as it implies
the more strict check for v1).

What would happen however if generation holds topological levels, and it
is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
so the `git commit-graph verify` would incorrectly say that there is
a problem with generation number, while there isn't one (false positive
detection of error).

Sidenote: I think we don't have to worry about having to introduce
GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
disck representation) corrected commiter date is the same as of commiter
date itself, plus some, and I don't see us coming close to 64-bit limit
of timestamp_t for commit dates.

>  				     oid_to_hex(&cur_oid),
>  				     generation,
>  				     max_generation + 1);

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 07/10] commit-graph: implement generation data chunk
  2020-10-07 14:09       ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-10-30 12:45         ` Jakub Narębski
  2020-11-06 11:25           ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-30 12:45 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

Tl;dr summary: the code writing GDOV chunk could be made more performant
(I think), but that could be left for the future commit.

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> As discovered by Ævar, we cannot increment graph version to
> distinguish between generation numbers v1 and v2 [1]. Thus, one of
> pre-requistes before implementing generation number was to distinguish
> between graph versions in a backwards compatible manner.

Minor nitpick: I think you meant "implementing generation number v2",
to be more precise.

>
> We are going to introduce a new chunk called Generation Data chunk (or

Very minor nitpick: perhaps s/Generation Data/Generation DATa/, to provide
mnemonics for chunk name.

> GDAT). GDAT stores corrected committer date offsets whereas CDAT will
> still store topological level.

Minor nitpick: I think the second sentence should use consistent
grammatical tense (but I am not a native English speaker); also
s/level/levels/:

    GDAT will store corrected committer date offsets, whereas CDAT will
    still store topological levels.

But it is perfectly understandable as it is.

>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. New Git can parse GDAT and take advantage
> of newer generation numbers, falling back to topological levels when
> GDAT chunk is missing (as it would happen with a commit graph written
> by old Git).

Minor nitpick: I think we use commit-graph with dash when writing about
the commit-graph file, like below.

>
> We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> which forces commit-graph file to be written without generation data
> chunk to emulate a commit-graph file written by old Git.

All right.

>
> While storing corrected commit date offset instead of the corrected
> commit date saves us 4 bytes per commit, it's possible for the offsets
> to overflow the 4-bytes allocated. As such overflows are exceedingly
> rare, we use the following overflow management scheme:

Perhaps it would be good idea to write the idea in full from start, as
the commit message is intended to be read stadalone and not in the
context of the patch series.  On the other hand it might be too much
detail in already [necessarily] lengthty commit message.

Perhaps something like the following proposal would read better.

  To minimize the space required to store corrected commit date, Git
  stores corrected commit date offsets into the commit-graph file,
  instead of corrected commit dates themselves. This saves us 4 bytes
  per commit, decreasing the GDAT chunk size by half, but it's possible
  for the offset to overflow the 4-bytes allocated for storage. As such
  overflows are and should be exceedingly rare, we use the following
  overflow management scheme:


NOTE: this overflow handling is a *new* code (or new-ish code, as it is
inspired and similar to EDGE chunk data handling), so it needs more
careful review.

>
> We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')

Minor issue: why GENERATION_DATA_OVERFLOW and not Generation Data
OVerflow, like for the GDAT chunk?

> to store corrected commit dates for commits with offsets greater than
> GENERATION_NUMBER_V2_OFFSET_MAX.
>
> If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
> the MSB of the offset and the other bits store the position of corrected
> commit date in GDOV chunk, similar to how Extra Edge List is maintained.
>
> We test the overflow-related code with the following repo history:
>
>            F - N - U
>           /         \
> U - N - U            N
>          \          /
>            N - F - N

Do we need such complex history? I guess we need to test the handling of
merge commits too.

>
> Where the commits denoted by U have committer date of zero seconds
> since Unix epoch, the commits denoted by N have committer date of
> 1112354055 (default committer date for the test suite) seconds since
> Unix epoch and the commits denoted by F have committer date of
> (2 ^ 31 - 2) seconds since Unix epoch.
>
> The largest offset observed is 2 ^ 31, just large enough to overflow.
>
> [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 98 +++++++++++++++++++++++++++++++++--
>  commit-graph.h                |  3 ++
>  commit.h                      |  1 +
>  t/README                      |  3 ++
>  t/helper/test-read-graph.c    |  4 ++
>  t/t4216-log-bloom.sh          |  4 +-
>  t/t5318-commit-graph.sh       | 70 ++++++++++++++++++++-----
>  t/t5324-split-commit-graph.sh | 12 ++---
>  t/t6600-test-reach.sh         | 68 +++++++++++++-----------
>  9 files changed, 206 insertions(+), 57 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 03948adfce..71d0b243db 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 9

All right.

>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> @@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
>  			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
>  
> +#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
> +

All right, though the naming convention is different from the one used
for EDGE chunk: GRAPH_EXTRA_EDGES_NEEDED and GRAPH_EDGE_LAST_MASK.

>  /* Remember to update object flag allocation in object.h */
>  #define REACHABLE       (1u<<15)
>  
> @@ -385,6 +389,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  				graph->chunk_commit_data = data + chunk_offset;
>  			break;
>  
> +		case GRAPH_CHUNKID_GENERATION_DATA:
> +			if (graph->chunk_generation_data)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data = data + chunk_offset;
> +			break;
> +
> +		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
> +			if (graph->chunk_generation_data_overflow)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data_overflow = data + chunk_offset;
> +			break;
> +

Necessary but unavoidable boilerplate for adding new chunks to the
commit-graph file format.  All right.

>  		case GRAPH_CHUNKID_EXTRAEDGES:
>  			if (graph->chunk_extra_edges)
>  				chunk_repeated = 1;
> @@ -745,8 +763,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  {
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
> -	uint32_t lex_index;
> -	uint64_t date_high, date_low;
> +	uint32_t lex_index, offset_pos;
> +	uint64_t date_high, date_low, offset;

All right, we are adding two new variables: `offset` to read data stored
in GDAT chunk, and `offset_pos` to help read data from GDOV chunk if
necessary i.e. to handle overflow in corrected commit data offset
storage.

>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
> @@ -764,7 +782,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +	if (g->chunk_generation_data) {
> +		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

Style: why space after the `(timestamp_t)` cast operator?

Though CodingGuidelines do not say anything on this topic... perhaps the
space after cast operator makes it more readable?

> +
> +		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {

All right, so the CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW is equivalent of
GRAPH_EXTRA_EDGES_NEEDED.

> +			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;

Hmmm... instead of using bitwise and on an equivalent to the
GRAPH_EDGE_LAST_MASK, we utilize the fact that we know that the MSB bit
is set, so we can clear it with bitwise xor.  Clever trick.

> +			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> +		} else
> +			graph_data->generation = item->date + offset;

All right, this handles the case when we have generation number v2, with
or without overflow.

> +	} else
> +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

All right, this handles the case where we have only generation number
v1, like for commit-graph file written by old Git.

>  
>  	if (g->topo_levels)
>  		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> @@ -942,6 +969,7 @@ struct write_commit_graph_context {
>  	struct packed_oid_list oids;
>  	struct packed_commit_list commits;
>  	int num_extra_edges;
> +	int num_generation_data_overflows;
>  	unsigned long approx_nr_objects;
>  	struct progress *progress;
>  	int progress_done;
> @@ -960,7 +988,8 @@ struct write_commit_graph_context {
>  		 report_progress:1,
>  		 split:1,
>  		 changed_paths:1,
> -		 order_by_pack:1;
> +		 order_by_pack:1,
> +		 write_generation_data:1;
>  
>  	struct topo_level_slab *topo_levels;
>  	const struct commit_graph_opts *opts;

All right, this adds necessary fields to `struct write_commit_graph_context`.

> @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
>  	return 0;
>  }
>  
> +static int write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	int i, num_generation_data_overflows = 0;

Minor nitpick: in my opinion there should be empty line here, between
the variables declaration and the code... however not all
write_graph_chunk_*() functions have it.

> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> +		display_progress(ctx->progress, ++ctx->progress_cnt);

All right.

> +
> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> +			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
> +			num_generation_data_overflows++;
> +		}

Hmmm... shouldn't we store these commits that need overflow handling
(with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
in a list or a queue, to remember them for writing GDOV chunk?

We could store oids, or we could store commits themselves, for example:

		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
			num_generation_data_overflows++;

			ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
			ctx->commits.list[ctx->gdov_commits.nr] = c;
            ctx->gdov_commits.nr++;
		}

Though in the above proposal we could get rid of `num_generation_data_overflows`, 
as it should be the same as `ctx->gdov_commits.nr`.

I have called the extra commit list member of write_commit_graph_context
`gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`, 
or similar more descriptive name.

> +
> +		hashwrite_be32(f, offset);
> +	}
> +
> +	return 0;
> +}

All right.

> +
> +static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
> +						       struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	for (i = 0; i < ctx->commits.nr; i++) {

Here we loop over *all* commits again, instead of looping over those
very rare commits that need overflow handling for their corrected commit
date data.

Though this possible performance issue^* could be fixed in the future commit.

*) It needs to be actually benchmarked which version is faster.

With the change proposed above (and required changes to the `struct
write_commit_graph_context`) it could look like this:

	for (i = 0; i < ctx->gcov_commits.nr; i++) {


> +		struct commit *c = ctx->commits.list[i];
> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +
> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> +			hashwrite_be32(f, offset >> 32);
> +			hashwrite_be32(f, (uint32_t) offset);
> +		}
> +	}

The above would be as simple as the following:

		struct commit *c = ctx->gcov_commits.list[i];
		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
		display_progress(ctx->progress, ++ctx->progress_cnt);

		hashwrite_be64(f, offset);

Assumming that there would be hashwrite_be64(), it would be the
following otherwise:

		hashwrite_be32(f, offset >> 32);
		hashwrite_be32(f, (uint32_t)offset);

> +
> +	return 0;
> +}
> +
>  static int write_graph_chunk_extra_edges(struct hashfile *f,
>  					 struct write_commit_graph_context *ctx)
>  {
> @@ -1399,7 +1466,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  
>  				if (current->date && current->date > max_corrected_commit_date)
>  					max_corrected_commit_date = current->date - 1;
> +

This is a bit unrelated change, adding this empty line.

>  				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> +
> +				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
> +					ctx->num_generation_data_overflows++;

All right, we need to track number of commits that need overflow
handling for generation number v2 to know what size GDOV chunk would
need to be.

>  			}
>  		}
>  	}
> @@ -1765,6 +1836,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
>  	chunks[2].write_fn = write_graph_chunk_data;
> +
> +	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
> +		ctx->write_generation_data = 0;

All right, here we handle GIT_TEST_COMMIT_GRAPH_NO_GDAT.

> +	if (ctx->write_generation_data) {
> +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
> +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
> +		num_chunks++;
> +	}

All right, the GDAT chunk consist of <number of commits> entries.

> +	if (ctx->num_generation_data_overflows) {
> +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
> +		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
> +		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
> +		num_chunks++;
> +	}

All right, that's what num_generation_data_overflows was for.

>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> @@ -2170,6 +2256,8 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->opts = opts;
>  	ctx->total_bloom_filter_data_size = 0;
> +	ctx->write_generation_data = 1;
> +	ctx->num_generation_data_overflows = 0;
>  
>  	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
>  						      bloom_settings.bits_per_entry);
> diff --git a/commit-graph.h b/commit-graph.h
> index 2e9aa7824e..19a02001fd 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -6,6 +6,7 @@
>  #include "oidset.h"
>  
>  #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
> +#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
>  #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
>  
> @@ -68,6 +69,8 @@ struct commit_graph {
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_generation_data;
> +	const unsigned char *chunk_generation_data_overflow;

All right, two new chunks: GDAT and GDOV.

>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;
> diff --git a/commit.h b/commit.h
> index 33c66b2177..251d877fcf 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -14,6 +14,7 @@
>  #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
>  #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
>  #define GENERATION_NUMBER_ZERO 0
> +#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)

Should we use this form, or hexadecimal constant?

   #define GENERATION_NUMBER_V2_OFFSET_MAX 0x7FFFFFFF

But I think the current definition is more explicit: all bits set to one
except for the most significant digit.  All right.

>  
>  struct commit_list {
>  	struct commit *item;
> diff --git a/t/README b/t/README
> index 2adaf7c2d2..975c054bc9 100644
> --- a/t/README
> +++ b/t/README
> @@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
>  be written after every 'git commit' command, and overrides the
>  'core.commitGraph' setting to true.
>  
> +GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
> +commit-graph to be written without generation data chunk.
> +

All right. Nice have it documented.

>  GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
>  commit-graph write to compute and write changed path Bloom filters for
>  every 'git commit-graph write', as if the `--changed-paths` option was
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 5f585a1725..75927b2c81 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" oid_lookup");
>  	if (graph->chunk_commit_data)
>  		printf(" commit_metadata");
> +	if (graph->chunk_generation_data)
> +		printf(" generation_data");
> +	if (graph->chunk_generation_data_overflow)
> +		printf(" generation_data_overflow");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
>  	if (graph->chunk_bloom_indexes)

All right, updating `test-tool read-graph` with new chunks.

> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index d11040ce41..dbde016188 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  '
>  
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF

Sidenote: I have just noticed this, and as I see it is not something you
wrote, but usually we write it with no space after the dash and before
'EOF':

   	cat >expect <<-EOF

>  	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual

All right, updating expect value for `test-tool read-graph` in the usual
case, with generation number chunk.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 2ed0c1544d..0328e98564 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>  graph_read_expect() {
>  	OPTIONAL=""
>  	NUM_CHUNKS=3
> -	if test ! -z $2
> +	if test ! -z "$2"

All right, that is straighforward fix, which is now needed.

>  	then
>  		OPTIONAL=" $2"
>  		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> @@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
>  	# valid commit and tree OID
>  	git rev-parse HEAD HEAD^{tree} >in &&
>  	git commit-graph write --stdin-commits <in &&
> -	graph_read_expect 3
> +	graph_read_expect 3 generation_data
>  '
>  
>  test_expect_success 'write graph' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "3"
> +	graph_read_expect "3" generation_data
>  '
>  
>  test_expect_success POSIXPERM 'write graph has correct permissions' '
> @@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "10" "extra_edges"
> +	graph_read_expect "10" "generation_data extra_edges"
>  '
>  
>  graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
> @@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>  
>  graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>  
>  graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	cat new-idx | git commit-graph write --stdin-packs &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "9" "extra_edges"
> +	graph_read_expect "9" "generation_data extra_edges"
>  '
>  
>  graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
> @@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
>  	git rev-parse merge/1 >>commits-in &&
>  	cat commits-in | git commit-graph write --stdin-commits &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "6"
> +	graph_read_expect "6" "generation_data"
>  '
>  
>  graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
> @@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "10" "extra_edges"
> +	graph_read_expect "10" "generation_data extra_edges"
>  '
>  
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write --reachable &&
>  	test_path_is_file $objdir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '
>  
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
>  	cd "$TRASH_DIRECTORY/bare" &&
>  	git commit-graph write &&
>  	test_path_is_file $baredir/info/commit-graph &&
> -	graph_read_expect "11" "extra_edges"
> +	graph_read_expect "11" "generation_data extra_edges"
>  '

All those just add "generation_data" (aka GDAT) to expected chunks. All
right.

>  
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> @@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
>  
>  test_expect_success 'git commit-graph verify' '
>  	cd "$TRASH_DIRECTORY/full" &&
> -	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> -	git commit-graph verify >output
> +	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> +	git commit-graph verify >output &&

All right, this simply adds GIT_TEST_COMMIT_GRAPH_NO_GDAT=1.  I assume
this is needed because this test is also setup for the following commits
_without_ even saying that in the test name (bad practice, in my
opinion), and the comment above this test says the following:

  # the verify tests below expect the commit-graph to contain
  # exactly the commits reachable from the commits/8 branch.
  # If the file changes the set of commits in the list, then the
  # offsets into the binary file will result in different edits
  # and the tests will likely break.

So the following tests are fragile (though perhaps unavoidably fragile),
and without this change they would not work, I assume.

> +	graph_read_expect 9 extra_edges

I guess that this is here to check that GIT_TEST_COMMIT_GRAPH_NO_GDAT=1
work as intended, and that the following "verify" tests wouldn't break.
I understand its necessity, even if I don't quite like having a test
that checks multiple things.  This is a minor issue, though.

All right.


We might want to have a separate test that checks that we get
commit-graph with and without GDAT chunk depending on whether we use
GIT_TEST_COMMIT_GRAPH_NO_GDAT=1.  On the other hand, this environment
variable is there purely for tests, so the question is should we test
the test infrastructure?

>  '
>  
>  NUM_COMMITS=9
> @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>  	)
>  '
>  
> +test_commit_with_date() {
> +  file="$1.t" &&
> +  echo "$1" >"$file" &&
> +  git add "$file" &&
> +  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
> +  git tag "$1"
> +}

Here we add a helper function.  All right.

I wonder though if it wouldn't be a better idea to add `--date <date>`
option to the test_commit() function in test-lib-functions.sh (which
option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
set notick=yes).

For example:

diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index f1ae935fee..a1f9a2b09b 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,12 @@ test_commit () {
 		--signoff)
 			signoff="$1"
 			;;
+        --date)
+            notick=yes
+            GIT_COMMITTER_DATE="$2"
+            GIT_AUTHOR_DATE="$2"
+            shift
+            ;;
 		-C)
 			indir="$2"
 			shift


> +

It would be nice to have there comment describing the shape of the
revision history we generate here, that currenly is present only in the
commmit message.

# We test the overflow-related code with the following repo history:
#
#               4:F - 5:N - 6:U
#              /               \
# 1:U - 2:N - 3:U               M:N
#              \               /
#               7:N - 8:F - 9:N
#
# Here the commits denoted by U have committer date of zero seconds
# since Unix epoch, the commits denoted by N have committer date
# starting from 1112354055 seconds since Unix epoch (default committer
# date for the test suite), and the commits denoted by F have committer
# date of (2 ^ 31 - 2) seconds since Unix epoch.
#
# The largest offset observed is 2 ^ 31, just large enough to overflow.
#

> +test_expect_success 'overflow corrected commit date offset' '
> +	objdir=".git/objects" &&
> +	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
> +	FUTURE_DATE="@2147483646 +0000" &&

It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
FUTURE_DATE other way.

Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?

> +	test_oid_cache <<-EOF &&
> +	oid_version sha1:1
> +	oid_version sha256:2
> +	EOF
> +	cd "$TRASH_DIRECTORY" &&
> +	mkdir repo &&
> +	cd repo &&
> +	git init &&
> +	test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
> +	test_commit 2 &&
> +	test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
> +	git commit-graph write --reachable &&
> +	graph_read_expect 3 generation_data &&
> +	test_commit_with_date 4 "$FUTURE_DATE" &&
> +	test_commit 5 &&
> +	test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
> +	git branch left &&
> +	git reset --hard 3 &&
> +	test_commit 7 &&
> +	test_commit_with_date 8 "$FUTURE_DATE" &&
> +	test_commit 9 &&
> +	git branch right &&
> +	git reset --hard 3 &&
> +	git merge left right &&

We have test_merge() function in test-lib-functions.sh, perhaps we
should use it here.

> +	git commit-graph write --reachable &&
> +	graph_read_expect 10 "generation_data generation_data_overflow" &&

All right, we write the commit-graph and check that it has both GDAT and
GDOV chunks present.

> +	git commit-graph verify

All right, we checks that created commit graph with GDAT and GDOV passes
'git commit-graph verify` checks.

> +'
> +
> +graph_git_behavior 'overflow corrected commit date offset' repo left right

All right, here we compare the Git behavior with the commit-graph to the
behavior without it... however I think that those two tests really
should have distinct (different) test names. Currently they both use
'overflow corrected commit date offset'.

> +
>  test_done
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index c334ee9155..651df89ab2 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
>  	infodir=".git/objects/info" &&
>  	graphdir="$infodir/commit-graphs" &&
>  	test_oid_cache <<-EOM
> -	shallow sha1:1760
> -	shallow sha256:2064
> +	shallow sha1:2132
> +	shallow sha256:2436
>  
> -	base sha1:1376
> -	base sha256:1496
> +	base sha1:1408
> +	base sha256:1528
>  
>  	oid_version sha1:1
>  	oid_version sha256:2
> @@ -31,9 +31,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
> +	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output

All right, we now expect the commit graph to include the GDAT chunk...
though shouldn't be there old expected value for no GDAT, for future
tests?  But perhaps this is not necessary.

Note that I have not checked the details, but it looks OK to me.

> diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> index f807276337..e2d33a8a4c 100755
> --- a/t/t6600-test-reach.sh
> +++ b/t/t6600-test-reach.sh
> @@ -55,10 +55,13 @@ test_expect_success 'setup' '
>  	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
>  	mv .git/objects/info/commit-graph commit-graph-half &&
>  	chmod u+w commit-graph-half &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
> +	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
> +	chmod u+w commit-graph-no-gdat &&

All right, this prepares for testing one more mode.  The run_all_modes()
function would test the following cases:
 - no commit-graph
 - commit-graph for all commits, with GDAT
 - commit-graph with half of commits, with GDAT
 - commit-graph for all commits, without GDAT

>  	git config core.commitGraph true
>  '
>  
> -run_three_modes () {
> +run_all_modes () {
>  	test_when_finished rm -rf .git/objects/info/commit-graph &&
>  	"$@" <input >actual &&
>  	test_cmp expect actual &&
> @@ -67,11 +70,14 @@ run_three_modes () {
>  	test_cmp expect actual &&
>  	cp commit-graph-half .git/objects/info/commit-graph &&
>  	"$@" <input >actual &&
> +	test_cmp expect actual &&
> +	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
> +	"$@" <input >actual &&
>  	test_cmp expect actual
>  }
>  
> -test_three_modes () {
> -	run_three_modes test-tool reach "$@"
> +test_all_modes () {
> +	run_all_modes test-tool reach "$@"
>  }

All right.

Though to reduce "noise" in this patch, the rename of run_three_modes()
to run_all_modes() and test_three_modes() to test_all_modes() could have
been done in a separate preparatory patch.  It would be pure refactoring
patch, without introducing any new functionality.

>  
>  test_expect_success 'ref_newer:miss' '
> @@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
>  	B:commit-4-9
>  	EOF
>  	echo "ref_newer(A,B):0" >expect &&
> -	test_three_modes ref_newer
> +	test_all_modes ref_newer
>  '
>  
>  test_expect_success 'ref_newer:hit' '
> @@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
>  	B:commit-2-3
>  	EOF
>  	echo "ref_newer(A,B):1" >expect &&
> -	test_three_modes ref_newer
> +	test_all_modes ref_newer
>  '
>  
>  test_expect_success 'in_merge_bases:hit' '
> @@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
>  	B:commit-8-8
>  	EOF
>  	echo "in_merge_bases(A,B):1" >expect &&
> -	test_three_modes in_merge_bases
> +	test_all_modes in_merge_bases
>  '
>  
>  test_expect_success 'in_merge_bases:miss' '
> @@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
>  	B:commit-5-9
>  	EOF
>  	echo "in_merge_bases(A,B):0" >expect &&
> -	test_three_modes in_merge_bases
> +	test_all_modes in_merge_bases
>  '
>  
>  test_expect_success 'in_merge_bases_many:hit' '
> @@ -117,7 +123,7 @@ test_expect_success 'in_merge_bases_many:hit' '
>  	X:commit-5-7
>  	EOF
>  	echo "in_merge_bases_many(A,X):1" >expect &&
> -	test_three_modes in_merge_bases_many
> +	test_all_modes in_merge_bases_many
>  '
>  
>  test_expect_success 'in_merge_bases_many:miss' '
> @@ -127,7 +133,7 @@ test_expect_success 'in_merge_bases_many:miss' '
>  	X:commit-8-6
>  	EOF
>  	echo "in_merge_bases_many(A,X):0" >expect &&
> -	test_three_modes in_merge_bases_many
> +	test_all_modes in_merge_bases_many
>  '
>  
>  test_expect_success 'in_merge_bases_many:miss-heuristic' '
> @@ -137,7 +143,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
>  	X:commit-6-6
>  	EOF
>  	echo "in_merge_bases_many(A,X):0" >expect &&
> -	test_three_modes in_merge_bases_many
> +	test_all_modes in_merge_bases_many
>  '
>  
>  test_expect_success 'is_descendant_of:hit' '
> @@ -148,7 +154,7 @@ test_expect_success 'is_descendant_of:hit' '
>  	X:commit-1-1
>  	EOF
>  	echo "is_descendant_of(A,X):1" >expect &&
> -	test_three_modes is_descendant_of
> +	test_all_modes is_descendant_of
>  '
>  
>  test_expect_success 'is_descendant_of:miss' '
> @@ -159,7 +165,7 @@ test_expect_success 'is_descendant_of:miss' '
>  	X:commit-7-6
>  	EOF
>  	echo "is_descendant_of(A,X):0" >expect &&
> -	test_three_modes is_descendant_of
> +	test_all_modes is_descendant_of
>  '
>  
>  test_expect_success 'get_merge_bases_many' '
> @@ -174,7 +180,7 @@ test_expect_success 'get_merge_bases_many' '
>  		git rev-parse commit-5-6 \
>  			      commit-4-7 | sort
>  	} >expect &&
> -	test_three_modes get_merge_bases_many
> +	test_all_modes get_merge_bases_many
>  '
>  
>  test_expect_success 'reduce_heads' '
> @@ -196,7 +202,7 @@ test_expect_success 'reduce_heads' '
>  			      commit-2-8 \
>  			      commit-1-10 | sort
>  	} >expect &&
> -	test_three_modes reduce_heads
> +	test_all_modes reduce_heads
>  '
>  
>  test_expect_success 'can_all_from_reach:hit' '
> @@ -219,7 +225,7 @@ test_expect_success 'can_all_from_reach:hit' '
>  	Y:commit-8-1
>  	EOF
>  	echo "can_all_from_reach(X,Y):1" >expect &&
> -	test_three_modes can_all_from_reach
> +	test_all_modes can_all_from_reach
>  '
>  
>  test_expect_success 'can_all_from_reach:miss' '
> @@ -241,7 +247,7 @@ test_expect_success 'can_all_from_reach:miss' '
>  	Y:commit-8-5
>  	EOF
>  	echo "can_all_from_reach(X,Y):0" >expect &&
> -	test_three_modes can_all_from_reach
> +	test_all_modes can_all_from_reach
>  '
>  
>  test_expect_success 'can_all_from_reach_with_flag: tags case' '
> @@ -264,7 +270,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
>  	Y:commit-8-1
>  	EOF
>  	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
> -	test_three_modes can_all_from_reach_with_flag
> +	test_all_modes can_all_from_reach_with_flag
>  '
>  
>  test_expect_success 'commit_contains:hit' '
> @@ -280,8 +286,8 @@ test_expect_success 'commit_contains:hit' '
>  	X:commit-9-3
>  	EOF
>  	echo "commit_contains(_,A,X,_):1" >expect &&
> -	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_all_modes commit_contains &&
> +	test_all_modes commit_contains --tag
>  '
>  
>  test_expect_success 'commit_contains:miss' '
> @@ -297,8 +303,8 @@ test_expect_success 'commit_contains:miss' '
>  	X:commit-9-3
>  	EOF
>  	echo "commit_contains(_,A,X,_):0" >expect &&
> -	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_all_modes commit_contains &&
> +	test_all_modes commit_contains --tag
>  '
>  
>  test_expect_success 'rev-list: basic topo-order' '
> @@ -310,7 +316,7 @@ test_expect_success 'rev-list: basic topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
>  		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-6-6
> +	run_all_modes git rev-list --topo-order commit-6-6
>  '
>  
>  test_expect_success 'rev-list: first-parent topo-order' '
> @@ -322,7 +328,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
>  		commit-6-2 \
>  		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>  	>expect &&
> -	run_three_modes git rev-list --first-parent --topo-order commit-6-6
> +	run_all_modes git rev-list --first-parent --topo-order commit-6-6
>  '
>  
>  test_expect_success 'rev-list: range topo-order' '
> @@ -334,7 +340,7 @@ test_expect_success 'rev-list: range topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
>  '
>  
>  test_expect_success 'rev-list: range topo-order' '
> @@ -346,7 +352,7 @@ test_expect_success 'rev-list: range topo-order' '
>  		commit-6-2 commit-5-2 commit-4-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
>  '
>  
>  test_expect_success 'rev-list: first-parent range topo-order' '
> @@ -358,7 +364,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
>  		commit-6-2 \
>  		commit-6-1 commit-5-1 commit-4-1 \
>  	>expect &&
> -	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
> +	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
>  '
>  
>  test_expect_success 'rev-list: ancestry-path topo-order' '
> @@ -368,7 +374,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
>  		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
>  		commit-6-3 commit-5-3 commit-4-3 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
> +	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
>  '
>  
>  test_expect_success 'rev-list: symmetric difference topo-order' '
> @@ -382,7 +388,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
>  		commit-3-8 commit-2-8 commit-1-8 \
>  		commit-3-7 commit-2-7 commit-1-7 \
>  	>expect &&
> -	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
> +	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
>  '
>  
>  test_expect_success 'get_reachable_subset:all' '
> @@ -402,7 +408,7 @@ test_expect_success 'get_reachable_subset:all' '
>  			      commit-1-7 \
>  			      commit-5-6 | sort
>  	) >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '
>  
>  test_expect_success 'get_reachable_subset:some' '
> @@ -420,7 +426,7 @@ test_expect_success 'get_reachable_subset:some' '
>  		git rev-parse commit-3-3 \
>  			      commit-1-7 | sort
>  	) >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset
>  '
>  
>  test_expect_success 'get_reachable_subset:none' '
> @@ -434,7 +440,7 @@ test_expect_success 'get_reachable_subset:none' '
>  	Y:commit-2-8
>  	EOF
>  	echo "get_reachable_subset(X,Y)" >expect &&
> -	test_three_modes get_reachable_subset
> +	test_all_modes get_reachable_subset

All those are pure renames of test_three_modes() to test_all_modes(),
which now does tests for one more mode -- without GDAT.

All right.

>  '
>  
>  test_done

Best,
-- 
Jakub Narębski

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
  2020-10-07 14:09       ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-11-01  0:55         ` Jakub Narębski
  2020-11-12 10:01           ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-01  0:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

All right.

>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.

I think this paragraphs tries too much to be concise, with the result it
is less clear than it could be.  Perhaps it would be better to separate
"what-if" from the current behavior.

  If each layer of split commit-graph is treated independently, as it
  were the case before this commit, with Git inspecting only the current
  layer for chunk_generation_data pointer, commits in the lower layer
  (one with GDAT) would have corrected commit date as their generation
  number, while commits in the upper layer would have topological levels
  as their generation.  Corrected commit dates have usually much larger
  values than topological levels.  This means that if we take two
  commits, one from the upper layer, and one reachable from it in the
  lower layer, then the expectation that the generation of a parent is
  smaller than the generation of a child would be violated.

>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.

I don't quite understand the issue here. Unless none of the following
query commands short-circuit and all walk the commit graph regardless of
what generation numbers tell them, they should give different results
with and without the commit graph, if we take two commits one from lower
layer of split commit graph with GDAT, and one commit from the higher
layer without GDAT, one lower reachable from the other higher.

We have the following query commands that we can check:
  $ git merge-base --is-ancestor <lower> <higher>
  $ git merge-base --independent <lower> <higher>

  $ git tag --contains <tag-to-lower>
  $ git tag --merged <tag-to-higher>
  $ git branch --contains <branch-to-lower>
  $ git branch --merged <branch-to-higher>

The second set of queries require for those commits to be tagged, or
have branch pointing at them, respectively.

Also, shouldn't `git commit-graph verify` fail with split commit graph
where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?

Let's assume that we have the following history, with newer commits
shown on top like in `git log --graph --oneline --all`:

          topological     corrected         generation
          level           commit date       number^*

      d    3                                3
      |
   c  |    3                                3
   |  |                                                 without GDAT
 ..|..|.....[layer.boundary]........................................
   |  |                                                    with GDAT
   |  b    2              1112912113        1112912113
   |  |
   a  |    2              1112912053        1112912053
   | /
   |/
   r       1              1112911993        1112911993

*) each layer inspected individually.

With such history, we can for example reach 'a' from 'c', thus
`git merge-base --is-ancestor a b` should return true value, but
without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
I use here weaker reachability condition, but the one that works
also for commits outside the commit-graph (and those for which
generation numbers overflows).

>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).

All right, that's good explanation.

>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
> Rewriting layers follows similar approach: if the topmost layer below
> the set of layers being rewritten (in the split commit-graph chain)
> exists, and it does not contain GDAT chunk, then the result of rewrite
> does not have GDAT chunks either.

All right, very good explanation; the only minor suggestion would be to
add some 'intro' to the first of those two paragraphs, for example:

  Therefore, when writing the new layer in split commit-graph...

Though I am not sure if it is necessary.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 29 +++++++++++-
>  commit-graph.h                |  1 +
>  t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
>  3 files changed, 115 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 71d0b243db..5d15a1399b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -605,6 +605,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
>  	return graph_chain;
>  }
>  
> +static void validate_mixed_generation_chain(struct commit_graph *g)
> +{
> +	int read_generation_data;
> +
> +	if (!g)
> +		return;
> +
> +	read_generation_data = !!g->chunk_generation_data;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}
> +}

All right, this function checks assumedly topmost layer if it is
GDAT-less, and if it is propagates this status down the layers of split
commit graph.  This is needed because if we have mixed-generation commit
graph, then for each and every layer we need to use topological levels
as generation number.

The only minor issue is the name of this function (it does not hint that
it propagates the GDAT status downwards), but I don't have a better
idea, unfortunately.  And it does reflect what this function is used for.

> +
>  struct commit_graph *read_commit_graph_one(struct repository *r,
>  					   struct object_directory *odb)
>  {
> @@ -613,6 +628,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>  
> +	validate_mixed_generation_chain(g);
> +

All right, this looks like a good place to put this new check: just
after reading commit-graph chain.

>  	return g;
>  }
>  
> @@ -782,7 +799,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> -	if (g->chunk_generation_data) {
> +	if (g->chunk_generation_data && g->read_generation_data) {
>  		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

All right, instead of simply checking if the current layer has
generation data chunk, we need to also check if the whole graph allows
for it (if there are no mixed-generation layers).

The g->read_generation_data should be filled correctly, because
fill_commit_graph_info() is always preceded by read_commit_graph(), if I
understand it correctly.

>  
>  		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  		}
>  	}
>  
> +	if (!ctx->write_generation_data && g->chunk_generation_data)
> +		ctx->write_generation_data = 1;
> +

This needs more careful examination, and looking at larger context of
those lines.

At this point, unless `--split=replace` option is used, 'g' points to
the bottom layer out of all topmost layers being merged. We know that if
there are GDAT-less layers then these must be top layers, so this means
that we can write GDAT chunk in the result of the merge -- because we
would be replacing all possible GDAT-less layers (and maybe some with
GDAT) with a single layer with the GDAT chunk.

The ctx->write_generation_data is set to true unless environment
variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
write_commit_graph() it would be set to false if topmost layer doesn't
have GDAT chunk, and to true if `--split=replace` option is used; see
below.

Looks good to me.

NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
writing GDAT chunk with generation data v2 unless we are merging layers,
or replacing all of them with a single layer: then it is _ignored_.

Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
in t/README?  Currently it reads:

  GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
  commit-graph to be written without generation data chunk.

>  	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
>  		ctx->new_base_graph = g;
>  	else if (ctx->num_commit_graphs_after != 1)
> @@ -2274,6 +2294,7 @@ int write_commit_graph(struct object_directory *odb,
>  		struct commit_graph *g = ctx->r->objects->commit_graph;
>  
>  		while (g) {
> +			g->read_generation_data = 1;
>  			g->topo_levels = &topo_levels;
>  			g = g->base_graph;
>  		}

All right, when writing the commit graph we want to make use of existing
generation data chunks.  This is safe, because when computing generation
numbers for writing we have separate place to store topoogical levels
(`topo_levels`) so they would not be mixed with corrected commit dates:
generation number v1 and v2 are kept separate.

> @@ -2300,6 +2321,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		g = ctx->r->objects->commit_graph;
>  
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +

All right, if current (topmost) layed does not have GDAT, then when
creating a new layer do not create GDAT layer either (merging layers and
rewriting the commit-graph is handled separately).

>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2318,6 +2342,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		if (ctx->opts)
>  			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;

All right, when replacing all layers (`git commit-graph write --split=replace`),
then we can safely write the GDAT chunk.

Note however that here we don't take into account the value of the
environment variable GIT_TEST_COMMIT_GRAPH_NO_GDAT.  Which maybe is what
we want...

>  	}
>  
>  	ctx->approx_nr_objects = approximate_object_count();
> diff --git a/commit-graph.h b/commit-graph.h
> index 19a02001fd..ad52130883 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -64,6 +64,7 @@ struct commit_graph {
>  	struct object_directory *odb;
>  
>  	uint32_t num_commits_in_base;
> +	unsigned int read_generation_data;
>  	struct commit_graph *base_graph;

All right, this new field is here to propagate to each layer the
information whether we can read from the generation number v2 data
chunk.

Though I am not sure whether this field should be added here, and
whether it should be `unsigned int` (we don't have to be that careful
about saving space for this type).

>  
>  	const uint32_t *chunk_oid_fanout;
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 651df89ab2..d0949a9eb8 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
>  	verify_chain_files_exist $graphdir
>  '
>  
> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&

This should probably go just before cd-ing into just created
subdirectory.

> +	graphdir=".git/objects/info/commit-graphs" &&
> +	test_oid_cache <<-EOM &&
> +	oid_version sha1:1
> +	oid_version sha256:2
> +	EOM

Minor nitpick: Why use "EOM", which is used only twice in Git the test
suite, and not the conventional "EOF" (used at least 4000 times)?

> +	cd "$TRASH_DIRECTORY/mixed" &&

The t/README says:

   - Don't chdir around in tests.  It is not sufficient to chdir to
     somewhere and then chdir back to the original location later in
     the test, as any intermediate step can fail and abort the test,
     causing the next test to start in an unexpected directory.  Do so
     inside a subshell if necessary.

Though I am not sure if it should apply also to this situation.

> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&

All right.

> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --split &&

Is there a reason why we do not check just written commit-graph file
with `test-tool read-graph >output-layer-1`?

> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&

Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
to use one or the other?

> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&

It would be nice to have ASCII-art of the history (of the graph of
revisions) created here for subsequent tests:

           /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
          /    \-\
         /        \
  1 <-- 2 <-- 3*   \--\
  |      \             \ 
  |       \-----\       \
   \             \       \
    \-- 4*<------ M/1     M/2
        |\               /  
        | \-- 5*        /
        \              /
         \------------/

  * - 1st layer  

Though I am not sure if what I have created is readable; I think a
better way to draw this graph is possible, for example:

               /- 3*
              /
             /
  1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
   \         \       \
    \         \       \
     \         \       \
      \- 4* <-- M/1     \    
         |\              \
         | \------------- M/2
         \
          \---- 5*

Edit: as I see the history gets even more complicated, so perhaps
ASCII-art diagram of the history with layers marked would be too
complicated, and wouldn't bring much.

Why do we need such shape of the history in the repository?

> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 $(test_oid oid_version) 4 1
> +	num_commits: 2
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&

All right, we check that we have 2 commits, and that there is no GDAT
chunk.

> +	git commit-graph verify

All right, we verify commit-graph as a whole (both layers).

> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '

Hmmm... I wonder if we can come up with a better name for this test;
for example should it be "does not write" or "do not write"?

> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&

The commit graph gets complicated, so it would not be easy to visualize
it with ASCII-art diagram without any crossed lines.  Maybe `git log
--graph --oneline --all` would help:

*   (merge/3) Merge branch 'merge/2'
|\
| *   (merge/2) Merge branch 'commits/6'
| |\
* | \   Merge branch 'commits/5'
|\ \ \
| * | | (commits/5) 5
| |/ /
* | |   Merge branch 'merge/1'
|\ \ \
| * | | (merge/1) Merge branch 'commits/4'
| |\| |
| | * | (commits/4) 4
* | | | (commits/3) 3
|/ / /
| | | * (commits/10) 10
| | | * (commits/9) 9
| | | * (commits/8) 8
| | | * (commits/7) 7
| | |/
| | * (commits/6) 6
| |/
|/|
* | (commits/2) 2
|/
* (commits/1) 1

> +	git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 $(test_oid oid_version) 4 2
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify

All right, so here we check that we have layer without GDAT at the top,
and we request not to merge layers thus new layer will be created, then
the new layer also does not have GDAT chunk (and has 3 commits).

Minor nitpick: shouldn't those test be indented?

> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split=replace &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&

All right, this checks that we have split commit-graph chain that
consist of a single layer, and that the commit-graph file for this
single layer exists.

> +	graph_read_expect 15 &&

Shouldn't we use `test-tool read-graph` to check whether generation_data
chunk is present... ah, sorry, I have realized that after previous
patches `graph_read_expect 15` implicitly checks the latter, because in
its' use of `test-tool read-graph` it does expect generation_data chunk.

So we use `test-tool read-graph` manually to check that generation_data
chunk is absent, and we use graph_read_expect to check that it is
present (and in both cases that the number of commits matches).  I
wonder if it would be possible to simplify that...

> +	git commit-graph verify

All right.

> +'
> +
> +test_expect_success 'add one commit, write a tip graph' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	test_commit 11 &&
> +	git branch commits/11 &&
> +	git commit-graph write --reachable --split &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 2 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'

What it is meant to test?  That adding single-commit to a 15 commit
commit-graph file in split mode does not result in layers merging, and
actually adds a new layer: we check that we have exactly two layers and
that they are all OK.

We don't check here that the newly created top layer commit-graph does
have GDAT chunk, as it should be if the top layer (in this case the only
layer) has GDAT chunk.

> +
>  test_done

One test we are missing is testing that merging layers is done
correctly, namely that if we are merging layers in split commit-graph
file, and the layer below the ones we are merging lacks GDAT chunk, then
the result of the merge should also be without GDAT chunk.  This would
require at least two GDAT-less layers in a setup.

I'm not sure how difficult writing such test should be.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
  2020-10-25 20:58           ` Taylor Blau
@ 2020-11-03  5:36             ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-03  5:36 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitgitgadget, jnareb, abhishekkumar8222

On Sun, Oct 25, 2020 at 04:58:14PM -0400, Taylor Blau wrote:
> On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> > > While measuring performance with `git commit-graph write --reachable
> > > --changed-paths` on the linux repository led to around 1m40s for both
> > > HEAD and master (and could be due to fault in my measurements), it is
> > > still the "right" thing to do.
> >
> > I had to read the above paragraph several times to understand it,
> > possibly because I have expected here to be a fix for a performance
> > regression.  The commit message for 3d112755 (commit-graph: examine
> > commits by generation number) describes reduction of computation time
> > from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
> > before those changes) to be around 3m, not the same before and after
> > changes being around 1m40s.
> >
> > Can anyone recheck this before-and-after benchmark, please?
> 
> My hunch is that our heuristic to fall back to the commits 'date'
> value is saving us here. commit_gen_cmp() first compares the generation
> numbers, breaking ties by 'date' as a heuristic. But since all
> generation number queries return GENERATION_NUMBER_INFINITY during
> writing, we're relying on our heuristic entirely.
> 
> I haven't looked much further than that, other than to see that I could
> get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
> computing Bloom filters region on the kernel.
> 

Thanks for benchmarking it. I wasn't sure if I am testing it correctly
or the patch made no difference.

> > Anyway, it might be more clear to write it as the following:
> >
> >   On the Linux kernel repository, this patch didn't reduce the
> >   computation time for 'git commit-graph write --reachable
> >   --changed-paths', which is around 1m40s both before and after this
> >   change.  This could be a fault in my measurements; it is still the
> >   "right" thing to do.
> >
> > Or something like that.
> 
> Assuming that we are in fact being saved by the "date" heuristic, I'd
> probably write the following commit message instead:
> 
>   Before computing Bloom filters, the commit-graph machinery uses
>   commit_gen_cmp to sort commits by generation order for improved diff
>   performance. 3d11275505 (commit-graph: examine commits by generation
>   number, 2020-03-30) claims that this sort can reduce the time spent to
>   compute Bloom filters by nearly half.
> 
>   But since c49c82aa4c (commit: move members graph_pos, generation to a
>   slab, 2020-06-17), this optimization is broken, since asking for
>   'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
>   while writing.
> 
>   Not all hope is lost, though: 'commit_graph_generation()' falls
>   back to comparing commits by their date when they have equal generation
>   number, and so since c49c82aa4c is purely a date comparison function.
>   This heuristic is good enough that we don't seem to loose appreciable
>   performance while computing Bloom filters. [Benchmark that we loose
>   about ~4sec before/after c49c82aa4c9...]
> 
>   So, avoid the uesless 'commit_graph_generation()' while writing by
>   instead accessing the slab directly. This returns the newly-computed
>   generation numbers, and allows us to avoid the heuristic by directly
>   comparing generation numbers.
> 

That's a lot better, will change.

> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 04/10] commit-graph: return 64-bit generation number
  2020-10-25 13:48         ` Jakub Narębski
@ 2020-11-03  6:40           ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-03  6:40 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Sun, Oct 25, 2020 at 02:48:27PM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> Note that there are two changes that are not mentioned in the commit
> message, namely adding 'const'-ness to generation_a/b local variables in
> commit_gen_cmp() from commit-graph.c, and switching from
> GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY as the default
> (initial) value for 'max_generation' in repo_in_merge_bases_many().
> 
> While the former is a simple "while-at-it" change that shouldn't affect
> correctness, the latter needs an explanation (or fixing if it is wrong).
> 

The change from GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY was
incorrect. While fixing merge conflicts on rebasing to master again, I
didn't notice that repo_in_merge_bases_many() switched from using
min_generation and GENERATION_NUMBER_ZERO to max_generation and
GENERATION_NUMBER_ZERO.

Thanks for noticing!

> ...
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
  2020-10-27 18:53         ` Jakub Narębski
@ 2020-11-03 11:44           ` Abhishek Kumar
  2020-11-04 16:45             ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-03 11:44 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ...
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Somewhere in the commit message we should also describe that this commit
> changes how commit-graph is verified: from checking that the generation
> number agrees with _topological level definition_, that is that for a
> given commit it is 1 more than maximum of its parents (with the caveat
> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
> checking that slightly weaker condition fulfilled by both topological
> levels (generation number v1) and by corrected commit date (generation
> number v2) that for a given commit its generation number is 1 more than
> maximum of its parents or larger.

Sure, that makes sense. Will add.

> 
> But, as far as I understand it, current code does not handle correctly
> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
> 
> On the other hand we could have simpy use functional check, that
> generation number used (which can be v1 or v2, or any similar other)
> fulfills the reachability condition for each edge, which can be
> simplified to checking that generation(parents) <= generation(commit).
> If the reachability condition is true for each edge, then it is true for
> each path, and for each commit.
> 
> > ---
> >  commit-graph.c | 43 +++++++++++++++++++++++--------------------
> >  1 file changed, 23 insertions(+), 20 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index cedd311024..03948adfce 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	else if (generation_a > generation_b)
> >  		return 1;
> >  
> > -	/* use date as a heuristic when generations are equal */
> > -	if (a->date < b->date)
> > -		return -1;
> > -	else if (a->date > b->date)
> > -		return 1;
> 
> Why this change?  It is not described in the commit message.
> 
> Note that while this tie-breaking fallback doesn't make much sense for
> corrected committer date generation number v2, this tie-breaking helps
> if we have to use topological levels (generation number v2).
> 

Right, I should have mentioned this change (and it's not something that
makes a difference either way).

We call commit_gen_cmp() only when we are sorting commits by generation
to speed up computation of Bloom filters i.e. while writing a commit
graph (either split commit-graph or a simple commit-graph).

Since we are always computing and storing corrected commit date when we
are writing (whether we write a GDAT chunk or not), using date as
heuristic is longer required.

> >  	return 0;
> >  }
> >  
> > @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> 
> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
> enough; no need for 'timestamp_t' for 'level' variable.
> 
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> >

We need the 'timestamp_t' as we are comparing level with the now 64-bits
GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
timestamp_t. I have a hunch that since we are explicitly using a fixed
width data type, compiler is unwilling to type coerce into broader data
types.

Advice on this appreciated.

> 
> All right, we compute both generation numbers: topological levels and
> corrected commit date.
> 
> I guess we use 'corrected_commit_date' instead of simply 'generation' to
> make it asier to remember which is which.
> 
> >  		display_progress(ctx->progress, i + 1);
> >  		if (level != GENERATION_NUMBER_INFINITY &&
> > -		    level != GENERATION_NUMBER_ZERO)
> > +		    level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> 
> Straightforward addition.
> 
> > +		    )
> 
> Why this closing parenthesis is now in separated line?
> 
> >  			continue;
> >  
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> 
> All right, straightforward addition.
> 
> >  
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > -
> 
> Why we have removed this empty line?
> 
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> 
> All right.
> 
> >  				if (level == GENERATION_NUMBER_INFINITY ||
> > -				    level == GENERATION_NUMBER_ZERO) {
> > +				    level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> > +				    ) {
> 
> All right, same as above.
> 
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> >  				}
> 
> All right, reasonable and straightforward.
> 
> >  			}
> >  
> > @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
> >  					max_level = GENERATION_NUMBER_V1_MAX - 1;
> >  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > +				if (current->date && current->date > max_corrected_commit_date)
> > +					max_corrected_commit_date = current->date - 1;
> > +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> 
> All right.
> 
> Here we use the same trick as in previous commit (and as above) to avoid
> any possible overflow, to minimize number of conditionals.  The fact
> that max_corrected_commit_date might store incorrect value doesn't
> matter, as it is reset at beginning of this loop.
> 
> >  			}
> >  		}
> >  	}
> > @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  		if (generation_zero == GENERATION_ZERO_EXISTS)
> >  			continue;
> >  
> > -		/*
> > -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> > -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> > -		 * extra logic in the following condition.
> > -		 */
> > -		if (max_generation == GENERATION_NUMBER_V1_MAX)
> > -			max_generation--;
> > -
> 
> Perhaps in the future we should check that both topological levels, and
> also corrected committer date (if it exists) for correctness according
> to their definition.  Then the above removed part would be restored (but
> with s/max_generation/max_level/).
> 
> >  		generation = commit_graph_generation(graph_commit);
> > -		if (generation != max_generation + 1)
> > -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > +		if (generation < max_generation + 1)
> > +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
> 
> All right, so we relaxed the check so that it will be fulfilled by
> generation number v2 (and also by generation number v1, as it implies
> the more strict check for v1).
> 
> What would happen however if generation holds topological levels, and it
> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
> GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
> so the `git commit-graph verify` would incorrectly say that there is
> a problem with generation number, while there isn't one (false positive
> detection of error).

Alright, so the above block still makes sense if we are working with
topological levels but not with corrected commit dates. Instead of
removing it, I will modify the condition to check that one of our parents
has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.

Suprised that no test breaks by this change.

I have also moved changes in the verify function to the next patch, as
we cannot write or read corrected commit dates yet - so little sense in
modifying verify.

> 
> Sidenote: I think we don't have to worry about having to introduce
> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
> disck representation) corrected commiter date is the same as of commiter
> date itself, plus some, and I don't see us coming close to 64-bit limit
> of timestamp_t for commit dates.
> 
> >  				     oid_to_hex(&cur_oid),
> >  				     generation,
> >  				     max_generation + 1);
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-10-07 14:09       ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-11-03 17:59         ` Jakub Narębski
  2020-11-03 18:19           ` Junio C Hamano
  2020-11-20 10:33           ` Abhishek Kumar
  0 siblings, 2 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-03 17:59 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Eric Sunshine, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With corrected commit dates implemented, we no longer have to rely on
> commit date as a heuristic in paint_down_to_common().
>
> While using corrected commit dates Git walks nearly the same number of
> commits as commit date, the process is slower as for each comparision we
> have to access a commit-slab (for corrected committer date) instead of
> accessing struct member (for committer date).

Something for the future: I wonder if it would be worth it to bring back
generation number from the commit-slab into `struct commit`.

>
> For example, the command `git merge-base v4.8 v4.9` on the linux
> repository walks 167468 commits, taking 0.135s for committer date and
> 167496 commits, taking 0.157s for corrected committer date respectively.

I think it would be good idea to explicitly refer to the commit that
changed paint_down_to_common() to *not* use generation numbers v1
(topological levels) in the cases such as this, namely 091f4cf3 (commit:
don't use generation numbers if not needed).  In this commit we have the
following:

  This change makes a concrete difference depending on the topology
  of the commit graph. For instance, computing the merge-base between
  consecutive versions of the Linux kernel has no effect for versions
  after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
  regression:

      v2.18.0: 0.122s
  v2.19.0-rc1: 0.547s
         HEAD: 0.127s

  To determine that this was simply an ordering issue, I inserted
  a counter within the while loop of paint_down_to_common() and
  found that the loop runs 167,468 times in v2.18.0 and 635,579
  times in v2.19.0-rc1.

The times you report (0.135s and 0.157s) are close to 0.122s / 0.127s
reported in 091f4cf3 - that is most probably because of the differences
in the system performance (hardware, operating system, load, etc.).
Numbers of commits walked for the committed date heuristics, that is
167,468 agrees with your results; 167,496 (+28) for corrected commit
date (generation number v2) is significantly smaller (-468,083) than
635,579 reported for topological levels (generation number v1).

I suspect that there are cases (with date skew) where corrected commit
date gives better performance than committer date heuristics, and I am
quite sure that generation number v2 can give better performance in case
where paint_down_to_common() uses generation numbers.

.................................................................

Here begins separate second change, which is not put into separate
commit because it is fairly tightly connected to the change described
above.  It would be good idea, in my opinion, to add a sentence that
explicitely marks this switch, for example:

  This change accidentally broke fragile t6404-recursive-merge test.
  t6404-recursive-merge setups a unique repository...

Maybe with s/accidentaly/incidentally/.

Or add some other way of connection those two parts of the commit
messages.

> t6404-recursive-merge setups a unique repository where all commits have
> the same committer date without well-defined merge-base.
>
> While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
> date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
> conflicts' merges commits in the order:
> - Merge C with B to form a intermediate commit.
> - Merge the intermediate commit with A.
>
> With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
> use the corrected committer date, which changes the order in which
> commits are merged:
> - Merge A with B to form a intermediate commit.
> - Merge the intermediate commit with C.
>
> While resulting repositories are equivalent, 6404.4 'virtual trees were
> processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
> different merge-bases and thus have different object ids for the
> intermediate commits.
>
> As this has already causes problems (as noted in 859fdc0 (commit-graph:
> define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
> within t6404-recursive-merge.

Very nice explanation.

Perhaps in the future we could make this test less fragile.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c             | 14 ++++++++++++++
>  commit-graph.h             |  8 +++++++-
>  commit-reach.c             |  2 +-
>  t/t6404-recursive-merge.sh |  5 ++++-
>  4 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 5d15a1399b..3de1933ede 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -705,6 +705,20 @@ int generation_numbers_enabled(struct repository *r)
>  	return !!first_generation;
>  }
>  
> +int corrected_commit_dates_enabled(struct repository *r)
> +{
> +	struct commit_graph *g;
> +	if (!prepare_commit_graph(r))
> +		return 0;
> +
> +	g = r->objects->commit_graph;
> +
> +	if (!g->num_commits)
> +		return 0;
> +
> +	return g->read_generation_data;
> +}

Very nice abstraction.

Minor issue: I wonder if it would be better to use _available() or
"_present()" rather than _enabled() suffix.

> +
>  struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
>  {
>  	struct commit_graph *g = r->objects->commit_graph;
> diff --git a/commit-graph.h b/commit-graph.h
> index ad52130883..d2c048dc64 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  struct commit_graph *parse_commit_graph(struct repository *r,
>  					void *graph_map, size_t graph_size);
>  
> +struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> +
>  /*
>   * Return 1 if and only if the repository has a commit-graph
>   * file and generation numbers are computed in that file.
>   */
>  int generation_numbers_enabled(struct repository *r);
>  
> -struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);

This moving get_bloom_filter_settings() before generation_numbers_enabled() 
looks like accidental change.  If not, why it is here?

> +/*
> + * Return 1 if and only if the repository has a commit-graph
> + * file and generation data chunk has been written for the file.
> + */
> +int corrected_commit_dates_enabled(struct repository *r);
>

All right, nice to have documentation for the public function.

>  enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
> diff --git a/commit-reach.c b/commit-reach.c
> index 20b48b872b..46f5a9e638 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  	int i;
>  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>  
> -	if (!min_generation)
> +	if (!min_generation && !corrected_commit_dates_enabled(r))
>  		queue.compare = compare_commits_by_commit_date;
>  
>  	one->object.flags |= PARENT1;

All right, this is the meat of the first change.

> diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
> index 332cfc53fd..7055771b62 100755
> --- a/t/t6404-recursive-merge.sh
> +++ b/t/t6404-recursive-merge.sh
> @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
>  export GIT_COMMITTER_DATE
>  
>  test_expect_success 'setup tests' '
> +	GIT_TEST_COMMIT_GRAPH=0 &&
> +	export GIT_TEST_COMMIT_GRAPH &&
>  	echo 1 >a1 &&
>  	git add a1 &&
>  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&

All right, we turn off running this test with commit-graph for the whole
script, not only for a single test.  As this is a setup, it would be run
even if we are skipping some tests.

> @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
>  '
>  
>  test_expect_success 'combined merge conflicts' '
> -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> +	test_must_fail git merge -m final G
>  '

All right, it is no longer necessary to run this specific test with
GIT_TEST_COMMIT_GRAPH=0 as now the whole script is run with this
setting.

>  
>  test_expect_success 'result contains a conflict' '
> @@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
>  '
>  
>  test_expect_success 'virtual trees were processed' '
> +	# TODO: fragile test, relies on ambigious merge-base resolution
>  	git ls-files --stage >out &&
>  
>  	cat >expect <<-EOF &&

Good call!  Nice adding TODO comment for the future.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-11-03 17:59         ` Jakub Narębski
@ 2020-11-03 18:19           ` Junio C Hamano
  2020-11-20 10:33           ` Abhishek Kumar
  1 sibling, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2020-11-03 18:19 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee, Taylor Blau,
	Eric Sunshine, Abhishek Kumar

jnareb@gmail.com (Jakub Narębski) writes:

> I suspect that there are cases (with date skew) where corrected commit
> date gives better performance than committer date heuristics, and I am
> quite sure that generation number v2 can give better performance in case
> where paint_down_to_common() uses generation numbers.

Thanks for a well reasoned review.

>
> .................................................................
>
> Here begins separate second change, which is not put into separate
> commit because it is fairly tightly connected to the change described
> above.  It would be good idea, in my opinion, to add a sentence that
> explicitely marks this switch, for example:
>
>   This change accidentally broke fragile t6404-recursive-merge test.
>   t6404-recursive-merge setups a unique repository...
>
> Maybe with s/accidentaly/incidentally/.

Also "setup" is not a verb.  "... sets up a unique repository".

> Or add some other way of connection those two parts of the commit
> messages.
> ...
>> As this has already causes problems (as noted in 859fdc0 (commit-graph:
>> define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
>> within t6404-recursive-merge.
>
> Very nice explanation.
>
> Perhaps in the future we could make this test less fragile.

If "separate second change" is distracting, would it be an option to
fix the test before this step, perhaps?

Thanks.

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 10/10] doc: add corrected commit date info
  2020-10-07 14:09       ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-11-04  1:37         ` Jakub Narębski
  2020-11-21  6:30           ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-04  1:37 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Nice.

> ---
>  .../technical/commit-graph-format.txt         | 21 +++++--
>  Documentation/technical/commit-graph.txt      | 62 ++++++++++++++++---
>  2 files changed, 69 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index b3b58880b9..08d9026ad4 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
>  
> -- The generation number of the commit. Commits with no parents have
> -  generation number 1; commits with parents have generation number
> -  one more than the maximum generation number of its parents. We
> -  reserve zero as special, and can be used to mark a generation
> -  number invalid or as "not computed".
> +- The generation number of the commit.

All right, because we could store both generation number v1 and
generation number v2 in the commit-graph file, and we need to describe
both, the description is now consolidated and in only one place.

>  
>  - The root tree OID.
>  
> @@ -86,13 +82,26 @@ CHUNK DATA:
>        position. If there are more than two parents, the second value
>        has its most-significant bit on and the other bits store an array
>        position into the Extra Edge List chunk.
> -    * The next 8 bytes store the generation number of the commit and
> +    * The next 8 bytes store the topological level (generation number v1)
> +      of the commit and

All right, this is updated information about CDAT chunk.

>        the commit time in seconds since EPOCH. The generation number
>        uses the higher 30 bits of the first 4 bytes, while the commit
>        time uses the 32 bits of the second 4 bytes, along with the lowest
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>  
> +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)

Should we mark this chunk as "[Optional]"?  Its absence is not an error.

> +    * This list of 4-byte values store corrected commit date offsets for the
> +      commits, arranged in the same order as commit data chunk.
> +    * If the corrected commit date offset cannot be stored within 31 bits,
> +      the value has its most-significant bit on and the other bits store
> +      the position of corrected commit date into the Generation Data Overflow
> +      chunk.

All right.

> +
> +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> +    * This list of 8-byte values stores the corrected commit dates for commits
> +      with corrected commit date offsets that cannot be stored within 31 bits.

A question: do we store 8-byte / 64-bit corrected commit date *directly*,
or do we store corrected commit date *offset* as 8-byte / 64-bit value?

Perhaps we should add the information that [like the EDGE chunk] it is
present only when necessary, and that it is present only when GDAT chunk
is present (it might be obvious, but it could be better to state
this explicitly).

> +

All right, this is the information about two new chunks (with the
mentioned above caveat about the clarity of the description of
overflow-handling chunk).

>    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>        This list of 4-byte values store the second through nth parents for
>        all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index f14a7659aa..75f71c4c7b 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
>  
>  Values 1-4 satisfy the requirements of parse_commit_gently().
>  
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates (generation number v2)
> +2. Topological levels (generation nummber v1)

All right.

>  
> - * A commit with no parents (a root commit) has generation number one.
> +Define "corrected committer date" of a commit recursively as follows:
>  
> - * A commit with at least one parent has generation number one more than
> -   the largest generation number among its parents.
> +  * A commit with no parents (a root commit) has corrected committer date
> +    equal to its committer date.

Minor nitpick: the above point has been accidentally indented one space
more than necessary, and than is indented in other places.  Or maybe
that fixes / unifies the formatting... I am not sure.

>  
> -Equivalently, the generation number of a commit A is one more than the
> +  * A commit with at least one parent has corrected committer date equal to
> +    the maximum of its commiter date and one more than the largest corrected
> +    committer date among its parents.
> +
> +  * As a special case, a root commit with timestamp zero has corrected commit
> +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> +    (that is, an uncomputed corrected commit date).

All right.  Looks good.

> +
> +Define the "topological level" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has topological level of one.
> +
> + * A commit with at least one parent has topological level one more than
> +   the largest topological level among its parents.
> +

All right, this just repeats what was written before, or in other words
move existing contents lower/later, just with 'generation number'
replaced by 'topological level' (though it might be not obvious from the
patch because of the latter change).

> +Equivalently, the topological level of a commit A is one more than the
>  length of a longest path from A to a root commit. The recursive definition
>  is easier to use for computation and observing the following property:
>  
> @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
>      generation numbers, then we always expand the boundary commit with highest
>      generation number and can easily detect the stopping condition.
>  
> +The properties applies to both versions of generation number, that is both
> +corrected committer dates and topological levels.
> +

I think it should be "This property" or "The property", not "The
properties"; it is a single property, a single condition.

We can alternatively say "This condition is fulfilled by both versions...",
or "This condition is true for both versions...".

>  This property can be used to significantly reduce the time it takes to
>  walk commits and determine topological relationships. Without generation
>  numbers, the general heuristic is the following:
> @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
>      If A and B are commits with commit time X and Y, respectively, and
>      X < Y, then A _probably_ cannot reach B.
>  
> -This heuristic is currently used whenever the computation is allowed to
> +In absence of corrected commit dates (for example, old versions of Git or
> +mixed generation graph chains),
> +this heuristic is currently used whenever the computation is allowed to
>  violate topological relationships due to clock skew (such as "git log"
>  with default order), but is not used when the topological order is
>  required (such as merge base calculations, "git log --graph").

All right, this explains when commit date heuristics is used (which is
less often than before).

> @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>  
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not

All right, 64-bit GENERATION_NUMBER_INFINITY = 0xFFFFFFFFFFFFFFFF is a
bit unwieldy...

>  in the commit-graph file. If a commit-graph file was written by a version
>  of Git that did not compute generation numbers, then those commits will
>  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,7 +115,7 @@ fully-computed generation numbers. Using strict inequality may result in
>  walking a few extra commits, but the simplicity in dealing with commits
>  with generation number *_INFINITY or *_ZERO is valuable.
>  
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> +We use the macro GENERATION_NUMBER_MAX for commits whose

This should be

  +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF to for commits whose
  +topological levels (generation number v1) are computed to be at least this value. We limit at
   this value since it is the largest value that can be stored in the
  +commit-graph file using the 30 bits available to topological levels. This

We need to use "topological levels" or "generation numbers v1" thorough
the rest of this section.

>  generation numbers are computed to be at least this value. We limit at
>  this value since it is the largest value that can be stored in the
>  commit-graph file using the 30 bits available to generation numbers. This
> @@ -267,6 +289,30 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
>  number of commits) could be extracted into config settings for full
>  flexibility.
>

All right, I agree that we don't need to write about overflow handling
for storing corrected committer dates (generation number v2) as offsets;
this is something format-specific, and this documentation is more about
using commit-graph data.  What is present in commit-graph-format.txt
should be enough information.

Sidenote: I wonder if other Git implementations such as JGit, Dulwich,
Gitoxide (gix), go-git have support for the commit-graph file...

> +## Handling Mixed Generation Number Chains
> +
> +With the introduction of generation number v2 and generation data chunk, the
> +following scenario is possible:
> +
> +1. "New" Git writes a commit-graph with the corrected commit dates.
> +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> +
> +A naive approach of using the newest available generation number from
> +each layer would lead to violated expectations: the lower layer would
> +use corrected commit dates which are much larger than the topological
> +levels of the higher layer. For this reason, Git inspects each layer to
> +see if any layer is missing corrected commit dates. In such a case, Git
> +only uses topological level

This should end in full stop:

  +only uses topological levels.

Or maybe we should expand the last sentence a bit:

  +only uses topological levels for generation numbers.

Sidenote: it is a good explanation, even if Git can make use of the
property described below that only topmost layers might be missing
corrected commit graph by the construction (so it needs to check only
the top layer).

> +
> +When writing a new layer in split commit-graph, we write corrected commit
> +dates if the topmost layer has corrected commit dates written. This
> +guarantees that if a layer has corrected commit dates, all lower layers
> +must have corrected commit dates as well.
> +
> +When merging layers, we do not consider whether the merged layers had corrected
> +commit dates. Instead, the new layer will have corrected commit dates if and
> +only if all existing layers below the new layer have corrected commit dates.
> +

Perhaps we should explicitly say that when rewriting split commit-graph
as a single file (`--split=replace`) then the newly created single layer
would store corrected commit dates.

>  ## Deleting graph-{hash} files
>  
>  After a new tip file is written, some `graph-{hash}` files may no longer

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
  2020-11-03 11:44           ` Abhishek Kumar
@ 2020-11-04 16:45             ` Jakub Narębski
  2020-11-05 14:05               ` Philip Oakley
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-04 16:45 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Hello Abhishek,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>> ...
>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>> 
>> Somewhere in the commit message we should also describe that this commit
>> changes how commit-graph is verified: from checking that the generation
>> number agrees with _topological level definition_, that is that for a
>> given commit it is 1 more than maximum of its parents (with the caveat
>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
>> checking that slightly weaker condition fulfilled by both topological
>> levels (generation number v1) and by corrected commit date (generation
>> number v2) that for a given commit its generation number is 1 more than
>> maximum of its parents or larger.
>
> Sure, that makes sense. Will add.

Actually this description should match whatever we decide about
mechanism for verifying correctness of generation numbers (see below).
Because we have to choose one.

>> 
>> But, as far as I understand it, current code does not handle correctly
>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>> 
>> On the other hand we could have simpy use functional check, that
>> generation number used (which can be v1 or v2, or any similar other)
>> fulfills the reachability condition for each edge, which can be
>> simplified to checking that generation(parents) <= generation(commit).
>> If the reachability condition is true for each edge, then it is true for
>> each path, and for each commit.

See below.

>>> ---
>>>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>>>  1 file changed, 23 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index cedd311024..03948adfce 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>  	else if (generation_a > generation_b)
>>>  		return 1;
>>>  
>>> -	/* use date as a heuristic when generations are equal */
>>> -	if (a->date < b->date)
>>> -		return -1;
>>> -	else if (a->date > b->date)
>>> -		return 1;
>> 
>> Why this change?  It is not described in the commit message.
>> 
>> Note that while this tie-breaking fallback doesn't make much sense for
>> corrected committer date generation number v2, this tie-breaking helps
>> if we have to use topological levels (generation number v2).
>> 
>
> Right, I should have mentioned this change (and it's not something that
> makes a difference either way).
>
> We call commit_gen_cmp() only when we are sorting commits by generation
> to speed up computation of Bloom filters i.e. while writing a commit
> graph (either split commit-graph or a simple commit-graph).
>
> Since we are always computing and storing corrected commit date when we
> are writing (whether we write a GDAT chunk or not), using date as
> heuristic is longer required.

Thanks.  This description really should be added to the commit message,
because (yet again?) I was confused by this change.

Sidenote: it is not obvious at least to me that this function is used
only for sorting commits to speed up computation of Bloom filters while
writing the commit-graph (`git commit-graph write --changed-paths [other
options]`).

>>>  	return 0;
>>>  }
>>>  
>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  					ctx->commits.nr);
>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>> 
>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
>> enough; no need for 'timestamp_t' for 'level' variable.
>> 
>>> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>>>
>
> We need the 'timestamp_t' as we are comparing level with the now 64-bits
> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
> timestamp_t. I have a hunch that since we are explicitly using a fixed
> width data type, compiler is unwilling to type coerce into broader data
> types.
>
> Advice on this appreciated.

All right, so the wider type is used because of comparison with
wide-uint GENERATION_NUMBER_INFINITY.  I stand corrected.

[...]
>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>>>  			continue;
>>>  
>>> -		/*
>>> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
>>> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>>> -		 * extra logic in the following condition.
>>> -		 */
>>> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
>>> -			max_generation--;
>>> -
>> 
>> Perhaps in the future we should check that both topological levels, and
>> also corrected committer date (if it exists) for correctness according
>> to their definition.  Then the above removed part would be restored (but
>> with s/max_generation/max_level/).
>> 
>>>  		generation = commit_graph_generation(graph_commit);
>>> -		if (generation != max_generation + 1)
>>> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
>>> +		if (generation < max_generation + 1)
>>> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>> 
>> All right, so we relaxed the check so that it will be fulfilled by
>> generation number v2 (and also by generation number v1, as it implies
>> the more strict check for v1).
>> 
>> What would happen however if generation holds topological levels, and it
>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
>> GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
>> so the `git commit-graph verify` would incorrectly say that there is
>> a problem with generation number, while there isn't one (false positive
>> detection of error).
>
> Alright, so the above block still makes sense if we are working with
> topological levels but not with corrected commit dates. Instead of
> removing it, I will modify the condition to check that one of our parents
> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.

That is one of the 3 possible solutions I can think of.


I. First solution is to switch from checking that generation number
matches its definition to checking that the [weaker] reachability
condition for the generation number is true, that is:

 	if (generation < max_generation)
 		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

The [weaker] reachability condition for generation numbers states that

   A reachable from B    =>    gen(A) <= gen(B)

This condition is true even if one or more generation numbers is
GENERATION_NUMBER_ZERO (uninitialized or written by old git version),
GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only
for generation number v1), or GENERATION_NUMBER_INFINITY (for commits
outside of the serialized commit-graph, doesn't matter and cannot happen
during verification of the commit-graph data by definition).

This means that if P* is the parent of C with the maximal generation
number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be
true), then there is a problem with generation number.

This is why I thought you were going for, and what I have proposed.

Advantages:
- we are testing what actually matters for speeding up reachability
  queries, namely that the reachability property holds true
- the test works for generation number v1, generation number v2,
  and any possible future use-compatibile generation number
  (not that I think we would need any)
- least complicated solution

Disadvantages:
- weaker test that we have had for generation number v1 (topological
  levels), and weaker that possible test for generation number v2
  that we could have (see below)


II. Verify corrected committed date (generation number v2) if available,
and verify topological levels (generation number v1) otherwise, checking
that it matches the definition of it -- using version-specific checks.

This would probably mean adding a conditional around the code verifying
that given generation number is correct, possibly:

  if (g->read_generation_data) {
  	/* verify corrected commit date */
  } else {
  	/* current code for verifying topological levels */
  }

II.a. For topological levels (generation number v1) we would continue
checking that it matches the definition, that is that the following
condition holds:

  gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1

This includes code for handling the case where `max_generation`, holding 
max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX.

II.b. For corrected commiter dates (generation number v2) we can use the
code proposed by this revision of this commit, namely we check if the
following condition holds:

  gen(P) + 1 <= gen(C)   for each P \in parents(C)

or, in other words:

  max_{P: P ∈ parents(C)} { gen(P) } + 1  <=  gen(C)

Which could be checked using the following code (i.e. current state
after this revision of this patch):

	if (generation < max_generation + 1)
		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

This is what I think you are proposing now.

Additionally, theoretically we could also check that the following
condition holds for corrected commiter date:

   committer_date(C) <= gen_v2(C)

but this is automatically fufilled because we use non-negative offsets
to store corrected committed date info.

Alternatively we can check for compliance with the definition of the
corrected committer date:

  if (max_generation + 1 <= graph_commit->date) {
  	/* commit date does not need correction */
  	if (generation != graph_commit->date)
    	graph_report(_("commit-graph corrected commit date for commit %s "
  		               "is %"PRItime" != %"PRItime" commit date"),
                     ...);
  } else {
  	if (generation != max_generation + 1)
  		graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime),
                     ...);
  }

Though I think it might be overkill.

Advantages:
- more strict tests, checking generation numbers (v2 if present, v1
  otherwise) against their definition
- if there is no GDAT chunk, verify works just like it did before

Disadvantages:
- more complicated code
- possibly measurable performance degradation due to extra conditional


III. Like II., but if there is generation numbers chunk (GDAT chunk), we
verify *both* topological levels (v1) and corrected commit date (v2)
against their definition.  If GDAT chunk is not present, it reduces to
current code (before this patch series).

Advantages:
- if there is no GDAT chunk, verify works just like it did before
- most strict tests, verifying all the data: both generation number v1
  and generation number v2 -- if possible

Disadvantages:
- most complex code; we need to somehow extract topological levels
  if the GDAT chunk is present (they are not on graph data slab in this
  case); I have not even started to think how it could be done
- slower verification

> Suprised that no test breaks by this change.

I don't whink we have any test that created commit graph with
topological levels greater than GENERATION_NUMBER_V1_MAX; this would be
expensive and have to be of course protected by GIT_TEST_LONG aka
EXPENSIVE prerequisite.

  # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels
  test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' '
  	rm -rf long_chain &&
  	git init long_chain &&
  	test_commit_bulk -C long_chain 1073741824 &&
    (
  		cd long_chain &&
  		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write &&
  		git commit-graph verify
    )
  '

This however lies slightly outside the scope of this patch series,
though if you could add this test (in a separate patch), after testing
it, it would be very nice.

>
> I have also moved changes in the verify function to the next patch, as
> we cannot write or read corrected commit dates yet - so little sense in
> modifying verify.

I think putting changes to the verify function in a separate patch, be
it before or after this one (depending on the choice of the algorithm
for verification, see above) would be a good idea.

>> 
>> Sidenote: I think we don't have to worry about having to introduce
>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
>> disck representation) corrected commiter date is the same as of commiter
>> date itself, plus some, and I don't see us coming close to 64-bit limit
>> of timestamp_t for commit dates.
>> 
>>>  				     oid_to_hex(&cur_oid),
>>>  				     generation,
>>>  				     max_generation + 1);

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (9 preceding siblings ...)
  2020-10-07 14:09       ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-11-04 23:37       ` Jakub Narębski
  2020-11-22  5:31         ` Abhishek Kumar
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
  11 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-04 23:37 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar

Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Thanks a lot for continued working on this patch series.

>
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.

Very minor nitpick: it would make it easier to read if the commands
themself would be put inside single quotes or backticks, e.g. `git log
--graph` and `git merge-base v4.8 v4.9`.

I wonder if it is worth mentioning (probably not) that this performance
hit was the reason why since 091f4cf3 `git merge-base` uses committer
date heuristics unless there is a cutoff and using topological levels
(generation date v1) is expected to give better performance.

>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem 1

Minor issue: this looks a bit strange; is there an error in formatting
this part?

> [https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/], we also
> needed a way to distinguish between the old and new generation number
> without incrementing graph version.
>
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk.

Nice writeup about the history of generation number v2, much appreciated.

>                    Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:

Minor nitpick: it would be probably better to use "is defined as
follows." instead of "is defined as:".

>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents plus 1. Then 
> corrected commit date offset is the difference between corrected commit date
> of C and commit date of C. As a special case, a root commit with timestamp
> zero has corrected commit date of 1 to be able distinguish it from
> GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).

Very minor nitpick: s/with timestamp/with *the* timestamp/, and
s/to be able distinguish/to be able *to* distinguish/ (without the '*'
used to mark the additions).

>
> We will introduce an additional commit-graph chunk, Generation Data chunk,

Or "Generation DATa chunk", if we want to emphasize where its name came
from, or even "Generation DATa (GDAT) chunk". But it is fine as it is
now, though it would be good idea to write "Generation Data (GDAT)
chunk" to explicitly state its name / shortcut.

> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

Nice writeup of handling the backward compatibility.

>
> While storing corrected commit date offsets saves us 4 bytes per commit (as
> compared with storing corrected commit dates directly), it's possible for
> the offset to overflow the space allocated. To handle such cases, we
> introduce a new chunk, Generation Data Overflow (GDOV) that stores the
> corrected commit date. For overflowing offsets, we set MSB and store the
> position into the GDOV chunk, in a mechanism similar to the Extra Edges list
> chunk.

Very minor suggestion: perhaps it would be better to use "it's however
possible".

Very minor suggestion: "it's possible for the offset to overflow" could
be simplified to just "the offset can overflow"... though the simplified
version loses a bit of hint that the overflow should be very rare in
real repositories.

But it is just fine as it is now; I am not a native English speaker to
judge which version is better.

>
> For mixed generation number environment (for example new Git on the command
> line, old Git used by GUI client), we can encounter a mixed-chain
> commit-graph (a commit-graph chain where some of split commit-graph files
> have GDAT chunk and others do not). As backward compatibility is one of the
> goals, we can define the following behavior:
>
> While reading a mixed-chain commit-graph version, we fall back on
> topological levels as corrected commit dates and topological levels cannot
> be compared directly.
>
> While writing on top of a split commit-graph, we check if the tip of the
> chain has a GDAT chunk. If it does, we append to the chain, writing GDAT
> chunk. Thus, we guarantee if the topmost split commit-graph file has a GDAT
> chunk, rest of the chain does too.
>
> If the topmost split commit-graph file does not have a GDAT chunk (meaning
> it has been appended by the old Git), we write without GDAT chunk. We do
> write a GDAT chunk when the existing chain does not have GDAT chunk - when
> we are writing to the commit-graph chain with the 'replace' strategy.

I think the last paragraph can be simplified (or added to) by explicitly
stating the goal:

  When adding new layer to the split commit-graph file, and when merging
  some or all layers (replacing them in the latter case), the new layer
  will have GDAT chunk if and only if in the final result there would be
  no layer without GDAT chunk just below it.

>
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.

You are welcome.

>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> Changes in version 4:
>
>  * Added GDOV to handle overflows in generation data.
>  * Added a test for writing tip graph for a generation number v2 graph chain
>    in t5324-split-commit-graph.sh
>  * Added a section on how mixed generation number chains are handled in 
>    Documentation/technical/commit-graph-format.txt
>  * Reverted unimportant whitespace style changes in commit-graph.c
>  * Added header comments about the order of comparision for
>    compare_commits_by_gen_then_commit_date in commit.h,
>    compare_commits_by_gen in commit-graph.h
>  * Elaborated on why t6404 fails with corrected commit date and must be run
>    with GIT_TEST_COMMIT_GRAPH=1 in the commit "commit-reach: use corrected
>    commit dates in paint_down_to_common()"
>  * Elaborated on write behavior for mixed generation number chains in the
>    commit "commit-graph: use generation v2 only if entire chain does"
>  * Added notes about adding the topo_level slab to struct
>    write_commit_graph_context as well as struct commit_graph.
>  * Clarified commit message for "commit-graph: consolidate
>    fill_commit_graph_info"
>  * Removed the claim "GDAT can store future generation numbers" because it
>    hasn't been tested yet.
>
> Changes in version 3:
>
>  * Reordered patches as discussed in 2
>    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
>    .
>  * Split "implement corrected commit date" into two patches - one
>    introducing the topo level slab and other implementing corrected commit
>    dates.
>  * Extended split-commit-graph tests to verify at the end of test.
>  * Use topological levels as generation number if any of split commit-graph
>    files do not have generation data chunk.
>
> Changes in version 2:
>
>  * Add tests for generation data chunk.
>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.
>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().
>  * Update technical documentation.
>  * Handle mixed generation commit chains.
>  * Improve commit messages for "commit-graph: fix regression when computing
>    bloom filter", "commit-graph: consolidate fill_commit_graph_info",
>  * Revert unnecessary whitespace changes.
>  * Split uint_32 -> timestamp_t change into a new commit.

After careful review of those 10 patches it looks like the series is
close to being ready, requiring only small changes to progress.

> Abhishek Kumar (10):
>   commit-graph: fix regression when computing Bloom filters

    All good, beside possible improvement to the commit message.
    Thanks to Taylor Blau for discovering possible reason for strange
    no change in performance.

>   revision: parse parent in indegree_walk_step()

    Looks good.

>   commit-graph: consolidate fill_commit_graph_info

    Needs to fix now duplicated test names (minor change).
    Proposed possible improvement to the commit message.

>   commit-graph: return 64-bit generation number

    Needs fixing due to mismerge: there should be no switch from
    using GENERATION_NUMBER_ZERO to using GENERATION_NUMBER_INFINITY.
    Possible minor improvement to the commit message.

>   commit-graph: add a slab to store topological levels

    Possible minor improvement to the commit message.
    
    There is also not very important issue, but something that would be
    nice to explain, namely that checks for GENERATION_NUMBER_INFINITY 
    can never be true, as topo_level_slab_at() returns 0 for commits
    outside the commit-graph, not GENERATION_NUMBER_INFINITY.  It works
    but it is not obvious why.

>   commit-graph: implement corrected commit date

    The change to commit-graph verification needs fixing, and we need to
    decide how verifying generation numbers should work.  Perhaps a test
    for handling topological level of GENERATION_NUMBER_V1_MAX could be
    added (though this might be left for ater).

    The changes to `git commit-graph verify` code could be put into
    separate patch, either before or after this one.

>   commit-graph: implement generation data chunk

    Proposed possible improvement to the commit message.
    The commit message does not explain why given shape of history is
    needed to test handling corrected commit date offset overflow.

    Proposed minor corrections to the coding style.

    Instead of looping again through all commits when handling overflow
    in corrected commit date offsets, while there should be at most a
    few commits needing it, why not save those commits on list and loop
    only through those commits?  Though this _possible_ performance
    improvement could be left to the followup...

    test_commit_with_date() could be instead implemented via adding
    `--date <date>` option to test_commit() in test-lib-functions.sh.

    Also, to reduce "noise" in this patch, the rename of
    run_three_modes() to run_all_modes() and test_three_modes() to
    test_all_modes() could have been done in a separate preparatory
    patch. It would be pure refactoring patch, without introducing any
    new functionality.  But it is not something that is necessary.

>   commit-graph: use generation v2 only if entire chain does

    Proposed possible improvement to the commit message.
    Proposed minor corrections to the coding style (also in tests).

    There is a question whether merging layers or replacing them should
    honor GIT_TEST_COMMIT_GRAPH_NO_GDAT.

    Tests possibly could be made more strict, and check more things
    explicitly. One test we are missing is testing that merging layers
    is done correctly, namely that if we are merging layers in split
    commit-graph file, and the layer below the ones we are merging lacks
    GDAT chunk, then the result of the merge should also be without GDAT
    chunk -- but that might be left for later.

>   commit-reach: use corrected commit dates in paint_down_to_common()

    This patch consist of two slightly interleaved changes, which
    possibly could be separated: change to paint_down_to_common() and
    change to t6404-recursive-merge test.

    In the commit message for the paint_down_to_common() we should
    explicitly mention 091f4cf3, which this one partially reverts.

    Possible accidental change, question about function naming.

>   doc: add corrected commit date info

    Needs further improvements to the documentation, like adding
    "[Optional]" to chunk description, and leftover switching from
    "generation numbers" to "topological levels" in one place.

>
>  .../technical/commit-graph-format.txt         |  21 +-
>  Documentation/technical/commit-graph.txt      |  62 ++++-
>  commit-graph.c                                | 256 ++++++++++++++----
>  commit-graph.h                                |  17 +-
>  commit-reach.c                                |  38 +--
>  commit-reach.h                                |   2 +-
>  commit.c                                      |   4 +-
>  commit.h                                      |   5 +-
>  revision.c                                    |  13 +-
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |   4 +-
>  t/t5000-tar-tree.sh                           |  20 +-
>  t/t5318-commit-graph.sh                       |  70 ++++-
>  t/t5324-split-commit-graph.sh                 |  98 ++++++-
>  t/t6404-recursive-merge.sh                    |   5 +-
>  t/t6600-test-reach.sh                         |  68 ++---
>  upload-pack.c                                 |   2 +-
>  18 files changed, 534 insertions(+), 158 deletions(-)
>
>
> base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
[...]

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
  2020-11-04 16:45             ` Jakub Narębski
@ 2020-11-05 14:05               ` Philip Oakley
  2020-11-05 18:22                 ` Junio C Hamano
  0 siblings, 1 reply; 211+ messages in thread
From: Philip Oakley @ 2020-11-05 14:05 UTC (permalink / raw)
  To: Jakub Narębski, Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Hi Abhishek,

On 04/11/2020 16:45, Jakub Narębski wrote:
> Hello Abhishek,
>
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
>>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>> ...
>>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>> Somewhere in the commit message we should also describe that this commit
>>> changes how commit-graph is verified: from checking that the generation
>>> number agrees with _topological level definition_, that is that for a
>>> given commit it is 1 more than maximum of its parents (with the caveat
>>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
>>> checking that slightly weaker condition fulfilled by both topological
>>> levels (generation number v1) and by corrected commit date (generation
>>> number v2) that for a given commit its generation number is 1 more than
>>> maximum of its parents or larger.
>> Sure, that makes sense. Will add.
> Actually this description should match whatever we decide about
> mechanism for verifying correctness of generation numbers (see below).
> Because we have to choose one.

This may be not part of the the main project, but could you consider, if
time permits, also adding some entries into the Git Glossary (`git help
glossary`) for the various terms we are using here and elsewhere, e.g.
'topological levels', 'generation number', 'corrected commit date' (and
its fancy technical name for the use of date heuristics e.g. the
'chronological ordering';).

The glossary can provide a reference, once the issues are resolved. The
History Simplification and Commit Ordering section of git-log maybe a
useful guide to some of the terms that would link to the glossary.
--
Philip

>>> But, as far as I understand it, current code does not handle correctly
>>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>>>
>>> On the other hand we could have simpy use functional check, that
>>> generation number used (which can be v1 or v2, or any similar other)
>>> fulfills the reachability condition for each edge, which can be
>>> simplified to checking that generation(parents) <= generation(commit).
>>> If the reachability condition is true for each edge, then it is true for
>>> each path, and for each commit.
> See below.
>
>>>> ---
>>>>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>>>>  1 file changed, 23 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/commit-graph.c b/commit-graph.c
>>>> index cedd311024..03948adfce 100644
>>>> --- a/commit-graph.c
>>>> +++ b/commit-graph.c
>>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>>  	else if (generation_a > generation_b)
>>>>  		return 1;
>>>>  
>>>> -	/* use date as a heuristic when generations are equal */
>>>> -	if (a->date < b->date)
>>>> -		return -1;
>>>> -	else if (a->date > b->date)
>>>> -		return 1;
>>> Why this change?  It is not described in the commit message.
>>>
>>> Note that while this tie-breaking fallback doesn't make much sense for
>>> corrected committer date generation number v2, this tie-breaking helps
>>> if we have to use topological levels (generation number v2).
>>>
>> Right, I should have mentioned this change (and it's not something that
>> makes a difference either way).
>>
>> We call commit_gen_cmp() only when we are sorting commits by generation
>> to speed up computation of Bloom filters i.e. while writing a commit
>> graph (either split commit-graph or a simple commit-graph).
>>
>> Since we are always computing and storing corrected commit date when we
>> are writing (whether we write a GDAT chunk or not), using date as
>> heuristic is longer required.
> Thanks.  This description really should be added to the commit message,
> because (yet again?) I was confused by this change.
>
> Sidenote: it is not obvious at least to me that this function is used
> only for sorting commits to speed up computation of Bloom filters while
> writing the commit-graph (`git commit-graph write --changed-paths [other
> options]`).
>
>>>>  	return 0;
>>>>  }
>>>>  
>>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>>  					ctx->commits.nr);
>>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>>>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
>>> enough; no need for 'timestamp_t' for 'level' variable.
>>>
>>>> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>>>>
>> We need the 'timestamp_t' as we are comparing level with the now 64-bits
>> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
>> timestamp_t. I have a hunch that since we are explicitly using a fixed
>> width data type, compiler is unwilling to type coerce into broader data
>> types.
>>
>> Advice on this appreciated.
> All right, so the wider type is used because of comparison with
> wide-uint GENERATION_NUMBER_INFINITY.  I stand corrected.
>
> [...]
>>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>>>>  			continue;
>>>>  
>>>> -		/*
>>>> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
>>>> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>>>> -		 * extra logic in the following condition.
>>>> -		 */
>>>> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
>>>> -			max_generation--;
>>>> -
>>> Perhaps in the future we should check that both topological levels, and
>>> also corrected committer date (if it exists) for correctness according
>>> to their definition.  Then the above removed part would be restored (but
>>> with s/max_generation/max_level/).
>>>
>>>>  		generation = commit_graph_generation(graph_commit);
>>>> -		if (generation != max_generation + 1)
>>>> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
>>>> +		if (generation < max_generation + 1)
>>>> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>>> All right, so we relaxed the check so that it will be fulfilled by
>>> generation number v2 (and also by generation number v1, as it implies
>>> the more strict check for v1).
>>>
>>> What would happen however if generation holds topological levels, and it
>>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
>>> GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
>>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
>>> so the `git commit-graph verify` would incorrectly say that there is
>>> a problem with generation number, while there isn't one (false positive
>>> detection of error).
>> Alright, so the above block still makes sense if we are working with
>> topological levels but not with corrected commit dates. Instead of
>> removing it, I will modify the condition to check that one of our parents
>> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.
> That is one of the 3 possible solutions I can think of.
>
>
> I. First solution is to switch from checking that generation number
> matches its definition to checking that the [weaker] reachability
> condition for the generation number is true, that is:
>
>  	if (generation < max_generation)
>  		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> The [weaker] reachability condition for generation numbers states that
>
>    A reachable from B    =>    gen(A) <= gen(B)
>
> This condition is true even if one or more generation numbers is
> GENERATION_NUMBER_ZERO (uninitialized or written by old git version),
> GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only
> for generation number v1), or GENERATION_NUMBER_INFINITY (for commits
> outside of the serialized commit-graph, doesn't matter and cannot happen
> during verification of the commit-graph data by definition).
>
> This means that if P* is the parent of C with the maximal generation
> number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be
> true), then there is a problem with generation number.
>
> This is why I thought you were going for, and what I have proposed.
>
> Advantages:
> - we are testing what actually matters for speeding up reachability
>   queries, namely that the reachability property holds true
> - the test works for generation number v1, generation number v2,
>   and any possible future use-compatibile generation number
>   (not that I think we would need any)
> - least complicated solution
>
> Disadvantages:
> - weaker test that we have had for generation number v1 (topological
>   levels), and weaker that possible test for generation number v2
>   that we could have (see below)
>
>
> II. Verify corrected committed date (generation number v2) if available,
> and verify topological levels (generation number v1) otherwise, checking
> that it matches the definition of it -- using version-specific checks.
>
> This would probably mean adding a conditional around the code verifying
> that given generation number is correct, possibly:
>
>   if (g->read_generation_data) {
>   	/* verify corrected commit date */
>   } else {
>   	/* current code for verifying topological levels */
>   }
>
> II.a. For topological levels (generation number v1) we would continue
> checking that it matches the definition, that is that the following
> condition holds:
>
>   gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1
>
> This includes code for handling the case where `max_generation`, holding 
> max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX.
>
> II.b. For corrected commiter dates (generation number v2) we can use the
> code proposed by this revision of this commit, namely we check if the
> following condition holds:
>
>   gen(P) + 1 <= gen(C)   for each P \in parents(C)
>
> or, in other words:
>
>   max_{P: P ∈ parents(C)} { gen(P) } + 1  <=  gen(C)
>
> Which could be checked using the following code (i.e. current state
> after this revision of this patch):
>
> 	if (generation < max_generation + 1)
> 		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> This is what I think you are proposing now.
>
> Additionally, theoretically we could also check that the following
> condition holds for corrected commiter date:
>
>    committer_date(C) <= gen_v2(C)
>
> but this is automatically fufilled because we use non-negative offsets
> to store corrected committed date info.
>
> Alternatively we can check for compliance with the definition of the
> corrected committer date:
>
>   if (max_generation + 1 <= graph_commit->date) {
>   	/* commit date does not need correction */
>   	if (generation != graph_commit->date)
>     	graph_report(_("commit-graph corrected commit date for commit %s "
>   		               "is %"PRItime" != %"PRItime" commit date"),
>                      ...);
>   } else {
>   	if (generation != max_generation + 1)
>   		graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime),
>                      ...);
>   }
>
> Though I think it might be overkill.
>
> Advantages:
> - more strict tests, checking generation numbers (v2 if present, v1
>   otherwise) against their definition
> - if there is no GDAT chunk, verify works just like it did before
>
> Disadvantages:
> - more complicated code
> - possibly measurable performance degradation due to extra conditional
>
>
> III. Like II., but if there is generation numbers chunk (GDAT chunk), we
> verify *both* topological levels (v1) and corrected commit date (v2)
> against their definition.  If GDAT chunk is not present, it reduces to
> current code (before this patch series).
>
> Advantages:
> - if there is no GDAT chunk, verify works just like it did before
> - most strict tests, verifying all the data: both generation number v1
>   and generation number v2 -- if possible
>
> Disadvantages:
> - most complex code; we need to somehow extract topological levels
>   if the GDAT chunk is present (they are not on graph data slab in this
>   case); I have not even started to think how it could be done
> - slower verification
>
>> Suprised that no test breaks by this change.
> I don't whink we have any test that created commit graph with
> topological levels greater than GENERATION_NUMBER_V1_MAX; this would be
> expensive and have to be of course protected by GIT_TEST_LONG aka
> EXPENSIVE prerequisite.
>
>   # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels
>   test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' '
>   	rm -rf long_chain &&
>   	git init long_chain &&
>   	test_commit_bulk -C long_chain 1073741824 &&
>     (
>   		cd long_chain &&
>   		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write &&
>   		git commit-graph verify
>     )
>   '
>
> This however lies slightly outside the scope of this patch series,
> though if you could add this test (in a separate patch), after testing
> it, it would be very nice.
>
>> I have also moved changes in the verify function to the next patch, as
>> we cannot write or read corrected commit dates yet - so little sense in
>> modifying verify.
> I think putting changes to the verify function in a separate patch, be
> it before or after this one (depending on the choice of the algorithm
> for verification, see above) would be a good idea.
>
>>> Sidenote: I think we don't have to worry about having to introduce
>>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
>>> disck representation) corrected commiter date is the same as of commiter
>>> date itself, plus some, and I don't see us coming close to 64-bit limit
>>> of timestamp_t for commit dates.
>>>
>>>>  				     oid_to_hex(&cur_oid),
>>>>  				     generation,
>>>>  				     max_generation + 1);
> Best,


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
  2020-11-05 14:05               ` Philip Oakley
@ 2020-11-05 18:22                 ` Junio C Hamano
  2020-11-06 18:26                   ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Junio C Hamano @ 2020-11-05 18:22 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Jakub Narębski, Abhishek Kumar, git,
	Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Philip Oakley <philipoakley@iee.email> writes:

> This may be not part of the the main project, but could you consider, if
> time permits, also adding some entries into the Git Glossary (`git help
> glossary`) for the various terms we are using here and elsewhere, e.g.
> 'topological levels', 'generation number', 'corrected commit date' (and
> its fancy technical name for the use of date heuristics e.g. the
> 'chronological ordering';).
>
> The glossary can provide a reference, once the issues are resolved. The
> History Simplification and Commit Ordering section of git-log maybe a
> useful guide to some of the terms that would link to the glossary.

Ah, I first thought that Documentation/rev-list-options.txt (which
is the relevant part of "git log" documentation you mention here)
already have references to deep technical terms explained in the
glossary and you are suggesting Abhishek to mimic the arrangement by
adding new and agreed-upon terms to the glossary and referring to
them from the commit-graph documentation updated by this series.

But sadly that is not the case.  What you are saying is that you
noticed that rev-list-options.txt needs a similar "the terms we use
to explain these two sections should be defined and explained in the
glossary (if they are not) and new references to glossary should be
added there" update.

In any case, that is a very good suggestion.  I agree that updating
"git log" doc may be outside the scope of Abhishek's theme, but it
would be very good to have such an update by anybody ;-)

Thanks

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 07/10] commit-graph: implement generation data chunk
  2020-10-30 12:45         ` Jakub Narębski
@ 2020-11-06 11:25           ` Abhishek Kumar
  2020-11-06 17:56             ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-06 11:25 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Fri, Oct 30, 2020 at 01:45:29PM +0100, Jakub Narębski wrote:
> 
> ...
>
> >
> > While storing corrected commit date offset instead of the corrected
> > commit date saves us 4 bytes per commit, it's possible for the offsets
> > to overflow the 4-bytes allocated. As such overflows are exceedingly
> > rare, we use the following overflow management scheme:
> 
> Perhaps it would be good idea to write the idea in full from start, as
> the commit message is intended to be read stadalone and not in the
> context of the patch series.  On the other hand it might be too much
> detail in already [necessarily] lengthty commit message.
> 
> Perhaps something like the following proposal would read better.
> 
>   To minimize the space required to store corrected commit date, Git
>   stores corrected commit date offsets into the commit-graph file,
>   instead of corrected commit dates themselves. This saves us 4 bytes
>   per commit, decreasing the GDAT chunk size by half, but it's possible
>   for the offset to overflow the 4-bytes allocated for storage. As such
>   overflows are and should be exceedingly rare, we use the following
>   overflow management scheme:
>

Thanks, that's better.

> 
> ...
>
> > We test the overflow-related code with the following repo history:
> >
> >            F - N - U
> >           /         \
> > U - N - U            N
> >          \          /
> >            N - F - N
> 
> Do we need such complex history? I guess we need to test the handling of
> merge commits too.
> 

I wanted to test three cases - a root epoch zero commit, a commit that's
far enough in past to overflow the offset and a commit that's far enough
in the future to overflow the offset.

> >
> > Where the commits denoted by U have committer date of zero seconds
> > since Unix epoch, the commits denoted by N have committer date of
> > 1112354055 (default committer date for the test suite) seconds since
> > Unix epoch and the commits denoted by F have committer date of
> > (2 ^ 31 - 2) seconds since Unix epoch.
> >
> > The largest offset observed is 2 ^ 31, just large enough to overflow.
> >
> > [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c                | 98 +++++++++++++++++++++++++++++++++--
> >  commit-graph.h                |  3 ++
> >  commit.h                      |  1 +
> >  t/README                      |  3 ++
> >  t/helper/test-read-graph.c    |  4 ++
> >  t/t4216-log-bloom.sh          |  4 +-
> >  t/t5318-commit-graph.sh       | 70 ++++++++++++++++++++-----
> >  t/t5324-split-commit-graph.sh | 12 ++---
> >  t/t6600-test-reach.sh         | 68 +++++++++++++-----------
> >  9 files changed, 206 insertions(+), 57 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 03948adfce..71d0b243db 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
> >  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> >  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> >  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> >  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> >  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> >  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> >  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> > -#define MAX_NUM_CHUNKS 7
> > +#define MAX_NUM_CHUNKS 9
> 
> All right.
> 
> >  
> >  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
> >  
> > @@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
> >  #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
> >  			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
> >  
> > +#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
> > +
> 
> All right, though the naming convention is different from the one used
> for EDGE chunk: GRAPH_EXTRA_EDGES_NEEDED and GRAPH_EDGE_LAST_MASK.
> 
> >  /* Remember to update object flag allocation in object.h */
> >  #define REACHABLE       (1u<<15)
> >  
> > @@ -385,6 +389,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  				graph->chunk_commit_data = data + chunk_offset;
> >  			break;
> >  
> > +		case GRAPH_CHUNKID_GENERATION_DATA:
> > +			if (graph->chunk_generation_data)
> > +				chunk_repeated = 1;
> > +			else
> > +				graph->chunk_generation_data = data + chunk_offset;
> > +			break;
> > +
> > +		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
> > +			if (graph->chunk_generation_data_overflow)
> > +				chunk_repeated = 1;
> > +			else
> > +				graph->chunk_generation_data_overflow = data + chunk_offset;
> > +			break;
> > +
> 
> Necessary but unavoidable boilerplate for adding new chunks to the
> commit-graph file format.  All right.
> 
> >  		case GRAPH_CHUNKID_EXTRAEDGES:
> >  			if (graph->chunk_extra_edges)
> >  				chunk_repeated = 1;
> > @@ -745,8 +763,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  {
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> > -	uint32_t lex_index;
> > -	uint64_t date_high, date_low;
> > +	uint32_t lex_index, offset_pos;
> > +	uint64_t date_high, date_low, offset;
> 
> All right, we are adding two new variables: `offset` to read data stored
> in GDAT chunk, and `offset_pos` to help read data from GDOV chunk if
> necessary i.e. to handle overflow in corrected commit data offset
> storage.
> 
> >  
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> > @@ -764,7 +782,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	date_low = get_be32(commit_data + g->hash_len + 12);
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >  
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +	if (g->chunk_generation_data) {
> > +		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> 
> Style: why space after the `(timestamp_t)` cast operator?
> 
> Though CodingGuidelines do not say anything on this topic... perhaps the
> space after cast operator makes it more readable?
> 
> > +
> > +		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> 
> All right, so the CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW is equivalent of
> GRAPH_EXTRA_EDGES_NEEDED.
> 
> > +			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
> 
> Hmmm... instead of using bitwise and on an equivalent to the
> GRAPH_EDGE_LAST_MASK, we utilize the fact that we know that the MSB bit
> is set, so we can clear it with bitwise xor.  Clever trick.
>
> 
> > +			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> > +		} else
> > +			graph_data->generation = item->date + offset;
> 
> All right, this handles the case when we have generation number v2, with
> or without overflow.
> 
> > +	} else
> > +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> 
> All right, this handles the case where we have only generation number
> v1, like for commit-graph file written by old Git.
> 
> >  
> >  	if (g->topo_levels)
> >  		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> > @@ -942,6 +969,7 @@ struct write_commit_graph_context {
> >  	struct packed_oid_list oids;
> >  	struct packed_commit_list commits;
> >  	int num_extra_edges;
> > +	int num_generation_data_overflows;
> >  	unsigned long approx_nr_objects;
> >  	struct progress *progress;
> >  	int progress_done;
> > @@ -960,7 +988,8 @@ struct write_commit_graph_context {
> >  		 report_progress:1,
> >  		 split:1,
> >  		 changed_paths:1,
> > -		 order_by_pack:1;
> > +		 order_by_pack:1,
> > +		 write_generation_data:1;
> >  
> >  	struct topo_level_slab *topo_levels;
> >  	const struct commit_graph_opts *opts;
> 
> All right, this adds necessary fields to `struct write_commit_graph_context`.
> 
> > @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
> >  	return 0;
> >  }
> >  
> > +static int write_graph_chunk_generation_data(struct hashfile *f,
> > +					      struct write_commit_graph_context *ctx)
> > +{
> > +	int i, num_generation_data_overflows = 0;
> 
> Minor nitpick: in my opinion there should be empty line here, between
> the variables declaration and the code... however not all
> write_graph_chunk_*() functions have it.
> 
> > +	for (i = 0; i < ctx->commits.nr; i++) {
> > +		struct commit *c = ctx->commits.list[i];
> > +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> > +		display_progress(ctx->progress, ++ctx->progress_cnt);
> 
> All right.
> 
> > +
> > +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> > +			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
> > +			num_generation_data_overflows++;
> > +		}
> 
> Hmmm... shouldn't we store these commits that need overflow handling
> (with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
> in a list or a queue, to remember them for writing GDOV chunk?
> 

We could, although write_graph_chunk_extra_edges() (just like this function)
prefers to iterate over all commits again. Both octopus merges and
overflowing corrected commit dates are exceedingly rare, might be
worthwhile to trade some memory to avoid looping again.

> We could store oids, or we could store commits themselves, for example:
> 
> 		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> 			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
> 			num_generation_data_overflows++;
> 
> 			ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
> 			ctx->commits.list[ctx->gdov_commits.nr] = c;
>             ctx->gdov_commits.nr++;
> 		}
> 
> Though in the above proposal we could get rid of `num_generation_data_overflows`, 
> as it should be the same as `ctx->gdov_commits.nr`.
> 
> I have called the extra commit list member of write_commit_graph_context
> `gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`, 
> or similar more descriptive name.
> 
> > +
> > +		hashwrite_be32(f, offset);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> All right.
> 
> > +
> > +static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
> > +						       struct write_commit_graph_context *ctx)
> > +{
> > +	int i;
> > +	for (i = 0; i < ctx->commits.nr; i++) {
> 
> Here we loop over *all* commits again, instead of looping over those
> very rare commits that need overflow handling for their corrected commit
> date data.
> 
> Though this possible performance issue^* could be fixed in the future commit.
> 
> *) It needs to be actually benchmarked which version is faster.
> 
> ...
> 
> >  
> >  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> > @@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
> >  
> >  test_expect_success 'git commit-graph verify' '
> >  	cd "$TRASH_DIRECTORY/full" &&
> > -	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> > -	git commit-graph verify >output
> > +	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> > +	git commit-graph verify >output &&
> 
> All right, this simply adds GIT_TEST_COMMIT_GRAPH_NO_GDAT=1.  I assume
> this is needed because this test is also setup for the following commits
> _without_ even saying that in the test name (bad practice, in my
> opinion), and the comment above this test says the following:
> 
>   # the verify tests below expect the commit-graph to contain
>   # exactly the commits reachable from the commits/8 branch.
>   # If the file changes the set of commits in the list, then the
>   # offsets into the binary file will result in different edits
>   # and the tests will likely break.
> 
> So the following tests are fragile (though perhaps unavoidably fragile),
> and without this change they would not work, I assume.
> 
> > +	graph_read_expect 9 extra_edges
> 
> I guess that this is here to check that GIT_TEST_COMMIT_GRAPH_NO_GDAT=1
> work as intended, and that the following "verify" tests wouldn't break.
> I understand its necessity, even if I don't quite like having a test
> that checks multiple things.  This is a minor issue, though.
> 
> All right.
> 
> 
> We might want to have a separate test that checks that we get
> commit-graph with and without GDAT chunk depending on whether we use
> GIT_TEST_COMMIT_GRAPH_NO_GDAT=1.  On the other hand, this environment
> variable is there purely for tests, so the question is should we test
> the test infrastructure?
> 


> >  '
> >  
> >  NUM_COMMITS=9
> > @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
> >  	)
> >  '
> >  
> > +test_commit_with_date() {
> > +  file="$1.t" &&
> > +  echo "$1" >"$file" &&
> > +  git add "$file" &&
> > +  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
> > +  git tag "$1"
> > +}
> 
> Here we add a helper function.  All right.
> 
> I wonder though if it wouldn't be a better idea to add `--date <date>`
> option to the test_commit() function in test-lib-functions.sh (which
> option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
> set notick=yes).
> 

Yes, that's a better idea - I didn't know how to change test_commit()
well enough to tinker with what's working.

> For example:
> 
> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
> index f1ae935fee..a1f9a2b09b 100644
> --- a/t/test-lib-functions.sh
> +++ b/t/test-lib-functions.sh
> @@ -202,6 +202,12 @@ test_commit () {
>  		--signoff)
>  			signoff="$1"
>  			;;
> +        --date)
> +            notick=yes
> +            GIT_COMMITTER_DATE="$2"
> +            GIT_AUTHOR_DATE="$2"
> +            shift
> +            ;;
>  		-C)
>  			indir="$2"
>  			shift
> 
> 
> > +
> 
> It would be nice to have there comment describing the shape of the
> revision history we generate here, that currenly is present only in the
> commmit message.
> 
> # We test the overflow-related code with the following repo history:
> #
> #               4:F - 5:N - 6:U
> #              /               \
> # 1:U - 2:N - 3:U               M:N
> #              \               /
> #               7:N - 8:F - 9:N
> #
> # Here the commits denoted by U have committer date of zero seconds
> # since Unix epoch, the commits denoted by N have committer date
> # starting from 1112354055 seconds since Unix epoch (default committer
> # date for the test suite), and the commits denoted by F have committer
> # date of (2 ^ 31 - 2) seconds since Unix epoch.
> #
> # The largest offset observed is 2 ^ 31, just large enough to overflow.
> #

Yes, it would. Added.
> 
> > +test_expect_success 'overflow corrected commit date offset' '
> > +	objdir=".git/objects" &&
> > +	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
> > +	FUTURE_DATE="@2147483646 +0000" &&
> 
> It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
> FUTURE_DATE other way.
> 
> Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?

It would, for some reason - I couldn't figure out the valid format for
this. Changed.

> 
> > +	test_oid_cache <<-EOF &&
> > +	oid_version sha1:1
> > +	oid_version sha256:2
> > +	EOF
> > +	cd "$TRASH_DIRECTORY" &&
> > +	mkdir repo &&
> > +	cd repo &&
> > +	git init &&
> > +	test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
> > +	test_commit 2 &&
> > +	test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
> > +	git commit-graph write --reachable &&
> > +	graph_read_expect 3 generation_data &&
> > +	test_commit_with_date 4 "$FUTURE_DATE" &&
> > +	test_commit 5 &&
> > +	test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
> > +	git branch left &&
> > +	git reset --hard 3 &&
> > +	test_commit 7 &&
> > +	test_commit_with_date 8 "$FUTURE_DATE" &&
> > +	test_commit 9 &&
> > +	git branch right &&
> > +	git reset --hard 3 &&
> > +	git merge left right &&
> 
> We have test_merge() function in test-lib-functions.sh, perhaps we
> should use it here.
> 
> > +	git commit-graph write --reachable &&
> > +	graph_read_expect 10 "generation_data generation_data_overflow" &&
> 
> All right, we write the commit-graph and check that it has both GDAT and
> GDOV chunks present.
> 
> > +	git commit-graph verify
> 
> All right, we checks that created commit graph with GDAT and GDOV passes
> 'git commit-graph verify` checks.
> 
> > +'
> > +
> > +graph_git_behavior 'overflow corrected commit date offset' repo left right
> 
> All right, here we compare the Git behavior with the commit-graph to the
> behavior without it... however I think that those two tests really
> should have distinct (different) test names. Currently they both use
> 'overflow corrected commit date offset'.
> 

Following the earlier tests, the first test could be "set up and verify
repo with generation data overflow chunk" and the git behavior test can
be "generation data overflow chunk repo"

> > +
> >  test_done
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index c334ee9155..651df89ab2 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
> >  	infodir=".git/objects/info" &&
> >  	graphdir="$infodir/commit-graphs" &&
> >  	test_oid_cache <<-EOM
> > -	shallow sha1:1760
> > -	shallow sha256:2064
> > +	shallow sha1:2132
> > +	shallow sha256:2436
> >  
> > -	base sha1:1376
> > -	base sha256:1496
> > +	base sha1:1408
> > +	base sha256:1528
> >  
> >  	oid_version sha1:1
> >  	oid_version sha256:2
> > @@ -31,9 +31,9 @@ graph_read_expect() {
> >  		NUM_BASE=$2
> >  	fi
> >  	cat >expect <<- EOF
> > -	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
> > +	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> 
> All right, we now expect the commit graph to include the GDAT chunk...
> though shouldn't be there old expected value for no GDAT, for future
> tests?  But perhaps this is not necessary.
> 
> Note that I have not checked the details, but it looks OK to me.
> 
> > diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> > index f807276337..e2d33a8a4c 100755
> > --- a/t/t6600-test-reach.sh
> > +++ b/t/t6600-test-reach.sh
> > @@ -55,10 +55,13 @@ test_expect_success 'setup' '
> >  	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
> >  	mv .git/objects/info/commit-graph commit-graph-half &&
> >  	chmod u+w commit-graph-half &&
> > +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
> > +	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
> > +	chmod u+w commit-graph-no-gdat &&
> 
> All right, this prepares for testing one more mode.  The run_all_modes()
> function would test the following cases:
>  - no commit-graph
>  - commit-graph for all commits, with GDAT
>  - commit-graph with half of commits, with GDAT
>  - commit-graph for all commits, without GDAT
> 
> >  	git config core.commitGraph true
> >  '
> >  
> > -run_three_modes () {
> > +run_all_modes () {
> >  	test_when_finished rm -rf .git/objects/info/commit-graph &&
> >  	"$@" <input >actual &&
> >  	test_cmp expect actual &&
> > @@ -67,11 +70,14 @@ run_three_modes () {
> >  	test_cmp expect actual &&
> >  	cp commit-graph-half .git/objects/info/commit-graph &&
> >  	"$@" <input >actual &&
> > +	test_cmp expect actual &&
> > +	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
> > +	"$@" <input >actual &&
> >  	test_cmp expect actual
> >  }
> >  
> > -test_three_modes () {
> > -	run_three_modes test-tool reach "$@"
> > +test_all_modes () {
> > +	run_all_modes test-tool reach "$@"
> >  }
> 
> All right.
> 
> Though to reduce "noise" in this patch, the rename of run_three_modes()
> to run_all_modes() and test_three_modes() to test_all_modes() could have
> been done in a separate preparatory patch.  It would be pure refactoring
> patch, without introducing any new functionality.
> 

Sure, that makes sense to me - this is patch is over 200 lines long
already.

> ...

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 07/10] commit-graph: implement generation data chunk
  2020-11-06 11:25           ` Abhishek Kumar
@ 2020-11-06 17:56             ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-06 17:56 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar via GitGitGadget

In short: I think that because current implementation of writing GDOV
chunk follows an example of writing EDGE chunk, it should be left as it
is now (simple), and posible performance improvements be postponed to
some future commit.

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Oct 30, 2020 at 01:45:29PM +0100, Jakub Narębski wrote:
[...]
>>> We test the overflow-related code with the following repo history:
>>>
>>>            F - N - U
>>>           /         \
>>> U - N - U            N
>>>          \          /
>>>            N - F - N
>> 
>> Do we need such complex history? I guess we need to test the handling of
>> merge commits too.
>> 
>
> I wanted to test three cases - a root epoch zero commit, a commit that's
> far enough in past to overflow the offset and a commit that's far enough
> in the future to overflow the offset.

All right, if I understand this correctly this would be U as root, U-F
pair of commits and N-F pair of commits, respectively.  Did I get it
right?

Anyway, it might be a good idea to put this explanation in the commit
message.

>>>
>>> Where the commits denoted by U have committer date of zero seconds
>>> since Unix epoch, the commits denoted by N have committer date of
>>> 1112354055 (default committer date for the test suite) seconds since
>>> Unix epoch and the commits denoted by F have committer date of
>>> (2 ^ 31 - 2) seconds since Unix epoch.
>>>
>>> The largest offset observed is 2 ^ 31, just large enough to overflow.

[...]
>>> @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
>>>  	return 0;
>>>  }
>>>  
>>> +static int write_graph_chunk_generation_data(struct hashfile *f,
>>> +					      struct write_commit_graph_context *ctx)
>>> +{
>>> +	int i, num_generation_data_overflows = 0;
>> 
>> Minor nitpick: in my opinion there should be empty line here, between
>> the variables declaration and the code... however not all
>> write_graph_chunk_*() functions have it.
>> 
>>> +	for (i = 0; i < ctx->commits.nr; i++) {
>>> +		struct commit *c = ctx->commits.list[i];
>>> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
>>> +		display_progress(ctx->progress, ++ctx->progress_cnt);
>> 
>> All right.
>> 
>>> +
>>> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
>>> +			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
>>> +			num_generation_data_overflows++;
>>> +		}
>> 
>> Hmmm... shouldn't we store these commits that need overflow handling
>> (with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
>> in a list or a queue, to remember them for writing GDOV chunk?
>> 
>
> We could, although write_graph_chunk_extra_edges() (just like this function)
> prefers to iterate over all commits again. Both octopus merges and
> overflowing corrected commit dates are exceedingly rare, might be
> worthwhile to trade some memory to avoid looping again.

I'm sorry, I have not looked what write_graph_chunk_extra_edges() does,
or rather how it does what it does -- it is a good idea to pattern your
solution in similar existing code.

For me this is an even stronger hint that we should strive for
simplicity first, and leave possible performance improvements for the
future commit.  Especially that you perform the most significant
optimization for this overflow handling: ensuring that we do not perform
any work if there are no commits with generation data overflow.

Maybe, maybe we should add that information about similarity between
write_graph_chunk_generation_data_overflow() and write_graph_chunk_extra_edges() 
in the commit message.  I am unsure...

>> We could store oids, or we could store commits themselves, for example:
>> 
>> 		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
>> 			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
>> 			num_generation_data_overflows++;
>> 
>> 			ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
>> 			ctx->commits.list[ctx->gdov_commits.nr] = c;
>>             ctx->gdov_commits.nr++;
>> 		}
>> 
>> Though in the above proposal we could get rid of `num_generation_data_overflows`, 
>> as it should be the same as `ctx->gdov_commits.nr`.
>> 
>> I have called the extra commit list member of write_commit_graph_context
>> `gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`, 
>> or similar more descriptive name.

[...]
>>> @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>>>  	)
>>>  '
>>>  
>>> +test_commit_with_date() {
>>> +  file="$1.t" &&
>>> +  echo "$1" >"$file" &&
>>> +  git add "$file" &&
>>> +  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
>>> +  git tag "$1"
>>> +}
>> 
>> Here we add a helper function.  All right.
>> 
>> I wonder though if it wouldn't be a better idea to add `--date <date>`
>> option to the test_commit() function in test-lib-functions.sh (which
>> option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
>> set notick=yes).
>> 
>
> Yes, that's a better idea - I didn't know how to change test_commit()
> well enough to tinker with what's working.
>
>> For example:
>> 
>> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
>> index f1ae935fee..a1f9a2b09b 100644
>> --- a/t/test-lib-functions.sh
>> +++ b/t/test-lib-functions.sh
>> @@ -202,6 +202,12 @@ test_commit () {
>>  		--signoff)
>>  			signoff="$1"
>>  			;;
>> +        --date)
>> +            notick=yes
>> +            GIT_COMMITTER_DATE="$2"
>> +            GIT_AUTHOR_DATE="$2"
>> +            shift
>> +            ;;
>>  		-C)
>>  			indir="$2"
>>  			shift

Note however that I have while I have followed example of other options
(namely '-C <directory>'), I have not actually tested this proposed
implementation in tests; I have just tested that it looks like it works
OK.

[...]
>>> +test_expect_success 'overflow corrected commit date offset' '
>>> +	objdir=".git/objects" &&
>>> +	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
>>> +	FUTURE_DATE="@2147483646 +0000" &&
>> 
>> It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
>> FUTURE_DATE other way.
>> 
>> Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?
>
> It would, for some reason - I couldn't figure out the valid format for
> this. Changed.

Well, if "@2147483646 +0000" works (i.e. "@<Unix epoch/timestamp> <offset>"),
why the same for timestamp 0, i.e. "@0 +0000", wouldn't work?

[...]
>>> +graph_git_behavior 'overflow corrected commit date offset' repo left right
>> 
>> All right, here we compare the Git behavior with the commit-graph to the
>> behavior without it... however I think that those two tests really
>> should have distinct (different) test names. Currently they both use
>> 'overflow corrected commit date offset'.
>> 
>
> Following the earlier tests, the first test could be "set up and verify
> repo with generation data overflow chunk" and the git behavior test can
> be "generation data overflow chunk repo"

First is OK, the second could possibly be improved but is all right.

[...]
>> Though to reduce "noise" in this patch, the rename of run_three_modes()
>> to run_all_modes() and test_three_modes() to test_all_modes() could have
>> been done in a separate preparatory patch.  It would be pure refactoring
>> patch, without introducing any new functionality.
>> 
>
> Sure, that makes sense to me - this is patch is over 200 lines long
> already.

Thanks in advance.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date)
  2020-11-05 18:22                 ` Junio C Hamano
@ 2020-11-06 18:26                   ` Jakub Narębski
  2020-11-06 19:33                     ` Extending and updating gitglossary Junio C Hamano
  2020-11-08 17:23                     ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
  0 siblings, 2 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-06 18:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Philip Oakley, Abhishek Kumar, git,
	Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Junio C Hamano <gitster@pobox.com> writes:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> This may be not part of the the main project, but could you consider, if
>> time permits, also adding some entries into the Git Glossary (`git help
>> glossary`) for the various terms we are using here and elsewhere, e.g.
>> 'topological levels', 'generation number', 'corrected commit date' (and
>> its fancy technical name for the use of date heuristics e.g. the
>> 'chronological ordering';).
>>
>> The glossary can provide a reference, once the issues are resolved. The
>> History Simplification and Commit Ordering section of git-log maybe a
>> useful guide to some of the terms that would link to the glossary.
>
> Ah, I first thought that Documentation/rev-list-options.txt (which
> is the relevant part of "git log" documentation you mention here)
> already have references to deep technical terms explained in the
> glossary and you are suggesting Abhishek to mimic the arrangement by
> adding new and agreed-upon terms to the glossary and referring to
> them from the commit-graph documentation updated by this series.
>
> But sadly that is not the case.  What you are saying is that you
> noticed that rev-list-options.txt needs a similar "the terms we use
> to explain these two sections should be defined and explained in the
> glossary (if they are not) and new references to glossary should be
> added there" update.
>
> In any case, that is a very good suggestion.  I agree that updating
> "git log" doc may be outside the scope of Abhishek's theme, but it
> would be very good to have such an update by anybody ;-)

The only possible problem I see with this suggestion is that some of
those terms (like 'topological levels' and 'corrected commit date') are
technical terms that should be not of concern for Git user, only for
developers working on Git.  (However one could encounter the term
"generation number" in `git commit-graph verify` output.)

I don't think adding technical terms that the user won't encounter in
the documentation or among messages that Git outputs would be not a good
idea.  It could confuse users, rather than help them.

Conversely, perhaps we should add Documentation/technical/glossary.txt
to help developers.


P.S. By the way, when looking at Documentation/glossary-content.txt, I
have noticed few obsolescent entries, like "Git archive", few that have
description that soon could be or is obsolete and would need updating,
like "master" (when default branch switch to "main"), or "object
identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
function).

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Extending and updating gitglossary
  2020-11-06 18:26                   ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
@ 2020-11-06 19:33                     ` Junio C Hamano
  2020-11-08 17:23                     ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
  1 sibling, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2020-11-06 19:33 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Philip Oakley, Abhishek Kumar, git,
	Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Jakub Narębski <jnareb@gmail.com> writes:

> I don't think adding technical terms that the user won't encounter in
> the documentation or among messages that Git outputs would be not a good
> idea.  It could confuse users, rather than help them.
>
> Conversely, perhaps we should add Documentation/technical/glossary.txt
> to help developers.

Thanks for a thoughtful suggestion to help the target audience.  I
agree 100% with the above two paragraphs.

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date)
  2020-11-06 18:26                   ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
  2020-11-06 19:33                     ` Extending and updating gitglossary Junio C Hamano
@ 2020-11-08 17:23                     ` Philip Oakley
  2020-11-10  1:35                       ` Extending and updating gitglossary Jakub Narębski
  1 sibling, 1 reply; 211+ messages in thread
From: Philip Oakley @ 2020-11-08 17:23 UTC (permalink / raw)
  To: Jakub Narębski, Junio C Hamano
  Cc: Abhishek Kumar, git, Abhishek Kumar via GitGitGadget,
	Derrick Stolee, Taylor Blau

Hi Jakub,

On 06/11/2020 18:26, Jakub Narębski wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>> Philip Oakley <philipoakley@iee.email> writes:
>>
>>> This may be not part of the the main project, but could you consider, if
>>> time permits, also adding some entries into the Git Glossary (`git help
>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>> its fancy technical name for the use of date heuristics e.g. the
>>> 'chronological ordering';).
>>>
>>> The glossary can provide a reference, once the issues are resolved. The
>>> History Simplification and Commit Ordering section of git-log maybe a
>>> useful guide to some of the terms that would link to the glossary.
>> Ah, I first thought that Documentation/rev-list-options.txt (which
>> is the relevant part of "git log" documentation you mention here)
>> already have references to deep technical terms explained in the
>> glossary and you are suggesting Abhishek to mimic the arrangement by
>> adding new and agreed-upon terms to the glossary and referring to
>> them from the commit-graph documentation updated by this series.
>>
>> But sadly that is not the case.  What you are saying is that you
>> noticed that rev-list-options.txt needs a similar "the terms we use
>> to explain these two sections should be defined and explained in the
>> glossary (if they are not) and new references to glossary should be
>> added there" update.
>>
>> In any case, that is a very good suggestion.  I agree that updating
>> "git log" doc may be outside the scope of Abhishek's theme, but it
>> would be very good to have such an update by anybody ;-)
> The only possible problem I see with this suggestion is that some of
> those terms (like 'topological levels' and 'corrected commit date') are
> technical terms that should be not of concern for Git user, only for
> developers working on Git.  (However one could encounter the term
> "generation number" in `git commit-graph verify` output.)
However we do mention "topolog*"  in a number of the manual pages, and
rather less, as yet, in the technical pages.

"Lexicographic" and "chronological" are in the same group of fancy
technical words ;-)

>
> I don't think adding technical terms that the user won't encounter in
> the documentation or among messages that Git outputs would be not a good
> idea.  It could confuse users, rather than help them.
>
> Conversely, perhaps we should add Documentation/technical/glossary.txt
> to help developers.

I would agree that the Glossary probably ought to be split into the
primary, secondary and background terms so that the core concepts are
separated from the academic/developer style terms.

Git does rip up most of what folks think about version "control",
usually based on the imperfect replication of physical artefacts.
>
> P.S. By the way, when looking at Documentation/glossary-content.txt, I
> have noticed few obsolescent entries, like "Git archive", few that have
> description that soon could be or is obsolete and would need updating,
> like "master" (when default branch switch to "main"), or "object
> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
> function).
The obsolescent items can be updated. I'm expecting that the 'main' and
'SHA-' changes will eventually be picked up as part of the respective
patch series, hopefully as part of the global replacements.

--
Philip

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Extending and updating gitglossary
  2020-11-08 17:23                     ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
@ 2020-11-10  1:35                       ` Jakub Narębski
  2020-11-10 14:04                         ` Philip Oakley
  0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-10  1:35 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Junio C Hamano, Abhishek Kumar, git,
	Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Hello Philip,

Philip Oakley <philipoakley@iee.email> writes:
> On 06/11/2020 18:26, Jakub Narębski wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>> Philip Oakley <philipoakley@iee.email> writes:
>>>
>>>> This may be not part of the the main project, but could you consider, if
>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>> its fancy technical name for the use of date heuristics e.g. the
>>>> 'chronological ordering';).
>>>>
>>>> The glossary can provide a reference, once the issues are resolved. The
>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>> useful guide to some of the terms that would link to the glossary.
>>>
>>> Ah, I first thought that Documentation/rev-list-options.txt (which
>>> is the relevant part of "git log" documentation you mention here)
>>> already have references to deep technical terms explained in the
>>> glossary and you are suggesting Abhishek to mimic the arrangement by
>>> adding new and agreed-upon terms to the glossary and referring to
>>> them from the commit-graph documentation updated by this series.
>>>
>>> But sadly that is not the case.  What you are saying is that you
>>> noticed that rev-list-options.txt needs a similar "the terms we use
>>> to explain these two sections should be defined and explained in the
>>> glossary (if they are not) and new references to glossary should be
>>> added there" update.

What terms you feel need glossary entry?

>>> In any case, that is a very good suggestion.  I agree that updating
>>> "git log" doc may be outside the scope of Abhishek's theme, but it
>>> would be very good to have such an update by anybody ;-)
>>
>> The only possible problem I see with this suggestion is that some of
>> those terms (like 'topological levels' and 'corrected commit date') are
>> technical terms that should be not of concern for Git user, only for
>> developers working on Git.  (However one could encounter the term
>> "generation number" in `git commit-graph verify` output.)

To be more precise, I think that user-facing glossary should include
only terms that appear in user-facing documentation and in output
messages of Git commands (with the possible exception of maybe output
messages of some low-level plumbing).

I think that the developer-facing glossary should include terms that
appear in technical documentation, and in commit messages in Git
history.

> However we do mention "topolog*"  in a number of the manual pages, and
> rather less, as yet, in the technical pages.
>
> "Lexicographic" and "chronological" are in the same group of fancy
> technical words ;-)

I think that 'topological level' would appear only in technical
documentation; if it would be the case then there is no reason to add it
to user-facing glossary (to gitglossary manpage).

'Topological order' or 'topological sort', 'lexicographical order' and
'chronological order' are not Git-specific terms, and there are no
Git-specific ambiguities.  I am therefore a bit unsure about adding them
to *Git* glossary.

- In computer science, a _topological sort_ or _topological_ ordering of
  a directed graph is a linear ordering of its vertices such that for
  every directed edge uv from vertex u to vertex v, u comes before v in
  the ordering.

  For Git it means that top to bottom, commits always appear before
  their parents. With `--graph` or `--topo-order` Git also avoids
  showing commits on multiple lines of history intermixed.

- In mathematics, the _lexicographic_ or _lexicographical order_ (also
  known as lexical order, dictionary order, etc.) is a generalization of
  the alphabetical order.

  For Git it is simply alphabetical order.

- _Chronological order_ is the arrangement of things following one after
  another in time; or in other words date order.

  Note that `git log --date-order` commits also always appear before
  their parents, but otherwise commits are shown in the commit timestamp
  order (committer date order)

>>
>> I don't think adding technical terms that the user won't encounter in
>> the documentation or among messages that Git outputs would be not a good
>> idea.  It could confuse users, rather than help them.
>>
>> Conversely, perhaps we should add Documentation/technical/glossary.txt
>> to help developers.
>
> I would agree that the Glossary probably ought to be split into the
> primary, secondary and background terms so that the core concepts are
> separated from the academic/developer style terms.

I don't thing we need three separate layers; in my opinion separating
terms that user of Git might encounter from terms that somebody working
on developing Git may encounter would be enough.

The technical glossary / dictionary could also help onboarding...

>
> Git does rip up most of what folks think about version "control",
> usually based on the imperfect replication of physical artefacts.

I don't quite understand what you wanted to say there.  Could you
explain in more detail, please?

>> P.S. By the way, when looking at Documentation/glossary-content.txt, I
>> have noticed few obsolescent entries, like "Git archive", few that have
>> description that soon could be or is obsolete and would need updating,
>> like "master" (when default branch switch to "main"), or "object
>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
>> function).
>
> The obsolescent items can be updated. I'm expecting that the 'main' and
> 'SHA-' changes will eventually be picked up as part of the respective
> patch series, hopefully as part of the global replacements.

Here I meant that "Git archive" entry is not important anymore, as I
think there are no active users of GNU arch version control system (no
"arch people"); arch's last release was in 2006, and its replacement,
Bazaar (or 'bzr') doesn't use this term. So I think it can be safely
removed in 2020, after 14 years after last release of arch.

In most cases "SHA-1" in the descriptions of terms in glossary should be
replaced by "object identifier" (to be more generic).  This can be
safely done before switch to NewHash is ready and announced.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Extending and updating gitglossary
  2020-11-10  1:35                       ` Extending and updating gitglossary Jakub Narębski
@ 2020-11-10 14:04                         ` Philip Oakley
  2020-11-10 23:52                           ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Philip Oakley @ 2020-11-10 14:04 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Abhishek Kumar, git,
	Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Hi Jakub,

On 10/11/2020 01:35, Jakub Narębski wrote:
> Hello Philip,
>
> Philip Oakley <philipoakley@iee.email> writes:
>> On 06/11/2020 18:26, Jakub Narębski wrote:
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>> Philip Oakley <philipoakley@iee.email> writes:
>>>>
>>>>> This may be not part of the the main project, but could you consider, if
>>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>>> its fancy technical name for the use of date heuristics e.g. the
>>>>> 'chronological ordering';).
>>>>>
>>>>> The glossary can provide a reference, once the issues are resolved. The
>>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>>> useful guide to some of the terms that would link to the glossary.
>>>> Ah, I first thought that Documentation/rev-list-options.txt (which
>>>> is the relevant part of "git log" documentation you mention here)
>>>> already have references to deep technical terms explained in the
>>>> glossary and you are suggesting Abhishek to mimic the arrangement by
>>>> adding new and agreed-upon terms to the glossary and referring to
>>>> them from the commit-graph documentation updated by this series.
>>>>
>>>> But sadly that is not the case.  What you are saying is that you
>>>> noticed that rev-list-options.txt needs a similar "the terms we use
>>>> to explain these two sections should be defined and explained in the
>>>> glossary (if they are not) and new references to glossary should be
>>>> added there" update.
> What terms you feel need glossary entry?
While it was Junio that made the comment, I'd agree that we should be
using the glossary to explain, in a general sense, the terms that are
used is a specialist sense. As the user community expands, their natural
understanding of some of the terms diminishes.
>
>>>> In any case, that is a very good suggestion.  I agree that updating
>>>> "git log" doc may be outside the scope of Abhishek's theme, but it
>>>> would be very good to have such an update by anybody ;-)
>>> The only possible problem I see with this suggestion is that some of
>>> those terms (like 'topological levels' and 'corrected commit date') are
>>> technical terms that should be not of concern for Git user, only for
>>> developers working on Git.  (However one could encounter the term
>>> "generation number" in `git commit-graph verify` output.)
> To be more precise, I think that user-facing glossary should include
> only terms that appear in user-facing documentation and in output
> messages of Git commands (with the possible exception of maybe output
> messages of some low-level plumbing).
And where implied, the underlying concepts when they aren't obvious, or
lack general terms (e.g. the 'staging area' discussions)
>
> I think that the developer-facing glossary should include terms that
> appear in technical documentation, and in commit messages in Git
> history.
>
>> However we do mention "topolog*"  in a number of the manual pages, and
>> rather less, as yet, in the technical pages.
>>
>> "Lexicographic" and "chronological" are in the same group of fancy
>> technical words ;-)
> I think that 'topological level' would appear only in technical
> documentation; if it would be the case then there is no reason to add it
> to user-facing glossary (to gitglossary manpage).
>
> 'Topological order' or 'topological sort', 'lexicographical order' and
> 'chronological order' are not Git-specific terms, and there are no
> Git-specific ambiguities.  I am therefore a bit unsure about adding them
> to *Git* glossary.

It is that they aren't terms used in normal speech, so many folks do not
comprehend the implied precision that the docs assume, nor the problems
they may hide.
>
> - In computer science, a _topological sort_ or _topological_ ordering of
>   a directed graph is a linear ordering of its vertices such that for
>   every directed edge uv from vertex u to vertex v, u comes before v in
>   the ordering.
Does this imply that those who aren't computer scientists shouldn't be
using Git?
>
>   For Git it means that top to bottom, commits always appear before
>   their parents. With `--graph` or `--topo-order` Git also avoids
>   showing commits on multiple lines of history intermixed.
>
> - In mathematics, the _lexicographic_ or _lexicographical order_ (also
>   known as lexical order, dictionary order, etc.) is a generalization of
>   the alphabetical order.
>
>   For Git it is simply alphabetical order. 
ASCII order, Case sensitivity, Special characters, etc.
>
> - _Chronological order_ is the arrangement of things following one after
>   another in time; or in other words date order.
Given that most  résumés (the thing most folk see that asks for date
order) is latest first, does this clarify which way chronological is? (I
see this regularly in my other volunteer work).
>
>   Note that `git log --date-order` commits also always appear before
>   their parents, but otherwise commits are shown in the commit timestamp
>   order (committer date order)

>
>>> I don't think adding technical terms that the user won't encounter in
>>> the documentation or among messages that Git outputs would be not a good
>>> idea.  It could confuse users, rather than help them.
>>>
>>> Conversely, perhaps we should add Documentation/technical/glossary.txt
>>> to help developers.
>> I would agree that the Glossary probably ought to be split into the
>> primary, secondary and background terms so that the core concepts are
>> separated from the academic/developer style terms.
> I don't thing we need three separate layers; in my opinion separating
> terms that user of Git might encounter from terms that somebody working
> on developing Git may encounter would be enough.
>
> The technical glossary / dictionary could also help onboarding...
>
>> Git does rip up most of what folks think about version "control",
>> usually based on the imperfect replication of physical artefacts.
> I don't quite understand what you wanted to say there.  Could you
> explain in more detail, please?
Background, I see Git & Version Control from an engineers view point,
rather than developers view.

In the "real" world there are no perfect copies, we serialise key items
so that we can track their degradation, and replace them when required.
We attempt to "Control" what is happening. Our documentation and
monitoring systems have layers of control to ensure only suitably
qualified persons may access and inspect critical items, can record and
access previous status reports, etc. There is only one "Mona Lisa", with
critical access controls, even though there are 'copies'
https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies.
Almost all of our terminology for configuration control comes from the
'real' world, i.e. pre-modern computing.

Git turns all that on its head. We can make perfect duplicates (they're
not copies, not replicas..). The Object name is immutable. It's either
right or wrong (exempt the SHAttered sha-1 breakage; were moving to
sha-256). Git does *not* provide any access control. It supports the
'software freedoms' by distributing the control to the user. The
repository is a version storage system, and the OIDs allow easy
authentication between folks that they are looking at the same object,
and all its implied descendants.

Git has ripped up classical 'real' world version control. In many areas
we need new or alternative terms, and documents that explain them to
screen writers(*) and the many other non CS-major users of Git (and some
engineers;-)

(*) there's a diff pattern for them, IIRC, or at least one was proposed.
>
>>> P.S. By the way, when looking at Documentation/glossary-content.txt, I
>>> have noticed few obsolescent entries, like "Git archive", few that have
>>> description that soon could be or is obsolete and would need updating,
>>> like "master" (when default branch switch to "main"), or "object
>>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
>>> function).
>> The obsolescent items can be updated. I'm expecting that the 'main' and
>> 'SHA-' changes will eventually be picked up as part of the respective
>> patch series, hopefully as part of the global replacements.
> Here I meant that "Git archive" entry is not important anymore, as I
> think there are no active users of GNU arch version control system (no
> "arch people"); arch's last release was in 2006, and its replacement,
> Bazaar (or 'bzr') doesn't use this term. So I think it can be safely
> removed in 2020, after 14 years after last release of arch.
>
> In most cases "SHA-1" in the descriptions of terms in glossary should be
> replaced by "object identifier" (to be more generic).  This can be
> safely done before switch to NewHash is ready and announced.
>
> Best,
--
Philip

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: Extending and updating gitglossary
  2020-11-10 14:04                         ` Philip Oakley
@ 2020-11-10 23:52                           ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-10 23:52 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Junio C Hamano, Abhishek Kumar, git,
	Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Hello Philip,

Philip Oakley <philipoakley@iee.email> writes:
> On 10/11/2020 01:35, Jakub Narębski wrote:
>> Philip Oakley <philipoakley@iee.email> writes:
>>> On 06/11/2020 18:26, Jakub Narębski wrote:
>>>> Junio C Hamano <gitster@pobox.com> writes:
>>>>> Philip Oakley <philipoakley@iee.email> writes:
>>>>>
>>>>>> This may be not part of the the main project, but could you consider, if
>>>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>>>> its fancy technical name for the use of date heuristics e.g. the
>>>>>> 'chronological ordering';).
>>>>>>
>>>>>> The glossary can provide a reference, once the issues are resolved. The
>>>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>>>> useful guide to some of the terms that would link to the glossary.
[...]
>> What terms you feel need glossary entry?
>
> While it was Junio that made the comment, I'd agree that we should be
> using the glossary to explain, in a general sense, the terms that are
> used is a specialist sense. As the user community expands, their natural
> understanding of some of the terms diminishes.

I was hoping for a list of terms from the abovementioned sections of
git-log manpage you feel need entry in gitglosary(7).

[...]
>> To be more precise, I think that user-facing glossary should include
>> only terms that appear in user-facing documentation and in output
>> messages of Git commands (with the possible exception of maybe output
>> messages of some low-level plumbing).
>
> And where implied, the underlying concepts when they aren't obvious, or
> lack general terms (e.g. the 'staging area' discussions)

True, 'staging area' should IMVHO be in glossary (replacing or in
addition to older less specific term 'index', previous name for 'staging
area' term).

>> I think that the developer-facing glossary should include terms that
>> appear in technical documentation, and in commit messages in Git
>> history.

Such as 'topological levels', 'commit slab' / 'on the slab', etc.

>>> However we do mention "topolog*"  in a number of the manual pages, and
>>> rather less, as yet, in the technical pages.
>>>
>>> "Lexicographic" and "chronological" are in the same group of fancy
>>> technical words ;-)
>>
>> I think that 'topological level' would appear only in technical
>> documentation; if it would be the case then there is no reason to add it
>> to user-facing glossary (to gitglossary manpage).
>>
>> 'Topological order' or 'topological sort', 'lexicographical order' and
>> 'chronological order' are not Git-specific terms, and there are no
>> Git-specific ambiguities.  I am therefore a bit unsure about adding them
>> to *Git* glossary.
>
> It is that they aren't terms used in normal speech, so many folks do not
> comprehend the implied precision that the docs assume, nor the problems
> they may hide.

Right.

>> - In computer science, a _topological sort_ or _topological_ ordering of
>>   a directed graph is a linear ordering of its vertices such that for
>>   every directed edge uv from vertex u to vertex v, u comes before v in
>>   the ordering.
>
> Does this imply that those who aren't computer scientists shouldn't be
> using Git?

I think that in most cases where we refer to topological order in the
documentation we describe it there.  It might be good idea to add it to
the glossary, especially because Git uses it often in a very specific
sense.

On the other hand, should we define 'topology' or 'graph' as well? Or
'glossary' ;-) ? Those don't have any special meaning in Git, and can be
as well found in the dictionary or Wikipedia.

>>   For Git it means that top to bottom, commits always appear before
>>   their parents. With `--graph` or `--topo-order` Git also avoids
>>   showing commits on multiple lines of history intermixed.
>>
>> - In mathematics, the _lexicographic_ or _lexicographical order_ (also
>>   known as lexical order, dictionary order, etc.) is a generalization of
>>   the alphabetical order.
>>
>>   For Git it is simply alphabetical order. 
>
> ASCII order, Case sensitivity, Special characters, etc.

Actually I don't know. Let me check: the only place this term appears in
the documentation is in git-tag(1) manpage and related documentation.
It simplly uses strcmp(), or strcasecmp() when using `--ignore-case`
option; so by default case sensitive.

It looks like it does not take locale-specific rules.

>> - _Chronological order_ is the arrangement of things following one after
>>   another in time; or in other words date order.
>
> Given that most résumés (the thing most folk see that asks for date
> order) is latest first, does this clarify which way chronological is? (I
> see this regularly in my other volunteer work).

Right, it might be not obvious at first glance that Git outputs most
recent commits first, that is newest commits are on top. Though if you
think about it in more detail, it is the only ordering that makes sense,
especially for projects with a long history; first, it is newest commits
that are most interesting, and second Git always walks the history from
child to parent.

>>   Note that `git log --date-order` commits also always appear before
>>   their parents, but otherwise commits are shown in the commit timestamp
>>   order (committer date order)

[...]
>>> Git does rip up most of what folks think about version "control",
>>> usually based on the imperfect replication of physical artefacts.
>>
>> I don't quite understand what you wanted to say there.  Could you
>> explain in more detail, please?
>
> Background, I see Git & Version Control from an engineers view point,
> rather than developers view.
>
> In the "real" world there are no perfect copies, we serialise key items
> so that we can track their degradation, and replace them when required.
> We attempt to "Control" what is happening. Our documentation and
> monitoring systems have layers of control to ensure only suitably
> qualified persons may access and inspect critical items, can record and
> access previous status reports, etc. There is only one "Mona Lisa", with
> critical access controls, even though there are 'copies'
> https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies.
> Almost all of our terminology for configuration control comes from the
> 'real' world, i.e. pre-modern computing.
>
> Git turns all that on its head. We can make perfect duplicates (they're
> not copies, not replicas..). The Object name is immutable. It's either
> right or wrong (exempt the SHAttered sha-1 breakage; were moving to
> sha-256). Git does *not* provide any access control. It supports the
> 'software freedoms' by distributing the control to the user. The
> repository is a version storage system, and the OIDs allow easy
> authentication between folks that they are looking at the same object,
> and all its implied descendants.
>
> Git has ripped up classical 'real' world version control. In many areas
> we need new or alternative terms, and documents that explain them to
> screen writers(*) and the many other non CS-major users of Git (and some
> engineers;-)
>
> (*) there's a diff pattern for them, IIRC, or at least one was proposed.

Right, though for me the concept of 'version control' was by default
always about the digital, usually the source code.

There are different editions of books, changes to non-digital technical
drawings and plans (AFAIK often in the form of physical foil overlays as
subsequent layers, if done well; overdrawing on the same layer if not),
amendment and changes to laws, etc.


Anyway, the question is what level of knowledge can we assume from the
average Git user -- this would affect the spread of terms that should be
considered for the Git glossary.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
  2020-11-01  0:55         ` Jakub Narębski
@ 2020-11-12 10:01           ` Abhishek Kumar
  2020-11-13  9:59             ` Jakub Narębski
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-12 10:01 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Since there are released versions of Git that understand generation
> > numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> > chunk, the following scenario is possible:
> >
> > 1. "New" Git writes a commit-graph with the GDAT chunk.
> > 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> All right.
> 
> >
> > Because of the current use of inspecting the current layer for a
> > chunk_generation_data pointer, the commits in the lower layer will be
> > interpreted as having very large generation values (commit date plus
> > offset) compared to the generation numbers in the top layer (topological
> > level). This violates the expectation that the generation of a parent is
> > strictly smaller than the generation of a child.
> 
> I think this paragraphs tries too much to be concise, with the result it
> is less clear than it could be.  Perhaps it would be better to separate
> "what-if" from the current behavior.
> 
>   If each layer of split commit-graph is treated independently, as it
>   were the case before this commit, with Git inspecting only the current
>   layer for chunk_generation_data pointer, commits in the lower layer
>   (one with GDAT) would have corrected commit date as their generation
>   number, while commits in the upper layer would have topological levels
>   as their generation.  Corrected commit dates have usually much larger
>   values than topological levels.  This means that if we take two
>   commits, one from the upper layer, and one reachable from it in the
>   lower layer, then the expectation that the generation of a parent is
>   smaller than the generation of a child would be violated.
> 

Thanks, that's better.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> 
> I don't quite understand the issue here. Unless none of the following
> query commands short-circuit and all walk the commit graph regardless of
> what generation numbers tell them, they should give different results
> with and without the commit graph, if we take two commits one from lower
> layer of split commit graph with GDAT, and one commit from the higher
> layer without GDAT, one lower reachable from the other higher.
> 
> We have the following query commands that we can check:
>   $ git merge-base --is-ancestor <lower> <higher>
>   $ git merge-base --independent <lower> <higher>
>   
>   $ git tag --contains <tag-to-lower>
>   $ git tag --merged <tag-to-higher>
>   $ git branch --contains <branch-to-lower>
>   $ git branch --merged <branch-to-higher>
> 
> The second set of queries require for those commits to be tagged, or
> have branch pointing at them, respectively.
> 
> Also, shouldn't `git commit-graph verify` fail with split commit graph
> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
> 
> Let's assume that we have the following history, with newer commits
> shown on top like in `git log --graph --oneline --all`:
> 
>           topological     corrected         generation
>           level           commit date       number^*
> 
>       d    3                                3
>       |
>    c  |    3                                3
>    |  |                                                 without GDAT
>  ..|..|.....[layer.boundary]........................................
>    |  |                                                    with GDAT
>    |  b    2              1112912113        1112912113
>    |  |
>    a  |    2              1112912053        1112912053
>    | /
>    |/
>    r       1              1112911993        1112911993
> 
> *) each layer inspected individually.
> 
> With such history, we can for example reach 'a' from 'c', thus
> `git merge-base --is-ancestor a b` should return true value, but
> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
> I use here weaker reachability condition, but the one that works
> also for commits outside the commit-graph (and those for which
> generation numbers overflows).
> 

The original explanation was given by Dr. Stolee and he might not have
thought exhaustively about the issue.

In any case, your explanation and the history make sense to me. I will
try to add test and report back to the mailing list if something goes
wrong.

Thank you for clarifying in such detail.

> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> All right, that's good explanation.
> 
> ...
>
> > @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
> >  		}
> >  	}
> >  
> > +	if (!ctx->write_generation_data && g->chunk_generation_data)
> > +		ctx->write_generation_data = 1;
> > +
> 
> This needs more careful examination, and looking at larger context of
> those lines.
> 
> At this point, unless `--split=replace` option is used, 'g' points to
> the bottom layer out of all topmost layers being merged. We know that if
> there are GDAT-less layers then these must be top layers, so this means
> that we can write GDAT chunk in the result of the merge -- because we
> would be replacing all possible GDAT-less layers (and maybe some with
> GDAT) with a single layer with the GDAT chunk.
> 
> The ctx->write_generation_data is set to true unless environment
> variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
> write_commit_graph() it would be set to false if topmost layer doesn't
> have GDAT chunk, and to true if `--split=replace` option is used; see
> below.
> 
> Looks good to me.
> 
> 
> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
> writing GDAT chunk with generation data v2 unless we are merging layers,
> or replacing all of them with a single layer: then it is _ignored_.
> 
> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
> in t/README?  Currently it reads:
> 
>   GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
>   commit-graph to be written without generation data chunk.

I think it's better to *not* write generation data chunk if
GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
are merged, that is:

  if (!ctx->write_generation_data &&
      g->chunk_generation_data &&
     !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
    ctx->write_generation_data = 1;

With this change, we would have a method to force-write commit-graph
without generation data chunk regardless of the shape of split
commit-graph files.

> 
> ...
>
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 19a02001fd..ad52130883 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -64,6 +64,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >  
> >  	uint32_t num_commits_in_base;
> > +	unsigned int read_generation_data;
> >  	struct commit_graph *base_graph;
> 
> All right, this new field is here to propagate to each layer the
> information whether we can read from the generation number v2 data
> chunk.
> 
> Though I am not sure whether this field should be added here, and
> whether it should be `unsigned int` (we don't have to be that careful
> about saving space for this type).
> 

I cannot think of a more appropriate struct than `struct commit_graph`. 
Any particular suggestions?

> > 
> >  	const uint32_t *chunk_oid_fanout;
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 651df89ab2..d0949a9eb8 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
> >  	verify_chain_files_exist $graphdir
> >  '
> >  
> > +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> > +	mkdir mixed &&
> 
> This should probably go just before cd-ing into just created
> subdirectory.
> 
> > +	graphdir=".git/objects/info/commit-graphs" &&
> > +	test_oid_cache <<-EOM &&
> > +	oid_version sha1:1
> > +	oid_version sha256:2
> > +	EOM
> 
> Minor nitpick: Why use "EOM", which is used only twice in Git the test
> suite, and not the conventional "EOF" (used at least 4000 times)?

Right, both instances of "EOM" are actually my own. I looked up some
test script for oid cache that did use EOM when I first wrote the tests
but it's changed now. Will replace.
> 
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> 
> The t/README says:
> 
>    - Don't chdir around in tests.  It is not sufficient to chdir to
>      somewhere and then chdir back to the original location later in
>      the test, as any intermediate step can fail and abort the test,
>      causing the next test to start in an unexpected directory.  Do so
>      inside a subshell if necessary.
> 
> Though I am not sure if it should apply also to this situation.

While I cannot avoid changing directory, using a subshell would be best
to avoid causing the later tests to start in unexpected directories.

> 
> > +	git init &&
> > +	git config core.commitGraph true &&
> > +	git config gc.writeCommitGraph false &&
> 
> All right.
> 
> > +	for i in $(test_seq 3)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git reset --hard commits/1 &&
> > +	for i in $(test_seq 4 5)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git reset --hard commits/2 &&
> > +	for i in $(test_seq 6 10)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git commit-graph write --reachable --split &&
> 
> Is there a reason why we do not check just written commit-graph file
> with `test-tool read-graph >output-layer-1`?

We could check the written commit-graph file at this point but it's same
as existing tests as above.

> 
> > +	git reset --hard commits/2 &&
> > +	git merge commits/4 &&
> 
> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
> to use one or the other?

`test_merge` is used in 26 places whereas `git merge` is used in over a
thousand places. `test_merge` is just not widely adopted and this lack
of adoption prevents further use.

> 
> > +	git branch merge/1 &&
> > +	git reset --hard commits/4 &&
> > +	git merge commits/6 &&
> > +	git branch merge/2 &&
> 
> It would be nice to have ASCII-art of the history (of the graph of
> revisions) created here for subsequent tests:
> 
>                                         
>            /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>           /    \-\
>          /        \
>   1 <-- 2 <-- 3*   \--\
>   |      \             \ 
>   |       \-----\       \
>    \             \       \
>     \-- 4*<------ M/1     M/2
>         |\               /  
>         | \-- 5*        /
>         \              /
>          \------------/
> 
>   * - 1st layer  
> 
> Though I am not sure if what I have created is readable; I think a
> better way to draw this graph is possible, for example:
> 
>                /- 3*
>               /
>              /
>   1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>    \         \       \
>     \         \       \
>      \         \       \
>       \- 4* <-- M/1     \    
>          |\              \
>          | \------------- M/2
>          \
>           \---- 5*
> 
> Edit: as I see the history gets even more complicated, so perhaps
> ASCII-art diagram of the history with layers marked would be too
> complicated, and wouldn't bring much.
> 
> Why do we need such shape of the history in the repository?

We don't need such a complicated shape. Any commit-graph file with 2-3
layers regardless of how commits are related should suffice. Will
simplify.

> 
> > +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> > +	test-tool read-graph >output &&
> > +	cat >expect <<-EOF &&
> > +	header: 43475048 1 $(test_oid oid_version) 4 1
> > +	num_commits: 2
> > +	chunks: oid_fanout oid_lookup commit_metadata
> > +	EOF
> > +	test_cmp expect output &&
> 
> All right, we check that we have 2 commits, and that there is no GDAT
> chunk.
> 
> > +	git commit-graph verify
> 
> All right, we verify commit-graph as a whole (both layers).
> 
> > +'
> > +
> > +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> 
> Hmmm... I wonder if we can come up with a better name for this test;
> for example should it be "does not write" or "do not write"?

That's better.

> 
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	git reset --hard commits/3 &&
> > +	git merge merge/1 &&
> > +	git merge commits/5 &&
> > +	git merge merge/2 &&
> > +	git branch merge/3 &&
> 
> The commit graph gets complicated, so it would not be easy to visualize
> it with ASCII-art diagram without any crossed lines.  Maybe `git log
> --graph --oneline --all` would help:
> 
> *   (merge/3) Merge branch 'merge/2'
> |\
> | *   (merge/2) Merge branch 'commits/6'
> | |\
> * | \   Merge branch 'commits/5'
> |\ \ \
> | * | | (commits/5) 5
> | |/ /
> * | |   Merge branch 'merge/1'
> |\ \ \
> | * | | (merge/1) Merge branch 'commits/4'
> | |\| |
> | | * | (commits/4) 4
> * | | | (commits/3) 3
> |/ / /
> | | | * (commits/10) 10
> | | | * (commits/9) 9
> | | | * (commits/8) 8
> | | | * (commits/7) 7
> | | |/
> | | * (commits/6) 6
> | |/
> |/|
> * | (commits/2) 2
> |/
> * (commits/1) 1
> 
> 
> > +	git commit-graph write --reachable --split=no-merge &&
> > +	test-tool read-graph >output &&
> > +	cat >expect <<-EOF &&
> > +	header: 43475048 1 $(test_oid oid_version) 4 2
> > +	num_commits: 3
> > +	chunks: oid_fanout oid_lookup commit_metadata
> > +	EOF
> > +	test_cmp expect output &&
> > +	git commit-graph verify
> 
> All right, so here we check that we have layer without GDAT at the top,
> and we request not to merge layers thus new layer will be created, then
> the new layer also does not have GDAT chunk (and has 3 commits).
> 
> Minor nitpick: shouldn't those test be indented?
> 

The tests look indented to me and `git diff HEAD^ --check` gives nothing.

Did you mean the lines enclosed by EOF delimiter?

> > +'
> > +
> > +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	git commit-graph write --reachable --split=replace &&
> > +	test_path_is_file $graphdir/commit-graph-chain &&
> > +	test_line_count = 1 $graphdir/commit-graph-chain &&
> > +	verify_chain_files_exist $graphdir &&
> 
> All right, this checks that we have split commit-graph chain that
> consist of a single layer, and that the commit-graph file for this
> single layer exists.
> 
> > +	graph_read_expect 15 &&
> 
> Shouldn't we use `test-tool read-graph` to check whether generation_data
> chunk is present... ah, sorry, I have realized that after previous
> patches `graph_read_expect 15` implicitly checks the latter, because in
> its' use of `test-tool read-graph` it does expect generation_data chunk.
> 
> So we use `test-tool read-graph` manually to check that generation_data
> chunk is absent, and we use graph_read_expect to check that it is
> present (and in both cases that the number of commits matches).  I
> wonder if it would be possible to simplify that...
>

The problem here is graph_read_expect() as defined in
t5324-split-commit-graph takes two parameters - number of commits and
number of base graphs. If the number of base graphs is not passed to
the function call, it's assumed to be zero. Using a default parameter
is tricky - I can fix it by manually adding a zero to each of 
graph_read_expect() in an additional preparatory patch.

Any other suggestions are welcome too.

> 
> > +	git commit-graph verify
> 
> All right.
> 
> > +'
> > +
> > +test_expect_success 'add one commit, write a tip graph' '
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	test_commit 11 &&
> > +	git branch commits/11 &&
> > +	git commit-graph write --reachable --split &&
> > +	test_path_is_missing $infodir/commit-graph &&
> > +	test_path_is_file $graphdir/commit-graph-chain &&
> > +	ls $graphdir/graph-*.graph >graph-files &&
> > +	test_line_count = 2 graph-files &&
> > +	verify_chain_files_exist $graphdir
> > +'
> 
> What it is meant to test?  That adding single-commit to a 15 commit
> commit-graph file in split mode does not result in layers merging, and
> actually adds a new layer: we check that we have exactly two layers and
> that they are all OK.

This test is meant to check writing to a split graph in "normal"
conditions (i.e. all existing layers have generation data chunk). The
above tests are special cases as they involve merging layers with mixed 
generation number versions.

> 
> We don't check here that the newly created top layer commit-graph does
> have GDAT chunk, as it should be if the top layer (in this case the only
> layer) has GDAT chunk.
> > +
> >  test_done
> 
> One test we are missing is testing that merging layers is done
> correctly, namely that if we are merging layers in split commit-graph
> file, and the layer below the ones we are merging lacks GDAT chunk, then
> the result of the merge should also be without GDAT chunk.  This would
> require at least two GDAT-less layers in a setup.
> 
> I'm not sure how difficult writing such test should be.

It wouldn't be too hard. 

After the last test, I can write some more commits and write split 
commit-graph file without GDAT chunk. Then write some more commits 
and merge layers using `git commit-graph write --max-commits=<nr>`.

Thanks for pointing this out!

> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
  2020-11-12 10:01           ` Abhishek Kumar
@ 2020-11-13  9:59             ` Jakub Narębski
  0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-13  9:59 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> It is difficult to expose this issue in a test. Since we _start_ with
>>> artificially low generation numbers, any commit walk that prioritizes
>>> generation numbers will walk all of the commits with high generation
>>> number before walking the commits with low generation number. In all the
>>> cases I tried, the commit-graph layers themselves "protect" any
>>> incorrect behavior since none of the commits in the lower layer can
>>> reach the commits in the upper layer.
>> 
>> I don't quite understand the issue here. Unless none of the following
>> query commands short-circuit and all walk the commit graph regardless of
>> what generation numbers tell them, they should give different results
>> with and without the commit graph, if we take two commits one from lower
>> layer of split commit graph with GDAT, and one commit from the higher
>> layer without GDAT, one lower reachable from the other higher.
>> 
>> We have the following query commands that we can check:
>>   $ git merge-base --is-ancestor <lower> <higher>
>>   $ git merge-base --independent <lower> <higher>
>>   
>>   $ git tag --contains <tag-to-lower>
>>   $ git tag --merged <tag-to-higher>
>>   $ git branch --contains <branch-to-lower>
>>   $ git branch --merged <branch-to-higher>
>> 
>> The second set of queries require for those commits to be tagged, or
>> have branch pointing at them, respectively.
>> 
>> Also, shouldn't `git commit-graph verify` fail with split commit graph
>> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
>> 
>> Let's assume that we have the following history, with newer commits
>> shown on top like in `git log --graph --oneline --all`:
>> 
>>           topological     corrected         generation
>>           level           commit date       number^*
>> 
>>       d    3                                3
>>       |
>>    c  |    3                                3
>>    |  |                                                 without GDAT
>>  ..|..|.....[layer.boundary]........................................
>>    |  |                                                    with GDAT
>>    |  b    2              1112912113        1112912113
>>    |  |
>>    a  |    2              1112912053        1112912053
>>    | /
>>    |/
>>    r       1              1112911993        1112911993
>> 
>> *) each layer inspected individually.
>> 
>> With such history, we can for example reach 'a' from 'c', thus
>> `git merge-base --is-ancestor a b` should return true value, but
>> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
>> I use here weaker reachability condition, but the one that works
>> also for commits outside the commit-graph (and those for which
>> generation numbers overflows).
>> 
>
> The original explanation was given by Dr. Stolee and he might not have
> thought exhaustively about the issue.
>
> In any case, your explanation and the history make sense to me. I will
> try to add test and report back to the mailing list if something goes
> wrong.
>
> Thank you for clarifying in such detail.

I don't think you need to add any new test.  It should be enough to check
that the first test introduced in this patch, namely 'setup repo for
mixed generation commit-graph-chain', fails without the change in this
patch -- as I think it does.  This is because `git commit-graph verify`
should fail with mixed-version split commit-graph with GDAT-less layer
on top without this change.

Reporting this (possibly as from one sentence to one paragraph in the
commit message) would be enough, in my opinion.

[...]
>> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
>> writing GDAT chunk with generation data v2 unless we are merging layers,
>> or replacing all of them with a single layer: then it is _ignored_.
>> 
>> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
>> in t/README?  Currently it reads:
>> 
>>   GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
>>   commit-graph to be written without generation data chunk.
>
> I think it's better to *not* write generation data chunk if
> GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
> are merged, that is:
>
>   if (!ctx->write_generation_data &&
>       g->chunk_generation_data &&
>      !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
>     ctx->write_generation_data = 1;
>
> With this change, we would have a method to force-write commit-graph
> without generation data chunk regardless of the shape of split
> commit-graph files.

While it would be more consistent to always behave like the old Git with
GIT_TEST_COMMIT_GRAPH_NO_GDAT=1, it is in my opinion not necessary.

The only thing we need to test the mixed-version commit-graph chain is
the ability to add new layer on top without GDAT.  It does not matter if
this layer is created from new commits or a result of partial or full
merge of layers.

So the alternative to extending what GIT_TEST_COMMIT_GRAPH_NO_GDAT does
that you propose here would be simply improving the description of t in
t/README, e.g.

     GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
     commit-graph, or new layer in split commit-graph chain, to be written
     without generation data chunk.  It does not affect merging of layers.

For me either solution is fine.

[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index 19a02001fd..ad52130883 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -64,6 +64,7 @@ struct commit_graph {
>>>  	struct object_directory *odb;
>>>  
>>>  	uint32_t num_commits_in_base;
>>> +	unsigned int read_generation_data;
>>>  	struct commit_graph *base_graph;
>> 
>> All right, this new field is here to propagate to each layer the
>> information whether we can read from the generation number v2 data
>> chunk.
>> 
>> Though I am not sure whether this field should be added here, and
>> whether it should be `unsigned int` (we don't have to be that careful
>> about saving space for this type).
>
> I cannot think of a more appropriate struct than `struct commit_graph`. 
> Any particular suggestions?

After thinking about it a bit more, I think it is fine to have it here
in `struct commit_graph`, it is better than using a global variable
(which would make code non-reentrant; not that we use multiple threads
for reading multiple layers of the commit graph, but we might want to in
the future).

[...]
>>> +	cd "$TRASH_DIRECTORY/mixed" &&
>> 
>> The t/README says:
>> 
>>    - Don't chdir around in tests.  It is not sufficient to chdir to
>>      somewhere and then chdir back to the original location later in
>>      the test, as any intermediate step can fail and abort the test,
>>      causing the next test to start in an unexpected directory.  Do so
>>      inside a subshell if necessary.
>> 
>> Though I am not sure if it should apply also to this situation.
>
> While I cannot avoid changing directory, using a subshell would be best
> to avoid causing the later tests to start in unexpected directories.

This would allow for easier skipping of tests, and failed tests would
not propagate the error (because of subsequent tests after a failed one
starting in unexpected directory).

[...]
>>> +	for i in $(test_seq 3)
>>> +	do
>>> +		test_commit $i &&
>>> +		git branch commits/$i || return 1
>>> +	done &&
>>> +	git reset --hard commits/1 &&
>>> +	for i in $(test_seq 4 5)
>>> +	do
>>> +		test_commit $i &&
>>> +		git branch commits/$i || return 1
>>> +	done &&
>>> +	git reset --hard commits/2 &&
>>> +	for i in $(test_seq 6 10)
>>> +	do
>>> +		test_commit $i &&
>>> +		git branch commits/$i || return 1
>>> +	done &&
>>> +	git commit-graph write --reachable --split &&
>> 
>> Is there a reason why we do not check just written commit-graph file
>> with `test-tool read-graph >output-layer-1`?
>
> We could check the written commit-graph file at this point but it's same
> as existing tests as above.

All right, thanks for an explanation.

>> 
>>> +	git reset --hard commits/2 &&
>>> +	git merge commits/4 &&
>> 
>> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
>> to use one or the other?
>
> `test_merge` is used in 26 places whereas `git merge` is used in over a
> thousand places. `test_merge` is just not widely adopted and this lack
> of adoption prevents further use.

All right then.

>>> +	git branch merge/1 &&
>>> +	git reset --hard commits/4 &&
>>> +	git merge commits/6 &&
>>> +	git branch merge/2 &&
>> 
>> It would be nice to have ASCII-art of the history (of the graph of
>> revisions) created here for subsequent tests:
>> 
>>                                         
>>            /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>>           /    \-\
>>          /        \
>>   1 <-- 2 <-- 3*   \--\
>>   |      \             \ 
>>   |       \-----\       \
>>    \             \       \
>>     \-- 4*<------ M/1     M/2
>>         |\               /  
>>         | \-- 5*        /
>>         \              /
>>          \------------/
>> 
>>   * - 1st layer  
>> 
>> Though I am not sure if what I have created is readable; I think a
>> better way to draw this graph is possible, for example:
>> 
>>                /- 3*
>>               /
>>              /
>>   1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>>    \         \       \
>>     \         \       \
>>      \         \       \
>>       \- 4* <-- M/1     \    
>>          |\              \
>>          | \------------- M/2
>>          \
>>           \---- 5*
>> 
>> Edit: as I see the history gets even more complicated, so perhaps
>> ASCII-art diagram of the history with layers marked would be too
>> complicated, and wouldn't bring much.
>> 
>> Why do we need such shape of the history in the repository?
>
> We don't need such a complicated shape. Any commit-graph file with 2-3
> layers regardless of how commits are related should suffice. Will
> simplify.

If you are unsire if we need this shape of history to properly test all
corner cases of the algorithm, or whether simple history would be
enough, you can simply compare code coverage.  Git Makefile ha the
'coverage' target (which requires 'gcov' tool).

NOTE: if it is possible to run 'make coverage' for you, it can be used
to check if there are any parts of the new code that are not tested.

[...]
>>> +	git commit-graph write --reachable --split=no-merge &&
>>> +	test-tool read-graph >output &&
>>> +	cat >expect <<-EOF &&
>>> +	header: 43475048 1 $(test_oid oid_version) 4 2
>>> +	num_commits: 3
>>> +	chunks: oid_fanout oid_lookup commit_metadata
>>> +	EOF
>>> +	test_cmp expect output &&
>>> +	git commit-graph verify
>> 
>> All right, so here we check that we have layer without GDAT at the top,
>> and we request not to merge layers thus new layer will be created, then
>> the new layer also does not have GDAT chunk (and has 3 commits).
>> 
>> Minor nitpick: shouldn't those test be indented?
>> 
>
> The tests look indented to me and `git diff HEAD^ --check` gives nothing.
>
> Did you mean the lines enclosed by EOF delimiter?

I'm sorry, that was my mistake -- tabs are used for indent, and the
tabstop (in my newsreader) when being quoted made it look like it was
not indented.

[...]
>>> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
>>> +	cd "$TRASH_DIRECTORY/mixed" &&
>>> +	git commit-graph write --reachable --split=replace &&
>>> +	test_path_is_file $graphdir/commit-graph-chain &&
>>> +	test_line_count = 1 $graphdir/commit-graph-chain &&
>>> +	verify_chain_files_exist $graphdir &&
>> 
>> All right, this checks that we have split commit-graph chain that
>> consist of a single layer, and that the commit-graph file for this
>> single layer exists.
>> 
>>> +	graph_read_expect 15 &&
>> 
>> Shouldn't we use `test-tool read-graph` to check whether generation_data
>> chunk is present... ah, sorry, I have realized that after previous
>> patches `graph_read_expect 15` implicitly checks the latter, because in
>> its' use of `test-tool read-graph` it does expect generation_data chunk.
>> 
>> So we use `test-tool read-graph` manually to check that generation_data
>> chunk is absent, and we use graph_read_expect to check that it is
>> present (and in both cases that the number of commits matches).  I
>> wonder if it would be possible to simplify that...

What I wanted to say that it might be better to have a second variant of
graph_read_expect() for GDAT-less layers -- but this might be
unnecessary complication.

> The problem here is graph_read_expect() as defined in
> t5324-split-commit-graph takes two parameters - number of commits and
> number of base graphs. If the number of base graphs is not passed to
> the function call, it's assumed to be zero. Using a default parameter
> is tricky - I can fix it by manually adding a zero to each of 
> graph_read_expect() in an additional preparatory patch.

All right, thanks for an explanation.  I should have examined
graph_read_expect() in more detail.

> Any other suggestions are welcome too.

[...]
>>> +test_expect_success 'add one commit, write a tip graph' '
>>> +	cd "$TRASH_DIRECTORY/mixed" &&
>>> +	test_commit 11 &&
>>> +	git branch commits/11 &&
>>> +	git commit-graph write --reachable --split &&
>>> +	test_path_is_missing $infodir/commit-graph &&
>>> +	test_path_is_file $graphdir/commit-graph-chain &&
>>> +	ls $graphdir/graph-*.graph >graph-files &&
>>> +	test_line_count = 2 graph-files &&
>>> +	verify_chain_files_exist $graphdir
>>> +'
>> 
>> What it is meant to test?  That adding single-commit to a 15 commit
>> commit-graph file in split mode does not result in layers merging, and
>> actually adds a new layer: we check that we have exactly two layers and
>> that they are all OK.
>
> This test is meant to check writing to a split graph in "normal"
> conditions (i.e. all existing layers have generation data chunk). The
> above tests are special cases as they involve merging layers with mixed 
> generation number versions.

All right.

>> 
>> We don't check here that the newly created top layer commit-graph does
>> have GDAT chunk, as it should be if the top layer (in this case the only
>> layer) has GDAT chunk.
>>> +
>>>  test_done
>> 
>> One test we are missing is testing that merging layers is done
>> correctly, namely that if we are merging layers in split commit-graph
>> file, and the layer below the ones we are merging lacks GDAT chunk, then
>> the result of the merge should also be without GDAT chunk.  This would
>> require at least two GDAT-less layers in a setup.
>> 
>> I'm not sure how difficult writing such test should be.
>
> It wouldn't be too hard. 
>
> After the last test, I can write some more commits and write split 
> commit-graph file without GDAT chunk. Then write some more commits 
> and merge layers using `git commit-graph write --max-commits=<nr>`.
>
> Thanks for pointing this out!

Good.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-11-03 17:59         ` Jakub Narębski
  2020-11-03 18:19           ` Junio C Hamano
@ 2020-11-20 10:33           ` Abhishek Kumar
  1 sibling, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-20 10:33 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: abhishekkumar8222, git, gitgitgadget, stolee, sunshine

On Tue, Nov 03, 2020 at 06:59:03PM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With corrected commit dates implemented, we no longer have to rely on
> > commit date as a heuristic in paint_down_to_common().
> >
> > While using corrected commit dates Git walks nearly the same number of
> > commits as commit date, the process is slower as for each comparision we
> > have to access a commit-slab (for corrected committer date) instead of
> > accessing struct member (for committer date).
> 
> Something for the future: I wonder if it would be worth it to bring back
> generation number from the commit-slab into `struct commit`.
> 
> >
> > For example, the command `git merge-base v4.8 v4.9` on the linux
> > repository walks 167468 commits, taking 0.135s for committer date and
> > 167496 commits, taking 0.157s for corrected committer date respectively.
> 
> I think it would be good idea to explicitly refer to the commit that
> changed paint_down_to_common() to *not* use generation numbers v1
> (topological levels) in the cases such as this, namely 091f4cf3 (commit:
> don't use generation numbers if not needed).  In this commit we have the
> following:
> ...
>

I have re-arranged the first half of commit message: 

  091f4cf3 (commit: don't use generation numbers if not needed,
  2018-08-30) changed paint_down_to_common() to use commit dates instead
  of generation numbers v1 (topological levels) as the performance
  regressed on certain topologies. With generation number v2 (corrected
  commit dates) implemented, we no longer have to rely on commit dates and
  can use generation numbers.
  
  For example, the command `git merge-base v4.8 v4.9` on the Linux
  repository walks 167468 commits, taking 0.135s for committer date and
  167496 commits, taking 0.157s for corrected committer date respectively.
  
  While using corrected commit dates Git walks nearly the same number of
  commits as commit date, the process is slower as for each comparision we
  have to access a commit-slab (for corrected committer date) instead of
  accessing struct member (for committer date).

> 
> The times you report (0.135s and 0.157s) are close to 0.122s / 0.127s
> reported in 091f4cf3 - that is most probably because of the differences
> in the system performance (hardware, operating system, load, etc.).
> Numbers of commits walked for the committed date heuristics, that is
> 167,468 agrees with your results; 167,496 (+28) for corrected commit
> date (generation number v2) is significantly smaller (-468,083) than
> 635,579 reported for topological levels (generation number v1).
> 
> I suspect that there are cases (with date skew) where corrected commit
> date gives better performance than committer date heuristics, and I am
> quite sure that generation number v2 can give better performance in case
> where paint_down_to_common() uses generation numbers.
> 
> .................................................................
> 
> Here begins separate second change, which is not put into separate
> commit because it is fairly tightly connected to the change described
> above.  It would be good idea, in my opinion, to add a sentence that
> explicitely marks this switch, for example:
> 
>   This change accidentally broke fragile t6404-recursive-merge test.
>   t6404-recursive-merge setups a unique repository...
> 
> Maybe with s/accidentaly/incidentally/.
> 

Thanks, will add.

> Or add some other way of connection those two parts of the commit
> messages.
> ...
> >  
> > +int corrected_commit_dates_enabled(struct repository *r)
> > +{
> > +	struct commit_graph *g;
> > +	if (!prepare_commit_graph(r))
> > +		return 0;
> > +
> > +	g = r->objects->commit_graph;
> > +
> > +	if (!g->num_commits)
> > +		return 0;
> > +
> > +	return g->read_generation_data;
> > +}
> 
> Very nice abstraction.
> 
> Minor issue: I wonder if it would be better to use _available() or
> "_present()" rather than _enabled() suffix.
> 

We could, but that breaks conformity with `generation_numbers_enabled()`.

I see both functions to be similar in nature, to answer whether the
commit-graph has X? X could be topological levels or corrected commit
dates.

> > +
> >  struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
> >  {
> >  	struct commit_graph *g = r->objects->commit_graph;
> > diff --git a/commit-graph.h b/commit-graph.h
> > index ad52130883..d2c048dc64 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> >  struct commit_graph *parse_commit_graph(struct repository *r,
> >  					void *graph_map, size_t graph_size);
> >  
> > +struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> > +
> >  /*
> >   * Return 1 if and only if the repository has a commit-graph
> >   * file and generation numbers are computed in that file.
> >   */
> >  int generation_numbers_enabled(struct repository *r);
> >  
> > -struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> 
> This moving get_bloom_filter_settings() before generation_numbers_enabled() 
> looks like accidental change.  If not, why it is here?

Right, that's an accidental change. I wanted to group
generation_numbers_enabled() and corrected_commit_dates_enabled()
together.

> 
> ...
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 10/10] doc: add corrected commit date info
  2020-11-04  1:37         ` Jakub Narębski
@ 2020-11-21  6:30           ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-21  6:30 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Wed, Nov 04, 2020 at 02:37:41AM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Nice.
> 
> > ---
> >  .../technical/commit-graph-format.txt         | 21 +++++--
> >  Documentation/technical/commit-graph.txt      | 62 ++++++++++++++++---
> >  2 files changed, 69 insertions(+), 14 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index b3b58880b9..08d9026ad4 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> >  The Git commit graph stores a list of commit OIDs and some associated
> >  metadata, including:
> >  
> > -- The generation number of the commit. Commits with no parents have
> > -  generation number 1; commits with parents have generation number
> > -  one more than the maximum generation number of its parents. We
> > -  reserve zero as special, and can be used to mark a generation
> > -  number invalid or as "not computed".
> > +- The generation number of the commit.
> 
> All right, because we could store both generation number v1 and
> generation number v2 in the commit-graph file, and we need to describe
> both, the description is now consolidated and in only one place.
> 
> >  
> >  - The root tree OID.
> >  
> > @@ -86,13 +82,26 @@ CHUNK DATA:
> >        position. If there are more than two parents, the second value
> >        has its most-significant bit on and the other bits store an array
> >        position into the Extra Edge List chunk.
> > -    * The next 8 bytes store the generation number of the commit and
> > +    * The next 8 bytes store the topological level (generation number v1)
> > +      of the commit and
> 
> All right, this is updated information about CDAT chunk.
> 
> >        the commit time in seconds since EPOCH. The generation number
> >        uses the higher 30 bits of the first 4 bytes, while the commit
> >        time uses the 32 bits of the second 4 bytes, along with the lowest
> >        2 bits of the lowest byte, storing the 33rd and 34th bit of the
> >        commit time.
> >  
> > +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
> 
> Should we mark this chunk as "[Optional]"?  Its absence is not an error.

I think we should mark it as "optional", although optional might not
have been the best choice word. 

Optional (for me) implies that it is configurable and decided by the end-user
directly.  However, it is *conditional* - on the existing commit graph file(s)
(if any) and the version of Git.

> > +    * This list of 4-byte values store corrected commit date offsets for the
> > +      commits, arranged in the same order as commit data chunk.
> > +    * If the corrected commit date offset cannot be stored within 31 bits,
> > +      the value has its most-significant bit on and the other bits store
> > +      the position of corrected commit date into the Generation Data Overflow
> > +      chunk.
> 
> All right.
> 
> > +
> > +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> > +    * This list of 8-byte values stores the corrected commit dates for commits
> > +      with corrected commit date offsets that cannot be stored within 31 bits.
> 
> A question: do we store 8-byte / 64-bit corrected commit date *directly*,
> or do we store corrected commit date *offset* as 8-byte / 64-bit value?
> 

We store the dates directly rather 8-byte offsets. Will clarify.

> Perhaps we should add the information that [like the EDGE chunk] it is
> present only when necessary, and that it is present only when GDAT chunk
> is present (it might be obvious, but it could be better to state
> this explicitly).
> 

It's always better to be explicit. Thanks for the detailed review.

> > +
> 
> All right, this is the information about two new chunks (with the
> mentioned above caveat about the clarity of the description of
> overflow-handling chunk).
> 
> >    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> >        This list of 4-byte values store the second through nth parents for
> >        all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index f14a7659aa..75f71c4c7b 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
> >  
> >  Values 1-4 satisfy the requirements of parse_commit_gently().
> >  
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates (generation number v2)
> > +2. Topological levels (generation nummber v1)
> 
> All right.
> 
> >  
> > - * A commit with no parents (a root commit) has generation number one.
> > +Define "corrected committer date" of a commit recursively as follows:
> >  
> > - * A commit with at least one parent has generation number one more than
> > -   the largest generation number among its parents.
> > +  * A commit with no parents (a root commit) has corrected committer date
> > +    equal to its committer date.
> 
> Minor nitpick: the above point has been accidentally indented one space
> more than necessary, and than is indented in other places.  Or maybe
> that fixes / unifies the formatting... I am not sure.
> 

That's a force of habit - I like to write markdown with greater
indentation. Should have been indented with one space instead of two.

> >  
> > -Equivalently, the generation number of a commit A is one more than the
> > +  * A commit with at least one parent has corrected committer date equal to
> > +    the maximum of its commiter date and one more than the largest corrected
> > +    committer date among its parents.
> > +
> > +  * As a special case, a root commit with timestamp zero has corrected commit
> > +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> > +    (that is, an uncomputed corrected commit date).
> 
> All right.  Looks good.
> 
> > +
> > +Define the "topological level" of a commit recursively as follows:
> > +
> > + * A commit with no parents (a root commit) has topological level of one.
> > +
> > + * A commit with at least one parent has topological level one more than
> > +   the largest topological level among its parents.
> > +
> 
> All right, this just repeats what was written before, or in other words
> move existing contents lower/later, just with 'generation number'
> replaced by 'topological level' (though it might be not obvious from the
> patch because of the latter change).
> 
> > +Equivalently, the topological level of a commit A is one more than the
> >  length of a longest path from A to a root commit. The recursive definition
> >  is easier to use for computation and observing the following property:
> >  
> > @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> >      generation numbers, then we always expand the boundary commit with highest
> >      generation number and can easily detect the stopping condition.
> >  
> > +The properties applies to both versions of generation number, that is both
> > +corrected committer dates and topological levels.
> > +
> 
> I think it should be "This property" or "The property", not "The
> properties"; it is a single property, a single condition.
> 
> We can alternatively say "This condition is fulfilled by both versions...",
> or "This condition is true for both versions...".
> 
> >  This property can be used to significantly reduce the time it takes to
> >  walk commits and determine topological relationships. Without generation
> >  numbers, the general heuristic is the following:
> > @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> >      If A and B are commits with commit time X and Y, respectively, and
> >      X < Y, then A _probably_ cannot reach B.
> >  
> > -This heuristic is currently used whenever the computation is allowed to
> > +In absence of corrected commit dates (for example, old versions of Git or
> > +mixed generation graph chains),
> > +this heuristic is currently used whenever the computation is allowed to
> >  violate topological relationships due to clock skew (such as "git log"
> >  with default order), but is not used when the topological order is
> >  required (such as merge base calculations, "git log --graph").
> 
> All right, this explains when commit date heuristics is used (which is
> less often than before).
> 
> > @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> >  generation number and walk until reaching commits with known generation
> >  number.
> >  
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> 
> All right, 64-bit GENERATION_NUMBER_INFINITY = 0xFFFFFFFFFFFFFFFF is a
> bit unwieldy...
> 
> >  in the commit-graph file. If a commit-graph file was written by a version
> >  of Git that did not compute generation numbers, then those commits will
> >  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,7 +115,7 @@ fully-computed generation numbers. Using strict inequality may result in
> >  walking a few extra commits, but the simplicity in dealing with commits
> >  with generation number *_INFINITY or *_ZERO is valuable.
> >  
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > +We use the macro GENERATION_NUMBER_MAX for commits whose
> 
> This should be
> 
>   +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF to for commits whose
>   +topological levels (generation number v1) are computed to be at least this value. We limit at
>    this value since it is the largest value that can be stored in the
>   +commit-graph file using the 30 bits available to topological levels. This
> 
> We need to use "topological levels" or "generation numbers v1" thorough
> the rest of this section.
> 
> >  generation numbers are computed to be at least this value. We limit at
> >  this value since it is the largest value that can be stored in the
> >  commit-graph file using the 30 bits available to generation numbers. This
> > @@ -267,6 +289,30 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> >  number of commits) could be extracted into config settings for full
> >  flexibility.
> >
> 
> All right, I agree that we don't need to write about overflow handling
> for storing corrected committer dates (generation number v2) as offsets;
> this is something format-specific, and this documentation is more about
> using commit-graph data.  What is present in commit-graph-format.txt
> should be enough information.
> 
> Sidenote: I wonder if other Git implementations such as JGit, Dulwich,
> Gitoxide (gix), go-git have support for the commit-graph file...
> 
> > +## Handling Mixed Generation Number Chains
> > +
> > +With the introduction of generation number v2 and generation data chunk, the
> > +following scenario is possible:
> > +
> > +1. "New" Git writes a commit-graph with the corrected commit dates.
> > +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> > +
> > +A naive approach of using the newest available generation number from
> > +each layer would lead to violated expectations: the lower layer would
> > +use corrected commit dates which are much larger than the topological
> > +levels of the higher layer. For this reason, Git inspects each layer to
> > +see if any layer is missing corrected commit dates. In such a case, Git
> > +only uses topological level
> 
> This should end in full stop:
> 
>   +only uses topological levels.
> 
> Or maybe we should expand the last sentence a bit:
> 
>   +only uses topological levels for generation numbers.
> 
> Sidenote: it is a good explanation, even if Git can make use of the
> property described below that only topmost layers might be missing
> corrected commit graph by the construction (so it needs to check only
> the top layer).
> 
> > +
> > +When writing a new layer in split commit-graph, we write corrected commit
> > +dates if the topmost layer has corrected commit dates written. This
> > +guarantees that if a layer has corrected commit dates, all lower layers
> > +must have corrected commit dates as well.
> > +
> > +When merging layers, we do not consider whether the merged layers had corrected
> > +commit dates. Instead, the new layer will have corrected commit dates if and
> > +only if all existing layers below the new layer have corrected commit dates.
> > +
> 
> Perhaps we should explicitly say that when rewriting split commit-graph
> as a single file (`--split=replace`) then the newly created single layer
> would store corrected commit dates.
> 

Rewriting split commit-graph as a single file is a case where there are
no "existing layers below the new layer". We should clarify that if the
new layer is the only layer, it will always have corrected commit dates
when written by compatible versions of Git. 

I have appended a paragraph at the end:

  While writing or merging layers, if the new layer is the only layer,
  it will have corrected commit dates when written by compatible
  versions of Git. Thus, rewriting split commit-graph as a singel file
  (`--split=replace`) creates a single layer with corrected commit
  dates.

> >  ## Deleting graph-{hash} files
> >  
> >  After a new tip file is written, some `graph-{hash}` files may no longer
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date
  2020-11-04 23:37       ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
@ 2020-11-22  5:31         ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-22  5:31 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee

On Thu, Nov 05, 2020 at 12:37:49AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Thanks a lot for continued working on this patch series.

Thank you so much for the careful review of the series.

> 
> >
> > Git uses topological levels in the commit-graph file for commit-graph
> > traversal operations like git log --graph. Unfortunately, using topological
> > levels can result in a worse performance than without them when compared
> > with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> > on the Linux repository walks 635,579 commits using topological levels and
> > walks 167,468 using committer date.
> 
> Very minor nitpick: it would make it easier to read if the commands
> themself would be put inside single quotes or backticks, e.g. `git log
> --graph` and `git merge-base v4.8 v4.9`.

That's unexpected - I wrote the commands within single quotes in the pull
request. Since backticks are rendered as "code-tags" on Github, let me 
try single quotes.

> 
> I wonder if it is worth mentioning (probably not) that this performance
> hit was the reason why since 091f4cf3 `git merge-base` uses committer
> date heuristics unless there is a cutoff and using topological levels
> (generation date v1) is expected to give better performance.
> 

I think that's useful context for someone wondering whether we continue
to take the performance hit with topological levels or have abandoned
topological levels or chosen some another alternative altogether.

> >
> > Thus, the need for generation number v2 was born. New generation number
> > needed to provide good performance, increment updates, and backward
> > compatibility. Due to an unfortunate problem 1
> 
> Minor issue: this looks a bit strange; is there an error in formatting
> this part?

Yes. The plaintext in pull request description reads as follows:

  Thus, the need for generation number v2 was born. New generation number
  needed to provide good performance, increment updates, and backward
  compatibility. Due to an unfortunate problem [1], we also needed a way
  to distinguish between the old and new generation number without
  incrementing graph version.

  [1]: https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

I have been reviewing other pull request descriptions to match their
style (and hope the cover letter renders correctly) and Dr. Stolee in
his patch series to "add --literal value" option to configuration has
written:
  
  As reported [1], 'git maintenance unregister' fails when a repository
  is located in a directory with regex glob characters.

  [1] https://lore.kernel.org/git/2c2db228-069a-947d-8446-89f4d3f6181a@gmail.com/T/#mb96fa4187a0d6aeda097cd95804a8aafc0273022

(Note the lack of colon after [1])

> 
> > [https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/], we also
> > needed a way to distinguish between the old and new generation number
> > without incrementing graph version.
> >
> > Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> > https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> > number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> > performed much worse than committer date (506,577 vs. 167,468 commits walked
> > for git merge-base v4.8 v4.9) and was dropped.
> >
> > Using Generation Data chunk (GDAT) relieves the requirement of backward
> > compatibility as we would continue to store topological levels in Commit
> > Data (CDAT) chunk.
> 
> Nice writeup about the history of generation number v2, much appreciated.
> 
> >                    Thus, Corrected Commit Date was chosen as generation
> > number v2. The Corrected Commit Date is defined as:
> 
> Minor nitpick: it would be probably better to use "is defined as
> follows." instead of "is defined as:".
> 
> >
> > For a commit C, let its corrected commit date be the maximum of the commit
> > date of C and the corrected commit dates of its parents plus 1. Then 
> > corrected commit date offset is the difference between corrected commit date
> > of C and commit date of C. As a special case, a root commit with timestamp
> > zero has corrected commit date of 1 to be able distinguish it from
> > GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).
> 
> Very minor nitpick: s/with timestamp/with *the* timestamp/, and
> s/to be able distinguish/to be able *to* distinguish/ (without the '*'
> used to mark the additions).
> 
> >
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> 
> Or "Generation DATa chunk", if we want to emphasize where its name came
> from, or even "Generation DATa (GDAT) chunk". But it is fine as it is
> now, though it would be good idea to write "Generation Data (GDAT)
> chunk" to explicitly state its name / shortcut.
> 
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
> 
> Nice writeup of handling the backward compatibility.
> 
> >
> > While storing corrected commit date offsets saves us 4 bytes per commit (as
> > compared with storing corrected commit dates directly), it's possible for
> > the offset to overflow the space allocated. To handle such cases, we
> > introduce a new chunk, Generation Data Overflow (GDOV) that stores the
> > corrected commit date. For overflowing offsets, we set MSB and store the
> > position into the GDOV chunk, in a mechanism similar to the Extra Edges list
> > chunk.
> 
> Very minor suggestion: perhaps it would be better to use "it's however
> possible".
> 
> Very minor suggestion: "it's possible for the offset to overflow" could
> be simplified to just "the offset can overflow"... though the simplified
> version loses a bit of hint that the overflow should be very rare in
> real repositories.
> 
> But it is just fine as it is now; I am not a native English speaker to
> judge which version is better.
> 

I think it is better to indicate the rareness of overflows.

> >
> > For mixed generation number environment (for example new Git on the command
> > line, old Git used by GUI client), we can encounter a mixed-chain
> > commit-graph (a commit-graph chain where some of split commit-graph files
> > have GDAT chunk and others do not). As backward compatibility is one of the
> > goals, we can define the following behavior:
> >
> > While reading a mixed-chain commit-graph version, we fall back on
> > topological levels as corrected commit dates and topological levels cannot
> > be compared directly.
> >
> > While writing on top of a split commit-graph, we check if the tip of the
> > chain has a GDAT chunk. If it does, we append to the chain, writing GDAT
> > chunk. Thus, we guarantee if the topmost split commit-graph file has a GDAT
> > chunk, rest of the chain does too.
> >
> > If the topmost split commit-graph file does not have a GDAT chunk (meaning
> > it has been appended by the old Git), we write without GDAT chunk. We do
> > write a GDAT chunk when the existing chain does not have GDAT chunk - when
> > we are writing to the commit-graph chain with the 'replace' strategy.
> 
> I think the last paragraph can be simplified (or added to) by explicitly
> stating the goal:
> 
>   When adding new layer to the split commit-graph file, and when merging
>   some or all layers (replacing them in the latter case), the new layer
>   will have GDAT chunk if and only if in the final result there would be
>   no layer without GDAT chunk just below it.
> 

Thanks, that is much clearer to understand.

> ...
> 
> After careful review of those 10 patches it looks like the series is
> close to being ready, requiring only small changes to progress.
> 

Thank you for writing this handy reference for changes.

> > Abhishek Kumar (10):
> >   commit-graph: fix regression when computing Bloom filters
> 
>     All good, beside possible improvement to the commit message.
>     Thanks to Taylor Blau for discovering possible reason for strange
>     no change in performance.
> 
> >   revision: parse parent in indegree_walk_step()
> 
>     Looks good.
> 
> >   commit-graph: consolidate fill_commit_graph_info
> 
>     Needs to fix now duplicated test names (minor change).
>     Proposed possible improvement to the commit message.
> 
> >   commit-graph: return 64-bit generation number
> 
>     Needs fixing due to mismerge: there should be no switch from
>     using GENERATION_NUMBER_ZERO to using GENERATION_NUMBER_INFINITY.
>     Possible minor improvement to the commit message.
> 
> >   commit-graph: add a slab to store topological levels
> 
>     Possible minor improvement to the commit message.
>     
>     There is also not very important issue, but something that would be
>     nice to explain, namely that checks for GENERATION_NUMBER_INFINITY 
>     can never be true, as topo_level_slab_at() returns 0 for commits
>     outside the commit-graph, not GENERATION_NUMBER_INFINITY.  It works
>     but it is not obvious why.
> 
> >   commit-graph: implement corrected commit date
> 
>     The change to commit-graph verification needs fixing, and we need to
>     decide how verifying generation numbers should work.  Perhaps a test
>     for handling topological level of GENERATION_NUMBER_V1_MAX could be
>     added (though this might be left for ater).
> 
>     The changes to `git commit-graph verify` code could be put into
>     separate patch, either before or after this one.
> 
> >   commit-graph: implement generation data chunk
> 
>     Proposed possible improvement to the commit message.
>     The commit message does not explain why given shape of history is
>     needed to test handling corrected commit date offset overflow.
> 
>     Proposed minor corrections to the coding style.
> 
>     Instead of looping again through all commits when handling overflow
>     in corrected commit date offsets, while there should be at most a
>     few commits needing it, why not save those commits on list and loop
>     only through those commits?  Though this _possible_ performance
>     improvement could be left to the followup...

Since the improvement can be applied to both 
`write_graph_chunk_generation_data_overflow()` and 
`write_graph_chunk_extra_edges()`, I am planning to cover this in a
followup.

> 
>     test_commit_with_date() could be instead implemented via adding
>     `--date <date>` option to test_commit() in test-lib-functions.sh.
> 
>     Also, to reduce "noise" in this patch, the rename of
>     run_three_modes() to run_all_modes() and test_three_modes() to
>     test_all_modes() could have been done in a separate preparatory
>     patch. It would be pure refactoring patch, without introducing any
>     new functionality.  But it is not something that is necessary.
> 
> >   commit-graph: use generation v2 only if entire chain does
> 
>     Proposed possible improvement to the commit message.
>     Proposed minor corrections to the coding style (also in tests).
> 
>     There is a question whether merging layers or replacing them should
>     honor GIT_TEST_COMMIT_GRAPH_NO_GDAT.
> 
>     Tests possibly could be made more strict, and check more things
>     explicitly. One test we are missing is testing that merging layers
>     is done correctly, namely that if we are merging layers in split
>     commit-graph file, and the layer below the ones we are merging lacks
>     GDAT chunk, then the result of the merge should also be without GDAT
>     chunk -- but that might be left for later.
> 
> >   commit-reach: use corrected commit dates in paint_down_to_common()
> 
>     This patch consist of two slightly interleaved changes, which
>     possibly could be separated: change to paint_down_to_common() and
>     change to t6404-recursive-merge test.
> 
>     In the commit message for the paint_down_to_common() we should
>     explicitly mention 091f4cf3, which this one partially reverts.
> 
>     Possible accidental change, question about function naming.
> 
> >   doc: add corrected commit date info
> 
>     Needs further improvements to the documentation, like adding
>     "[Optional]" to chunk description, and leftover switching from
>     "generation numbers" to "topological levels" in one place.
> 
> ...
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date
  2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
                         ` (10 preceding siblings ...)
  2020-11-04 23:37       ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
@ 2020-12-28 11:15       ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:15         ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
                           ` (12 more replies)
  11 siblings, 13 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, using
topological levels can result in a worse performance than without them when
compared with committer date as a heuristics. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using topological
levels and walks 167,468 using committer date. Since 091f4cf3 (commit: don't
use generation numbers if not needed, 2018-08-30), 'git merge-base' uses
committer date heuristic unless there is a cutoff because of the performance
hit.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem [1], we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

[1] https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Various candidates were examined (https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets
performed much worse than committer date (506,577 vs. 167,468 commits walked
for 'git merge-base v4.8 v4.9') and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as follows:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with the
timestamp zero has corrected commit date of 1 to be able to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).

We will introduce an additional commit-graph chunk, Generation DATa (GDAT)
chunk, and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores
the corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.

For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.

When adding new layer to the split commit-graph file, and when merging some
or all layers (replacing them in the latter case), the new layer will have
GDAT chunk if and only if in the final result there would be no layer
without GDAT chunk just below it.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.

I look forward to everyone's reviews!

Thanks

 * Abhishek

----------------------------------------------------------------------------

Improvements left for a future series:

 * Save commits with generation data overflow and extra edge commits instead
   of looping over all commits. cf. 858sbel67n.fsf@gmail.com
 * Verify both topological levels and corrected commit dates when present.
   cf. 85pn4tnk8u.fsf@gmail.com

Changes in version 5:

 * Explained a possible reason for no change in performance for
   "commit-graph: fix regression when computing bloom-filters"
 * Clarified about the addition of a new test for 11-digit octal
   implementations of ustar.
 * Fixed duplicate test names in "commit-graph: consolidate
   fill_commit_graph_info".
 * Swapped the order "commit-graph: return 64-bit generation number",
   "commit-graph: add a slab to store topological levels" to minimize lines
   changed.
 * Fixed the mismerge in "commit-graph: return 64-bit generation number"
 * Clarified the preparatory steps are for the larger goal of implementing
   generation number v2 in "commit-graph: return 64-bit generation number".
 * Moved the rename of "run_three_modes()" to "run_all_modes()" into a new
   patch "t6600-test-reach: generalize *_three_modes".
 * Explained and removed the checks for GENERATION_NUMBER_INFINITY that can
   never be true in "commit-graph: add a slab to store topological levels".
 * Fixed incorrect logic for verifying commit-graph in "commit-graph:
   implement corrected commit date".
 * Added minor improvements to commit message of "commit-graph: implement
   generation data chunk".
 * Added '--date ' option to test_commit() in 'test-lib-functions.sh' in
   "commit-graph: implement generation data chunk".
 * Improved coding style (also in tests) for "commit-graph: use generation
   v2 only if entire chain does".
 * Simplified test repository structure in "commit-graph: use generation v2
   only if entire chain does" as only the number of commits in a split
   commit-graph layer are relevant.
 * Added a new test in "commit-graph: use generation v2 only if entire chain
   does" to check if the layers are merged correctly.
 * Explicitly mentioned commit "091f4cf3" in the commit-message of
   "commit-graph: use corrected commit dates in paint_down_to_common()".
 * Minor corrections to documentation in "doc: add corrected commit date
   info".
 * Minor corrections to coding style.

Changes in version 4:

 * Added GDOV to handle overflows in generation data.
 * Added a test for writing tip graph for a generation number v2 graph chain
   in t5324-split-commit-graph.sh
 * Added a section on how mixed generation number chains are handled in
   Documentation/technical/commit-graph-format.txt
 * Reverted unimportant whitespace, style changes in commit-graph.c
 * Added header comments about the order of comparision for
   compare_commits_by_gen_then_commit_date in commit.h,
   compare_commits_by_gen in commit-graph.h
 * Elaborated on why t6404 fails with corrected commit date and must be run
   with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
   commit dates in paint_down_to_common()"
 * Elaborated on write behavior for mixed generation number chains in the
   commit "commit-graph: use generation v2 only if entire chain does"
 * Added notes about adding the topo_level slab to struct
   write_commit_graph_context as well as struct commit_graph.
 * Clarified commit message for "commit-graph: consolidate
   fill_commit_graph_info"
 * Removed the claim "GDAT can store future generation numbers" because it
   hasn't been tested yet.

Changes in version 3:

 * Reordered patches as discussed in 2
   [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/].
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed generation commit chains.
 * Improve commit messages for "commit-graph: fix regression when computing
   bloom filter", "commit-graph: consolidate fill_commit_graph_info",
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (11):
  commit-graph: fix regression when computing Bloom filters
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  t6600-test-reach: generalize *_three_modes
  commit-graph: add a slab to store topological levels
  commit-graph: return 64-bit generation number
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  28 +-
 Documentation/technical/commit-graph.txt      |  77 +++++-
 commit-graph.c                                | 243 ++++++++++++++----
 commit-graph.h                                |  15 +-
 commit-reach.c                                |  38 +--
 commit-reach.h                                |   2 +-
 commit.c                                      |   4 +-
 commit.h                                      |   5 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   4 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |  24 +-
 t/t5318-commit-graph.sh                       |  79 +++++-
 t/t5324-split-commit-graph.sh                 | 193 +++++++++++++-
 t/t6404-recursive-merge.sh                    |   5 +-
 t/t6600-test-reach.sh                         |  68 ++---
 t/test-lib-functions.sh                       |   6 +
 upload-pack.c                                 |   2 +-
 19 files changed, 659 insertions(+), 154 deletions(-)


base-commit: 4a0de43f4923993377dbbc42cfc0a1054b6c5ccf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v4:

  1:  fae81b534b1 !  1:  c4e817abf7d commit-graph: fix regression when computing Bloom filters
     @@ Metadata
       ## Commit message ##
          commit-graph: fix regression when computing Bloom filters
      
     -    commit_gen_cmp is used when writing a commit-graph to sort commits in
     -    generation order before computing Bloom filters. Since c49c82aa (commit:
     -    move members graph_pos, generation to a slab, 2020-06-17) made it so
     -    that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
     -    during writing, we cannot call it within this function. Instead, access
     -    the generation number directly through the slab (i.e., by calling
     -    'commit_graph_data_at(c)->generation') in order to access it while
     -    writing.
     +    Before computing Bloom fitlers, the commit-graph machinery uses
     +    commit_gen_cmp to sort commits by generation order for improved diff
     +    performance. 3d11275505 (commit-graph: examine commits by generation
     +    number, 2020-03-30) claims that this sort can reduce the time spent to
     +    compute Bloom filters by nearly half.
      
     -    While measuring performance with `git commit-graph write --reachable
     -    --changed-paths` on the linux repository led to around 1m40s for both
     -    HEAD and master (and could be due to fault in my measurements), it is
     -    still the "right" thing to do.
     +    But since c49c82aa4c (commit: move members graph_pos, generation to a
     +    slab, 2020-06-17), this optimization is broken, since asking for a
     +    'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
     +    while writing.
     +
     +    Not all hope is lost, though: 'commit_graph_generation()' falls back to
     +    comparing commits by their date when they have equal generation number,
     +    and so since c49c82aa4c is purely a date comparision function. This
     +    heuristic is good enough that we don't seem to loose appreciable
     +    performance while computing Bloom filters. Applying this patch (compared
     +    with v2.29.1) speeds up computing Bloom filters by around ~4
     +    seconds.
     +
     +    So, avoid the useless 'commit_graph_generation()' while writing by
     +    instead accessing the slab directly. This returns the newly-computed
     +    generation numbers, and allows us to avoid the heuristic by directly
     +    comparing generation numbers.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
  2:  4470d916428 =  2:  7645e0bcef0 revision: parse parent in indegree_walk_step()
  3:  18bb3318a12 !  3:  ca646912b2b commit-graph: consolidate fill_commit_graph_info
     @@ Commit message
          fill_commit_in_graph().
      
          fill_commit_graph_info() used to not load committer data from commit data
     -    chunk. However, with the corrected committer date, we have to load
     -    committer date to calculate generation number value.
     +    chunk. However, with the upcoming switch to using corrected committer
     +    date as generation number v2, we will have to load committer date to
     +    compute generation number value anyway.
      
          e51217e15 (t5000: test tar files that overflow ustar headers,
          30-06-2016) introduced a test 'generate tar with future mtime' that
     -    creates a commit with committer date of (2 ^ 36 + 1) seconds since
     +    creates a commit with committer date of (2^36 + 1) seconds since
          EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
          committer time overflows into generation number (within CDAT chunk) and
          has undefined behavior.
      
          The test used to pass as fill_commit_graph_info() would not set struct
     -    member `date` of struct commit and loads committer date from the object
     +    member `date` of struct commit and load committer date from the object
          database, generating a tar file with the expected mtime.
      
          However, with corrected commit date, we will load the committer date
     @@ Commit message
          mtime.
      
          The ustar format (the header format used by most modern tar programs)
     -    only has room for 11 (or 12, depending om some implementations) octal
     -    digits for the size and mtime of each files.
     +    only has room for 11 (or 12, depending on some implementations) octal
     +    digits for the size and mtime of each file.
      
     -    Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
     -    digit implementations while still fitting into commit data chunk.
     +    As the CDAT chunk is overflow by 12-octal digits but not 11-octal
     +    digits, we split the existing tests to test both implementations
     +    separately and add a new explicit test for 11-digit implementation.
      
     -    Since we want to test 12-octal digit implementations of ustar as well,
     -    let's modify the existing test to no longer use commit-graph file.
     +    To test the 11-octal digit implementation, we create a future commit
     +    with committer date of 2^34 - 1, which overflows 11-octal digits without
     +    overflowing 34-bits of the Commit Date chunks.
     +
     +    To test the 12-octal digit implementation, the smallest committer date
     +    possible is 2^36 + 1, which overflows the CDAT chunk and thus
     +    commit-graph must be disabled for the test.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
       	test_cmp expect actual
       '
       
     -+test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
     +-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
     ++test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
      +	rm -f .git/index &&
      +	echo foo >file &&
      +	git add file &&
     @@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
      +		git commit -m "tempori parendum"
      +'
      +
     -+test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
     ++test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
      +	git archive HEAD >future.tar
      +'
      +
     @@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
      +	test_cmp expect actual
      +'
      +
     - test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
     ++test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
       	rm -f .git/index &&
       	echo content >file &&
       	git add file &&
     @@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
       		git commit -m "tempori parendum"
       '
       
     +-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
     ++test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
     + 	git archive HEAD >future.tar
     + '
     + 
  -:  ----------- >  4:  591935075f1 t6600-test-reach: generalize *_three_modes
  5:  e067f653ad5 !  5:  baae7006764 commit-graph: add a slab to store topological levels
     @@ Commit message
          commit-graph: add a slab to store topological levels
      
          In a later commit we will introduce corrected commit date as the
     -    generation number v2. This value will be stored in the new seperate
     -    Generation Data chunk. However, to ensure backwards compatibility with
     -    "Old" Git we need to continue to write generation number v1, which is
     -    topological level, to the commit data chunk. This means that we need to
     -    compute both versions of generation numbers when writing the
     -    commit-graph file. Therefore, let's introduce a commit-slab to store
     +    generation number v2. Corrected commit dates will be stored in the new
     +    seperate Generation Data chunk. However, to ensure backwards
     +    compatibility with "Old" Git we need to continue to write generation
     +    number v1 (topological levels) to the commit data chunk. Thus, we need
     +    to compute and store both versions of generation numbers to write the
     +    commit-graph file.
     +
     +    Therefore, let's introduce a commit-slab `topo_level_slab` to store
          topological levels; corrected commit date will be stored in the member
          `generation` of struct commit_graph_data.
      
     -    When Git creates a split commit-graph, it takes advantage of the
     -    generation values that have been computed already and present in
     -    existing commit-graph files.
     +    The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
     +    mark commits not in the commit-graph file and commits written by a
     +    version of Git that did not compute generation numbers respectively.
     +    Generation numbers are computed identically for both kinds of commits.
     +
     +    A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
     +    is not in the commit-graph file. However, since the slab is
     +    zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
     +    Thus, we no longer need to check if the topological level of a commit is
     +    `GENERATION_NUMBER_INFINITY`.
      
     -    So, let's add a pointer to struct commit_graph as well as struct
     -    write_commit_graph_context to the topological level commit-slab
     -    and populate it with topological levels while writing a commit-graph
     -    file.
     +    We will add a pointer to the slab in `struct write_commit_graph_context`
     +    and `struct commit_graph` to populate the slab in
     +    `fill_commit_graph_info` if the commit has a pre-computed topological
     +    level as in case of split commit-graphs.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       					_("Computing commit graph generation numbers"),
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
     --		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     +-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     ++		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
       
       		display_progress(ctx->progress, i + 1);
      -		if (generation != GENERATION_NUMBER_INFINITY &&
      -		    generation != GENERATION_NUMBER_ZERO)
     -+		if (level != GENERATION_NUMBER_INFINITY &&
     -+		    level != GENERATION_NUMBER_ZERO)
     ++		if (level != GENERATION_NUMBER_ZERO)
       			continue;
       
       		commit_list_insert(ctx->commits.list[i], &list);
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       
      -				if (generation == GENERATION_NUMBER_INFINITY ||
      -				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (level == GENERATION_NUMBER_INFINITY ||
     -+				    level == GENERATION_NUMBER_ZERO) {
     ++				if (level == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
       					break;
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
      -				data->generation = max_generation + 1;
       				pop_commit(&list);
       
     --				if (data->generation > GENERATION_NUMBER_V1_MAX)
     --					data->generation = GENERATION_NUMBER_V1_MAX;
     -+				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
     -+					max_level = GENERATION_NUMBER_V1_MAX - 1;
     +-				if (data->generation > GENERATION_NUMBER_MAX)
     +-					data->generation = GENERATION_NUMBER_MAX;
     ++				if (max_level > GENERATION_NUMBER_MAX - 1)
     ++					max_level = GENERATION_NUMBER_MAX - 1;
      +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
       			}
       		}
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
      +	struct topo_level_slab topo_levels;
       
     - 	if (!commit_graph_compatible(the_repository))
     - 		return 0;
     + 	prepare_repo_settings(the_repository);
     + 	if (!the_repository->settings.core_commit_graph) {
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       							 bloom_settings.max_changed_paths);
       	ctx->bloom_settings = &bloom_settings;
  4:  011b0aa497d !  6:  26bd6f49100 commit-graph: return 64-bit generation number
     @@ Metadata
       ## Commit message ##
          commit-graph: return 64-bit generation number
      
     -    In a preparatory step, let's return timestamp_t values from
     -    commit_graph_generation(), use timestamp_t for local variables and
     -    define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
     +    In a preparatory step for introducing corrected commit dates, let's
     +    return timestamp_t values from commit_graph_generation(), use
     +    timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
     +    as (2 ^ 63 - 1) instead.
      
          We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
          represent the largest topological level we can store in the commit data
     @@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
       	if (generation_a < generation_b)
       		return -1;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 					_("Computing commit graph generation numbers"),
     - 					ctx->commits.nr);
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     --		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     - 
     - 		display_progress(ctx->progress, i + 1);
     - 		if (generation != GENERATION_NUMBER_INFINITY &&
     -@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 				data->generation = max_generation + 1;
     + 			if (all_parents_computed) {
       				pop_commit(&list);
       
     --				if (data->generation > GENERATION_NUMBER_MAX)
     --					data->generation = GENERATION_NUMBER_MAX;
     -+				if (data->generation > GENERATION_NUMBER_V1_MAX)
     -+					data->generation = GENERATION_NUMBER_V1_MAX;
     +-				if (max_level > GENERATION_NUMBER_MAX - 1)
     +-					max_level = GENERATION_NUMBER_MAX - 1;
     ++				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
     ++					max_level = GENERATION_NUMBER_V1_MAX - 1;
     + 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
       			}
       		}
     - 	}
      @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
       	for (i = 0; i < g->num_commits; i++) {
       		struct commit *graph_commit, *odb_commit;
     @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_grap
       			max_generation--;
       
       		generation = commit_graph_generation(graph_commit);
     + 		if (generation != max_generation + 1)
     +-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     ++			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
     + 				     oid_to_hex(&cur_oid),
     + 				     generation,
     + 				     max_generation + 1);
      
       ## commit-graph.h ##
      @@ commit-graph.h: void disable_commit_graph(struct repository *r);
     @@ commit-reach.c: int repo_in_merge_bases_many(struct repository *r, struct commit
       	struct commit_list *bases;
       	int ret = 0, i;
      -	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
     -+	timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;
     ++	timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
       
       	if (repo_parse_commit(r, commit))
       		return ret;
  6:  694ef1ec08d !  7:  859c39eff52 commit-graph: implement corrected commit date
     @@ Commit message
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     - 	else if (generation_a > generation_b)
     - 		return 1;
     - 
     --	/* use date as a heuristic when generations are equal */
     --	if (a->date < b->date)
     --		return -1;
     --	else if (a->date > b->date)
     --		return 1;
     - 	return 0;
     - }
     - 
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
     - 		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     + 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
      +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
       
       		display_progress(ctx->progress, i + 1);
     - 		if (level != GENERATION_NUMBER_INFINITY &&
     --		    level != GENERATION_NUMBER_ZERO)
     -+		    level != GENERATION_NUMBER_ZERO &&
     -+		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
     -+		    corrected_commit_date != GENERATION_NUMBER_ZERO
     -+		    )
     +-		if (level != GENERATION_NUMBER_ZERO)
     ++		if (level != GENERATION_NUMBER_ZERO &&
     ++		    corrected_commit_date != GENERATION_NUMBER_ZERO)
       			continue;
       
       		commit_list_insert(ctx->commits.list[i], &list);
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       
       			for (parent = current->parents; parent; parent = parent->next) {
       				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     --
      +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
     - 				if (level == GENERATION_NUMBER_INFINITY ||
     --				    level == GENERATION_NUMBER_ZERO) {
     -+				    level == GENERATION_NUMBER_ZERO ||
     -+				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
     -+				    corrected_commit_date == GENERATION_NUMBER_ZERO
     -+				    ) {
     + 
     +-				if (level == GENERATION_NUMBER_ZERO) {
     ++				if (level == GENERATION_NUMBER_ZERO ||
     ++				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
       					break;
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       			}
       		}
       	}
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 		if (generation_zero == GENERATION_ZERO_EXISTS)
     - 			continue;
     - 
     --		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
     --		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
     --		 * extra logic in the following condition.
     --		 */
     --		if (max_generation == GENERATION_NUMBER_V1_MAX)
     --			max_generation--;
     --
     - 		generation = commit_graph_generation(graph_commit);
     --		if (generation != max_generation + 1)
     --			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		if (generation < max_generation + 1)
     -+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     - 				     oid_to_hex(&cur_oid),
     - 				     generation,
     - 				     max_generation + 1);
  7:  b903efe2ea1 !  8:  8403c4d0257 commit-graph: implement generation data chunk
     @@ Commit message
      
          As discovered by Ævar, we cannot increment graph version to
          distinguish between generation numbers v1 and v2 [1]. Thus, one of
     -    pre-requistes before implementing generation number was to distinguish
     -    between graph versions in a backwards compatible manner.
     +    pre-requistes before implementing generation number v2 was to
     +    distinguish between graph versions in a backwards compatible manner.
      
     -    We are going to introduce a new chunk called Generation Data chunk (or
     -    GDAT). GDAT stores corrected committer date offsets whereas CDAT will
     -    still store topological level.
     +    We are going to introduce a new chunk called Generation DATa chunk (or
     +    GDAT). GDAT will store corrected committer date offsets whereas CDAT
     +    will still store topological level.
      
          Old Git does not understand GDAT chunk and would ignore it, reading
          topological levels from CDAT. New Git can parse GDAT and take advantage
          of newer generation numbers, falling back to topological levels when
     -    GDAT chunk is missing (as it would happen with a commit graph written
     +    GDAT chunk is missing (as it would happen with a commit-graph written
          by old Git).
      
          We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
          which forces commit-graph file to be written without generation data
          chunk to emulate a commit-graph file written by old Git.
      
     -    While storing corrected commit date offset instead of the corrected
     -    commit date saves us 4 bytes per commit, it's possible for the offsets
     -    to overflow the 4-bytes allocated. As such overflows are exceedingly
     -    rare, we use the following overflow management scheme:
     +    To minimize the space required to store corrrected commit date, Git
     +    stores corrected commit date offsets into the commit-graph file, instea
     +    of corrected commit dates. This saves us 4 bytes per commit, decreasing
     +    the GDAT chunk size by half, but it's possible for the offset to
     +    overflow the 4-bytes allocated for storage. As such overflows are and
     +    should be exceedingly rare, we use the following overflow management
     +    scheme:
      
     -    We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')
     +    We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
          to store corrected commit dates for commits with offsets greater than
          GENERATION_NUMBER_V2_OFFSET_MAX.
      
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       
      -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
      +	if (g->chunk_generation_data) {
     -+		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     ++		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
      +
      +		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
      +			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       	if (g->topo_levels)
       		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct packed_oid_list oids;
     + 	struct oid_array oids;
       	struct packed_commit_list commits;
       	int num_extra_edges;
      +	int num_generation_data_overflows;
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
      +					      struct write_commit_graph_context *ctx)
      +{
      +	int i, num_generation_data_overflows = 0;
     ++
      +	for (i = 0; i < ctx->commits.nr; i++) {
      +		struct commit *c = ctx->commits.list[i];
      +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       					 struct write_commit_graph_context *ctx)
       {
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
     - 
       				if (current->date && current->date > max_corrected_commit_date)
       					max_corrected_commit_date = current->date - 1;
     -+
       				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
      +
      +				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
       						      bloom_settings.bits_per_entry);
     +@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     + 			continue;
     + 
     + 		/*
     +-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
     +-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
     +-		 * extra logic in the following condition.
     ++		 * If we are using topological level and one of our parents has
     ++		 * generation GENERATION_NUMBER_V1_MAX, then our generation is
     ++		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
     ++		 * in the following condition.
     + 		 */
     +-		if (max_generation == GENERATION_NUMBER_V1_MAX)
     ++		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
     + 			max_generation--;
     + 
     + 		generation = commit_graph_generation(graph_commit);
     +-		if (generation != max_generation + 1)
     +-			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
     ++		if (generation < max_generation + 1)
     ++			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     + 				     oid_to_hex(&cur_oid),
     + 				     generation,
     + 				     max_generation + 1);
      
       ## commit-graph.h ##
      @@
     @@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missin
       	)
       '
       
     -+test_commit_with_date() {
     -+  file="$1.t" &&
     -+  echo "$1" >"$file" &&
     -+  git add "$file" &&
     -+  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
     -+  git tag "$1"
     -+}
     ++# We test the overflow-related code with the following repo history:
     ++#
     ++#               4:F - 5:N - 6:U
     ++#              /                \
     ++# 1:U - 2:N - 3:U                M:N
     ++#              \                /
     ++#               7:N - 8:F - 9:N
     ++#
     ++# Here the commits denoted by U have committer date of zero seconds
     ++# since Unix epoch, the commits denoted by N have committer date
     ++# starting from 1112354055 seconds since Unix epoch (default committer
     ++# date for the test suite), and the commits denoted by F have committer
     ++# date of (2 ^ 31 - 2) seconds since Unix epoch.
     ++#
     ++# The largest offset observed is 2 ^ 31, just large enough to overflow.
     ++#
      +
     -+test_expect_success 'overflow corrected commit date offset' '
     ++test_expect_success 'set up and verify repo with generation data overflow chunk' '
      +	objdir=".git/objects" &&
     -+	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
     ++	UNIX_EPOCH_ZERO="@0 +0000" &&
      +	FUTURE_DATE="@2147483646 +0000" &&
      +	test_oid_cache <<-EOF &&
      +	oid_version sha1:1
     @@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missin
      +	mkdir repo &&
      +	cd repo &&
      +	git init &&
     -+	test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
      +	test_commit 2 &&
     -+	test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
      +	git commit-graph write --reachable &&
      +	graph_read_expect 3 generation_data &&
     -+	test_commit_with_date 4 "$FUTURE_DATE" &&
     ++	test_commit --date "$FUTURE_DATE" 4 &&
      +	test_commit 5 &&
     -+	test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
     ++	test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
      +	git branch left &&
      +	git reset --hard 3 &&
      +	test_commit 7 &&
     -+	test_commit_with_date 8 "$FUTURE_DATE" &&
     ++	test_commit --date "$FUTURE_DATE" 8 &&
      +	test_commit 9 &&
      +	git branch right &&
      +	git reset --hard 3 &&
     -+	git merge left right &&
     ++	test_merge M left right &&
      +	git commit-graph write --reachable &&
      +	graph_read_expect 10 "generation_data generation_data_overflow" &&
      +	git commit-graph verify
      +'
      +
     -+graph_git_behavior 'overflow corrected commit date offset' repo left right
     ++graph_git_behavior 'generation data overflow chunk repo' repo left right
      +
       test_done
      
     @@ t/t6600-test-reach.sh: test_expect_success 'setup' '
       	git config core.commitGraph true
       '
       
     --run_three_modes () {
     -+run_all_modes () {
     - 	test_when_finished rm -rf .git/objects/info/commit-graph &&
     - 	"$@" <input >actual &&
     - 	test_cmp expect actual &&
     -@@ t/t6600-test-reach.sh: run_three_modes () {
     +@@ t/t6600-test-reach.sh: run_all_modes () {
       	test_cmp expect actual &&
       	cp commit-graph-half .git/objects/info/commit-graph &&
       	"$@" <input >actual &&
     @@ t/t6600-test-reach.sh: run_three_modes () {
       	test_cmp expect actual
       }
       
     --test_three_modes () {
     --	run_three_modes test-tool reach "$@"
     -+test_all_modes () {
     -+	run_all_modes test-tool reach "$@"
     - }
     - 
     - test_expect_success 'ref_newer:miss' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:miss' '
     - 	B:commit-4-9
     - 	EOF
     - 	echo "ref_newer(A,B):0" >expect &&
     --	test_three_modes ref_newer
     -+	test_all_modes ref_newer
     - '
     - 
     - test_expect_success 'ref_newer:hit' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:hit' '
     - 	B:commit-2-3
     - 	EOF
     - 	echo "ref_newer(A,B):1" >expect &&
     --	test_three_modes ref_newer
     -+	test_all_modes ref_newer
     - '
     - 
     - test_expect_success 'in_merge_bases:hit' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:hit' '
     - 	B:commit-8-8
     - 	EOF
     - 	echo "in_merge_bases(A,B):1" >expect &&
     --	test_three_modes in_merge_bases
     -+	test_all_modes in_merge_bases
     - '
     - 
     - test_expect_success 'in_merge_bases:miss' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:miss' '
     - 	B:commit-5-9
     - 	EOF
     - 	echo "in_merge_bases(A,B):0" >expect &&
     --	test_three_modes in_merge_bases
     -+	test_all_modes in_merge_bases
     - '
     - 
     - test_expect_success 'in_merge_bases_many:hit' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:hit' '
     - 	X:commit-5-7
     - 	EOF
     - 	echo "in_merge_bases_many(A,X):1" >expect &&
     --	test_three_modes in_merge_bases_many
     -+	test_all_modes in_merge_bases_many
     - '
     - 
     - test_expect_success 'in_merge_bases_many:miss' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:miss' '
     - 	X:commit-8-6
     - 	EOF
     - 	echo "in_merge_bases_many(A,X):0" >expect &&
     --	test_three_modes in_merge_bases_many
     -+	test_all_modes in_merge_bases_many
     - '
     - 
     - test_expect_success 'in_merge_bases_many:miss-heuristic' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:miss-heuristic' '
     - 	X:commit-6-6
     - 	EOF
     - 	echo "in_merge_bases_many(A,X):0" >expect &&
     --	test_three_modes in_merge_bases_many
     -+	test_all_modes in_merge_bases_many
     - '
     - 
     - test_expect_success 'is_descendant_of:hit' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:hit' '
     - 	X:commit-1-1
     - 	EOF
     - 	echo "is_descendant_of(A,X):1" >expect &&
     --	test_three_modes is_descendant_of
     -+	test_all_modes is_descendant_of
     - '
     - 
     - test_expect_success 'is_descendant_of:miss' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:miss' '
     - 	X:commit-7-6
     - 	EOF
     - 	echo "is_descendant_of(A,X):0" >expect &&
     --	test_three_modes is_descendant_of
     -+	test_all_modes is_descendant_of
     - '
     - 
     - test_expect_success 'get_merge_bases_many' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'get_merge_bases_many' '
     - 		git rev-parse commit-5-6 \
     - 			      commit-4-7 | sort
     - 	} >expect &&
     --	test_three_modes get_merge_bases_many
     -+	test_all_modes get_merge_bases_many
     - '
     - 
     - test_expect_success 'reduce_heads' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'reduce_heads' '
     - 			      commit-2-8 \
     - 			      commit-1-10 | sort
     - 	} >expect &&
     --	test_three_modes reduce_heads
     -+	test_all_modes reduce_heads
     - '
     - 
     - test_expect_success 'can_all_from_reach:hit' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:hit' '
     - 	Y:commit-8-1
     - 	EOF
     - 	echo "can_all_from_reach(X,Y):1" >expect &&
     --	test_three_modes can_all_from_reach
     -+	test_all_modes can_all_from_reach
     - '
     - 
     - test_expect_success 'can_all_from_reach:miss' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:miss' '
     - 	Y:commit-8-5
     - 	EOF
     - 	echo "can_all_from_reach(X,Y):0" >expect &&
     --	test_three_modes can_all_from_reach
     -+	test_all_modes can_all_from_reach
     - '
     - 
     - test_expect_success 'can_all_from_reach_with_flag: tags case' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach_with_flag: tags case' '
     - 	Y:commit-8-1
     - 	EOF
     - 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
     --	test_three_modes can_all_from_reach_with_flag
     -+	test_all_modes can_all_from_reach_with_flag
     - '
     - 
     - test_expect_success 'commit_contains:hit' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:hit' '
     - 	X:commit-9-3
     - 	EOF
     - 	echo "commit_contains(_,A,X,_):1" >expect &&
     --	test_three_modes commit_contains &&
     --	test_three_modes commit_contains --tag
     -+	test_all_modes commit_contains &&
     -+	test_all_modes commit_contains --tag
     - '
     - 
     - test_expect_success 'commit_contains:miss' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:miss' '
     - 	X:commit-9-3
     - 	EOF
     - 	echo "commit_contains(_,A,X,_):0" >expect &&
     --	test_three_modes commit_contains &&
     --	test_three_modes commit_contains --tag
     -+	test_all_modes commit_contains &&
     -+	test_all_modes commit_contains --tag
     - '
     - 
     - test_expect_success 'rev-list: basic topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: basic topo-order' '
     - 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
     - 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
     - 	>expect &&
     --	run_three_modes git rev-list --topo-order commit-6-6
     -+	run_all_modes git rev-list --topo-order commit-6-6
     - '
     - 
     - test_expect_success 'rev-list: first-parent topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent topo-order' '
     - 		commit-6-2 \
     - 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
     - 	>expect &&
     --	run_three_modes git rev-list --first-parent --topo-order commit-6-6
     -+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
     - '
     - 
     - test_expect_success 'rev-list: range topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
     - 		commit-6-2 commit-5-2 commit-4-2 \
     - 		commit-6-1 commit-5-1 commit-4-1 \
     - 	>expect &&
     --	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
     -+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
     - '
     - 
     - test_expect_success 'rev-list: range topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
     - 		commit-6-2 commit-5-2 commit-4-2 \
     - 		commit-6-1 commit-5-1 commit-4-1 \
     - 	>expect &&
     --	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
     -+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
     - '
     - 
     - test_expect_success 'rev-list: first-parent range topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent range topo-order' '
     - 		commit-6-2 \
     - 		commit-6-1 commit-5-1 commit-4-1 \
     - 	>expect &&
     --	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
     -+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
     - '
     - 
     - test_expect_success 'rev-list: ancestry-path topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: ancestry-path topo-order' '
     - 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
     - 		commit-6-3 commit-5-3 commit-4-3 \
     - 	>expect &&
     --	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
     -+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
     - '
     - 
     - test_expect_success 'rev-list: symmetric difference topo-order' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: symmetric difference topo-order' '
     - 		commit-3-8 commit-2-8 commit-1-8 \
     - 		commit-3-7 commit-2-7 commit-1-7 \
     - 	>expect &&
     --	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
     -+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
     - '
     - 
     - test_expect_success 'get_reachable_subset:all' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:all' '
     - 			      commit-1-7 \
     - 			      commit-5-6 | sort
     - 	) >expect &&
     --	test_three_modes get_reachable_subset
     -+	test_all_modes get_reachable_subset
     - '
     - 
     - test_expect_success 'get_reachable_subset:some' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:some' '
     - 		git rev-parse commit-3-3 \
     - 			      commit-1-7 | sort
     - 	) >expect &&
     --	test_three_modes get_reachable_subset
     -+	test_all_modes get_reachable_subset
     - '
     - 
     - test_expect_success 'get_reachable_subset:none' '
     -@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:none' '
     - 	Y:commit-2-8
     - 	EOF
     - 	echo "get_reachable_subset(X,Y)" >expect &&
     --	test_three_modes get_reachable_subset
     -+	test_all_modes get_reachable_subset
     - '
     - 
     - test_done
     +
     + ## t/test-lib-functions.sh ##
     +@@ t/test-lib-functions.sh: test_commit () {
     + 		--signoff)
     + 			signoff="$1"
     + 			;;
     ++		--date)
     ++			notick=yes
     ++			GIT_COMMITTER_DATE="$2"
     ++			GIT_AUTHOR_DATE="$2"
     ++			shift
     ++			;;
     + 		-C)
     + 			indir="$2"
     + 			shift
  8:  8ec119edc66 !  9:  a3a70a1edd0 commit-graph: use generation v2 only if entire chain does
     @@ Commit message
          1. "New" Git writes a commit-graph with the GDAT chunk.
          2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
      
     -    Because of the current use of inspecting the current layer for a
     -    chunk_generation_data pointer, the commits in the lower layer will be
     -    interpreted as having very large generation values (commit date plus
     -    offset) compared to the generation numbers in the top layer (topological
     -    level). This violates the expectation that the generation of a parent is
     -    strictly smaller than the generation of a child.
     +    If each layer of split commit-graph is treated independently, as it was
     +    the case before this commit, with Git inspecting only the current layer
     +    for chunk_generation_data pointer, commits in the lower layer (one with
     +    GDAT) whould have corrected commit date as their generation number,
     +    while commits in the upper layer would have topological levels as their
     +    generation. Corrected commit dates usually have much larger values than
     +    topological levels. This means that if we take two commits, one from the
     +    upper layer, and one reachable from it in the lower layer, then the
     +    expectation that the generation of a parent is smaller than the
     +    generation of a child would be violated.
      
          It is difficult to expose this issue in a test. Since we _start_ with
          artificially low generation numbers, any commit walk that prioritizes
     @@ Commit message
          commits in the lower layer before allowing the topo-order queue to write
          anything to output (depending on the size of the upper layer).
      
     -    When writing the new layer in split commit-graph, we write a GDAT chunk
     -    only if the topmost layer has a GDAT chunk. This guarantees that if a
     -    layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
     +    Therefore, When writing the new layer in split commit-graph, we write a
     +    GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
     +    that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
     +    as well.
      
          Rewriting layers follows similar approach: if the topmost layer below
          the set of layers being rewritten (in the split commit-graph chain)
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       	item->date = (timestamp_t)((date_high << 32) | date_low);
       
      -	if (g->chunk_generation_data) {
     -+	if (g->chunk_generation_data && g->read_generation_data) {
     - 		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     ++	if (g->read_generation_data) {
     + 		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
       
       		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
      @@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
     - 		}
     - 	}
     + 		if (i < ctx->num_commit_graphs_after)
     + 			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
       
     -+	if (!ctx->write_generation_data && g->chunk_generation_data)
     -+		ctx->write_generation_data = 1;
     ++		/*
     ++		 * If the topmost remaining layer has generation data chunk, the
     ++		 * resultant layer also has generation data chunk.
     ++		 */
     ++		if (i == ctx->num_commit_graphs_after - 2)
     ++			ctx->write_generation_data = !!g->chunk_generation_data;
      +
     - 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
     - 		ctx->new_base_graph = g;
     - 	else if (ctx->num_commit_graphs_after != 1)
     + 		i--;
     + 		g = g->base_graph;
     + 	}
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       		struct commit_graph *g = ctx->r->objects->commit_graph;
       
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       			g->topo_levels = &topo_levels;
       			g = g->base_graph;
       		}
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 
     - 		g = ctx->r->objects->commit_graph;
     +@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     + 		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
     + 		 * in the following condition.
     + 		 */
     +-		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
     ++		if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
     + 			max_generation--;
       
     -+		if (g && !g->chunk_generation_data)
     -+			ctx->write_generation_data = 0;
     -+
     - 		while (g) {
     - 			ctx->num_commit_graphs_before++;
     - 			g = g->base_graph;
     -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 
     - 		if (ctx->opts)
     - 			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
     -+
     -+		if (replace)
     -+			ctx->write_generation_data = 1;
     - 	}
     - 
     - 	ctx->approx_nr_objects = approximate_object_count();
     + 		generation = commit_graph_generation(graph_commit);
      
       ## commit-graph.h ##
      @@ commit-graph.h: struct commit_graph {
     @@ commit-graph.h: struct commit_graph {
       	const uint32_t *chunk_oid_fanout;
      
       ## t/t5324-split-commit-graph.sh ##
     -@@ t/t5324-split-commit-graph.sh: test_expect_success '--split=replace with partial Bloom data' '
     - 	verify_chain_files_exist $graphdir
     +@@ t/t5324-split-commit-graph.sh: test_expect_success 'prevent regression for duplicate commits across layers' '
     + 	git -C dup commit-graph verify
       '
       
     ++NUM_FIRST_LAYER_COMMITS=64
     ++NUM_SECOND_LAYER_COMMITS=16
     ++NUM_THIRD_LAYER_COMMITS=7
     ++NUM_FOURTH_LAYER_COMMITS=8
     ++NUM_FIFTH_LAYER_COMMITS=16
     ++SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
     ++SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
     ++THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
     ++THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
     ++FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
     ++FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
     ++FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
     ++FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
     ++
     ++# Current split graph chain:
     ++#
     ++#     16 commits (No GDAT)
     ++# ------------------------
     ++#     64 commits (GDAT)
     ++#
      +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
     -+	mkdir mixed &&
      +	graphdir=".git/objects/info/commit-graphs" &&
     -+	test_oid_cache <<-EOM &&
     ++	test_oid_cache <<-EOF &&
      +	oid_version sha1:1
      +	oid_version sha256:2
     -+	EOM
     -+	cd "$TRASH_DIRECTORY/mixed" &&
     -+	git init &&
     -+	git config core.commitGraph true &&
     -+	git config gc.writeCommitGraph false &&
     -+	for i in $(test_seq 3)
     -+	do
     -+		test_commit $i &&
     -+		git branch commits/$i || return 1
     -+	done &&
     -+	git reset --hard commits/1 &&
     -+	for i in $(test_seq 4 5)
     -+	do
     -+		test_commit $i &&
     -+		git branch commits/$i || return 1
     -+	done &&
     -+	git reset --hard commits/2 &&
     -+	for i in $(test_seq 6 10)
     -+	do
     -+		test_commit $i &&
     -+		git branch commits/$i || return 1
     -+	done &&
     -+	git commit-graph write --reachable --split &&
     -+	git reset --hard commits/2 &&
     -+	git merge commits/4 &&
     -+	git branch merge/1 &&
     -+	git reset --hard commits/4 &&
     -+	git merge commits/6 &&
     -+	git branch merge/2 &&
     -+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
     -+	test-tool read-graph >output &&
     -+	cat >expect <<-EOF &&
     -+	header: 43475048 1 $(test_oid oid_version) 4 1
     -+	num_commits: 2
     -+	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output &&
     -+	git commit-graph verify
     ++	git init mixed &&
     ++	(
     ++		cd mixed &&
     ++		git config core.commitGraph true &&
     ++		git config gc.writeCommitGraph false &&
     ++		for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
     ++		do
     ++			test_commit $i &&
     ++			git branch commits/$i || return 1
     ++		done &&
     ++		git commit-graph write --reachable --split &&
     ++		graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
     ++		test_line_count = 1 $graphdir/commit-graph-chain &&
     ++		for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
     ++		do
     ++			test_commit $i &&
     ++			git branch commits/$i || return 1
     ++		done &&
     ++		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
     ++		test_line_count = 2 $graphdir/commit-graph-chain &&
     ++		test-tool read-graph >output &&
     ++		cat >expect <<-EOF &&
     ++		header: 43475048 1 $(test_oid oid_version) 4 1
     ++		num_commits: $NUM_SECOND_LAYER_COMMITS
     ++		chunks: oid_fanout oid_lookup commit_metadata
     ++		EOF
     ++		test_cmp expect output &&
     ++		git commit-graph verify &&
     ++		cat $graphdir/commit-graph-chain
     ++	)
      +'
      +
     -+test_expect_success 'does not write generation data chunk if not present on existing tip' '
     -+	cd "$TRASH_DIRECTORY/mixed" &&
     -+	git reset --hard commits/3 &&
     -+	git merge merge/1 &&
     -+	git merge commits/5 &&
     -+	git merge merge/2 &&
     -+	git branch merge/3 &&
     -+	git commit-graph write --reachable --split=no-merge &&
     -+	test-tool read-graph >output &&
     -+	cat >expect <<-EOF &&
     -+	header: 43475048 1 $(test_oid oid_version) 4 2
     -+	num_commits: 3
     -+	chunks: oid_fanout oid_lookup commit_metadata
     -+	EOF
     -+	test_cmp expect output &&
     -+	git commit-graph verify
     ++# The new layer will be added without generation data chunk as it was not
     ++# present on the layer underneath it.
     ++#
     ++#      7 commits (No GDAT)
     ++# ------------------------
     ++#     16 commits (No GDAT)
     ++# ------------------------
     ++#     64 commits (GDAT)
     ++#
     ++test_expect_success 'do not write generation data chunk if not present on existing tip' '
     ++	git clone mixed mixed-no-gdat &&
     ++	(
     ++		cd mixed-no-gdat &&
     ++		for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
     ++		do
     ++			test_commit $i &&
     ++			git branch commits/$i || return 1
     ++		done &&
     ++		git commit-graph write --reachable --split=no-merge &&
     ++		test_line_count = 3 $graphdir/commit-graph-chain &&
     ++		test-tool read-graph >output &&
     ++		cat >expect <<-EOF &&
     ++		header: 43475048 1 $(test_oid oid_version) 4 2
     ++		num_commits: $NUM_THIRD_LAYER_COMMITS
     ++		chunks: oid_fanout oid_lookup commit_metadata
     ++		EOF
     ++		test_cmp expect output &&
     ++		git commit-graph verify
     ++	)
     ++'
     ++
     ++# Number of commits in each layer of the split-commit graph before merge:
     ++#
     ++#      8 commits (No GDAT)
     ++# ------------------------
     ++#      7 commits (No GDAT)
     ++# ------------------------
     ++#     16 commits (No GDAT)
     ++# ------------------------
     ++#     64 commits (GDAT)
     ++#
     ++# The top two layers are merged and do not have generation data chunk as layer below them does
     ++# not have generation data chunk.
     ++#
     ++#     15 commits (No GDAT)
     ++# ------------------------
     ++#     16 commits (No GDAT)
     ++# ------------------------
     ++#     64 commits (GDAT)
     ++#
     ++test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
     ++	git clone mixed-no-gdat mixed-merge-no-gdat &&
     ++	(
     ++		cd mixed-merge-no-gdat &&
     ++		for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
     ++		do
     ++			test_commit $i &&
     ++			git branch commits/$i || return 1
     ++		done &&
     ++		git commit-graph write --reachable --split --size-multiple 1 &&
     ++		test_line_count = 3 $graphdir/commit-graph-chain &&
     ++		test-tool read-graph >output &&
     ++		cat >expect <<-EOF &&
     ++		header: 43475048 1 $(test_oid oid_version) 4 2
     ++		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
     ++		chunks: oid_fanout oid_lookup commit_metadata
     ++		EOF
     ++		test_cmp expect output &&
     ++		git commit-graph verify
     ++	)
      +'
      +
     -+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
     -+	cd "$TRASH_DIRECTORY/mixed" &&
     -+	git commit-graph write --reachable --split=replace &&
     -+	test_path_is_file $graphdir/commit-graph-chain &&
     -+	test_line_count = 1 $graphdir/commit-graph-chain &&
     -+	verify_chain_files_exist $graphdir &&
     -+	graph_read_expect 15 &&
     -+	git commit-graph verify
     ++# Number of commits in each layer of the split-commit graph before merge:
     ++#
     ++#     16 commits (No GDAT)
     ++# ------------------------
     ++#     15 commits (No GDAT)
     ++# ------------------------
     ++#     16 commits (No GDAT)
     ++# ------------------------
     ++#     64 commits (GDAT)
     ++#
     ++# The top three layers are merged and has generation data chunk as the topmost remaining layer
     ++# has generation data chunk.
     ++#
     ++#     47 commits (GDAT)
     ++# ------------------------
     ++#     64 commits (GDAT)
     ++#
     ++test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
     ++	git clone mixed-merge-no-gdat mixed-merge-gdat &&
     ++	(
     ++		cd mixed-merge-gdat &&
     ++		for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
     ++		do
     ++			test_commit $i &&
     ++			git branch commits/$i || return 1
     ++		done &&
     ++		git commit-graph write --reachable --split --size-multiple 1 &&
     ++		test_line_count = 2 $graphdir/commit-graph-chain &&
     ++		test-tool read-graph >output &&
     ++		cat >expect <<-EOF &&
     ++		header: 43475048 1 $(test_oid oid_version) 5 1
     ++		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
     ++		chunks: oid_fanout oid_lookup commit_metadata generation_data
     ++		EOF
     ++		test_cmp expect output
     ++	)
      +'
      +
     -+test_expect_success 'add one commit, write a tip graph' '
     -+	cd "$TRASH_DIRECTORY/mixed" &&
     -+	test_commit 11 &&
     -+	git branch commits/11 &&
     -+	git commit-graph write --reachable --split &&
     -+	test_path_is_missing $infodir/commit-graph &&
     -+	test_path_is_file $graphdir/commit-graph-chain &&
     -+	ls $graphdir/graph-*.graph >graph-files &&
     -+	test_line_count = 2 graph-files &&
     -+	verify_chain_files_exist $graphdir
     ++test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
     ++	git clone mixed mixed-replace &&
     ++	(
     ++		cd mixed-replace &&
     ++		git commit-graph write --reachable --split=replace &&
     ++		test_path_is_file $graphdir/commit-graph-chain &&
     ++		test_line_count = 1 $graphdir/commit-graph-chain &&
     ++		verify_chain_files_exist $graphdir &&
     ++		graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
     ++		git commit-graph verify
     ++	)
      +'
      +
       test_done
  9:  bb9b02af32d ! 10:  093101f908b commit-reach: use corrected commit dates in paint_down_to_common()
     @@ Metadata
       ## Commit message ##
          commit-reach: use corrected commit dates in paint_down_to_common()
      
     -    With corrected commit dates implemented, we no longer have to rely on
     -    commit date as a heuristic in paint_down_to_common().
     +    091f4cf (commit: don't use generation numbers if not needed,
     +    2018-08-30) changed paint_down_to_common() to use commit dates instead
     +    of generation numbers v1 (topological levels) as the performance
     +    regressed on certain topologies. With generation number v2 (corrected
     +    commit dates) implemented, we no longer have to rely on commit dates and
     +    can use generation numbers.
      
     -    While using corrected commit dates Git walks nearly the same number of
     +    For example, the command `git merge-base v4.8 v4.9` on the Linux
     +    repository walks 167468 commits, taking 0.135s for committer date and
     +    167496 commits, taking 0.157s for corrected committer date respectively.
     +
     +    While using corrected commit dates, Git walks nearly the same number of
          commits as commit date, the process is slower as for each comparision we
          have to access a commit-slab (for corrected committer date) instead of
          accessing struct member (for committer date).
      
     -    For example, the command `git merge-base v4.8 v4.9` on the linux
     -    repository walks 167468 commits, taking 0.135s for committer date and
     -    167496 commits, taking 0.157s for corrected committer date respectively.
     -
     -    t6404-recursive-merge setups a unique repository where all commits have
     -    the same committer date without well-defined merge-base.
     +    This change incidentally broke the fragile t6404-recursive-merge test.
     +    t6404-recursive-merge sets up a unique repository where all commits have
     +    the same committer date without a well-defined merge-base.
      
          While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
          date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
          conflicts' merges commits in the order:
     -    - Merge C with B to form a intermediate commit.
     +    - Merge C with B to form an intermediate commit.
          - Merge the intermediate commit with A.
      
          With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
          use the corrected committer date, which changes the order in which
          commits are merged:
     -    - Merge A with B to form a intermediate commit.
     +    - Merge A with B to form an intermediate commit.
          - Merge the intermediate commit with C.
      
          While resulting repositories are equivalent, 6404.4 'virtual trees were
     @@ commit-graph.c: int generation_numbers_enabled(struct repository *r)
       	struct commit_graph *g = r->objects->commit_graph;
      
       ## commit-graph.h ##
     -@@ commit-graph.h: struct commit_graph *read_commit_graph_one(struct repository *r,
     - struct commit_graph *parse_commit_graph(struct repository *r,
     - 					void *graph_map, size_t graph_size);
     - 
     -+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
     -+
     - /*
     -  * Return 1 if and only if the repository has a commit-graph
     -  * file and generation numbers are computed in that file.
     +@@ commit-graph.h: struct commit_graph *parse_commit_graph(struct repository *r,
        */
       int generation_numbers_enabled(struct repository *r);
       
     --struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
      +/*
      + * Return 1 if and only if the repository has a commit-graph
      + * file and generation data chunk has been written for the file.
      + */
      +int corrected_commit_dates_enabled(struct repository *r);
     ++
     + struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
       
       enum commit_graph_write_flags {
     - 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
      
       ## commit-reach.c ##
      @@ commit-reach.c: static struct commit_list *paint_down_to_common(struct repository *r,
 10:  9ada43967d2 ! 11:  20299e57457 doc: add corrected commit date info
     @@ Documentation/technical/commit-graph-format.txt: CHUNK DATA:
             2 bits of the lowest byte, storing the 33rd and 34th bit of the
             commit time.
       
     -+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
     ++  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
      +    * This list of 4-byte values store corrected commit date offsets for the
      +      commits, arranged in the same order as commit data chunk.
      +    * If the corrected commit date offset cannot be stored within 31 bits,
      +      the value has its most-significant bit on and the other bits store
      +      the position of corrected commit date into the Generation Data Overflow
      +      chunk.
     ++    * Generation Data chunk is present only when commit-graph file is written
     ++      by compatible versions of Git and in case of split commit-graph chains,
     ++      the topmost layer also has Generation Data chunk.
      +
      +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
     -+    * This list of 8-byte values stores the corrected commit dates for commits
     -+      with corrected commit date offsets that cannot be stored within 31 bits.
     ++    * This list of 8-byte values stores the corrected commit date offsets
     ++      for commits with corrected commit date offsets that cannot be
     ++      stored within 31 bits.
     ++    * Generation Data Overflow chunk is present only when Generation Data
     ++      chunk is present and atleast one corrected commit date offset cannot
     ++      be stored within 31 bits.
      +
         Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
             This list of 4-byte values store the second through nth parents for
     @@ Documentation/technical/commit-graph.txt: A consumer may load the following info
       
      - * A commit with at least one parent has generation number one more than
      -   the largest generation number among its parents.
     -+  * A commit with no parents (a root commit) has corrected committer date
     ++ * A commit with no parents (a root commit) has corrected committer date
      +    equal to its committer date.
       
      -Equivalently, the generation number of a commit A is one more than the
     -+  * A commit with at least one parent has corrected committer date equal to
     ++ * A commit with at least one parent has corrected committer date equal to
      +    the maximum of its commiter date and one more than the largest corrected
      +    committer date among its parents.
      +
     -+  * As a special case, a root commit with timestamp zero has corrected commit
     ++ * As a special case, a root commit with timestamp zero has corrected commit
      +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
      +    (that is, an uncomputed corrected commit date).
      +
     @@ Documentation/technical/commit-graph.txt: is easier to use for computation and o
           generation numbers, then we always expand the boundary commit with highest
           generation number and can easily detect the stopping condition.
       
     -+The properties applies to both versions of generation number, that is both
     ++The property applies to both versions of generation number, that is both
      +corrected committer dates and topological levels.
      +
       This property can be used to significantly reduce the time it takes to
     @@ Documentation/technical/commit-graph.txt: fully-computed generation numbers. Usi
       with generation number *_INFINITY or *_ZERO is valuable.
       
      -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
     -+We use the macro GENERATION_NUMBER_MAX for commits whose
     - generation numbers are computed to be at least this value. We limit at
     - this value since it is the largest value that can be stored in the
     - commit-graph file using the 30 bits available to generation numbers. This
     +-generation numbers are computed to be at least this value. We limit at
     +-this value since it is the largest value that can be stored in the
     +-commit-graph file using the 30 bits available to generation numbers. This
     +-presents another case where a commit can have generation number equal to
     +-that of a parent.
     ++We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
     ++topological levels (generation number v1) are computed to be at least
     ++this value. We limit at this value since it is the largest value that
     ++can be stored in the commit-graph file using the 30 bits available
     ++to topological levels. This presents another case where a commit can
     ++have generation number equal to that of a parent.
     + 
     + Design Details
     + --------------
      @@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the size multiple, 64,000 for the maximum
       number of commits) could be extracted into config settings for full
       flexibility.
     @@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the s
      +A naive approach of using the newest available generation number from
      +each layer would lead to violated expectations: the lower layer would
      +use corrected commit dates which are much larger than the topological
     -+levels of the higher layer. For this reason, Git inspects each layer to
     -+see if any layer is missing corrected commit dates. In such a case, Git
     -+only uses topological level
     ++levels of the higher layer. For this reason, Git inspects the topmost
     ++layer to see if the layer is missing corrected commit dates. In such a case
     ++Git only uses topological level for generation numbers.
      +
      +When writing a new layer in split commit-graph, we write corrected commit
      +dates if the topmost layer has corrected commit dates written. This
     @@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the s
      +must have corrected commit dates as well.
      +
      +When merging layers, we do not consider whether the merged layers had corrected
     -+commit dates. Instead, the new layer will have corrected commit dates if and
     -+only if all existing layers below the new layer have corrected commit dates.
     ++commit dates. Instead, the new layer will have corrected commit dates if the
     ++layer below the new layer has corrected commit dates.
     ++
     ++While writing or merging layers, if the new layer is the only layer, it will
     ++have corrected commit dates when written by compatible versions of Git. Thus,
     ++rewriting split commit-graph as a single file (`--split=replace`) creates a
     ++single layer with corrected commit dates.
      +
       ## Deleting graph-{hash} files
       

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:15         ` Abhishek Kumar via GitGitGadget
  2020-12-30  1:35           ` Derrick Stolee
  2021-01-05  9:45           ` SZEDER Gábor
  2020-12-28 11:15         ` [PATCH v5 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                           ` (11 subsequent siblings)
  12 siblings, 2 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Before computing Bloom fitlers, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.

But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.

Not all hope is lost, though: 'commit_graph_generation()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparision function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters. Applying this patch (compared
with v2.29.1) speeds up computing Bloom filters by around ~4
seconds.

So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 06f8dc1d896..caf823295f4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;

-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 02/11] revision: parse parent in indegree_walk_step()
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
  2020-12-28 11:15         ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:15         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                           ` (10 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 9dff845bed6..de8e45f462f 100644
--- a/revision.c
+++ b/revision.c
@@ -3373,6 +3373,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
  2020-12-28 11:15         ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
  2020-12-28 11:15         ` [PATCH v5 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
                           ` (9 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.

As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.

To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.

To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 27 ++++++++++-----------------
 t/t5000-tar-tree.sh | 24 +++++++++++++++++++++---
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index caf823295f4..d5b33b4f7ac 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
 	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b652..7204799a0b5 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,15 +431,33 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
 	test_cmp expect actual
 '
 
-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+	rm -f .git/index &&
+	echo foo >file &&
+	git add file &&
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
+		git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+	git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+	echo 2514 >expect &&
+	tar_info future.tar | cut -d" " -f2 >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
 		git commit -m "tempori parendum"
 '
 
-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
 	git archive HEAD >future.tar
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (2 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
                           ` (8 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step to implement generation number v2, we add tests to
ensure Git can read and parse commit-graph files without Generation Data
chunk. These files represent commit-graph files written by Old Git and
are neccesary for backward compatability.

We extend run_three_modes() and test_three_modes() to *_all_modes() with
the fourth mode being "commit-graph without generation data chunk".

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 t/t6600-test-reach.sh | 62 +++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337d..af10f0dc090 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -58,7 +58,7 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -70,8 +70,8 @@ run_three_modes () {
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +80,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +89,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +98,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +107,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +117,7 @@ test_expect_success 'in_merge_bases_many:hit' '
 	X:commit-5-7
 	EOF
 	echo "in_merge_bases_many(A,X):1" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +127,7 @@ test_expect_success 'in_merge_bases_many:miss' '
 	X:commit-8-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +137,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
 	X:commit-6-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +148,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +159,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -174,7 +174,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -196,7 +196,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +219,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +241,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +264,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -280,8 +280,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -297,8 +297,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +322,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +334,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +346,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +358,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +368,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +382,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +402,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +420,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +434,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 05/11] commit-graph: add a slab to store topological levels
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (3 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                           ` (7 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.

Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.

A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.

We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
 commit-graph.h |  1 +
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d5b33b4f7ac..c98e8910fe2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -768,6 +770,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+	if (g->topo_levels)
+		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -956,6 +961,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
+	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
@@ -1102,7 +1108,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1332,11 +1338,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1344,29 +1349,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else if (level > max_level) {
+					max_level = level;
 				}
 			}
 
 			if (all_parents_computed) {
-				struct commit_graph_data *data = commit_graph_data_at(current);
-
-				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
 	}
@@ -2102,6 +2104,7 @@ int write_commit_graph(struct object_directory *odb,
 	int res = 0;
 	int replace = 0;
 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct topo_level_slab topo_levels;
 
 	prepare_repo_settings(the_repository);
 	if (!the_repository->settings.core_commit_graph) {
@@ -2128,6 +2131,18 @@ int write_commit_graph(struct object_directory *odb,
 							 bloom_settings.max_changed_paths);
 	ctx->bloom_settings = &bloom_settings;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
+	if (ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			g->topo_levels = &topo_levels;
+			g = g->base_graph;
+		}
+	}
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6e..00f00745b79 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
+	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 06/11] commit-graph: return 64-bit generation number
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (4 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                           ` (6 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 22 +++++++++++-----------
 commit-graph.h |  4 ++--
 commit-reach.c | 36 ++++++++++++++++++------------------
 commit-reach.h |  2 +-
 commit.c       |  4 ++--
 commit.h       |  4 ++--
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 8 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c98e8910fe2..1b2a015f92f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -101,7 +101,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -146,8 +146,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1366,8 +1366,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			if (all_parents_computed) {
 				pop_commit(&list);
 
-				if (max_level > GENERATION_NUMBER_MAX - 1)
-					max_level = GENERATION_NUMBER_MAX - 1;
+				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
@@ -2363,8 +2363,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2428,16 +2428,16 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
 		 * extra logic in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
+		if (max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
 		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 00f00745b79..2e9aa7824ee 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,12 +145,12 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e7..9b24b0378d5 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
 	const struct commit *a = *(const struct commit * const *)_a;
 	const struct commit *b = *(const struct commit * const *)_b;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
 
 	if (generation_a < generation_b)
 		return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a317..148b56fea50 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.c b/commit.c
index fe1fa3dc41f..17abf92a2d2 100644
--- a/commit.c
+++ b/commit.c
@@ -731,8 +731,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a),
+			  generation_b = commit_graph_generation(b);
 
 	/* newer commits first */
 	if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index 5467786c7be..33c66b2177c 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
diff --git a/revision.c b/revision.c
index de8e45f462f..d55c2e4d566 100644
--- a/revision.c
+++ b/revision.c
@@ -3300,7 +3300,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3346,7 +3346,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3389,7 +3389,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3447,7 +3447,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (repo_parse_commit_gently(revs->repo, c, 1))
 			continue;
@@ -3508,7 +3508,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b66bf92ba8..b87607e0dd4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -500,7 +500,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 07/11] commit-graph: implement corrected commit date
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (5 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-30  1:53           ` Derrick Stolee
  2020-12-28 11:16         ` [PATCH v5 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                           ` (5 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if one of commits is malformed but
valid and has committerdate of 0 Unix time, as the offset would be the
same as corrected commit date and thus require 64-bits to be stored
properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1b2a015f92f..bfc3aae5f93 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1339,9 +1339,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
-		if (level != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1350,16 +1352,23 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 
-				if (level == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				} else {
+					if (level > max_level)
+						max_level = level;
+
+					if (corrected_commit_date > max_corrected_commit_date)
+						max_corrected_commit_date = corrected_commit_date;
 				}
 			}
 
@@ -1369,6 +1378,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date && current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 08/11] commit-graph: implement generation data chunk
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (6 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
                           ` (4 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 111 ++++++++++++++++++++++++++++++----
 commit-graph.h                |   3 +
 commit.h                      |   1 +
 t/README                      |   3 +
 t/helper/test-read-graph.c    |   4 ++
 t/t4216-log-bloom.sh          |   4 +-
 t/t5318-commit-graph.sh       |  79 ++++++++++++++++++++----
 t/t5324-split-commit-graph.sh |  12 ++--
 t/t6600-test-reach.sh         |   6 ++
 t/test-lib-functions.sh       |   6 ++
 10 files changed, 197 insertions(+), 32 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index bfc3aae5f93..629b2f17fbc 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
@@ -390,6 +394,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+			if (graph->chunk_generation_data_overflow)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data_overflow = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -750,8 +768,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 {
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
-	uint32_t lex_index;
-	uint64_t date_high, date_low;
+	uint32_t lex_index, offset_pos;
+	uint64_t date_high, date_low, offset;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
@@ -769,7 +787,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data) {
+		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+		} else
+			graph_data->generation = item->date + offset;
+	} else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 
 	if (g->topo_levels)
 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -941,6 +968,7 @@ struct write_commit_graph_context {
 	struct oid_array oids;
 	struct packed_commit_list commits;
 	int num_extra_edges;
+	int num_generation_data_overflows;
 	unsigned long approx_nr_objects;
 	struct progress *progress;
 	int progress_done;
@@ -959,7 +987,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
@@ -1119,6 +1148,45 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i, num_generation_data_overflows = 0;
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+			num_generation_data_overflows++;
+		}
+
+		hashwrite_be32(f, offset);
+	}
+
+	return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+						       struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			hashwrite_be32(f, offset >> 32);
+			hashwrite_be32(f, (uint32_t) offset);
+		}
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1382,6 +1450,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
@@ -1715,6 +1786,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+		ctx->write_generation_data = 0;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
+	if (ctx->num_generation_data_overflows) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2135,6 +2221,8 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = 1;
+	ctx->num_generation_data_overflows = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
@@ -2441,16 +2529,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
-		 * extra logic in the following condition.
+		 * If we are using topological level and one of our parents has
+		 * generation GENERATION_NUMBER_V1_MAX, then our generation is
+		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+		 * in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_V1_MAX)
+		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+		if (generation < max_generation + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824ee..19a02001fde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -68,6 +69,8 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
+	const unsigned char *chunk_generation_data_overflow;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 33c66b2177c..251d877fcf6 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
 #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
 #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
 
 struct commit_list {
 	struct commit *item;
diff --git a/t/README b/t/README
index c730a707705..8a121487279 100644
--- a/t/README
+++ b/t/README
@@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a17256..75927b2c81d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
+	if (graph->chunk_generation_data_overflow)
+		printf(" generation_data_overflow");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d11040ce41c..dbde0161882 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 '
 
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544da..fa27df579a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
@@ -741,4 +742,56 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 	)
 '
 
+# We test the overflow-related code with the following repo history:
+#
+#               4:F - 5:N - 6:U
+#              /                \
+# 1:U - 2:N - 3:U                M:N
+#              \                /
+#               7:N - 8:F - 9:N
+#
+# Here the commits denoted by U have committer date of zero seconds
+# since Unix epoch, the commits denoted by N have committer date
+# starting from 1112354055 seconds since Unix epoch (default committer
+# date for the test suite), and the commits denoted by F have committer
+# date of (2 ^ 31 - 2) seconds since Unix epoch.
+#
+# The largest offset observed is 2 ^ 31, just large enough to overflow.
+#
+
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+	objdir=".git/objects" &&
+	UNIX_EPOCH_ZERO="@0 +0000" &&
+	FUTURE_DATE="@2147483646 +0000" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	cd "$TRASH_DIRECTORY" &&
+	mkdir repo &&
+	cd repo &&
+	git init &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+	test_commit 2 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 3 generation_data &&
+	test_commit --date "$FUTURE_DATE" 4 &&
+	test_commit 5 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+	git branch left &&
+	git reset --hard 3 &&
+	test_commit 7 &&
+	test_commit --date "$FUTURE_DATE" 8 &&
+	test_commit 9 &&
+	git branch right &&
+	git reset --hard 3 &&
+	test_merge M left right &&
+	git commit-graph write --reachable &&
+	graph_read_expect 10 "generation_data generation_data_overflow" &&
+	git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
 test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 4d3842b83b9..587757b62d9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
 	infodir=".git/objects/info" &&
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 
 	oid_version sha1:1
 	oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index af10f0dc090..e2d33a8a4c4 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,6 +55,9 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
@@ -67,6 +70,9 @@ run_all_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 999982fe4a9..3ad712c3acc 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,12 @@ test_commit () {
 		--signoff)
 			signoff="$1"
 			;;
+		--date)
+			notick=yes
+			GIT_COMMITTER_DATE="$2"
+			GIT_AUTHOR_DATE="$2"
+			shift
+			;;
 		-C)
 			indir="$2"
 			shift
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (7 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-30  3:23           ` Derrick Stolee
  2020-12-28 11:16         ` [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                           ` (3 subsequent siblings)
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                |  29 +++++-
 commit-graph.h                |   1 +
 t/t5324-split-commit-graph.sh | 181 ++++++++++++++++++++++++++++++++++
 3 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 629b2f17fbc..41a65d98738 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -610,6 +610,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+	int read_generation_data;
+
+	if (!g)
+		return;
+
+	read_generation_data = !!g->chunk_generation_data;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -618,6 +633,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(g);
+
 	return g;
 }
 
@@ -787,7 +804,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data) {
+	if (g->read_generation_data) {
 		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 
 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2012,6 +2029,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		if (i < ctx->num_commit_graphs_after)
 			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
 
+		/*
+		 * If the topmost remaining layer has generation data chunk, the
+		 * resultant layer also has generation data chunk.
+		 */
+		if (i == ctx->num_commit_graphs_after - 2)
+			ctx->write_generation_data = !!g->chunk_generation_data;
+
 		i--;
 		g = g->base_graph;
 	}
@@ -2239,6 +2263,7 @@ int write_commit_graph(struct object_directory *odb,
 		struct commit_graph *g = ctx->r->objects->commit_graph;
 
 		while (g) {
+			g->read_generation_data = 1;
 			g->topo_levels = &topo_levels;
 			g = g->base_graph;
 		}
@@ -2534,7 +2559,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
 		 * in the following condition.
 		 */
-		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+		if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fde..ad52130883b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	unsigned int read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 587757b62d9..8e90f3423b8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,4 +453,185 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
 	git -C dup commit-graph verify
 '
 
+NUM_FIRST_LAYER_COMMITS=64
+NUM_SECOND_LAYER_COMMITS=16
+NUM_THIRD_LAYER_COMMITS=7
+NUM_FOURTH_LAYER_COMMITS=8
+NUM_FIFTH_LAYER_COMMITS=16
+SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
+SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
+THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
+THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
+FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
+FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
+FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
+FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
+
+# Current split graph chain:
+#
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	graphdir=".git/objects/info/commit-graphs" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	git init mixed &&
+	(
+		cd mixed &&
+		git config core.commitGraph true &&
+		git config gc.writeCommitGraph false &&
+		for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split &&
+		graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 1
+		num_commits: $NUM_SECOND_LAYER_COMMITS
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify &&
+		cat $graphdir/commit-graph-chain
+	)
+'
+
+# The new layer will be added without generation data chunk as it was not
+# present on the layer underneath it.
+#
+#      7 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if not present on existing tip' '
+	git clone mixed mixed-no-gdat &&
+	(
+		cd mixed-no-gdat &&
+		for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split=no-merge &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 2
+		num_commits: $NUM_THIRD_LAYER_COMMITS
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify
+	)
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+#      8 commits (No GDAT)
+# ------------------------
+#      7 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+# The top two layers are merged and do not have generation data chunk as layer below them does
+# not have generation data chunk.
+#
+#     15 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
+	git clone mixed-no-gdat mixed-merge-no-gdat &&
+	(
+		cd mixed-merge-no-gdat &&
+		for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split --size-multiple 1 &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 2
+		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify
+	)
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+#     16 commits (No GDAT)
+# ------------------------
+#     15 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+# The top three layers are merged and has generation data chunk as the topmost remaining layer
+# has generation data chunk.
+#
+#     47 commits (GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
+	git clone mixed-merge-no-gdat mixed-merge-gdat &&
+	(
+		cd mixed-merge-gdat &&
+		for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split --size-multiple 1 &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 5 1
+		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
+		chunks: oid_fanout oid_lookup commit_metadata generation_data
+		EOF
+		test_cmp expect output
+	)
+'
+
+test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
+	git clone mixed mixed-replace &&
+	(
+		cd mixed-replace &&
+		git commit-graph write --reachable --split=replace &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir &&
+		graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
+		git commit-graph verify
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (8 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-28 11:16         ` [PATCH v5 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
                           ` (2 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6404-recursive-merge.sh |  5 ++++-
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 41a65d98738..c8d7ed13302 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -710,6 +710,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return g->read_generation_data;
+}
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 {
 	struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883b..97f3497c279 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,12 @@ struct commit_graph *parse_commit_graph(struct repository *r,
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
 
 enum commit_graph_write_flags {
diff --git a/commit-reach.c b/commit-reach.c
index 9b24b0378d5..e38771ca5a1 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index b1c3d4dda49..86f74ae5847 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
 '
 
 test_expect_success 'virtual trees were processed' '
+	# TODO: fragile test, relies on ambigious merge-base resolution
 	git ls-files --stage >out &&
 
 	cat >expect <<-EOF &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v5 11/11] doc: add corrected commit date info
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (9 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16         ` Abhishek Kumar via GitGitGadget
  2020-12-30  4:35         ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 28 +++++--
 Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
 2 files changed, 86 insertions(+), 19 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b92..b6658eff188 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -86,13 +82,33 @@ CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the generation number of the commit and
+    * The next 8 bytes store the topological level (generation number v1)
+      of the commit and
       the commit time in seconds since EPOCH. The generation number
       uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * If the corrected commit date offset cannot be stored within 31 bits,
+      the value has its most-significant bit on and the other bits store
+      the position of corrected commit date into the Generation Data Overflow
+      chunk.
+    * Generation Data chunk is present only when commit-graph file is written
+      by compatible versions of Git and in case of split commit-graph chains,
+      the topmost layer also has Generation Data chunk.
+
+  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+    * This list of 8-byte values stores the corrected commit date offsets
+      for commits with corrected commit date offsets that cannot be
+      stored within 31 bits.
+    * Generation Data Overflow chunk is present only when Generation Data
+      chunk is present and atleast one corrected commit date offset cannot
+      be stored within 31 bits.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa8..f05e7bda1a9 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
 
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
 
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+    (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
     generation numbers, then we always expand the boundary commit with highest
     generation number and can easily detect the stopping condition.
 
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
 This property can be used to significantly reduce the time it takes to
 walk commits and determine topological relationships. Without generation
 numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
 violate topological relationships due to clock skew (such as "git log"
 with default order), but is not used when the topological order is
 required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
  2020-12-28 11:15         ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-12-30  1:35           ` Derrick Stolee
  2021-01-08  5:45             ` Abhishek Kumar
  2021-01-05  9:45           ` SZEDER Gábor
  1 sibling, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30  1:35 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Before computing Bloom fitlers, the commit-graph machinery uses

s/fitlers/filters/

> commit_gen_cmp to sort commits by generation order for improved diff
> performance. 3d11275505 (commit-graph: examine commits by generation
> number, 2020-03-30) claims that this sort can reduce the time spent to
> compute Bloom filters by nearly half.
> 
> But since c49c82aa4c (commit: move members graph_pos, generation to a
> slab, 2020-06-17), this optimization is broken, since asking for a
> 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> while writing.
> 
> Not all hope is lost, though: 'commit_graph_generation()' falls back to
> comparing commits by their date when they have equal generation number,
> and so since c49c82aa4c is purely a date comparision function. This

s/comparision/comparison/

> heuristic is good enough that we don't seem to loose appreciable
> performance while computing Bloom filters. Applying this patch (compared
> with v2.29.1) speeds up computing Bloom filters by around ~4
> seconds.

Using "~4 seconds" here is odd since there is no baseline. Which
repository did you use?

Previous discussion used relative terms. Something like "speeds up by
a factor of 1.25" or something might be interesting.

> So, avoid the useless 'commit_graph_generation()' while writing by
> instead accessing the slab directly. This returns the newly-computed
> generation numbers, and allows us to avoid the heuristic by directly
> comparing generation numbers.

This introduces some timing restrictions to the ability for this
comparison function. It would be dangerous if someone extracted
the method for another purpose. A comment above these lines could
warn future developers from making that mistake, but they would
probably use the comparison functions in commit.c instead.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 06f8dc1d896..caf823295f4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>  
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> 


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 07/11] commit-graph: implement corrected commit date
  2020-12-28 11:16         ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-12-30  1:53           ` Derrick Stolee
  2021-01-10 12:21             ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30  1:53 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> With most of preparations done, let's implement corrected commit date.
> 
> The corrected commit date for a commit is defined as:
> 
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.
> 
> As a special case, a root commit with timestamp of zero (01.01.1970
> 00:00:00Z) has corrected commit date of one, to be able to distinguish
> from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> date).
> 
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.
> 
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t). However, corrected commit date
> offsets can be safely stored using only 32-bits. This halves the size
> of GDAT chunk, which is a reduction of around 6% in the size of
> commit-graph file.
> 
> However, using offsets be problematic if one of commits is malformed but

However, using 32-bit offsets is problematic if a commit is malformed...

> valid and has committerdate of 0 Unix time, as the offset would be the

s/committerdate/committer date/

> same as corrected commit date and thus require 64-bits to be stored
> properly.
> 
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 1b2a015f92f..bfc3aae5f93 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1339,9 +1339,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>  
>  		display_progress(ctx->progress, i + 1);
> -		if (level != GENERATION_NUMBER_ZERO)
> +		if (level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO)
>  			continue;
>  
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1350,16 +1352,23 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;
>  
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>  
> -				if (level == GENERATION_NUMBER_ZERO) {
> +				if (level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;

nit: the "break" in the first case makes it so this large else block
is unnecessary. 

-				if (level == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				
+				if (level > max_level)
+					max_level = level;
+
+				if (corrected_commit_date > max_corrected_commit_date)
+					max_corrected_commit_date = corrected_commit_date;
-				}
 			}

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
  2020-12-28 11:16         ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-12-30  3:23           ` Derrick Stolee
  2021-01-10 13:13             ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30  3:23 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Abhishek Kumar, Taylor Blau

On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>

...

> +static void validate_mixed_generation_chain(struct commit_graph *g)
> +{
> +	int read_generation_data;
> +
> +	if (!g)
> +		return;
> +
> +	read_generation_data = !!g->chunk_generation_data;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}
> +}
> +

This method exists to say "use generation v2 if the top layer has it"
and that helps with the future layer checks.

> @@ -2239,6 +2263,7 @@ int write_commit_graph(struct object_directory *odb,
>  		struct commit_graph *g = ctx->r->objects->commit_graph;
>  
>  		while (g) {
> +			g->read_generation_data = 1;
>  			g->topo_levels = &topo_levels;
>  			g = g->base_graph;
>  		}

However, here you just turn them on automatically.

I think the diff you want is here:

 		struct commit_graph *g = ctx->r->objects->commit_graph;
 
+ 		validate_mixed_generation_chain(g);
+ 
 		while (g) {
 			g->topo_levels = &topo_levels;
 			g = g->base_graph;
 		}

But maybe you have a good reason for what you already have.

I paid attention to this because I hit a problem in my local testing.
After trying to reproduce it, I think the root cause is that I had a
commit-graph that was written by an older version of your series, so
it caused an unexpected pairing of an "offset required" bit but no
offset chunk.

Perhaps this diff is required in the proper place to avoid the
segfault I hit, in the case of a malformed commit-graph file:

diff --git a/commit-graph.c b/commit-graph.c
index c8d7ed1330..d264c90868 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -822,6 +822,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 
 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+			if (!g->chunk_generation_data_overflow)
+				die(_("commit-graph requires overflow generation data but has none"));
+
 			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
 			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
 		} else

Your tests in this patch seem very thorough, covering all the cases
I could think to create this strange situation. I even tried creating
cases where the overflow would be necessary. The following test actually
fails on the "graph_read_expect 6" due to the extra chunk, not the 'write'
process I was trying to trick into failure.

diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 8e90f3423b..cfef8e52b9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,6 +453,20 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
        git -C dup commit-graph verify
 '
 
+test_expect_success 'upgrade to generation data succeeds when there was none' '
+	(
+		cd dup &&
+		rm -rf .git/objects/info/commit-graph* &&
+		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph \
+			write --reachable &&
+		GIT_COMMITTER_DATE="1980-01-01 00:00" git commit --allow-empty -m one &&
+		GIT_COMMITTER_DATE="2090-01-01 00:00" git commit --allow-empty -m two &&
+		GIT_COMMITTER_DATE="2000-01-01 00:00" git commit --allow-empty -m three &&
+		git commit-graph write --reachable &&
+		graph_read_expect 6
+	)
+'
+
 NUM_FIRST_LAYER_COMMITS=64
 NUM_SECOND_LAYER_COMMITS=16
 NUM_THIRD_LAYER_COMMITS=7

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (10 preceding siblings ...)
  2020-12-28 11:16         ` [PATCH v5 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-12-30  4:35         ` Derrick Stolee
  2021-01-10 14:06           ` Abhishek Kumar
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
  12 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30  4:35 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar

On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Abhishek,

Thank you for this version. I appreciate your hard work on this topic,
especially after GSoC ended and you returned to being a full-time student.

My hope was that I could completely approve this series and only provide
forward-fixes from here on out, as necessary. I think there are a few minor
typos that you might want to address, but I was also able to understand your
intention.

I did make a particular case about a SEGFAULT I hit that I have been unable
to replicate. I saw it both in my copy of torvalds/linux and of
chromium/chromium. I have the file for chromium/chromium that is in a bad
state where a GDAT value includes the bit saying it should be in the long
offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't
exist in a from-scratch write.

I'm now taking backups of my existing commit-graph files before any later
test, but it doesn't repro for my Git repository or any other repo I try on
purpose.

However, I did some performance testing to double-check your numbers. I sent
a patch [1] that helps with some of the hard numbers.

[1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/

The big question is whether the overhead from using a slab to store the
generation values is worth it. I still think it is, for these reasons:

1. Generation number v2 is measurably better than v1 in most user cases.

2. Generation number v2 is slower than using committer date due to the
   overhead, but _guarantees correctness_.

I like to use "git log --graph -<N>" to compare against topological levels
(v1), for various levels of <N>. When <N> is small, we hope to minimize
the amount we need to walk using the extra commit-date information as an
assistance. Repos like git/git and torvalds/linux use the philosophy of
"base your changes on oldest applicable commit" enough that v1 struggles
sometimes.

git/git: N=1000

	Benchmark #1: baseline
	Time (mean ± σ):     100.3 ms ±   4.2 ms    [User: 89.0 ms, System: 11.3 ms]
	Range (min … max):    94.5 ms … 105.1 ms    28 runs

	Benchmark #2: test
	Time (mean ± σ):      35.8 ms ±   3.1 ms    [User: 29.6 ms, System: 6.2 ms]
	Range (min … max):    29.8 ms …  40.6 ms    81 runs

	Summary
	'test' ran
	2.80 ± 0.27 times faster than 'baseline'

This is a dramatic improvement! Using my topo-walk stats commit, I see that
v1 walks 58,805 commits as part of the in-degree walk while v2 only walks
4,335 commits!

torvalds/linux: N=1000 (starting at v5.10)

	Benchmark #1: baseline
	Time (mean ± σ):      90.8 ms ±   3.7 ms    [User: 75.2 ms, System: 15.6 ms]
	Range (min … max):    85.2 ms …  96.2 ms    31 runs

	Benchmark #2: test
	Time (mean ± σ):      49.2 ms ±   3.5 ms    [User: 36.9 ms, System: 12.3 ms]
	Range (min … max):    42.9 ms …  54.0 ms    61 runs

	Summary
	'test' ran
	1.85 ± 0.15 times faster than 'baseline'

Similarly, v1 walked 38,161 commits compared to 4,340 by v2.

If I increase N to something like 10,000, then usually these values get
washed out due to the width of the parallel topics.

The place we were still using commit-date as a heuristic was paint_down_to_common
which caused a regression the first time we used v1, at least for certain cases.

Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9
hit a strangeness about a pair of recent commits both based on a very old commit,
but the generation numbers forced walking farther than necessary. This doesn't
happen with v2, but we see the overhead cost of the slabs:

	Benchmark #1: baseline
	Time (mean ± σ):     112.9 ms ±   2.8 ms    [User: 96.5 ms, System: 16.3 ms]
	Range (min … max):   107.7 ms … 118.0 ms    26 runs

	Benchmark #2: test
	Time (mean ± σ):     147.1 ms ±   5.2 ms    [User: 132.7 ms, System: 14.3 ms]
	Range (min … max):   141.4 ms … 162.2 ms    18 runs

	Summary
	'baseline' ran
	1.30 ± 0.06 times faster than 'test'

The overhead still exists for a more recent pair of versions (v5.0 and v5.1):

	Benchmark #1: baseline
	Time (mean ± σ):      25.1 ms ±   3.2 ms    [User: 18.6 ms, System: 6.5 ms]
	Range (min … max):    19.0 ms …  32.8 ms    99 runs

	Benchmark #2: test
	Time (mean ± σ):      33.3 ms ±   3.3 ms    [User: 26.5 ms, System: 6.9 ms]
	Range (min … max):    27.0 ms …  38.4 ms    105 runs

	Summary
	'baseline' ran
	1.33 ± 0.22 times faster than 'test'

I still think this overhead is worth it. In case not everyone agrees, it _might_
be worth a command-line option to skip the GDAT chunk. That also prevents an
ability to eventually wean entirely of generation number v1 and allow the commit
date to take the full 64-bit column (instead of only 34 bits, saving 30 for
topo-levels).

Again, such a modification should not be considered required for this series.

> ----------------------------------------------------------------------------
> 
> Improvements left for a future series:
> 
>  * Save commits with generation data overflow and extra edge commits instead
>    of looping over all commits. cf. 858sbel67n.fsf@gmail.com
>  * Verify both topological levels and corrected commit dates when present.
>    cf. 85pn4tnk8u.fsf@gmail.com

These seem like reasonable things to delay for a later series
or for #leftoverbits

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
  2020-12-28 11:15         ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
  2020-12-30  1:35           ` Derrick Stolee
@ 2021-01-05  9:45           ` SZEDER Gábor
  2021-01-05  9:47             ` SZEDER Gábor
  2021-01-08  5:51             ` Abhishek Kumar
  1 sibling, 2 replies; 211+ messages in thread
From: SZEDER Gábor @ 2021-01-05  9:45 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Taylor Blau,
	Abhishek Kumar

On Mon, Dec 28, 2020 at 11:15:58AM +0000, Abhishek Kumar via GitGitGadget wrote:
> Before computing Bloom fitlers, the commit-graph machinery uses
> commit_gen_cmp to sort commits by generation order for improved diff
> performance. 3d11275505 (commit-graph: examine commits by generation
> number, 2020-03-30) claims that this sort can reduce the time spent to
> compute Bloom filters by nearly half.

That's true, though there are repositories where it has basically no
effect.  Alas we can't directly test it, because in 3d11275505 there
is no '--changed-paths' option yet... one has to revert 3d11275505 on
top of d38e07b8c4 (commit-graph: add --changed-paths option to write
subcommand, 2020-04-06) to make any runtime comparisons ('git
commit-graph write --reachable --changed-paths', best of five):

                   Sorting by
               pack    | generation
             position  |
    -------------------+------------
    gcc      114.821s  |    38.963s 
    git        8.896s  |     5.620s
    linux    209.984s  |   104.900s
    webkit    35.193s  |    35.482s

Note the almost 3x speedup in the gcc repository, and the basically
negligible slowdown in the webkit repo.

> But since c49c82aa4c (commit: move members graph_pos, generation to a
> slab, 2020-06-17), this optimization is broken, since asking for a
> 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> while writing.

I wouldn't say that c49c82aa4c broke this optimisation, because:

did not break that optimization.  Though, sadly, it's not
mentioned in 3d11275505's commit message, when commit_gen_cmp()
compares two commits with identical generation numbers, then it
doesn't leave them unsorted, but falls back to use their committer
date as a tie-braker.  This means that after c49c82aa4c the commits
are sorted by committer date, which appears to be so good a heuristic
for Bloom filter computation that there is barely any slowdown
compared to sorting by generation numbers:

> Not all hope is lost, though: 'commit_graph_generation()' falls back to

You mean commit_gen_cmp() here.

> comparing commits by their date when they have equal generation number,
> and so since c49c82aa4c is purely a date comparision function. This
> heuristic is good enough that we don't seem to loose appreciable
> performance while computing Bloom filters.

Indeed, c49c82aa4c barely caused any runtime difference in the
repositories I usually use to test modified path Bloom filter
performance:

                 c49c82aa4c^  c49c82aa4c
  ---------------------------------------------
  android-base     43.057s     43.091s   0.07%
  cmssw            21.781s     21.856s   0.34%
  cpython           9.626s      9.724s   1.01%
  elasticsearch    18.049s     18.224s   0.96%
  gcc              40.312s     40.255s  -0.14%
  gecko-dev       104.515s    104.740s   0.21%
  git               5.559s      5.570s   0.19%
  glibc             4.455s      4.468s   0.29%
  go                4.009s      4.016s   0.17%
  homebrew-cask    30.759s     30.523s  -0.76%
  homebrew-core    57.122s     56.553s  -0.99%
  jdk              18.297s     18.364s   0.36%
  linux           104.499s    105.302s   0.76%
  llvm-project     34.074s     34.446s   1.09%
  rails             6.472s      6.486s   0.21%
  rust             14.943s     14.947s   0.02%
  tensorflow       13.362s     13.477s   0.86%
  webkit           34.583s     34.601s   0.05%

> Applying this patch (compared
> with v2.29.1) speeds up computing Bloom filters by around ~4
> seconds.

Without a baseline and knowing which repo, this "~4 seconds" is
meaningless.

Here are my results comparing this fix to v2.30.0, best of five:

                              v2.30.0 +
                   v2.30.0    this fix
  ---------------------------------------------
  android-base     42.786s     42.933s   0.34%
  cmssw            20.229s     20.160s  -0.34%
  cpython           9.616s      9.647s   0.32%
  elasticsearch    16.859s     16.936s   0.45%
  gcc              38.909s     36.889s  -5.19%
  gecko-dev        99.417s     98.558s  -0.86%
  git               5.620s      5.509s  -1.97%
  glibc             4.307s      4.301s  -0.13%
  go                3.971s      3.938s  -0.83%
  homebrew-cask    31.262s     30.283s  -3.13%
  homebrew-core    57.842s     55.663s  -3.76%
  jdk              12.557s     12.251s  -2.43%
  linux            94.335s     94.760s   0.45%
  llvm-project     34.432s     33.988s  -1.28%
  rails             6.481s      6.454s  -0.41%
  rust             14.772s     14.601s  -1.15%
  tensorflow       11.759s     11.711s  -0.40%
  webkit           33.917s     33.759s  -0.46%

> So, avoid the useless 'commit_graph_generation()' while writing by
> instead accessing the slab directly. This returns the newly-computed
> generation numbers, and allows us to avoid the heuristic by directly
> comparing generation numbers.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 06f8dc1d896..caf823295f4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>  
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
  2021-01-05  9:45           ` SZEDER Gábor
@ 2021-01-05  9:47             ` SZEDER Gábor
  2021-01-08  5:51             ` Abhishek Kumar
  1 sibling, 0 replies; 211+ messages in thread
From: SZEDER Gábor @ 2021-01-05  9:47 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Taylor Blau,
	Abhishek Kumar

On Tue, Jan 05, 2021 at 10:45:35AM +0100, SZEDER Gábor wrote:
> > But since c49c82aa4c (commit: move members graph_pos, generation to a
> > slab, 2020-06-17), this optimization is broken, since asking for a
> > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> > while writing.
> 
> I wouldn't say that c49c82aa4c broke this optimisation, because:
> 
> did not break that optimization.  Though, sadly, it's not
> mentioned in 3d11275505's commit message, when commit_gen_cmp()
> compares two commits with identical generation numbers, then it
> doesn't leave them unsorted, but falls back to use their committer
> date as a tie-braker.  This means that after c49c82aa4c the commits
> are sorted by committer date, which appears to be so good a heuristic
> for Bloom filter computation that there is barely any slowdown
> compared to sorting by generation numbers:

Gaah, scratch this paragraph; I first misunderstood what you wrote in
the paragraph below, but then forgot to remove it.

> > Not all hope is lost, though: 'commit_graph_generation()' falls back to
> 
> You mean commit_gen_cmp() here.
> 
> > comparing commits by their date when they have equal generation number,
> > and so since c49c82aa4c is purely a date comparision function. This
> > heuristic is good enough that we don't seem to loose appreciable
> > performance while computing Bloom filters.

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
  2020-12-30  1:35           ` Derrick Stolee
@ 2021-01-08  5:45             ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-08  5:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me

On Tue, Dec 29, 2020 at 08:35:56PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > Before computing Bloom fitlers, the commit-graph machinery uses
> 
> s/fitlers/filters/
> 
> > commit_gen_cmp to sort commits by generation order for improved diff
> > performance. 3d11275505 (commit-graph: examine commits by generation
> > number, 2020-03-30) claims that this sort can reduce the time spent to
> > compute Bloom filters by nearly half.
> > 
> > But since c49c82aa4c (commit: move members graph_pos, generation to a
> > slab, 2020-06-17), this optimization is broken, since asking for a
> > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> > while writing.
> > 
> > Not all hope is lost, though: 'commit_graph_generation()' falls back to
> > comparing commits by their date when they have equal generation number,
> > and so since c49c82aa4c is purely a date comparision function. This
> 
> s/comparision/comparison/
> 
> > heuristic is good enough that we don't seem to loose appreciable
> > performance while computing Bloom filters. Applying this patch (compared
> > with v2.29.1) speeds up computing Bloom filters by around ~4
> > seconds.
> 
> Using "~4 seconds" here is odd since there is no baseline. Which
> repository did you use?
> 

I used the linux repository, will mention that.

> Previous discussion used relative terms. Something like "speeds up by
> a factor of 1.25" or something might be interesting.
> 

As SZEDER Gábor found, the improvements are rather minor - ranging from
0.40% to 5.19% [1]. I want to make sure this is the correct way to word
in the commit message:

Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various
repositories. 

https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/

> > So, avoid the useless 'commit_graph_generation()' while writing by
> > instead accessing the slab directly. This returns the newly-computed
> > generation numbers, and allows us to avoid the heuristic by directly
> > comparing generation numbers.
> 
> This introduces some timing restrictions to the ability for this
> comparison function. It would be dangerous if someone extracted
> the method for another purpose. A comment above these lines could
> warn future developers from making that mistake, but they would
> probably use the comparison functions in commit.c instead.
> 

Sure, will add a comment above.

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 06f8dc1d896..caf823295f4 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;
> >  
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
> > 
> 

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
  2021-01-05  9:45           ` SZEDER Gábor
  2021-01-05  9:47             ` SZEDER Gábor
@ 2021-01-08  5:51             ` Abhishek Kumar
  1 sibling, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-08  5:51 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me, stolee

On Tue, Jan 05, 2021 at 10:45:35AM +0100, SZEDER Gábor wrote:
> On Mon, Dec 28, 2020 at 11:15:58AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > Before computing Bloom fitlers, the commit-graph machinery uses
> > commit_gen_cmp to sort commits by generation order for improved diff
> > performance. 3d11275505 (commit-graph: examine commits by generation
> > number, 2020-03-30) claims that this sort can reduce the time spent to
> > compute Bloom filters by nearly half.
> 
> That's true, though there are repositories where it has basically no
> effect.  Alas we can't directly test it, because in 3d11275505 there
> is no '--changed-paths' option yet... one has to revert 3d11275505 on
> top of d38e07b8c4 (commit-graph: add --changed-paths option to write
> subcommand, 2020-04-06) to make any runtime comparisons ('git
> commit-graph write --reachable --changed-paths', best of five):
> 
>                    Sorting by
>                pack    | generation
>              position  |
>     -------------------+------------
>     gcc      114.821s  |    38.963s 
>     git        8.896s  |     5.620s
>     linux    209.984s  |   104.900s
>     webkit    35.193s  |    35.482s
> 
> Note the almost 3x speedup in the gcc repository, and the basically
> negligible slowdown in the webkit repo.
> 
> > But since c49c82aa4c (commit: move members graph_pos, generation to a
> > slab, 2020-06-17), this optimization is broken, since asking for a
> > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> > while writing.
> 
> I wouldn't say that c49c82aa4c broke this optimisation, because:
> 
> did not break that optimization.  Though, sadly, it's not
> mentioned in 3d11275505's commit message, when commit_gen_cmp()
> compares two commits with identical generation numbers, then it
> doesn't leave them unsorted, but falls back to use their committer
> date as a tie-braker.  This means that after c49c82aa4c the commits
> are sorted by committer date, which appears to be so good a heuristic
> for Bloom filter computation that there is barely any slowdown
> compared to sorting by generation numbers:
> 
> > Not all hope is lost, though: 'commit_graph_generation()' falls back to
> 
> You mean commit_gen_cmp() here.
> 

Yes, fixed.

> > comparing commits by their date when they have equal generation number,
> > and so since c49c82aa4c is purely a date comparision function. This
> > heuristic is good enough that we don't seem to loose appreciable
> > performance while computing Bloom filters.
> 
> Indeed, c49c82aa4c barely caused any runtime difference in the
> repositories I usually use to test modified path Bloom filter
> performance:
> 
>                  c49c82aa4c^  c49c82aa4c
>   ---------------------------------------------
>   android-base     43.057s     43.091s   0.07%
>   cmssw            21.781s     21.856s   0.34%
>   cpython           9.626s      9.724s   1.01%
>   elasticsearch    18.049s     18.224s   0.96%
>   gcc              40.312s     40.255s  -0.14%
>   gecko-dev       104.515s    104.740s   0.21%
>   git               5.559s      5.570s   0.19%
>   glibc             4.455s      4.468s   0.29%
>   go                4.009s      4.016s   0.17%
>   homebrew-cask    30.759s     30.523s  -0.76%
>   homebrew-core    57.122s     56.553s  -0.99%
>   jdk              18.297s     18.364s   0.36%
>   linux           104.499s    105.302s   0.76%
>   llvm-project     34.074s     34.446s   1.09%
>   rails             6.472s      6.486s   0.21%
>   rust             14.943s     14.947s   0.02%
>   tensorflow       13.362s     13.477s   0.86%
>   webkit           34.583s     34.601s   0.05%
> 
> > Applying this patch (compared
> > with v2.29.1) speeds up computing Bloom filters by around ~4
> > seconds.
> 
> Without a baseline and knowing which repo, this "~4 seconds" is
> meaningless.
> 
> Here are my results comparing this fix to v2.30.0, best of five:
> 
>                               v2.30.0 +
>                    v2.30.0    this fix
>   ---------------------------------------------
>   android-base     42.786s     42.933s   0.34%
>   cmssw            20.229s     20.160s  -0.34%
>   cpython           9.616s      9.647s   0.32%
>   elasticsearch    16.859s     16.936s   0.45%
>   gcc              38.909s     36.889s  -5.19%
>   gecko-dev        99.417s     98.558s  -0.86%
>   git               5.620s      5.509s  -1.97%
>   glibc             4.307s      4.301s  -0.13%
>   go                3.971s      3.938s  -0.83%
>   homebrew-cask    31.262s     30.283s  -3.13%
>   homebrew-core    57.842s     55.663s  -3.76%
>   jdk              12.557s     12.251s  -2.43%
>   linux            94.335s     94.760s   0.45%
>   llvm-project     34.432s     33.988s  -1.28%
>   rails             6.481s      6.454s  -0.41%
>   rust             14.772s     14.601s  -1.15%
>   tensorflow       11.759s     11.711s  -0.40%
>   webkit           33.917s     33.759s  -0.46%
>

Thank you for the detailed performance benchmarking.

> 
> > So, avoid the useless 'commit_graph_generation()' while writing by
> > instead accessing the slab directly. This returns the newly-computed
> > generation numbers, and allows us to avoid the heuristic by directly
> > comparing generation numbers.
> > 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 06f8dc1d896..caf823295f4 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;
> >  
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
> > -- 
> > gitgitgadget
> > 

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 07/11] commit-graph: implement corrected commit date
  2020-12-30  1:53           ` Derrick Stolee
@ 2021-01-10 12:21             ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-10 12:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me

On Tue, Dec 29, 2020 at 08:53:11PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > With most of preparations done, let's implement corrected commit date.
> > 
> > The corrected commit date for a commit is defined as:
> > 
> > * A commit with no parents (a root commit) has corrected commit date
> >   equal to its committer date.
> > * A commit with at least one parent has corrected commit date equal to
> >   the maximum of its commit date and one more than the largest corrected
> >   commit date among its parents.
> > 
> > As a special case, a root commit with timestamp of zero (01.01.1970
> > 00:00:00Z) has corrected commit date of one, to be able to distinguish
> > from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> > date).
> > 
> > To minimize the space required to store corrected commit date, Git
> > stores corrected commit date offsets into the commit-graph file. The
> > corrected commit date offset for a commit is defined as the difference
> > between its corrected commit date and actual commit date.
> > 
> > Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> > in most cases is 64 bits (uintmax_t). However, corrected commit date
> > offsets can be safely stored using only 32-bits. This halves the size
> > of GDAT chunk, which is a reduction of around 6% in the size of
> > commit-graph file.
> > 
> > However, using offsets be problematic if one of commits is malformed but
> 
> However, using 32-bit offsets is problematic if a commit is malformed...
> 
> > valid and has committerdate of 0 Unix time, as the offset would be the
> 
> s/committerdate/committer date/
> 
> > same as corrected commit date and thus require 64-bits to be stored
> > properly.
> > 
> > While Git does not write out offsets at this stage, Git stores the
> > corrected commit dates in member generation of struct commit_graph_data.
> > It will begin writing commit date offsets with the introduction of
> > generation data chunk.
> > 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1b2a015f92f..bfc3aae5f93 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -1339,9 +1339,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> >  
> >  		display_progress(ctx->progress, i + 1);
> > -		if (level != GENERATION_NUMBER_ZERO)
> > +		if (level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO)
> >  			continue;
> >  
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1350,16 +1352,23 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> >  
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> >  
> > -				if (level == GENERATION_NUMBER_ZERO) {
> > +				if (level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> 
> nit: the "break" in the first case makes it so this large else block
> is unnecessary. 

Thanks, removed.

> 
> -				if (level == GENERATION_NUMBER_ZERO) {
> +				if (level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				
> +				if (level > max_level)
> +					max_level = level;
> +
> +				if (corrected_commit_date > max_corrected_commit_date)
> +					max_corrected_commit_date = corrected_commit_date;
> -				}
>  			}
> 
> Thanks,
> -Stolee
> 

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
  2020-12-30  3:23           ` Derrick Stolee
@ 2021-01-10 13:13             ` Abhishek Kumar
  2021-01-11 12:43               ` Derrick Stolee
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-10 13:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me

On Tue, Dec 29, 2020 at 10:23:54PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> ...
> 
> > +static void validate_mixed_generation_chain(struct commit_graph *g)
> > +{
> > +	int read_generation_data;
> > +
> > +	if (!g)
> > +		return;
> > +
> > +	read_generation_data = !!g->chunk_generation_data;
> > +
> > +	while (g) {
> > +		g->read_generation_data = read_generation_data;
> > +		g = g->base_graph;
> > +	}
> > +}
> > +
> 
> This method exists to say "use generation v2 if the top layer has it"
> and that helps with the future layer checks.
> 
> > @@ -2239,6 +2263,7 @@ int write_commit_graph(struct object_directory *odb,
> >  		struct commit_graph *g = ctx->r->objects->commit_graph;
> >  
> >  		while (g) {
> > +			g->read_generation_data = 1;
> >  			g->topo_levels = &topo_levels;
> >  			g = g->base_graph;
> >  		}
> 
> However, here you just turn them on automatically.
> 
> I think the diff you want is here:
> 
>  		struct commit_graph *g = ctx->r->objects->commit_graph;
>  
> + 		validate_mixed_generation_chain(g);
> + 
>  		while (g) {
>  			g->topo_levels = &topo_levels;
>  			g = g->base_graph;
>  		}
> 
> But maybe you have a good reason for what you already have.
> 

Thanks, that was an oversight.

My (incorrect) reasoning at the time was:

Since we are computing both topological levels and corrected commit
dates, we can read corrected commit dates from layers with a GDAT chunk
hidden below non-GDAT layer.

But we end up storing both corrected commit date offsets (for a layers with
GDAT chunk) and topological level (for layers without GDAT chunk) in the
same slab with no way to distinguish between the two!

> I paid attention to this because I hit a problem in my local testing.
> After trying to reproduce it, I think the root cause is that I had a
> commit-graph that was written by an older version of your series, so
> it caused an unexpected pairing of an "offset required" bit but no
> offset chunk.
> 
> Perhaps this diff is required in the proper place to avoid the
> segfault I hit, in the case of a malformed commit-graph file:
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index c8d7ed1330..d264c90868 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -822,6 +822,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>  
>  		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> +			if (!g->chunk_generation_data_overflow)
> +				die(_("commit-graph requires overflow generation data but has none"));
> +
>  			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
>  			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
>  		} else
> 
> Your tests in this patch seem very thorough, covering all the cases
> I could think to create this strange situation. I even tried creating
> cases where the overflow would be necessary. The following test actually
> fails on the "graph_read_expect 6" due to the extra chunk, not the 'write'
> process I was trying to trick into failure.
> 
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 8e90f3423b..cfef8e52b9 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -453,6 +453,20 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
>         git -C dup commit-graph verify
>  '
>  
> +test_expect_success 'upgrade to generation data succeeds when there was none' '
> +	(
> +		cd dup &&
> +		rm -rf .git/objects/info/commit-graph* &&
> +		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph \
> +			write --reachable &&
> +		GIT_COMMITTER_DATE="1980-01-01 00:00" git commit --allow-empty -m one &&
> +		GIT_COMMITTER_DATE="2090-01-01 00:00" git commit --allow-empty -m two &&
> +		GIT_COMMITTER_DATE="2000-01-01 00:00" git commit --allow-empty -m three &&
> +		git commit-graph write --reachable &&
> +		graph_read_expect 6
> +	)
> +'

I am not sure what this test adds over the existing generation data
overflow related tests added in t5318-commit-graph.sh

> +
>  NUM_FIRST_LAYER_COMMITS=64
>  NUM_SECOND_LAYER_COMMITS=16
>  NUM_THIRD_LAYER_COMMITS=7
> 
> Thanks,
> -Stolee

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date
  2020-12-30  4:35         ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-01-10 14:06           ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-10 14:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me

On Tue, Dec 29, 2020 at 11:35:56PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Abhishek,
> 
> Thank you for this version. I appreciate your hard work on this topic,
> especially after GSoC ended and you returned to being a full-time student.
> 
> My hope was that I could completely approve this series and only provide
> forward-fixes from here on out, as necessary. I think there are a few minor
> typos that you might want to address, but I was also able to understand your
> intention.
> 
> I did make a particular case about a SEGFAULT I hit that I have been unable
> to replicate. I saw it both in my copy of torvalds/linux and of
> chromium/chromium. I have the file for chromium/chromium that is in a bad
> state where a GDAT value includes the bit saying it should be in the long
> offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't
> exist in a from-scratch write.

I hope validating mixed generation chain while writing as well was
enough to fix the SEGFAULT.

>
> I'm now taking backups of my existing commit-graph files before any later
> test, but it doesn't repro for my Git repository or any other repo I try on
> purpose.
> 
> However, I did some performance testing to double-check your numbers. I sent
> a patch [1] that helps with some of the hard numbers.
> 
> [1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/
> 
> The big question is whether the overhead from using a slab to store the
> generation values is worth it. I still think it is, for these reasons:
> 
> 1. Generation number v2 is measurably better than v1 in most user cases.
> 
> 2. Generation number v2 is slower than using committer date due to the
>    overhead, but _guarantees correctness_.
> 
> I like to use "git log --graph -<N>" to compare against topological levels
> (v1), for various levels of <N>. When <N> is small, we hope to minimize
> the amount we need to walk using the extra commit-date information as an
> assistance. Repos like git/git and torvalds/linux use the philosophy of
> "base your changes on oldest applicable commit" enough that v1 struggles
> sometimes.
> 
> git/git: N=1000
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):     100.3 ms ±   4.2 ms    [User: 89.0 ms, System: 11.3 ms]
> 	Range (min … max):    94.5 ms … 105.1 ms    28 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):      35.8 ms ±   3.1 ms    [User: 29.6 ms, System: 6.2 ms]
> 	Range (min … max):    29.8 ms …  40.6 ms    81 runs
> 	
> 	Summary
> 	'test' ran
> 	2.80 ± 0.27 times faster than 'baseline'
> 
> This is a dramatic improvement! Using my topo-walk stats commit, I see that
> v1 walks 58,805 commits as part of the in-degree walk while v2 only walks
> 4,335 commits!
> 
> torvalds/linux: N=1000 (starting at v5.10)
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):      90.8 ms ±   3.7 ms    [User: 75.2 ms, System: 15.6 ms]
> 	Range (min … max):    85.2 ms …  96.2 ms    31 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):      49.2 ms ±   3.5 ms    [User: 36.9 ms, System: 12.3 ms]
> 	Range (min … max):    42.9 ms …  54.0 ms    61 runs
> 	
> 	Summary
> 	'test' ran
> 	1.85 ± 0.15 times faster than 'baseline'
> 
> Similarly, v1 walked 38,161 commits compared to 4,340 by v2.
> 
> If I increase N to something like 10,000, then usually these values get
> washed out due to the width of the parallel topics.

That's not too bad, as large N would be needed rather infrequently.

> 
> The place we were still using commit-date as a heuristic was paint_down_to_common
> which caused a regression the first time we used v1, at least for certain cases.
> 
> Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9
> hit a strangeness about a pair of recent commits both based on a very old commit,
> but the generation numbers forced walking farther than necessary. This doesn't
> happen with v2, but we see the overhead cost of the slabs:
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):     112.9 ms ±   2.8 ms    [User: 96.5 ms, System: 16.3 ms]
> 	Range (min … max):   107.7 ms … 118.0 ms    26 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):     147.1 ms ±   5.2 ms    [User: 132.7 ms, System: 14.3 ms]
> 	Range (min … max):   141.4 ms … 162.2 ms    18 runs
> 	
> 	Summary
> 	'baseline' ran
> 	1.30 ± 0.06 times faster than 'test'
> 
> The overhead still exists for a more recent pair of versions (v5.0 and v5.1):
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):      25.1 ms ±   3.2 ms    [User: 18.6 ms, System: 6.5 ms]
> 	Range (min … max):    19.0 ms …  32.8 ms    99 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):      33.3 ms ±   3.3 ms    [User: 26.5 ms, System: 6.9 ms]
> 	Range (min … max):    27.0 ms …  38.4 ms    105 runs
> 	
> 	Summary
> 	'baseline' ran
> 	1.33 ± 0.22 times faster than 'test'
> 
> I still think this overhead is worth it. In case not everyone agrees, it _might_
> be worth a command-line option to skip the GDAT chunk. That also prevents an
> ability to eventually wean entirely of generation number v1 and allow the commit
> date to take the full 64-bit column (instead of only 34 bits, saving 30 for
> topo-levels).

Thank you for the detailed benchmarking and discussion. 

I don't think there is any disagreement on utility of corrected commit
dates so far. 

We will run out of 34-bits for the commit date by the year 2514, so I
am not exactly worried about weaning of generation number v1 anytime
soon.

> 
> Again, such a modification should not be considered required for this series.
> 
> > ----------------------------------------------------------------------------
> > 
> > Improvements left for a future series:
> > 
> >  * Save commits with generation data overflow and extra edge commits instead
> >    of looping over all commits. cf. 858sbel67n.fsf@gmail.com
> >  * Verify both topological levels and corrected commit dates when present.
> >    cf. 85pn4tnk8u.fsf@gmail.com
> 
> These seem like reasonable things to delay for a later series
> or for #leftoverbits
> 
> Thanks,
> -Stolee
> 

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
  2021-01-10 13:13             ` Abhishek Kumar
@ 2021-01-11 12:43               ` Derrick Stolee
  0 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2021-01-11 12:43 UTC (permalink / raw)
  To: 2e89c6e1-e8e8-0d51-5670-038b4e296d93
  Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me

On 1/10/2021 8:13 AM, Abhishek Kumar wrote:
> On Tue, Dec 29, 2020 at 10:23:54PM -0500, Derrick Stolee wrote:
>> Your tests in this patch seem very thorough, covering all the cases
>> I could think to create this strange situation. I even tried creating
>> cases where the overflow would be necessary. The following test actually
>> fails on the "graph_read_expect 6" due to the extra chunk, not the 'write'
>> process I was trying to trick into failure.
>>
>> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
>> index 8e90f3423b..cfef8e52b9 100755
>> --- a/t/t5324-split-commit-graph.sh
>> +++ b/t/t5324-split-commit-graph.sh
>> @@ -453,6 +453,20 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
>>         git -C dup commit-graph verify
>>  '
>>  
>> +test_expect_success 'upgrade to generation data succeeds when there was none' '
>> +	(
>> +		cd dup &&
>> +		rm -rf .git/objects/info/commit-graph* &&
>> +		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph \
>> +			write --reachable &&
>> +		GIT_COMMITTER_DATE="1980-01-01 00:00" git commit --allow-empty -m one &&
>> +		GIT_COMMITTER_DATE="2090-01-01 00:00" git commit --allow-empty -m two &&
>> +		GIT_COMMITTER_DATE="2000-01-01 00:00" git commit --allow-empty -m three &&
>> +		git commit-graph write --reachable &&
>> +		graph_read_expect 6
>> +	)
>> +'
> 
> I am not sure what this test adds over the existing generation data
> overflow related tests added in t5318-commit-graph.sh

Good point.

-Stolee

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
  2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
                           ` (11 preceding siblings ...)
  2020-12-30  4:35         ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-01-16 18:11         ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
                             ` (12 more replies)
  12 siblings, 13 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, using
topological levels can result in a worse performance than without them when
compared with committer date as a heuristics. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using topological
levels and walks 167,468 using committer date. Since 091f4cf3 (commit: don't
use generation numbers if not needed, 2018-08-30), 'git merge-base' uses
committer date heuristic unless there is a cutoff because of the performance
hit.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem [1], we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

[1] https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Various candidates were examined (https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets
performed much worse than committer date (506,577 vs. 167,468 commits walked
for 'git merge-base v4.8 v4.9') and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as follows:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with the
timestamp zero has corrected commit date of 1 to be able to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).

We will introduce an additional commit-graph chunk, Generation DATa (GDAT)
chunk, and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores
the corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.

For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.

When adding new layer to the split commit-graph file, and when merging some
or all layers (replacing them in the latter case), the new layer will have
GDAT chunk if and only if in the final result there would be no layer
without GDAT chunk just below it.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.

I look forward to everyone's reviews!

Thanks

 * Abhishek

----------------------------------------------------------------------------

Improvements left for a future series:

 * Save commits with generation data overflow and extra edge commits instead
   of looping over all commits. cf. 858sbel67n.fsf@gmail.com
 * Verify both topological levels and corrected commit dates when present.
   cf. 85pn4tnk8u.fsf@gmail.com

Changes in version 6:

 * Fixed typos in commit message for "commit-graph: implement corrected
   commit date".
 * Removed an unnecessary else-block in "commit-graph: implement corrected
   commit date".
 * Validate mixed generation chain correctly while writing in "commit-graph:
   use generation v2 only if the entire chain does".
 * Die if the GDAT chunk indicates data has overflown but there are is no
   generation data overflow chunk.

Changes in version 5:

 * Explained a possible reason for no change in performance for
   "commit-graph: fix regression when computing bloom-filters"
 * Clarified about the addition of a new test for 11-digit octal
   implementations of ustar.
 * Fixed duplicate test names in "commit-graph: consolidate
   fill_commit_graph_info".
 * Swapped the order "commit-graph: return 64-bit generation number",
   "commit-graph: add a slab to store topological levels" to minimize lines
   changed.
 * Fixed the mismerge in "commit-graph: return 64-bit generation number"
 * Clarified the preparatory steps are for the larger goal of implementing
   generation number v2 in "commit-graph: return 64-bit generation number".
 * Moved the rename of "run_three_modes()" to "run_all_modes()" into a new
   patch "t6600-test-reach: generalize *_three_modes".
 * Explained and removed the checks for GENERATION_NUMBER_INFINITY that can
   never be true in "commit-graph: add a slab to store topological levels".
 * Fixed incorrect logic for verifying commit-graph in "commit-graph:
   implement corrected commit date".
 * Added minor improvements to commit message of "commit-graph: implement
   generation data chunk".
 * Added '--date ' option to test_commit() in 'test-lib-functions.sh' in
   "commit-graph: implement generation data chunk".
 * Improved coding style (also in tests) for "commit-graph: use generation
   v2 only if entire chain does".
 * Simplified test repository structure in "commit-graph: use generation v2
   only if entire chain does" as only the number of commits in a split
   commit-graph layer are relevant.
 * Added a new test in "commit-graph: use generation v2 only if entire chain
   does" to check if the layers are merged correctly.
 * Explicitly mentioned commit "091f4cf3" in the commit-message of
   "commit-graph: use corrected commit dates in paint_down_to_common()".
 * Minor corrections to documentation in "doc: add corrected commit date
   info".
 * Minor corrections to coding style.

Changes in version 4:

 * Added GDOV to handle overflows in generation data.
 * Added a test for writing tip graph for a generation number v2 graph chain
   in t5324-split-commit-graph.sh
 * Added a section on how mixed generation number chains are handled in
   Documentation/technical/commit-graph-format.txt
 * Reverted unimportant whitespace, style changes in commit-graph.c
 * Added header comments about the order of comparision for
   compare_commits_by_gen_then_commit_date in commit.h,
   compare_commits_by_gen in commit-graph.h
 * Elaborated on why t6404 fails with corrected commit date and must be run
   with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
   commit dates in paint_down_to_common()"
 * Elaborated on write behavior for mixed generation number chains in the
   commit "commit-graph: use generation v2 only if entire chain does"
 * Added notes about adding the topo_level slab to struct
   write_commit_graph_context as well as struct commit_graph.
 * Clarified commit message for "commit-graph: consolidate
   fill_commit_graph_info"
 * Removed the claim "GDAT can store future generation numbers" because it
   hasn't been tested yet.

Changes in version 3:

 * Reordered patches as discussed in 2
   [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/].
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed generation commit chains.
 * Improve commit messages for "commit-graph: fix regression when computing
   bloom filter", "commit-graph: consolidate fill_commit_graph_info",
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (11):
  commit-graph: fix regression when computing Bloom filters
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  t6600-test-reach: generalize *_three_modes
  commit-graph: add a slab to store topological levels
  commit-graph: return 64-bit generation number
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  28 +-
 Documentation/technical/commit-graph.txt      |  77 +++++-
 commit-graph.c                                | 251 ++++++++++++++----
 commit-graph.h                                |  15 +-
 commit-reach.c                                |  38 +--
 commit-reach.h                                |   2 +-
 commit.c                                      |   4 +-
 commit.h                                      |   5 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   4 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |  24 +-
 t/t5318-commit-graph.sh                       |  79 +++++-
 t/t5324-split-commit-graph.sh                 | 193 +++++++++++++-
 t/t6404-recursive-merge.sh                    |   5 +-
 t/t6600-test-reach.sh                         |  68 ++---
 t/test-lib-functions.sh                       |   6 +
 upload-pack.c                                 |   2 +-
 19 files changed, 667 insertions(+), 154 deletions(-)

base-commit: 4151fdb1c76c1a190ac9241b67223efd19f3e478
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v6
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v6
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v5:

  1:  c4e817abf7d !  1:  4d8eb415578 commit-graph: fix regression when computing Bloom filters
     @@ Metadata
       ## Commit message ##
          commit-graph: fix regression when computing Bloom filters

     -    Before computing Bloom fitlers, the commit-graph machinery uses
     +    Before computing Bloom filters, the commit-graph machinery uses
          commit_gen_cmp to sort commits by generation order for improved diff
          performance. 3d11275505 (commit-graph: examine commits by generation
          number, 2020-03-30) claims that this sort can reduce the time spent to
     @@ Commit message
          'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
          while writing.

     -    Not all hope is lost, though: 'commit_graph_generation()' falls back to
     +    Not all hope is lost, though: 'commit_gen_cmp()' falls back to
          comparing commits by their date when they have equal generation number,
     -    and so since c49c82aa4c is purely a date comparision function. This
     +    and so since c49c82aa4c is purely a date comparison function. This
          heuristic is good enough that we don't seem to loose appreciable
     -    performance while computing Bloom filters. Applying this patch (compared
     -    with v2.29.1) speeds up computing Bloom filters by around ~4
     -    seconds.
     +    performance while computing Bloom filters.
     +
     +    Applying this patch (compared with v2.30.0) speeds up computing Bloom
     +    filters by factors ranging from 0.40% to 5.19% on various repositories [1].

          So, avoid the useless 'commit_graph_generation()' while writing by
          instead accessing the slab directly. This returns the newly-computed
          generation numbers, and allows us to avoid the heuristic by directly
          comparing generation numbers.

     +    [1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/
     +
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

       ## commit-graph.c ##
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     +@@ commit-graph.c: static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
     + 	return data;
     + }
     + 
     ++/* 
     ++ * Should be used only while writing commit-graph as it compares
     ++ * generation value of commits by directly accessing commit-slab.
     ++ */
     + static int commit_gen_cmp(const void *va, const void *vb)
     + {
       	const struct commit *a = *(const struct commit **)va;
       	const struct commit *b = *(const struct commit **)vb;

  2:  7645e0bcef0 =  2:  05dcb862818 revision: parse parent in indegree_walk_step()
  3:  ca646912b2b =  3:  dcb9891d819 commit-graph: consolidate fill_commit_graph_info
  4:  591935075f1 =  4:  4fbdee7ac90 t6600-test-reach: generalize *_three_modes
  5:  baae7006764 =  5:  fbd8feb5d8c commit-graph: add a slab to store topological levels
  6:  26bd6f49100 =  6:  855ff662a44 commit-graph: return 64-bit generation number
  7:  859c39eff52 !  7:  8fbe7486405 commit-graph: implement corrected commit date
     @@ Commit message
          of GDAT chunk, which is a reduction of around 6% in the size of
          commit-graph file.

     -    However, using offsets be problematic if one of commits is malformed but
     -    valid and has committerdate of 0 Unix time, as the offset would be the
     -    same as corrected commit date and thus require 64-bits to be stored
     -    properly.
     +    However, using offsets be problematic if a commit is malformed but valid
     +    and has committer date of 0 Unix time, as the offset would be the same
     +    as corrected commit date and thus require 64-bits to be stored properly.

          While Git does not write out offsets at this stage, Git stores the
          corrected commit dates in member generation of struct commit_graph_data.
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       					break;
      -				} else if (level > max_level) {
      -					max_level = level;
     -+				} else {
     -+					if (level > max_level)
     -+						max_level = level;
     -+
     -+					if (corrected_commit_date > max_corrected_commit_date)
     -+						max_corrected_commit_date = corrected_commit_date;
       				}
     ++
     ++				if (level > max_level)
     ++					max_level = level;
     ++
     ++				if (corrected_commit_date > max_corrected_commit_date)
     ++					max_corrected_commit_date = corrected_commit_date;
       			}

     + 			if (all_parents_computed) {
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
       				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
       					max_level = GENERATION_NUMBER_V1_MAX - 1;
  8:  8403c4d0257 !  8:  6d0696ae216 commit-graph: implement generation data chunk
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
      +		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
      +
      +		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
     ++			if (!g->chunk_generation_data_overflow)
     ++				die(_("commit-graph requires overflow generation data but has none"));
     ++
      +			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
      +			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
      +		} else
  9:  a3a70a1edd0 !  9:  fba0d7f3dfe commit-graph: use generation v2 only if entire chain does
     @@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph
       		g = g->base_graph;
       	}
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 		struct commit_graph *g = ctx->r->objects->commit_graph;
     + 	} else
     + 		ctx->num_commit_graphs_after = 1;

     - 		while (g) {
     -+			g->read_generation_data = 1;
     - 			g->topo_levels = &topo_levels;
     - 			g = g->base_graph;
     - 		}
     ++	validate_mixed_generation_chain(ctx->r->objects->commit_graph);
     ++
     + 	compute_generation_numbers(ctx);
     + 
     + 	if (ctx->changed_paths)
      @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
       		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
       		 * in the following condition.
 10:  093101f908b = 10:  ba1f2c5555f commit-reach: use corrected commit dates in paint_down_to_common()
 11:  20299e57457 = 11:  e571f03d8bd doc: add corrected commit date info

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                             ` (11 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.

But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.

Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.

Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].

So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.

[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e9124d4a412..0267886e76c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -139,13 +139,17 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 	return data;
 }

+/* 
+ * Should be used only while writing commit-graph as it compares
+ * generation value of commits by directly accessing commit-slab.
+ */
 static int commit_gen_cmp(const void *va, const void *vb)
 {
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;

-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 02/11] revision: parse parent in indegree_walk_step()
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                             ` (10 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 1bb590ece78..be2d828a4cc 100644
--- a/revision.c
+++ b/revision.c
@@ -3397,6 +3397,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
                             ` (9 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.

As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.

To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.

To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 27 ++++++++++-----------------
 t/t5000-tar-tree.sh | 24 +++++++++++++++++++++---
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0267886e76c..3d59b8b905d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -753,15 +753,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -776,38 +785,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
 	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b652..7204799a0b5 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,15 +431,33 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
 	test_cmp expect actual
 '
 
-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+	rm -f .git/index &&
+	echo foo >file &&
+	git add file &&
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
+		git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+	git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+	echo 2514 >expect &&
+	tar_info future.tar | cut -d" " -f2 >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
 		git commit -m "tempori parendum"
 '
 
-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
 	git archive HEAD >future.tar
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (2 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
                             ` (8 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step to implement generation number v2, we add tests to
ensure Git can read and parse commit-graph files without Generation Data
chunk. These files represent commit-graph files written by Old Git and
are neccesary for backward compatability.

We extend run_three_modes() and test_three_modes() to *_all_modes() with
the fourth mode being "commit-graph without generation data chunk".

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 t/t6600-test-reach.sh | 62 +++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337d..af10f0dc090 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -58,7 +58,7 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -70,8 +70,8 @@ run_three_modes () {
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +80,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +89,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +98,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +107,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +117,7 @@ test_expect_success 'in_merge_bases_many:hit' '
 	X:commit-5-7
 	EOF
 	echo "in_merge_bases_many(A,X):1" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +127,7 @@ test_expect_success 'in_merge_bases_many:miss' '
 	X:commit-8-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +137,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
 	X:commit-6-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +148,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +159,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -174,7 +174,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -196,7 +196,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +219,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +241,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +264,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -280,8 +280,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -297,8 +297,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +322,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +334,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +346,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +358,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +368,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +382,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +402,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +420,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +434,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 05/11] commit-graph: add a slab to store topological levels
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (3 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                             ` (7 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.

Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.

A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.

We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
 commit-graph.h |  1 +
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3d59b8b905d..3b69c3cc329 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -772,6 +774,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+	if (g->topo_levels)
+		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -960,6 +965,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
+	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
@@ -1106,7 +1112,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1336,11 +1342,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1348,29 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else if (level > max_level) {
+					max_level = level;
 				}
 			}
 
 			if (all_parents_computed) {
-				struct commit_graph_data *data = commit_graph_data_at(current);
-
-				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
 	}
@@ -2106,6 +2108,7 @@ int write_commit_graph(struct object_directory *odb,
 	int res = 0;
 	int replace = 0;
 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct topo_level_slab topo_levels;
 
 	prepare_repo_settings(the_repository);
 	if (!the_repository->settings.core_commit_graph) {
@@ -2132,6 +2135,18 @@ int write_commit_graph(struct object_directory *odb,
 							 bloom_settings.max_changed_paths);
 	ctx->bloom_settings = &bloom_settings;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
+	if (ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			g->topo_levels = &topo_levels;
+			g = g->base_graph;
+		}
+	}
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6e..00f00745b79 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
+	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 06/11] commit-graph: return 64-bit generation number
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (4 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                             ` (6 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 22 +++++++++++-----------
 commit-graph.h |  4 ++--
 commit-reach.c | 36 ++++++++++++++++++------------------
 commit-reach.h |  2 +-
 commit.c       |  4 ++--
 commit.h       |  4 ++--
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 8 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3b69c3cc329..6d42e30cd9a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -101,7 +101,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -150,8 +150,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1370,8 +1370,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			if (all_parents_computed) {
 				pop_commit(&list);
 
-				if (max_level > GENERATION_NUMBER_MAX - 1)
-					max_level = GENERATION_NUMBER_MAX - 1;
+				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
@@ -2367,8 +2367,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2432,16 +2432,16 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
 		 * extra logic in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
+		if (max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
 		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 00f00745b79..2e9aa7824ee 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,12 +145,12 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e7..9b24b0378d5 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
 	const struct commit *a = *(const struct commit * const *)_a;
 	const struct commit *b = *(const struct commit * const *)_b;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
 
 	if (generation_a < generation_b)
 		return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a317..148b56fea50 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.c b/commit.c
index bab8d5ab07c..4c717329ee0 100644
--- a/commit.c
+++ b/commit.c
@@ -753,8 +753,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a),
+			  generation_b = commit_graph_generation(b);
 
 	/* newer commits first */
 	if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index f4e7b0158e2..742d96c41e8 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
diff --git a/revision.c b/revision.c
index be2d828a4cc..31fd3219e65 100644
--- a/revision.c
+++ b/revision.c
@@ -3300,7 +3300,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3368,7 +3368,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3413,7 +3413,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3471,7 +3471,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (repo_parse_commit_gently(revs->repo, c, 1))
 			continue;
@@ -3539,7 +3539,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b66bf92ba8..b87607e0dd4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -500,7 +500,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 07/11] commit-graph: implement corrected commit date
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (5 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                             ` (5 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6d42e30cd9a..a899f429093 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1343,9 +1343,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;

 		display_progress(ctx->progress, i + 1);
-		if (level != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO)
 			continue;

 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1356,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;

 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;

-				if (level == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
 				}
+
+				if (level > max_level)
+					max_level = level;
+
+				if (corrected_commit_date > max_corrected_commit_date)
+					max_corrected_commit_date = corrected_commit_date;
 			}

 			if (all_parents_computed) {
@@ -1373,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date && current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 08/11] commit-graph: implement generation data chunk
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (6 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
                             ` (4 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 114 ++++++++++++++++++++++++++++++----
 commit-graph.h                |   3 +
 commit.h                      |   1 +
 t/README                      |   3 +
 t/helper/test-read-graph.c    |   4 ++
 t/t4216-log-bloom.sh          |   4 +-
 t/t5318-commit-graph.sh       |  79 +++++++++++++++++++----
 t/t5324-split-commit-graph.sh |  12 ++--
 t/t6600-test-reach.sh         |   6 ++
 t/test-lib-functions.sh       |   6 ++
 10 files changed, 200 insertions(+), 32 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a899f429093..7365958d9d3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
@@ -394,6 +398,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+			if (graph->chunk_generation_data_overflow)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data_overflow = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -754,8 +772,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 {
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
-	uint32_t lex_index;
-	uint64_t date_high, date_low;
+	uint32_t lex_index, offset_pos;
+	uint64_t date_high, date_low, offset;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
@@ -773,7 +791,19 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data) {
+		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+			if (!g->chunk_generation_data_overflow)
+				die(_("commit-graph requires overflow generation data but has none"));
+
+			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+		} else
+			graph_data->generation = item->date + offset;
+	} else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 
 	if (g->topo_levels)
 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -945,6 +975,7 @@ struct write_commit_graph_context {
 	struct oid_array oids;
 	struct packed_commit_list commits;
 	int num_extra_edges;
+	int num_generation_data_overflows;
 	unsigned long approx_nr_objects;
 	struct progress *progress;
 	int progress_done;
@@ -963,7 +994,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
@@ -1123,6 +1155,45 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i, num_generation_data_overflows = 0;
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+			num_generation_data_overflows++;
+		}
+
+		hashwrite_be32(f, offset);
+	}
+
+	return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+						       struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			hashwrite_be32(f, offset >> 32);
+			hashwrite_be32(f, (uint32_t) offset);
+		}
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1386,6 +1457,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
@@ -1719,6 +1793,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+		ctx->write_generation_data = 0;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
+	if (ctx->num_generation_data_overflows) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2139,6 +2228,8 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = 1;
+	ctx->num_generation_data_overflows = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
@@ -2445,16 +2536,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
-		 * extra logic in the following condition.
+		 * If we are using topological level and one of our parents has
+		 * generation GENERATION_NUMBER_V1_MAX, then our generation is
+		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+		 * in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_V1_MAX)
+		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+		if (generation < max_generation + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824ee..19a02001fde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -68,6 +69,8 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
+	const unsigned char *chunk_generation_data_overflow;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 742d96c41e8..eff94f3f7c2 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
 #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
 #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
 
 struct commit_list {
 	struct commit *item;
diff --git a/t/README b/t/README
index c730a707705..8a121487279 100644
--- a/t/README
+++ b/t/README
@@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a17256..75927b2c81d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
+	if (graph->chunk_generation_data_overflow)
+		printf(" generation_data_overflow");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d11040ce41c..dbde0161882 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 '
 
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544da..fa27df579a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
@@ -741,4 +742,56 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 	)
 '
 
+# We test the overflow-related code with the following repo history:
+#
+#               4:F - 5:N - 6:U
+#              /                \
+# 1:U - 2:N - 3:U                M:N
+#              \                /
+#               7:N - 8:F - 9:N
+#
+# Here the commits denoted by U have committer date of zero seconds
+# since Unix epoch, the commits denoted by N have committer date
+# starting from 1112354055 seconds since Unix epoch (default committer
+# date for the test suite), and the commits denoted by F have committer
+# date of (2 ^ 31 - 2) seconds since Unix epoch.
+#
+# The largest offset observed is 2 ^ 31, just large enough to overflow.
+#
+
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+	objdir=".git/objects" &&
+	UNIX_EPOCH_ZERO="@0 +0000" &&
+	FUTURE_DATE="@2147483646 +0000" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	cd "$TRASH_DIRECTORY" &&
+	mkdir repo &&
+	cd repo &&
+	git init &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+	test_commit 2 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 3 generation_data &&
+	test_commit --date "$FUTURE_DATE" 4 &&
+	test_commit 5 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+	git branch left &&
+	git reset --hard 3 &&
+	test_commit 7 &&
+	test_commit --date "$FUTURE_DATE" 8 &&
+	test_commit 9 &&
+	git branch right &&
+	git reset --hard 3 &&
+	test_merge M left right &&
+	git commit-graph write --reachable &&
+	graph_read_expect 10 "generation_data generation_data_overflow" &&
+	git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
 test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 4d3842b83b9..587757b62d9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
 	infodir=".git/objects/info" &&
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 
 	oid_version sha1:1
 	oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index af10f0dc090..e2d33a8a4c4 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,6 +55,9 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
@@ -67,6 +70,9 @@ run_all_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 999982fe4a9..3ad712c3acc 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,12 @@ test_commit () {
 		--signoff)
 			signoff="$1"
 			;;
+		--date)
+			notick=yes
+			GIT_COMMITTER_DATE="$2"
+			GIT_AUTHOR_DATE="$2"
+			shift
+			;;
 		-C)
 			indir="$2"
 			shift
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (7 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
                             ` (3 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                |  30 +++++-
 commit-graph.h                |   1 +
 t/t5324-split-commit-graph.sh | 181 ++++++++++++++++++++++++++++++++++
 3 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7365958d9d3..d32492f3724 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -614,6 +614,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+	int read_generation_data;
+
+	if (!g)
+		return;
+
+	read_generation_data = !!g->chunk_generation_data;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -622,6 +637,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(g);
+
 	return g;
 }
 
@@ -791,7 +808,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data) {
+	if (g->read_generation_data) {
 		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 
 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2019,6 +2036,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		if (i < ctx->num_commit_graphs_after)
 			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
 
+		/*
+		 * If the topmost remaining layer has generation data chunk, the
+		 * resultant layer also has generation data chunk.
+		 */
+		if (i == ctx->num_commit_graphs_after - 2)
+			ctx->write_generation_data = !!g->chunk_generation_data;
+
 		i--;
 		g = g->base_graph;
 	}
@@ -2343,6 +2367,8 @@ int write_commit_graph(struct object_directory *odb,
 	} else
 		ctx->num_commit_graphs_after = 1;
 
+	validate_mixed_generation_chain(ctx->r->objects->commit_graph);
+
 	compute_generation_numbers(ctx);
 
 	if (ctx->changed_paths)
@@ -2541,7 +2567,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
 		 * in the following condition.
 		 */
-		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+		if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fde..ad52130883b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	unsigned int read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 587757b62d9..8e90f3423b8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,4 +453,185 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
 	git -C dup commit-graph verify
 '
 
+NUM_FIRST_LAYER_COMMITS=64
+NUM_SECOND_LAYER_COMMITS=16
+NUM_THIRD_LAYER_COMMITS=7
+NUM_FOURTH_LAYER_COMMITS=8
+NUM_FIFTH_LAYER_COMMITS=16
+SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
+SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
+THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
+THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
+FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
+FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
+FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
+FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
+
+# Current split graph chain:
+#
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	graphdir=".git/objects/info/commit-graphs" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	git init mixed &&
+	(
+		cd mixed &&
+		git config core.commitGraph true &&
+		git config gc.writeCommitGraph false &&
+		for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split &&
+		graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 1
+		num_commits: $NUM_SECOND_LAYER_COMMITS
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify &&
+		cat $graphdir/commit-graph-chain
+	)
+'
+
+# The new layer will be added without generation data chunk as it was not
+# present on the layer underneath it.
+#
+#      7 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if not present on existing tip' '
+	git clone mixed mixed-no-gdat &&
+	(
+		cd mixed-no-gdat &&
+		for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split=no-merge &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 2
+		num_commits: $NUM_THIRD_LAYER_COMMITS
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify
+	)
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+#      8 commits (No GDAT)
+# ------------------------
+#      7 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+# The top two layers are merged and do not have generation data chunk as layer below them does
+# not have generation data chunk.
+#
+#     15 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
+	git clone mixed-no-gdat mixed-merge-no-gdat &&
+	(
+		cd mixed-merge-no-gdat &&
+		for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split --size-multiple 1 &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 2
+		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify
+	)
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+#     16 commits (No GDAT)
+# ------------------------
+#     15 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+# The top three layers are merged and has generation data chunk as the topmost remaining layer
+# has generation data chunk.
+#
+#     47 commits (GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
+	git clone mixed-merge-no-gdat mixed-merge-gdat &&
+	(
+		cd mixed-merge-gdat &&
+		for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split --size-multiple 1 &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 5 1
+		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
+		chunks: oid_fanout oid_lookup commit_metadata generation_data
+		EOF
+		test_cmp expect output
+	)
+'
+
+test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
+	git clone mixed mixed-replace &&
+	(
+		cd mixed-replace &&
+		git commit-graph write --reachable --split=replace &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir &&
+		graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
+		git commit-graph verify
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (8 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-16 18:11           ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
                             ` (2 subsequent siblings)
  12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6404-recursive-merge.sh |  5 ++++-
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d32492f3724..d3d14601d4d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -714,6 +714,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return g->read_generation_data;
+}
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 {
 	struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883b..97f3497c279 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,12 @@ struct commit_graph *parse_commit_graph(struct repository *r,
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
 
 enum commit_graph_write_flags {
diff --git a/commit-reach.c b/commit-reach.c
index 9b24b0378d5..e38771ca5a1 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index b1c3d4dda49..86f74ae5847 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
 '
 
 test_expect_success 'virtual trees were processed' '
+	# TODO: fragile test, relies on ambigious merge-base resolution
 	git ls-files --stage >out &&
 
 	cat >expect <<-EOF &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v6 11/11] doc: add corrected commit date info
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (9 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11           ` Abhishek Kumar via GitGitGadget
  2021-01-27  0:04             ` SZEDER Gábor
  2021-01-18 21:04           ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
  12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 28 +++++--
 Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
 2 files changed, 86 insertions(+), 19 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b92..b6658eff188 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -86,13 +82,33 @@ CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the generation number of the commit and
+    * The next 8 bytes store the topological level (generation number v1)
+      of the commit and
       the commit time in seconds since EPOCH. The generation number
       uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * If the corrected commit date offset cannot be stored within 31 bits,
+      the value has its most-significant bit on and the other bits store
+      the position of corrected commit date into the Generation Data Overflow
+      chunk.
+    * Generation Data chunk is present only when commit-graph file is written
+      by compatible versions of Git and in case of split commit-graph chains,
+      the topmost layer also has Generation Data chunk.
+
+  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+    * This list of 8-byte values stores the corrected commit date offsets
+      for commits with corrected commit date offsets that cannot be
+      stored within 31 bits.
+    * Generation Data Overflow chunk is present only when Generation Data
+      chunk is present and atleast one corrected commit date offset cannot
+      be stored within 31 bits.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa8..f05e7bda1a9 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
 
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
 
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+    (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
     generation numbers, then we always expand the boundary commit with highest
     generation number and can easily detect the stopping condition.
 
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
 This property can be used to significantly reduce the time it takes to
 walk commits and determine topological relationships. Without generation
 numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
 violate topological relationships due to clock skew (such as "git log"
 with default order), but is not used when the topological order is
 required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (10 preceding siblings ...)
  2021-01-16 18:11           ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2021-01-18 21:04           ` Derrick Stolee
  2021-01-18 22:00             ` Taylor Blau
                               ` (2 more replies)
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
  12 siblings, 3 replies; 211+ messages in thread
From: Derrick Stolee @ 2021-01-18 21:04 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar,
	SZEDER Gábor

On 1/16/2021 1:11 PM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
...
> Changes in version 6:
> 
>  * Fixed typos in commit message for "commit-graph: implement corrected
>    commit date".
>  * Removed an unnecessary else-block in "commit-graph: implement corrected
>    commit date".
>  * Validate mixed generation chain correctly while writing in "commit-graph:
>    use generation v2 only if the entire chain does".
>  * Die if the GDAT chunk indicates data has overflown but there are is no
>    generation data overflow chunk.

I checked the range-diff and looked once more through the patch
series. This version is good to go by my standards.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thanks, Abhishek!


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
  2021-01-18 21:04           ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-01-18 22:00             ` Taylor Blau
  2021-01-23 12:11               ` Abhishek Kumar
  2021-01-19  0:02             ` Junio C Hamano
  2021-01-23 12:07             ` Abhishek Kumar
  2 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2021-01-18 22:00 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Taylor Blau, Abhishek Kumar, SZEDER Gábor

On Mon, Jan 18, 2021 at 04:04:14PM -0500, Derrick Stolee wrote:
> I checked the range-diff and looked once more through the patch
> series. This version is good to go by my standards.
>
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

I re-read this series now that it seems to have stabilized, and I agree
with Stolee that it LGTM.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

> Thanks, Abhishek!

Incredible work!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
  2021-01-18 21:04           ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
  2021-01-18 22:00             ` Taylor Blau
@ 2021-01-19  0:02             ` Junio C Hamano
  2021-01-23 12:07             ` Abhishek Kumar
  2 siblings, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2021-01-19  0:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Taylor Blau, Abhishek Kumar, SZEDER Gábor

Derrick Stolee <stolee@gmail.com> writes:

> On 1/16/2021 1:11 PM, Abhishek Kumar via GitGitGadget wrote:
>> This patch series implements the corrected commit date offsets as generation
>> number v2, along with other pre-requisites.
> ...
>> Changes in version 6:
>> 
>>  * Fixed typos in commit message for "commit-graph: implement corrected
>>    commit date".
>>  * Removed an unnecessary else-block in "commit-graph: implement corrected
>>    commit date".
>>  * Validate mixed generation chain correctly while writing in "commit-graph:
>>    use generation v2 only if the entire chain does".
>>  * Die if the GDAT chunk indicates data has overflown but there are is no
>>    generation data overflow chunk.
>
> I checked the range-diff and looked once more through the patch
> series. This version is good to go by my standards.
>
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thanks, both.  I'll give it a (hopefully) final read-over after
replacing what we have kept in 'seen'.

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
  2021-01-18 21:04           ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
  2021-01-18 22:00             ` Taylor Blau
  2021-01-19  0:02             ` Junio C Hamano
@ 2021-01-23 12:07             ` Abhishek Kumar
  2 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-23 12:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me

On Mon, Jan 18, 2021 at 04:04:14PM -0500, Derrick Stolee wrote:
> On 1/16/2021 1:11 PM, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> ...
> > Changes in version 6:
> > 
> >  * Fixed typos in commit message for "commit-graph: implement corrected
> >    commit date".
> >  * Removed an unnecessary else-block in "commit-graph: implement corrected
> >    commit date".
> >  * Validate mixed generation chain correctly while writing in "commit-graph:
> >    use generation v2 only if the entire chain does".
> >  * Die if the GDAT chunk indicates data has overflown but there are is no
> >    generation data overflow chunk.
> 
> I checked the range-diff and looked once more through the patch
> series. This version is good to go by my standards.
> 
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
> 
> Thanks, Abhishek!
> 

Thanks a lot for the review and continued guidance through out the
patch series!

- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
  2021-01-18 22:00             ` Taylor Blau
@ 2021-01-23 12:11               ` Abhishek Kumar
  0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-23 12:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, stolee

On Mon, Jan 18, 2021 at 05:00:41PM -0500, Taylor Blau wrote:
> On Mon, Jan 18, 2021 at 04:04:14PM -0500, Derrick Stolee wrote:
> > I checked the range-diff and looked once more through the patch
> > series. This version is good to go by my standards.
> >
> > Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
> 
> I re-read this series now that it seems to have stabilized, and I agree
> with Stolee that it LGTM.
> 
>   Reviewed-by: Taylor Blau <me@ttaylorr.com>
> 
> > Thanks, Abhishek!
> 
> Incredible work!

Thanks a lot for the reviews and help in identifying the reason behind
(relatively) minor performance increase when we switched from useless
'commit_graph_generation()' calls to direct slab calls.

> 
> Thanks,
> Taylor

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 11/11] doc: add corrected commit date info
  2021-01-16 18:11           ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2021-01-27  0:04             ` SZEDER Gábor
  2021-01-30  5:29               ` Abhishek Kumar
  0 siblings, 1 reply; 211+ messages in thread
From: SZEDER Gábor @ 2021-01-27  0:04 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Taylor Blau,
	Abhishek Kumar

On Sat, Jan 16, 2021 at 06:11:18PM +0000, Abhishek Kumar via GitGitGadget wrote:
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.

This patch should come much earlier in this series, before patch 07/11
(commit-graph: implement corrected commit date), or perhaps even
earlier.  That way if someone were to investigate an issue in this
series and checks out one of its commits, then the specification and
the will be right there under 'Documentation/technical/'.

Furthermore, a patch introducing a new chunk format is the right place
to justify the introduction of said new chunk.  What problems does a
chunk of corrected commit dates solve?  Why does it solve them?  Why
do we need corrected commit dates instead of simple commit dates?
What alternatives were considered [1]?  Any other design considerations
worth mentioning for the benefit of future readers?

None of the patches' log messages properly explain these, and while
much of these is indeed explained in the cover letter, the cover
letter will not be part of the history.  Requiring to look up mailing
list archives for the justification puts unnecessary burden on other
developers who might get interested in this feature in the future.

You might want to take
https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com/
as an inspiration.


[1] Please remember the following snippet from SubmittingPatches:
    "Try to make sure your explanation can be understood without
    external resources. Instead of giving a URL to a mailing list
    archive, summarize the relevant points of the discussion."

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  .../technical/commit-graph-format.txt         | 28 +++++--
>  Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
>  2 files changed, 86 insertions(+), 19 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index b3b58880b92..b6658eff188 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
>  
> -- The generation number of the commit. Commits with no parents have
> -  generation number 1; commits with parents have generation number
> -  one more than the maximum generation number of its parents. We
> -  reserve zero as special, and can be used to mark a generation
> -  number invalid or as "not computed".
> +- The generation number of the commit.
>  
>  - The root tree OID.
>  
> @@ -86,13 +82,33 @@ CHUNK DATA:
>        position. If there are more than two parents, the second value
>        has its most-significant bit on and the other bits store an array
>        position into the Extra Edge List chunk.
> -    * The next 8 bytes store the generation number of the commit and
> +    * The next 8 bytes store the topological level (generation number v1)
> +      of the commit and
>        the commit time in seconds since EPOCH. The generation number
>        uses the higher 30 bits of the first 4 bytes, while the commit
>        time uses the 32 bits of the second 4 bytes, along with the lowest
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>  
> +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> +    * This list of 4-byte values store corrected commit date offsets for the
> +      commits, arranged in the same order as commit data chunk.
> +    * If the corrected commit date offset cannot be stored within 31 bits,
> +      the value has its most-significant bit on and the other bits store
> +      the position of corrected commit date into the Generation Data Overflow
> +      chunk.
> +    * Generation Data chunk is present only when commit-graph file is written
> +      by compatible versions of Git and in case of split commit-graph chains,
> +      the topmost layer also has Generation Data chunk.
> +
> +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> +    * This list of 8-byte values stores the corrected commit date offsets
> +      for commits with corrected commit date offsets that cannot be
> +      stored within 31 bits.
> +    * Generation Data Overflow chunk is present only when Generation Data
> +      chunk is present and atleast one corrected commit date offset cannot
> +      be stored within 31 bits.
> +
>    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>        This list of 4-byte values store the second through nth parents for
>        all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index f14a7659aa8..f05e7bda1a9 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
>  
>  Values 1-4 satisfy the requirements of parse_commit_gently().
>  
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates (generation number v2)
> +2. Topological levels (generation nummber v1)
>  
> - * A commit with no parents (a root commit) has generation number one.
> +Define "corrected committer date" of a commit recursively as follows:
>  
> - * A commit with at least one parent has generation number one more than
> -   the largest generation number among its parents.
> + * A commit with no parents (a root commit) has corrected committer date
> +    equal to its committer date.
>  
> -Equivalently, the generation number of a commit A is one more than the
> + * A commit with at least one parent has corrected committer date equal to
> +    the maximum of its commiter date and one more than the largest corrected
> +    committer date among its parents.
> +
> + * As a special case, a root commit with timestamp zero has corrected commit
> +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> +    (that is, an uncomputed corrected commit date).
> +
> +Define the "topological level" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has topological level of one.
> +
> + * A commit with at least one parent has topological level one more than
> +   the largest topological level among its parents.
> +
> +Equivalently, the topological level of a commit A is one more than the
>  length of a longest path from A to a root commit. The recursive definition
>  is easier to use for computation and observing the following property:
>  
> @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
>      generation numbers, then we always expand the boundary commit with highest
>      generation number and can easily detect the stopping condition.
>  
> +The property applies to both versions of generation number, that is both
> +corrected committer dates and topological levels.
> +
>  This property can be used to significantly reduce the time it takes to
>  walk commits and determine topological relationships. Without generation
>  numbers, the general heuristic is the following:
> @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
>      If A and B are commits with commit time X and Y, respectively, and
>      X < Y, then A _probably_ cannot reach B.
>  
> -This heuristic is currently used whenever the computation is allowed to
> +In absence of corrected commit dates (for example, old versions of Git or
> +mixed generation graph chains),
> +this heuristic is currently used whenever the computation is allowed to
>  violate topological relationships due to clock skew (such as "git log"
>  with default order), but is not used when the topological order is
>  required (such as merge base calculations, "git log --graph").
> @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>  
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
>  in the commit-graph file. If a commit-graph file was written by a version
>  of Git that did not compute generation numbers, then those commits will
>  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
>  walking a few extra commits, but the simplicity in dealing with commits
>  with generation number *_INFINITY or *_ZERO is valuable.
>  
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> -generation numbers are computed to be at least this value. We limit at
> -this value since it is the largest value that can be stored in the
> -commit-graph file using the 30 bits available to generation numbers. This
> -presents another case where a commit can have generation number equal to
> -that of a parent.
> +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
> +topological levels (generation number v1) are computed to be at least
> +this value. We limit at this value since it is the largest value that
> +can be stored in the commit-graph file using the 30 bits available
> +to topological levels. This presents another case where a commit can
> +have generation number equal to that of a parent.
>  
>  Design Details
>  --------------
> @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
>  number of commits) could be extracted into config settings for full
>  flexibility.
>  
> +## Handling Mixed Generation Number Chains
> +
> +With the introduction of generation number v2 and generation data chunk, the
> +following scenario is possible:
> +
> +1. "New" Git writes a commit-graph with the corrected commit dates.
> +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> +
> +A naive approach of using the newest available generation number from
> +each layer would lead to violated expectations: the lower layer would
> +use corrected commit dates which are much larger than the topological
> +levels of the higher layer. For this reason, Git inspects the topmost
> +layer to see if the layer is missing corrected commit dates. In such a case
> +Git only uses topological level for generation numbers.
> +
> +When writing a new layer in split commit-graph, we write corrected commit
> +dates if the topmost layer has corrected commit dates written. This
> +guarantees that if a layer has corrected commit dates, all lower layers
> +must have corrected commit dates as well.
> +
> +When merging layers, we do not consider whether the merged layers had corrected
> +commit dates. Instead, the new layer will have corrected commit dates if the
> +layer below the new layer has corrected commit dates.
> +
> +While writing or merging layers, if the new layer is the only layer, it will
> +have corrected commit dates when written by compatible versions of Git. Thus,
> +rewriting split commit-graph as a single file (`--split=replace`) creates a
> +single layer with corrected commit dates.
> +
>  ## Deleting graph-{hash} files
>  
>  After a new tip file is written, some `graph-{hash}` files may no longer
> -- 
> gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 11/11] doc: add corrected commit date info
  2021-01-27  0:04             ` SZEDER Gábor
@ 2021-01-30  5:29               ` Abhishek Kumar
  2021-01-31  1:45                 ` Taylor Blau
  0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-30  5:29 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me, stolee

On Wed, Jan 27, 2021 at 01:04:54AM +0100, SZEDER Gábor wrote:
> On Sat, Jan 16, 2021 at 06:11:18PM +0000, Abhishek Kumar via GitGitGadget wrote:
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> 
> This patch should come much earlier in this series, before patch 07/11
> (commit-graph: implement corrected commit date), or perhaps even
> earlier.  That way if someone were to investigate an issue in this
> series and checks out one of its commits, then the specification and
> the will be right there under 'Documentation/technical/'.
> 
> Furthermore, a patch introducing a new chunk format is the right place
> to justify the introduction of said new chunk.  What problems does a
> chunk of corrected commit dates solve?  Why does it solve them?  Why
> do we need corrected commit dates instead of simple commit dates?
> What alternatives were considered [1]?  Any other design considerations
> worth mentioning for the benefit of future readers?
> 
> None of the patches' log messages properly explain these, and while
> much of these is indeed explained in the cover letter, the cover
> letter will not be part of the history.  Requiring to look up mailing
> list archives for the justification puts unnecessary burden on other
> developers who might get interested in this feature in the future.
> 
> You might want to take
> https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com/
> as an inspiration.
> 

Alright, the suggestion makes a lot of sense and the patch introducing
documentation is the perfect place to justify the introduction of new
chunk format.

> 
> [1] Please remember the following snippet from SubmittingPatches:
>     "Try to make sure your explanation can be understood without
>     external resources. Instead of giving a URL to a mailing list
>     archive, summarize the relevant points of the discussion."
> 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  .../technical/commit-graph-format.txt         | 28 +++++--
> >  Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
> >  2 files changed, 86 insertions(+), 19 deletions(-)
> > 
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index b3b58880b92..b6658eff188 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> >  The Git commit graph stores a list of commit OIDs and some associated
> >  metadata, including:
> >  
> > -- The generation number of the commit. Commits with no parents have
> > -  generation number 1; commits with parents have generation number
> > -  one more than the maximum generation number of its parents. We
> > -  reserve zero as special, and can be used to mark a generation
> > -  number invalid or as "not computed".
> > +- The generation number of the commit.
> >  
> >  - The root tree OID.
> >  
> > @@ -86,13 +82,33 @@ CHUNK DATA:
> >        position. If there are more than two parents, the second value
> >        has its most-significant bit on and the other bits store an array
> >        position into the Extra Edge List chunk.
> > -    * The next 8 bytes store the generation number of the commit and
> > +    * The next 8 bytes store the topological level (generation number v1)
> > +      of the commit and
> >        the commit time in seconds since EPOCH. The generation number
> >        uses the higher 30 bits of the first 4 bytes, while the commit
> >        time uses the 32 bits of the second 4 bytes, along with the lowest
> >        2 bits of the lowest byte, storing the 33rd and 34th bit of the
> >        commit time.
> >  
> > +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> > +    * This list of 4-byte values store corrected commit date offsets for the
> > +      commits, arranged in the same order as commit data chunk.
> > +    * If the corrected commit date offset cannot be stored within 31 bits,
> > +      the value has its most-significant bit on and the other bits store
> > +      the position of corrected commit date into the Generation Data Overflow
> > +      chunk.
> > +    * Generation Data chunk is present only when commit-graph file is written
> > +      by compatible versions of Git and in case of split commit-graph chains,
> > +      the topmost layer also has Generation Data chunk.
> > +
> > +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> > +    * This list of 8-byte values stores the corrected commit date offsets
> > +      for commits with corrected commit date offsets that cannot be
> > +      stored within 31 bits.
> > +    * Generation Data Overflow chunk is present only when Generation Data
> > +      chunk is present and atleast one corrected commit date offset cannot
> > +      be stored within 31 bits.
> > +
> >    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> >        This list of 4-byte values store the second through nth parents for
> >        all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index f14a7659aa8..f05e7bda1a9 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
> >  
> >  Values 1-4 satisfy the requirements of parse_commit_gently().
> >  
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates (generation number v2)
> > +2. Topological levels (generation nummber v1)
> >  
> > - * A commit with no parents (a root commit) has generation number one.
> > +Define "corrected committer date" of a commit recursively as follows:
> >  
> > - * A commit with at least one parent has generation number one more than
> > -   the largest generation number among its parents.
> > + * A commit with no parents (a root commit) has corrected committer date
> > +    equal to its committer date.
> >  
> > -Equivalently, the generation number of a commit A is one more than the
> > + * A commit with at least one parent has corrected committer date equal to
> > +    the maximum of its commiter date and one more than the largest corrected
> > +    committer date among its parents.
> > +
> > + * As a special case, a root commit with timestamp zero has corrected commit
> > +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> > +    (that is, an uncomputed corrected commit date).
> > +
> > +Define the "topological level" of a commit recursively as follows:
> > +
> > + * A commit with no parents (a root commit) has topological level of one.
> > +
> > + * A commit with at least one parent has topological level one more than
> > +   the largest topological level among its parents.
> > +
> > +Equivalently, the topological level of a commit A is one more than the
> >  length of a longest path from A to a root commit. The recursive definition
> >  is easier to use for computation and observing the following property:
> >  
> > @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> >      generation numbers, then we always expand the boundary commit with highest
> >      generation number and can easily detect the stopping condition.
> >  
> > +The property applies to both versions of generation number, that is both
> > +corrected committer dates and topological levels.
> > +
> >  This property can be used to significantly reduce the time it takes to
> >  walk commits and determine topological relationships. Without generation
> >  numbers, the general heuristic is the following:
> > @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> >      If A and B are commits with commit time X and Y, respectively, and
> >      X < Y, then A _probably_ cannot reach B.
> >  
> > -This heuristic is currently used whenever the computation is allowed to
> > +In absence of corrected commit dates (for example, old versions of Git or
> > +mixed generation graph chains),
> > +this heuristic is currently used whenever the computation is allowed to
> >  violate topological relationships due to clock skew (such as "git log"
> >  with default order), but is not used when the topological order is
> >  required (such as merge base calculations, "git log --graph").
> > @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> >  generation number and walk until reaching commits with known generation
> >  number.
> >  
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> >  in the commit-graph file. If a commit-graph file was written by a version
> >  of Git that did not compute generation numbers, then those commits will
> >  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
> >  walking a few extra commits, but the simplicity in dealing with commits
> >  with generation number *_INFINITY or *_ZERO is valuable.
> >  
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > -generation numbers are computed to be at least this value. We limit at
> > -this value since it is the largest value that can be stored in the
> > -commit-graph file using the 30 bits available to generation numbers. This
> > -presents another case where a commit can have generation number equal to
> > -that of a parent.
> > +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
> > +topological levels (generation number v1) are computed to be at least
> > +this value. We limit at this value since it is the largest value that
> > +can be stored in the commit-graph file using the 30 bits available
> > +to topological levels. This presents another case where a commit can
> > +have generation number equal to that of a parent.
> >  
> >  Design Details
> >  --------------
> > @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> >  number of commits) could be extracted into config settings for full
> >  flexibility.
> >  
> > +## Handling Mixed Generation Number Chains
> > +
> > +With the introduction of generation number v2 and generation data chunk, the
> > +following scenario is possible:
> > +
> > +1. "New" Git writes a commit-graph with the corrected commit dates.
> > +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> > +
> > +A naive approach of using the newest available generation number from
> > +each layer would lead to violated expectations: the lower layer would
> > +use corrected commit dates which are much larger than the topological
> > +levels of the higher layer. For this reason, Git inspects the topmost
> > +layer to see if the layer is missing corrected commit dates. In such a case
> > +Git only uses topological level for generation numbers.
> > +
> > +When writing a new layer in split commit-graph, we write corrected commit
> > +dates if the topmost layer has corrected commit dates written. This
> > +guarantees that if a layer has corrected commit dates, all lower layers
> > +must have corrected commit dates as well.
> > +
> > +When merging layers, we do not consider whether the merged layers had corrected
> > +commit dates. Instead, the new layer will have corrected commit dates if the
> > +layer below the new layer has corrected commit dates.
> > +
> > +While writing or merging layers, if the new layer is the only layer, it will
> > +have corrected commit dates when written by compatible versions of Git. Thus,
> > +rewriting split commit-graph as a single file (`--split=replace`) creates a
> > +single layer with corrected commit dates.
> > +
> >  ## Deleting graph-{hash} files
> >  
> >  After a new tip file is written, some `graph-{hash}` files may no longer
> > -- 
> > gitgitgadget

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v6 11/11] doc: add corrected commit date info
  2021-01-30  5:29               ` Abhishek Kumar
@ 2021-01-31  1:45                 ` Taylor Blau
  0 siblings, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2021-01-31  1:45 UTC (permalink / raw)
  To: 20210127000454.GA1440011
  Cc: SZEDER Gábor, abhishekkumar8222, git, gitgitgadget, jnareb,
	me, stolee

On Sat, Jan 30, 2021 at 10:59:05AM +0530, Abhishek Kumar wrote:
> > You might want to take
> > https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com/
> > as an inspiration.
> >
> Alright, the suggestion makes a lot of sense and the patch introducing
> documentation is the perfect place to justify the introduction of new
> chunk format.

I don't have any strong feelings about Gábor's suggestion itself, but
note that there isn't any work for you to do in this series, since the
patches are on track to be merged to master.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date
  2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
                             ` (11 preceding siblings ...)
  2021-01-18 21:04           ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-02-01  6:58           ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
                               ` (11 more replies)
  12 siblings, 12 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, topological
levels can perform worse than committer date when parents of a commit differ
greatly in generation numbers [1]. For example, 'git merge-base v4.8 v4.9'
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date. Since 091f4cf3 (commit: don't use
generation numbers if not needed, 2018-08-30), 'git merge-base' uses
committer date heuristic unless there is a cutoff because of the performance
hit.

[1]
https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/

Thus, the need for generation number v2 was born. As Git used to die when
graph version understood by it and in the commit-graph file are different
[2], we needed a way to distinguish between the old and new generation
number without incrementing the graph version.

[2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

The following candidates were proposed
(https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1):

 * (Epoch, Date) Pairs.
 * Maximum Generation Numbers.
 * Corrected Commit Date.
 * FELINE Index.
 * Corrected Commit Date with Monotonically Increasing Offsets.

Based on performance, local computability, and immutability (along with the
introduction of an additional commit-graph chunk which relieved the
requirement of backwards-compatibility) Corrected Commit Date was chosen as
generation number v2 and is defined as follows:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with the
timestamp zero has corrected commit date of 1 to distinguish it from
GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).

While it was proposed initially to store corrected commit date offsets
within Commit Data Chunk, storing the offsets in a new chunk did not affect
the performance measurably. The new chunk is "Generation DATa (GDAT) chunk"
and it stores corrected commit date offsets while CDAT chunk stores
topological level. The old versions of Git would ignore GDAT chunk, using
topological levels from CDAT chunk. In contrast, new versions of Git would
use corrected commit dates, falling back to topological level if the
generation data chunk is absent in the commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores
the corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.

For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.

When adding new layer to the split commit-graph file, and when merging some
or all layers (replacing them in the latter case), the new layer will have
GDAT chunk if and only if in the final result there would be no layer
without GDAT chunk just below it.

Thanks to Dr. Stolee, Dr. Narębski, Taylor Blau and SZEDER Gábor for their
reviews.

I look forward to everyone's reviews!

Thanks

 * Abhishek

----------------------------------------------------------------------------

Improvements left for a future series:

 * Save commits with generation data overflow and extra edge commits instead
   of looping over all commits. cf. 858sbel67n.fsf@gmail.com
 * Verify both topological levels and corrected commit dates when present.
   cf. 85pn4tnk8u.fsf@gmail.com

Changes in version 7:

 * Moved the documentation patch ahead of "commit-graph: implement corrected
   commit date" and elaborated on the introduction of generation number v2.

Changes in version 6:

 * Fixed typos in commit message for "commit-graph: implement corrected
   commit date".
 * Removed an unnecessary else-block in "commit-graph: implement corrected
   commit date".
 * Validate mixed generation chain correctly while writing in "commit-graph:
   use generation v2 only if the entire chain does".
 * Die if the GDAT chunk indicates data has overflown but there are is no
   generation data overflow chunk.

Changes in version 5:

 * Explained a possible reason for no change in performance for
   "commit-graph: fix regression when computing bloom-filters"
 * Clarified about the addition of a new test for 11-digit octal
   implementations of ustar.
 * Fixed duplicate test names in "commit-graph: consolidate
   fill_commit_graph_info".
 * Swapped the order "commit-graph: return 64-bit generation number",
   "commit-graph: add a slab to store topological levels" to minimize lines
   changed.
 * Fixed the mismerge in "commit-graph: return 64-bit generation number"
 * Clarified the preparatory steps are for the larger goal of implementing
   generation number v2 in "commit-graph: return 64-bit generation number".
 * Moved the rename of "run_three_modes()" to "run_all_modes()" into a new
   patch "t6600-test-reach: generalize *_three_modes".
 * Explained and removed the checks for GENERATION_NUMBER_INFINITY that can
   never be true in "commit-graph: add a slab to store topological levels".
 * Fixed incorrect logic for verifying commit-graph in "commit-graph:
   implement corrected commit date".
 * Added minor improvements to commit message of "commit-graph: implement
   generation data chunk".
 * Added '--date ' option to test_commit() in 'test-lib-functions.sh' in
   "commit-graph: implement generation data chunk".
 * Improved coding style (also in tests) for "commit-graph: use generation
   v2 only if entire chain does".
 * Simplified test repository structure in "commit-graph: use generation v2
   only if entire chain does" as only the number of commits in a split
   commit-graph layer are relevant.
 * Added a new test in "commit-graph: use generation v2 only if entire chain
   does" to check if the layers are merged correctly.
 * Explicitly mentioned commit "091f4cf3" in the commit-message of
   "commit-graph: use corrected commit dates in paint_down_to_common()".
 * Minor corrections to documentation in "doc: add corrected commit date
   info".
 * Minor corrections to coding style.

Changes in version 4:

 * Added GDOV to handle overflows in generation data.
 * Added a test for writing tip graph for a generation number v2 graph chain
   in t5324-split-commit-graph.sh
 * Added a section on how mixed generation number chains are handled in
   Documentation/technical/commit-graph-format.txt
 * Reverted unimportant whitespace, style changes in commit-graph.c
 * Added header comments about the order of comparision for
   compare_commits_by_gen_then_commit_date in commit.h,
   compare_commits_by_gen in commit-graph.h
 * Elaborated on why t6404 fails with corrected commit date and must be run
   with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
   commit dates in paint_down_to_common()"
 * Elaborated on write behavior for mixed generation number chains in the
   commit "commit-graph: use generation v2 only if entire chain does"
 * Added notes about adding the topo_level slab to struct
   write_commit_graph_context as well as struct commit_graph.
 * Clarified commit message for "commit-graph: consolidate
   fill_commit_graph_info"
 * Removed the claim "GDAT can store future generation numbers" because it
   hasn't been tested yet.

Changes in version 3:

 * Reordered patches to implement corrected commit date before generation
   data chunk [3].
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

[3]
https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed generation commit chains.
 * Improve commit messages for "commit-graph: fix regression when computing
   bloom filter", "commit-graph: consolidate fill_commit_graph_info",
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (11):
  commit-graph: fix regression when computing Bloom filters
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  t6600-test-reach: generalize *_three_modes
  commit-graph: add a slab to store topological levels
  commit-graph: return 64-bit generation number
  commit-graph: document generation number v2
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()

 .../technical/commit-graph-format.txt         |  28 +-
 Documentation/technical/commit-graph.txt      |  77 +++++-
 commit-graph.c                                | 251 ++++++++++++++----
 commit-graph.h                                |  15 +-
 commit-reach.c                                |  38 +--
 commit-reach.h                                |   2 +-
 commit.c                                      |   4 +-
 commit.h                                      |   5 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   4 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |  24 +-
 t/t5318-commit-graph.sh                       |  79 +++++-
 t/t5324-split-commit-graph.sh                 | 193 +++++++++++++-
 t/t6404-recursive-merge.sh                    |   5 +-
 t/t6600-test-reach.sh                         |  68 ++---
 t/test-lib-functions.sh                       |   6 +
 upload-pack.c                                 |   2 +-
 19 files changed, 667 insertions(+), 154 deletions(-)

base-commit: e6362826a0409539642a5738db61827e5978e2e4
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v7
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v7
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v6:

  1:  4d8eb415578 =  1:  9ac331b63ee commit-graph: fix regression when computing Bloom filters
  2:  05dcb862818 =  2:  90ca0a1fd69 revision: parse parent in indegree_walk_step()
  3:  dcb9891d819 =  3:  b3040696d43 commit-graph: consolidate fill_commit_graph_info
  4:  4fbdee7ac90 =  4:  085085a4330 t6600-test-reach: generalize *_three_modes
  5:  fbd8feb5d8c =  5:  3b1aae4106a commit-graph: add a slab to store topological levels
  6:  855ff662a44 =  6:  ea32cba16ef commit-graph: return 64-bit generation number
 11:  e571f03d8bd !  7:  8647b5d2e38 doc: add corrected commit date info
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>

       ## Commit message ##
     -    doc: add corrected commit date info
     +    commit-graph: document generation number v2

     -    With generation data chunk and corrected commit dates implemented, let's
     -    update the technical documentation for commit-graph.
     +    Git uses topological levels in the commit-graph file for commit-graph
     +    traversal operations like 'git log --graph'. Unfortunately, topological
     +    levels can perform worse than committer date when parents of a commit
     +    differ greatly in generation numbers [1]. For example, 'git merge-base
     +    v4.8 v4.9' on the Linux repository walks 635,579 commits using
     +    topological levels and walks 167,468 using committer date. Since
     +    091f4cf3 (commit: don't use generation numbers if not needed,
     +    2018-08-30), 'git merge-base' uses committer date heuristic unless there
     +    is a cutoff because of the performance hit.
     +
     +    [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
     +
     +    Thus, the need for generation number v2 was born. As Git used to die
     +    when graph version understood by it and in the commit-graph file are
     +    different [2], we needed a way to distinguish between the old and new
     +    generation number without incrementing the graph version.
     +
     +    [2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
     +
     +    The following candidates were proposed (https://github.com/derrickstolee/gen-test,
     +    https://github.com/abhishekkumar2718/git/pull/1):
     +    - (Epoch, Date) Pairs.
     +    - Maximum Generation Numbers.
     +    - Corrected Commit Date.
     +    - FELINE Index.
     +    - Corrected Commit Date with Monotonically Increasing Offsets.
     +
     +    Based on performance, local computability, and immutability (along with
     +    the introduction of an additional commit-graph chunk which relieved the
     +    requirement of backwards-compatibility) Corrected Commit Date was chosen
     +    as generation number v2 and is defined as follows:
     +
     +    For a commit C, let its corrected commit date  be the maximum of the
     +    commit date of C and the corrected commit dates of its parents plus 1.
     +    Then corrected commit date offset is the difference between corrected
     +    commit date of C and commit date of C. As a special case, a root commit
     +    with the timestamp zero has corrected commit date of 1 to distinguish it
     +    from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
     +
     +    While it was proposed initially to store corrected commit date offsets
     +    within Commit Data Chunk, storing the offsets in a new chunk did not
     +    affect the performance measurably. The new chunk is "Generation DATa
     +    (GDAT) chunk" and it stores corrected commit date offsets while CDAT
     +    chunk stores topological level. The old versions of Git would ignore
     +    GDAT chunk, using topological levels from CDAT chunk. In contrast, new
     +    versions of Git would use corrected commit dates, falling back to
     +    topological level if the generation data chunk is absent in the
     +    commit-graph file.
     +
     +    While storing corrected commit date offsets saves us 4 bytes per commit
     +    (as compared with storing corrected commit dates directly), it's however
     +    possible for the offset to overflow the space allocated. To handle such
     +    cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
     +    stores the corrected commit date. For overflowing offsets, we set MSB
     +    and store the position into the GDOV chunk, in a mechanism similar to
     +    the Extra Edges list chunk.
     +
     +    For mixed generation number environment (for example new Git on the
     +    command line, old Git used by GUI client), we can encounter a
     +    mixed-chain commit-graph (a commit-graph chain where some of split
     +    commit-graph files have GDAT chunk and others do not). As backward
     +    compatibility is one of the goals, we can define the following behavior:
     +
     +    While reading a mixed-chain commit-graph version, we fall back on
     +    topological levels as corrected commit dates and topological levels
     +    cannot be compared directly.
     +
     +    When adding new layer to the split commit-graph file, and when merging
     +    some or all layers (replacing them in the latter case), the new layer
     +    will have GDAT chunk if and only if in the final result there would be
     +    no layer without GDAT chunk just below it.

          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

  7:  8fbe7486405 =  8:  ec598f1d500 commit-graph: implement corrected commit date
  8:  6d0696ae216 =  9:  71d81518857 commit-graph: implement generation data chunk
  9:  fba0d7f3dfe = 10:  07a88f1aae6 commit-graph: use generation v2 only if entire chain does
 10:  ba1f2c5555f = 11:  523e2d4a902 commit-reach: use corrected commit dates in paint_down_to_common()

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 211+ messages in thread

* [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                               ` (10 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.

But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.

Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.

Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].

So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.

[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f3486ec18f1..78de312ccec 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -139,13 +139,17 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 	return data;
 }

+/* 
+ * Should be used only while writing commit-graph as it compares
+ * generation value of commits by directly accessing commit-slab.
+ */
 static int commit_gen_cmp(const void *va, const void *vb)
 {
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;

-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 02/11] revision: parse parent in indegree_walk_step()
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                               ` (9 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 0b5c7231401..5474001331a 100644
--- a/revision.c
+++ b/revision.c
@@ -3399,6 +3399,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+			return;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
                               ` (8 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.

As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.

To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.

To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 27 ++++++++++-----------------
 t/t5000-tar-tree.sh | 24 +++++++++++++++++++++---
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 78de312ccec..955418bd6e5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -753,15 +753,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -776,38 +785,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
+	fill_commit_graph_info(item, g, pos);
 
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
 	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b652..7204799a0b5 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,15 +431,33 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
 	test_cmp expect actual
 '
 
-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+	rm -f .git/index &&
+	echo foo >file &&
+	git add file &&
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
+		git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+	git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+	echo 2514 >expect &&
+	tar_info future.tar | cut -d" " -f2 >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
 		git commit -m "tempori parendum"
 '
 
-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
 	git archive HEAD >future.tar
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (2 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
                               ` (7 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step to implement generation number v2, we add tests to
ensure Git can read and parse commit-graph files without Generation Data
chunk. These files represent commit-graph files written by Old Git and
are neccesary for backward compatability.

We extend run_three_modes() and test_three_modes() to *_all_modes() with
the fourth mode being "commit-graph without generation data chunk".

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 t/t6600-test-reach.sh | 62 +++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337d..af10f0dc090 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -58,7 +58,7 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-run_three_modes () {
+run_all_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
 	test_cmp expect actual &&
@@ -70,8 +70,8 @@ run_three_modes () {
 	test_cmp expect actual
 }
 
-test_three_modes () {
-	run_three_modes test-tool reach "$@"
+test_all_modes () {
+	run_all_modes test-tool reach "$@"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -80,7 +80,7 @@ test_expect_success 'ref_newer:miss' '
 	B:commit-4-9
 	EOF
 	echo "ref_newer(A,B):0" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'ref_newer:hit' '
@@ -89,7 +89,7 @@ test_expect_success 'ref_newer:hit' '
 	B:commit-2-3
 	EOF
 	echo "ref_newer(A,B):1" >expect &&
-	test_three_modes ref_newer
+	test_all_modes ref_newer
 '
 
 test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +98,7 @@ test_expect_success 'in_merge_bases:hit' '
 	B:commit-8-8
 	EOF
 	echo "in_merge_bases(A,B):1" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +107,7 @@ test_expect_success 'in_merge_bases:miss' '
 	B:commit-5-9
 	EOF
 	echo "in_merge_bases(A,B):0" >expect &&
-	test_three_modes in_merge_bases
+	test_all_modes in_merge_bases
 '
 
 test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +117,7 @@ test_expect_success 'in_merge_bases_many:hit' '
 	X:commit-5-7
 	EOF
 	echo "in_merge_bases_many(A,X):1" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +127,7 @@ test_expect_success 'in_merge_bases_many:miss' '
 	X:commit-8-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +137,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
 	X:commit-6-6
 	EOF
 	echo "in_merge_bases_many(A,X):0" >expect &&
-	test_three_modes in_merge_bases_many
+	test_all_modes in_merge_bases_many
 '
 
 test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +148,7 @@ test_expect_success 'is_descendant_of:hit' '
 	X:commit-1-1
 	EOF
 	echo "is_descendant_of(A,X):1" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +159,7 @@ test_expect_success 'is_descendant_of:miss' '
 	X:commit-7-6
 	EOF
 	echo "is_descendant_of(A,X):0" >expect &&
-	test_three_modes is_descendant_of
+	test_all_modes is_descendant_of
 '
 
 test_expect_success 'get_merge_bases_many' '
@@ -174,7 +174,7 @@ test_expect_success 'get_merge_bases_many' '
 		git rev-parse commit-5-6 \
 			      commit-4-7 | sort
 	} >expect &&
-	test_three_modes get_merge_bases_many
+	test_all_modes get_merge_bases_many
 '
 
 test_expect_success 'reduce_heads' '
@@ -196,7 +196,7 @@ test_expect_success 'reduce_heads' '
 			      commit-2-8 \
 			      commit-1-10 | sort
 	} >expect &&
-	test_three_modes reduce_heads
+	test_all_modes reduce_heads
 '
 
 test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +219,7 @@ test_expect_success 'can_all_from_reach:hit' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach(X,Y):1" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +241,7 @@ test_expect_success 'can_all_from_reach:miss' '
 	Y:commit-8-5
 	EOF
 	echo "can_all_from_reach(X,Y):0" >expect &&
-	test_three_modes can_all_from_reach
+	test_all_modes can_all_from_reach
 '
 
 test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +264,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
 	Y:commit-8-1
 	EOF
 	echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-	test_three_modes can_all_from_reach_with_flag
+	test_all_modes can_all_from_reach_with_flag
 '
 
 test_expect_success 'commit_contains:hit' '
@@ -280,8 +280,8 @@ test_expect_success 'commit_contains:hit' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
@@ -297,8 +297,8 @@ test_expect_success 'commit_contains:miss' '
 	X:commit-9-3
 	EOF
 	echo "commit_contains(_,A,X,_):0" >expect &&
-	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_all_modes commit_contains &&
+	test_all_modes commit_contains --tag
 '
 
 test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-6-6
+	run_all_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +322,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +334,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +346,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +358,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+	run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +368,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+	run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +382,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+	run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +402,7 @@ test_expect_success 'get_reachable_subset:all' '
 			      commit-1-7 \
 			      commit-5-6 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +420,7 @@ test_expect_success 'get_reachable_subset:some' '
 		git rev-parse commit-3-3 \
 			      commit-1-7 | sort
 	) >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +434,7 @@ test_expect_success 'get_reachable_subset:none' '
 	Y:commit-2-8
 	EOF
 	echo "get_reachable_subset(X,Y)" >expect &&
-	test_three_modes get_reachable_subset
+	test_all_modes get_reachable_subset
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 05/11] commit-graph: add a slab to store topological levels
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (3 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
                               ` (6 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.

Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.

A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.

We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
 commit-graph.h |  1 +
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 955418bd6e5..2f344cce151 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -772,6 +774,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+	if (g->topo_levels)
+		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -960,6 +965,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
+	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
@@ -1106,7 +1112,7 @@ static int write_graph_chunk_data(struct hashfile *f,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1336,11 +1342,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1348,29 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else if (level > max_level) {
+					max_level = level;
 				}
 			}
 
 			if (all_parents_computed) {
-				struct commit_graph_data *data = commit_graph_data_at(current);
-
-				data->generation = max_generation + 1;
 				pop_commit(&list);
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
 	}
@@ -2106,6 +2108,7 @@ int write_commit_graph(struct object_directory *odb,
 	int res = 0;
 	int replace = 0;
 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct topo_level_slab topo_levels;
 
 	prepare_repo_settings(the_repository);
 	if (!the_repository->settings.core_commit_graph) {
@@ -2132,6 +2135,18 @@ int write_commit_graph(struct object_directory *odb,
 							 bloom_settings.max_changed_paths);
 	ctx->bloom_settings = &bloom_settings;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
+	if (ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			g->topo_levels = &topo_levels;
+			g = g->base_graph;
+		}
+	}
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6e..00f00745b79 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
 
+	struct topo_level_slab *topo_levels;
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 06/11] commit-graph: return 64-bit generation number
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (4 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 07/11] commit-graph: document generation number v2 Abhishek Kumar via GitGitGadget
                               ` (5 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 22 +++++++++++-----------
 commit-graph.h |  4 ++--
 commit-reach.c | 36 ++++++++++++++++++------------------
 commit-reach.h |  2 +-
 commit.c       |  4 ++--
 commit.h       |  4 ++--
 revision.c     | 10 +++++-----
 upload-pack.c  |  2 +-
 8 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 2f344cce151..8f17815021d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -101,7 +101,7 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -150,8 +150,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	const timestamp_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
@@ -1370,8 +1370,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			if (all_parents_computed) {
 				pop_commit(&list);
 
-				if (max_level > GENERATION_NUMBER_MAX - 1)
-					max_level = GENERATION_NUMBER_MAX - 1;
+				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 			}
 		}
@@ -2367,8 +2367,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_generation = 0;
+		timestamp_t generation;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2432,16 +2432,16 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
 		 * extra logic in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
+		if (max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
 		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 00f00745b79..2e9aa7824ee 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,12 +145,12 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e7..9b24b0378d5 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
 	const struct commit *a = *(const struct commit * const *)_a;
 	const struct commit *b = *(const struct commit * const *)_b;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
 
 	if (generation_a < generation_b)
 		return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a317..148b56fea50 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.c b/commit.c
index bab8d5ab07c..4c717329ee0 100644
--- a/commit.c
+++ b/commit.c
@@ -753,8 +753,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a),
+			  generation_b = commit_graph_generation(b);
 
 	/* newer commits first */
 	if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index f4e7b0158e2..742d96c41e8 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
 #include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
diff --git a/revision.c b/revision.c
index 5474001331a..a54d2bd28df 100644
--- a/revision.c
+++ b/revision.c
@@ -3302,7 +3302,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3370,7 +3370,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3415,7 +3415,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3473,7 +3473,7 @@ static void init_topo_walk(struct rev_info *revs)
 	info->min_generation = GENERATION_NUMBER_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (repo_parse_commit_gently(revs->repo, c, 1))
 			continue;
@@ -3541,7 +3541,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b66bf92ba8..b87607e0dd4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -500,7 +500,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 07/11] commit-graph: document generation number v2
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (5 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 08/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
                               ` (4 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, topological
levels can perform worse than committer date when parents of a commit
differ greatly in generation numbers [1]. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using
topological levels and walks 167,468 using committer date. Since
091f4cf3 (commit: don't use generation numbers if not needed,
2018-08-30), 'git merge-base' uses committer date heuristic unless there
is a cutoff because of the performance hit.

[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/

Thus, the need for generation number v2 was born. As Git used to die
when graph version understood by it and in the commit-graph file are
different [2], we needed a way to distinguish between the old and new
generation number without incrementing the graph version.

[2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

The following candidates were proposed (https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1):
- (Epoch, Date) Pairs.
- Maximum Generation Numbers.
- Corrected Commit Date.
- FELINE Index.
- Corrected Commit Date with Monotonically Increasing Offsets.

Based on performance, local computability, and immutability (along with
the introduction of an additional commit-graph chunk which relieved the
requirement of backwards-compatibility) Corrected Commit Date was chosen
as generation number v2 and is defined as follows:

For a commit C, let its corrected commit date  be the maximum of the
commit date of C and the corrected commit dates of its parents plus 1.
Then corrected commit date offset is the difference between corrected
commit date of C and commit date of C. As a special case, a root commit
with the timestamp zero has corrected commit date of 1 to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).

While it was proposed initially to store corrected commit date offsets
within Commit Data Chunk, storing the offsets in a new chunk did not
affect the performance measurably. The new chunk is "Generation DATa
(GDAT) chunk" and it stores corrected commit date offsets while CDAT
chunk stores topological level. The old versions of Git would ignore
GDAT chunk, using topological levels from CDAT chunk. In contrast, new
versions of Git would use corrected commit dates, falling back to
topological level if the generation data chunk is absent in the
commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit
(as compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
stores the corrected commit date. For overflowing offsets, we set MSB
and store the position into the GDOV chunk, in a mechanism similar to
the Extra Edges list chunk.

For mixed generation number environment (for example new Git on the
command line, old Git used by GUI client), we can encounter a
mixed-chain commit-graph (a commit-graph chain where some of split
commit-graph files have GDAT chunk and others do not). As backward
compatibility is one of the goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels
cannot be compared directly.

When adding new layer to the split commit-graph file, and when merging
some or all layers (replacing them in the latter case), the new layer
will have GDAT chunk if and only if in the final result there would be
no layer without GDAT chunk just below it.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 28 +++++--
 Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
 2 files changed, 86 insertions(+), 19 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b92..b6658eff188 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -86,13 +82,33 @@ CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the generation number of the commit and
+    * The next 8 bytes store the topological level (generation number v1)
+      of the commit and
       the commit time in seconds since EPOCH. The generation number
       uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * If the corrected commit date offset cannot be stored within 31 bits,
+      the value has its most-significant bit on and the other bits store
+      the position of corrected commit date into the Generation Data Overflow
+      chunk.
+    * Generation Data chunk is present only when commit-graph file is written
+      by compatible versions of Git and in case of split commit-graph chains,
+      the topmost layer also has Generation Data chunk.
+
+  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+    * This list of 8-byte values stores the corrected commit date offsets
+      for commits with corrected commit date offsets that cannot be
+      stored within 31 bits.
+    * Generation Data Overflow chunk is present only when Generation Data
+      chunk is present and atleast one corrected commit date offset cannot
+      be stored within 31 bits.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa8..f05e7bda1a9 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
 
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
 
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+    (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
     generation numbers, then we always expand the boundary commit with highest
     generation number and can easily detect the stopping condition.
 
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
 This property can be used to significantly reduce the time it takes to
 walk commits and determine topological relationships. Without generation
 numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
 violate topological relationships due to clock skew (such as "git log"
 with default order), but is not used when the topological order is
 required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 08/11] commit-graph: implement corrected commit date
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (6 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 07/11] commit-graph: document generation number v2 Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 09/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                               ` (3 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8f17815021d..d1e6ced8647 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1343,9 +1343,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;

 		display_progress(ctx->progress, i + 1);
-		if (level != GENERATION_NUMBER_ZERO)
+		if (level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO)
 			continue;

 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1356,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;

 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;

-				if (level == GENERATION_NUMBER_ZERO) {
+				if (level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
 				}
+
+				if (level > max_level)
+					max_level = level;
+
+				if (corrected_commit_date > max_corrected_commit_date)
+					max_corrected_commit_date = corrected_commit_date;
 			}

 			if (all_parents_computed) {
@@ -1373,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date && current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 09/11] commit-graph: implement generation data chunk
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (7 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 08/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
                               ` (2 subsequent siblings)
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 114 ++++++++++++++++++++++++++++++----
 commit-graph.h                |   3 +
 commit.h                      |   1 +
 t/README                      |   3 +
 t/helper/test-read-graph.c    |   4 ++
 t/t4216-log-bloom.sh          |   4 +-
 t/t5318-commit-graph.sh       |  79 +++++++++++++++++++----
 t/t5324-split-commit-graph.sh |  12 ++--
 t/t6600-test-reach.sh         |   6 ++
 t/test-lib-functions.sh       |   6 ++
 10 files changed, 200 insertions(+), 32 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d1e6ced8647..d2afcc83283 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
@@ -394,6 +398,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+			if (graph->chunk_generation_data_overflow)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data_overflow = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -754,8 +772,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 {
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
-	uint32_t lex_index;
-	uint64_t date_high, date_low;
+	uint32_t lex_index, offset_pos;
+	uint64_t date_high, date_low, offset;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
@@ -773,7 +791,19 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data) {
+		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+			if (!g->chunk_generation_data_overflow)
+				die(_("commit-graph requires overflow generation data but has none"));
+
+			offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+			graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+		} else
+			graph_data->generation = item->date + offset;
+	} else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 
 	if (g->topo_levels)
 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -945,6 +975,7 @@ struct write_commit_graph_context {
 	struct oid_array oids;
 	struct packed_commit_list commits;
 	int num_extra_edges;
+	int num_generation_data_overflows;
 	unsigned long approx_nr_objects;
 	struct progress *progress;
 	int progress_done;
@@ -963,7 +994,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 changed_paths:1,
-		 order_by_pack:1;
+		 order_by_pack:1,
+		 write_generation_data:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
@@ -1123,6 +1155,45 @@ static int write_graph_chunk_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	int i, num_generation_data_overflows = 0;
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+			num_generation_data_overflows++;
+		}
+
+		hashwrite_be32(f, offset);
+	}
+
+	return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+						       struct write_commit_graph_context *ctx)
+{
+	int i;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+			hashwrite_be32(f, offset >> 32);
+			hashwrite_be32(f, (uint32_t) offset);
+		}
+	}
+
+	return 0;
+}
+
 static int write_graph_chunk_extra_edges(struct hashfile *f,
 					 struct write_commit_graph_context *ctx)
 {
@@ -1386,6 +1457,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (current->date && current->date > max_corrected_commit_date)
 					max_corrected_commit_date = current->date - 1;
 				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+					ctx->num_generation_data_overflows++;
 			}
 		}
 	}
@@ -1719,6 +1793,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunks[2].id = GRAPH_CHUNKID_DATA;
 	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
 	chunks[2].write_fn = write_graph_chunk_data;
+
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+		ctx->write_generation_data = 0;
+	if (ctx->write_generation_data) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+		num_chunks++;
+	}
+	if (ctx->num_generation_data_overflows) {
+		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+		num_chunks++;
+	}
 	if (ctx->num_extra_edges) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
 		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2139,6 +2228,8 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
+	ctx->write_generation_data = 1;
+	ctx->num_generation_data_overflows = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
@@ -2445,16 +2536,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 			continue;
 
 		/*
-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
-		 * extra logic in the following condition.
+		 * If we are using topological level and one of our parents has
+		 * generation GENERATION_NUMBER_V1_MAX, then our generation is
+		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+		 * in the following condition.
 		 */
-		if (max_generation == GENERATION_NUMBER_V1_MAX)
+		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+		if (generation < max_generation + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824ee..19a02001fde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
 #include "oidset.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
@@ -68,6 +69,8 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
+	const unsigned char *chunk_generation_data_overflow;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 742d96c41e8..eff94f3f7c2 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
 #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
 #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
 
 struct commit_list {
 	struct commit *item;
diff --git a/t/README b/t/README
index c730a707705..8a121487279 100644
--- a/t/README
+++ b/t/README
@@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
 commit-graph write to compute and write changed path Bloom filters for
 every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a17256..75927b2c81d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
+	if (graph->chunk_generation_data_overflow)
+		printf(" generation_data_overflow");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 0f16c4b9d52..50f206db550 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -43,11 +43,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 '
 
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544da..fa27df579a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
-	if test ! -z $2
+	if test ! -z "$2"
 	then
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
 	# valid commit and tree OID
 	git rev-parse HEAD HEAD^{tree} >in &&
 	git commit-graph write --stdin-commits <in &&
-	graph_read_expect 3
+	graph_read_expect 3 generation_data
 '
 
 test_expect_success 'write graph' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "3"
+	graph_read_expect "3" generation_data
 '
 
 test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cat new-idx | git commit-graph write --stdin-packs &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "9" "extra_edges"
+	graph_read_expect "9" "generation_data extra_edges"
 '
 
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
 	git rev-parse merge/1 >>commits-in &&
 	cat commits-in | git commit-graph write --stdin-commits &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "6"
+	graph_read_expect "6" "generation_data"
 '
 
 graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "10" "extra_edges"
+	graph_read_expect "10" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --reachable &&
 	test_path_is_file $objdir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
 	cd "$TRASH_DIRECTORY/bare" &&
 	git commit-graph write &&
 	test_path_is_file $baredir/info/commit-graph &&
-	graph_read_expect "11" "extra_edges"
+	graph_read_expect "11" "generation_data extra_edges"
 '
 
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
 
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
-	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
-	git commit-graph verify >output
+	git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+	git commit-graph verify >output &&
+	graph_read_expect 9 extra_edges
 '
 
 NUM_COMMITS=9
@@ -741,4 +742,56 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
 	)
 '
 
+# We test the overflow-related code with the following repo history:
+#
+#               4:F - 5:N - 6:U
+#              /                \
+# 1:U - 2:N - 3:U                M:N
+#              \                /
+#               7:N - 8:F - 9:N
+#
+# Here the commits denoted by U have committer date of zero seconds
+# since Unix epoch, the commits denoted by N have committer date
+# starting from 1112354055 seconds since Unix epoch (default committer
+# date for the test suite), and the commits denoted by F have committer
+# date of (2 ^ 31 - 2) seconds since Unix epoch.
+#
+# The largest offset observed is 2 ^ 31, just large enough to overflow.
+#
+
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+	objdir=".git/objects" &&
+	UNIX_EPOCH_ZERO="@0 +0000" &&
+	FUTURE_DATE="@2147483646 +0000" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	cd "$TRASH_DIRECTORY" &&
+	mkdir repo &&
+	cd repo &&
+	git init &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+	test_commit 2 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 3 generation_data &&
+	test_commit --date "$FUTURE_DATE" 4 &&
+	test_commit 5 &&
+	test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+	git branch left &&
+	git reset --hard 3 &&
+	test_commit 7 &&
+	test_commit --date "$FUTURE_DATE" 8 &&
+	test_commit 9 &&
+	git branch right &&
+	git reset --hard 3 &&
+	test_merge M left right &&
+	git commit-graph write --reachable &&
+	graph_read_expect 10 "generation_data generation_data_overflow" &&
+	git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
 test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 4d3842b83b9..587757b62d9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
 	infodir=".git/objects/info" &&
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 
 	oid_version sha1:1
 	oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index af10f0dc090..e2d33a8a4c4 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,6 +55,9 @@ test_expect_success 'setup' '
 	git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
 	mv .git/objects/info/commit-graph commit-graph-half &&
 	chmod u+w commit-graph-half &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+	mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+	chmod u+w commit-graph-no-gdat &&
 	git config core.commitGraph true
 '
 
@@ -67,6 +70,9 @@ run_all_modes () {
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
 	"$@" <input >actual &&
+	test_cmp expect actual &&
+	cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 6bca0023168..df5bba07295 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -218,6 +218,12 @@ test_commit () {
 		--signoff)
 			signoff="$1"
 			;;
+		--date)
+			notick=yes
+			GIT_COMMITTER_DATE="$2"
+			GIT_AUTHOR_DATE="$2"
+			shift
+			;;
 		-C)
 			indir="$2"
 			shift
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (8 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 09/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01  6:58             ` [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
  2021-02-01 13:14             ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                |  30 +++++-
 commit-graph.h                |   1 +
 t/t5324-split-commit-graph.sh | 181 ++++++++++++++++++++++++++++++++++
 3 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d2afcc83283..77fef5a240e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -614,6 +614,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+	int read_generation_data;
+
+	if (!g)
+		return;
+
+	read_generation_data = !!g->chunk_generation_data;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -622,6 +637,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(g);
+
 	return g;
 }
 
@@ -791,7 +808,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data) {
+	if (g->read_generation_data) {
 		offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 
 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2019,6 +2036,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		if (i < ctx->num_commit_graphs_after)
 			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
 
+		/*
+		 * If the topmost remaining layer has generation data chunk, the
+		 * resultant layer also has generation data chunk.
+		 */
+		if (i == ctx->num_commit_graphs_after - 2)
+			ctx->write_generation_data = !!g->chunk_generation_data;
+
 		i--;
 		g = g->base_graph;
 	}
@@ -2343,6 +2367,8 @@ int write_commit_graph(struct object_directory *odb,
 	} else
 		ctx->num_commit_graphs_after = 1;
 
+	validate_mixed_generation_chain(ctx->r->objects->commit_graph);
+
 	compute_generation_numbers(ctx);
 
 	if (ctx->changed_paths)
@@ -2541,7 +2567,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		 * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
 		 * in the following condition.
 		 */
-		if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+		if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
 			max_generation--;
 
 		generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fde..ad52130883b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	unsigned int read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 587757b62d9..8e90f3423b8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,4 +453,185 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
 	git -C dup commit-graph verify
 '
 
+NUM_FIRST_LAYER_COMMITS=64
+NUM_SECOND_LAYER_COMMITS=16
+NUM_THIRD_LAYER_COMMITS=7
+NUM_FOURTH_LAYER_COMMITS=8
+NUM_FIFTH_LAYER_COMMITS=16
+SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
+SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
+THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
+THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
+FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
+FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
+FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
+FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
+
+# Current split graph chain:
+#
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	graphdir=".git/objects/info/commit-graphs" &&
+	test_oid_cache <<-EOF &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOF
+	git init mixed &&
+	(
+		cd mixed &&
+		git config core.commitGraph true &&
+		git config gc.writeCommitGraph false &&
+		for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split &&
+		graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 1
+		num_commits: $NUM_SECOND_LAYER_COMMITS
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify &&
+		cat $graphdir/commit-graph-chain
+	)
+'
+
+# The new layer will be added without generation data chunk as it was not
+# present on the layer underneath it.
+#
+#      7 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if not present on existing tip' '
+	git clone mixed mixed-no-gdat &&
+	(
+		cd mixed-no-gdat &&
+		for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split=no-merge &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 2
+		num_commits: $NUM_THIRD_LAYER_COMMITS
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify
+	)
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+#      8 commits (No GDAT)
+# ------------------------
+#      7 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+# The top two layers are merged and do not have generation data chunk as layer below them does
+# not have generation data chunk.
+#
+#     15 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
+	git clone mixed-no-gdat mixed-merge-no-gdat &&
+	(
+		cd mixed-merge-no-gdat &&
+		for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split --size-multiple 1 &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 4 2
+		num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
+		chunks: oid_fanout oid_lookup commit_metadata
+		EOF
+		test_cmp expect output &&
+		git commit-graph verify
+	)
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+#     16 commits (No GDAT)
+# ------------------------
+#     15 commits (No GDAT)
+# ------------------------
+#     16 commits (No GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+# The top three layers are merged and has generation data chunk as the topmost remaining layer
+# has generation data chunk.
+#
+#     47 commits (GDAT)
+# ------------------------
+#     64 commits (GDAT)
+#
+test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
+	git clone mixed-merge-no-gdat mixed-merge-gdat &&
+	(
+		cd mixed-merge-gdat &&
+		for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
+		do
+			test_commit $i &&
+			git branch commits/$i || return 1
+		done &&
+		git commit-graph write --reachable --split --size-multiple 1 &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test-tool read-graph >output &&
+		cat >expect <<-EOF &&
+		header: 43475048 1 $(test_oid oid_version) 5 1
+		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
+		chunks: oid_fanout oid_lookup commit_metadata generation_data
+		EOF
+		test_cmp expect output
+	)
+'
+
+test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
+	git clone mixed mixed-replace &&
+	(
+		cd mixed-replace &&
+		git commit-graph write --reachable --split=replace &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir &&
+		graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
+		git commit-graph verify
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 211+ messages in thread

* [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common()
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (9 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2021-02-01  6:58             ` Abhishek Kumar via GitGitGadget
  2021-02-01 13:14             ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
  11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01  6:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
	SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  6 ++++++
 commit-reach.c             |  2 +-
 t/t6404-recursive-merge.sh |  5 ++++-
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 77fef5a240e..bf735fac4ea 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -714,6 +714,20 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return g->read_generation_data;
+}
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 {
 	struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883b..97f3497c279 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,12 @@ struct commit_graph *parse_commit_graph(struct repository *r,
  */
 int generation_numbers_enabled(struct repository *r);
 
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
 
 enum commit_graph_write_flags {
diff --git a/commit-reach.c b/commit-reach.c
index 9b24b0378d5..e38771ca5a1 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index c7ab7048f58..eaf48e941e2 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -18,6 +18,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -69,7 +71,7 @@ test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
@@ -85,6 +87,7 @@ test_expect_success 'result contains a conflict' '
 '
 
 test_expect_success 'virtual trees were processed' '
+	# TODO: fragile test, relies on ambigious merge-base resolution
 	git ls-files --stage >out &&
 
 	cat >expect <<-EOF &&
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 211+ messages in thread

* Re: [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date
  2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
                               ` (10 preceding siblings ...)
  2021-02-01  6:58             ` [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2021-02-01 13:14             ` Derrick Stolee
  2021-02-01 18:26               ` Junio C Hamano
  11 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2021-02-01 13:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Abhishek Kumar, SZEDER Gábor,
	Taylor Blau, Junio C Hamano

On 2/1/2021 1:58 AM, Abhishek Kumar via GitGitGadget wrote:

> Changes in version 7:
> 
>  * Moved the documentation patch ahead of "commit-graph: implement corrected
>    commit date" and elaborated on the introduction of generation number v2.

The only change in this version is this commit message:

>  11:  e571f03d8bd !  7:  8647b5d2e38 doc: add corrected commit date info
>      @@ Metadata
>       Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
>       
>        ## Commit message ##
>      -    doc: add corrected commit date info
>      +    commit-graph: document generation number v2
>       
>      -    With generation data chunk and corrected commit dates implemented, let's
>      -    update the technical documentation for commit-graph.
>      +    Git uses topological levels in the commit-graph file for commit-graph
>      +    traversal operations like 'git log --graph'. Unfortunately, topological
>      +    levels can perform worse than committer date when parents of a commit
>      +    differ greatly in generation numbers [1]. For example, 'git merge-base
>      +    v4.8 v4.9' on the Linux repository walks 635,579 commits using
>      +    topological levels and walks 167,468 using committer date. Since
>      +    091f4cf3 (commit: don't use generation numbers if not needed,
>      +    2018-08-30), 'git merge-base' uses committer date heuristic unless there
>      +    is a cutoff because of the performance hit.
>      +
>      +    [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
>      +
>      +    Thus, the need for generation number v2 was born. As Git used to die
>      +    when graph version understood by it and in the commit-graph file are
>      +    different [2], we needed a way to distinguish between the old and new
>      +    generation number without incrementing the graph version.
>      +
>      +    [2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
>      +
>      +    The following candidates were proposed (https://github.com/derrickstolee/gen-test,
>      +    https://github.com/abhishekkumar2718/git/pull/1):
>      +    - (Epoch, Date) Pairs.
>      +    - Maximum Generation Numbers.
>      +    - Corrected Commit Date.
>      +    - FELINE Index.
>      +    - Corrected Commit Date with Monotonically Increasing Offsets.
>      +
>      +    Based on performance, local computability, and immutability (along with
>      +    the introduction of an additional commit-graph chunk which relieved the
>      +    requirement of backwards-compatibility) Corrected Commit Date was chosen
>      +    as generation number v2 and is defined as follows:
>      +
>      +    For a commit C, let its corrected commit date  be the maximum of the
>      +    commit date of C and the corrected commit dates of its parents plus 1.
>      +    Then corrected commit date offset is the difference between corrected
>      +    commit date of C and commit date of C. As a special case, a root commit
>      +    with the timestamp zero has corrected commit date of 1 to distinguish it
>      +    from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
>      +
>      +    While it was proposed initially to store corrected commit date offsets
>      +    within Commit Data Chunk, storing the offsets in a new chunk did not
>      +    affect the performance measurably. The new chunk is "Generation DATa
>      +    (GDAT) chunk" and it stores corrected commit date offsets while CDAT
>      +    chunk stores topological level. The old versions of Git would ignore
>      +    GDAT chunk, using topological levels from CDAT chunk. In contrast, new
>      +    versions of Git would use corrected commit dates, falling back to
>      +    topological level if the generation data chunk is absent in the
>      +    commit-graph file.
>      +
>      +    While storing corrected commit date offsets saves us 4 bytes per commit
>      +    (as compared with storing corrected commit dates directly), it's however
>      +    possible for the offset to overflow the space allocated. To handle such
>      +    cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
>      +    stores the corrected commit date. For overflowing offsets, we set MSB
>      +    and store the position into the GDOV chunk, in a mechanism similar to
>      +    the Extra Edges list chunk.
>      +
>      +    For mixed generation number environment (for example new Git on the
>      +    command line, old Git used by GUI client), we can encounter a
>      +    mixed-chain commit-graph (a commit-graph chain where some of split
>      +    commit-graph files have GDAT chunk and others do not). As backward
>      +    compatibility is one of the goals, we can define the following behavior:
>      +
>      +    While reading a mixed-chain commit-graph version, we fall back on
>      +    topological levels as corrected commit dates and topological levels
>      +    cannot be compared directly.
>      +
>      +    When adding new layer to the split commit-graph file, and when merging
>      +    some or all layers (replacing them in the latter case), the new layer
>      +    will have GDAT chunk if and only if in the final result there would be
>      +    no layer without GDAT chunk just below it.

While that is a quality message, v6 has landed in 'next' and I've begun
working off of that version. As Taylor attempted to say [1], this topic
should be considered final and updates should be follow-ups on top.

[1] https://lore.kernel.org/git/YBYLwpKdUfxCNwaz@nand.local/

(Of course, if Junio says differently, then listen to him.)

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 211+ messages in thread

* Re: [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date
  2021-02-01 13:14             ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-02-01 18:26               ` Junio C Hamano
  0 siblings, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2021-02-01 18:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar, SZEDER Gábor, Taylor Blau

Derrick Stolee <stolee@gmail.com> writes:

>>      +    When adding new layer to the split commit-graph file, and when merging
>>      +    some or all layers (replacing them in the latter case), the new layer
>>      +    will have GDAT chunk if and only if in the final result there would be
>>      +    no layer without GDAT chunk just below it.
>
> While that is a quality message, v6 has landed in 'next' and I've begun
> working off of that version. As Taylor attempted to say [1], this topic
> should be considered final and updates should be follow-ups on top.
>
> [1] https://lore.kernel.org/git/YBYLwpKdUfxCNwaz@nand.local/

Sounds sensible, modulo s/final/solid enough/ ;-)

I would imagine that the "quality message" has something of value to
keep to help future developers, and if that is the case, a follow-up
patch to add to the Documentation/technical/ would be appropriate.

Thanks all, for a quality series.

^ permalink raw reply	[flat|nested] 211+ messages in thread

end of thread, other threads:[~2021-02-01 18:27 UTC | newest]

Thread overview: 211+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-07-28 15:28   ` Taylor Blau
2020-07-30  5:24     ` Abhishek Kumar
2020-08-04  0:46   ` Jakub Narębski
2020-08-04  0:56     ` Taylor Blau
2020-08-04 10:10       ` Jakub Narębski
2020-08-04  7:55     ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-07-28 13:00   ` Derrick Stolee
2020-07-28 15:30     ` Taylor Blau
2020-08-05 23:16   ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-07-28 13:14   ` Derrick Stolee
2020-07-28 15:19     ` René Scharfe
2020-07-28 15:58       ` Derrick Stolee
2020-07-28 16:01     ` Taylor Blau
2020-07-30  6:07     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-07-28 16:03   ` Taylor Blau
2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-07-28 16:12   ` Taylor Blau
2020-07-30  6:52     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
2020-07-28 15:55   ` Derrick Stolee
2020-07-28 16:23     ` Taylor Blau
2020-07-30  7:27     ` Abhishek Kumar
2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
2020-07-30  7:47   ` Abhishek Kumar
2020-07-28 16:35 ` Derrick Stolee
2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-08-10 16:28     ` Derrick Stolee
2020-08-11 11:03       ` Abhishek Kumar
2020-08-11 12:27         ` Derrick Stolee
2020-08-11 18:58           ` Taylor Blau
2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-08-10 14:23     ` Derrick Stolee
2020-08-14  4:59       ` Abhishek Kumar
2020-08-14 12:24         ` Derrick Stolee
2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
2020-08-10 16:42     ` Derrick Stolee
2020-08-11 11:36       ` Abhishek Kumar
2020-08-11 12:43         ` Derrick Stolee
2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-08-17 22:30       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-08-18 14:18       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-08-19 17:54       ` Jakub Narębski
2020-08-21  4:11         ` Abhishek Kumar
2020-08-25 11:11           ` Jakub Narębski
2020-09-01 11:35             ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-08-17 13:22       ` Derrick Stolee
2020-08-21 11:05       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-08-21 13:14       ` Jakub Narębski
2020-08-25  5:04         ` Abhishek Kumar
2020-08-25 12:18           ` Jakub Narębski
2020-09-01 12:06             ` Abhishek Kumar
2020-09-03 13:42               ` Jakub Narębski
2020-09-05 17:21                 ` Abhishek Kumar
2020-09-13 15:39                   ` Jakub Narębski
2020-09-28 21:48                     ` Jakub Narębski
2020-10-05  5:25                       ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-08-21 18:43       ` Jakub Narębski
2020-08-25  6:14         ` Abhishek Kumar
2020-08-25  7:33           ` Jakub Narębski
2020-08-25  7:56             ` Jakub Narębski
2020-09-01 10:26               ` Abhishek Kumar
2020-09-03  9:25                 ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-08-22  0:05       ` Jakub Narębski
2020-08-25  6:49         ` Abhishek Kumar
2020-08-25 10:07           ` Jakub Narębski
2020-09-01 11:01             ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-08-22 13:09       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-08-22 17:14       ` Jakub Narębski
2020-08-26  7:15         ` Abhishek Kumar
2020-08-26 10:38           ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-08-22 19:09       ` Jakub Narębski
2020-09-01 10:08         ` Abhishek Kumar
2020-09-03 19:11           ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-08-22 22:20       ` Jakub Narębski
2020-08-27  6:39         ` Abhishek Kumar
2020-08-27 12:43           ` Jakub Narębski
2020-08-27 13:15           ` Derrick Stolee
2020-09-01 13:01             ` Abhishek Kumar
2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
     [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
2020-08-17  1:32         ` Fwd: " Taylor Blau
2020-08-17  7:56           ` Jakub Narębski
2020-08-18  6:12       ` Abhishek Kumar
2020-08-23 15:27       ` Jakub Narębski
2020-08-24  2:49         ` Abhishek Kumar
2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-10-24 23:16         ` Jakub Narębski
2020-10-25 20:58           ` Taylor Blau
2020-11-03  5:36             ` Abhishek Kumar
2020-10-07 14:09       ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-10-24 23:41         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-10-25 10:52         ` Jakub Narębski
2020-10-27  6:33           ` Abhishek Kumar
2020-10-07 14:09       ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-10-25 13:48         ` Jakub Narębski
2020-11-03  6:40           ` Abhishek Kumar
2020-10-07 14:09       ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-10-25 22:17         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-10-27 18:53         ` Jakub Narębski
2020-11-03 11:44           ` Abhishek Kumar
2020-11-04 16:45             ` Jakub Narębski
2020-11-05 14:05               ` Philip Oakley
2020-11-05 18:22                 ` Junio C Hamano
2020-11-06 18:26                   ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
2020-11-06 19:33                     ` Extending and updating gitglossary Junio C Hamano
2020-11-08 17:23                     ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
2020-11-10  1:35                       ` Extending and updating gitglossary Jakub Narębski
2020-11-10 14:04                         ` Philip Oakley
2020-11-10 23:52                           ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-10-30 12:45         ` Jakub Narębski
2020-11-06 11:25           ` Abhishek Kumar
2020-11-06 17:56             ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-11-01  0:55         ` Jakub Narębski
2020-11-12 10:01           ` Abhishek Kumar
2020-11-13  9:59             ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-11-03 17:59         ` Jakub Narębski
2020-11-03 18:19           ` Junio C Hamano
2020-11-20 10:33           ` Abhishek Kumar
2020-10-07 14:09       ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-11-04  1:37         ` Jakub Narębski
2020-11-21  6:30           ` Abhishek Kumar
2020-11-04 23:37       ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
2020-11-22  5:31         ` Abhishek Kumar
2020-12-28 11:15       ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
2020-12-28 11:15         ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-12-30  1:35           ` Derrick Stolee
2021-01-08  5:45             ` Abhishek Kumar
2021-01-05  9:45           ` SZEDER Gábor
2021-01-05  9:47             ` SZEDER Gábor
2021-01-08  5:51             ` Abhishek Kumar
2020-12-28 11:15         ` [PATCH v5 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-12-30  1:53           ` Derrick Stolee
2021-01-10 12:21             ` Abhishek Kumar
2020-12-28 11:16         ` [PATCH v5 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-12-30  3:23           ` Derrick Stolee
2021-01-10 13:13             ` Abhishek Kumar
2021-01-11 12:43               ` Derrick Stolee
2020-12-28 11:16         ` [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-12-28 11:16         ` [PATCH v5 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-12-30  4:35         ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-01-10 14:06           ` Abhishek Kumar
2021-01-16 18:11         ` [PATCH v6 " Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2021-01-16 18:11           ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2021-01-27  0:04             ` SZEDER Gábor
2021-01-30  5:29               ` Abhishek Kumar
2021-01-31  1:45                 ` Taylor Blau
2021-01-18 21:04           ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-01-18 22:00             ` Taylor Blau
2021-01-23 12:11               ` Abhishek Kumar
2021-01-19  0:02             ` Junio C Hamano
2021-01-23 12:07             ` Abhishek Kumar
2021-02-01  6:58           ` [PATCH v7 " Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 07/11] commit-graph: document generation number v2 Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 08/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 09/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2021-02-01  6:58             ` [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2021-02-01 13:14             ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-02-01 18:26               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).