git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 0/6] [GSoC] Implement Corrected Commit Date
@ 2020-07-28  9:13 Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Here's what left for the PR (which I intend to take on with the second
version of pull request):

 1. Add an option to skip writing generation data chunk (to test whether new
    Git works without GDAT as intended).
 2. Handle writing to commit-graph for mismatched version (that is, merging
    all graphs into a new graph with a GDAT chunk).
 3. Update technical documentation.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
unrelated to my code changes. Still need to investigate further.

Abhishek Kumar (6):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date offset

 blame.c                       |   2 +-
 commit-graph.c                | 181 +++++++++++++++++++++-------------
 commit-graph.h                |   7 +-
 commit-reach.c                |  47 +++------
 commit-reach.h                |   2 +-
 commit.c                      |   9 +-
 commit.h                      |   3 +
 revision.c                    |  17 ++--
 t/helper/test-read-graph.c    |   2 +
 t/t4216-log-bloom.sh          |   4 +-
 t/t5000-tar-tree.sh           |   4 +-
 t/t5318-commit-graph.sh       |  21 ++--
 t/t5324-split-commit-graph.sh |  12 +--
 upload-pack.c                 |   2 +-
 14 files changed, 178 insertions(+), 135 deletions(-)


base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/676
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 15:28   ` Taylor Blau
  2020-08-04  0:46   ` Jakub Narębski
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With 3d112755 (commit-graph: examine commits by generation number), Git
knew to sort by generation number before examining the diff when not
using pack order. c49c82aa (commit: move members graph_pos, generation
to a slab, 2020-06-17) moved generation number into a slab and
introduced a helper which returns GENERATION_NUMBER_INFINITY when
writing the graph. Sorting is no longer useful and essentially reverts
the earlier commit.

Let's fix this by accessing generation number directly through the slab.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1af68c297d..5d3c9bd23c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
+
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 13:00   ` Derrick Stolee
  2020-08-05 23:16   ` Jakub Narębski
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 revision.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/revision.c b/revision.c
index 6aa7f4f567..23287d26c3 100644
--- a/revision.c
+++ b/revision.c
@@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
 		struct commit *parent = p->item;
 		int *pi = indegree_slab_at(&info->indegree, parent);
 
+		if (parse_commit_gently(parent, 1) < 0)
+			return ;
+
 		if (*pi)
 			(*pi)++;
 		else
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

The test 'generate tar with future mtime' creates a commit with commit
time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
generation number and has undefined behavior. The test used to pass as
fill_commit_in_graph() did not read commit time from commit graph,
reading commit date from odb instead.

Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c      | 31 ++++++++++++-------------------
 t/t5000-tar-tree.sh |  4 ++--
 2 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5d3c9bd23c..204eb454b2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	const unsigned char *commit_data;
 	struct commit_graph_data *graph_data;
 	uint32_t lex_index;
+	uint64_t date_high, date_low;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
 	lex_index = pos - g->num_commits_in_base;
 	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	graph_data = commit_graph_data_at(item);
 	graph_data->graph_pos = pos;
+
+	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+	date_low = get_be32(commit_data + g->hash_len + 12);
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
-	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	struct commit_graph_data *graph_data;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
 
+	fill_commit_graph_info(item, g, pos);
+
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (pos >= g->num_commits + g->num_commits_in_base)
-		die(_("invalid commit position. commit-graph is likely corrupt"));
-
-	/*
-	 * Store the "full" position, but then use the
-	 * "local" position for the rest of the calculation.
-	 */
-	graph_data = commit_graph_data_at(item);
-	graph_data->graph_pos = pos;
 	lex_index = pos - g->num_commits_in_base;
-
-	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 
 	item->object.parsed = 1;
 
 	set_commit_tree(item, NULL);
 
-	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
-	date_low = get_be32(commit_data + g->hash_len + 12);
-	item->date = (timestamp_t)((date_high << 32) | date_low);
-
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 37655a237c..1986354fc3 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
 	rm -f .git/index &&
 	echo content >file &&
 	git add file &&
-	GIT_COMMITTER_DATE="@68719476737 +0000" \
+	GIT_COMMITTER_DATE="@17179869183 +0000" \
 		git commit -m "tempori parendum"
 '
 
@@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
 '
 
 test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
-	echo 4147 >expect &&
+	echo 2514 >expect &&
 	tar_info future.tar | cut -d" " -f2 >actual &&
 	test_cmp expect actual
 '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 16:03   ` Taylor Blau
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Comparing commits by generation has been independently defined twice, in
commit-reach and commit. Let's simplify the implementation by moving
compare_commits_by_gen() to commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 15 +++++++++++++++
 commit-graph.h |  2 ++
 commit-reach.c | 15 ---------------
 commit.c       |  9 +++------
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 204eb454b2..1c98f38d69 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -112,6 +112,21 @@ uint32_t commit_graph_generation(const struct commit *c)
 	return data->generation;
 }
 
+int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = _a, *b = _b;
+	const uint32_t generation_a = commit_graph_generation(a);
+	const uint32_t generation_b = commit_graph_generation(b);
+
+	/* older commits first */
+	if (generation_a < generation_b)
+		return -1;
+	else if (generation_a > generation_b)
+		return 1;
+
+	return 0;
+}
+
 static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 {
 	unsigned int i, nth_slab;
diff --git a/commit-graph.h b/commit-graph.h
index 28f89cdf3e..98cc5a3b9d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,4 +145,6 @@ struct commit_graph_data {
  */
 uint32_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
+
+int compare_commits_by_gen(const void *_a, const void *_b);
 #endif
diff --git a/commit-reach.c b/commit-reach.c
index efd5925cbb..c83cc291e7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -561,21 +561,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
diff --git a/commit.c b/commit.c
index 7128895c3a..bed63b41fb 100644
--- a/commit.c
+++ b/commit.c
@@ -731,14 +731,11 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
-	const uint32_t generation_a = commit_graph_generation(a),
-		       generation_b = commit_graph_generation(b);
+	int ret_val = compare_commits_by_gen(a_, b_);
 
 	/* newer commits first */
-	if (generation_a < generation_b)
-		return 1;
-	else if (generation_a > generation_b)
-		return -1;
+	if (ret_val)
+		return -ret_val;
 
 	/* use date as a heuristic when generations are equal */
 	if (a->date < b->date)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 16:12   ` Taylor Blau
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

One of the essential pre-requisites before implementing generation
number as to distinguish between generation numbers v1 and v2 while
still being compatible with old Git.

We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores generation number v2 (and any subsequent versions),
whereas CDAT will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. Newer versions of Git can parse GDAT and
take advantage of newer generation numbers, falling back to topological
levels when GDAT chunk is missing (as it would happen with a commit
graph written by old Git).

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 33 +++++++++++++++++++++++++++++----
 commit-graph.h                |  1 +
 t/helper/test-read-graph.c    |  2 ++
 t/t4216-log-bloom.sh          |  4 ++--
 t/t5318-commit-graph.sh       | 19 +++++++++++--------
 t/t5324-split-commit-graph.sh | 12 ++++++------
 6 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1c98f38d69..ab714f4a76 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 				graph->chunk_commit_data = data + chunk_offset;
 			break;
 
+		case GRAPH_CHUNKID_GENERATION_DATA:
+			if (graph->chunk_generation_data)
+				chunk_repeated = 1;
+			else
+				graph->chunk_generation_data = data + chunk_offset;
+			break;
+
 		case GRAPH_CHUNKID_EXTRAEDGES:
 			if (graph->chunk_extra_edges)
 				chunk_repeated = 1;
@@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+	if (g->chunk_generation_data)
+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	else
+		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
 static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 	}
 }
 
+static void write_graph_chunk_generation_data(struct hashfile *f,
+					      struct write_commit_graph_context *ctx)
+{
+	struct commit **list = ctx->commits.list;
+	int count;
+	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		display_progress(ctx->progress, ++ctx->progress_cnt);
+		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
+	}
+}
+
 static void write_graph_chunk_extra_edges(struct hashfile *f,
 					  struct write_commit_graph_context *ctx)
 {
@@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
+	int num_chunks = 4;
 	struct object_id file_hash;
 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
@@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
 	if (ctx->num_extra_edges) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
@@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
 
-	num_chunks = 3;
+	num_chunks = 4;
 	if (ctx->num_extra_edges) {
 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
 						4 * ctx->num_extra_edges;
@@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_fanout(f, ctx);
 	write_graph_chunk_oids(f, hashsz, ctx);
 	write_graph_chunk_data(f, hashsz, ctx);
+	write_graph_chunk_generation_data(f, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
 	if (ctx->changed_paths) {
diff --git a/commit-graph.h b/commit-graph.h
index 98cc5a3b9d..e3d4ba96f4 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -67,6 +67,7 @@ struct commit_graph {
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_generation_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..1c2a5366c7 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" oid_lookup");
 	if (graph->chunk_commit_data)
 		printf(" commit_metadata");
+	if (graph->chunk_generation_data)
+		printf(" generation_data");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
 	if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c855bcd3e7..780855e691 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 26f332d6a3..3ec5248d70 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
 
 graph_read_expect() {
 	OPTIONAL=""
-	NUM_CHUNKS=3
+	NUM_CHUNKS=4
 	if test ! -z $2
 	then
 		OPTIONAL=" $2"
-		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
+		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
 	fi
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
 GRAPH_BYTE_CHUNK_COUNT=6
 GRAPH_CHUNK_LOOKUP_OFFSET=8
 GRAPH_CHUNK_LOOKUP_WIDTH=12
-GRAPH_CHUNK_LOOKUP_ROWS=5
+GRAPH_CHUNK_LOOKUP_ROWS=6
 GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
 GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
 			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
@@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
 GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
-GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
 GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
 GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
-GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
-			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
+GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
+				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
+GRAPH_GENERATION_DATA_WIDTH=4
+GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
+GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
+			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
 GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
 GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
 
@@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
 '
 
 test_expect_success 'detect incorrect generation number' '
-	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
 		"non-zero generation number"
 '
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 269d0964a3..096a96ec41 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
 	graphdir="$infodir/commit-graphs" &&
 	test_oid_init &&
 	test_oid_cache <<-EOM
-	shallow sha1:1760
-	shallow sha256:2064
+	shallow sha1:2132
+	shallow sha256:2436
 
-	base sha1:1376
-	base sha256:1496
+	base sha1:1408
+	base sha256:1528
 	EOM
 '
 
@@ -29,9 +29,9 @@ graph_read_expect() {
 		NUM_BASE=$2
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 3 $NUM_BASE
+	header: 43475048 1 1 4 $NUM_BASE
 	num_commits: $1
-	chunks: oid_fanout oid_lookup commit_metadata
+	chunks: oid_fanout oid_lookup commit_metadata generation_data
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-07-28  9:13 ` Abhishek Kumar via GitGitGadget
  2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
  2020-07-28 16:35 ` Derrick Stolee
  7 siblings, 1 reply; 30+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-07-28  9:13 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar, Abhishek Kumar

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With preparations done, let's implement corrected commit date offset.
We add a new commit-slab to store topological levels while writing
commit graph and upgrade number of struct commit_graph_data to 64-bits.

We have to touch many files, upgrading generation number from uint32_t
to timestamp_t.

We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
which tests if verify can detect if a commit graph have
GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
one of dates is Unix epoch zero.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 blame.c                 |   2 +-
 commit-graph.c          | 109 ++++++++++++++++++++++------------------
 commit-graph.h          |   4 +-
 commit-reach.c          |  32 ++++++------
 commit-reach.h          |   2 +-
 commit.h                |   3 ++
 revision.c              |  14 +++---
 t/t5318-commit-graph.sh |   2 +-
 upload-pack.c           |   2 +-
 9 files changed, 93 insertions(+), 77 deletions(-)

diff --git a/blame.c b/blame.c
index 82fa16d658..48aa632461 100644
--- a/blame.c
+++ b/blame.c
@@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
 	if (!bd)
 		return 1;
 
-	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
+	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
 		return 1;
 
 	filter = get_bloom_filter(r, origin->commit, 0);
diff --git a/commit-graph.c b/commit-graph.c
index ab714f4a76..9647d9f0df 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -65,6 +65,8 @@ void git_test_write_commit_graph_or_die(void)
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+define_commit_slab(topo_level_slab, uint32_t);
+
 /* Keep track of the order in which commits are added to our list. */
 define_commit_slab(commit_pos, int);
 static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -100,15 +102,15 @@ uint32_t commit_graph_position(const struct commit *c)
 	return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
 }
 
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
 {
 	struct commit_graph_data *data =
 		commit_graph_data_slab_peek(&commit_graph_data_slab, c);
 
 	if (!data)
-		return GENERATION_NUMBER_INFINITY;
+		return GENERATION_NUMBER_V2_INFINITY;
 	else if (data->graph_pos == COMMIT_NOT_FROM_GRAPH)
-		return GENERATION_NUMBER_INFINITY;
+		return GENERATION_NUMBER_V2_INFINITY;
 
 	return data->generation;
 }
@@ -116,8 +118,8 @@ uint32_t commit_graph_generation(const struct commit *c)
 int compare_commits_by_gen(const void *_a, const void *_b)
 {
 	const struct commit *a = _a, *b = _b;
-	const uint32_t generation_a = commit_graph_generation(a);
-	const uint32_t generation_b = commit_graph_generation(b);
+	const timestamp_t generation_a = commit_graph_generation(a);
+	const timestamp_t generation_b = commit_graph_generation(b);
 
 	/* older commits first */
 	if (generation_a < generation_b)
@@ -160,8 +162,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_data_at(a)->generation;
-	uint32_t generation_b = commit_graph_data_at(b)->generation;
+	timestamp_t generation_a = commit_graph_data_at(a)->generation;
+	timestamp_t generation_b = commit_graph_data_at(b)->generation;
 
 	/* lower generation commits first */
 	if (generation_a < generation_b)
@@ -169,11 +171,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -777,8 +774,13 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
 	if (g->chunk_generation_data)
-		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	{
+		/* Read corrected commit date offset from GDAT */
+		graph_data->generation = item->date +
+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+	}
 	else
+		/* Read topological level from CDAT */
 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -950,6 +952,7 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+	struct topo_level_slab *topo_levels;
 
 	char *base_graph_name;
 	int num_commit_graphs_before;
@@ -1102,7 +1105,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
-		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
 
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
@@ -1117,8 +1120,13 @@ static void write_graph_chunk_generation_data(struct hashfile *f,
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		timestamp_t offset = commit_graph_data_at(*list)->generation - (*list)->date;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
-		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
+
+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
+
+		hashwrite_be32(f, offset);
 	}
 }
 
@@ -1316,7 +1324,7 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-static void compute_generation_numbers(struct write_commit_graph_context *ctx)
+static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit_list *list = NULL;
@@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
-		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
 
 		display_progress(ctx->progress, i + 1);
-		if (generation != GENERATION_NUMBER_INFINITY &&
-		    generation != GENERATION_NUMBER_ZERO)
+		if (topo_level != GENERATION_NUMBER_INFINITY &&
+		    topo_level != GENERATION_NUMBER_ZERO)
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit *current = list->item;
 			struct commit_list *parent;
 			int all_parents_computed = 1;
-			uint32_t max_generation = 0;
+			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = current->date;
 
 			for (parent = current->parents; parent; parent = parent->next) {
-				generation = commit_graph_data_at(parent->item)->generation;
+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
 
-				if (generation == GENERATION_NUMBER_INFINITY ||
-				    generation == GENERATION_NUMBER_ZERO) {
+				if (topo_level == GENERATION_NUMBER_INFINITY ||
+				    topo_level == GENERATION_NUMBER_ZERO) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (generation > max_generation) {
-					max_generation = generation;
+				} else {
+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
+
+					if (topo_level > max_level)
+						max_level = topo_level;
+
+					if (data->generation > max_corrected_commit_date)
+						max_corrected_commit_date = data->generation;
 				}
 			}
 
 			if (all_parents_computed) {
 				struct commit_graph_data *data = commit_graph_data_at(current);
 
-				data->generation = max_generation + 1;
-				pop_commit(&list);
+				if (max_level > GENERATION_NUMBER_MAX - 1)
+					max_level = GENERATION_NUMBER_MAX - 1;
+
+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+				data->generation = max_corrected_commit_date + 1;
 
-				if (data->generation > GENERATION_NUMBER_MAX)
-					data->generation = GENERATION_NUMBER_MAX;
+				pop_commit(&list);
 			}
 		}
 	}
@@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct topo_level_slab topo_levels;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
 	ctx->total_bloom_filter_data_size = 0;
 
+	init_topo_level_slab(&topo_levels);
+	ctx->topo_levels = &topo_levels;
+
 	if (ctx->split) {
 		struct commit_graph *g;
 		prepare_commit_graph(ctx->r);
@@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
 	} else
 		ctx->num_commit_graphs_after = 1;
 
-	compute_generation_numbers(ctx);
+	compute_corrected_commit_date_offsets(ctx);
 
 	if (ctx->changed_paths)
 		compute_bloom_filters(ctx);
@@ -2325,8 +2346,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		uint32_t max_generation = 0;
-		uint32_t generation;
+		timestamp_t max_parent_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2365,9 +2386,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_parent_corrected_commit_date)
+				max_parent_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2389,20 +2410,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_parent_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
diff --git a/commit-graph.h b/commit-graph.h
index e3d4ba96f4..20c5848587 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -138,13 +138,13 @@ void disable_commit_graph(struct repository *r);
 
 struct commit_graph_data {
 	uint32_t graph_pos;
-	uint32_t generation;
+	timestamp_t generation;
 };
 
 /*
  * Commits should be parsed before accessing generation, graph positions.
  */
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
 uint32_t commit_graph_position(const struct commit *);
 
 int compare_commits_by_gen(const void *_a, const void *_b);
diff --git a/commit-reach.c b/commit-reach.c
index c83cc291e7..2ce9867ff3 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
 static struct commit_list *paint_down_to_common(struct repository *r,
 						struct commit *one, int n,
 						struct commit **twos,
-						int min_generation)
+						timestamp_t min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
-	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	timestamp_t last_gen = GENERATION_NUMBER_V2_INFINITY;
 
 	if (!min_generation)
 		queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
-		uint32_t generation = commit_graph_generation(commit);
+		timestamp_t generation = commit_graph_generation(commit);
 
 		if (min_generation && generation > last_gen)
-			BUG("bad generation skip %8x > %8x at %s",
+			BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
 			    generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
 		last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		repo_parse_commit(r, array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
-		uint32_t min_generation = commit_graph_generation(array[i]);
+		timestamp_t min_generation = commit_graph_generation(array[i]);
 
 		if (redundant[i])
 			continue;
 		for (j = filled = 0; j < cnt; j++) {
-			uint32_t curr_generation;
+			timestamp_t curr_generation;
 			if (i == j || redundant[j])
 				continue;
 			filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
 {
 	struct commit_list *bases;
 	int ret = 0, i;
-	uint32_t generation, min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t generation, min_generation = GENERATION_NUMBER_V2_INFINITY;
 
 	if (repo_parse_commit(r, commit))
 		return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
 					  struct contains_cache *cache,
-					  uint32_t cutoff)
+					  timestamp_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
 	enum contains_result result;
-	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	timestamp_t cutoff = GENERATION_NUMBER_V2_INFINITY;
 	const struct commit_list *p;
 
 	for (p = want; p; p = p->next) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = p->item;
 		load_commit_graph_info(the_repository, c);
 		generation = commit_graph_generation(c);
@@ -565,7 +565,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation)
+				 timestamp_t min_generation)
 {
 	struct commit **list = NULL;
 	int i;
@@ -666,13 +666,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	struct commit_list *from_iter = from, *to_iter = to;
 	int result;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
 
 	while (from_iter) {
 		add_object_array(&from_iter->item->object, NULL, &from_objs);
 
 		if (!parse_commit(from_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (from_iter->item->date < min_commit_date)
 				min_commit_date = from_iter->item->date;
 
@@ -686,7 +686,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 
 	while (to_iter) {
 		if (!parse_commit(to_iter->item)) {
-			uint32_t generation;
+			timestamp_t generation;
 			if (to_iter->item->date < min_commit_date)
 				min_commit_date = to_iter->item->date;
 
@@ -726,13 +726,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit_list *found_commits = NULL;
 	struct commit **to_last = to + nr_to;
 	struct commit **from_last = from + nr_from;
-	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	timestamp_t min_generation = GENERATION_NUMBER_V2_INFINITY;
 	int num_to_find = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	for (item = to; item < to_last; item++) {
-		uint32_t generation;
+		timestamp_t generation;
 		struct commit *c = *item;
 
 		parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
 				 time_t min_commit_date,
-				 uint32_t min_generation);
+				 timestamp_t min_generation);
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
diff --git a/commit.h b/commit.h
index e901538909..dd17a81672 100644
--- a/commit.h
+++ b/commit.h
@@ -15,6 +15,9 @@
 #define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_ZERO 0
 
+#define GENERATION_NUMBER_V2_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
+
 struct commit_list {
 	struct commit *item;
 	struct commit_list *next;
diff --git a/revision.c b/revision.c
index 23287d26c3..b978e79601 100644
--- a/revision.c
+++ b/revision.c
@@ -725,7 +725,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (!revs->repo->objects->commit_graph)
 		return -1;
 
-	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
+	if (commit_graph_generation(commit) == GENERATION_NUMBER_V2_INFINITY)
 		return -1;
 
 	filter = get_bloom_filter(revs->repo, commit, 0);
@@ -3270,7 +3270,7 @@ define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
 struct topo_walk_info {
-	uint32_t min_generation;
+	timestamp_t min_generation;
 	struct prio_queue explore_queue;
 	struct prio_queue indegree_queue;
 	struct prio_queue topo_queue;
@@ -3316,7 +3316,7 @@ static void explore_walk_step(struct rev_info *revs)
 }
 
 static void explore_to_depth(struct rev_info *revs,
-			     uint32_t gen_cutoff)
+			     timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3359,7 +3359,7 @@ static void indegree_walk_step(struct rev_info *revs)
 }
 
 static void compute_indegrees_to_depth(struct rev_info *revs,
-				       uint32_t gen_cutoff)
+				       timestamp_t gen_cutoff)
 {
 	struct topo_walk_info *info = revs->topo_walk_info;
 	struct commit *c;
@@ -3414,10 +3414,10 @@ static void init_topo_walk(struct rev_info *revs)
 	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
 	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
 
-	info->min_generation = GENERATION_NUMBER_INFINITY;
+	info->min_generation = GENERATION_NUMBER_V2_INFINITY;
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *c = list->item;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parse_commit_gently(c, 1))
 			continue;
@@ -3478,7 +3478,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	for (p = commit->parents; p; p = p->next) {
 		struct commit *parent = p->item;
 		int *pi;
-		uint32_t generation;
+		timestamp_t generation;
 
 		if (parent->object.flags & UNINTERESTING)
 			continue;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3ec5248d70..43801f07a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -596,7 +596,7 @@ test_expect_success 'detect incorrect generation number' '
 		"generation for commit"
 '
 
-test_expect_success 'detect incorrect generation number' '
+test_expect_failure 'detect incorrect generation number' '
 	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
 		"non-zero generation number"
 '
diff --git a/upload-pack.c b/upload-pack.c
index 951a2b23aa..db2332e687 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -489,7 +489,7 @@ static int got_oid(struct upload_pack_data *data,
 
 static int ok_to_give_up(struct upload_pack_data *data)
 {
-	uint32_t min_generation = GENERATION_NUMBER_ZERO;
+	timestamp_t min_generation = GENERATION_NUMBER_ZERO;
 
 	if (!data->have_obj.nr)
 		return 0;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-07-28 13:00   ` Derrick Stolee
  2020-07-28 15:30     ` Taylor Blau
  2020-08-05 23:16   ` Jakub Narębski
  1 sibling, 1 reply; 30+ messages in thread
From: Derrick Stolee @ 2020-07-28 13:00 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.

You mentioned this in your blog post. I'm sorry that such a small
issue caused you pain. Perhaps you could summarize a little bit of
how that investigation led you to find this issue?

Question: is this something that is only necessary when we change
the generation number, or is it something that is only _exposed_
by the test suite when we change the generation number? It seems that
it is likely to be an existing bug, but it might be hard to expose
in a test case.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/revision.c b/revision.c
> index 6aa7f4f567..23287d26c3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>  
> +		if (parse_commit_gently(parent, 1) < 0)
> +			return ;

Drop the extra space.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Derrick Stolee @ 2020-07-28 13:14 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
> 
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> generation number and has undefined behavior. The test used to pass as
> fill_commit_in_graph() did not read commit time from commit graph,
> reading commit date from odb instead.

I was first confused as to why fill_commit_graph_info() did not
load the timestamp, but the reason is that it is only used by
two methods:

1. fill_commit_in_graph(): this actually leaves the commit in a
   "parsed" state, so the date must be correct. Thus, it parses
   the date out of the commit-graph.

2. load_commit_graph_info(): this only helps to guarantee we
   know the graph_pos and generation number values.

Perhaps add this extra context: you will _need_ the commit date
from the commit-graph in order to populate the generation number
v2 in fill_commit_graph_info().

> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.

The timestamp limit placed in the commit-graph is more restrictive
than 64-bit timestamps, but as your test points out, the maximum
timestamp allowed takes place in the year 2514. That is far enough
away for all real data.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c      | 31 ++++++++++++-------------------
>  t/t5000-tar-tree.sh |  4 ++--
>  2 files changed, 14 insertions(+), 21 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 5d3c9bd23c..204eb454b2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	const unsigned char *commit_data;
>  	struct commit_graph_data *graph_data;
>  	uint32_t lex_index;
> +	uint64_t date_high, date_low;
>  
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;
>  
> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		die(_("invalid commit position. commit-graph is likely corrupt"));
> +
>  	lex_index = pos - g->num_commits_in_base;
>  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>  
>  	graph_data = commit_graph_data_at(item);
>  	graph_data->graph_pos = pos;
> +
> +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> +	date_low = get_be32(commit_data + g->hash_len + 12);
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>  
> @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
> -	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	struct commit_graph_data *graph_data;
>  	const unsigned char *commit_data;
>  	uint32_t lex_index;
>  
> +	fill_commit_graph_info(item, g, pos);
> +
>  	while (pos < g->num_commits_in_base)
>  		g = g->base_graph;

This 'while' loop happens in both implementations, so you could
save a miniscule amount of time by placing the call to
fill_commit_graph_info() after the while loop.

> -	if (pos >= g->num_commits + g->num_commits_in_base)
> -		die(_("invalid commit position. commit-graph is likely corrupt"));

> -	/*
> -	 * Store the "full" position, but then use the
> -	 * "local" position for the rest of the calculation.
> -	 */
> -	graph_data = commit_graph_data_at(item);
> -	graph_data->graph_pos = pos;
>  	lex_index = pos - g->num_commits_in_base;
> -
> -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;

I was about to complain about this change, but GRAPH_DATA_WIDTH
is a macro that does an equivalent thing (except the_hash_algo->rawsz
instead of g->hash_len).

>  
>  	item->object.parsed = 1;
>  
>  	set_commit_tree(item, NULL);
>  
> -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> -	date_low = get_be32(commit_data + g->hash_len + 12);
> -	item->date = (timestamp_t)((date_high << 32) | date_low);
> -
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -
>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 37655a237c..1986354fc3 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
>  	rm -f .git/index &&
>  	echo content >file &&
>  	git add file &&
> -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> +	GIT_COMMITTER_DATE="@17179869183 +0000" \
>  		git commit -m "tempori parendum"
>  '
>  
> @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
>  '
>  
>  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> -	echo 4147 >expect &&
> +	echo 2514 >expect &&
>  	tar_info future.tar | cut -d" " -f2 >actual &&
>  	test_cmp expect actual
>  '
> 

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
@ 2020-07-28 14:54 ` Taylor Blau
  2020-07-30  7:47   ` Abhishek Kumar
  2020-07-28 16:35 ` Derrick Stolee
  7 siblings, 1 reply; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 14:54 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

Hi Abhishek,

On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Very exciting. I have been eagerly following your blog and asking
Stolee about your progress, so I am excited to read these patches.

> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

Interestingly, we use a very similar metric at GitHub to sort commits in
various UI views which have lots of existing machinery that sorts
an abstract collection by each element's "date". Since that sort is
stable, and we want to respect the order that Git delivered, we take the
pairwise max of each successive pair of commits.

> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

I'm sure that I'll learn more when I get to this point, but I would like
to hear more about why you want to store the offset rather than the
corrected commit date itself. It seems that the offset could be either
positive or negative, so you'd only have the range of a signed integer
(rather than storing 8 bytes of a time_t for the full breadth of
possibilities).

I know also that Peff is working on negative timestamp support, so I
would want to hear about what he thinks of this, too.

> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
>
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This will be good to gradually roll-out the new chunk. Another thought
is to control whether or not the commit-graph machinery _reads_ this
chunk if it's present. That can be useful for debugging too (eg., I have
a commit-graph with a GDAT chunk that is broken in some way, what
happens if I don't read that chunk?)

Maybe something like `commitgraph.readsGenerationData`? Incidentally,
I'm preparing a `commitgraph.readsChangedPaths` to control whether or
not we read the Bloom index and data chunks. I'll send that to the list
shortly (it's in my fork somewhere if you want an earlier look), but
that may be a useful reference for you.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).
>  3. Update technical documentation.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
> unrelated to my code changes. Still need to investigate further.
>
> Abhishek Kumar (6):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: implement generation data chunk
>   commit-graph: implement corrected commit date offset
>
>  blame.c                       |   2 +-
>  commit-graph.c                | 181 +++++++++++++++++++++-------------
>  commit-graph.h                |   7 +-
>  commit-reach.c                |  47 +++------
>  commit-reach.h                |   2 +-
>  commit.c                      |   9 +-
>  commit.h                      |   3 +
>  revision.c                    |  17 ++--
>  t/helper/test-read-graph.c    |   2 +
>  t/t4216-log-bloom.sh          |   4 +-
>  t/t5000-tar-tree.sh           |   4 +-
>  t/t5318-commit-graph.sh       |  21 ++--
>  t/t5324-split-commit-graph.sh |  12 +--
>  upload-pack.c                 |   2 +-
>  14 files changed, 178 insertions(+), 135 deletions(-)
>
>
> base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
@ 2020-07-28 15:19     ` René Scharfe
  2020-07-28 15:58       ` Derrick Stolee
  2020-07-28 16:01     ` Taylor Blau
  2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2020-07-28 15:19 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git, Derrick Stolee
  Cc: Jakub Narębski, Abhishek Kumar

[Had to remove stolee@gmail.com because with it my mail provider
 rejected this email with the following error message:

   Requested action not taken: mailbox unavailable
   invalid DNS MX or A/AAAA resource record.]

Am 28.07.20 um 15:14 schrieb Derrick Stolee:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>
>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>> information present in commit data chunk. Let's simplify the
>> implementation by calling fill_commit_graph_info() within
>> fill_commit_in_graph().
>>
>> The test 'generate tar with future mtime' creates a commit with commit
>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>> generation number and has undefined behavior. The test used to pass as
>> fill_commit_in_graph() did not read commit time from commit graph,
>> reading commit date from odb instead.
>
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
>
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
>
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
>
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().
>
>> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.

We all may feel like the end of the world is imminent, but do we really
need to set such an arbitrary limit?  OK, that limit was already set two
years ago, and I'm really late.  But still: It's sad to see anything
else than signed 64-bit timestamps to be used in fresh code (after Y2K).
The extra four bytes would fatten up the structures less than the
transition from SHA-1 to SHA-256 will, and no bit twiddling would be
required.  *sigh*

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
@ 2020-07-28 15:28   ` Taylor Blau
  2020-07-30  5:24     ` Abhishek Kumar
  2020-08-04  0:46   ` Jakub Narębski
  1 sibling, 1 reply; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 15:28 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.

This last sentence is slightly confusing. Do you think it would be more
clear if you said elaborated a bit? Perhaps something like:

  [...]

  commit_gen_cmp is used when writing a commit-graph to sort commits in
  generation order before computing Bloom filters. Since c49c82aa made
  it so that 'commit_graph_generation()' returns
  'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
  this function. Instead, access the generation number directly through
  the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
  order to access it while writing.

I think the above would be a good extra paragraph in the commit message
provided that you remove the sentence beginning with "Sorting is no
longer useful..."

> Let's fix this by accessing generation number directly through the slab.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +

Nit; this whitespace diff is extraneous, but it's not hurting anything
either. Since it looks like you're rerolling anyway, it would be good to
just get rid of it.

Otherwise this fix makes sense to me.

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28 13:00   ` Derrick Stolee
@ 2020-07-28 15:30     ` Taylor Blau
  0 siblings, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 15:30 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar

On Tue, Jul 28, 2020 at 09:00:51AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > In indegree_walk_step(), we add unvisited parents to the indegree queue.
> > However, parents are not guaranteed to be parsed. As the indegree queue
> > sorts by generation number, let's parse parents before inserting them to
> > ensure the correct priority order.
>
> You mentioned this in your blog post. I'm sorry that such a small
> issue caused you pain. Perhaps you could summarize a little bit of
> how that investigation led you to find this issue?

Indeed ;-). I feel like forgetting to call 'parse_commit_gently()' is a
rite of passage for this part of the code in some sense.

> Question: is this something that is only necessary when we change
> the generation number, or is it something that is only _exposed_
> by the test suite when we change the generation number? It seems that
> it is likely to be an existing bug, but it might be hard to expose
> in a test case.

I tend to agree that this bug probably existed before Abhishek's
changes, but that it's probably more trouble than it's worth to tickle
with a test case. So, I'd be fine with this fix as it is (provided that
the style nit is addressed below, too).

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  revision.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/revision.c b/revision.c
> > index 6aa7f4f567..23287d26c3 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
> >  		struct commit *parent = p->item;
> >  		int *pi = indegree_slab_at(&info->indegree, parent);
> >
> > +		if (parse_commit_gently(parent, 1) < 0)
> > +			return ;
>
> Drop the extra space.
>
> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
@ 2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 16:23     ` Taylor Blau
  2020-07-30  7:27     ` Abhishek Kumar
  0 siblings, 2 replies; 30+ messages in thread
From: Derrick Stolee @ 2020-07-28 15:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git; +Cc: Jakub Narębski, Abhishek Kumar

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> With preparations done,...

I feel like this commit could have been made smaller by doing the
uint32_t -> timestamp_t conversion in a separate patch. That would
make it easier to focus on the changes to the generation number v2
logic.

> let's implement corrected commit date offset.
> We add a new commit-slab to store topological levels while writing

It's important to add: we store topological levels to ensure that older
versions of Git will still have the performance benefits from generation
number v1.

> commit graph and upgrade number of struct commit_graph_data to 64-bits.

Do you mean "update the generation member in struct commit_graph_data
to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
member.

> We have to touch many files, upgrading generation number from uint32_t
> to timestamp_t.

Yes, that's why I recommend doing that in a different step.

> We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> which tests if verify can detect if a commit graph have
> GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> one of dates is Unix epoch zero.

What about the topological levels? Are we caring about verifying the data
that we start to ignore in this new version? I'm hesitant to drop this
right now, but I'm open to it if we really don't see it as a valuable test.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  blame.c                 |   2 +-
>  commit-graph.c          | 109 ++++++++++++++++++++++------------------
>  commit-graph.h          |   4 +-
>  commit-reach.c          |  32 ++++++------
>  commit-reach.h          |   2 +-
>  commit.h                |   3 ++
>  revision.c              |  14 +++---
>  t/t5318-commit-graph.sh |   2 +-
>  upload-pack.c           |   2 +-
>  9 files changed, 93 insertions(+), 77 deletions(-)
> 
> diff --git a/blame.c b/blame.c
> index 82fa16d658..48aa632461 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
>  	if (!bd)
>  		return 1;
>  
> -	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
> +	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
>  		return 1;

I don't see value in changing the name of this macro. It
is only used as the default value for a commit not in the
commit-graph. Changing its value to 0xFFFFFFFF works for
both versions when the type is updated to timestamp_t.

The actually-important change in this patch (not just the
type change) is here:

> -static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> +static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
>  {
>  	int i;
>  	struct commit_list *list = NULL;
> @@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>  
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (topo_level != GENERATION_NUMBER_INFINITY &&
> +		    topo_level != GENERATION_NUMBER_ZERO)
>  			continue;

Here, our "skip" condition is that the topo_level has been computed.
This should be fine, as we are never reading that out of the commit-graph.
We will never be in a mode where topo_level is computed but corrected
commit-date is not.

>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = current->date;

Later you assign data->generation to be "max_corrected_commit_date + 1",
which made me think this should be "current->date - 1". Is that so? Or,
do we want most offsets to be one instead of zero? Is there value there?

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>  
> -				if (generation == GENERATION_NUMBER_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (topo_level == GENERATION_NUMBER_INFINITY ||
> +				    topo_level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else {
> +					struct commit_graph_data *data = commit_graph_data_at(parent->item);
> +
> +					if (topo_level > max_level)
> +						max_level = topo_level;
> +
> +					if (data->generation > max_corrected_commit_date)
> +						max_corrected_commit_date = data->generation;
>  				}
>  			}
>  
>  			if (all_parents_computed) {
>  				struct commit_graph_data *data = commit_graph_data_at(current);
>  
> -				data->generation = max_generation + 1;
> -				pop_commit(&list);
> +				if (max_level > GENERATION_NUMBER_MAX - 1)
> +					max_level = GENERATION_NUMBER_MAX - 1;
> +
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +				data->generation = max_corrected_commit_date + 1;
>  
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				pop_commit(&list);
>  			}
>  		}
>  	}

This looks correct, and I've done a tiny bit of perf tests locally.

> @@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
>  	uint32_t i, count_distinct = 0;
>  	int res = 0;
>  	int replace = 0;
> +	struct topo_level_slab topo_levels;
>  
>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>  	ctx->total_bloom_filter_data_size = 0;
>  
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
>  	if (ctx->split) {
>  		struct commit_graph *g;
>  		prepare_commit_graph(ctx->r);
> @@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
>  	} else
>  		ctx->num_commit_graphs_after = 1;
>  
> -	compute_generation_numbers(ctx);
> +	compute_corrected_commit_date_offsets(ctx);

This rename might not be necessary. You are computing both
versions (v1 and v2) so the name change is actually less
accurate than the old name.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 15:19     ` René Scharfe
@ 2020-07-28 15:58       ` Derrick Stolee
  0 siblings, 0 replies; 30+ messages in thread
From: Derrick Stolee @ 2020-07-28 15:58 UTC (permalink / raw)
  To: René Scharfe, Abhishek Kumar via GitGitGadget, git, Derrick Stolee
  Cc: Jakub Narębski, Abhishek Kumar

On 7/28/2020 11:19 AM, René Scharfe wrote:
> [Had to remove stolee@gmail.com because with it my mail provider
>  rejected this email with the following error message:
> 
>    Requested action not taken: mailbox unavailable
>    invalid DNS MX or A/AAAA resource record.]
> 
> Am 28.07.20 um 15:14 schrieb Derrick Stolee:
>> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>>> information present in commit data chunk. Let's simplify the
>>> implementation by calling fill_commit_graph_info() within
>>> fill_commit_in_graph().
>>>
>>> The test 'generate tar with future mtime' creates a commit with commit
>>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>>> generation number and has undefined behavior. The test used to pass as
>>> fill_commit_in_graph() did not read commit time from commit graph,
>>> reading commit date from odb instead.
>>
>> I was first confused as to why fill_commit_graph_info() did not
>> load the timestamp, but the reason is that it is only used by
>> two methods:
>>
>> 1. fill_commit_in_graph(): this actually leaves the commit in a
>>    "parsed" state, so the date must be correct. Thus, it parses
>>    the date out of the commit-graph.
>>
>> 2. load_commit_graph_info(): this only helps to guarantee we
>>    know the graph_pos and generation number values.
>>
>> Perhaps add this extra context: you will _need_ the commit date
>> from the commit-graph in order to populate the generation number
>> v2 in fill_commit_graph_info().
>>
>>> Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>>
>> The timestamp limit placed in the commit-graph is more restrictive
>> than 64-bit timestamps, but as your test points out, the maximum
>> timestamp allowed takes place in the year 2514. That is far enough
>> away for all real data.
> 
> We all may feel like the end of the world is imminent, but do we really
> need to set such an arbitrary limit?  OK, that limit was already set two
> years ago, and I'm really late.  But still: It's sad to see anything
> else than signed 64-bit timestamps to be used in fresh code (after Y2K).
> The extra four bytes would fatten up the structures less than the
> transition from SHA-1 to SHA-256 will, and no bit twiddling would be
> required.  *sigh*

One thing to consider after generation number v2 is out long enough
is if we could drop the topo-levels and write zeroes for the topo-
level portion. This was valid data in the first version of the
commit-graph, so it would still be valid. Then, we could allow
full 64-bit timestamps again.

This is something to think about again in a year, maybe.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
@ 2020-07-28 16:01     ` Taylor Blau
  2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 16:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:14:42AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number and has undefined behavior. The test used to pass as
> > fill_commit_in_graph() did not read commit time from commit graph,
> > reading commit date from odb instead.
>
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
>
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
>
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
>
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().
>
> > Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
>
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 31 ++++++++++++-------------------
> >  t/t5000-tar-tree.sh |  4 ++--
> >  2 files changed, 14 insertions(+), 21 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 5d3c9bd23c..204eb454b2 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >
> > @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >
> > +	fill_commit_graph_info(item, g, pos);
> > +
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
>
> This 'while' loop happens in both implementations, so you could
> save a miniscule amount of time by placing the call to
> fill_commit_graph_info() after the while loop.
>
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
>
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> > -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> > +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>
> I was about to complain about this change, but GRAPH_DATA_WIDTH
> is a macro that does an equivalent thing (except the_hash_algo->rawsz
> instead of g->hash_len).
>
> >
> >  	item->object.parsed = 1;
> >
> >  	set_commit_tree(item, NULL);
> >
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >
> >  	edge_value = get_be32(commit_data + g->hash_len);
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 37655a237c..1986354fc3 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> >
> > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> >  '
> >
> >  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > -	echo 4147 >expect &&
> > +	echo 2514 >expect &&
> >  	tar_info future.tar | cut -d" " -f2 >actual &&
> >  	test_cmp expect actual
> >  '
> >
>
> Thanks,
> -Stolee

Agreed with Stolee's review, but otherwise this looks like a faithful
transformation.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen
  2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
@ 2020-07-28 16:03   ` Taylor Blau
  0 siblings, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 16:03 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:49AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Comparing commits by generation has been independently defined twice, in
> commit-reach and commit. Let's simplify the implementation by moving
> compare_commits_by_gen() to commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 15 +++++++++++++++
>  commit-graph.h |  2 ++
>  commit-reach.c | 15 ---------------
>  commit.c       |  9 +++------
>  4 files changed, 20 insertions(+), 21 deletions(-)

All looks good to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-07-28 16:12   ` Taylor Blau
  2020-07-30  6:52     ` Abhishek Kumar
  0 siblings, 1 reply; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 16:12 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Jakub Narębski, Abhishek Kumar

On Tue, Jul 28, 2020 at 09:13:50AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> One of the essential pre-requisites before implementing generation
> number as to distinguish between generation numbers v1 and v2 while

s/as/is

> still being compatible with old Git.

Maybe you could add a section here to talk about why this is needed
specifically? That is, you mention it's a prerequisite, but a reader in
a year or two may not remember why. Adding that information here would
be good.

> We are going to introduce a new chunk called Generation Data chunk (or
> GDAT). GDAT stores generation number v2 (and any subsequent versions),
> whereas CDAT will still store topological level.
>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. Newer versions of Git can parse GDAT and
> take advantage of newer generation numbers, falling back to topological
> levels when GDAT chunk is missing (as it would happen with a commit
> graph written by old Git).

...this is exactly the paragraph that I was looking for above. Could you
swap the order of these last two paragraphs? I think that it would make
the patch message far clearer.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 33 +++++++++++++++++++++++++++++----
>  commit-graph.h                |  1 +
>  t/helper/test-read-graph.c    |  2 ++
>  t/t4216-log-bloom.sh          |  4 ++--
>  t/t5318-commit-graph.sh       | 19 +++++++++++--------
>  t/t5324-split-commit-graph.sh | 12 ++++++------
>  6 files changed, 51 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1c98f38d69..ab714f4a76 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 8

Ugh. I am simultaneously working on a new chunk myself (so a bad
conflict resolution would look at both of us incrementing this number
to the same value without generating a conflict.)

I think the right thing to do here would be to define an enum over chunk
names, and then index an array by that enum (where the value at each
index is the chunk identifier). Then, the last value of that enum would
be a '__COUNT' which you could use to initialize the array (as well as
within the commit-graph writing routines).

Anyway, I think that it's probably not worth it in the meantime, but it
is something that Junio should look out for when merging (if yours and
my topic happen to get merged around the same time, which they may not).

>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
>  				graph->chunk_commit_data = data + chunk_offset;
>  			break;
>
> +		case GRAPH_CHUNKID_GENERATION_DATA:
> +			if (graph->chunk_generation_data)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_generation_data = data + chunk_offset;
> +			break;
> +
>  		case GRAPH_CHUNKID_EXTRAEDGES:
>  			if (graph->chunk_extra_edges)
>  				chunk_repeated = 1;
> @@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +	if (g->chunk_generation_data)
> +		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> +	else
> +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
> @@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  	}
>  }
>
> +static void write_graph_chunk_generation_data(struct hashfile *f,
> +					      struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	int count;
> +	for (count = 0; count < ctx->commits.nr; count++, list++) {
> +		display_progress(ctx->progress, ++ctx->progress_cnt);
> +		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
> +	}
> +}
> +

This pointer arithmetic is not necessary. Why not like:

  int i;
  for (i = 0; i < ctx->commits.nr; i++) {
    struct commit *c = ctx->commits.list[i];
    display_progress(ctx->progress, ++ctx->progress_cnt);
    hashwrite_be32(f, commit_graph_data_at(c)->generation);
  }

instead?

>  static void write_graph_chunk_extra_edges(struct hashfile *f,
>  					  struct write_commit_graph_context *ctx)
>  {
> @@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
>  	const unsigned hashsz = the_hash_algo->rawsz;
>  	struct strbuf progress_title = STRBUF_INIT;
> -	int num_chunks = 3;
> +	int num_chunks = 4;
>  	struct object_id file_hash;
>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> @@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
>  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> +	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
>  	if (ctx->num_extra_edges) {
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
>  		num_chunks++;
> @@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
>  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
>  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> +	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
>
> -	num_chunks = 3;
> +	num_chunks = 4;
>  	if (ctx->num_extra_edges) {
>  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
>  						4 * ctx->num_extra_edges;
> @@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	write_graph_chunk_fanout(f, ctx);
>  	write_graph_chunk_oids(f, hashsz, ctx);
>  	write_graph_chunk_data(f, hashsz, ctx);
> +	write_graph_chunk_generation_data(f, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
>  	if (ctx->changed_paths) {
> diff --git a/commit-graph.h b/commit-graph.h
> index 98cc5a3b9d..e3d4ba96f4 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -67,6 +67,7 @@ struct commit_graph {
>  	const uint32_t *chunk_oid_fanout;
>  	const unsigned char *chunk_oid_lookup;
>  	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_generation_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 6d0c962438..1c2a5366c7 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" oid_lookup");
>  	if (graph->chunk_commit_data)
>  		printf(" commit_metadata");
> +	if (graph->chunk_generation_data)
> +		printf(" generation_data");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
>  	if (graph->chunk_bloom_indexes)
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c855bcd3e7..780855e691 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
>  	EOF
>  	test-tool read-graph >actual &&
>  	test_cmp expect actual
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 26f332d6a3..3ec5248d70 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
>
>  graph_read_expect() {
>  	OPTIONAL=""
> -	NUM_CHUNKS=3
> +	NUM_CHUNKS=4
>  	if test ! -z $2
>  	then
>  		OPTIONAL=" $2"
> -		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> +		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
>  	fi
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> @@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
>  GRAPH_BYTE_CHUNK_COUNT=6
>  GRAPH_CHUNK_LOOKUP_OFFSET=8
>  GRAPH_CHUNK_LOOKUP_WIDTH=12
> -GRAPH_CHUNK_LOOKUP_ROWS=5
> +GRAPH_CHUNK_LOOKUP_ROWS=6
>  GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
>  GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
>  			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
> @@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
>  GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
>  GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
>  GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
> -GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
>  GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
>  GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
> -GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> -			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> +GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> +				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> +GRAPH_GENERATION_DATA_WIDTH=4
> +GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
> +GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
> +			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
>  GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
>  GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
>
> @@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
>  '
>
>  test_expect_success 'detect incorrect generation number' '
> -	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
>  		"non-zero generation number"
>  '
>
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 269d0964a3..096a96ec41 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
>  	graphdir="$infodir/commit-graphs" &&
>  	test_oid_init &&
>  	test_oid_cache <<-EOM
> -	shallow sha1:1760
> -	shallow sha256:2064
> +	shallow sha1:2132
> +	shallow sha256:2436
>
> -	base sha1:1376
> -	base sha256:1496
> +	base sha1:1408
> +	base sha256:1528
>  	EOM
>  '
>
> @@ -29,9 +29,9 @@ graph_read_expect() {
>  		NUM_BASE=$2
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 3 $NUM_BASE
> +	header: 43475048 1 1 4 $NUM_BASE
>  	num_commits: $1
> -	chunks: oid_fanout oid_lookup commit_metadata
> +	chunks: oid_fanout oid_lookup commit_metadata generation_data
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output
> --
> gitgitgadget
>

All of this looks good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28 15:55   ` Derrick Stolee
@ 2020-07-28 16:23     ` Taylor Blau
  2020-07-30  7:27     ` Abhishek Kumar
  1 sibling, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2020-07-28 16:23 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
	Abhishek Kumar

On Tue, Jul 28, 2020 at 11:55:12AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With preparations done,...
>
> I feel like this commit could have been made smaller by doing the
> uint32_t -> timestamp_t conversion in a separate patch. That would
> make it easier to focus on the changes to the generation number v2
> logic.

Yep, agreed.

> > let's implement corrected commit date offset.
> > We add a new commit-slab to store topological levels while writing
>
> It's important to add: we store topological levels to ensure that older
> versions of Git will still have the performance benefits from generation
> number v1.
>
> > commit graph and upgrade number of struct commit_graph_data to 64-bits.
>
> Do you mean "update the generation member in struct commit_graph_data
> to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
> member.
>
> > We have to touch many files, upgrading generation number from uint32_t
> > to timestamp_t.
>
> Yes, that's why I recommend doing that in a different step.
>
> > We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> > which tests if verify can detect if a commit graph have
> > GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> > With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> > one of dates is Unix epoch zero.
>
> What about the topological levels? Are we caring about verifying the data
> that we start to ignore in this new version? I'm hesitant to drop this
> right now, but I'm open to it if we really don't see it as a valuable test.
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  blame.c                 |   2 +-
> >  commit-graph.c          | 109 ++++++++++++++++++++++------------------
> >  commit-graph.h          |   4 +-
> >  commit-reach.c          |  32 ++++++------
> >  commit-reach.h          |   2 +-
> >  commit.h                |   3 ++
> >  revision.c              |  14 +++---
> >  t/t5318-commit-graph.sh |   2 +-
> >  upload-pack.c           |   2 +-
> >  9 files changed, 93 insertions(+), 77 deletions(-)
> >
> > diff --git a/blame.c b/blame.c
> > index 82fa16d658..48aa632461 100644
> > --- a/blame.c
> > +++ b/blame.c
> > @@ -1272,7 +1272,7 @@ static int maybe_changed_path(struct repository *r,
> >  	if (!bd)
> >  		return 1;
> >
> > -	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
> > +	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_V2_INFINITY)
> >  		return 1;
>
> I don't see value in changing the name of this macro. It
> is only used as the default value for a commit not in the
> commit-graph. Changing its value to 0xFFFFFFFF works for
> both versions when the type is updated to timestamp_t.
>
> The actually-important change in this patch (not just the
> type change) is here:
>
> > -static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > +static void compute_corrected_commit_date_offsets(struct write_commit_graph_context *ctx)
> >  {
> >  	int i;
> >  	struct commit_list *list = NULL;
> > @@ -1326,11 +1334,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (topo_level != GENERATION_NUMBER_INFINITY &&
> > +		    topo_level != GENERATION_NUMBER_ZERO)
> >  			continue;
>
> Here, our "skip" condition is that the topo_level has been computed.
> This should be fine, as we are never reading that out of the commit-graph.
> We will never be in a mode where topo_level is computed but corrected
> commit-date is not.
>
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1338,29 +1346,38 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = current->date;
>
> Later you assign data->generation to be "max_corrected_commit_date + 1",
> which made me think this should be "current->date - 1". Is that so? Or,
> do we want most offsets to be one instead of zero? Is there value there?
>
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (topo_level == GENERATION_NUMBER_INFINITY ||
> > +				    topo_level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else {
> > +					struct commit_graph_data *data = commit_graph_data_at(parent->item);
> > +
> > +					if (topo_level > max_level)
> > +						max_level = topo_level;
> > +
> > +					if (data->generation > max_corrected_commit_date)
> > +						max_corrected_commit_date = data->generation;
> >  				}
> >  			}
> >
> >  			if (all_parents_computed) {
> >  				struct commit_graph_data *data = commit_graph_data_at(current);
> >
> > -				data->generation = max_generation + 1;
> > -				pop_commit(&list);
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +				data->generation = max_corrected_commit_date + 1;
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				pop_commit(&list);
> >  			}
> >  		}
> >  	}
>
> This looks correct, and I've done a tiny bit of perf tests locally.
>
> > @@ -2085,6 +2102,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2099,6 +2117,9 @@ int write_commit_graph(struct object_directory *odb,
> >  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> >  	ctx->total_bloom_filter_data_size = 0;
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> >  	if (ctx->split) {
> >  		struct commit_graph *g;
> >  		prepare_commit_graph(ctx->r);
> > @@ -2197,7 +2218,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	} else
> >  		ctx->num_commit_graphs_after = 1;
> >
> > -	compute_generation_numbers(ctx);
> > +	compute_corrected_commit_date_offsets(ctx);
>
> This rename might not be necessary. You are computing both
> versions (v1 and v2) so the name change is actually less
> accurate than the old name.
>
> Thanks,
> -Stolee

I don't have anything to add that Stolee hasn't already pointed out.
Thanks for your work on this series, and I'm looking forward to another
reroll.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
@ 2020-07-28 16:35 ` Derrick Stolee
  7 siblings, 0 replies; 30+ messages in thread
From: Derrick Stolee @ 2020-07-28 16:35 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget, git
  Cc: Jakub Narębski, Abhishek Kumar, Taylor Blau

On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
> 
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
> 
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
> 
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
> 
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
> 
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.
> 
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
> 
> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
> 
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This would be a good idea, if only as a GIT_TEST_* environment variable.
I think it important we have a test for the compatibility scenario where
we have an "old" commit-graph with the new code and test that reading and
writing still works properly.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).

This is an excellent thing to do. There are a few options when writing an
incremental commit-graph when the base graphs do not have the GDAT chunk:

   i. Do not write the GDAT chunk unless we are merging all levels
      (based on the merging strategy).

  ii. Merge all levels, then write the GDAT chunk.

>  3. Update technical documentation.

Yes, I was going to ask for a patch that updates
Documentation/technical/commit-graph-format.txt.

This is an excellent v1. A lot of small things, but no
really big issues.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28 15:28   ` Taylor Blau
@ 2020-07-30  5:24     ` Abhishek Kumar
  0 siblings, 0 replies; 30+ messages in thread
From: Abhishek Kumar @ 2020-07-30  5:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, abhishekkumar8222, gitgitgadget, jnareb, stolee

On Tue, Jul 28, 2020 at 11:28:44AM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> 
> This last sentence is slightly confusing. Do you think it would be more
> clear if you said elaborated a bit? Perhaps something like:
> 
>   [...]
> 
>   commit_gen_cmp is used when writing a commit-graph to sort commits in
>   generation order before computing Bloom filters. Since c49c82aa made
>   it so that 'commit_graph_generation()' returns
>   'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
>   this function. Instead, access the generation number directly through
>   the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
>   order to access it while writing.
> 

Thanks! That is clearer. Will change.

> I think the above would be a good extra paragraph in the commit message
> provided that you remove the sentence beginning with "Sorting is no
> longer useful..."
> 
> > Let's fix this by accessing generation number directly through the slab.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;
> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> 
> Nit; this whitespace diff is extraneous, but it's not hurting anything
> either. Since it looks like you're rerolling anyway, it would be good to
> just get rid of it.
> 
> Otherwise this fix makes sense to me.
> 
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
> > --
> > gitgitgadget
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info
  2020-07-28 13:14   ` Derrick Stolee
  2020-07-28 15:19     ` René Scharfe
  2020-07-28 16:01     ` Taylor Blau
@ 2020-07-30  6:07     ` Abhishek Kumar
  2 siblings, 0 replies; 30+ messages in thread
From: Abhishek Kumar @ 2020-07-30  6:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb

On Tue, Jul 28, 2020 at 09:14:42AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> > 
> > The test 'generate tar with future mtime' creates a commit with commit
> > time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
> > generation number and has undefined behavior. The test used to pass as
> > fill_commit_in_graph() did not read commit time from commit graph,
> > reading commit date from odb instead.
> 
> I was first confused as to why fill_commit_graph_info() did not
> load the timestamp, but the reason is that it is only used by
> two methods:
> 
> 1. fill_commit_in_graph(): this actually leaves the commit in a
>    "parsed" state, so the date must be correct. Thus, it parses
>    the date out of the commit-graph.
> 
> 2. load_commit_graph_info(): this only helps to guarantee we
>    know the graph_pos and generation number values.
> 
> Perhaps add this extra context: you will _need_ the commit date
> from the commit-graph in order to populate the generation number
> v2 in fill_commit_graph_info().

Thanks, that makes sense. I have revised the commit message to:

commit-graph: consolidate fill_commit_graph_info
    
    Both fill_commit_graph_info() and fill_commit_in_graph() parse
    information present in commit data chunk. Let's simplify the
    implementation by calling fill_commit_graph_info() within
    fill_commit_in_graph().
    
    The test 'generate tar with future mtime' creates a commit with commit
    time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
    generation number (within CDAT chunk) and has undefined behavior.
    
    The test used to pass as fill_commit_in_graph() guarantees the values of
    graph position and generation number, and did not load timestamp.
    However, with corrected commit date we will need load the timestamp as
    well to populate the generation number.
> 
> > Let's fix that by setting commit time of (2 ^ 34 - 1) seconds.
> 
> The timestamp limit placed in the commit-graph is more restrictive
> than 64-bit timestamps, but as your test points out, the maximum
> timestamp allowed takes place in the year 2514. That is far enough
> away for all real data.
> 
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c      | 31 ++++++++++++-------------------
> >  t/t5000-tar-tree.sh |  4 ++--
> >  2 files changed, 14 insertions(+), 21 deletions(-)
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 5d3c9bd23c..204eb454b2 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -735,15 +735,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	const unsigned char *commit_data;
> >  	struct commit_graph_data *graph_data;
> >  	uint32_t lex_index;
> > +	uint64_t date_high, date_low;
> >  
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> >  
> > +	if (pos >= g->num_commits + g->num_commits_in_base)
> > +		die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> >  	lex_index = pos - g->num_commits_in_base;
> >  	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >  
> >  	graph_data = commit_graph_data_at(item);
> >  	graph_data->graph_pos = pos;
> > +
> > +	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > +	date_low = get_be32(commit_data + g->hash_len + 12);
> > +	item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >  
> > @@ -758,38 +767,22 @@ static int fill_commit_in_graph(struct repository *r,
> >  {
> >  	uint32_t edge_value;
> >  	uint32_t *parent_data_ptr;
> > -	uint64_t date_low, date_high;
> >  	struct commit_list **pptr;
> > -	struct commit_graph_data *graph_data;
> >  	const unsigned char *commit_data;
> >  	uint32_t lex_index;
> >  
> > +	fill_commit_graph_info(item, g, pos);
> > +
> >  	while (pos < g->num_commits_in_base)
> >  		g = g->base_graph;
> 
> This 'while' loop happens in both implementations, so you could
> save a miniscule amount of time by placing the call to
> fill_commit_graph_info() after the while loop.
> 
> > -	if (pos >= g->num_commits + g->num_commits_in_base)
> > -		die(_("invalid commit position. commit-graph is likely corrupt"));
> 
> > -	/*
> > -	 * Store the "full" position, but then use the
> > -	 * "local" position for the rest of the calculation.
> > -	 */
> > -	graph_data = commit_graph_data_at(item);
> > -	graph_data->graph_pos = pos;
> >  	lex_index = pos - g->num_commits_in_base;
> > -
> > -	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> > +	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> 
> I was about to complain about this change, but GRAPH_DATA_WIDTH
> is a macro that does an equivalent thing (except the_hash_algo->rawsz
> instead of g->hash_len).
> 
> >  
> >  	item->object.parsed = 1;
> >  
> >  	set_commit_tree(item, NULL);
> >  
> > -	date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > -	date_low = get_be32(commit_data + g->hash_len + 12);
> > -	item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> >  	pptr = &item->parents;
> >  
> >  	edge_value = get_be32(commit_data + g->hash_len);
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 37655a237c..1986354fc3 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -406,7 +406,7 @@ test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> >  	rm -f .git/index &&
> >  	echo content >file &&
> >  	git add file &&
> > -	GIT_COMMITTER_DATE="@68719476737 +0000" \
> > +	GIT_COMMITTER_DATE="@17179869183 +0000" \
> >  		git commit -m "tempori parendum"
> >  '
> >  
> > @@ -415,7 +415,7 @@ test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> >  '
> >  
> >  test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > -	echo 4147 >expect &&
> > +	echo 2514 >expect &&
> >  	tar_info future.tar | cut -d" " -f2 >actual &&
> >  	test_cmp expect actual
> >  '
> > 
> 
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] commit-graph: implement generation data chunk
  2020-07-28 16:12   ` Taylor Blau
@ 2020-07-30  6:52     ` Abhishek Kumar
  0 siblings, 0 replies; 30+ messages in thread
From: Abhishek Kumar @ 2020-07-30  6:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: dstolee, jnareb, git, abhishekkumar8222, gitgitgadget

On Tue, Jul 28, 2020 at 12:12:50PM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:50AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > One of the essential pre-requisites before implementing generation
> > number as to distinguish between generation numbers v1 and v2 while
> 
> s/as/is
> 
> > still being compatible with old Git.
> 
> Maybe you could add a section here to talk about why this is needed
> specifically? That is, you mention it's a prerequisite, but a reader in
> a year or two may not remember why. Adding that information here would
> be good.
> 
> > We are going to introduce a new chunk called Generation Data chunk (or
> > GDAT). GDAT stores generation number v2 (and any subsequent versions),
> > whereas CDAT will still store topological level.
> >
> > Old Git does not understand GDAT chunk and would ignore it, reading
> > topological levels from CDAT. Newer versions of Git can parse GDAT and
> > take advantage of newer generation numbers, falling back to topological
> > levels when GDAT chunk is missing (as it would happen with a commit
> > graph written by old Git).
> 
> ...this is exactly the paragraph that I was looking for above. Could you
> swap the order of these last two paragraphs? I think that it would make
> the patch message far clearer.

Here's revised commit message:

  commit-graph: implement generation data chunk
    
  As discovered by Ævar, we cannot increment graph version to
  distinguish between generation numbers v1 and v2 [1]. Thus, one of
  pre-requistes before implementing generation number v2 was to
  distinguish generation numbers in a backwards compatible manner
  without increment graph version.
  
  We are going to introduce a new chunk called Generation Data chunk (or
  GDAT). GDAT stores generation number v2 (and any subsequent versions),
  whereas CDAT will still store topological level.
  
  Old Git does not understand GDAT chunk and would ignore it, reading
  topological levels from CDAT. New Git can parse GDAT and take advantage
  of newer generation numbers, falling back to topological levels when
  GDAT chunk is missing (as it would happen with a commit graph written
  by old Git).
 
  [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

First paragraph explains why we need this patch (cannot increment graph
version) second explains what this patch does (introduce a new chunk)
and third proves why it works (Old Git ignores GDAT, New Git parses GDAT).

Can we improve this commit message further? 

> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c                | 33 +++++++++++++++++++++++++++++----
> >  commit-graph.h                |  1 +
> >  t/helper/test-read-graph.c    |  2 ++
> >  t/t4216-log-bloom.sh          |  4 ++--
> >  t/t5318-commit-graph.sh       | 19 +++++++++++--------
> >  t/t5324-split-commit-graph.sh | 12 ++++++------
> >  6 files changed, 51 insertions(+), 20 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1c98f38d69..ab714f4a76 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -38,11 +38,12 @@ void git_test_write_commit_graph_or_die(void)
> >  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> >  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> >  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> >  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> >  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> >  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> >  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> > -#define MAX_NUM_CHUNKS 7
> > +#define MAX_NUM_CHUNKS 8
> 
> Ugh. I am simultaneously working on a new chunk myself (so a bad
> conflict resolution would look at both of us incrementing this number
> to the same value without generating a conflict.)
> 
> I think the right thing to do here would be to define an enum over chunk
> names, and then index an array by that enum (where the value at each
> index is the chunk identifier). Then, the last value of that enum would
> be a '__COUNT' which you could use to initialize the array (as well as
> within the commit-graph writing routines).
> 
> Anyway, I think that it's probably not worth it in the meantime, but it
> is something that Junio should look out for when merging (if yours and
> my topic happen to get merged around the same time, which they may not).
> 
> >  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
> >
> > @@ -389,6 +390,13 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
> >  				graph->chunk_commit_data = data + chunk_offset;
> >  			break;
> >
> > +		case GRAPH_CHUNKID_GENERATION_DATA:
> > +			if (graph->chunk_generation_data)
> > +				chunk_repeated = 1;
> > +			else
> > +				graph->chunk_generation_data = data + chunk_offset;
> > +			break;
> > +
> >  		case GRAPH_CHUNKID_EXTRAEDGES:
> >  			if (graph->chunk_extra_edges)
> >  				chunk_repeated = 1;
> > @@ -768,7 +776,10 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	date_low = get_be32(commit_data + g->hash_len + 12);
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> > -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +	if (g->chunk_generation_data)
> > +		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> > +	else
> > +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -1100,6 +1111,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> >  	}
> >  }
> >
> > +static void write_graph_chunk_generation_data(struct hashfile *f,
> > +					      struct write_commit_graph_context *ctx)
> > +{
> > +	struct commit **list = ctx->commits.list;
> > +	int count;
> > +	for (count = 0; count < ctx->commits.nr; count++, list++) {
> > +		display_progress(ctx->progress, ++ctx->progress_cnt);
> > +		hashwrite_be32(f, commit_graph_data_at(*list)->generation);
> > +	}
> > +}
> > +
> 
> This pointer arithmetic is not necessary. Why not like:
> 
>   int i;
>   for (i = 0; i < ctx->commits.nr; i++) {
>     struct commit *c = ctx->commits.list[i];
>     display_progress(ctx->progress, ++ctx->progress_cnt);
>     hashwrite_be32(f, commit_graph_data_at(c)->generation);
>   }
> 
> instead?
> 
> >  static void write_graph_chunk_extra_edges(struct hashfile *f,
> >  					  struct write_commit_graph_context *ctx)
> >  {
> > @@ -1605,7 +1627,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
> >  	const unsigned hashsz = the_hash_algo->rawsz;
> >  	struct strbuf progress_title = STRBUF_INIT;
> > -	int num_chunks = 3;
> > +	int num_chunks = 4;
> >  	struct object_id file_hash;
> >  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> >
> > @@ -1656,6 +1678,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> >  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> >  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> > +	chunk_ids[3] = GRAPH_CHUNKID_GENERATION_DATA;
> >  	if (ctx->num_extra_edges) {
> >  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
> >  		num_chunks++;
> > @@ -1677,8 +1700,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> >  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
> >  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> > +	chunk_offsets[4] = chunk_offsets[3] + sizeof(uint32_t) * ctx->commits.nr;
> >
> > -	num_chunks = 3;
> > +	num_chunks = 4;
> >  	if (ctx->num_extra_edges) {
> >  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> >  						4 * ctx->num_extra_edges;
> > @@ -1728,6 +1752,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> >  	write_graph_chunk_fanout(f, ctx);
> >  	write_graph_chunk_oids(f, hashsz, ctx);
> >  	write_graph_chunk_data(f, hashsz, ctx);
> > +	write_graph_chunk_generation_data(f, ctx);
> >  	if (ctx->num_extra_edges)
> >  		write_graph_chunk_extra_edges(f, ctx);
> >  	if (ctx->changed_paths) {
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 98cc5a3b9d..e3d4ba96f4 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -67,6 +67,7 @@ struct commit_graph {
> >  	const uint32_t *chunk_oid_fanout;
> >  	const unsigned char *chunk_oid_lookup;
> >  	const unsigned char *chunk_commit_data;
> > +	const unsigned char *chunk_generation_data;
> >  	const unsigned char *chunk_extra_edges;
> >  	const unsigned char *chunk_base_graphs;
> >  	const unsigned char *chunk_bloom_indexes;
> > diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> > index 6d0c962438..1c2a5366c7 100644
> > --- a/t/helper/test-read-graph.c
> > +++ b/t/helper/test-read-graph.c
> > @@ -32,6 +32,8 @@ int cmd__read_graph(int argc, const char **argv)
> >  		printf(" oid_lookup");
> >  	if (graph->chunk_commit_data)
> >  		printf(" commit_metadata");
> > +	if (graph->chunk_generation_data)
> > +		printf(" generation_data");
> >  	if (graph->chunk_extra_edges)
> >  		printf(" extra_edges");
> >  	if (graph->chunk_bloom_indexes)
> > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> > index c855bcd3e7..780855e691 100755
> > --- a/t/t4216-log-bloom.sh
> > +++ b/t/t4216-log-bloom.sh
> > @@ -33,11 +33,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> >  	git commit-graph write --reachable --changed-paths
> >  '
> >  graph_read_expect () {
> > -	NUM_CHUNKS=5
> > +	NUM_CHUNKS=6
> >  	cat >expect <<- EOF
> >  	header: 43475048 1 1 $NUM_CHUNKS 0
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
> >  	EOF
> >  	test-tool read-graph >actual &&
> >  	test_cmp expect actual
> > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> > index 26f332d6a3..3ec5248d70 100755
> > --- a/t/t5318-commit-graph.sh
> > +++ b/t/t5318-commit-graph.sh
> > @@ -71,16 +71,16 @@ graph_git_behavior 'no graph' full commits/3 commits/1
> >
> >  graph_read_expect() {
> >  	OPTIONAL=""
> > -	NUM_CHUNKS=3
> > +	NUM_CHUNKS=4
> >  	if test ! -z $2
> >  	then
> >  		OPTIONAL=" $2"
> > -		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> > +		NUM_CHUNKS=$((4 + $(echo "$2" | wc -w)))
> >  	fi
> >  	cat >expect <<- EOF
> >  	header: 43475048 1 1 $NUM_CHUNKS 0
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data$OPTIONAL
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> > @@ -433,7 +433,7 @@ GRAPH_BYTE_HASH=5
> >  GRAPH_BYTE_CHUNK_COUNT=6
> >  GRAPH_CHUNK_LOOKUP_OFFSET=8
> >  GRAPH_CHUNK_LOOKUP_WIDTH=12
> > -GRAPH_CHUNK_LOOKUP_ROWS=5
> > +GRAPH_CHUNK_LOOKUP_ROWS=6
> >  GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
> >  GRAPH_BYTE_OID_LOOKUP_ID=$(($GRAPH_CHUNK_LOOKUP_OFFSET + \
> >  			    1 * $GRAPH_CHUNK_LOOKUP_WIDTH))
> > @@ -451,11 +451,14 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
> >  GRAPH_BYTE_COMMIT_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN))
> >  GRAPH_BYTE_COMMIT_EXTRA_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4))
> >  GRAPH_BYTE_COMMIT_WRONG_PARENT=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3))
> > -GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 11))
> >  GRAPH_BYTE_COMMIT_DATE=$(($GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12))
> >  GRAPH_COMMIT_DATA_WIDTH=$(($HASH_LEN + 16))
> > -GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> > -			     $GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> > +GRAPH_GENERATION_DATA_OFFSET=$(($GRAPH_COMMIT_DATA_OFFSET + \
> > +				$GRAPH_COMMIT_DATA_WIDTH * $NUM_COMMITS))
> > +GRAPH_GENERATION_DATA_WIDTH=4
> > +GRAPH_BYTE_COMMIT_GENERATION=$(($GRAPH_GENERATION_DATA_OFFSET + 3))
> > +GRAPH_OCTOPUS_DATA_OFFSET=$(($GRAPH_GENERATION_DATA_OFFSET + \
> > +			     $GRAPH_GENERATION_DATA_WIDTH * $NUM_COMMITS))
> >  GRAPH_BYTE_OCTOPUS=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4))
> >  GRAPH_BYTE_FOOTER=$(($GRAPH_OCTOPUS_DATA_OFFSET + 4 * $NUM_OCTOPUS_EDGES))
> >
> > @@ -594,7 +597,7 @@ test_expect_success 'detect incorrect generation number' '
> >  '
> >
> >  test_expect_success 'detect incorrect generation number' '
> > -	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> > +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\00" \
> >  		"non-zero generation number"
> >  '
> >
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 269d0964a3..096a96ec41 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -14,11 +14,11 @@ test_expect_success 'setup repo' '
> >  	graphdir="$infodir/commit-graphs" &&
> >  	test_oid_init &&
> >  	test_oid_cache <<-EOM
> > -	shallow sha1:1760
> > -	shallow sha256:2064
> > +	shallow sha1:2132
> > +	shallow sha256:2436
> >
> > -	base sha1:1376
> > -	base sha256:1496
> > +	base sha1:1408
> > +	base sha256:1528
> >  	EOM
> >  '
> >
> > @@ -29,9 +29,9 @@ graph_read_expect() {
> >  		NUM_BASE=$2
> >  	fi
> >  	cat >expect <<- EOF
> > -	header: 43475048 1 1 3 $NUM_BASE
> > +	header: 43475048 1 1 4 $NUM_BASE
> >  	num_commits: $1
> > -	chunks: oid_fanout oid_lookup commit_metadata
> > +	chunks: oid_fanout oid_lookup commit_metadata generation_data
> >  	EOF
> >  	test-tool read-graph >output &&
> >  	test_cmp expect output
> > --
> > gitgitgadget
> >
> 
> All of this looks good to me.
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] commit-graph: implement corrected commit date offset
  2020-07-28 15:55   ` Derrick Stolee
  2020-07-28 16:23     ` Taylor Blau
@ 2020-07-30  7:27     ` Abhishek Kumar
  1 sibling, 0 replies; 30+ messages in thread
From: Abhishek Kumar @ 2020-07-30  7:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: abhishekkumar8222, git, jnareb, gitgitgadget

On Tue, Jul 28, 2020 at 11:55:12AM -0400, Derrick Stolee wrote:
> On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > 
> > With preparations done,...
> 
> I feel like this commit could have been made smaller by doing the
> uint32_t -> timestamp_t conversion in a separate patch. That would
> make it easier to focus on the changes to the generation number v2
> logic.
> 

Sure, would seperate into two patches.

> > let's implement corrected commit date offset.
> > We add a new commit-slab to store topological levels while writing
> 
> It's important to add: we store topological levels to ensure that older
> versions of Git will still have the performance benefits from generation
> number v1.
> 

Will do.

> > commit graph and upgrade number of struct commit_graph_data to 64-bits.
> 
> Do you mean "update the generation member in struct commit_graph_data
> to a 64-bit timestamp"? The struct itself also has the 32-bit graph_pos
> member.
> 

Yes, "update the generation number".

> > We have to touch many files, upgrading generation number from uint32_t
> > to timestamp_t.
> 
> Yes, that's why I recommend doing that in a different step.
> 
> > We drop 'detect incorrect generation number' from t5318-commit-graph.sh,
> > which tests if verify can detect if a commit graph have
> > GENERATION_NUMBER_ZERO for a commit, followed by a non-zero generation.
> > With corrected commit dates, GENERATION_NUMBER_ZERO is possible only if
> > one of dates is Unix epoch zero.
> 
> What about the topological levels? Are we caring about verifying the data
> that we start to ignore in this new version? I'm hesitant to drop this
> right now, but I'm open to it if we really don't see it as a valuable test.
> 

We haven't tested the scenario "New Git reads a commit graph without
GDAT chunk" yet. Verifying topological levels (along with many of the
changed offsets) would be a part of the scenario.

Now that I think about it, those tests should have been included with
this patch.

> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> [...]
>
> Later you assign data->generation to be "max_corrected_commit_date + 1",
> which made me think this should be "current->date - 1". Is that so? Or,
> do we want most offsets to be one instead of zero? Is there value there?
> 

Does it? 

I had hoped most of the offsets could have been zero, as we could take
advantage of the fact that commit-slab zero initializes values and avoid
a commit-slab access.

Right, What I meant to do was:

        /*
         * max_parent_corrected_commit_date is initialized with zero and
         * takes the maximum of
         * (parent->item->date + commit_graph_data_at(parent->item)->generation)
        */

        if (max_parent_corrected_commit_date >= current->date)
        {
                struct commit_graph_data *data = commit_graph_data_at(current);
                data->generation = max_parent_corrected_commit_date + 1;
        }

Thanks for pointing this out!

> [...]

- Abhishek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] [GSoC] Implement Corrected Commit Date
  2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
@ 2020-07-30  7:47   ` Abhishek Kumar
  0 siblings, 0 replies; 30+ messages in thread
From: Abhishek Kumar @ 2020-07-30  7:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitgitgadget, abhishekkumar8222

On Tue, Jul 28, 2020 at 10:54:58AM -0400, Taylor Blau wrote:
> Hi Abhishek,
> 
> On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Very exciting. I have been eagerly following your blog and asking
> Stolee about your progress, so I am excited to read these patches.
> 

I am so glad to hear that!

> 
> [...]
> 
> I'm sure that I'll learn more when I get to this point, but I would like
> to hear more about why you want to store the offset rather than the
> corrected commit date itself. It seems that the offset could be either
> positive or negative, so you'd only have the range of a signed integer
> (rather than storing 8 bytes of a time_t for the full breadth of
> possibilities).
> 

Corrected commit dates are at least as big as the committer date, so the
offset (i.e. corrected date - committer date) would never be negative.

We store offsets instead of corrected commit dates because:
- We save 4 bytes for each commit, which amounts to 7-8% of the size of
  commit graph file (of course, dependent on the other chunks used).
- We save some time while writing the commit-graph file too, around
  ~200ms for the Linux repo.

While the savings are modest, writing corrected dates does not offer any
advantage that we could think of, at the time.

> I know also that Peff is working on negative timestamp support, so I
> would want to hear about what he thinks of this, too.

I have read up on Peff's work with negative timestamp support and it's
pretty exciting.

> [...]

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
  2020-07-28 15:28   ` Taylor Blau
@ 2020-08-04  0:46   ` Jakub Narębski
  2020-08-04  0:56     ` Taylor Blau
  2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 2 replies; 30+ messages in thread
From: Jakub Narębski @ 2020-08-04  0:46 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Abhishek Kumar, Taylor Blau

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.
>
> Let's fix this by accessing generation number directly through the slab.

It looks like unfortunate and unforeseen consequence of putting together
graph position and generation number in the commit_graph_data struct.
During writing of the commit-graph file generation number is computed,
but graph position is undefined (yet), and commit_graph_generation()
uses graph_pos field to find if the data for commit is initialized;
in this case wrongly.

Anyway, when writing the commit graph we first compute generation
number, then (if requested) the changed-paths Bloom filter.  Skipping
the unnecessary check is a good thing... assuming that commit_gen_cmp()
is used only when writing the commit graph, and not when traversing it
(because then some commits may not have generation number set, and maybe
even do not have any data on the commit slab) - which is the case.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c

We might want to add function comment either here or in the header that
this comparisonn function is to be used only for `git commit-graph
write`, and not for graph traversal (even if similar funnction exists in
other modules).

> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:46   ` Jakub Narębski
@ 2020-08-04  0:56     ` Taylor Blau
  2020-08-04 10:10       ` Jakub Narębski
  2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 1 reply; 30+ messages in thread
From: Taylor Blau @ 2020-08-04  0:56 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
	Abhishek Kumar, Taylor Blau

On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> >
> > Let's fix this by accessing generation number directly through the slab.
>
> It looks like unfortunate and unforeseen consequence of putting together
> graph position and generation number in the commit_graph_data struct.
> During writing of the commit-graph file generation number is computed,
> but graph position is undefined (yet), and commit_graph_generation()
> uses graph_pos field to find if the data for commit is initialized;
> in this case wrongly.
>
> Anyway, when writing the commit graph we first compute generation
> number, then (if requested) the changed-paths Bloom filter.  Skipping
> the unnecessary check is a good thing... assuming that commit_gen_cmp()
> is used only when writing the commit graph, and not when traversing it
> (because then some commits may not have generation number set, and maybe
> even do not have any data on the commit slab) - which is the case.
>
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
>
> We might want to add function comment either here or in the header that
> this comparisonn function is to be used only for `git commit-graph
> write`, and not for graph traversal (even if similar funnction exists in
> other modules).

I think that probably within the function is just fine, and that we can
avoid touching commit-graph.h here.

>
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;

Maybe something like:

  /*
   * Access the generation number directly with
   * 'commit_graph_data_at(...)->generation' instead of going through
   * the slab as usual to avoid accessing a yet-uncomputed value.
   */

Folks that are curious for more can blame this commit and read there.
I'd err on the side of being brief in the code comment and verbose in
the commit message than the other way around ;).

> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
>
> Best,
> --
> Jakub Narębski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:46   ` Jakub Narębski
  2020-08-04  0:56     ` Taylor Blau
@ 2020-08-04  7:55     ` Jakub Narębski
  1 sibling, 0 replies; 30+ messages in thread
From: Jakub Narębski @ 2020-08-04  7:55 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget
  Cc: git, Derrick Stolee, Abhishek Kumar, Taylor Blau

Jakub Narębski <jnareb@gmail.com> writes:

[...]
>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>  	const struct commit *a = *(const struct commit **)va;
>>  	const struct commit *b = *(const struct commit **)vb;
>>
>> -	uint32_t generation_a = commit_graph_generation(a);
>> -	uint32_t generation_b = commit_graph_generation(b);
>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>> +
>>  	/* lower generation commits first */
>>  	if (generation_a < generation_b)
>>  		return -1;

NOTE: One more thing: we would want to check if corrected commit date
(generation number v2) or topological level (generation number v1) is
better for this purpose, that is gives better performance.

The commit 3d11275505 (commit-graph: examine commits by generation
number) which introduced using commit_gen_cmp when writing commit graph
when finding commits via `--reachable` flags describes the following
performance improvement:

    On the Linux kernel repository, this change reduced the computation
    time for 'git commit-graph write --reachable --changed-paths' from
    3m00s to 1m37s.

We would probably want time for no sorting, for sorting by generation
number v2, and for sorting by topological level (generation number v1)
for the same or similar case.

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] commit-graph: fix regression when computing bloom filter
  2020-08-04  0:56     ` Taylor Blau
@ 2020-08-04 10:10       ` Jakub Narębski
  0 siblings, 0 replies; 30+ messages in thread
From: Jakub Narębski @ 2020-08-04 10:10 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee, Abhishek Kumar

Taylor Blau <me@ttaylorr.com> writes:
> On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index 1af68c297d..5d3c9bd23c 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>
>> We might want to add function comment either here or in the header that
>> this comparisonn function is to be used only for `git commit-graph
>> write`, and not for graph traversal (even if similar funnction exists in
>> other modules).
>
> I think that probably within the function is just fine, and that we can
> avoid touching commit-graph.h here.
>
>>
>>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>  	const struct commit *a = *(const struct commit **)va;
>>>  	const struct commit *b = *(const struct commit **)vb;
>
> Maybe something like:
>
>   /*
>    * Access the generation number directly with
>    * 'commit_graph_data_at(...)->generation' instead of going through
>    * the slab as usual to avoid accessing a yet-uncomputed value.
>    */

I think the last part of this comment should read:

[...]
     * 'commit_graph_data_at(...)->generation' instead of going through
     * the commit_graph_generation() helper function to access just
     * computed data [during `git commit-graph write --reachable --changed-paths`].
     */

Or something like that (the part in square brackets is optional; I am
not sure if adding it helps or not).

>
> Folks that are curious for more can blame this commit and read there.
> I'd err on the side of being brief in the code comment and verbose in
> the commit message than the other way around ;).

I agree.

>>>
>>> -	uint32_t generation_a = commit_graph_generation(a);
>>> -	uint32_t generation_b = commit_graph_generation(b);
>>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>>> +
>>>  	/* lower generation commits first */
>>>  	if (generation_a < generation_b)
>>>  		return -1;

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] revision: parse parent in indegree_walk_step()
  2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
  2020-07-28 13:00   ` Derrick Stolee
@ 2020-08-05 23:16   ` Jakub Narębski
  1 sibling, 0 replies; 30+ messages in thread
From: Jakub Narębski @ 2020-08-05 23:16 UTC (permalink / raw)
  To: Abhishek Kumar via GitGitGadget; +Cc: git, Derrick Stolee, Abhishek Kumar

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  revision.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 6aa7f4f567..23287d26c3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3343,6 +3343,9 @@ static void indegree_walk_step(struct rev_info *revs)
>  		struct commit *parent = p->item;
>  		int *pi = indegree_slab_at(&info->indegree, parent);
>
> +		if (parse_commit_gently(parent, 1) < 0)

All right, parse_commit_gently() avoids re-parsing objects, and makes
use of the commit-graph data.  If parents are not guaranteed to be
parsed, this is a correct thing to do.

Though I do wonder how this issue got missed by the test suite, just
like other reviewers...

> +			return ;

Why this need to be 'return' and not 'continue'?

> +
>  		if (*pi)
>  			(*pi)++;
>  		else

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, back to index

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-07-28 15:28   ` Taylor Blau
2020-07-30  5:24     ` Abhishek Kumar
2020-08-04  0:46   ` Jakub Narębski
2020-08-04  0:56     ` Taylor Blau
2020-08-04 10:10       ` Jakub Narębski
2020-08-04  7:55     ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-07-28 13:00   ` Derrick Stolee
2020-07-28 15:30     ` Taylor Blau
2020-08-05 23:16   ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-07-28 13:14   ` Derrick Stolee
2020-07-28 15:19     ` René Scharfe
2020-07-28 15:58       ` Derrick Stolee
2020-07-28 16:01     ` Taylor Blau
2020-07-30  6:07     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-07-28 16:03   ` Taylor Blau
2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-07-28 16:12   ` Taylor Blau
2020-07-30  6:52     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
2020-07-28 15:55   ` Derrick Stolee
2020-07-28 16:23     ` Taylor Blau
2020-07-30  7:27     ` Abhishek Kumar
2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
2020-07-30  7:47   ` Abhishek Kumar
2020-07-28 16:35 ` Derrick Stolee

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git