git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH 0/6] Compute and consume generation numbers
@ 2018-04-03 16:51 Derrick Stolee
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
                   ` (9 more replies)
  0 siblings, 10 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (6):
  object.c: parse commit in graph first
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: sort by generation number in paint_down_to_common()
  commit.c: use generation number to stop merge-base walks
  commit-graph.txt: update design doc with generation numbers

 Documentation/technical/commit-graph.txt |  7 +---
 alloc.c                                  |  1 +
 commit-graph.c                           | 48 +++++++++++++++++++++
 commit.c                                 | 53 ++++++++++++++++++++----
 commit.h                                 |  7 +++-
 object.c                                 |  4 +-
 6 files changed, 104 insertions(+), 16 deletions(-)

-- 
2.17.0.20.g9f30ba16e1


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:21   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/object.c b/object.c
index e6ad3f61f0..4cd3e98e04 100644
--- a/object.c
+++ b/object.c
@@ -3,6 +3,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "tag.h"
 
 static struct object **obj_hash;
@@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (!parse_commit_in_graph(commit) &&
+			    parse_commit_buffer(commit, buffer, size))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:05   ` Brandon Williams
  2018-04-03 18:24   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_UNDEF 0xFFFFFFFF
GENERATION_NUMBER_NONE 0

The first (_UNDEF) means the generation number has not been loaded or
computed. The second (_NONE) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..1a62e85ac3 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_UNDEF;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..d24b947525 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index e57ae4b583..3cadd386f3 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,8 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
+#define GENERATION_NUMBER_NONE 0
 
 struct commit_list {
 	struct commit *item;
@@ -24,6 +26,7 @@ struct commit {
 	struct commit_list *parents;
 	struct tree *tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 3/6] commit-graph: compute generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:30   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 commit.h       |  1 +
 2 files changed, 47 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d24b947525..b80c8ad80e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
+			if ((*list)->generation > GENERATION_NUMBER_MAX)
+				die("generation number %u is too large to store in commit-graph",
+				    (*list)->generation);
+			packedDate[0] |= htonl((*list)->generation << 2);
+		}
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_UNDEF &&
+		    commits[i]->generation != GENERATION_NUMBER_NONE)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_UNDEF ||
+				    parent->item->generation == GENERATION_NUMBER_NONE) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
diff --git a/commit.h b/commit.h
index 3cadd386f3..bc7a3186c5 100644
--- a/commit.h
+++ b/commit.h
@@ -11,6 +11,7 @@
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_NONE 0
 
 struct commit_list {
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 4/6] commit: use generations in paint_down_to_common()
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:31   ` Stefan Beller
  2018-04-03 18:31   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_UNDEF.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 19 ++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 3e39c86abf..95ae7e13a3 100644
--- a/commit.c
+++ b/commit.c
@@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* newer commits with larger date first */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index bc7a3186c5..cb97b7636a 100644
--- a/commit.h
+++ b/commit.h
@@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 5/6] commit.c: use generation to halt paint walk
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 19:01   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

In paint_down_to_common(), the walk is halted when the queue contains
only stale commits. The queue_has_nonstale() method iterates over the
entire queue looking for a nonstale commit. In a wide commit graph where
the two sides share many commits in common, but have deep sets of
different commits, this method may inspect many elements before finding
a nonstale commit. In the worst case, this can give quadratic
performance in paint_down_to_common().

Convert queue_has_nonstale() to use generation numbers for an O(1)
termination condition. To properly take advantage of this condition,
track the minimum generation number of a commit that enters the queue
with nonstale status.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 95ae7e13a3..858f4fdbc9 100644
--- a/commit.c
+++ b/commit.c
@@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
-static int queue_has_nonstale(struct prio_queue *queue)
+static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 {
-	int i;
-	for (i = 0; i < queue->nr; i++) {
-		struct commit *commit = queue->array[i].data;
-		if (!(commit->object.flags & STALE))
-			return 1;
+	if (min_gen != GENERATION_NUMBER_UNDEF) {
+		if (queue->nr > 0) {
+			struct commit *commit = queue->array[0].data;
+			return commit->generation >= min_gen;
+		}
+	} else {
+		int i;
+		for (i = 0; i < queue->nr; i++) {
+			struct commit *commit = queue->array[i].data;
+			if (!(commit->object.flags & STALE))
+				return 1;
+		}
 	}
+
 	return 0;
 }
 
@@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_UNDEF;
+	uint32_t min_nonstale_gen = GENERATION_NUMBER_UNDEF;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		return result;
 	}
 	prio_queue_put(&queue, one);
+	if (one->generation < min_nonstale_gen)
+		min_nonstale_gen = one->generation;
 
 	for (i = 0; i < n; i++) {
 		twos[i]->object.flags |= PARENT2;
 		prio_queue_put(&queue, twos[i]);
+		if (twos[i]->generation < min_nonstale_gen)
+			min_nonstale_gen = twos[i]->generation;
 	}
 
-	while (queue_has_nonstale(&queue)) {
+	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+
+		last_gen = commit->generation;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 				return NULL;
 			p->object.flags |= flags;
 			prio_queue_put(&queue, p);
+
+			if (!(flags & STALE) &&
+			    p->generation < min_nonstale_gen)
+				min_nonstale_gen = p->generation;
 		}
 	}
 
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 6/6] commit-graph.txt: update future work
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 19:04   ` Jonathan Tan
  2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..be68bee43d 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -98,17 +98,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-03 16:56 ` Derrick Stolee
  2018-04-03 18:03 ` Brandon Williams
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:56 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: avarab, sbeller, larsxschneider, peff

On 4/3/2018 12:51 PM, Derrick Stolee wrote:
> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a constant-time check in
> queue_has_nonstale() instead of the previous linear-time check.
>
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.
>
> This patch series depends on v7 of ds/commit-graph.
>
> Derrick Stolee (6):
>    object.c: parse commit in graph first
>    commit: add generation number to struct commmit
>    commit-graph: compute generation numbers
>    commit: sort by generation number in paint_down_to_common()
>    commit.c: use generation number to stop merge-base walks
>    commit-graph.txt: update design doc with generation numbers

This patch is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/5

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
@ 2018-04-03 18:03 ` Brandon Williams
  2018-04-03 18:29   ` Derrick Stolee
  2018-04-07 16:55 ` Jakub Narebski
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  9 siblings, 1 reply; 162+ messages in thread
From: Brandon Williams @ 2018-04-03 18:03 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On 04/03, Derrick Stolee wrote:
> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
> 
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).

Thanks for ensuring that this is defined and documented somewhere :)

> 
> This series makes the computation of generation numbers part of the
> commit-graph write process.
> 
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a constant-time check in
> queue_has_nonstale() instead of the previous linear-time check.
> 
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
> 
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.

log --graph should benefit a lot more from this correct?  I know we've
talked a bit about negotiation and I wonder if these generation numbers
should be able to help out a little bit with that some day.

> 
> This patch series depends on v7 of ds/commit-graph.
> 
> Derrick Stolee (6):
>   object.c: parse commit in graph first
>   commit: add generation number to struct commmit
>   commit-graph: compute generation numbers
>   commit: sort by generation number in paint_down_to_common()
>   commit.c: use generation number to stop merge-base walks
>   commit-graph.txt: update design doc with generation numbers
> 
>  Documentation/technical/commit-graph.txt |  7 +---
>  alloc.c                                  |  1 +
>  commit-graph.c                           | 48 +++++++++++++++++++++
>  commit.c                                 | 53 ++++++++++++++++++++----
>  commit.h                                 |  7 +++-
>  object.c                                 |  4 +-
>  6 files changed, 104 insertions(+), 16 deletions(-)
> 
> -- 
> 2.17.0.20.g9f30ba16e1
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-03 18:05   ` Brandon Williams
  2018-04-03 18:28     ` Jeff King
  2018-04-03 18:24   ` Jonathan Tan
  1 sibling, 1 reply; 162+ messages in thread
From: Brandon Williams @ 2018-04-03 18:05 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On 04/03, Derrick Stolee wrote:
> The generation number of a commit is defined recursively as follows:
> 
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
> 
> Add a uint32_t generation field to struct commit so we can pass this

Is there any reason to believe this would be too small of a value in the
future?  Or is a 32 bit unsigned good enough?

> information to revision walks. We use two special values to signal
> the generation number is invalid:
> 
> GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> GENERATION_NUMBER_NONE 0
> 
> The first (_UNDEF) means the generation number has not been loaded or
> computed. The second (_NONE) means the generation number was loaded
> from a commit graph file that was stored before generation numbers
> were computed.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  alloc.c        | 1 +
>  commit-graph.c | 2 ++
>  commit.h       | 3 +++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/alloc.c b/alloc.c
> index cf4f8b61e1..1a62e85ac3 100644
> --- a/alloc.c
> +++ b/alloc.c
> @@ -94,6 +94,7 @@ void *alloc_commit_node(void)
>  	c->object.type = OBJ_COMMIT;
>  	c->index = alloc_commit_index();
>  	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
> +	c->generation = GENERATION_NUMBER_UNDEF;
>  	return c;
>  }
>  
> diff --git a/commit-graph.c b/commit-graph.c
> index 1fc63d541b..d24b947525 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/commit.h b/commit.h
> index e57ae4b583..3cadd386f3 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -10,6 +10,8 @@
>  #include "pretty.h"
>  
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> +#define GENERATION_NUMBER_NONE 0
>  
>  struct commit_list {
>  	struct commit *item;
> @@ -24,6 +26,7 @@ struct commit {
>  	struct commit_list *parents;
>  	struct tree *tree;
>  	uint32_t graph_pos;
> +	uint32_t generation;
>  };
>  
>  extern int save_commit_buffer;
> -- 
> 2.17.0.rc0
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
@ 2018-04-03 18:21   ` Jonathan Tan
  2018-04-03 18:28     ` Jeff King
  0 siblings, 1 reply; 162+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:38 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> Most code paths load commits using lookup_commit() and then
> parse_commit(). In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().
> 
> Before adding generation numbers to the commit-graph, we need to ensure
> that any commit that exists in the graph is loaded from the graph, so
> check parse_commit_in_graph() before calling parse_commit_buffer().
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Modifying parse_object_buffer() is the most pragmatic way to accomplish
this, but this also means that parse_object_buffer() now potentially
reads from the local object store (instead of only relying on what's in
memory and what's in the provided buffer). parse_object_buffer() is
called by several callers including in builtin/fsck.c. I would feel more
comfortable if the relevant [1] caller to parse_object_buffer() was
modified instead of parse_object_buffer(), but I'll let others give
their opinions too.

[1] The caller which, if modified, will result in the speedup to
the merge-base calculations in the Windows repository you describe in
your cover letter.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
  2018-04-03 18:05   ` Brandon Williams
@ 2018-04-03 18:24   ` Jonathan Tan
  1 sibling, 0 replies; 162+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:39 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> The generation number of a commit is defined recursively as follows:
> 
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
> 
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use two special values to signal
> the generation number is invalid:
> 
> GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> GENERATION_NUMBER_NONE 0
> 
> The first (_UNDEF) means the generation number has not been loaded or
> computed. The second (_NONE) means the generation number was loaded
> from a commit graph file that was stored before generation numbers
> were computed.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

This looks straightforward and correct, thanks. I think some of the
description above should appear as code comments.

> +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> +#define GENERATION_NUMBER_NONE 0

I would include the description above here as documentation, and would
replace "was stored before generation numbers were computed" by "was
written by a version of Git that did not support generation numbers".

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:05   ` Brandon Williams
@ 2018-04-03 18:28     ` Jeff King
  2018-04-03 18:31       ` Derrick Stolee
                         ` (3 more replies)
  0 siblings, 4 replies; 162+ messages in thread
From: Jeff King @ 2018-04-03 18:28 UTC (permalink / raw)
  To: Brandon Williams; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:

> On 04/03, Derrick Stolee wrote:
> > The generation number of a commit is defined recursively as follows:
> > 
> > * If a commit A has no parents, then the generation number of A is one.
> > * If a commit A has parents, then the generation number of A is one
> >   more than the maximum generation number among the parents of A.
> > 
> > Add a uint32_t generation field to struct commit so we can pass this
> 
> Is there any reason to believe this would be too small of a value in the
> future?  Or is a 32 bit unsigned good enough?

The linux kernel took ~10 years to produce 500k commits. Even assuming
those were all linear (and they're not), that gives us ~80,000 years of
leeway. So even if the pace of development speeds up or we have a
quicker project, it still seems we have a pretty reasonable safety
margin.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 18:21   ` Jonathan Tan
@ 2018-04-03 18:28     ` Jeff King
  2018-04-03 18:32       ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-03 18:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote:

> On Tue,  3 Apr 2018 12:51:38 -0400
> Derrick Stolee <dstolee@microsoft.com> wrote:
> 
> > Most code paths load commits using lookup_commit() and then
> > parse_commit(). In some cases, including some branch lookups, the commit
> > is parsed using parse_object_buffer() which side-steps parse_commit() in
> > favor of parse_commit_buffer().
> > 
> > Before adding generation numbers to the commit-graph, we need to ensure
> > that any commit that exists in the graph is loaded from the graph, so
> > check parse_commit_in_graph() before calling parse_commit_buffer().
> > 
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> 
> Modifying parse_object_buffer() is the most pragmatic way to accomplish
> this, but this also means that parse_object_buffer() now potentially
> reads from the local object store (instead of only relying on what's in
> memory and what's in the provided buffer). parse_object_buffer() is
> called by several callers including in builtin/fsck.c. I would feel more
> comfortable if the relevant [1] caller to parse_object_buffer() was
> modified instead of parse_object_buffer(), but I'll let others give
> their opinions too.

It's not just you. This seems like a really odd place to put it.
Especially because if we have the buffer to pass to this function, then
we'd already have incurred the cost to inflate the object.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:03 ` Brandon Williams
@ 2018-04-03 18:29   ` Derrick Stolee
  2018-04-03 18:47     ` Jeff King
  2018-04-07 17:09     ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski
  0 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 18:29 UTC (permalink / raw)
  To: Brandon Williams, Derrick Stolee
  Cc: git, avarab, sbeller, larsxschneider, peff

On 4/3/2018 2:03 PM, Brandon Williams wrote:
> On 04/03, Derrick Stolee wrote:
>> This is the first of several "small" patches that follow the serialized
>> Git commit graph patch (ds/commit-graph).
>>
>> As described in Documentation/technical/commit-graph.txt, the generation
>> number of a commit is one more than the maximum generation number among
>> its parents (trivially, a commit with no parents has generation number
>> one).
> Thanks for ensuring that this is defined and documented somewhere :)
>
>> This series makes the computation of generation numbers part of the
>> commit-graph write process.
>>
>> Finally, generation numbers are used to order commits in the priority
>> queue in paint_down_to_common(). This allows a constant-time check in
>> queue_has_nonstale() instead of the previous linear-time check.
>>
>> This does not have a significant performance benefit in repositories
>> of normal size, but in the Windows repository, some merge-base
>> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
>> an actual consumer of generation numbers as a starting point.
>>
>> A more substantial refactoring of revision.c is required before making
>> 'git log --graph' use generation numbers effectively.
> log --graph should benefit a lot more from this correct?  I know we've
> talked a bit about negotiation and I wonder if these generation numbers
> should be able to help out a little bit with that some day.

'log --graph' should be a HUGE speedup, when it is refactored. Since the 
topo-order can "stream" commits to the pager, it can be very responsive 
to return the graph in almost all conditions. (The case where generation 
numbers are not enough is when filters reduce the set of displayed 
commits to be very sparse, so many commits are walked anyway.)

If we have generic "can X reach Y?" queries, then we can also use 
generation numbers there to great effect (by not walking commits Z with 
gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" 
thread for ideas.

For negotiation, there are some things we can do here. VSTS uses 
generation numbers as a heuristic for determining "all wants connected 
to haves" which is a condition for halting negotiation. The idea is very 
simple, and I'd be happy to discuss it on a separate thread.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 3/6] commit-graph: compute generation numbers
  2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-03 18:30   ` Jonathan Tan
  2018-04-03 18:49     ` Stefan Beller
  0 siblings, 1 reply; 162+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:40 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> +		if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
> +			if ((*list)->generation > GENERATION_NUMBER_MAX)
> +				die("generation number %u is too large to store in commit-graph",
> +				    (*list)->generation);
> +			packedDate[0] |= htonl((*list)->generation << 2);
> +		}

The die() should have "BUG:" if you agree with my comment below.

> +static void compute_generation_numbers(struct commit** commits,
> +				       int nr_commits)

Style: space before **, not after.

> +			if (all_parents_computed) {
> +				current->generation = max_generation + 1;
> +				pop_commit(&list);
> +			}

I think the current->generation should be clamped to _MAX here. If we do, then
the die() I mentioned in my first comment will have "BUG:", since we are never
meant to write any number larger than _MAX in ->generation.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
@ 2018-04-03 18:31       ` Derrick Stolee
  2018-04-03 18:32       ` Brandon Williams
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 18:31 UTC (permalink / raw)
  To: Jeff King, Brandon Williams
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On 4/3/2018 2:28 PM, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
>
>> On 04/03, Derrick Stolee wrote:
>>> The generation number of a commit is defined recursively as follows:
>>>
>>> * If a commit A has no parents, then the generation number of A is one.
>>> * If a commit A has parents, then the generation number of A is one
>>>    more than the maximum generation number among the parents of A.
>>>
>>> Add a uint32_t generation field to struct commit so we can pass this
>> Is there any reason to believe this would be too small of a value in the
>> future?  Or is a 32 bit unsigned good enough?
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not), that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.

That, and larger projects do not have linear histories. Despite having 
almost 2 million reachable commits, the Windows repository has maximum 
generation number ~100,000.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 4/6] commit: use generations in paint_down_to_common()
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-03 18:31   ` Stefan Beller
  2018-04-03 18:31   ` Jonathan Tan
  1 sibling, 0 replies; 162+ messages in thread
From: Stefan Beller @ 2018-04-03 18:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Lars Schneider, Jeff King

On Tue, Apr 3, 2018 at 9:51 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).
>
> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_UNDEF.
>
> This change does not affect the number of commits that are walked during
> the execution of paint_down_to_common(), only the order that those
> commits are inspected. In the case that commit dates violate topological
> order (i.e. a parent is "newer" than a child), the previous code could
> walk a commit twice: if a commit is reached with the PARENT1 bit, but
> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
> propagated to its parents. Using generation numbers avoids this extra
> effort, even if it is somewhat rare.


This patch (or later in this series) may want to touch
Documentation/technical/commit-graph.txt, that mentions this in
the section of Future Work:

- After computing and storing generation numbers, we must make graph
  walks aware of generation numbers to gain the performance benefits they
  enable. This will mostly be accomplished by swapping a commit-date-ordered
  priority queue with one ordered by generation number. The following
  operations are important candidates:

    - paint_down_to_common()
    - 'log --topo-order'

The paint down to common is only internal, not exposed to the user
for ordering, i.e. the topological ordering is still ordering commits in
a branch adjacent?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 4/6] commit: use generations in paint_down_to_common()
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
  2018-04-03 18:31   ` Stefan Beller
@ 2018-04-03 18:31   ` Jonathan Tan
  1 sibling, 0 replies; 162+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:41 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> +{
> +	const struct commit *a = a_, *b = b_;
> +
> +	if (a->generation < b->generation)
> +		return 1;
> +	else if (a->generation > b->generation)
> +		return -1;
> +
> +	/* newer commits with larger date first */
> +	if (a->date < b->date)
> +		return 1;
> +	else if (a->date > b->date)
> +		return -1;
> +	return 0;
> +}

I think it would be clearer if you commented above the first block
"newer commits first", then on the second block, "use date as a
heuristic to determine newer commit".

Other than that, this looks good.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
  2018-04-03 18:31       ` Derrick Stolee
@ 2018-04-03 18:32       ` Brandon Williams
  2018-04-03 18:44       ` Stefan Beller
  2018-04-03 23:17       ` Ramsay Jones
  3 siblings, 0 replies; 162+ messages in thread
From: Brandon Williams @ 2018-04-03 18:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On 04/03, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
> 
> > On 04/03, Derrick Stolee wrote:
> > > The generation number of a commit is defined recursively as follows:
> > > 
> > > * If a commit A has no parents, then the generation number of A is one.
> > > * If a commit A has parents, then the generation number of A is one
> > >   more than the maximum generation number among the parents of A.
> > > 
> > > Add a uint32_t generation field to struct commit so we can pass this
> > 
> > Is there any reason to believe this would be too small of a value in the
> > future?  Or is a 32 bit unsigned good enough?
> 
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not), that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.
> 
> -Peff

I figured as much, but just wanted to check since the windows folks
seems to produce commits pretty quickly.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 18:28     ` Jeff King
@ 2018-04-03 18:32       ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-03 18:32 UTC (permalink / raw)
  To: Jeff King, Jonathan Tan
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider



On 4/3/2018 2:28 PM, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote:
>
>> On Tue,  3 Apr 2018 12:51:38 -0400
>> Derrick Stolee <dstolee@microsoft.com> wrote:
>>
>>> Most code paths load commits using lookup_commit() and then
>>> parse_commit(). In some cases, including some branch lookups, the commit
>>> is parsed using parse_object_buffer() which side-steps parse_commit() in
>>> favor of parse_commit_buffer().
>>>
>>> Before adding generation numbers to the commit-graph, we need to ensure
>>> that any commit that exists in the graph is loaded from the graph, so
>>> check parse_commit_in_graph() before calling parse_commit_buffer().
>>>
>>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> Modifying parse_object_buffer() is the most pragmatic way to accomplish
>> this, but this also means that parse_object_buffer() now potentially
>> reads from the local object store (instead of only relying on what's in
>> memory and what's in the provided buffer). parse_object_buffer() is
>> called by several callers including in builtin/fsck.c. I would feel more
>> comfortable if the relevant [1] caller to parse_object_buffer() was
>> modified instead of parse_object_buffer(), but I'll let others give
>> their opinions too.
> It's not just you. This seems like a really odd place to put it.
> Especially because if we have the buffer to pass to this function, then
> we'd already have incurred the cost to inflate the object.
>

OK. Thanks. I'll try to find the better place to put this check.

-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
  2018-04-03 18:31       ` Derrick Stolee
  2018-04-03 18:32       ` Brandon Williams
@ 2018-04-03 18:44       ` Stefan Beller
  2018-04-03 23:17       ` Ramsay Jones
  3 siblings, 0 replies; 162+ messages in thread
From: Stefan Beller @ 2018-04-03 18:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Brandon Williams, Derrick Stolee, git,
	Ævar Arnfjörð Bjarmason, Lars Schneider

On Tue, Apr 3, 2018 at 11:28 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
>
>> On 04/03, Derrick Stolee wrote:
>> > The generation number of a commit is defined recursively as follows:
>> >
>> > * If a commit A has no parents, then the generation number of A is one.
>> > * If a commit A has parents, then the generation number of A is one
>> >   more than the maximum generation number among the parents of A.
>> >
>> > Add a uint32_t generation field to struct commit so we can pass this
>>
>> Is there any reason to believe this would be too small of a value in the
>> future?  Or is a 32 bit unsigned good enough?
>
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not),

... which you meant in terms of DAG, where a linear history is the worst case
for generation numbers.

I first read it the other way round, as the best case w.r.t. timing

~/linux$ git log --oneline |wc -l
721223
$ git log --oneline --since 2012 |wc -l
421853
$ git log --oneline --since 2011 |wc -l
477155

The number of commits is growing exponentially, though the exponential
part is very small and the YoY growth can be estimated using linear
interpolation.

In linux, the release is a natural synchronization point IIUC as well
as on a regular schedule. So an interesting question to ask there would
be whether the delta in generation number goes up over time, or if the
DAG just gets wider (=more parallel)

> that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.

Thanks for the estimate.
Stefan

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:29   ` Derrick Stolee
@ 2018-04-03 18:47     ` Jeff King
  2018-04-03 19:05       ` Jeff King
  2018-04-07 17:09     ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski
  1 sibling, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-03 18:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote:

> If we have generic "can X reach Y?" queries, then we can also use generation
> numbers there to great effect (by not walking commits Z with gen(Z) <=
> gen(Y)). Perhaps I should look at that "git branch --contains" thread for
> ideas.

I think the gist of it is the patch below. Which I hastily adapted from
the patch we run at GitHub that uses timestamps as a proxy. So it's
possible I completely flubbed the logic. I'm assuming unavailable
generation numbers are set to 0; the logic is actually a bit simpler if
they end up as (uint32_t)-1.

Assuming it works, that would cover for-each-ref and tag. You'd probably
want to drop the "with_commit_tag_algo" flag in ref-filter.h, and just
use always use it by default (and that would cover "git branch").

---
diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..6bea6173d1 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation && candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = -1;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation && c->generation < cutoff )
+			cutoff = c->generation;
+	}
+	if (cutoff == -1)
+		cutoff = 0;
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 3/6] commit-graph: compute generation numbers
  2018-04-03 18:30   ` Jonathan Tan
@ 2018-04-03 18:49     ` Stefan Beller
  0 siblings, 0 replies; 162+ messages in thread
From: Stefan Beller @ 2018-04-03 18:49 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Lars Schneider, Jeff King

On Tue, Apr 3, 2018 at 11:30 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Tue,  3 Apr 2018 12:51:40 -0400
> Derrick Stolee <dstolee@microsoft.com> wrote:
>
>> +             if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
>> +                     if ((*list)->generation > GENERATION_NUMBER_MAX)
>> +                             die("generation number %u is too large to store in commit-graph",
>> +                                 (*list)->generation);
>> +                     packedDate[0] |= htonl((*list)->generation << 2);
>> +             }
>
> The die() should have "BUG:" if you agree with my comment below.

I would remove the BUG/die() altogether and keep going.
(But do not write it out, i.e. warn and skip the next line)

A degraded commit graph with partial generation numbers is better
than Git refusing to write any part of the commit graph (which later on
will be part of many maintenance operations I would think, leading to
more immediate headache rather than "working but slightly slower")

>
>> +static void compute_generation_numbers(struct commit** commits,
>> +                                    int nr_commits)
>
> Style: space before **, not after.
>
>> +                     if (all_parents_computed) {
>> +                             current->generation = max_generation + 1;
>> +                             pop_commit(&list);
>> +                     }
>
> I think the current->generation should be clamped to _MAX here. If we do, then
> the die() I mentioned in my first comment will have "BUG:", since we are never
> meant to write any number larger than _MAX in ->generation.

When we clamp here, we'd have to treat the _MAX specially
in all our use cases or we'd encounter funny bugs due to miss ordered
commits later?

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 5/6] commit.c: use generation to halt paint walk
  2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-03 19:01   ` Jonathan Tan
  0 siblings, 0 replies; 162+ messages in thread
From: Jonathan Tan @ 2018-04-03 19:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:42 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> -static int queue_has_nonstale(struct prio_queue *queue)
> +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
>  {
> -	int i;
> -	for (i = 0; i < queue->nr; i++) {
> -		struct commit *commit = queue->array[i].data;
> -		if (!(commit->object.flags & STALE))
> -			return 1;
> +	if (min_gen != GENERATION_NUMBER_UNDEF) {
> +		if (queue->nr > 0) {
> +			struct commit *commit = queue->array[0].data;
> +			return commit->generation >= min_gen;
> +		}

This only works if the prio_queue has
compare_commits_by_gen_then_commit_date. Also, I don't think that the
min_gen != GENERATION_NUMBER_UNDEF check is necessary. So I would write
this as:

  if (queue->compare == compare_commits_by_gen_then_commit_date &&
      queue->nr) {
    struct commit *commit = queue->array[0].data;
    return commit->generation >= min_gen;
  }
  for (i = 0 ...

If you'd rather not perform the comparison to
compare_commits_by_gen_then_commit_date every time you invoke
queue_has_nonstale(), that's fine with me too, but document somewhere
that queue_has_nonstale() only works if this comparison function is
used.

> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +
> +		last_gen = commit->generation;

last_gen seems to only be used to ensure that the priority queue returns
elements in the correct order - I think we can generally trust the
queue, and if we need to test it, we can do it elsewhere.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 6/6] commit-graph.txt: update future work
  2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-03 19:04   ` Jonathan Tan
  0 siblings, 0 replies; 162+ messages in thread
From: Jonathan Tan @ 2018-04-03 19:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:43 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> We now calculate generation numbers in the commit-graph file and use
> them in paint_down_to_common().

For completeness, I'll mention that I don't see any issues with this
patch, of course.

Thanks for this series.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:47     ` Jeff King
@ 2018-04-03 19:05       ` Jeff King
  2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-03 19:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 02:47:27PM -0400, Jeff King wrote:

> On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote:
> 
> > If we have generic "can X reach Y?" queries, then we can also use generation
> > numbers there to great effect (by not walking commits Z with gen(Z) <=
> > gen(Y)). Perhaps I should look at that "git branch --contains" thread for
> > ideas.
> 
> I think the gist of it is the patch below. Which I hastily adapted from
> the patch we run at GitHub that uses timestamps as a proxy. So it's
> possible I completely flubbed the logic. I'm assuming unavailable
> generation numbers are set to 0; the logic is actually a bit simpler if
> they end up as (uint32_t)-1.

Oh indeed, that is already the value of your UNDEF. So the patch is more
like this:

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..b147b1d0ee 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
+	if (cutoff == GENERATION_NUMBER_UNDEF)
+		cutoff = GENERATION_NUMBER_NONE;
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
                         ` (2 preceding siblings ...)
  2018-04-03 18:44       ` Stefan Beller
@ 2018-04-03 23:17       ` Ramsay Jones
  2018-04-03 23:19         ` Jeff King
  3 siblings, 1 reply; 162+ messages in thread
From: Ramsay Jones @ 2018-04-03 23:17 UTC (permalink / raw)
  To: Jeff King, Brandon Williams
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider



On 03/04/18 19:28, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
> 
>> On 04/03, Derrick Stolee wrote:
>>> The generation number of a commit is defined recursively as follows:
>>>
>>> * If a commit A has no parents, then the generation number of A is one.
>>> * If a commit A has parents, then the generation number of A is one
>>>   more than the maximum generation number among the parents of A.
>>>
>>> Add a uint32_t generation field to struct commit so we can pass this
>>
>> Is there any reason to believe this would be too small of a value in the
>> future?  Or is a 32 bit unsigned good enough?
> 
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not), that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.

I didn't read the patches closely, but isn't it ~20,000 years?

Given that '#define GENERATION_NUMBER_MAX 0x3FFFFFFF', that is. ;-)

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 23:17       ` Ramsay Jones
@ 2018-04-03 23:19         ` Jeff King
  0 siblings, 0 replies; 162+ messages in thread
From: Jeff King @ 2018-04-03 23:19 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider

On Wed, Apr 04, 2018 at 12:17:06AM +0100, Ramsay Jones wrote:

> >> Is there any reason to believe this would be too small of a value in the
> >> future?  Or is a 32 bit unsigned good enough?
> > 
> > The linux kernel took ~10 years to produce 500k commits. Even assuming
> > those were all linear (and they're not), that gives us ~80,000 years of
> > leeway. So even if the pace of development speeds up or we have a
> > quicker project, it still seems we have a pretty reasonable safety
> > margin.
> 
> I didn't read the patches closely, but isn't it ~20,000 years?
> 
> Given that '#define GENERATION_NUMBER_MAX 0x3FFFFFFF', that is. ;-)

What, I'm supposed to read the patches before responding? Heresy.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-03 19:05       ` Jeff King
@ 2018-04-04 15:45         ` Derrick Stolee
  2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
  2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
  0 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 15:45 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..b147b1d0ee 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
+	if (cutoff == GENERATION_NUMBER_UNDEF)
+		cutoff = GENERATION_NUMBER_NONE;
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-04 15:45           ` Derrick Stolee
  2018-04-04 15:48             ` Derrick Stolee
  2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
  1 sibling, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 15:45 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 858f4fdbc9..2566cba79f 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return 0;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-04 15:48             ` Derrick Stolee
  2018-04-04 17:01               ` Brandon Williams
  2018-04-04 18:24               ` Jeff King
  0 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 15:48 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 11:45 AM, Derrick Stolee wrote:
> The containment algorithm for 'git branch --contains' is different
> from that for 'git tag --contains' in that it uses is_descendant_of()
> instead of contains_tag_algo(). The expensive portion of the branch
> algorithm is computing merge bases.
>
> When a commit-graph file exists with generation numbers computed,
> we can avoid this merge-base calculation when the target commit has
> a larger generation number than the target commits.
>
> Performance tests were run on a copy of the Linux repository where
> HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> copied to branches and 'git branch --contains' was tested:
>
> Before: 60.0s
> After:   0.4s
> Rel %: -99.3%
>
> Reported-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   commit.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 858f4fdbc9..2566cba79f 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>   {
>   	struct commit_list *bases;
>   	int ret = 0, i;
> +	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
>   
>   	if (parse_commit(commit))
>   		return ret;
> -	for (i = 0; i < nr_reference; i++)
> +	for (i = 0; i < nr_reference; i++) {
>   		if (parse_commit(reference[i]))
>   			return ret;
> +		if (min_generation > reference[i]->generation)
> +			min_generation = reference[i]->generation;
> +	}
> +
> +	if (commit->generation > min_generation)
> +		return 0;
>   
>   	bases = paint_down_to_common(commit, nr_reference, reference);
>   	if (commit->object.flags & PARENT2)

This patch may suffice to speed up 'git branch --contains' instead of 
needing to always use the 'git tag --contains' algorithm as considered 
in [1].

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180303051516.GE27689@sigill.intra.peff.net/
     Re: [PATCH 0/4] Speed up git tag --contains

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:48             ` Derrick Stolee
@ 2018-04-04 17:01               ` Brandon Williams
  2018-04-04 18:24               ` Jeff King
  1 sibling, 0 replies; 162+ messages in thread
From: Brandon Williams @ 2018-04-04 17:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider

On 04/04, Derrick Stolee wrote:
> On 4/4/2018 11:45 AM, Derrick Stolee wrote:
> > The containment algorithm for 'git branch --contains' is different
> > from that for 'git tag --contains' in that it uses is_descendant_of()
> > instead of contains_tag_algo(). The expensive portion of the branch
> > algorithm is computing merge bases.
> > 
> > When a commit-graph file exists with generation numbers computed,
> > we can avoid this merge-base calculation when the target commit has
> > a larger generation number than the target commits.
> > 
> > Performance tests were run on a copy of the Linux repository where
> > HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> > copied to branches and 'git branch --contains' was tested:
> > 
> > Before: 60.0s
> > After:   0.4s
> > Rel %: -99.3%

Now that is an impressive speedup.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
  2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-04 18:22           ` Jeff King
  2018-04-04 19:06             ` Derrick Stolee
  1 sibling, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-04 18:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote:

> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  					      struct contains_cache *cache)
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
> -	enum contains_result result = contains_test(candidate, want, cache);
> +	enum contains_result result;
> +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
> +	const struct commit_list *p;
> +
> +	for (p = want; p; p = p->next) {
> +		struct commit *c = p->item;
> +		parse_commit_or_die(c);
> +		if (c->generation < cutoff)
> +			cutoff = c->generation;
> +	}
> +	if (cutoff == GENERATION_NUMBER_UNDEF)
> +		cutoff = GENERATION_NUMBER_NONE;

Hmm, on reflection, I'm not sure if this is right in the face of
multiple "want" commits, only some of which have generation numbers.  We
probably want to disable the cutoff if _any_ "want" commit doesn't have
a number.

There's also an obvious corner case where this won't kick in, and you'd
really like it to: recently added commits. E.g,. if I do this:

  git gc ;# imagine this writes generation numbers
  git pull
  git tag --contains HEAD

then HEAD isn't going to have a generation number. But this is the case
where we have the most to gain, since we could throw away all of the
ancient tags immediately upon seeing that their generation numbers are
way less than that of HEAD.

I wonder to what degree it's worth traversing to come up with a
generation number for the "want" commits. If we walked, say, 50 commits
to do it, you'd probably save a lot of work (since the alternative is
walking thousands of commits until you realize that some ancient "v1.0"
tag is not useful).

I'd actually go so far as to say that any amount of traversal is
generally going to be worth it to come up with the correct generation
cutoff here. You can come up with pathological cases where you only have
one really recent tag or something, but in practice every repository
where performance is a concern is going to end up with refs much further
back than it would take to reach the cutoff condition.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:48             ` Derrick Stolee
  2018-04-04 17:01               ` Brandon Williams
@ 2018-04-04 18:24               ` Jeff King
  2018-04-04 18:53                 ` Derrick Stolee
  1 sibling, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-04 18:24 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote:

> > diff --git a/commit.c b/commit.c
> > index 858f4fdbc9..2566cba79f 100644
> > --- a/commit.c
> > +++ b/commit.c
> > @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
> >   {
> >   	struct commit_list *bases;
> >   	int ret = 0, i;
> > +	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
> >   	if (parse_commit(commit))
> >   		return ret;
> > -	for (i = 0; i < nr_reference; i++)
> > +	for (i = 0; i < nr_reference; i++) {
> >   		if (parse_commit(reference[i]))
> >   			return ret;
> > +		if (min_generation > reference[i]->generation)
> > +			min_generation = reference[i]->generation;
> > +	}
> > +
> > +	if (commit->generation > min_generation)
> > +		return 0;
> >   	bases = paint_down_to_common(commit, nr_reference, reference);
> >   	if (commit->object.flags & PARENT2)
> 
> This patch may suffice to speed up 'git branch --contains' instead of
> needing to always use the 'git tag --contains' algorithm as considered in
> [1].

I'd have to do some timings, but I suspect we may want to switch to the
"tag --contains" algorithm anyway. This still does N independent
merge-base operations, one per ref. So with enough refs, you're still
better off throwing it all into one big traversal.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 18:24               ` Jeff King
@ 2018-04-04 18:53                 ` Derrick Stolee
  2018-04-04 18:59                   ` Jeff King
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 18:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 2:24 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote:
>
>>> diff --git a/commit.c b/commit.c
>>> index 858f4fdbc9..2566cba79f 100644
>>> --- a/commit.c
>>> +++ b/commit.c
>>> @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>>    {
>>>    	struct commit_list *bases;
>>>    	int ret = 0, i;
>>> +	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
>>>    	if (parse_commit(commit))
>>>    		return ret;
>>> -	for (i = 0; i < nr_reference; i++)
>>> +	for (i = 0; i < nr_reference; i++) {
>>>    		if (parse_commit(reference[i]))
>>>    			return ret;
>>> +		if (min_generation > reference[i]->generation)
>>> +			min_generation = reference[i]->generation;
>>> +	}
>>> +
>>> +	if (commit->generation > min_generation)
>>> +		return 0;
>>>    	bases = paint_down_to_common(commit, nr_reference, reference);
>>>    	if (commit->object.flags & PARENT2)
>> This patch may suffice to speed up 'git branch --contains' instead of
>> needing to always use the 'git tag --contains' algorithm as considered in
>> [1].

I guess I want to specify: the only reason to NOT switch to the tags 
algorithm is because it _may_ hurt existing cases in certain data shapes...

> I'd have to do some timings, but I suspect we may want to switch to the
> "tag --contains" algorithm anyway. This still does N independent
> merge-base operations, one per ref. So with enough refs, you're still
> better off throwing it all into one big traversal.

...and I suppose your timings are to find out if there are data shapes 
where the branch algorithm is faster. Perhaps that is impossible now 
that we have the generation number cutoff for the tag algorithm.

Since the branch algorithm checks generation numbers before triggering 
pain_down_to_common(), we will do N independent merge-base calculations, 
where N is the number of branches with large enough generation numbers 
(which is why my test does so well: most are below the target generation 
number). This doesn't help at all if none of the refs are in the graph.

The other thing to do is add a minimum generation for the walk in 
paint_down_to_common() so even if commit->generation <= min_generation 
we still only walk down to commit->generation instead of all merge 
bases. This is something we could change in a later patch.

Patches 7 and 8 seem to me like simple changes with no downside UNLESS 
we are deciding instead to delete the code I'm changing.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 18:53                 ` Derrick Stolee
@ 2018-04-04 18:59                   ` Jeff King
  0 siblings, 0 replies; 162+ messages in thread
From: Jeff King @ 2018-04-04 18:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 02:53:45PM -0400, Derrick Stolee wrote:

> > I'd have to do some timings, but I suspect we may want to switch to the
> > "tag --contains" algorithm anyway. This still does N independent
> > merge-base operations, one per ref. So with enough refs, you're still
> > better off throwing it all into one big traversal.
> 
> ...and I suppose your timings are to find out if there are data shapes where
> the branch algorithm is faster. Perhaps that is impossible now that we have
> the generation number cutoff for the tag algorithm.

Well, I wanted to show the opposite: that the branch algorithm can still
perform quite poorly. :)

I think with generation numbers that the tag algorithm should always
perform better, since you can't walk past a merge base when using a
cutoff. But it could definitely perform worse in a case where you don't
have generation numbers.

> Patches 7 and 8 seem to me like simple changes with no downside UNLESS we
> are deciding instead to delete the code I'm changing.

Yeah, I think they are strict improvements modulo the inverted UNDEF
logic I mentioned.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
@ 2018-04-04 19:06             ` Derrick Stolee
  2018-04-04 19:16               ` Jeff King
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 19:06 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 2:22 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote:
>
>> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   					      struct contains_cache *cache)
>>   {
>>   	struct contains_stack contains_stack = { 0, 0, NULL };
>> -	enum contains_result result = contains_test(candidate, want, cache);
>> +	enum contains_result result;
>> +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
>> +	const struct commit_list *p;
>> +
>> +	for (p = want; p; p = p->next) {
>> +		struct commit *c = p->item;
>> +		parse_commit_or_die(c);
>> +		if (c->generation < cutoff)
>> +			cutoff = c->generation;
>> +	}

Now that you mention it, let me split out the portion you are probably 
talking about as incorrect:

>> +	if (cutoff == GENERATION_NUMBER_UNDEF)
>> +		cutoff = GENERATION_NUMBER_NONE;

You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we 
get no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then 
our walk will be limited to commits NOT in the commit-graph (which we 
hope is small if proper hygiene is followed).

> Hmm, on reflection, I'm not sure if this is right in the face of
> multiple "want" commits, only some of which have generation numbers.  We
> probably want to disable the cutoff if _any_ "want" commit doesn't have
> a number.
>
> There's also an obvious corner case where this won't kick in, and you'd
> really like it to: recently added commits. E.g,. if I do this:
>
>    git gc ;# imagine this writes generation numbers
>    git pull
>    git tag --contains HEAD
>
> then HEAD isn't going to have a generation number. But this is the case
> where we have the most to gain, since we could throw away all of the
> ancient tags immediately upon seeing that their generation numbers are
> way less than that of HEAD.
>
> I wonder to what degree it's worth traversing to come up with a
> generation number for the "want" commits. If we walked, say, 50 commits
> to do it, you'd probably save a lot of work (since the alternative is
> walking thousands of commits until you realize that some ancient "v1.0"
> tag is not useful).
>
> I'd actually go so far as to say that any amount of traversal is
> generally going to be worth it to come up with the correct generation
> cutoff here. You can come up with pathological cases where you only have
> one really recent tag or something, but in practice every repository
> where performance is a concern is going to end up with refs much further
> back than it would take to reach the cutoff condition.

Perhaps there is some value in walking to find the correct cutoff value, 
but it is difficult to determine how far we are from commits with 
correct generation numbers _a priori_. I'd rather rely on the 
commit-graph being in a good state, not too far behind the refs. An 
added complexity of computing generation numbers dynamically is that we 
would need to add a dependence on the commit-graph file's existence at all.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:06             ` Derrick Stolee
@ 2018-04-04 19:16               ` Jeff King
  2018-04-04 19:22                 ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-04 19:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote:

> > > @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
> > >   					      struct contains_cache *cache)
> > >   {
> > >   	struct contains_stack contains_stack = { 0, 0, NULL };
> > > -	enum contains_result result = contains_test(candidate, want, cache);
> > > +	enum contains_result result;
> > > +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
> > > +	const struct commit_list *p;
> > > +
> > > +	for (p = want; p; p = p->next) {
> > > +		struct commit *c = p->item;
> > > +		parse_commit_or_die(c);
> > > +		if (c->generation < cutoff)
> > > +			cutoff = c->generation;
> > > +	}
> 
> Now that you mention it, let me split out the portion you are probably
> talking about as incorrect:
> 
> > > +	if (cutoff == GENERATION_NUMBER_UNDEF)
> > > +		cutoff = GENERATION_NUMBER_NONE;
> 
> You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get
> no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk
> will be limited to commits NOT in the commit-graph (which we hope is small
> if proper hygiene is followed).

I think it's more than that. If we leave it at UNDEF, that's wrong,
because contains_test() compares:

  candidate->generation < cutoff

which would _always_ be true. In other words, we're saying that our
"want" has an insanely high generation number, and traversing can never
find it. Which is clearly wrong.

So we have to put it at "0", to say "you should always traverse, we
can't tell you that this is a dead end". So that part of the logic is
currently correct.

But what I was getting at is that the loop behavior can't just pick the
min cutoff. The min is effectively "0" if there's even a single ref for
which we don't have a generation number, because we cannot ever stop
traversing (we might get to that commit if we kept going).

(It's also possible I'm confused about how UNDEF and NONE are used; I'm
assuming commits for which we don't have a generation number available
would get UNDEF in their commit->generation field).

If you could make the assumption that when we have a generation for
commit X, then we have a generation for all of its ancestors, things get
easier. Because then if you hit commit X with a generation number and
want to compare it to a cutoff, you know that either:

  1. The cutoff is defined, in which case you can stop traversing if
     we've gone past the cutoff.

  2. The cutoff is undefined, in which case we cannot possibly reach
     our "want" by traversing. Even if it has a smaller generation
     number than us, it's on an unrelated line of development.

I don't know that the reachability property is explicitly promised by
your work, but it seems like it would be a natural fallout (after all,
you have to know the generation of each ancestor in order to compute the
later ones, so you're really just promising that you've actually stored
all the ones you've computed).

> > I wonder to what degree it's worth traversing to come up with a
> > generation number for the "want" commits. If we walked, say, 50 commits
> > to do it, you'd probably save a lot of work (since the alternative is
> > walking thousands of commits until you realize that some ancient "v1.0"
> > tag is not useful).
> > 
> > I'd actually go so far as to say that any amount of traversal is
> > generally going to be worth it to come up with the correct generation
> > cutoff here. You can come up with pathological cases where you only have
> > one really recent tag or something, but in practice every repository
> > where performance is a concern is going to end up with refs much further
> > back than it would take to reach the cutoff condition.
> 
> Perhaps there is some value in walking to find the correct cutoff value, but
> it is difficult to determine how far we are from commits with correct
> generation numbers _a priori_. I'd rather rely on the commit-graph being in
> a good state, not too far behind the refs. An added complexity of computing
> generation numbers dynamically is that we would need to add a dependence on
> the commit-graph file's existence at all.

If you could make the reachability assumption, I think this question
just goes away. As soon as you hit a commit with _any_ generation
number, you could quit traversing down that path.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:16               ` Jeff King
@ 2018-04-04 19:22                 ` Derrick Stolee
  2018-04-04 19:42                   ` Jeff King
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 19:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 3:16 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote:
>
>>>> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>>>    					      struct contains_cache *cache)
>>>>    {
>>>>    	struct contains_stack contains_stack = { 0, 0, NULL };
>>>> -	enum contains_result result = contains_test(candidate, want, cache);
>>>> +	enum contains_result result;
>>>> +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
>>>> +	const struct commit_list *p;
>>>> +
>>>> +	for (p = want; p; p = p->next) {
>>>> +		struct commit *c = p->item;
>>>> +		parse_commit_or_die(c);
>>>> +		if (c->generation < cutoff)
>>>> +			cutoff = c->generation;
>>>> +	}
>> Now that you mention it, let me split out the portion you are probably
>> talking about as incorrect:
>>
>>>> +	if (cutoff == GENERATION_NUMBER_UNDEF)
>>>> +		cutoff = GENERATION_NUMBER_NONE;
>> You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get
>> no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk
>> will be limited to commits NOT in the commit-graph (which we hope is small
>> if proper hygiene is followed).
> I think it's more than that. If we leave it at UNDEF, that's wrong,
> because contains_test() compares:
>
>    candidate->generation < cutoff
>
> which would _always_ be true. In other words, we're saying that our
> "want" has an insanely high generation number, and traversing can never
> find it. Which is clearly wrong.

That condition is not always true (which is why we use strict comparison 
instead of <=). If a commit is not in the commit-graph file, then its 
generation is equal to GENERATION_NUMBER_UNDEF, as shown in alloc.c:

void *alloc_commit_node(void)
{
         struct commit *c = alloc_node(&commit_state, sizeof(struct 
commit));
         c->object.type = OBJ_COMMIT;
         c->index = alloc_commit_index();
         c->graph_pos = COMMIT_NOT_FROM_GRAPH;
         c->generation = GENERATION_NUMBER_UNDEF;
         return c;
}


> So we have to put it at "0", to say "you should always traverse, we
> can't tell you that this is a dead end". So that part of the logic is
> currently correct.
>
> But what I was getting at is that the loop behavior can't just pick the
> min cutoff. The min is effectively "0" if there's even a single ref for
> which we don't have a generation number, because we cannot ever stop
> traversing (we might get to that commit if we kept going).
>
> (It's also possible I'm confused about how UNDEF and NONE are used; I'm
> assuming commits for which we don't have a generation number available
> would get UNDEF in their commit->generation field).

I think it is this case.

> If you could make the assumption that when we have a generation for
> commit X, then we have a generation for all of its ancestors, things get
> easier. Because then if you hit commit X with a generation number and
> want to compare it to a cutoff, you know that either:
>
>    1. The cutoff is defined, in which case you can stop traversing if
>       we've gone past the cutoff.
>
>    2. The cutoff is undefined, in which case we cannot possibly reach
>       our "want" by traversing. Even if it has a smaller generation
>       number than us, it's on an unrelated line of development.
>
> I don't know that the reachability property is explicitly promised by
> your work, but it seems like it would be a natural fallout (after all,
> you have to know the generation of each ancestor in order to compute the
> later ones, so you're really just promising that you've actually stored
> all the ones you've computed).

The commit-graph is closed under reachability, so if a commit has a 
generation number then it is in the graph and so are all its ancestors.

The reason for GENERATION_NUMBER_NONE is that the commit-graph file 
stores "0" for generation number until this patch. It still satisfies 
the condition that gen(A) < gen(B) if B can reach A, but also gives us a 
condition for "this commit still needs its generation number computed".

>
>>> I wonder to what degree it's worth traversing to come up with a
>>> generation number for the "want" commits. If we walked, say, 50 commits
>>> to do it, you'd probably save a lot of work (since the alternative is
>>> walking thousands of commits until you realize that some ancient "v1.0"
>>> tag is not useful).
>>>
>>> I'd actually go so far as to say that any amount of traversal is
>>> generally going to be worth it to come up with the correct generation
>>> cutoff here. You can come up with pathological cases where you only have
>>> one really recent tag or something, but in practice every repository
>>> where performance is a concern is going to end up with refs much further
>>> back than it would take to reach the cutoff condition.
>> Perhaps there is some value in walking to find the correct cutoff value, but
>> it is difficult to determine how far we are from commits with correct
>> generation numbers _a priori_. I'd rather rely on the commit-graph being in
>> a good state, not too far behind the refs. An added complexity of computing
>> generation numbers dynamically is that we would need to add a dependence on
>> the commit-graph file's existence at all.
> If you could make the reachability assumption, I think this question
> just goes away. As soon as you hit a commit with _any_ generation
> number, you could quit traversing down that path.
That is the idea. I should make this clearer in all of my commit messages.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:22                 ` Derrick Stolee
@ 2018-04-04 19:42                   ` Jeff King
  2018-04-04 19:45                     ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jeff King @ 2018-04-04 19:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:

> > I don't know that the reachability property is explicitly promised by
> > your work, but it seems like it would be a natural fallout (after all,
> > you have to know the generation of each ancestor in order to compute the
> > later ones, so you're really just promising that you've actually stored
> > all the ones you've computed).
> 
> The commit-graph is closed under reachability, so if a commit has a
> generation number then it is in the graph and so are all its ancestors.

OK, if we assume that it's closed, then I think we can effectively
ignore the UNDEF cases. They'll just work out. And then yes I'd agree
that the:

  if (cutoff == UNDEF)
    cutoff = NONE;

code is wrong. We'd want to keep it at UNDEF so we stop traversing at
any generation number.

> The reason for GENERATION_NUMBER_NONE is that the commit-graph file stores
> "0" for generation number until this patch. It still satisfies the condition
> that gen(A) < gen(B) if B can reach A, but also gives us a condition for
> "this commit still needs its generation number computed".

OK. I thought at first that would yield wrong results when comparing
UNDEF to NONE, but I think for this kind of --contains traversal, it's
still OK (NONE is less than UNDEF, but we know that the UNDEF thing
cannot be found by traversing from a NONE).

> > If you could make the reachability assumption, I think this question
> > just goes away. As soon as you hit a commit with _any_ generation
> > number, you could quit traversing down that path.
> That is the idea. I should make this clearer in all of my commit messages.

Yes, please. :) And maybe in the documentation of the file format, if
it's not there (I didn't check). It's a very useful property, and we
want to make sure people making use of the graph know they can depend on
it.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:42                   ` Jeff King
@ 2018-04-04 19:45                     ` Derrick Stolee
  2018-04-04 19:46                       ` Jeff King
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-04 19:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 3:42 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:
>
>> That is the idea. I should make this clearer in all of my commit messages.
> Yes, please. :) And maybe in the documentation of the file format, if
> it's not there (I didn't check). It's a very useful property, and we
> want to make sure people making use of the graph know they can depend on
> it.

For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion 
of generation numbers in Documentation/technical/commit-graph.txt (the 
design doc instead of the file format).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:45                     ` Derrick Stolee
@ 2018-04-04 19:46                       ` Jeff King
  0 siblings, 0 replies; 162+ messages in thread
From: Jeff King @ 2018-04-04 19:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 03:45:30PM -0400, Derrick Stolee wrote:

> On 4/4/2018 3:42 PM, Jeff King wrote:
> > On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:
> > 
> > > That is the idea. I should make this clearer in all of my commit messages.
> > Yes, please. :) And maybe in the documentation of the file format, if
> > it's not there (I didn't check). It's a very useful property, and we
> > want to make sure people making use of the graph know they can depend on
> > it.
> 
> For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion of
> generation numbers in Documentation/technical/commit-graph.txt (the design
> doc instead of the file format).

Yeah, that makes sense. Thanks, and thanks for a thoughtful discussion.
The performance numbers are very exciting.

-Peff

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-04-03 18:03 ` Brandon Williams
@ 2018-04-07 16:55 ` Jakub Narebski
  2018-04-08  1:06   ` Derrick Stolee
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  9 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-07 16:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Stefan Beller,
	Lars Schneider, Jeff King

Hello,

Derrick Stolee <dstolee@microsoft.com> writes:

> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used [...].
>
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.

I have started working on Jupyter Notebook on Google Colaboratory to
find out how much speedup we can get using generation numbers (level
negative-cut filter), FELINE index (negative-cut filter) and min-post
intervals in some spanning tree (positive-cut filter, if I understand it
correctly the base of GRAIL method) in commit graphs.

Currently I am at the stage of reproducing results in FELINE paper:
"Reachability Queries in Very Large Graphs: A Fast Refined Online Search
Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
J. Zaki (2014).  This paper is available in the PDF form at
https://openproceedings.org/EDBT/2014/paper_166.pdf

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

I hope that could be of help, or at least interesting
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:29   ` Derrick Stolee
  2018-04-03 18:47     ` Jeff King
@ 2018-04-07 17:09     ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-07 17:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller,
	larsxschneider, peff

Derrick Stolee <stolee@gmail.com> writes:

> On 4/3/2018 2:03 PM, Brandon Williams wrote:
>> On 04/03, Derrick Stolee wrote:
>>> This is the first of several "small" patches that follow the serialized
>>> Git commit graph patch (ds/commit-graph).
>>>
>>> As described in Documentation/technical/commit-graph.txt, the generation
>>> number of a commit is one more than the maximum generation number among
>>> its parents (trivially, a commit with no parents has generation number
>>> one).
[...]
>>> A more substantial refactoring of revision.c is required before making
>>> 'git log --graph' use generation numbers effectively.
>>
>> log --graph should benefit a lot more from this correct?  I know we've
>> talked a bit about negotiation and I wonder if these generation numbers
>> should be able to help out a little bit with that some day.
>
> 'log --graph' should be a HUGE speedup, when it is refactored. Since
> the topo-order can "stream" commits to the pager, it can be very
> responsive to return the graph in almost all conditions. (The case
> where generation numbers are not enough is when filters reduce the set
> of displayed commits to be very sparse, so many commits are walked
> anyway.)

I wonder if next big speedup would be to store [some] topological
ordering of commits in the commit graph... It could be done for example
in two chunks: a mapping to position in topological order, and list of
commits sorted in topological order.

Note also that FELINE index uses (or can use -- but it is supposedly the
optimal choice) position of vertex/node in topological order as one of
the two values in the pair that composes FELINE index.

> If we have generic "can X reach Y?" queries, then we can also use
> generation numbers there to great effect (by not walking commits Z
> with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch
> --contains" thread for ideas.

This is something that is shown in the Google Colab [Jupyter] Notebook
I have mentioned:

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
  https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

> For negotiation, there are some things we can do here. VSTS uses
> generation numbers as a heuristic for determining "all wants connected
> to haves" which is a condition for halting negotiation. The idea is
> very simple, and I'd be happy to discuss it on a separate thread.

Nice.  How much speedup it gives?

Best regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-07 16:55 ` Jakub Narebski
@ 2018-04-08  1:06   ` Derrick Stolee
  2018-04-11 19:32     ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-08  1:06 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Stefan Beller,
	Lars Schneider, Jeff King

On 4/7/2018 12:55 PM, Jakub Narebski wrote:
> Currently I am at the stage of reproducing results in FELINE paper:
> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search
> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
> J. Zaki (2014).  This paper is available in the PDF form at
> https://openproceedings.org/EDBT/2014/paper_166.pdf
>
> The Jupyter Notebook (which runs on Google cloud, but can be also run
> locally) uses Python kernel, NetworkX librabry for graph manipulation,
> and matplotlib (via NetworkX) for display.
>
> Available at:
> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing
>
> I hope that could be of help, or at least interesting

Let me know when you can give numbers (either raw performance or # of 
commits walked) for real-world Git commit graphs. The Linux repo is a 
good example to use for benchmarking, but I also use the Kotlin repo 
sometimes as it has over a million objects and over 250K commits.

Of course, the only important statistic at the end of the day is the 
end-to-end time of a 'git ...' command. Your investigations should 
inform whether it is worth prototyping the feature in the git codebase.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 00/10] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-04-07 16:55 ` Jakub Narebski
@ 2018-04-09 16:41 ` " Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
                     ` (10 more replies)
  9 siblings, 11 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

Thanks for the lively discussion of this patch series in v1!

I've incorporated the feedback from the previous round, added patches
[7/6] and [8/6], expanded the discussion of generation numbers in the
design document, and added another speedup for 'git branch --contains'.

One major difference: I renamed the macros from _UNDEF to _INFINITY and
_NONE to _ZERO. This communicates their value more clearly, since the
previous names were unclear about which was larger than the "real"
generation numbers.

Patch 2 includes a change to builtin/merge.c and a new test in
t5318-commit-graph.sh that exposes a problem I found when testing the
previous patch series on my box. The "BUG: bad generation skip" message
from "commit.c: use generation to halt paint walk" would halt a fast-
forward merge since the HEAD commit was loaded before the core.commitGraph
config setting was loaded. It is crucial that all commits that exist
in the commit-graph file are loaded from that file or else we will
lose our expected inequalities of generation numbers.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

Further, use generation numbers for '--contains' queries in 'git tag'
and 'git branch', providing a significant speedup (at least 95% for
some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (10):
  object.c: parse commit in graph first
  merge: check config before loading commits
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit.c: use generation to halt paint walk
  commit-graph.txt: update future work
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()

 Documentation/technical/commit-graph.txt | 50 +++++++++++++--
 alloc.c                                  |  1 +
 builtin/merge.c                          |  5 +-
 commit-graph.c                           | 48 +++++++++++++++
 commit.c                                 | 78 ++++++++++++++++++++----
 commit.h                                 |  5 ++
 object.c                                 |  4 +-
 ref-filter.c                             | 24 ++++++--
 t/t5318-commit-graph.sh                  |  9 +++
 9 files changed, 197 insertions(+), 27 deletions(-)

-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 01/10] object.c: parse commit in graph first
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
@ 2018-04-09 16:41   ` Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/object.c b/object.c
index e6ad3f61f0..4cd3e98e04 100644
--- a/object.c
+++ b/object.c
@@ -3,6 +3,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "tag.h"
 
 static struct object **obj_hash;
@@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (!parse_commit_in_graph(commit) &&
+			    parse_commit_buffer(commit, buffer, size))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 02/10] merge: check config before loading commits
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
@ 2018-04-09 16:41   ` Derrick Stolee
  2018-04-11  2:12     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

In anticipation of using generation numbers from the commit-graph,
we must ensure that all commits that exist in the commit-graph are
loaded from that file instead of from the object database. Since
the commit-graph file is only checked if core.commitGraph is true,
we must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch'
and the commit lookup. Also add a test to t5318-commit-graph.sh
that exercises this code path to prevent a regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/merge.c         | 5 +++--
 t/t5318-commit-graph.sh | 9 +++++++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index ee050a47f3..20897f8223 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
 	if (branch)
 		skip_prefix(branch, "refs/heads/", &branch);
+	init_diff_ui_defaults();
+	git_config(git_merge_config, NULL);
+
 	if (!branch || is_null_oid(&head_oid))
 		head_commit = NULL;
 	else
 		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
 
-	init_diff_ui_defaults();
-	git_config(git_merge_config, NULL);
 
 	if (branch_mergeoptions)
 		parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git checkout -b merge-5-to-8 commits/5 &&
+	git merge commits/8 &&
+	git show-ref -s merge-5-to-8 >output &&
+	git show-ref -s commits/8 >expect &&
+	test_cmp expect output
+'
+
 test_done
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 17:59     ` Stefan Beller
  2018-04-11  2:31     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
                     ` (7 subsequent siblings)
  10 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_ININITY 0xFFFFFFFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_ZERO) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_INFINITY;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..d24b947525 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index e57ae4b583..b91df315c5 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
 	struct commit *item;
@@ -24,6 +27,7 @@ struct commit {
 	struct commit_list *parents;
 	struct tree *tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (2 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-11  2:51     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d24b947525..5fd63acc31 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+			if ((*list)->generation > GENERATION_NUMBER_MAX)
+				die("generation number %u is too large to store in commit-graph",
+				    (*list)->generation);
+			packedDate[0] |= htonl((*list)->generation << 2);
+		}
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    commits[i]->generation != GENERATION_NUMBER_ZERO)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
+				    parent->item->generation == GENERATION_NUMBER_ZERO) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 05/10] commit: use generations in paint_down_to_common()
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (3 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 19 ++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 3e39c86abf..95ae7e13a3 100644
--- a/commit.c
+++ b/commit.c
@@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* newer commits with larger date first */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index b91df315c5..c440f56bf9 100644
--- a/commit.h
+++ b/commit.h
@@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 06/10] commit.c: use generation to halt paint walk
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (4 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-11  3:02     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

In paint_down_to_common(), the walk is halted when the queue contains
only stale commits. The queue_has_nonstale() method iterates over the
entire queue looking for a nonstale commit. In a wide commit graph where
the two sides share many commits in common, but have deep sets of
different commits, this method may inspect many elements before finding
a nonstale commit. In the worst case, this can give quadratic
performance in paint_down_to_common().

Convert queue_has_nonstale() to use generation numbers for an O(1)
termination condition. To properly take advantage of this condition,
track the minimum generation number of a commit that enters the queue
with nonstale status.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 95ae7e13a3..00bdc2ab21 100644
--- a/commit.c
+++ b/commit.c
@@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
-static int queue_has_nonstale(struct prio_queue *queue)
+static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 {
-	int i;
-	for (i = 0; i < queue->nr; i++) {
-		struct commit *commit = queue->array[i].data;
-		if (!(commit->object.flags & STALE))
-			return 1;
+	if (min_gen != GENERATION_NUMBER_INFINITY) {
+		if (queue->nr > 0) {
+			struct commit *commit = queue->array[0].data;
+			return commit->generation >= min_gen;
+		}
+	} else {
+		int i;
+		for (i = 0; i < queue->nr; i++) {
+			struct commit *commit = queue->array[i].data;
+			if (!(commit->object.flags & STALE))
+				return 1;
+		}
 	}
+
 	return 0;
 }
 
@@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	uint32_t min_nonstale_gen = GENERATION_NUMBER_INFINITY;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		return result;
 	}
 	prio_queue_put(&queue, one);
+	if (one->generation < min_nonstale_gen)
+		min_nonstale_gen = one->generation;
 
 	for (i = 0; i < n; i++) {
 		twos[i]->object.flags |= PARENT2;
 		prio_queue_put(&queue, twos[i]);
+		if (twos[i]->generation < min_nonstale_gen)
+			min_nonstale_gen = twos[i]->generation;
 	}
 
-	while (queue_has_nonstale(&queue)) {
+	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+
+		last_gen = commit->generation;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 				return NULL;
 			p->object.flags |= flags;
 			prio_queue_put(&queue, p);
+
+			if (!(flags & STALE) &&
+			    p->generation < min_nonstale_gen)
+				min_nonstale_gen = p->generation;
 		}
 	}
 
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (5 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-12  9:12     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the two
"special" generation numbers GENERATION_NUMBER_INFINITY and *_ZERO
interact with other generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 50 +++++++++++++++++++++---
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..a8df0ae9db 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,49 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N amd M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+Here is a diagram to visualize the shape of the full commit graph, and
+how different generation numbers relate:
+
+    +-----------------------------------------+
+    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
+    +-----------------------------------------+
+	    |            |      ^
+	    |            |      |
+	    |            +------+
+	    |         [gen(A) = gen(B)]
+	    V
+    +-------------------------------------+
+    | 0 < commit->generation < 0x40000000 |
+    +-------------------------------------+
+	    |            |      ^
+	    |            |      |
+	    |            +------+
+	    |        [gen(A) > gen(B)]
+	    V
+    +-------------------------------------+
+    | GENERATION_NUMBER_ZERO = 0          |
+    +-------------------------------------+
+			 |      ^
+			 |      |
+			 +------+
+		     [gen(A) = gen(B)]
+
 Design Details
 --------------
 
@@ -98,17 +141,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 08/10] ref-filter: use generation number for --contains
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (6 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 16:42   ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..2f5e79b5de 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1648,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1662,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 09/10] commit: use generation numbers for in_merge_bases()
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (7 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 16:42   ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 00bdc2ab21..0b155dece8 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return 0;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common()
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (8 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 0b155dece8..7348075e38 100644
--- a/commit.c
+++ b/commit.c
@@ -796,7 +796,9 @@ static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+						struct commit **twos,
+						int min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
@@ -830,6 +832,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 
 		last_gen = commit->generation;
 
+		if (commit->generation < min_generation)
+			break;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -882,7 +887,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
 			return NULL;
 	}
 
-	list = paint_down_to_common(one, n, twos);
+	list = paint_down_to_common(one, n, twos, 0);
 
 	while (list) {
 		struct commit *commit = pop_commit(&list);
@@ -949,7 +954,7 @@ static int remove_redundant(struct commit **array, int cnt)
 			filled_index[filled] = j;
 			work[filled++] = array[j];
 		}
-		common = paint_down_to_common(array[i], filled, work);
+		common = paint_down_to_common(array[i], filled, work, 0);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
@@ -1073,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 	if (commit->generation > min_generation)
 		return 0;
 
-	bases = paint_down_to_common(commit, nr_reference, reference);
+	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
 	if (commit->object.flags & PARENT2)
 		ret = 1;
 	clear_commit_marks(commit, all_flags);
-- 
2.17.0


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-09 17:59     ` Stefan Beller
  2018-04-11  2:31     ` Junio C Hamano
  1 sibling, 0 replies; 162+ messages in thread
From: Stefan Beller @ 2018-04-09 17:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, avarab, larsxschneider, bmwill

On Mon, Apr 9, 2018 at 9:42 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
> The generation number of a commit is defined recursively as follows:
>
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
>
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use two special values to signal
> the generation number is invalid:
>
> GENERATION_NUMBER_ININITY 0xFFFFFFFF

GENERATION_NUMBER_INFINITY

On disk we currently only store up to 2^30-1,
(2 bits fewer than MAX_UINT_32), but here we just take the maximum
value of what a uint32_t can store. That miss match should not be a
problem albeit aesthetically.

Once we run into scaling problems, we can just up to uint64_t in the code,
and defer the solution on disk to a new file format.

With both ZERO and _INFINITY we are at the border of uint
wrap-around, so we have to be very careful to not add/subtract
one and then compare. Just to watch out for when reviewing.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 02/10] merge: check config before loading commits
  2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
@ 2018-04-11  2:12     ` Junio C Hamano
  2018-04-11 12:49       ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-11  2:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> diff --git a/builtin/merge.c b/builtin/merge.c
> index ee050a47f3..20897f8223 100644
> --- a/builtin/merge.c
> +++ b/builtin/merge.c
> @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>  	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
>  	if (branch)
>  		skip_prefix(branch, "refs/heads/", &branch);
> +	init_diff_ui_defaults();
> +	git_config(git_merge_config, NULL);
> +
>  	if (!branch || is_null_oid(&head_oid))
>  		head_commit = NULL;
>  	else
>  		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>  
> -	init_diff_ui_defaults();
> -	git_config(git_merge_config, NULL);

Wow, that's tricky.  git_merge_config() wants to know which "branch"
we are on, and this place is as early as we can move the call to
without breaking things.  Is this to allow parse_object() called
in lookup_commit_reference_gently() to know if we can rely on the
data cached in the commit-graph data?

> Move the config load to be between the initialization of 'branch'
> and the commit lookup. Also add a test to t5318-commit-graph.sh
> that exercises this code path to prevent a regression.

It is not clear to me how a successful merge of commits/8
demonstrates that reading the config earlier than before is
regression free.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index a380419b65..77d85aefe7 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
>  
> +test_expect_success 'perform fast-forward merge in full repo' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git checkout -b merge-5-to-8 commits/5 &&
> +	git merge commits/8 &&
> +	git show-ref -s merge-5-to-8 >output &&
> +	git show-ref -s commits/8 >expect &&
> +	test_cmp expect output
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
  2018-04-09 17:59     ` Stefan Beller
@ 2018-04-11  2:31     ` Junio C Hamano
  2018-04-11 12:57       ` Derrick Stolee
  1 sibling, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-11  2:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> The generation number of a commit is defined recursively as follows:
>
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
>
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use two special values to signal
> the generation number is invalid:
>
> GENERATION_NUMBER_ININITY 0xFFFFFFFF
> GENERATION_NUMBER_ZERO 0
>
> The first (_INFINITY) means the generation number has not been loaded or
> computed. The second (_ZERO) means the generation number was loaded
> from a commit graph file that was stored before generation numbers
> were computed.

Should it also be possible for a caller to tell if a given commit
has too deep a history, i.e. we do not know its generation number
exactly, but we know it is larger than 1<<30?

It seems that we only have a 30-bit field in the file, so wouldn't
we need a special value defined in (e.g. "0") so that we can tell
that the commit has such a large generation number?  E.g.

> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

	if (!item->generation)
		item->generation = GENERATION_NUMBER_OVERFLOW;

when we read it from the file?

We obviously need to do something similar when assigning a
generation number to a child commit, perhaps like

	#define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1)

	commit->generation = 1; /* assume no parent */
	for (p = commit->parents; p; p++) {
		uint32_t gen = p->item->generation + 1;

		if (gen >= GENERATION_NUMBER_OVERFLOW) {
			commit->generation = GENERATION_NUMBER_OVERFLOW;
			break;
		} else if (commit->generation < gen)
			commit->generation = gen;
	}
        
or something?  And then on the writing side you'd encode too large a
generation as '0'.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-11  2:51     ` Junio C Hamano
  2018-04-11 13:02       ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-11  2:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
> +			if ((*list)->generation > GENERATION_NUMBER_MAX)
> +				die("generation number %u is too large to store in commit-graph",
> +				    (*list)->generation);
> +			packedDate[0] |= htonl((*list)->generation << 2);
> +		}


How serious do we want this feature to be?  On one extreme, we could
be irresponsible and say it will be a problem for our descendants in
the future if their repositories have more than billion pearls on a
single strand, and the above certainly is a reasonable way to punt.
Those who actually encounter the problem will notice by Git dying
somewhere rather deep in the callchain.

Or we could say Git actually does support a history that is
arbitrarily long, even though such a deep portion of history will
not benefit from having generation numbers in commit-graph.

I've been assuming that our stance is the latter and that is why I
made noises about overflowing 30-bit generation field in my review
of the previous step.

In case we want to do the "we know this is very large, but we do not
know the exact value", we may actually want a mode where we can
pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
make sure that the code to handle overflow behaves sensibly.

> +	for (i = 0; i < nr_commits; i++) {
> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
> +			continue;
> +
> +		commit_list_insert(commits[i], &list);
> +		while (list) {
> +...
> +		}
> +	}

So we go over the list of commits just _once_ and make sure each of
them gets the generation assigned correctly by (conceptually
recursively but iteratively in implementation by using a commit
list) making sure that all its parents have generation assigned and
compute the generation for the commit, before moving to the next
one.  Which sounds correct.



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk
  2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-11  3:02     ` Junio C Hamano
  2018-04-11 13:24       ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-11  3:02 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  		return result;
>  	}
>  	prio_queue_put(&queue, one);
> +	if (one->generation < min_nonstale_gen)
> +		min_nonstale_gen = one->generation;
>  
>  	for (i = 0; i < n; i++) {
>  		twos[i]->object.flags |= PARENT2;
>  		prio_queue_put(&queue, twos[i]);
> +		if (twos[i]->generation < min_nonstale_gen)
> +			min_nonstale_gen = twos[i]->generation;
>  	}
>  
> -	while (queue_has_nonstale(&queue)) {
> +	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
>  		struct commit *commit = prio_queue_get(&queue);
>  		struct commit_list *parents;
>  		int flags;
>  
> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +
> +		last_gen = commit->generation;
> +
>  		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>  		if (flags == (PARENT1 | PARENT2)) {
>  			if (!(commit->object.flags & RESULT)) {
> @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  				return NULL;
>  			p->object.flags |= flags;

Hmph.  Can a commit that used to be not stale (and contributed to
the current value of min_nonstale_gen) become stale here by getting
visited twice, invalidating the value in min_nonstale_gen?

>  			prio_queue_put(&queue, p);
> +
> +			if (!(flags & STALE) &&
> +			    p->generation < min_nonstale_gen)
> +				min_nonstale_gen = p->generation;
>  		}
>  	}

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 02/10] merge: check config before loading commits
  2018-04-11  2:12     ` Junio C Hamano
@ 2018-04-11 12:49       ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-11 12:49 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 10:12 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> diff --git a/builtin/merge.c b/builtin/merge.c
>> index ee050a47f3..20897f8223 100644
>> --- a/builtin/merge.c
>> +++ b/builtin/merge.c
>> @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>>   	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
>>   	if (branch)
>>   		skip_prefix(branch, "refs/heads/", &branch);
>> +	init_diff_ui_defaults();
>> +	git_config(git_merge_config, NULL);
>> +
>>   	if (!branch || is_null_oid(&head_oid))
>>   		head_commit = NULL;
>>   	else
>>   		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>>   
>> -	init_diff_ui_defaults();
>> -	git_config(git_merge_config, NULL);
> Wow, that's tricky.  git_merge_config() wants to know which "branch"
> we are on, and this place is as early as we can move the call to
> without breaking things.  Is this to allow parse_object() called
> in lookup_commit_reference_gently() to know if we can rely on the
> data cached in the commit-graph data?

When I saw the bug on my machine, I tracked the issue down to a call to 
parse_commit_in_graph() that skipped the graph check since 
core_commit_graph was not set. The call stack from this call is as follows:

* lookup_commit_or_die()
* lookup_commit_reference()
* lookup_commit_reference_gently()
* parse_object()
* parse_object_buffer()
* parse_commit_in_graph() [as introduced in PATCH 01/10]

>
>> Move the config load to be between the initialization of 'branch'
>> and the commit lookup. Also add a test to t5318-commit-graph.sh
>> that exercises this code path to prevent a regression.
> It is not clear to me how a successful merge of commits/8
> demonstrates that reading the config earlier than before is
> regression free.

I didn't want to introduce commits in an order that led to a commit 
failing tests, but if you drop the change to builtin/merge.c from this 
series, the tip commit will fail this test with "BUG: bad generation skip".

The reason for this failure is that commits/5 is loaded from HEAD from 
the object database, so its generation is marked as 
GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, 
the commit at merges/3 is loaded from the graph with generation 4. This 
triggers the BUG statement in paint_down_to_common(). That is why it is 
important to check a fast-forward merge.

In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were 
already testing 'git merge-base' to check the commit walk logic.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-11  2:31     ` Junio C Hamano
@ 2018-04-11 12:57       ` Derrick Stolee
  2018-04-11 23:28         ` Junio C Hamano
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-11 12:57 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 10:31 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The generation number of a commit is defined recursively as follows:
>>
>> * If a commit A has no parents, then the generation number of A is one.
>> * If a commit A has parents, then the generation number of A is one
>>    more than the maximum generation number among the parents of A.
>>
>> Add a uint32_t generation field to struct commit so we can pass this
>> information to revision walks. We use two special values to signal
>> the generation number is invalid:
>>
>> GENERATION_NUMBER_ININITY 0xFFFFFFFF
>> GENERATION_NUMBER_ZERO 0
>>
>> The first (_INFINITY) means the generation number has not been loaded or
>> computed. The second (_ZERO) means the generation number was loaded
>> from a commit graph file that was stored before generation numbers
>> were computed.
> Should it also be possible for a caller to tell if a given commit
> has too deep a history, i.e. we do not know its generation number
> exactly, but we know it is larger than 1<<30?
>
> It seems that we only have a 30-bit field in the file, so wouldn't
> we need a special value defined in (e.g. "0") so that we can tell
> that the commit has such a large generation number?  E.g.
>
>> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> 	if (!item->generation)
> 		item->generation = GENERATION_NUMBER_OVERFLOW;
>
> when we read it from the file?
>
> We obviously need to do something similar when assigning a
> generation number to a child commit, perhaps like
>
> 	#define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1)
>
> 	commit->generation = 1; /* assume no parent */
> 	for (p = commit->parents; p; p++) {
> 		uint32_t gen = p->item->generation + 1;
>
> 		if (gen >= GENERATION_NUMBER_OVERFLOW) {
> 			commit->generation = GENERATION_NUMBER_OVERFLOW;
> 			break;
> 		} else if (commit->generation < gen)
> 			commit->generation = gen;
> 	}
>          
> or something?  And then on the writing side you'd encode too large a
> generation as '0'.

You raise a very good point. How about we do a slightly different 
arrangement for these overflow commits?

Instead of storing the commits in the commit-graph file as "0" (which 
currently means "written by a version of git that did not compute 
generation numbers") we could let GENERATION_NUMBER_MAX be the maximum 
generation of a commit in the commit-graph, and if a commit would have 
larger generation, we collapse it down to that value.

It slightly complicates the diagram I made in 
Documentation/technical/commit-graph.txt, but it was already a bit of a 
simplification. Here is an updated diagram, but likely we will want to 
limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, 
since it is not a practical situation at the moment.

     +-----------------------------------------+
     | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
     +-----------------------------------------+
       |    |            |      ^
       |    |            |      |
       |    |            +------+
       |    |         [gen(A) = gen(B)]
       |    V
       |  +------------------------------------+
       |  | GENERATION_NUMBER_MAX = 0x3FFFFFFF |
       |  +------------------------------------+
       |    |            |      ^
       |    |            |      |
       |    |            +------+
       |    |         [gen(A) = gen(B)]
       V    V
     +-------------------------------------+
     | 0 < commit->generation < 0x3FFFFFFF |
     +-------------------------------------+
         |            |      ^
         |            |      |
         |            +------+
         |        [gen(A) > gen(B)]
         V
     +-------------------------------------+
     | GENERATION_NUMBER_ZERO = 0          |
     +-------------------------------------+
              |      ^
              |      |
              +------+
              [gen(A) = gen(B)]

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-11  2:51     ` Junio C Hamano
@ 2018-04-11 13:02       ` Derrick Stolee
  2018-04-11 18:49         ` Stefan Beller
  2018-04-11 19:26         ` Eric Sunshine
  0 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-11 13:02 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 10:51 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
>> +			if ((*list)->generation > GENERATION_NUMBER_MAX)
>> +				die("generation number %u is too large to store in commit-graph",
>> +				    (*list)->generation);
>> +			packedDate[0] |= htonl((*list)->generation << 2);
>> +		}
>
> How serious do we want this feature to be?  On one extreme, we could
> be irresponsible and say it will be a problem for our descendants in
> the future if their repositories have more than billion pearls on a
> single strand, and the above certainly is a reasonable way to punt.
> Those who actually encounter the problem will notice by Git dying
> somewhere rather deep in the callchain.
>
> Or we could say Git actually does support a history that is
> arbitrarily long, even though such a deep portion of history will
> not benefit from having generation numbers in commit-graph.
>
> I've been assuming that our stance is the latter and that is why I
> made noises about overflowing 30-bit generation field in my review
> of the previous step.
>
> In case we want to do the "we know this is very large, but we do not
> know the exact value", we may actually want a mode where we can
> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
> make sure that the code to handle overflow behaves sensibly.

I agree. I wonder how we can effectively expose this value into a test. 
It's probably not sufficient to manually test using compiler flags ("-D 
GENERATION_NUMBER_MAX=8").

>
>> +	for (i = 0; i < nr_commits; i++) {
>> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
>> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
>> +			continue;
>> +
>> +		commit_list_insert(commits[i], &list);
>> +		while (list) {
>> +...
>> +		}
>> +	}
> So we go over the list of commits just _once_ and make sure each of
> them gets the generation assigned correctly by (conceptually
> recursively but iteratively in implementation by using a commit
> list) making sure that all its parents have generation assigned and
> compute the generation for the commit, before moving to the next
> one.  Which sounds correct.

Yes, we compute the generation number of a commit exactly once. We use 
the list as a stack so we do not have recursion limits during our 
depth-first search (DFS). We rely on the object cache to ensure we store 
the computed generation numbers, and computed generation numbers provide 
termination conditions to the DFS.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk
  2018-04-11  3:02     ` Junio C Hamano
@ 2018-04-11 13:24       ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-11 13:24 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 11:02 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>>   		return result;
>>   	}
>>   	prio_queue_put(&queue, one);
>> +	if (one->generation < min_nonstale_gen)
>> +		min_nonstale_gen = one->generation;
>>   
>>   	for (i = 0; i < n; i++) {
>>   		twos[i]->object.flags |= PARENT2;
>>   		prio_queue_put(&queue, twos[i]);
>> +		if (twos[i]->generation < min_nonstale_gen)
>> +			min_nonstale_gen = twos[i]->generation;
>>   	}
>>   
>> -	while (queue_has_nonstale(&queue)) {
>> +	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
>>   		struct commit *commit = prio_queue_get(&queue);
>>   		struct commit_list *parents;
>>   		int flags;
>>   
>> +		if (commit->generation > last_gen)
>> +			BUG("bad generation skip");
>> +
>> +		last_gen = commit->generation;
>> +
>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>   		if (flags == (PARENT1 | PARENT2)) {
>>   			if (!(commit->object.flags & RESULT)) {
>> @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>>   				return NULL;
>>   			p->object.flags |= flags;
> Hmph.  Can a commit that used to be not stale (and contributed to
> the current value of min_nonstale_gen) become stale here by getting
> visited twice, invalidating the value in min_nonstale_gen?

min_nonstale_gen can be "wrong" in the way you say, but fits the 
definition from the commit message:

"To properly take advantage of this condition, track the minimum 
generation number of a commit that **enters the queue** with nonstale 
status." (Emphasis added)

You make an excellent point about how this can be problematic. I was 
confused by the lack of clear performance benefits here, but I think 
that whatever benefits making queue_has_nonstale() be O(1) were removed 
by walking more commits than necessary.

Consider the following commit graph, where M is a parent of both A and 
B, S is a parent of M and B, and there is a large set of commits 
reachable from M with generation number larger than gen(S).

A    B
| __/|
|/   |
M    |
|\   |
. |  |
. |  |
. |_/
|/
S

Between A and B, the true merge base is M. Anything reachable from M is 
marked as stale. When S is added to the queue, it is only reachable from 
B, so it is non-stale. However, it is marked stale after M is walked. 
The old code would detect this as a termination condition, but the new 
code would not.

I think this data shape is actually common (not exactly, as it may be 
that some ancestor of M provides a second path to S) especially in the 
world of pull requests and users merging master into their topic branches.

I'll remove this commit in the next version, but use the new prototype 
for queue_has_nonstale() in "commit: add short-circuit to 
paint_down_to_common()" using the given 'min_generation' instead of 
'min_nonstale_gen'.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-11 13:02       ` Derrick Stolee
@ 2018-04-11 18:49         ` Stefan Beller
  2018-04-11 19:26         ` Eric Sunshine
  1 sibling, 0 replies; 162+ messages in thread
From: Stefan Beller @ 2018-04-11 18:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee, git, peff, avarab,
	larsxschneider, bmwill

On Wed, Apr 11, 2018 at 6:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/10/2018 10:51 PM, Junio C Hamano wrote:
>>
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> +               if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
>>> +                       if ((*list)->generation > GENERATION_NUMBER_MAX)
>>> +                               die("generation number %u is too large to
>>> store in commit-graph",
>>> +                                   (*list)->generation);
>>> +                       packedDate[0] |= htonl((*list)->generation << 2);
>>> +               }
>>
>>
>> How serious do we want this feature to be?  On one extreme, we could
>> be irresponsible and say it will be a problem for our descendants in
>> the future if their repositories have more than billion pearls on a
>> single strand, and the above certainly is a reasonable way to punt.
>> Those who actually encounter the problem will notice by Git dying
>> somewhere rather deep in the callchain.
>>
>> Or we could say Git actually does support a history that is
>> arbitrarily long, even though such a deep portion of history will
>> not benefit from having generation numbers in commit-graph.
>>
>> I've been assuming that our stance is the latter and that is why I
>> made noises about overflowing 30-bit generation field in my review
>> of the previous step.
>>
>> In case we want to do the "we know this is very large, but we do not
>> know the exact value", we may actually want a mode where we can
>> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
>> make sure that the code to handle overflow behaves sensibly.
>
>
> I agree. I wonder how we can effectively expose this value into a test. It's
> probably not sufficient to manually test using compiler flags ("-D
> GENERATION_NUMBER_MAX=8").

Would using an environment variable for this testing purpose be a good idea?

If we allow a user to pass in an arbitrary maximum, then we'd have to care about
generation numbers that are stored in the commit graph file larger than that
user specific maximum, though.

Looking through the output of "git grep getenv" we only have two instances
with _DEBUG, both in transport.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-11 13:02       ` Derrick Stolee
  2018-04-11 18:49         ` Stefan Beller
@ 2018-04-11 19:26         ` Eric Sunshine
  1 sibling, 0 replies; 162+ messages in thread
From: Eric Sunshine @ 2018-04-11 19:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee, git, peff, avarab, sbeller,
	larsxschneider, bmwill

On Wed, Apr 11, 2018 at 9:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/10/2018 10:51 PM, Junio C Hamano wrote:
>> In case we want to do the "we know this is very large, but we do not
>> know the exact value", we may actually want a mode where we can
>> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
>> make sure that the code to handle overflow behaves sensibly.
>
> I agree. I wonder how we can effectively expose this value into a test. It's
> probably not sufficient to manually test using compiler flags ("-D
> GENERATION_NUMBER_MAX=8").

A few similar cases of tests needing to tweak some behavior do so by
environment variable. See, for instance, GIT_GETTEXT_POISON and
GIT_FSMONITOR_TEST.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-08  1:06   ` Derrick Stolee
@ 2018-04-11 19:32     ` Jakub Narebski
  2018-04-11 19:58       ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-11 19:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

Derrick Stolee <stolee@gmail.com> writes:

> On 4/7/2018 12:55 PM, Jakub Narebski wrote:
>> Currently I am at the stage of reproducing results in FELINE paper:
>> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search
>> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
>> J. Zaki (2014).  This paper is available in the PDF form at
>> https://openproceedings.org/EDBT/2014/paper_166.pdf
>>
>> The Jupyter Notebook (which runs on Google cloud, but can be also run
>> locally) uses Python kernel, NetworkX librabry for graph manipulation,
>> and matplotlib (via NetworkX) for display.
>>
>> Available at:
>> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing
>>
>> I hope that could be of help, or at least interesting
>
> Let me know when you can give numbers (either raw performance or # of
> commits walked) for real-world Git commit graphs. The Linux repo is a
> good example to use for benchmarking, but I also use the Kotlin repo
> sometimes as it has over a million objects and over 250K commits.

As I am curently converting git repository into commit graph, number of
objects doesn't matter.

Though Kotlin is nicely in largish size set, not as large as Linux
kernel which has 750K commits, but mich larger than git.git with 65K
commits.

> Of course, the only important statistic at the end of the day is the
> end-to-end time of a 'git ...' command. Your investigations should
> inform whether it is worth prototyping the feature in the git
> codebase.

What would you suggest as a good test that could imply performance?  The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.


I have tried finding number of false positives for level (generation
number) filter and for FELINE index, and number of false negatives for
min-post intervals in the spanning tree (for DFS tree) for 10000
randomly selected pairs of commits... but I don't think this is a good
benchmark.

I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
that has 750832 nodes and 811733 edges, and 563747941392 possible
directed pairs, we have for 10000 randomly selected pairs of commits:

  level-filter has    91 =  0.91% [all] false positives
  FELINE index has    78 =  0.78% [all] false positives
  FELINE index has 1.16667 less false positives than level filter

  min-post spanning-tree intervals has  3641 = 36.41% [all] false
  negatives

For git.git repository (https://github.com/git/git.git) that has 52950
nodes and 65887 edges the numbers are slighly more in FELINE index
favor (also out of 10000 random pairs):

  level-filter has   504 =  9.11% false positives
  FELINE index has   125 =  2.26% false positives
  FELINE index has 4.032 less false positives than level filter

This is for FELINE which does not use level / generatio-numbers filter.

Regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-11 19:32     ` Jakub Narebski
@ 2018-04-11 19:58       ` Derrick Stolee
  2018-04-14 16:52         ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-11 19:58 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

On 4/11/2018 3:32 PM, Jakub Narebski wrote:
> What would you suggest as a good test that could imply performance? The
> Google Colab notebook linked to above includes a function to count
> number of commits (nodes / vertices in the commit graph) walked,
> currently in the worst case scenario.

The two main questions to consider are:

1. Can X reach Y?
2. What is the set of merge-bases between X and Y?

And the thing to measure is a commit count. If possible, it would be 
good to count commits walked (commits whose parent list is enumerated) 
and commits inspected (commits that were listed as a parent of some 
walked commit). Walked commits require a commit parse -- albeit from the 
commit-graph instead of the ODB now -- while inspected commits only 
check the in-memory cache.

For git.git and Linux, I like to use the release tags as tests. They 
provide a realistic view of the linear history, and maintenance releases 
have their own history from the major releases.

> I have tried finding number of false positives for level (generation
> number) filter and for FELINE index, and number of false negatives for
> min-post intervals in the spanning tree (for DFS tree) for 10000
> randomly selected pairs of commits... but I don't think this is a good
> benchmark.

What is a false-positive? A case where gen(X) < gen(Y) but Y cannot 
reach X? I do not think that is a great benchmark, but I guess it is 
something to measure.

> I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
> that has 750832 nodes and 811733 edges, and 563747941392 possible
> directed pairs, we have for 10000 randomly selected pairs of commits:
>
>    level-filter has    91 =  0.91% [all] false positives
>    FELINE index has    78 =  0.78% [all] false positives
>    FELINE index has 1.16667 less false positives than level filter
>
>    min-post spanning-tree intervals has  3641 = 36.41% [all] false
>    negatives

Perhaps something you can do instead of sampling from N^2 commits in 
total is to select a pair of generations (say, G = 20000, G' = 20100) or 
regions of generations ( 20000 <= G <= 20050, 20100 <= G' <= 20150) and 
see how many false positives you see by testing all pairs (one from each 
level). The delta between the generations may need to be smaller to 
actually have a large proportion of unreachable pairs. Try different 
levels, since major version releases tend to "pinch" the commit graph to 
a common history.

> For git.git repository (https://github.com/git/git.git) that has 52950
> nodes and 65887 edges the numbers are slighly more in FELINE index
> favor (also out of 10000 random pairs):
>
>    level-filter has   504 =  9.11% false positives
>    FELINE index has   125 =  2.26% false positives
>    FELINE index has 4.032 less false positives than level filter
>
> This is for FELINE which does not use level / generatio-numbers filter.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-11 12:57       ` Derrick Stolee
@ 2018-04-11 23:28         ` Junio C Hamano
  0 siblings, 0 replies; 162+ messages in thread
From: Junio C Hamano @ 2018-04-11 23:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <stolee@gmail.com> writes:

> How about we do a slightly different
> arrangement for these overflow commits?
>
> Instead of storing the commits in the commit-graph file as "0" (which
> currently means "written by a version of git that did not compute
> generation numbers") we could let GENERATION_NUMBER_MAX be the maximum
> generation of a commit in the commit-graph, and if a commit would have
> larger generation, we collapse it down to that value.

Sure.  Any value we can tell that it is special is fine.  Thanks.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-12  9:12     ` Junio C Hamano
  2018-04-12 11:35       ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-12  9:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> +Here is a diagram to visualize the shape of the full commit graph, and
> +how different generation numbers relate:
> +
> +    +-----------------------------------------+
> +    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
> +    +-----------------------------------------+
> +	    |            |      ^
> +	    |            |      |
> +	    |            +------+
> +	    |         [gen(A) = gen(B)]
> +	    V
> +    +-------------------------------------+
> +    | 0 < commit->generation < 0x40000000 |
> +    +-------------------------------------+
> +	    |            |      ^
> +	    |            |      |
> +	    |            +------+
> +	    |        [gen(A) > gen(B)]
> +	    V
> +    +-------------------------------------+
> +    | GENERATION_NUMBER_ZERO = 0          |
> +    +-------------------------------------+
> +			 |      ^
> +			 |      |
> +			 +------+
> +		     [gen(A) = gen(B)]

It may be just me but all I can read out of the above is that
commit->generation may store 0xFFFFFFFF, a value between 0 and
0x40000000, or 0.  I cannot quite tell what the notation [gen(A)
<cmp> gen(B)] is trying to say.  I am guessing "Two generation
numbers within the 'valid' range can be compared" is what the second
one is trying to say, but it is much less interesting to know that
two infinities compare equal than how generation numbers from
different classes compare, which cannot be depicted in the above
notation, I am afraid.  For example, don't we want to say that a
commit with INF can never be reached by a commit with a valid
generation number, or something like that?

>  Design Details
>  --------------
>  
> @@ -98,17 +141,12 @@ Future Work
>  - The 'commit-graph' subcommand does not have a "verify" mode that is
>    necessary for integration with fsck.
>  
> -- The file format includes room for precomputed generation numbers. These
> -  are not currently computed, so all generation numbers will be marked as
> -  0 (or "uncomputed"). A later patch will include this calculation.
> -
>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operations are important candidates:
> +  operation is an important candidate:

Good.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-12  9:12     ` Junio C Hamano
@ 2018-04-12 11:35       ` Derrick Stolee
  2018-04-13  9:53         ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-12 11:35 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/12/2018 5:12 AM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> +Here is a diagram to visualize the shape of the full commit graph, and
>> +how different generation numbers relate:
>> +
>> +    +-----------------------------------------+
>> +    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
>> +    +-----------------------------------------+
>> +	    |            |      ^
>> +	    |            |      |
>> +	    |            +------+
>> +	    |         [gen(A) = gen(B)]
>> +	    V
>> +    +-------------------------------------+
>> +    | 0 < commit->generation < 0x40000000 |
>> +    +-------------------------------------+
>> +	    |            |      ^
>> +	    |            |      |
>> +	    |            +------+
>> +	    |        [gen(A) > gen(B)]
>> +	    V
>> +    +-------------------------------------+
>> +    | GENERATION_NUMBER_ZERO = 0          |
>> +    +-------------------------------------+
>> +			 |      ^
>> +			 |      |
>> +			 +------+
>> +		     [gen(A) = gen(B)]
> It may be just me but all I can read out of the above is that
> commit->generation may store 0xFFFFFFFF, a value between 0 and
> 0x40000000, or 0.  I cannot quite tell what the notation [gen(A)
> <cmp> gen(B)] is trying to say.  I am guessing "Two generation
> numbers within the 'valid' range can be compared" is what the second
> one is trying to say, but it is much less interesting to know that
> two infinities compare equal than how generation numbers from
> different classes compare, which cannot be depicted in the above
> notation, I am afraid.  For example, don't we want to say that a
> commit with INF can never be reached by a commit with a valid
> generation number, or something like that?

My intention with the arrows was to demonstrate where parent 
relationships can go, and the generation-number relation between a 
commit A with parent B. Clearly, this diagram is less than helpful.

>
>>   Design Details
>>   --------------
>>   
>> @@ -98,17 +141,12 @@ Future Work
>>   - The 'commit-graph' subcommand does not have a "verify" mode that is
>>     necessary for integration with fsck.
>>   
>> -- The file format includes room for precomputed generation numbers. These
>> -  are not currently computed, so all generation numbers will be marked as
>> -  0 (or "uncomputed"). A later patch will include this calculation.
>> -
>>   - After computing and storing generation numbers, we must make graph
>>     walks aware of generation numbers to gain the performance benefits they
>>     enable. This will mostly be accomplished by swapping a commit-date-ordered
>>     priority queue with one ordered by generation number. The following
>> -  operations are important candidates:
>> +  operation is an important candidate:
> Good.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-12 11:35       ` Derrick Stolee
@ 2018-04-13  9:53         ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-13  9:53 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee, git, Jeff King,
	Ævar Arnfjörð Bjarmason, Stefan Beller,
	Lars Schneider, Brandon Williams

Derrick Stolee <stolee@gmail.com> writes:

> On 4/12/2018 5:12 AM, Junio C Hamano wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> +Here is a diagram to visualize the shape of the full commit graph, and
>>> +how different generation numbers relate:
>>> +
>>> +    +-----------------------------------------+
>>> +    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
>>> +    +-----------------------------------------+
>>> +	    |            |      ^
>>> +	    |            |      |
>>> +	    |            +------+
>>> +	    |         [gen(A) = gen(B)]
>>> +	    V
>>> +    +-------------------------------------+
>>> +    | 0 < commit->generation < 0x40000000 |
>>> +    +-------------------------------------+
>>> +	    |            |      ^
>>> +	    |            |      |
>>> +	    |            +------+
>>> +	    |        [gen(A) > gen(B)]
>>> +	    V
>>> +    +-------------------------------------+
>>> +    | GENERATION_NUMBER_ZERO = 0          |
>>> +    +-------------------------------------+
>>> +			 |      ^
>>> +			 |      |
>>> +			 +------+
>>> +		     [gen(A) = gen(B)]
>>
>> It may be just me but all I can read out of the above is that

It's not just you.

>> commit->generation may store 0xFFFFFFFF, a value between 0 and
>> 0x40000000, or 0.  I cannot quite tell what the notation [gen(A)
>> <cmp> gen(B)] is trying to say.  I am guessing "Two generation
>> numbers within the 'valid' range can be compared" is what the second
>> one is trying to say, but it is much less interesting to know that
>> two infinities compare equal than how generation numbers from
>> different classes compare, which cannot be depicted in the above
>> notation, I am afraid.  For example, don't we want to say that a
>> commit with INF can never be reached by a commit with a valid
>> generation number, or something like that?
>
> My intention with the arrows was to demonstrate where parent
> relationships can go, and the generation-number relation between a
> commit A with parent B. Clearly, this diagram is less than helpful.

Perhaps the following table would make the information clearer (perhaps
in addition to the above graph, but without "gen(A) {cmp} gen(B)"
arrows).

I assume that it is possible to have both GENERATION_NUMBER_ZERO and non
zero generation numbers in one repo, perhaps via alternates.  I also
assume that A != B, and that generation numbers (both set, and 0s) are
transitivelu closed under reachability.

gen(A) \   commit B ->   |                     gen(B)
        \-----\          |
commit A       \         | 0xFFFFFFFF | larger   | smaller | 0x00000000
----------------\--------+------------+----------+---------+------------
0xFFFFFFFF               | =            >          >         >
0 < larger  < 0x40000000 | < N          = n        >         >
0 < smaller < 0x40000000 | < N          < N        = n       >
0x00000000               | < N          < N        < N       =

The "<", "=", ">" denotes result of comparison between gen(A) and gen(B).

Generation numbers create a negative-cut filter: "N" and "n" denote
situation where we know from gen(A) and gen(B) that B is not reachable
from A.

As can be seen if we use gen(A) < gen(B) as cutoff, we don't need to
treat "infinity" and "zero" in a special way.


Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-11 19:58       ` Derrick Stolee
@ 2018-04-14 16:52         ` Jakub Narebski
  2018-04-21 20:44           ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-14 16:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

Derrick Stolee <stolee@gmail.com> writes:
> On 4/11/2018 3:32 PM, Jakub Narebski wrote:

>> What would you suggest as a good test that could imply performance? The
>> Google Colab notebook linked to above includes a function to count
>> number of commits (nodes / vertices in the commit graph) walked,
>> currently in the worst case scenario.
>
> The two main questions to consider are:
>
> 1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

> 2. What is the set of merge-bases between X and Y?

I don't have an algorithm for that in the Google Colaboratory notebook.
Though I see that there exist algorithms for calculating lowest common
ancestors in DAGs...

I'll have to take a look how Git does that.

>
> And the thing to measure is a commit count. If possible, it would be
> good to count commits walked (commits whose parent list is enumerated)
> and commits inspected (commits that were listed as a parent of some
> walked commit). Walked commits require a commit parse -- albeit from
> the commit-graph instead of the ODB now -- while inspected commits
> only check the in-memory cache.

I don't quite see the distinction.  Whether we access generation number
of a commit (information about level of vertex in graph), or a parent
list (vertex successors / neighbours), it both needs accessing
commit-graph; well, accessing parents may be more costly for octopus
merges (due to having to go through EDGE chunk).

I can easily return the set of visited commits (vertices), or just size
of said set.

>
> For git.git and Linux, I like to use the release tags as tests. They
> provide a realistic view of the linear history, and maintenance
> releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

>> I have tried finding number of false positives for level (generation
>> number) filter and for FELINE index, and number of false negatives for
>> min-post intervals in the spanning tree (for DFS tree) for 10000
>> randomly selected pairs of commits... but I don't think this is a good
>> benchmark.
>
> What is a false-positive? A case where gen(X) < gen(Y) but Y cannot
> reach X?

Yes.  (And equivalent for FELINE index, which is a pair of integers).

> I do not think that is a great benchmark, but I guess it is
> something to measure.

I have simply used it to have something to compare.

>> I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
>> that has 750832 nodes and 811733 edges, and 563747941392 possible
>> directed pairs, we have for 10000 randomly selected pairs of commits:
>>
>>    level-filter has    91 =  0.91% [all] false positives
>>    FELINE index has    78 =  0.78% [all] false positives
>>    FELINE index has 1.16667 less false positives than level filter
>>
>>    min-post spanning-tree intervals has  3641 = 36.41% [all] false
>>    negatives
>
> Perhaps something you can do instead of sampling from N^2 commits in
> total is to select a pair of generations (say, G = 20000, G' = 20100)
> or regions of generations ( 20000 <= G <= 20050, 20100 <= G' <= 20150)
> and see how many false positives you see by testing all pairs (one
> from each level). The delta between the generations may need to be
> smaller to actually have a large proportion of unreachable pairs. Try
> different levels, since major version releases tend to "pinch" the
> commit graph to a common history.

That's a good idea.

>> For git.git repository (https://github.com/git/git.git) that has 52950
>> nodes and 65887 edges the numbers are slighly more in FELINE index
>> favor (also out of 10000 random pairs):
>>
>>    level-filter has   504 =  9.11% false positives
>>    FELINE index has   125 =  2.26% false positives
>>    FELINE index has 4.032 less false positives than level filter
>>
>> This is for FELINE which does not use level / generatio-numbers filter.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 0/9] Compute and consume generation numbers
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (9 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-17 17:00   ` Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
                       ` (10 more replies)
  10 siblings, 11 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Thanks for all the help on v2. Here are a few changes between versions:

* Removed the constant-time check in queue_has_nonstale() due to the
  possibility of a performance hit and no evidence of a performance
  benefit in typical cases.

* Reordered the commits about loading commits from the commit-graph.
  This way it is easier to demonstrate the incorrect checks. On my
  machine, every commit compiles and the test suite passes, but patches
  6-8 have the bug that is fixed in patch 9 "merge: check config before
  loading commits".

* The interaction with parse_commit_in_graph() from parse_object() is
  replaced with a new 'check_graph' parameter in parse_commit_buffer().
  This allows us to fill in the graph_pos and generation values for
  commits that are parsed directly from a buffer. This keeps the existing
  behavior that a commit parsed this way should match its buffer.

* There was discussion about making GENERATION_NUMBER_MAX assignable by
  an environment variable so we could add tests that exercise the behavior
  of capping a generation at that value. Perhaps the code around this is
  simple enough that we do not need to add that complexity.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph.txt: update design document
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  commit-graph: always load commit-graph information
  merge: check config before loading commits

 Documentation/technical/commit-graph.txt | 30 +++++--
 alloc.c                                  |  1 +
 builtin/merge.c                          |  5 +-
 commit-graph.c                           | 99 +++++++++++++++++++-----
 commit-graph.h                           |  8 ++
 commit.c                                 | 54 +++++++++++--
 commit.h                                 |  7 +-
 object.c                                 |  2 +-
 ref-filter.c                             | 23 +++++-
 sha1_file.c                              |  2 +-
 t/t5318-commit-graph.sh                  |  9 +++
 11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 1/9] commit: add generation number to struct commmit
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee
                       ` (9 subsequent siblings)
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0xFFFFFFFF
GENERATION_NUMBER_MAX 0x3FFFFFFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_INFINITY;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
 	struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 	 */
 	struct tree *maybe_tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 2/9] commit-graph: compute generation numbers
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
                       ` (8 subsequent siblings)
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..688d5b1801 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+			packedDate[0] |= htonl((*list)->generation << 2);
+		}
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    commits[i]->generation != GENERATION_NUMBER_ZERO)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
+				    parent->item->generation == GENERATION_NUMBER_ZERO) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+
+			if (current->generation > GENERATION_NUMBER_MAX)
+				current->generation = GENERATION_NUMBER_MAX;
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 3/9] commit: use generations in paint_down_to_common()
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 14:31       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
                       ` (7 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 20 +++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..a44899c733 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	/* newer commits first */
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* use date as a heuristic when generataions are equal */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 4/9] commit-graph.txt: update design document
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (2 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 19:47       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
                       ` (6 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N amd M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --------------
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (3 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 21:02       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
                       ` (5 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
 
 	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (4 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 22:15       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
                       ` (4 subsequent siblings)
  10 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index a44899c733..bceb79c419 100644
--- a/commit.c
+++ b/commit.c
@@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return 0;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (5 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 23:19       ` Jakub Narebski
  2018-04-19  8:32       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
                       ` (3 subsequent siblings)
  10 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index bceb79c419..a70f120878 100644
--- a/commit.c
+++ b/commit.c
@@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+						struct commit **twos,
+						int min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+		last_gen = commit->generation;
+
+		if (commit->generation < min_generation)
+			break;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
 			return NULL;
 	}
 
-	list = paint_down_to_common(one, n, twos);
+	list = paint_down_to_common(one, n, twos, 0);
 
 	while (list) {
 		struct commit *commit = pop_commit(&list);
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
 			filled_index[filled] = j;
 			work[filled++] = array[j];
 		}
-		common = paint_down_to_common(array[i], filled, work);
+		common = paint_down_to_common(array[i], filled, work, 0);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 	if (commit->generation > min_generation)
 		return 0;
 
-	bases = paint_down_to_common(commit, nr_reference, reference);
+	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
 	if (commit->object.flags & PARENT2)
 		ret = 1;
 	clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (6 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-17 17:50       ` Derrick Stolee
  2018-04-19  0:02       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee
                       ` (2 subsequent siblings)
  10 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
 commit-graph.h |  8 ++++++++
 commit.c       |  7 +++++--
 commit.h       |  2 +-
 object.c       |  2 +-
 sha1_file.c    |  2 +-
 6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 	return &commit_list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
 
 	item->object.parsed = 1;
 	item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
+{
+	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		*pos = item->graph_pos;
+		return 1;
+	} else {
+		return bsearch_graph(commit_graph, &(item->object.oid), pos);
+	}
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+	uint32_t pos;
+
+	if (item->object.parsed)
+		return 0;
 	if (!core_commit_graph)
 		return 0;
-	if (item->object.parsed)
-		return 1;
-
 	prepare_commit_graph();
-	if (commit_graph) {
-		uint32_t pos;
-		int found;
-		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-			pos = item->graph_pos;
-			found = 1;
-		} else {
-			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
-		}
-
-		if (found)
-			return fill_commit_in_graph(item, commit_graph, pos);
-	}
-
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		return fill_commit_in_graph(item, commit_graph, pos);
 	return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+	uint32_t pos;
+	if (!core_commit_graph)
+		return;
+	prepare_commit_graph();
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
 {
 	struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 	return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
 {
 	const char *tail = buffer;
 	const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	}
 	item->date = parse_commit_date(bufptr, tail);
 
+	if (check_graph)
+		load_commit_graph_info(item);
+
 	return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return error("Object %s not a commit",
 			     oid_to_hex(&item->object.oid));
 	}
-	ret = parse_commit_buffer(item, buffer, size);
+	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
 		set_commit_buffer(item, buffer, size);
 		return 0;
diff --git a/commit.h b/commit.h
index 64436ff44e..b5afde1ae9 100644
--- a/commit.h
+++ b/commit.h
@@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
  */
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
diff --git a/object.c b/object.c
index e6ad3f61f0..efe4871325 100644
--- a/object.c
+++ b/object.c
@@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (parse_commit_buffer(commit, buffer, size, 1))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
diff --git a/sha1_file.c b/sha1_file.c
index 1b94f39c4c..0fd4f0b8b6 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
 {
 	struct commit c;
 	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(&c, buf, size))
+	if (parse_commit_buffer(&c, buf, size, 0))
 		die("corrupt commit");
 }
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v3 9/9] merge: check config before loading commits
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (7 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
  10 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/merge.c         | 5 +++--
 t/t5318-commit-graph.sh | 9 +++++++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..7e1da6c6ea 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
 	if (branch)
 		skip_prefix(branch, "refs/heads/", &branch);
+	init_diff_ui_defaults();
+	git_config(git_merge_config, NULL);
+
 	if (!branch || is_null_oid(&head_oid))
 		head_commit = NULL;
 	else
 		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
 
-	init_diff_ui_defaults();
-	git_config(git_merge_config, NULL);
 
 	if (branch_mergeoptions)
 		parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git checkout -b merge-5-to-8 commits/5 &&
+	git merge commits/8 &&
+	git show-ref -s merge-5-to-8 >output &&
+	git show-ref -s commits/8 >expect &&
+	test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-17 17:50       ` Derrick Stolee
  2018-04-19  0:02       ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:50 UTC (permalink / raw)
  To: Derrick Stolee, git
  Cc: peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine,
	jonathantanmy

On 4/17/2018 1:00 PM, Derrick Stolee wrote:
> Most code paths load commits using lookup_commit() and then
> parse_commit(). In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().
>
> With generation numbers in the commit-graph, we need to ensure that any
> commit that exists in the commit-graph file has its generation number
> loaded.
>
> Create new load_commit_graph_info() method to fill in the information
> for a commit that exists only in the commit-graph file. Call it from
> parse_commit_buffer() after loading the other commit information from
> the given buffer. Only fill this information when specified by the
> 'check_graph' parameter. This avoids duplicate work when we already
> checked the graph in parse_commit_gently() or when simply checking the
> buffer contents in check_commit().
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
>   commit-graph.h |  8 ++++++++
>   commit.c       |  7 +++++--
>   commit.h       |  2 +-
>   object.c       |  2 +-
>   sha1_file.c    |  2 +-
>   6 files changed, 49 insertions(+), 23 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 688d5b1801..21e853c21a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>   	return &commit_list_insert(c, pptr)->next;
>   }
>   
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +{
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +}
> +
>   static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>   {
>   	uint32_t edge_value;
>   	uint32_t *parent_data_ptr;
>   	uint64_t date_low, date_high;
>   	struct commit_list **pptr;
> -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
>   
>   	item->object.parsed = 1;
>   	item->graph_pos = pos;
> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>   	return 1;
>   }
>   
> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
> +{
> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> +		*pos = item->graph_pos;
> +		return 1;
> +	} else {
> +		return bsearch_graph(commit_graph, &(item->object.oid), pos);

The reference to 'commit_graph' in the above line should be 'g'. Sorry!

> +	}
> +}
> +
>   int parse_commit_in_graph(struct commit *item)
>   {
> +	uint32_t pos;
> +
> +	if (item->object.parsed)
> +		return 0;
>   	if (!core_commit_graph)
>   		return 0;
> -	if (item->object.parsed)
> -		return 1;
> -
>   	prepare_commit_graph();
> -	if (commit_graph) {
> -		uint32_t pos;
> -		int found;
> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> -			pos = item->graph_pos;
> -			found = 1;
> -		} else {
> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
> -		}
> -
> -		if (found)
> -			return fill_commit_in_graph(item, commit_graph, pos);
> -	}
> -
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		return fill_commit_in_graph(item, commit_graph, pos);
>   	return 0;
>   }
>   
> +void load_commit_graph_info(struct commit *item)
> +{
> +	uint32_t pos;
> +	if (!core_commit_graph)
> +		return;
> +	prepare_commit_graph();
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		fill_commit_graph_info(item, commit_graph, pos);
> +}
> +
>   static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>   {
>   	struct object_id oid;
> diff --git a/commit-graph.h b/commit-graph.h
> index 260a468e73..96cccb10f3 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>    */
>   int parse_commit_in_graph(struct commit *item);
>   
> +/*
> + * It is possible that we loaded commit contents from the commit buffer,
> + * but we also want to ensure the commit-graph content is correctly
> + * checked and filled. Fill the graph_pos and generation members of
> + * the given commit.
> + */
> +void load_commit_graph_info(struct commit *item);
> +
>   struct tree *get_commit_tree_in_graph(const struct commit *c);
>   
>   struct commit_graph {
> diff --git a/commit.c b/commit.c
> index a70f120878..9ef6f699bd 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>   	return ret;
>   }
>   
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>   {
>   	const char *tail = buffer;
>   	const char *bufptr = buffer;
> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>   	}
>   	item->date = parse_commit_date(bufptr, tail);
>   
> +	if (check_graph)
> +		load_commit_graph_info(item);
> +
>   	return 0;
>   }
>   
> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>   		return error("Object %s not a commit",
>   			     oid_to_hex(&item->object.oid));
>   	}
> -	ret = parse_commit_buffer(item, buffer, size);
> +	ret = parse_commit_buffer(item, buffer, size, 0);
>   	if (save_commit_buffer && !ret) {
>   		set_commit_buffer(item, buffer, size);
>   		return 0;
> diff --git a/commit.h b/commit.h
> index 64436ff44e..b5afde1ae9 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>    */
>   struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>   
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>   int parse_commit_gently(struct commit *item, int quiet_on_missing);
>   static inline int parse_commit(struct commit *item)
>   {
> diff --git a/object.c b/object.c
> index e6ad3f61f0..efe4871325 100644
> --- a/object.c
> +++ b/object.c
> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>   	} else if (type == OBJ_COMMIT) {
>   		struct commit *commit = lookup_commit(oid);
>   		if (commit) {
> -			if (parse_commit_buffer(commit, buffer, size))
> +			if (parse_commit_buffer(commit, buffer, size, 1))
>   				return NULL;
>   			if (!get_cached_commit_buffer(commit, NULL)) {
>   				set_commit_buffer(commit, buffer, size);
> diff --git a/sha1_file.c b/sha1_file.c
> index 1b94f39c4c..0fd4f0b8b6 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>   {
>   	struct commit c;
>   	memset(&c, 0, sizeof(c));
> -	if (parse_commit_buffer(&c, buf, size))
> +	if (parse_commit_buffer(&c, buf, size, 0))
>   		die("corrupt commit");
>   }
>   


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()
  2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-18 14:31       ` Jakub Narebski
  2018-04-18 14:46         ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-18 14:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).
>
> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_INFINITY.
>
> This change does not affect the number of commits that are walked during
> the execution of paint_down_to_common(), only the order that those
> commits are inspected. In the case that commit dates violate topological
> order (i.e. a parent is "newer" than a child), the previous code could
> walk a commit twice: if a commit is reached with the PARENT1 bit, but
> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
> propagated to its parents. Using generation numbers avoids this extra
> effort, even if it is somewhat rare.

Does it mean that it gives no measureable performance improvements for
typical test cases?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 20 +++++++++++++++++++-
>  commit.h |  1 +
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 711f674c18..a44899c733 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
>  	return 0;
>  }
>  
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> +{
> +	const struct commit *a = a_, *b = b_;
> +
> +	/* newer commits first */
> +	if (a->generation < b->generation)
> +		return 1;
> +	else if (a->generation > b->generation)
> +		return -1;
> +
> +	/* use date as a heuristic when generataions are equal */

Very minor typo in above comment:

s/generataions/generations/

> +	if (a->date < b->date)
> +		return 1;
> +	else if (a->date > b->date)
> +		return -1;
> +	return 0;
> +}
> +
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  /* all input commits in one and twos[] must have been parsed! */
>  static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>  {
> -	struct prio_queue queue = { compare_commits_by_commit_date };
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
>  
> diff --git a/commit.h b/commit.h
> index aac3b8c56f..64436ff44e 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>  extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>  
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
>  
>  LAST_ARG_MUST_BE_NULL
>  extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()
  2018-04-18 14:31       ` Jakub Narebski
@ 2018-04-18 14:46         ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-18 14:46 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 10:31 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Define compare_commits_by_gen_then_commit_date(), which uses generation
>> numbers as a primary comparison and commit date to break ties (or as a
>> comparison when both commits do not have computed generation numbers).
>>
>> Since the commit-graph file is closed under reachability, we know that
>> all commits in the file have generation at most GENERATION_NUMBER_MAX
>> which is less than GENERATION_NUMBER_INFINITY.
>>
>> This change does not affect the number of commits that are walked during
>> the execution of paint_down_to_common(), only the order that those
>> commits are inspected. In the case that commit dates violate topological
>> order (i.e. a parent is "newer" than a child), the previous code could
>> walk a commit twice: if a commit is reached with the PARENT1 bit, but
>> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
>> propagated to its parents. Using generation numbers avoids this extra
>> effort, even if it is somewhat rare.
> Does it mean that it gives no measureable performance improvements for
> typical test cases?

Not in this commit. When we add the `min_generation` parameter in a 
later commit, we do get a significant performance boost (when we can 
supply a non-zero value to `min_generation`).

This step of using generation numbers for the priority is important for 
that commit, but on its own has limited value outside of the clock-skew 
case mentioned above.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit.c | 20 +++++++++++++++++++-
>>   commit.h |  1 +
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/commit.c b/commit.c
>> index 711f674c18..a44899c733 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
>>   	return 0;
>>   }
>>   
>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>> +{
>> +	const struct commit *a = a_, *b = b_;
>> +
>> +	/* newer commits first */
>> +	if (a->generation < b->generation)
>> +		return 1;
>> +	else if (a->generation > b->generation)
>> +		return -1;
>> +
>> +	/* use date as a heuristic when generataions are equal */
> Very minor typo in above comment:
>
> s/generataions/generations/

Good catch!

>
>> +	if (a->date < b->date)
>> +		return 1;
>> +	else if (a->date > b->date)
>> +		return -1;
>> +	return 0;
>> +}
>> +
>>   int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
>>   {
>>   	const struct commit *a = a_, *b = b_;
>> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>>   /* all input commits in one and twos[] must have been parsed! */
>>   static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>>   {
>> -	struct prio_queue queue = { compare_commits_by_commit_date };
>> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>>   	struct commit_list *result = NULL;
>>   	int i;
>>   
>> diff --git a/commit.h b/commit.h
>> index aac3b8c56f..64436ff44e 100644
>> --- a/commit.h
>> +++ b/commit.h
>> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>>   extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>>   
>>   int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
>>   
>>   LAST_ARG_MUST_BE_NULL
>>   extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 4/9] commit-graph.txt: update design document
  2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
@ 2018-04-18 19:47       ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-18 19:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> We now calculate generation numbers in the commit-graph file and use
> them in paint_down_to_common().

All right.

>
> Expand the section on generation numbers to discuss how the three
> special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
> _MAX interact with other generation numbers.

Very good.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 0550c6d0dc..d9f2713efa 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>
> +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +in the commit-graph file. If a commit-graph file was written by a version
> +of Git that did not compute generation numbers, then those commits will
> +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.

I have to wonder if there would be any relesed Git that do not compute
generation numbers...

On the other hand in case the user-visible view of the project history
changes, be it because shallow clone is shortened or deepened, or grafts
file is edited, or a commit object is replaced with another with
different parents - we can still use "commit-graph" data, just pretend
that generation numbers (which are invalid in altered history) are all
zero.  (I'll write about this idea in comments to later series.)

On the other hand with GENERATION_NUMBER_ZERO these series of patches
are self-contained and bisectable.

> +
> +Since the commit-graph file is closed under reachability, we can guarantee
> +the following weaker condition on all commits:

I have had to look up the contents of the whole file, but it turns out
that it is all right: "weaker condition" refers to earlier "N <= M".

Minor sidenote: if one would be extremly pedantic, one could say that
previous condition is incorrect, because it doesn't state explicitely
that commit A != commit B. ;-)

> +
> +    If A and B are commits with generation numbers N amd M, respectively,
> +    and N < M, then A cannot reach B.
> +
> +Note how the strict inequality differs from the inequality when we have
> +fully-computed generation numbers. Using strict inequality may result in
> +walking a few extra commits, but the simplicity in dealing with commits
> +with generation number *_INFINITY or *_ZERO is valuable.
> +
> +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> +generation numbers are computed to be at least this value. We limit at
> +this value since it is the largest value that can be stored in the
> +commit-graph file using the 30 bits available to generation numbers. This
> +presents another case where a commit can have generation number equal to
> +that of a parent.

I wonder if something like the table I have proposed in v2 version of
this patch [1] would make it easier or harder to understand.

[1]: https://public-inbox.org/git/86a7u7mnzi.fsf@gmail.com/

Something like the following:

             |                      gen(B)
             |
gen(A)       | _INFINITY | _MAX     | larger   | smaller  | _ZERO
-------------+-----------+----------+----------+----------+--------
_INFINITY    | =         | >        | >        | >        | >
_MAX         | < N       | =        | >        | >        | >
larger       | < N       | < N      | = n      | >        | >
smaller      | < N       | < N      | < N      | = n      | >
_ZERO        | < N       | < N      | < N      | < N      | =

Here "n" and "N" denotes stronger condition, and "N" denotes weaker
condition.  We have _INFINITY > _MAX > larger > smaller > _ZERO.

> +
>  Design Details
>  --------------
>
> @@ -98,17 +121,12 @@ Future Work
>  - The 'commit-graph' subcommand does not have a "verify" mode that is
>    necessary for integration with fsck.
>
> -- The file format includes room for precomputed generation numbers. These
> -  are not currently computed, so all generation numbers will be marked as
> -  0 (or "uncomputed"). A later patch will include this calculation.
> -
>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operations are important candidates:
> +  operation is an important candidate:
>
> -    - paint_down_to_common()
>      - 'log --topo-order'
>
>  - Currently, parse_commit_gently() requires filling in the root tree

Looks good.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-18 21:02       ` Jakub Narebski
  2018-04-23 14:22         ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-18 21:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Here I can offer only the cursory examination, as I don't know this area
of code in question.

Derrick Stolee <dstolee@microsoft.com> writes:

> A commit A can reach a commit B only if the generation number of A
> is larger than the generation number of B. This condition allows
> significantly short-circuiting commit-graph walks.
>
> Use generation number for 'git tag --contains' queries.
>
> On a copy of the Linux repository where HEAD is containd in v4.13
> but no earlier tag, the command 'git tag --contains HEAD' had the
> following peformance improvement:
>
> Before: 0.81s
> After:  0.04s
> Rel %:  -95%

A question: what is the performance after if the "commit-graph" feature
is disabled, or there is no commit-graph file?  Is there performance
regression in this case, or is the difference negligible?

>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  ref-filter.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index cffd8bf3ce..e2fea6d635 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  /*
>   * Test whether the candidate or one of its parents is contained in the list.
                                 ^^^^^^^^^^^^^^^^^^^^^

Sidenote: when examining the code after the change, I have noticed that
the above part of commit header for the comtains_test() function is no
longer entirely correct, as the function only checks the candidate
commit, and in no place it access its parents.

But that is not your problem.

>   * Do not recurse to find out, though, but return -1 if inconclusive.
>   */
>  static enum contains_result contains_test(struct commit *candidate,
>  					  const struct commit_list *want,
> -					  struct contains_cache *cache)
> +					  struct contains_cache *cache,
> +					  uint32_t cutoff)
>  {
>  	enum contains_result *cached = contains_cache_at(cache, candidate);
>  
> @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
>  
>  	/* Otherwise, we don't know; prepare to recurse */
>  	parse_commit_or_die(candidate);
> +
> +	if (candidate->generation < cutoff)
> +		return CONTAINS_NO;
> +

Looks good to me.

The only [minor] question may be whether to define separate type for
generation numbers, and whether to future proof the tests - though the
latter would be almost certainly overengineering, and the former
probablt too.

>  	return CONTAINS_UNKNOWN;
>  }
>  
> @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  					      struct contains_cache *cache)
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
> -	enum contains_result result = contains_test(candidate, want, cache);
> +	enum contains_result result;
> +	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> +	const struct commit_list *p;
> +
> +	for (p = want; p; p = p->next) {
> +		struct commit *c = p->item;
> +		parse_commit_or_die(c);
> +		if (c->generation < cutoff)
> +			cutoff = c->generation;
> +	}

Sholdn't the above be made conditional on the ability to get generation
numbers from the commit-graph file (feature is turned on and file
exists)?  Otherwise here after the change contains_tag_algo() now parses
each commit in 'want', which I think was not done previously.

With commit-graph file parsing is [probably] cheap.  Without it, not
necessary.

But I might be worrying about nothing.

>  
> +	result = contains_test(candidate, want, cache, cutoff);

Other than the question about possible performace regression if
commit-graph data is not available, it looks good to me.

>  	if (result != CONTAINS_UNKNOWN)
>  		return result;
>  
> @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  		 * If we just popped the stack, parents->item has been marked,
>  		 * therefore contains_test will return a meaningful yes/no.
>  		 */
> -		else switch (contains_test(parents->item, want, cache)) {
> +		else switch (contains_test(parents->item, want, cache, cutoff)) {
>  		case CONTAINS_YES:
>  			*contains_cache_at(cache, commit) = CONTAINS_YES;
>  			contains_stack.nr--;
> @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  		}
>  	}
>  	free(contains_stack.contains_stack);
> -	return contains_test(candidate, want, cache);
> +	return contains_test(candidate, want, cache, cutoff);

Simple change. It looks good to me.

>  }
>  
>  static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
  2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-18 22:15       ` Jakub Narebski
  2018-04-23 14:31         ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-18 22:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> The containment algorithm for 'git branch --contains' is different
> from that for 'git tag --contains' in that it uses is_descendant_of()
> instead of contains_tag_algo(). The expensive portion of the branch
> algorithm is computing merge bases.
>
> When a commit-graph file exists with generation numbers computed,
> we can avoid this merge-base calculation when the target commit has
> a larger generation number than the target commits.

You have "target" twice in above paragraph; one of those should probably
be something else.

>
> Performance tests were run on a copy of the Linux repository where
> HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> copied to branches and 'git branch --contains' was tested:
>
> Before: 60.0s
> After:   0.4s
> Rel %: -99.3%

Nice...

>
> Reported-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)

...especially for so small changes.

>
> diff --git a/commit.c b/commit.c
> index a44899c733..bceb79c419 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>  {
>  	struct commit_list *bases;
>  	int ret = 0, i;
> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>  
>  	if (parse_commit(commit))
>  		return ret;
> -	for (i = 0; i < nr_reference; i++)
> +	for (i = 0; i < nr_reference; i++) {
>  		if (parse_commit(reference[i]))
>  			return ret;
> +		if (min_generation > reference[i]->generation)
> +			min_generation = reference[i]->generation;
> +	}
> +
> +	if (commit->generation > min_generation)
> +		return 0;

Why not use "return ret;" instead of "return 0;", like the rest of the
code [cryptically] does, that is:

  +	if (commit->generation > min_generation)
  +		return ret;

>  
>  	bases = paint_down_to_common(commit, nr_reference, reference);
>  	if (commit->object.flags & PARENT2)

Otherwise, it looks good to me.

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-18 23:19       ` Jakub Narebski
  2018-04-23 14:40         ` Derrick Stolee
  2018-04-19  8:32       ` Jakub Narebski
  1 sibling, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-18 23:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> When running 'git branch --contains', the in_merge_bases_many()
> method calls paint_down_to_common() to discover if a specific
> commit is reachable from a set of branches. Commits with lower
> generation number are not needed to correctly answer the
> containment query of in_merge_bases_many().

Right. This description is not entirely clear to me, but I don't have a
better proposal. Good enough, I guess.

>
> Add a new parameter, min_generation, to paint_down_to_common() that
> prevents walking commits with generation number strictly less than
> min_generation. If 0 is given, then there is no functional change.

Is it new parameter really needed, i.e. do you really need to change the
signature of this function?  See below for details.

>
> For in_merge_bases_many(), we can pass commit->generation as the
> cutoff,...

This is the only callsite that uses min_generation with non-zero value,
and it uses commit->generation to fill it... while commit itself is one
of exiting parameters.

> [...], and this saves time during 'git branch --contains' queries
> that would otherwise walk "around" the commit we are inspecting.

If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/

Do I have that right?  If so, it looks all right to me.

>
> For a copy of the Linux repository, where HEAD is checked out at
> v4.13~100, we get the following performance improvement for
> 'git branch --contains' over the previous commit:
>
> Before: 0.21s
> After:  0.13s
> Rel %: -38%

Nice.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/commit.c b/commit.c
> index bceb79c419..a70f120878 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  }
>  
>  /* all input commits in one and twos[] must have been parsed! */
> -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
> +static struct commit_list *paint_down_to_common(struct commit *one, int n,
> +						struct commit **twos,
> +						int min_generation)
>  {
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
> +	uint32_t last_gen = GENERATION_NUMBER_INFINITY;

Do we really need to change the signature of paint_down_to_common(), or
would it be enough to create a local variable min_generation set
initially to one->generation.

 +      uint32_t min_generation = one->generation;
 +	uint32_t last_gen = GENERATION_NUMBER_INFINITY;

>  
>  	one->object.flags |= PARENT1;
>  	if (!n) {
> @@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  		struct commit_list *parents;
>  		int flags;
>  
> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +		last_gen = commit->generation;
> +
> +		if (commit->generation < min_generation)
> +			break;
> +

I think, after looking at the whole post-image code, that it is all
right.

>  		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>  		if (flags == (PARENT1 | PARENT2)) {
>  			if (!(commit->object.flags & RESULT)) {
> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>  			return NULL;
>  	}
>  
> -	list = paint_down_to_common(one, n, twos);
> +	list = paint_down_to_common(one, n, twos, 0);
>  
>  	while (list) {
>  		struct commit *commit = pop_commit(&list);
> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>  			filled_index[filled] = j;
>  			work[filled++] = array[j];
>  		}
> -		common = paint_down_to_common(array[i], filled, work);
> +		common = paint_down_to_common(array[i], filled, work, 0);
>  		if (array[i]->object.flags & PARENT2)
>  			redundant[i] = 1;
>  		for (j = 0; j < filled; j++)
> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>  	if (commit->generation > min_generation)
>  		return 0;
>  
> -	bases = paint_down_to_common(commit, nr_reference, reference);
> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.

>  	if (commit->object.flags & PARENT2)
>  		ret = 1;
>  	clear_commit_marks(commit, all_flags);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
  2018-04-17 17:50       ` Derrick Stolee
@ 2018-04-19  0:02       ` Jakub Narebski
  2018-04-23 14:49         ` Derrick Stolee
  1 sibling, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-19  0:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> Most code paths load commits using lookup_commit() and then
> parse_commit(). In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().
>
> With generation numbers in the commit-graph, we need to ensure that any
> commit that exists in the commit-graph file has its generation number
> loaded.

All right, that is nice explanation of the why behind this change.

>
> Create new load_commit_graph_info() method to fill in the information
> for a commit that exists only in the commit-graph file. Call it from
> parse_commit_buffer() after loading the other commit information from
> the given buffer. Only fill this information when specified by the
> 'check_graph' parameter. This avoids duplicate work when we already
> checked the graph in parse_commit_gently() or when simply checking the
> buffer contents in check_commit().

Couldn't this 'check_graph' parameter be a global variable similar to
the 'commit_graph' variable?  Maybe I am not understanding it.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
>  commit-graph.h |  8 ++++++++
>  commit.c       |  7 +++++--
>  commit.h       |  2 +-
>  object.c       |  2 +-
>  sha1_file.c    |  2 +-
>  6 files changed, 49 insertions(+), 23 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 688d5b1801..21e853c21a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>  	return &commit_list_insert(c, pptr)->next;
>  }
>  
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +{
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +}
> +
>  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
>  	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;

I'm probably wrong, but isn't it unrelated change?

>  
>  	item->object.parsed = 1;
>  	item->graph_pos = pos;
> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	return 1;
>  }
>  
> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
> +{
> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> +		*pos = item->graph_pos;
> +		return 1;
> +	} else {
> +		return bsearch_graph(commit_graph, &(item->object.oid), pos);
> +	}
> +}

All right (after the fix).

> +
>  int parse_commit_in_graph(struct commit *item)
>  {
> +	uint32_t pos;
> +
> +	if (item->object.parsed)
> +		return 0;
>  	if (!core_commit_graph)
>  		return 0;
> -	if (item->object.parsed)
> -		return 1;

Hmmm... previously the function returned 1 if item->object.parsed, now
it returns 0 for this situation.  I don't understand this change.

> -
>  	prepare_commit_graph();
> -	if (commit_graph) {
> -		uint32_t pos;
> -		int found;
> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> -			pos = item->graph_pos;
> -			found = 1;
> -		} else {
> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
> -		}
> -
> -		if (found)
> -			return fill_commit_in_graph(item, commit_graph, pos);
> -	}
> -
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		return fill_commit_in_graph(item, commit_graph, pos);

Nice refactoring.

>  	return 0;
>  }
>  
> +void load_commit_graph_info(struct commit *item)
> +{
> +	uint32_t pos;
> +	if (!core_commit_graph)
> +		return;
> +	prepare_commit_graph();
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		fill_commit_graph_info(item, commit_graph, pos);
> +}

And the reason for the refactoring.

> +
>  static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>  {
>  	struct object_id oid;
> diff --git a/commit-graph.h b/commit-graph.h
> index 260a468e73..96cccb10f3 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>   */
>  int parse_commit_in_graph(struct commit *item);
>  
> +/*
> + * It is possible that we loaded commit contents from the commit buffer,
> + * but we also want to ensure the commit-graph content is correctly
> + * checked and filled. Fill the graph_pos and generation members of
> + * the given commit.
> + */
> +void load_commit_graph_info(struct commit *item);
> +
>  struct tree *get_commit_tree_in_graph(const struct commit *c);
>  
>  struct commit_graph {
> diff --git a/commit.c b/commit.c
> index a70f120878..9ef6f699bd 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>  	return ret;
>  }
>  
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>  {
>  	const char *tail = buffer;
>  	const char *bufptr = buffer;
> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>  	}
>  	item->date = parse_commit_date(bufptr, tail);
>  
> +	if (check_graph)
> +		load_commit_graph_info(item);
> +
>  	return 0;
>  }
>  
> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  		return error("Object %s not a commit",
>  			     oid_to_hex(&item->object.oid));
>  	}
> -	ret = parse_commit_buffer(item, buffer, size);
> +	ret = parse_commit_buffer(item, buffer, size, 0);
>  	if (save_commit_buffer && !ret) {
>  		set_commit_buffer(item, buffer, size);
>  		return 0;
> diff --git a/commit.h b/commit.h
> index 64436ff44e..b5afde1ae9 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>   */
>  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>  
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>  int parse_commit_gently(struct commit *item, int quiet_on_missing);
>  static inline int parse_commit(struct commit *item)
>  {
> diff --git a/object.c b/object.c
> index e6ad3f61f0..efe4871325 100644
> --- a/object.c
> +++ b/object.c
> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>  	} else if (type == OBJ_COMMIT) {
>  		struct commit *commit = lookup_commit(oid);
>  		if (commit) {
> -			if (parse_commit_buffer(commit, buffer, size))
> +			if (parse_commit_buffer(commit, buffer, size, 1))
>  				return NULL;
>  			if (!get_cached_commit_buffer(commit, NULL)) {
>  				set_commit_buffer(commit, buffer, size);
> diff --git a/sha1_file.c b/sha1_file.c
> index 1b94f39c4c..0fd4f0b8b6 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>  {
>  	struct commit c;
>  	memset(&c, 0, sizeof(c));
> -	if (parse_commit_buffer(&c, buf, size))
> +	if (parse_commit_buffer(&c, buf, size, 0))
>  		die("corrupt commit");
>  }

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 0/9] Compute and consume generation numbers
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (8 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee
@ 2018-04-19  0:04     ` Jakub Narebski
  2018-04-23 14:54       ` Derrick Stolee
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
  10 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-19  0:04 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> -- >8 --
>
> This is the one of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph) and lazy-loading trees
> (ds/lazy-load-trees).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one). This section is expanded to describe the interaction with special
> generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
> file) and *_ZERO (commits in a commit-graph file written before generation
> numbers were implemented).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a short-circuit mechanism
> to improve performance of `git branch --contains`.
>
> Further, use generation numbers for 'git tag --contains', providing a
> significant speedup (at least 95% for some cases).
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.
>
> This patch series is build on ds/lazy-load-trees.
>
> Derrick Stolee (9):
>   commit: add generation number to struct commmit

Nice and short patch. Looks good to me.

>   commit-graph: compute generation numbers

Another quite easy to understand patch. LGTM.

>   commit: use generations in paint_down_to_common()

Nice and short patch; minor typo in comment in code.
Otherwise it looks good to me.

>   commit-graph.txt: update design document

I see that diagram got removed in this version; maybe it could be
replaced with relationship table?

Anyway, it looks good to me.

>   ref-filter: use generation number for --contains

A question: how performance looks like after the change if commit-graph
is not available?

>   commit: use generation numbers for in_merge_bases()

Possible typo in the commit message, and stylistic inconsistence in
in_merge_bases() - though actually more clear than existing code.

Short, simple, and gives good performance improvenebts.

>   commit: add short-circuit to paint_down_to_common()

Looks good to me; ignore [mostly] what I have written in response to the
patch in question.

>   commit-graph: always load commit-graph information

Looks all right; question: parameter or one more global variable.

>   merge: check config before loading commits

This looks good to me.

>
>  Documentation/technical/commit-graph.txt | 30 +++++--
>  alloc.c                                  |  1 +
>  builtin/merge.c                          |  5 +-
>  commit-graph.c                           | 99 +++++++++++++++++++-----
>  commit-graph.h                           |  8 ++
>  commit.c                                 | 54 +++++++++++--
>  commit.h                                 |  7 +-
>  object.c                                 |  2 +-
>  ref-filter.c                             | 23 +++++-
>  sha1_file.c                              |  2 +-
>  t/t5318-commit-graph.sh                  |  9 +++
>  11 files changed, 199 insertions(+), 41 deletions(-)
>
>
> base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
  2018-04-18 23:19       ` Jakub Narebski
@ 2018-04-19  8:32       ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-19  8:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>  			return NULL;
>  	}
>  
> -	list = paint_down_to_common(one, n, twos);
> +	list = paint_down_to_common(one, n, twos, 0);
>  
>  	while (list) {
>  		struct commit *commit = pop_commit(&list);
> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>  			filled_index[filled] = j;
>  			work[filled++] = array[j];
>  		}
> -		common = paint_down_to_common(array[i], filled, work);
> +		common = paint_down_to_common(array[i], filled, work, 0);
>  		if (array[i]->object.flags & PARENT2)
>  			redundant[i] = 1;
>  		for (j = 0; j < filled; j++)

Wouldn't it be better and more readable to create a symbolic name for
this 0, for example:

  -	list = paint_down_to_common(one, n, twos);
  +	list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF);

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-14 16:52         ` Jakub Narebski
@ 2018-04-21 20:44           ` Jakub Narebski
  2018-04-23 13:54             ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-21 20:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

Jakub Narebski <jnareb@gmail.com> writes:
> Derrick Stolee <stolee@gmail.com> writes:
>> On 4/11/2018 3:32 PM, Jakub Narebski wrote:
>
>>> What would you suggest as a good test that could imply performance? The
>>> Google Colab notebook linked to above includes a function to count
>>> number of commits (nodes / vertices in the commit graph) walked,
>>> currently in the worst case scenario.
>>
>> The two main questions to consider are:
>>
>> 1. Can X reach Y?
>
> That is easy to do.  The function generic_is_reachable() does
> that... though using direct translation of the pseudocode for
> "Algorithm 3: Reachable" from FELINE paper, which is recursive and
> doesn't check if vertex was already visited was not good idea for large
> graphs such as Linux kernel commit graph, oops.  That is why
> generic_is_reachable_large() was created.
[...]

>> And the thing to measure is a commit count. If possible, it would be
>> good to count commits walked (commits whose parent list is enumerated)
>> and commits inspected (commits that were listed as a parent of some
>> walked commit). Walked commits require a commit parse -- albeit from
>> the commit-graph instead of the ODB now -- while inspected commits
>> only check the in-memory cache.
[...]
>>
>> For git.git and Linux, I like to use the release tags as tests. They
>> provide a realistic view of the linear history, and maintenance
>> releases have their own history from the major releases.
>
> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
> FELINE index does not bring any improvements over using just level
> (generation number) filter.  But that may be caused by narrowing od
> commit DAG around releases.
>
> I try do do the same between commits in wide part, with many commits
> with the same level (same generation number) both for source and for
> target commit.  Though this may be unfair to level filter, though...
>
>
> Note however that FELINE index is not unabiguous, like generation
> numbers are (modulo decision whether to start at 0 or at 1); it depends
> on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
 * src = 468165c1d = v2.17.0
 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

 * 468165c1d has level 18418 which it shares with 6 commits
 * 66d2e04ec has level 14776 which it shares with 93 commits
 * gen(468165c1d) - gen(66d2e04ec) = 3642

 algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
 -----------+---------+--------+----------+---------+----------+-----------+
 naive      | 48865   | 39599  | 244      | 9200    |          |           |
 level      |  3086   |  2492  | 113      |  528    | 285      |           |
 FELINE     |   283   |   216  |  68      |    0    |          |  25       |
 lev+FELINE |   282   |   215  |  68      |    0    |   5      |  24       |
 -----------+---------+--------+----------+---------+----------+-----------+
 lev+FEL+mpi|    79   |    59  |  21      |    0    |   0      |   0       |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
  note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
  were used for negative-cut; note that those are not disjoint; node can
  be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.

P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, FELINE
index, min-post intervals, bloom filter for changed files, etc.) be
moved into separate thread?
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-21 20:44           ` Jakub Narebski
@ 2018-04-23 13:54             ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-23 13:54 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

On 4/21/2018 4:44 PM, Jakub Narebski wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>> Derrick Stolee <stolee@gmail.com> writes:
>>> On 4/11/2018 3:32 PM, Jakub Narebski wrote:
>>>> What would you suggest as a good test that could imply performance? The
>>>> Google Colab notebook linked to above includes a function to count
>>>> number of commits (nodes / vertices in the commit graph) walked,
>>>> currently in the worst case scenario.
>>> The two main questions to consider are:
>>>
>>> 1. Can X reach Y?
>> That is easy to do.  The function generic_is_reachable() does
>> that... though using direct translation of the pseudocode for
>> "Algorithm 3: Reachable" from FELINE paper, which is recursive and
>> doesn't check if vertex was already visited was not good idea for large
>> graphs such as Linux kernel commit graph, oops.  That is why
>> generic_is_reachable_large() was created.
> [...]
>
>>> And the thing to measure is a commit count. If possible, it would be
>>> good to count commits walked (commits whose parent list is enumerated)
>>> and commits inspected (commits that were listed as a parent of some
>>> walked commit). Walked commits require a commit parse -- albeit from
>>> the commit-graph instead of the ODB now -- while inspected commits
>>> only check the in-memory cache.
> [...]
>>> For git.git and Linux, I like to use the release tags as tests. They
>>> provide a realistic view of the linear history, and maintenance
>>> releases have their own history from the major releases.
>> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
>> FELINE index does not bring any improvements over using just level
>> (generation number) filter.  But that may be caused by narrowing od
>> commit DAG around releases.
>>
>> I try do do the same between commits in wide part, with many commits
>> with the same level (same generation number) both for source and for
>> target commit.  Though this may be unfair to level filter, though...
>>
>>
>> Note however that FELINE index is not unabiguous, like generation
>> numbers are (modulo decision whether to start at 0 or at 1); it depends
>> on the topological ordering chosen for the X elements.
> One can now test reachability on git.git repository; there is a form
> where one can plug source and destination revisions at
> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1
>
> I have tried the case that is quite unfair to the generation numbers
> filter, namely the check between one of recent tags, and one commit that
> shares generation number among largest number of other commits.
>
> Here level = generation number-1 (as it starts at 0 for root commit, not
> 1).
>
> The results are:
>   * src = 468165c1d = v2.17.0
>   * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec
>
>   * 468165c1d has level 18418 which it shares with 6 commits
>   * 66d2e04ec has level 14776 which it shares with 93 commits
>   * gen(468165c1d) - gen(66d2e04ec) = 3642
>
>   algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
>   -----------+---------+--------+----------+---------+----------+-----------+
>   naive      | 48865   | 39599  | 244      | 9200    |          |           |
>   level      |  3086   |  2492  | 113      |  528    | 285      |           |
>   FELINE     |   283   |   216  |  68      |    0    |          |  25       |
>   lev+FELINE |   282   |   215  |  68      |    0    |   5      |  24       |
>   -----------+---------+--------+----------+---------+----------+-----------+
>   lev+FEL+mpi|    79   |    59  |  21      |    0    |   0      |   0       |
>
> Here we have:
> * 'naive' implementation means simple DFS walk, without any filters (cut-offs)
> * 'level' means using levels / generation numbers based negative-cut filter
> * 'FELINE' means using FELINE index based negative-cut filter
> * 'lev+FELINE' means combining generation numbers filter with FELINE filter
> * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
>    note that the code does not walk the path after cut, but it is easy to do
>
> The stats have the following meaning:
> * 'access' means accessing the node
> * 'walk' is actual walking the node
> * 'maxdepth' is maximum depth of the stack used for DFS
> * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
>    were used for negative-cut; note that those are not disjoint; node can
>    be rejected by both level filter and FELINE filter
>
> For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
> the results are the same, with 5 commits accessed and 6 walked compared
> to 61574 accessed in naive algorithm.
>
> The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
> (different child-less commits) and 9 roots, and has average clustering
> coefficient 0.000409217.

Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.

> P.S. Would it be better to move the discussion about possible extensions
> to the commit-graph in the form of new chunks (topological order, FELINE
> index, min-post intervals, bloom filter for changed files, etc.) be
> moved into separate thread?

Yes. I think we've exhausted this thought experiment and future 
discussion should revolve around actual implementations in Git with 
end-to-end performance times. The computation time for computing the 
FELINE index should be included in that discussion.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-18 21:02       ` Jakub Narebski
@ 2018-04-23 14:22         ` Derrick Stolee
  2018-04-24 18:56           ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:22 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 5:02 PM, Jakub Narebski wrote:
> Here I can offer only the cursory examination, as I don't know this area
> of code in question.
>
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> A commit A can reach a commit B only if the generation number of A
>> is larger than the generation number of B. This condition allows
>> significantly short-circuiting commit-graph walks.
>>
>> Use generation number for 'git tag --contains' queries.
>>
>> On a copy of the Linux repository where HEAD is containd in v4.13
>> but no earlier tag, the command 'git tag --contains HEAD' had the
>> following peformance improvement:
>>
>> Before: 0.81s
>> After:  0.04s
>> Rel %:  -95%
> A question: what is the performance after if the "commit-graph" feature
> is disabled, or there is no commit-graph file?  Is there performance
> regression in this case, or is the difference negligible?

Negligible, since we are adding a small number of integer comparisons 
and the main cost is in commit parsing. More on commit parsing in 
response to your comments below.

>
>> Helped-by: Jeff King <peff@peff.net>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   ref-filter.c | 23 +++++++++++++++++++----
>>   1 file changed, 19 insertions(+), 4 deletions(-)
>>
>> diff --git a/ref-filter.c b/ref-filter.c
>> index cffd8bf3ce..e2fea6d635 100644
>> --- a/ref-filter.c
>> +++ b/ref-filter.c
>> @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>>   /*
>>    * Test whether the candidate or one of its parents is contained in the list.
>                                   ^^^^^^^^^^^^^^^^^^^^^
>
> Sidenote: when examining the code after the change, I have noticed that
> the above part of commit header for the comtains_test() function is no
> longer entirely correct, as the function only checks the candidate
> commit, and in no place it access its parents.
>
> But that is not your problem.

I'll add a commit in the next version that fixes this comment before I 
make any changes to the method.

>
>>    * Do not recurse to find out, though, but return -1 if inconclusive.
>>    */
>>   static enum contains_result contains_test(struct commit *candidate,
>>   					  const struct commit_list *want,
>> -					  struct contains_cache *cache)
>> +					  struct contains_cache *cache,
>> +					  uint32_t cutoff)
>>   {
>>   	enum contains_result *cached = contains_cache_at(cache, candidate);
>>   
>> @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
>>   
>>   	/* Otherwise, we don't know; prepare to recurse */
>>   	parse_commit_or_die(candidate);
>> +
>> +	if (candidate->generation < cutoff)
>> +		return CONTAINS_NO;
>> +
> Looks good to me.
>
> The only [minor] question may be whether to define separate type for
> generation numbers, and whether to future proof the tests - though the
> latter would be almost certainly overengineering, and the former
> probablt too.

If we have multiple notions of generation, then we can refactor all 
references to the "generation" member.

>
>>   	return CONTAINS_UNKNOWN;
>>   }
>>   
>> @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   					      struct contains_cache *cache)
>>   {
>>   	struct contains_stack contains_stack = { 0, 0, NULL };
>> -	enum contains_result result = contains_test(candidate, want, cache);
>> +	enum contains_result result;
>> +	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
>> +	const struct commit_list *p;
>> +
>> +	for (p = want; p; p = p->next) {
>> +		struct commit *c = p->item;
>> +		parse_commit_or_die(c);
>> +		if (c->generation < cutoff)
>> +			cutoff = c->generation;
>> +	}
> Sholdn't the above be made conditional on the ability to get generation
> numbers from the commit-graph file (feature is turned on and file
> exists)?  Otherwise here after the change contains_tag_algo() now parses
> each commit in 'want', which I think was not done previously.
>
> With commit-graph file parsing is [probably] cheap.  Without it, not
> necessary.
>
> But I might be worrying about nothing.

Not nothing. This parses the "wants" when we previously did not parse 
the wants. Further: this parsing happens before we do the simple check 
of comparing the OID of the candidate against the wants.

The question is: are these parsed commits significant compared to the 
walk that will parse many more commits? It is certainly possible.

One way to fix this is to call 'prepare_commit_graph()' directly and 
then test that 'commit_graph' is non-null before performing any parses. 
I'm not thrilled with how that couples the commit-graph implementation 
to this feature, but that may be necessary to avoid regressions in the 
non-commit-graph case.

>
>>   
>> +	result = contains_test(candidate, want, cache, cutoff);
> Other than the question about possible performace regression if
> commit-graph data is not available, it looks good to me.
>
>>   	if (result != CONTAINS_UNKNOWN)
>>   		return result;
>>   
>> @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   		 * If we just popped the stack, parents->item has been marked,
>>   		 * therefore contains_test will return a meaningful yes/no.
>>   		 */
>> -		else switch (contains_test(parents->item, want, cache)) {
>> +		else switch (contains_test(parents->item, want, cache, cutoff)) {
>>   		case CONTAINS_YES:
>>   			*contains_cache_at(cache, commit) = CONTAINS_YES;
>>   			contains_stack.nr--;
>> @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   		}
>>   	}
>>   	free(contains_stack.contains_stack);
>> -	return contains_test(candidate, want, cache);
>> +	return contains_test(candidate, want, cache, cutoff);
> Simple change. It looks good to me.
>
>>   }
>>   
>>   static int commit_contains(struct ref_filter *filter, struct commit *commit,


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
  2018-04-18 22:15       ` Jakub Narebski
@ 2018-04-23 14:31         ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:31 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 6:15 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The containment algorithm for 'git branch --contains' is different
>> from that for 'git tag --contains' in that it uses is_descendant_of()
>> instead of contains_tag_algo(). The expensive portion of the branch
>> algorithm is computing merge bases.
>>
>> When a commit-graph file exists with generation numbers computed,
>> we can avoid this merge-base calculation when the target commit has
>> a larger generation number than the target commits.
> You have "target" twice in above paragraph; one of those should probably
> be something else.

Thanks. Second "target" should be "initial".

> [...]
>> +
>> +	if (commit->generation > min_generation)
>> +		return 0;
> Why not use "return ret;" instead of "return 0;", like the rest of the
> code [cryptically] does, that is:
>
>    +	if (commit->generation > min_generation)
>    +		return ret;

Sure.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-18 23:19       ` Jakub Narebski
@ 2018-04-23 14:40         ` Derrick Stolee
  2018-04-23 21:38           ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:40 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 7:19 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
[...]
>> [...], and this saves time during 'git branch --contains' queries
>> that would otherwise walk "around" the commit we are inspecting.
> If I understand the code properly, what happens is that we can now
> short-circuit if all commits that are left are lower than the target
> commit.
>
> This is because max-order priority queue is used: if the commit with
> maximum generation number is below generation number of target commit,
> then target commit is not reachable from any commit in the priority
> queue (all of which has generation number less or equal than the commit
> at head of queue, i.e. all are same level or deeper); compare what I
> have written in [1]
>
> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/
>
> Do I have that right?  If so, it looks all right to me.

Yes, the priority queue needs to compare via generation number first or 
there will be errors. This is why we could not use commit time before.

>
>> For a copy of the Linux repository, where HEAD is checked out at
>> v4.13~100, we get the following performance improvement for
>> 'git branch --contains' over the previous commit:
>>
>> Before: 0.21s
>> After:  0.13s
>> Rel %: -38%
> [...]
>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>   		if (flags == (PARENT1 | PARENT2)) {
>>   			if (!(commit->object.flags & RESULT)) {
>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>>   			return NULL;
>>   	}
>>   
>> -	list = paint_down_to_common(one, n, twos);
>> +	list = paint_down_to_common(one, n, twos, 0);
>>   
>>   	while (list) {
>>   		struct commit *commit = pop_commit(&list);
>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>   			filled_index[filled] = j;
>>   			work[filled++] = array[j];
>>   		}
>> -		common = paint_down_to_common(array[i], filled, work);
>> +		common = paint_down_to_common(array[i], filled, work, 0);
>>   		if (array[i]->object.flags & PARENT2)
>>   			redundant[i] = 1;
>>   		for (j = 0; j < filled; j++)
>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>   	if (commit->generation > min_generation)
>>   		return 0;
>>   
>> -	bases = paint_down_to_common(commit, nr_reference, reference);
>> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
> Is it the only case where we would call paint_down_to_common() with
> non-zero last parameter?  Would we always use commit->generation where
> commit is the first parameter of paint_down_to_common()?
>
> If both are true and will remain true, then in my humble opinion it is
> not necessary to change the signature of this function.

We need to change the signature some way, but maybe the way I chose is 
not the best.

To elaborate: paint_down_to_common() is used for multiple purposes. The 
caller here that supplies 'commit->generation' is used only to compute 
reachability (by testing if the flag PARENT2 exists on the commit, then 
clears all flags). The other callers expect the full walk down to the 
common commits, and keeps those PARENT1, PARENT2, and STALE flags for 
future use (such as reporting merge bases). Usually the call to 
paint_down_to_common() is followed by a revision walk that only halts 
when reaching root commits or commits with both PARENT1 and PARENT2 
flags on, so always short-circuiting on generations would break the 
functionality; this is confirmed by the t5318-commit-graph.sh.

An alternative to the signature change is to add a boolean parameter 
"use_cutoff" or something, that specifies "don't walk beyond the 
commit". This may give a more of a clear description of what it will do 
with the generation value, but since we are already performing 
generation comparisons before calling paint_down_to_common() I find this 
simple enough.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-19  0:02       ` Jakub Narebski
@ 2018-04-23 14:49         ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:49 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 8:02 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Most code paths load commits using lookup_commit() and then
>> parse_commit(). In some cases, including some branch lookups, the commit
>> is parsed using parse_object_buffer() which side-steps parse_commit() in
>> favor of parse_commit_buffer().
>>
>> With generation numbers in the commit-graph, we need to ensure that any
>> commit that exists in the commit-graph file has its generation number
>> loaded.
> All right, that is nice explanation of the why behind this change.
>
>> Create new load_commit_graph_info() method to fill in the information
>> for a commit that exists only in the commit-graph file. Call it from
>> parse_commit_buffer() after loading the other commit information from
>> the given buffer. Only fill this information when specified by the
>> 'check_graph' parameter. This avoids duplicate work when we already
>> checked the graph in parse_commit_gently() or when simply checking the
>> buffer contents in check_commit().
> Couldn't this 'check_graph' parameter be a global variable similar to
> the 'commit_graph' variable?  Maybe I am not understanding it.

See the two callers at the bottom of the patch. They have different 
purposes: one needs to fill in a valid commit struct, the other needs to 
check the commit buffer is valid (then throws away the struct). They 
have different values for 'check_graph'. Also, in parse_commit_gently() 
we check parse_commit_in_graph() before we call parse_commit_buffer, so 
we do not want to repeat work; in the case of a valid commit-graph file, 
but the commit is not in the commit-graph, we would repeat our binary 
search for the same commit.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
>>   commit-graph.h |  8 ++++++++
>>   commit.c       |  7 +++++--
>>   commit.h       |  2 +-
>>   object.c       |  2 +-
>>   sha1_file.c    |  2 +-
>>   6 files changed, 49 insertions(+), 23 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 688d5b1801..21e853c21a 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>>   	return &commit_list_insert(c, pptr)->next;
>>   }
>>   
>> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
>> +{
>> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
>> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>> +}
>> +
>>   static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>>   {
>>   	uint32_t edge_value;
>>   	uint32_t *parent_data_ptr;
>>   	uint64_t date_low, date_high;
>>   	struct commit_list **pptr;
>> -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
>> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> I'm probably wrong, but isn't it unrelated change?

You're right. I saw this while I was in here, and there was a similar 
comment on this change in a different patch. Probably best to keep these 
cleanup things in a separate commit.

>>   	item->object.parsed = 1;
>>   	item->graph_pos = pos;
>> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>>   	return 1;
>>   }
>>   
>> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
>> +{
>> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
>> +		*pos = item->graph_pos;
>> +		return 1;
>> +	} else {
>> +		return bsearch_graph(commit_graph, &(item->object.oid), pos);
>> +	}
>> +}
> All right (after the fix).
>
>> +
>>   int parse_commit_in_graph(struct commit *item)
>>   {
>> +	uint32_t pos;
>> +
>> +	if (item->object.parsed)
>> +		return 0;
>>   	if (!core_commit_graph)
>>   		return 0;
>> -	if (item->object.parsed)
>> -		return 1;
> Hmmm... previously the function returned 1 if item->object.parsed, now
> it returns 0 for this situation.  I don't understand this change.

The good news is that this change is unimportant (the only caller is 
parse_commit_gently() which checks item->object.parsed before calling 
parse_commit_in_graph()). I wonder why I reordered those things, anyway. 
I'll revert to simplify the patch.

>
>> -
>>   	prepare_commit_graph();
>> -	if (commit_graph) {
>> -		uint32_t pos;
>> -		int found;
>> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
>> -			pos = item->graph_pos;
>> -			found = 1;
>> -		} else {
>> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
>> -		}
>> -
>> -		if (found)
>> -			return fill_commit_in_graph(item, commit_graph, pos);
>> -	}
>> -
>> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> +		return fill_commit_in_graph(item, commit_graph, pos);
> Nice refactoring.
>
>>   	return 0;
>>   }
>>   
>> +void load_commit_graph_info(struct commit *item)
>> +{
>> +	uint32_t pos;
>> +	if (!core_commit_graph)
>> +		return;
>> +	prepare_commit_graph();
>> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> +		fill_commit_graph_info(item, commit_graph, pos);
>> +}
> And the reason for the refactoring.
>
>> +
>>   static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>>   {
>>   	struct object_id oid;
>> diff --git a/commit-graph.h b/commit-graph.h
>> index 260a468e73..96cccb10f3 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>>    */
>>   int parse_commit_in_graph(struct commit *item);
>>   
>> +/*
>> + * It is possible that we loaded commit contents from the commit buffer,
>> + * but we also want to ensure the commit-graph content is correctly
>> + * checked and filled. Fill the graph_pos and generation members of
>> + * the given commit.
>> + */
>> +void load_commit_graph_info(struct commit *item);
>> +
>>   struct tree *get_commit_tree_in_graph(const struct commit *c);
>>   
>>   struct commit_graph {
>> diff --git a/commit.c b/commit.c
>> index a70f120878..9ef6f699bd 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>>   	return ret;
>>   }
>>   
>> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
>> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>>   {
>>   	const char *tail = buffer;
>>   	const char *bufptr = buffer;
>> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>>   	}
>>   	item->date = parse_commit_date(bufptr, tail);
>>   
>> +	if (check_graph)
>> +		load_commit_graph_info(item);
>> +
>>   	return 0;
>>   }
>>   
>> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>>   		return error("Object %s not a commit",
>>   			     oid_to_hex(&item->object.oid));
>>   	}
>> -	ret = parse_commit_buffer(item, buffer, size);
>> +	ret = parse_commit_buffer(item, buffer, size, 0);
>>   	if (save_commit_buffer && !ret) {
>>   		set_commit_buffer(item, buffer, size);
>>   		return 0;
>> diff --git a/commit.h b/commit.h
>> index 64436ff44e..b5afde1ae9 100644
>> --- a/commit.h
>> +++ b/commit.h
>> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>>    */
>>   struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>>   
>> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
>> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>>   int parse_commit_gently(struct commit *item, int quiet_on_missing);
>>   static inline int parse_commit(struct commit *item)
>>   {
>> diff --git a/object.c b/object.c
>> index e6ad3f61f0..efe4871325 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>>   	} else if (type == OBJ_COMMIT) {
>>   		struct commit *commit = lookup_commit(oid);
>>   		if (commit) {
>> -			if (parse_commit_buffer(commit, buffer, size))
>> +			if (parse_commit_buffer(commit, buffer, size, 1))
>>   				return NULL;
>>   			if (!get_cached_commit_buffer(commit, NULL)) {
>>   				set_commit_buffer(commit, buffer, size);
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 1b94f39c4c..0fd4f0b8b6 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>>   {
>>   	struct commit c;
>>   	memset(&c, 0, sizeof(c));
>> -	if (parse_commit_buffer(&c, buf, size))
>> +	if (parse_commit_buffer(&c, buf, size, 0))
>>   		die("corrupt commit");
>>   }


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 0/9] Compute and consume generation numbers
  2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
@ 2018-04-23 14:54       ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:54 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 8:04 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> -- >8 --
>>
>> This is the one of several "small" patches that follow the serialized
>> Git commit graph patch (ds/commit-graph) and lazy-loading trees
>> (ds/lazy-load-trees).
>>
>> As described in Documentation/technical/commit-graph.txt, the generation
>> number of a commit is one more than the maximum generation number among
>> its parents (trivially, a commit with no parents has generation number
>> one). This section is expanded to describe the interaction with special
>> generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
>> file) and *_ZERO (commits in a commit-graph file written before generation
>> numbers were implemented).
>>
>> This series makes the computation of generation numbers part of the
>> commit-graph write process.
>>
>> Finally, generation numbers are used to order commits in the priority
>> queue in paint_down_to_common(). This allows a short-circuit mechanism
>> to improve performance of `git branch --contains`.
>>
>> Further, use generation numbers for 'git tag --contains', providing a
>> significant speedup (at least 95% for some cases).
>>
>> A more substantial refactoring of revision.c is required before making
>> 'git log --graph' use generation numbers effectively.
>>
>> This patch series is build on ds/lazy-load-trees.
>>
>> Derrick Stolee (9):
>>    commit: add generation number to struct commmit
> Nice and short patch. Looks good to me.
>
>>    commit-graph: compute generation numbers
> Another quite easy to understand patch. LGTM.
>
>>    commit: use generations in paint_down_to_common()
> Nice and short patch; minor typo in comment in code.
> Otherwise it looks good to me.
>
>>    commit-graph.txt: update design document
> I see that diagram got removed in this version; maybe it could be
> replaced with relationship table?
>
> Anyway, it looks good to me.

The diagrams and tables seemed to cause more confusion than clarity. I 
think the reader should create their own mental model from the 
definitions and description and we should avoid trying to make a summary.

>
>>    ref-filter: use generation number for --contains
> A question: how performance looks like after the change if commit-graph
> is not available?

The performance issue is minor, but will be fixed in v4.

>
>>    commit: use generation numbers for in_merge_bases()
> Possible typo in the commit message, and stylistic inconsistence in
> in_merge_bases() - though actually more clear than existing code.
>
> Short, simple, and gives good performance improvenebts.
>
>>    commit: add short-circuit to paint_down_to_common()
> Looks good to me; ignore [mostly] what I have written in response to the
> patch in question.
>
>>    commit-graph: always load commit-graph information
> Looks all right; question: parameter or one more global variable.

I responded to say that the global variable approach is incorrect. 
Parameter is important to functionality and performance.

>
>>    merge: check config before loading commits
> This looks good to me.
>
>>   Documentation/technical/commit-graph.txt | 30 +++++--
>>   alloc.c                                  |  1 +
>>   builtin/merge.c                          |  5 +-
>>   commit-graph.c                           | 99 +++++++++++++++++++-----
>>   commit-graph.h                           |  8 ++
>>   commit.c                                 | 54 +++++++++++--
>>   commit.h                                 |  7 +-
>>   object.c                                 |  2 +-
>>   ref-filter.c                             | 23 +++++-
>>   sha1_file.c                              |  2 +-
>>   t/t5318-commit-graph.sh                  |  9 +++
>>   11 files changed, 199 insertions(+), 41 deletions(-)
>>
>>
>> base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-23 14:40         ` Derrick Stolee
@ 2018-04-23 21:38           ` Jakub Narebski
  2018-04-24 12:31             ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-23 21:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, peff\, avarab\, sbeller\, larsxschneider\,
	bmwill\, gitster\, sunshine\, jonathantanmy\

Derrick Stolee <stolee@gmail.com> writes:

> On 4/18/2018 7:19 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
> [...]
>>> [...], and this saves time during 'git branch --contains' queries
>>> that would otherwise walk "around" the commit we are inspecting.
>>>
>> If I understand the code properly, what happens is that we can now
>> short-circuit if all commits that are left are lower than the target
>> commit.
>>
>> This is because max-order priority queue is used: if the commit with
>> maximum generation number is below generation number of target commit,
>> then target commit is not reachable from any commit in the priority
>> queue (all of which has generation number less or equal than the commit
>> at head of queue, i.e. all are same level or deeper); compare what I
>> have written in [1]
>>
>> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/
>>
>> Do I have that right?  If so, it looks all right to me.
>
> Yes, the priority queue needs to compare via generation number first
> or there will be errors. This is why we could not use commit time
> before.

I was more concerned about getting right the order in the priority queue
(does it return minimal or maximal generation number).

I understand that the cutoff could not be used without generation
numbers because of the possibility of clock skew - using cutoff on dates
could lead to wrong results.

>>> For a copy of the Linux repository, where HEAD is checked out at
>>> v4.13~100, we get the following performance improvement for
>>> 'git branch --contains' over the previous commit:
>>>
>>> Before: 0.21s
>>> After:  0.13s
>>> Rel %: -38%
>> [...]
>>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>>   		if (flags == (PARENT1 | PARENT2)) {
>>>   			if (!(commit->object.flags & RESULT)) {
>>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>>>   			return NULL;
>>>   	}
>>>   -	list = paint_down_to_common(one, n, twos);
>>> +	list = paint_down_to_common(one, n, twos, 0);
>>>     	while (list) {
>>>   		struct commit *commit = pop_commit(&list);
>>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>>   			filled_index[filled] = j;
>>>   			work[filled++] = array[j];
>>>   		}
>>> -		common = paint_down_to_common(array[i], filled, work);
>>> +		common = paint_down_to_common(array[i], filled, work, 0);
>>>   		if (array[i]->object.flags & PARENT2)
>>>   			redundant[i] = 1;
>>>   		for (j = 0; j < filled; j++)
>>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>>   	if (commit->generation > min_generation)
>>>   		return 0;
>>>   -	bases = paint_down_to_common(commit, nr_reference, reference);
>>> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
>>
>> Is it the only case where we would call paint_down_to_common() with
>> non-zero last parameter?  Would we always use commit->generation where
>> commit is the first parameter of paint_down_to_common()?
>>
>> If both are true and will remain true, then in my humble opinion it is
>> not necessary to change the signature of this function.
>
> We need to change the signature some way, but maybe the way I chose is
> not the best.

No, after taking longer I think the new signature is a good choice.

> To elaborate: paint_down_to_common() is used for multiple
> purposes. The caller here that supplies 'commit->generation' is used
> only to compute reachability (by testing if the flag PARENT2 exists on
> the commit, then clears all flags). The other callers expect the full
> walk down to the common commits, and keeps those PARENT1, PARENT2, and
> STALE flags for future use (such as reporting merge bases). Usually
> the call to paint_down_to_common() is followed by a revision walk that
> only halts when reaching root commits or commits with both PARENT1 and
> PARENT2 flags on, so always short-circuiting on generations would
> break the functionality; this is confirmed by the
> t5318-commit-graph.sh.

Right.

I have realized that just after sending the email.  I'm sorry about this.

>
> An alternative to the signature change is to add a boolean parameter
> "use_cutoff" or something, that specifies "don't walk beyond the
> commit". This may give a more of a clear description of what it will
> do with the generation value, but since we are already performing
> generation comparisons before calling paint_down_to_common() I find
> this simple enough.

Two things:

1. The signature proposed in the patch is more generic.  The cutoff does
   not need to be equal to the generation number of the commit, though
   currently it always (all of one time the new mechanism is used) is.

   So now I think the new signature of paint_down_to_common() is all
   right as it is proposed here.

2. The way generation numbers are defined (with 0 being a special case,
   and generation numbers starting from 1 for parent-less commits), and
   the way they are compared (using strict comparison, to avoid having
   to special-case _ZERO, _MAX and _INFINITY generation numbers) the
   cutoff of 0 means no cutoff.

   On the other hand cutoff of 0 can be understood as meaning no cutoff
   as a special case.

   It could be made more clear to use (as I proposed elsewhere in this
   thread) symbolic name for this no-cutoff case via preprocessor
   constants or enums, e.g. GENERATION_NO_CUTOFF:

    @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
      			return NULL;
      	}
      -	list = paint_down_to_common(one, n, twos);
    +	list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF);
        	while (list) {
      		struct commit *commit = pop_commit(&list);
    @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
      			filled_index[filled] = j;
      			work[filled++] = array[j];
      		}
    -		common = paint_down_to_common(array[i], filled, work);
    +		common = paint_down_to_common(array[i], filled, work, GENERATION_NO_CUTOFF);
      		if (array[i]->object.flags & PARENT2)
      			redundant[i] = 1;
      		for (j = 0; j < filled; j++)


   But whether it makes code more readable, or less readable, is a
   matter of opinion and taste.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-23 21:38           ` Jakub Narebski
@ 2018-04-24 12:31             ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-24 12:31 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider,
	bmwill, gitster, sunshine, jonathantanmy

On 4/23/2018 5:38 PM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> On 4/18/2018 7:19 PM, Jakub Narebski wrote:
>>> Derrick Stolee <dstolee@microsoft.com> writes:
>>>
>> [...]
>>>> [...], and this saves time during 'git branch --contains' queries
>>>> that would otherwise walk "around" the commit we are inspecting.
>>>>
>>> If I understand the code properly, what happens is that we can now
>>> short-circuit if all commits that are left are lower than the target
>>> commit.
>>>
>>> This is because max-order priority queue is used: if the commit with
>>> maximum generation number is below generation number of target commit,
>>> then target commit is not reachable from any commit in the priority
>>> queue (all of which has generation number less or equal than the commit
>>> at head of queue, i.e. all are same level or deeper); compare what I
>>> have written in [1]
>>>
>>> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/
>>>
>>> Do I have that right?  If so, it looks all right to me.
>> Yes, the priority queue needs to compare via generation number first
>> or there will be errors. This is why we could not use commit time
>> before.
> I was more concerned about getting right the order in the priority queue
> (does it return minimal or maximal generation number).
>
> I understand that the cutoff could not be used without generation
> numbers because of the possibility of clock skew - using cutoff on dates
> could lead to wrong results.

Maximal generation number is important so we do not visit commits 
multiple times (say, once with PARENT1 set, and a second time when 
PARENT2 is set). A minimal generation number order would create a DFS 
order and walk until the cutoff every time.

In cases without clock skew, maximal generation number order will walk 
the same set of commits as maximal commit time; the order may differ, 
but only between incomparable commits.

>>>> For a copy of the Linux repository, where HEAD is checked out at
>>>> v4.13~100, we get the following performance improvement for
>>>> 'git branch --contains' over the previous commit:
>>>>
>>>> Before: 0.21s
>>>> After:  0.13s
>>>> Rel %: -38%
>>> [...]
>>>>    		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>>>    		if (flags == (PARENT1 | PARENT2)) {
>>>>    			if (!(commit->object.flags & RESULT)) {
>>>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>>>>    			return NULL;
>>>>    	}
>>>>    -	list = paint_down_to_common(one, n, twos);
>>>> +	list = paint_down_to_common(one, n, twos, 0);
>>>>      	while (list) {
>>>>    		struct commit *commit = pop_commit(&list);
>>>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>>>    			filled_index[filled] = j;
>>>>    			work[filled++] = array[j];
>>>>    		}
>>>> -		common = paint_down_to_common(array[i], filled, work);
>>>> +		common = paint_down_to_common(array[i], filled, work, 0);
>>>>    		if (array[i]->object.flags & PARENT2)
>>>>    			redundant[i] = 1;
>>>>    		for (j = 0; j < filled; j++)
>>>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>>>    	if (commit->generation > min_generation)
>>>>    		return 0;
>>>>    -	bases = paint_down_to_common(commit, nr_reference, reference);
>>>> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
>>> Is it the only case where we would call paint_down_to_common() with
>>> non-zero last parameter?  Would we always use commit->generation where
>>> commit is the first parameter of paint_down_to_common()?
>>>
>>> If both are true and will remain true, then in my humble opinion it is
>>> not necessary to change the signature of this function.
>> We need to change the signature some way, but maybe the way I chose is
>> not the best.
> No, after taking longer I think the new signature is a good choice.
>
>> To elaborate: paint_down_to_common() is used for multiple
>> purposes. The caller here that supplies 'commit->generation' is used
>> only to compute reachability (by testing if the flag PARENT2 exists on
>> the commit, then clears all flags). The other callers expect the full
>> walk down to the common commits, and keeps those PARENT1, PARENT2, and
>> STALE flags for future use (such as reporting merge bases). Usually
>> the call to paint_down_to_common() is followed by a revision walk that
>> only halts when reaching root commits or commits with both PARENT1 and
>> PARENT2 flags on, so always short-circuiting on generations would
>> break the functionality; this is confirmed by the
>> t5318-commit-graph.sh.
> Right.
>
> I have realized that just after sending the email.  I'm sorry about this.
>
>> An alternative to the signature change is to add a boolean parameter
>> "use_cutoff" or something, that specifies "don't walk beyond the
>> commit". This may give a more of a clear description of what it will
>> do with the generation value, but since we are already performing
>> generation comparisons before calling paint_down_to_common() I find
>> this simple enough.
> Two things:
>
> 1. The signature proposed in the patch is more generic.  The cutoff does
>     not need to be equal to the generation number of the commit, though
>     currently it always (all of one time the new mechanism is used) is.
>
>     So now I think the new signature of paint_down_to_common() is all
>     right as it is proposed here.
>
> 2. The way generation numbers are defined (with 0 being a special case,
>     and generation numbers starting from 1 for parent-less commits), and
>     the way they are compared (using strict comparison, to avoid having
>     to special-case _ZERO, _MAX and _INFINITY generation numbers) the
>     cutoff of 0 means no cutoff.
>
>     On the other hand cutoff of 0 can be understood as meaning no cutoff
>     as a special case.
>
>     It could be made more clear to use (as I proposed elsewhere in this
>     thread) symbolic name for this no-cutoff case via preprocessor
>     constants or enums, e.g. GENERATION_NO_CUTOFF:
>
>      @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>        			return NULL;
>        	}
>        -	list = paint_down_to_common(one, n, twos);
>      +	list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF);
>          	while (list) {
>        		struct commit *commit = pop_commit(&list);
>      @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>        			filled_index[filled] = j;
>        			work[filled++] = array[j];
>        		}
>      -		common = paint_down_to_common(array[i], filled, work);
>      +		common = paint_down_to_common(array[i], filled, work, GENERATION_NO_CUTOFF);
>        		if (array[i]->object.flags & PARENT2)
>        			redundant[i] = 1;
>        		for (j = 0; j < filled; j++)
>
>
>     But whether it makes code more readable, or less readable, is a
>     matter of opinion and taste.
>

Since paint_down_to_common() is static to this file, I think 0 is 
cleaner. If the method was external and used by other .c files, then I 
would use this macro trick to clarify "what does this zero parameter mean?".

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-23 14:22         ` Derrick Stolee
@ 2018-04-24 18:56           ` Jakub Narebski
  2018-04-25 14:11             ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-24 18:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, peff\, avarab\, sbeller\, larsxschneider\,
	bmwill\, gitster\, sunshine\, jonathantanmy\

Derrick Stolee <stolee@gmail.com> writes:
> On 4/18/2018 5:02 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> A commit A can reach a commit B only if the generation number of A
>>> is larger than the generation number of B. This condition allows
>>> significantly short-circuiting commit-graph walks.
>>>
>>> Use generation number for 'git tag --contains' queries.
>>>
>>> On a copy of the Linux repository where HEAD is containd in v4.13
>>> but no earlier tag, the command 'git tag --contains HEAD' had the
>>> following peformance improvement:
>>>
>>> Before: 0.81s
>>> After:  0.04s
>>> Rel %:  -95%
>>
>> A question: what is the performance after if the "commit-graph" feature
>> is disabled, or there is no commit-graph file?  Is there performance
>> regression in this case, or is the difference negligible?
>
> Negligible, since we are adding a small number of integer comparisons
> and the main cost is in commit parsing. More on commit parsing in
> response to your comments below.

If it is proven to be always negligible, then its all right.  If it is
unlikely to be non-negligible, well, still O.K.  But I wonder if maybe
there is some situation where the cost of extra parsing is non-negligble.

[...]
>>>   @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>>   					      struct contains_cache *cache)
>>>   {
>>>   	struct contains_stack contains_stack = { 0, 0, NULL };
>>> -	enum contains_result result = contains_test(candidate, want, cache);
>>> +	enum contains_result result;
>>> +	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
>>> +	const struct commit_list *p;
>>> +
>>> +	for (p = want; p; p = p->next) {
>>> +		struct commit *c = p->item;
>>> +		parse_commit_or_die(c);
>>> +		if (c->generation < cutoff)
>>> +			cutoff = c->generation;
>>> +	}
>> Sholdn't the above be made conditional on the ability to get generation
>> numbers from the commit-graph file (feature is turned on and file
>> exists)?  Otherwise here after the change contains_tag_algo() now parses
>> each commit in 'want', which I think was not done previously.
>>
>> With commit-graph file parsing is [probably] cheap.  Without it, not
>> necessary.
>>
>> But I might be worrying about nothing.
>
> Not nothing. This parses the "wants" when we previously did not parse
> the wants. Further: this parsing happens before we do the simple check
> of comparing the OID of the candidate against the wants.
>
> The question is: are these parsed commits significant compared to the
> walk that will parse many more commits? It is certainly possible.
>
> One way to fix this is to call 'prepare_commit_graph()' directly and
> then test that 'commit_graph' is non-null before performing any
> parses. I'm not thrilled with how that couples the commit-graph
> implementation to this feature, but that may be necessary to avoid
> regressions in the non-commit-graph case.

Another possible solution (not sure if better or worse) would be to
change the signature of contains_tag_algo() function to take parameter
or flag that would decide whether to parse "wants".

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-24 18:56           ` Jakub Narebski
@ 2018-04-25 14:11             ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:11 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider,
	bmwill, gitster, sunshine, jonathantanmy

On 4/24/2018 2:56 PM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> One way to fix this is to call 'prepare_commit_graph()' directly and
>> then test that 'commit_graph' is non-null before performing any
>> parses. I'm not thrilled with how that couples the commit-graph
>> implementation to this feature, but that may be necessary to avoid
>> regressions in the non-commit-graph case.
> Another possible solution (not sure if better or worse) would be to
> change the signature of contains_tag_algo() function to take parameter
> or flag that would decide whether to parse "wants".

If I reorder commits so "commit-graph:always load commit-graph 
information" is before this one, then we can call 
load_commit_graph_info() which just fills the generation and graph_pos 
information. This will keep the coupling very light, instead of needing 
to call prepare_commit_graph() or checking the config setting.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 00/10] Compute and consume generation numbers
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (9 preceding siblings ...)
  2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
@ 2018-04-25 14:37     ` " Derrick Stolee
  2018-04-25 14:37       ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
                         ` (11 more replies)
  10 siblings, 12 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

Thanks for the feedback on the previous version. I think this series is
stabilizing nicely. I'll reply to this message with an inter-diff as it
is not too large to share but would clutter this cover letter.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is built on ds/lazy-load-trees.

Derrick Stolee (10):
  ref-filter: fix outdated comment on in_commit_list
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph: always load commit-graph information
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  merge: check config before loading commits
  commit-graph.txt: update design document

 Documentation/technical/commit-graph.txt | 30 ++++++--
 alloc.c                                  |  1 +
 builtin/merge.c                          |  7 +-
 commit-graph.c                           | 92 ++++++++++++++++++++----
 commit-graph.h                           |  8 +++
 commit.c                                 | 54 +++++++++++---
 commit.h                                 |  7 +-
 object.c                                 |  2 +-
 ref-filter.c                             | 26 +++++--
 sha1_file.c                              |  2 +-
 t/t5318-commit-graph.sh                  |  9 +++
 11 files changed, 198 insertions(+), 40 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-28 17:54         ` Jakub Narebski
  2018-04-25 14:37       ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee
                         ` (10 subsequent siblings)
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

The in_commit_list() method does not check the parents of
the candidate for containment in the list. Fix the comment
that incorrectly states that it does.

Reported-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..aff24d93be 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 }
 
 /*
- * Test whether the candidate or one of its parents is contained in the list.
+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 02/10] commit: add generation number to struct commmit
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
  2018-04-25 14:37       ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-28 22:35         ` Jakub Narebski
  2018-04-25 14:37       ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee
                         ` (9 subsequent siblings)
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0xFFFFFFFF
GENERATION_NUMBER_MAX 0x3FFFFFFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_INFINITY;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
 	struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 	 */
 	struct tree *maybe_tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
  2018-04-25 14:37       ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
  2018-04-25 14:37       ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-26  2:35         ` Junio C Hamano
  2018-04-29  9:08         ` Jakub Narebski
  2018-04-25 14:37       ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee
                         ` (8 subsequent siblings)
  11 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..047fa9fca5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY)
+			packedDate[0] |= htonl((*list)->generation << 2);
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    commits[i]->generation != GENERATION_NUMBER_ZERO)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
+				    parent->item->generation == GENERATION_NUMBER_ZERO) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+
+			if (current->generation > GENERATION_NUMBER_MAX)
+				current->generation = GENERATION_NUMBER_MAX;
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 04/10] commit: use generations in paint_down_to_common()
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (2 preceding siblings ...)
  2018-04-25 14:37       ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-26  3:22         ` Junio C Hamano
  2018-04-29 15:40         ` Jakub Narebski
  2018-04-25 14:37       ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee
                         ` (7 subsequent siblings)
  11 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 20 +++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..4d00b0a1d6 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	/* newer commits first */
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* use date as a heuristic when generations are equal */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 05/10] commit-graph: always load commit-graph information
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (3 preceding siblings ...)
  2018-04-25 14:37       ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-29 22:14         ` Jakub Narebski
  2018-04-29 22:18         ` Jakub Narebski
  2018-04-25 14:37       ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee
                         ` (6 subsequent siblings)
  11 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
 commit-graph.h |  8 ++++++++
 commit.c       |  7 +++++--
 commit.h       |  2 +-
 object.c       |  2 +-
 sha1_file.c    |  2 +-
 6 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 047fa9fca5..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 	return &commit_list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	uint32_t edge_value;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
+{
+	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		*pos = item->graph_pos;
+		return 1;
+	} else {
+		return bsearch_graph(g, &(item->object.oid), pos);
+	}
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+	uint32_t pos;
+
 	if (!core_commit_graph)
 		return 0;
 	if (item->object.parsed)
 		return 1;
-
 	prepare_commit_graph();
-	if (commit_graph) {
-		uint32_t pos;
-		int found;
-		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-			pos = item->graph_pos;
-			found = 1;
-		} else {
-			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
-		}
-
-		if (found)
-			return fill_commit_in_graph(item, commit_graph, pos);
-	}
-
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		return fill_commit_in_graph(item, commit_graph, pos);
 	return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+	uint32_t pos;
+	if (!core_commit_graph)
+		return;
+	prepare_commit_graph();
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
 {
 	struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index 4d00b0a1d6..39a3749abd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 	return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
 {
 	const char *tail = buffer;
 	const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	}
 	item->date = parse_commit_date(bufptr, tail);
 
+	if (check_graph)
+		load_commit_graph_info(item);
+
 	return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return error("Object %s not a commit",
 			     oid_to_hex(&item->object.oid));
 	}
-	ret = parse_commit_buffer(item, buffer, size);
+	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
 		set_commit_buffer(item, buffer, size);
 		return 0;
diff --git a/commit.h b/commit.h
index 64436ff44e..b5afde1ae9 100644
--- a/commit.h
+++ b/commit.h
@@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
  */
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
diff --git a/object.c b/object.c
index e6ad3f61f0..efe4871325 100644
--- a/object.c
+++ b/object.c
@@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (parse_commit_buffer(commit, buffer, size, 1))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
diff --git a/sha1_file.c b/sha1_file.c
index 1b94f39c4c..0fd4f0b8b6 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
 {
 	struct commit c;
 	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(&c, buf, size))
+	if (parse_commit_buffer(&c, buf, size, 0))
 		die("corrupt commit");
 }
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 06/10] ref-filter: use generation number for --contains
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (4 preceding siblings ...)
  2018-04-25 14:37       ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-30 16:34         ` Jakub Narebski
  2018-04-25 14:37       ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
                         ` (5 subsequent siblings)
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index aff24d93be..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"
 
 static struct ref_msg {
 	const char *gone;
@@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate,
 
 	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		load_commit_graph_info(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 07/10] commit: use generation numbers for in_merge_bases()
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (5 preceding siblings ...)
  2018-04-25 14:37       ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-25 14:37       ` Derrick Stolee
  2018-04-30 17:05         ` Jakub Narebski
  2018-04-25 14:38       ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
                         ` (4 subsequent siblings)
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the initial commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 39a3749abd..7bb007f56a 100644
--- a/commit.c
+++ b/commit.c
@@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return ret;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (6 preceding siblings ...)
  2018-04-25 14:37       ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-25 14:38       ` Derrick Stolee
  2018-04-30 22:19         ` Jakub Narebski
  2018-04-25 14:38       ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee
                         ` (3 subsequent siblings)
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:38 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 7bb007f56a..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+						struct commit **twos,
+						int min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+		last_gen = commit->generation;
+
+		if (commit->generation < min_generation)
+			break;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
 			return NULL;
 	}
 
-	list = paint_down_to_common(one, n, twos);
+	list = paint_down_to_common(one, n, twos, 0);
 
 	while (list) {
 		struct commit *commit = pop_commit(&list);
@@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt)
 			filled_index[filled] = j;
 			work[filled++] = array[j];
 		}
-		common = paint_down_to_common(array[i], filled, work);
+		common = paint_down_to_common(array[i], filled, work, 0);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
@@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 	if (commit->generation > min_generation)
 		return ret;
 
-	bases = paint_down_to_common(commit, nr_reference, reference);
+	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
 	if (commit->object.flags & PARENT2)
 		ret = 1;
 	clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 09/10] merge: check config before loading commits
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (7 preceding siblings ...)
  2018-04-25 14:38       ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-25 14:38       ` Derrick Stolee
  2018-04-30 22:54         ` Jakub Narebski
  2018-04-25 14:38       ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee
                         ` (2 subsequent siblings)
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:38 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/merge.c         | 7 ++++---
 t/t5318-commit-graph.sh | 9 +++++++++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
 	if (branch)
 		skip_prefix(branch, "refs/heads/", &branch);
+
+	init_diff_ui_defaults();
+	git_config(git_merge_config, NULL);
+
 	if (!branch || is_null_oid(&head_oid))
 		head_commit = NULL;
 	else
 		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
 
-	init_diff_ui_defaults();
-	git_config(git_merge_config, NULL);
-
 	if (branch_mergeoptions)
 		parse_branch_merge_options(branch_mergeoptions);
 	argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git checkout -b merge-5-to-8 commits/5 &&
+	git merge commits/8 &&
+	git show-ref -s merge-5-to-8 >output &&
+	git show-ref -s commits/8 >expect &&
+	test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v4 10/10] commit-graph.txt: update design document
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (8 preceding siblings ...)
  2018-04-25 14:38       ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee
@ 2018-04-25 14:38       ` Derrick Stolee
  2018-04-30 23:32         ` Jakub Narebski
  2018-04-25 14:40       ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:38 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N amd M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --------------
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 00/10] Compute and consume generation numbers
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (9 preceding siblings ...)
  2018-04-25 14:38       ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee
@ 2018-04-25 14:40       ` Derrick Stolee
  2018-04-28 17:28         ` Jakub Narebski
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
  11 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-25 14:40 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: gitster, peff, jnareb, avarab

As promised, here is the diff from v3.

Thanks,
-Stolee

-- >8 --

diff --git a/builtin/merge.c b/builtin/merge.c
index 7e1da6c6ea..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const 
char *prefix)
         branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, 
NULL);
         if (branch)
                 skip_prefix(branch, "refs/heads/", &branch);
+
         init_diff_ui_defaults();
         git_config(git_merge_config, NULL);

@@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const 
char *prefix)
         else
                 head_commit = lookup_commit_or_die(&head_oid, "HEAD");

-
         if (branch_mergeoptions)
                 parse_branch_merge_options(branch_mergeoptions);
         argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
         uint32_t *parent_data_ptr;
         uint64_t date_low, date_high;
         struct commit_list **pptr;
-       const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+       const unsigned char *commit_data = g->chunk_commit_data + 
(g->hash_len + 16) * pos;

         item->object.parsed = 1;
         item->graph_pos = pos;
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
                 *pos = item->graph_pos;
                 return 1;
         } else {
-               return bsearch_graph(commit_graph, &(item->object.oid), 
pos);
+               return bsearch_graph(g, &(item->object.oid), pos);
         }
  }

@@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item)
  {
         uint32_t pos;

-       if (item->object.parsed)
-               return 0;
         if (!core_commit_graph)
                 return 0;
+       if (item->object.parsed)
+               return 1;
         prepare_commit_graph();
         if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
                 return fill_commit_in_graph(item, commit_graph, pos);
@@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile 
*f, int hash_len,
                 else
                         packedDate[0] = 0;

-               if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+               if ((*list)->generation != GENERATION_NUMBER_INFINITY)
                         packedDate[0] |= htonl((*list)->generation << 2);
-               }

                 packedDate[1] = htonl((*list)->date);
                 hashwrite(f, packedDate, 8);
diff --git a/commit.c b/commit.c
index 9ef6f699bd..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const 
void *a_, const void *b_, void
         else if (a->generation > b->generation)
                 return -1;

-       /* use date as a heuristic when generataions are equal */
+       /* use date as a heuristic when generations are equal */
         if (a->date < b->date)
                 return 1;
         else if (a->date > b->date)
@@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
         }

         if (commit->generation > min_generation)
-               return 0;
+               return ret;

         bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
         if (commit->object.flags & PARENT2)
diff --git a/ref-filter.c b/ref-filter.c
index e2fea6d635..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
  #include "trailer.h"
  #include "wt-status.h"
  #include "commit-slab.h"
+#include "commit-graph.h"

  static struct ref_msg {
         const char *gone;
@@ -1582,7 +1583,7 @@ static int in_commit_list(const struct commit_list 
*want, struct commit *c)
  }

  /*
- * Test whether the candidate or one of its parents is contained in the 
list.
+ * Test whether the candidate is contained in the list.
   * Do not recurse to find out, though, but return -1 if inconclusive.
   */
  static enum contains_result contains_test(struct commit *candidate,
@@ -1629,7 +1630,7 @@ static enum contains_result 
contains_tag_algo(struct commit *candidate,

         for (p = want; p; p = p->next) {
                 struct commit *c = p->item;
-               parse_commit_or_die(c);
+               load_commit_graph_info(c);
                 if (c->generation < cutoff)
                         cutoff = c->generation;
         }



^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-04-25 14:37       ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-26  2:35         ` Junio C Hamano
  2018-04-26 12:58           ` Derrick Stolee
  2018-04-29  9:08         ` Jakub Narebski
  1 sibling, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-26  2:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, jnareb\, avarab\

Derrick Stolee <dstolee@microsoft.com> writes:

> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  		else
>  			packedDate[0] = 0;
>  
> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY)
> +			packedDate[0] |= htonl((*list)->generation << 2);
> +
>  		packedDate[1] = htonl((*list)->date);
>  		hashwrite(f, packedDate, 8);

The ones that have infinity are written as zero here.  The code that
reads the generation field off of a file in fill_commit_graph_info()
and fill_commit_in_graph() both leave such a record in file as-is,
so the reader of what we write out will think it is _ZERO, not _INF.

Not that it matters, as it seems that most of the code being added
by this series treat _ZERO and _INF more or less interchangeably.
But it does raise another question, i.e. do we need both _ZERO and
_INF, or is it sufficient to have just a single _UNKNOWN?

> @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
>  	}
>  }
>  
> +static void compute_generation_numbers(struct commit** commits,
> +				       int nr_commits)
> +{
> +	int i;
> +	struct commit_list *list = NULL;
> +
> +	for (i = 0; i < nr_commits; i++) {
> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
> +			continue;
> +
> +		commit_list_insert(commits[i], &list);
> +		while (list) {
> +			struct commit *current = list->item;
> +			struct commit_list *parent;
> +			int all_parents_computed = 1;
> +			uint32_t max_generation = 0;
> +
> +			for (parent = current->parents; parent; parent = parent->next) {
> +				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
> +				    parent->item->generation == GENERATION_NUMBER_ZERO) {
> +					all_parents_computed = 0;
> +					commit_list_insert(parent->item, &list);
> +					break;
> +				} else if (parent->item->generation > max_generation) {
> +					max_generation = parent->item->generation;
> +				}
> +			}
> +
> +			if (all_parents_computed) {
> +				current->generation = max_generation + 1;
> +				pop_commit(&list);
> +			}

If we haven't computed all parents' generations yet,
current->generation is undefined (or at least "left as
initialized"), so it does not make much sense to attempt to clip it
at _MAX at this point.  At leat not yet.

IOW, shouldn't the following two lines be inside the "we now know
genno of all parents, so we can compute genno for commit" block
above?

> +			if (current->generation > GENERATION_NUMBER_MAX)
> +				current->generation = GENERATION_NUMBER_MAX;
> +		}
> +	}
> +}
> +
>  void write_commit_graph(const char *obj_dir,
>  			const char **pack_indexes,
>  			int nr_packs,
> @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
>  	if (commits.nr >= GRAPH_PARENT_MISSING)
>  		die(_("too many commits to write graph"));
>  
> +	compute_generation_numbers(commits.list, commits.nr);
> +
>  	graph_name = get_commit_graph_filename(obj_dir);
>  	fd = hold_lock_file_for_update(&lk, graph_name, 0);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
  2018-04-25 14:37       ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-26  3:22         ` Junio C Hamano
  2018-04-26  9:02           ` Jakub Narebski
  2018-04-29 15:40         ` Jakub Narebski
  1 sibling, 1 reply; 162+ messages in thread
From: Junio C Hamano @ 2018-04-26  3:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, jnareb\, avarab\

Derrick Stolee <dstolee@microsoft.com> writes:

> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).
>
> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_INFINITY.

I suspect that my puzzlement may be coming from my not "getting"
what you meant by "closed under reachability", but could you also
explain how _INF and _ZERO interact with commits with normal
generation numbers?  I've always assumed that genno will be used
only when comparing two commits with valid genno and otherwise we'd
fall back to the traditional date based one, but...

> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> +{
> +	const struct commit *a = a_, *b = b_;
> +
> +	/* newer commits first */
> +	if (a->generation < b->generation)
> +		return 1;
> +	else if (a->generation > b->generation)
> +		return -1;

... this does not check if a->generation is _ZERO or _INF.  

Both being _MAX is OK (the control will fall through and use the
dates below).  One being _MAX and the other being a normal value is
also OK (the above comparisons will declare the commit with _MAX is
farther than less-than-max one from a root).

Or is the assumption that if one has _ZERO, that must have come from
an ancient commit-graph file and none of the commits have anything
but _ZERO?

> +	/* use date as a heuristic when generations are equal */
> +	if (a->date < b->date)
> +		return 1;
> +	else if (a->date > b->date)
> +		return -1;
> +	return 0;
> +}
> +
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  /* all input commits in one and twos[] must have been parsed! */
>  static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>  {
> -	struct prio_queue queue = { compare_commits_by_commit_date };
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
>  
> diff --git a/commit.h b/commit.h
> index aac3b8c56f..64436ff44e 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>  extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>  
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
>  
>  LAST_ARG_MUST_BE_NULL
>  extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
  2018-04-26  3:22         ` Junio C Hamano
@ 2018-04-26  9:02           ` Jakub Narebski
  2018-04-28 14:38             ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-26  9:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, git, Jeff King, Ævar Arnfjörð Bjarmason

Junio C Hamano <gitster@pobox.com> writes:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Define compare_commits_by_gen_then_commit_date(), which uses generation
>> numbers as a primary comparison and commit date to break ties (or as a
>> comparison when both commits do not have computed generation numbers).
>>
>> Since the commit-graph file is closed under reachability, we know that
>> all commits in the file have generation at most GENERATION_NUMBER_MAX
>> which is less than GENERATION_NUMBER_INFINITY.
>
> I suspect that my puzzlement may be coming from my not "getting"
> what you meant by "closed under reachability",

It means that if commit A is in the commit graph, then all of its
ancestors (all commits reachable from A) are also in the commit graph.

>                                                but could you also
> explain how _INF and _ZERO interact with commits with normal
> generation numbers?  I've always assumed that genno will be used
> only when comparing two commits with valid genno and otherwise we'd
> fall back to the traditional date based one, but...
>
>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>> +{
>> +	const struct commit *a = a_, *b = b_;
>> +
>> +	/* newer commits first */
>> +	if (a->generation < b->generation)
>> +		return 1;
>> +	else if (a->generation > b->generation)
>> +		return -1;
>
> ... this does not check if a->generation is _ZERO or _INF.  
>
> Both being _MAX is OK (the control will fall through and use the
> dates below).  One being _MAX and the other being a normal value is
> also OK (the above comparisons will declare the commit with _MAX is
> farther than less-than-max one from a root).
>
> Or is the assumption that if one has _ZERO, that must have come from
> an ancient commit-graph file and none of the commits have anything
> but _ZERO?

There is stronger and weaker version of the negative-cut criteria based
on generation numbers.

The strong criteria:

  if A != B and gen(A) <= gen(B), then A cannot reach B

The weaker criteria:

  if gen(A) < gen(B), then A cannot reach B


Because commit-graph is closed under reachability, this means that

  if A is in commit graph, and B is outside of it, then A cannot reach B

If A is in commit graph, then either _MAX >= gen(A) >= 1,
or gen(A) == _ZERO.  Because _INFINITY > _MAX > _ZERO, then we have

  if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY
  then A cannot reach B

which also fullfils the weaker criteria

  if gen(A) < gen(B), then A cannot reach B


If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY,
or if both A and B have gen(A) = gen(B) = _MAX,
or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO,
then we cannot say anything about reachability... and weak criteria
also does not say anything about reachability.


Maybe the following ASCII table would make it clear.

             |                      gen(B)
             |            ................................ :::::::
gen(A)       | _INFINITY | _MAX     | larger   | smaller  | _ZERO
-------------+-----------+----------+----------+----------+--------
_INFINITY    | =         | >        | >        | >        | >
_MAX         | < Nn      | =        | >        | >        | >
larger       | < Nn      | < Nn     | = n      | >        | >
smaller      | < Nn      | < Nn     | < Nn     | = n      | >
_ZERO        | < Nn      | < Nn     | < Nn     | < Nn     | =

Here "n" denotes stronger condition, and "N" denotes weaker condition.
We have _INFINITY > _MAX > larger > smaller > _ZERO.


NOTE however that it is a *tradeoff*.  Using weaker criteria, with
strict inequality, means that we don't need to handle _INFINITY, _MAX
and _ZERO corner-cases in a special way; but it also means that we would
walk slightly more commits than if we used stronger criteria, with less
or equals.

For Linux kernel public repository commit graph[1] we have maximum of 512
commits sharing the same level, 5.43 sharing the same commit on average,
and 50% of time only 2 commits sharing the same level (median, or 2nd
quartile, or 50% percentile).  This is roughly the amount of commits we
walk more with weaker cut-off condition.

[1]: with 750k commits, but which is not largest commit graph any more :-0

>> +	/* use date as a heuristic when generations are equal */
>> +	if (a->date < b->date)
>> +		return 1;
>> +	else if (a->date > b->date)
>> +		return -1;
>> +	return 0;
>> +}

HTH
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-04-26  2:35         ` Junio C Hamano
@ 2018-04-26 12:58           ` Derrick Stolee
  2018-04-26 13:49             ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-04-26 12:58 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee; +Cc: git, peff, jnareb, avarab

n 4/25/2018 10:35 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>>   		else
>>   			packedDate[0] = 0;
>>   
>> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY)
>> +			packedDate[0] |= htonl((*list)->generation << 2);
>> +
>>   		packedDate[1] = htonl((*list)->date);
>>   		hashwrite(f, packedDate, 8);
> The ones that have infinity are written as zero here.  The code that
> reads the generation field off of a file in fill_commit_graph_info()
> and fill_commit_in_graph() both leave such a record in file as-is,
> so the reader of what we write out will think it is _ZERO, not _INF.
>
> Not that it matters, as it seems that most of the code being added
> by this series treat _ZERO and _INF more or less interchangeably.
> But it does raise another question, i.e. do we need both _ZERO and
> _INF, or is it sufficient to have just a single _UNKNOWN?

This code is confusing. The 'if' condition is useless, since at this 
point every commit should be finite (since we computed generation 
numbers for everyone). We should just write the value always.

For the sake of discussion, the value _INFINITY means not in the graph 
and _ZERO means in the graph without a computed generation number. It's 
a small distinction, but it gives a single boundary to use for 
reachability queries that test generation number.

>
>> @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
>>   	}
>>   }
>>   
>> +static void compute_generation_numbers(struct commit** commits,
>> +				       int nr_commits)
>> +{
>> +	int i;
>> +	struct commit_list *list = NULL;
>> +
>> +	for (i = 0; i < nr_commits; i++) {
>> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
>> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
>> +			continue;
>> +
>> +		commit_list_insert(commits[i], &list);
>> +		while (list) {
>> +			struct commit *current = list->item;
>> +			struct commit_list *parent;
>> +			int all_parents_computed = 1;
>> +			uint32_t max_generation = 0;
>> +
>> +			for (parent = current->parents; parent; parent = parent->next) {
>> +				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
>> +				    parent->item->generation == GENERATION_NUMBER_ZERO) {
>> +					all_parents_computed = 0;
>> +					commit_list_insert(parent->item, &list);
>> +					break;
>> +				} else if (parent->item->generation > max_generation) {
>> +					max_generation = parent->item->generation;
>> +				}
>> +			}
>> +
>> +			if (all_parents_computed) {
>> +				current->generation = max_generation + 1;
>> +				pop_commit(&list);
>> +			}
> If we haven't computed all parents' generations yet,
> current->generation is undefined (or at least "left as
> initialized"), so it does not make much sense to attempt to clip it
> at _MAX at this point.  At leat not yet.
>
> IOW, shouldn't the following two lines be inside the "we now know
> genno of all parents, so we can compute genno for commit" block
> above?

You're right! Good catch. This code sets every merge commit to _MAX. It 
should be in the block above.

>
>> +			if (current->generation > GENERATION_NUMBER_MAX)
>> +				current->generation = GENERATION_NUMBER_MAX;
>> +		}
>> +	}
>> +}
>> +
>>   void write_commit_graph(const char *obj_dir,
>>   			const char **pack_indexes,
>>   			int nr_packs,
>> @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
>>   	if (commits.nr >= GRAPH_PARENT_MISSING)
>>   		die(_("too many commits to write graph"));
>>   
>> +	compute_generation_numbers(commits.list, commits.nr);
>> +
>>   	graph_name = get_commit_graph_filename(obj_dir);
>>   	fd = hold_lock_file_for_update(&lk, graph_name, 0);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-04-26 12:58           ` Derrick Stolee
@ 2018-04-26 13:49             ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-26 13:49 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee; +Cc: git, peff, jnareb, avarab



On 4/26/2018 8:58 AM, Derrick Stolee wrote:
> n 4/25/2018 10:35 PM, Junio C Hamano wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct 
>>> hashfile *f, int hash_len,
>>>           else
>>>               packedDate[0] = 0;
>>>   +        if ((*list)->generation != GENERATION_NUMBER_INFINITY)
>>> +            packedDate[0] |= htonl((*list)->generation << 2);
>>> +
>>>           packedDate[1] = htonl((*list)->date);
>>>           hashwrite(f, packedDate, 8);
>> The ones that have infinity are written as zero here.  The code that
>> reads the generation field off of a file in fill_commit_graph_info()
>> and fill_commit_in_graph() both leave such a record in file as-is,
>> so the reader of what we write out will think it is _ZERO, not _INF.
>>
>> Not that it matters, as it seems that most of the code being added
>> by this series treat _ZERO and _INF more or less interchangeably.
>> But it does raise another question, i.e. do we need both _ZERO and
>> _INF, or is it sufficient to have just a single _UNKNOWN?
>
> This code is confusing. The 'if' condition is useless, since at this 
> point every commit should be finite (since we computed generation 
> numbers for everyone). We should just write the value always.
>
> For the sake of discussion, the value _INFINITY means not in the graph 
> and _ZERO means in the graph without a computed generation number. 
> It's a small distinction, but it gives a single boundary to use for 
> reachability queries that test generation number.
>
>>
>>> @@ -571,6 +574,46 @@ static void close_reachable(struct 
>>> packed_oid_list *oids)
>>>       }
>>>   }
>>>   +static void compute_generation_numbers(struct commit** commits,
>>> +                       int nr_commits)
>>> +{
>>> +    int i;
>>> +    struct commit_list *list = NULL;
>>> +
>>> +    for (i = 0; i < nr_commits; i++) {
>>> +        if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
>>> +            commits[i]->generation != GENERATION_NUMBER_ZERO)
>>> +            continue;
>>> +
>>> +        commit_list_insert(commits[i], &list);
>>> +        while (list) {
>>> +            struct commit *current = list->item;
>>> +            struct commit_list *parent;
>>> +            int all_parents_computed = 1;
>>> +            uint32_t max_generation = 0;
>>> +
>>> +            for (parent = current->parents; parent; parent = 
>>> parent->next) {
>>> +                if (parent->item->generation == 
>>> GENERATION_NUMBER_INFINITY ||
>>> +                    parent->item->generation == 
>>> GENERATION_NUMBER_ZERO) {
>>> +                    all_parents_computed = 0;
>>> +                    commit_list_insert(parent->item, &list);
>>> +                    break;
>>> +                } else if (parent->item->generation > 
>>> max_generation) {
>>> +                    max_generation = parent->item->generation;
>>> +                }
>>> +            }
>>> +
>>> +            if (all_parents_computed) {
>>> +                current->generation = max_generation + 1;
>>> +                pop_commit(&list);
>>> +            }
>> If we haven't computed all parents' generations yet,
>> current->generation is undefined (or at least "left as
>> initialized"), so it does not make much sense to attempt to clip it
>> at _MAX at this point.  At leat not yet.
>>
>> IOW, shouldn't the following two lines be inside the "we now know
>> genno of all parents, so we can compute genno for commit" block
>> above?
>
> You're right! Good catch. This code sets every merge commit to _MAX. 
> It should be in the block above.
>
>>
>>> +            if (current->generation > GENERATION_NUMBER_MAX)
>>> +                current->generation = GENERATION_NUMBER_MAX;
>>> +        }
>>> +    }

This bothered me: why didn't I catch a bug here? I rebased my "fsck" RFC 
onto this branch and it succeeded. Then, I realized that this does not 
actually write incorrect values, since we re-visit this commit again 
after we pop the stack down to this commit. However, there is time in 
the middle where we have set the generation (in memory) incorrectly and 
that could easily turn into a real bug by a later change.

I'll stick the _MAX check in the if above to prevent confusion.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
  2018-04-26  9:02           ` Jakub Narebski
@ 2018-04-28 14:38             ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-28 14:38 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, git, Jeff King, Ævar Arnfjörð Bjarmason

Jakub Narebski <jnareb@gmail.com> writes:
> Junio C Hamano <gitster@pobox.com> writes:
>> Derrick Stolee <dstolee@microsoft.com> writes:
[...]
>>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>>> +{
>>> +	const struct commit *a = a_, *b = b_;
>>> +
>>> +	/* newer commits first */
>>> +	if (a->generation < b->generation)
>>> +		return 1;
>>> +	else if (a->generation > b->generation)
>>> +		return -1;
>>
>> ... this does not check if a->generation is _ZERO or _INF.  
>>
>> Both being _MAX is OK (the control will fall through and use the
>> dates below).  One being _MAX and the other being a normal value is
>> also OK (the above comparisons will declare the commit with _MAX is
>> farther than less-than-max one from a root).
>>
>> Or is the assumption that if one has _ZERO, that must have come from
>> an ancient commit-graph file and none of the commits have anything
>> but _ZERO?
>
> There is stronger and weaker version of the negative-cut criteria based
> on generation numbers.
>
> The strong criteria:
>
>   if A != B and gen(A) <= gen(B), then A cannot reach B
>
> The weaker criteria:
>
>   if gen(A) < gen(B), then A cannot reach B
>
>
> Because commit-graph is closed under reachability, this means that
>
>   if A is in commit graph, and B is outside of it, then A cannot reach B
>
> If A is in commit graph, then either _MAX >= gen(A) >= 1,
> or gen(A) == _ZERO.  Because _INFINITY > _MAX > _ZERO, then we have
>
>   if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY
>   then A cannot reach B
>
> which also fullfils the weaker criteria
>
>   if gen(A) < gen(B), then A cannot reach B
>
>
> If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY,
> or if both A and B have gen(A) = gen(B) = _MAX,
> or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO,
> then we cannot say anything about reachability... and weak criteria
> also does not say anything about reachability.
>
>
> Maybe the following ASCII table would make it clear.
>
>              |                      gen(B)
>              |            ................................ :::::::
> gen(A)       | _INFINITY | _MAX     | larger   | smaller  | _ZERO
> -------------+-----------+----------+----------+----------+--------
> _INFINITY    | =         | >        | >        | >        | >
> _MAX         | < Nn      | =        | >        | >        | >
> larger       | < Nn      | < Nn     | = n      | >        | >
> smaller      | < Nn      | < Nn     | < Nn     | = n      | >
> _ZERO        | < Nn      | < Nn     | < Nn     | < Nn     | =
>
> Here "n" denotes stronger condition, and "N" denotes weaker condition.
> We have _INFINITY > _MAX > larger > smaller > _ZERO.
>
>
> NOTE however that it is a *tradeoff*.  Using weaker criteria, with
> strict inequality, means that we don't need to handle _INFINITY, _MAX
> and _ZERO corner-cases in a special way; but it also means that we would
> walk slightly more commits than if we used stronger criteria, with less
> or equals.

Actually, if we look at the table above, it turns out that we can use
the stronger version of negative-cut criteria without special-casing all
the possible combinations.  Just use stronger criteria on normal range,
weaker criteria if any of generation numbers is special generation
number.

  if _MAX > gen(A) > _ZERO and
     _MAX > gen(B) > _ZERO then

    if A != B and gen(A) <= gen(B) then
      A cannot reach B
    else
      A can reach B

  else /* at least one special case */

    if gen(A) < gen(B) then
      A cannot reach B
    else
      A can reach B


NOTE that it specifically does not matter for created here
compare_commits_by_gen_then_commit_date(), as it requires strict
inequality for sorting - which using weak criteria explains why we don't
need any special cases in the code here.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 00/10] Compute and consume generation numbers
  2018-04-25 14:40       ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee
@ 2018-04-28 17:28         ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-28 17:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee, git\, gitster\, peff\, avarab\

Derrick Stolee <stolee@gmail.com> writes:

> As promised, here is the diff from v3.

What is this strange string "       " in place of tabs in the interdiff?
" " here is Unicode Character 'NO-BREAK SPACE' (U+00A0).

Though it doesn't matter for viewing, my newsreader (Gnus from GNU
Emacs) thinks that it is worth notifying about when replying.

Also, it looks like at least in one place the diff got line-wrapped.

> Thanks,
> -Stolee
>
> -- >8 --
>
> diff --git a/builtin/merge.c b/builtin/merge.c
> index 7e1da6c6ea..b819756946 100644
> --- a/builtin/merge.c
> +++ b/builtin/merge.c
> @@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const
> char *prefix)
>         branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid,
> NULL);
>         if (branch)
>                 skip_prefix(branch, "refs/heads/", &branch);
> +
>         init_diff_ui_defaults();
>         git_config(git_merge_config, NULL);
>
> @@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const
> char *prefix)
>         else
>                 head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>
> -
>         if (branch_mergeoptions)
>                 parse_branch_merge_options(branch_mergeoptions);
>         argc = parse_options(argc, argv, prefix, builtin_merge_options,

Whitespace fixes, all right.

> diff --git a/commit-graph.c b/commit-graph.c
> index 21e853c21a..aebd242def 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit
> *item, struct commit_graph *g, uin
>         uint32_t *parent_data_ptr;
>         uint64_t date_low, date_high;
>         struct commit_list **pptr;
> -       const unsigned char *commit_data = g->chunk_commit_data +
> GRAPH_DATA_WIDTH * pos;
> +       const unsigned char *commit_data = g->chunk_commit_data +
> (g->hash_len + 16) * pos;
>
>         item->object.parsed = 1;
>         item->graph_pos = pos;

This was accidental change in v3 (unrelated to the changes in commit it
were in).  Though I wonder if the symbolic constant route is not better
- though as separate standalone commit.

> @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit
> *item, struct commit_graph *g, uin
>                 *pos = item->graph_pos;
>                 return 1;
>         } else {
> -               return bsearch_graph(commit_graph,
> &(item->object.oid), pos);
> +               return bsearch_graph(g, &(item->object.oid), pos);
>         }
>  }
>

Fixup for a commit, that was sent in separate fixup email in v3.  All
right.

Though I wonder if it wouldn't be better to call global variable
'the_commit_graph' to avoid such errors in the future...

> @@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item)
>  {
>         uint32_t pos;
>
> -       if (item->object.parsed)
> -               return 0;
>         if (!core_commit_graph)
>                 return 0;
> +       if (item->object.parsed)
> +               return 1;
>         prepare_commit_graph();
>         if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>                 return fill_commit_in_graph(item, commit_graph, pos);

Fixed accidental flip-flopping about return value when
item->object.parsed.  I'd have to take a look at actual commits to say
whether I think it is all right or not.

> @@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile
> *f, int hash_len,
>                 else
>                         packedDate[0] = 0;
>
> -               if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
> +               if ((*list)->generation != GENERATION_NUMBER_INFINITY)
>                         packedDate[0] |= htonl((*list)->generation << 2);
> -               }
>
>                 packedDate[1] = htonl((*list)->date);
>                 hashwrite(f, packedDate, 8);

Coding style change, to be more in line with CodingGuidelines, namely
that we usually do not use block for single-command in conditionals.

All right.

> diff --git a/commit.c b/commit.c
> index 9ef6f699bd..e2e16ea1a7 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const
> void *a_, const void *b_, void
>         else if (a->generation > b->generation)
>                 return -1;
>
> -       /* use date as a heuristic when generataions are equal */
> +       /* use date as a heuristic when generations are equal */
>         if (a->date < b->date)
>                 return 1;
>         else if (a->date > b->date)

Fixed typo in comment.  All right.

> @@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit,
> int nr_reference, struct commit *
>         }
>
>         if (commit->generation > min_generation)
> -               return 0;
> +               return ret;
>
>         bases = paint_down_to_common(commit, nr_reference, reference,
> commit->generation);
>         if (commit->object.flags & PARENT2)

Unifying way of returning result (to one that was used before this
commit in this fragment of the git code).  Looks all right, from what I
remember.

> diff --git a/ref-filter.c b/ref-filter.c
> index e2fea6d635..fb35067fc9 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -16,6 +16,7 @@
>  #include "trailer.h"
>  #include "wt-status.h"
>  #include "commit-slab.h"
> +#include "commit-graph.h"
>
>  static struct ref_msg {
>         const char *gone;
> @@ -1629,7 +1630,7 @@ static enum contains_result
> contains_tag_algo(struct commit *candidate,
>
>         for (p = want; p; p = p->next) {
>                 struct commit *c = p->item;
> -               parse_commit_or_die(c);
> +               load_commit_graph_info(c);
>                 if (c->generation < cutoff)
>                         cutoff = c->generation;
>         }

Avoiding performance penalty when not using commit-graph feature (or
when it is turned off).  Looks good on first glance.

> @@ -1582,7 +1583,7 @@ static int in_commit_list(const struct
> commit_list *want, struct commit *c)
>  }
>
>  /*
> - * Test whether the candidate or one of its parents is contained in
> the list.
> + * Test whether the candidate is contained in the list.
>   * Do not recurse to find out, though, but return -1 if inconclusive.
>   */
>  static enum contains_result contains_test(struct commit *candidate,

Bringing comment in line with the function it is about.  Good.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list
  2018-04-25 14:37       ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
@ 2018-04-28 17:54         ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-28 17:54 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, peff\, avarab\

Derrick Stolee <dstolee@microsoft.com> writes:

> The in_commit_list() method does not check the parents of
> the candidate for containment in the list. Fix the comment
> that incorrectly states that it does.
>
> Reported-by: Jakub Narebski <jnareb@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  ref-filter.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index cffd8bf3ce..aff24d93be 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  }
>  
>  /*
> - * Test whether the candidate or one of its parents is contained in the list.
> + * Test whether the candidate is contained in the list.
>   * Do not recurse to find out, though, but return -1 if inconclusive.
>   */
>  static enum contains_result contains_test(struct commit *candidate,

All right. Always good to have comment and code match.


FYI: the contains_test() function described in this comment only checks
the candidate, and never access candidate commit parents.  All
recursion, which naturally includes checking parents, is in the
contains_tag_algo().

I guess that the code was refactored, but comment had not been changed.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 02/10] commit: add generation number to struct commmit
  2018-04-25 14:37       ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-28 22:35         ` Jakub Narebski
  2018-04-30 12:05           ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-28 22:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> The generation number of a commit is defined recursively as follows:
>
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.

Very minor nitpick: it would be more readable wrapped differently:

  * If a commit A has parents, then the generation number of A is
    one more than the maximum generation number among parents of A.

Very minor nitpick: possibly "parents", not "the parents", but I am
not native English speaker.

>
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use three special values to signal
> the generation number is invalid:
>
> GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> GENERATION_NUMBER_MAX 0x3FFFFFFF
> GENERATION_NUMBER_ZERO 0
>
> The first (_INFINITY) means the generation number has not been loaded or
> computed. The second (_MAX) means the generation number is too large to
> store in the commit-graph file. The third (_ZERO) means the generation
> number was loaded from a commit graph file that was written by a version
> of git that did not support generation numbers.

Good explanation; I wonder if we want to have it in some shortened form
also in comments, and not only in the commit message.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  alloc.c        | 1 +
>  commit-graph.c | 2 ++
>  commit.h       | 4 ++++
>  3 files changed, 7 insertions(+)

I have reordered patches to make it easier to review.

> diff --git a/commit.h b/commit.h
> index 23a3f364ed..aac3b8c56f 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -10,6 +10,9 @@
>  #include "pretty.h"
>  
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> +#define GENERATION_NUMBER_MAX 0x3FFFFFFF
> +#define GENERATION_NUMBER_ZERO 0

I wonder if it wouldn't be good to have some short in-line comments
explaining those constants, or a block comment above them.

>  
>  struct commit_list {
>  	struct commit *item;
> @@ -30,6 +33,7 @@ struct commit {
>  	 */
>  	struct tree *maybe_tree;
>  	uint32_t graph_pos;
> +	uint32_t generation;
>  };
>  
>  extern int save_commit_buffer;

All right, simple addition of the new field.  Nothing to go wrong here.

Sidenote: With 0x7FFFFFFF being (if I am not wrong) maximum graph_pos
and maximum number of nodes in commit graph, we won't hit 0x3FFFFFFF
generation number limit for all except very, very linear histories.

>
> diff --git a/alloc.c b/alloc.c
> index cf4f8b61e1..e8ab14f4a1 100644
> --- a/alloc.c
> +++ b/alloc.c
> @@ -94,6 +94,7 @@ void *alloc_commit_node(void)
>  	c->object.type = OBJ_COMMIT;
>  	c->index = alloc_commit_index();
>  	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
> +	c->generation = GENERATION_NUMBER_INFINITY;
>  	return c;
>  }

All right, start with initializing it with "not from commit-graph" value
after allocation.

>  
> diff --git a/commit-graph.c b/commit-graph.c
> index 70fa1b25fd..9ad21c3ffb 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +

I guess we should not worry about these "magical constants" sprinkled
here, like "+ 8" above.

Let's examine how it goes, taking a look at commit-graph-format.txt
in Documentation/technical/commit-graph-format.txt

 * The first H (g->hash_len) bytes are for the OID of the root tree.
 * The next 8 bytes are for the positions of the first two parents [...]

So 'commit_data + g->hash_len + 8' is our offset from the start of
commit data.  All right.

  * The next 8 bytes store the generation number of the commit and
    the commit time in seconds since EPOCH.  The generation number
    uses the higher 30 bits of the first 4 bytes. [...]

The higher 30 bits of the 4 bytes, which is 32 bits, means that we need
to shift 32-bit value 2 bits right, so that we get lower 30 bits of
32-bit value.  All right.

  All 4-byte numbers are in network order.

Shouldn't it be ntohl() to convert from network order to host order, and
not get_be32()?  I guess they are the same (network order is big-endian
order), and get_be32() is what rest of git uses...

Looks all right.

>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-04-25 14:37       ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee
  2018-04-26  2:35         ` Junio C Hamano
@ 2018-04-29  9:08         ` Jakub Narebski
  2018-05-01 12:10           ` Derrick Stolee
  1 sibling, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-29  9:08 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, peff\, avarab\

Derrick Stolee <dstolee@microsoft.com> writes:

> While preparing commits to be written into a commit-graph file, compute
> the generation numbers using a depth-first strategy.

Sidenote: for generation numbers it does not matter if we use
depth-first or breadth-first strategy, but it is more natural to use
depth-first search because generation numbers need post-order processing
(parents before child).

>
> The only commits that are walked in this depth-first search are those
> without a precomputed generation number. Thus, computation time will be
> relative to the number of new commits to the commit-graph file.

A question: what happens if the existing commit graph is from older
version of git and has _ZERO for generation numbers?

Answer: I see that we treat both _INFINITY (not in commit-graph) and
_ZERO (in commit graph but not computed) as not computed generation
numbers.  All right.

>
> If a computed generation number would exceed GENERATION_NUMBER_MAX, then
> use GENERATION_NUMBER_MAX instead.

All right, though I guess this would remain theoretical for a long
while.

We don't have any way of testing this, at least not without recompiling
Git with lower value of GENERATION_NUMBER_MAX -- which means not
automatically, isn't it?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 9ad21c3ffb..047fa9fca5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  		else
>  			packedDate[0] = 0;
>  
> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY)
> +			packedDate[0] |= htonl((*list)->generation << 2);
> +

If we stumble upon commit marked as "not in commit-graph" while writing
commit graph, it is a BUG(), isn't it?

(Problem noticed by Junio.)

It is a bit strange to me that the code uses get_be32 for reading, but
htonl for writing.  Is Git tested on non little-endian machines, like
big-endian ppc64 or s390x, or on mixed-endian machines (or
selectable-endian machines with data endianness set to non
little-endian, like ia64)?  If not, could we use for example openSUSE
Build Service (https://build.opensuse.org/) for this?

>  		packedDate[1] = htonl((*list)->date);
>  		hashwrite(f, packedDate, 8);
>  
> @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
>  	}
>  }
>  
> +static void compute_generation_numbers(struct commit** commits,
> +				       int nr_commits)
> +{
> +	int i;
> +	struct commit_list *list = NULL;

All right, commit_list will work as stack.

> +
> +	for (i = 0; i < nr_commits; i++) {
> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
> +			continue;

All right, we consider _INFINITY and _SERO as not computed.  If
generation number is computed (by 'recursion' or from commit graph), we
(re)use it.  This means that generation number calculation is
incremental, as intended -- good.

> +
> +		commit_list_insert(commits[i], &list);

Start depth-first walks from commits given.

> +		while (list) {
> +			struct commit *current = list->item;
> +			struct commit_list *parent;
> +			int all_parents_computed = 1;

Here all_parents_computed is a boolean flag.  I see that it is easier to
start with assumption that all parents will have computed generation
numbers.

> +			uint32_t max_generation = 0;

The generation number value of 0 functions as sentinel; generation
numbers start from 1.  Not that it matters much, as lowest possible
generation number is 1, and we could have started from that value.

> +
> +			for (parent = current->parents; parent; parent = parent->next) {
> +				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
> +				    parent->item->generation == GENERATION_NUMBER_ZERO) {
> +					all_parents_computed = 0;
> +					commit_list_insert(parent->item, &list);
> +					break;

If some parent doesn't have generation number calculated, we add it to
stack (and break out of loop because it is depth-first walk), and mark
this situation.  All right.

> +				} else if (parent->item->generation > max_generation) {
> +					max_generation = parent->item->generation;

Otherwise, update max_generation.  All right.

> +				}
> +			}
> +
> +			if (all_parents_computed) {
> +				current->generation = max_generation + 1;
> +				pop_commit(&list);
> +			}
> +
> +			if (current->generation > GENERATION_NUMBER_MAX)
> +				current->generation = GENERATION_NUMBER_MAX;

This conditional should be inside all_parents_computed test, for example
like this:

  +			if (all_parents_computed) {
  +				current->generation = max_generation + 1;
  +				if (current->generation > GENERATION_NUMBER_MAX)
  +					current->generation = GENERATION_NUMBER_MAX;
  +
  +				pop_commit(&list);
  +			}

(Noticed by Junio.)

Sidenote: when we revisit the commit, returning from depth-first walk of
one of its parents, we calculate max_generation from scratch again.
This does not matter for performance, as it's just data access and
calculating maximum - any workaround to not restart those calculations
would take more time and memory.  And it's simple.

> +		}
> +	}
> +}
> +
>  void write_commit_graph(const char *obj_dir,
>  			const char **pack_indexes,
>  			int nr_packs,
> @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
>  	if (commits.nr >= GRAPH_PARENT_MISSING)
>  		die(_("too many commits to write graph"));
>  
> +	compute_generation_numbers(commits.list, commits.nr);
> +

Nice and simple.  All right.

I guess that we do not pass "struct packed_commit_list commits" as
argument to compute_generation_numbers instead of "struct commit**
commits.list" and "int commits.nr" to compute_generation_numbers() to
keep the latter nice and generic?

>  	graph_name = get_commit_graph_filename(obj_dir);
>  	fd = hold_lock_file_for_update(&lk, graph_name, 0);

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
  2018-04-25 14:37       ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee
  2018-04-26  3:22         ` Junio C Hamano
@ 2018-04-29 15:40         ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-29 15:40 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).

All right, this looks reasonable thing to do when we have access to
commit generation numbers..

> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_INFINITY.

Thus the condition that if B is reachable from A, then gen(A) >= gen(B),
even if they have generation numbers _INFINITY, _MAX or _ZERO.

We use generation numbers, if possible, to choose closest commit; if
not, we use dates.

>
> This change does not affect the number of commits that are walked during
> the execution of paint_down_to_common(), only the order that those
> commits are inspected. In the case that commit dates violate topological
> order (i.e. a parent is "newer" than a child), the previous code could
> walk a commit twice: if a commit is reached with the PARENT1 bit, but
> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
> propagated to its parents. Using generation numbers avoids this extra
> effort, even if it is somewhat rare.

Actually the ordering of commits walked does not affect the correctness
of the result.  Better ordering means that commits do not need to be
walked twice; I think it would be possible to craft repository in which
unlucky clock skew would lead to depth-first walk of commits later part
of walk would mark STALE.

Pedantry aside, I think it is a good description of analysis of change
results.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 20 +++++++++++++++++++-
>  commit.h |  1 +
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 711f674c18..4d00b0a1d6 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
>  	return 0;
>  }
>  
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> +{
> +	const struct commit *a = a_, *b = b_;
> +
> +	/* newer commits first */

To be pedantic, larger generation number does not necessary mean that
commit was created later (is newer), only that it is on longer chain
since common ancestor or root commit.

> +	if (a->generation < b->generation)
> +		return 1;
> +	else if (a->generation > b->generation)
> +		return -1;

If the commit-graph feature is not available, or is disabled, all
commits would have the same generation number (_INFINITY), then this
block would be always practically no-op.

This is not very costly: 2 access to data which should be in cache, and
1 to 2 comparison operations.  But I wonder if we wouldn't want to avoid
this nano-cost if possible...

> +
> +	/* use date as a heuristic when generations are equal */
> +	if (a->date < b->date)
> +		return 1;
> +	else if (a->date > b->date)
> +		return -1;
> +	return 0;

The above is the same code as in compare_commits_by_commit_date(), but
there it is with "newer commits with larger date first" as comment
instead.

All right: we need inlining for speed.

> +}
> +
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  /* all input commits in one and twos[] must have been parsed! */
>  static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>  {
> -	struct prio_queue queue = { compare_commits_by_commit_date };
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };

I wonder if it would be worth it to avoid comparing by generation
numbers without commit graph data:

  +	struct prio_queue queue;
  [...]
  +	if (commit_graph)
  +		queue.compare = compare_commits_by_gen_then_commit_date;
  +	else
  +		queue.compare = compare_commits_by_commit_date;

Or something like that.  But perhaps this nano-optimization is not worth
it (it is not that complicated, though).


Sidenote: when I searched for compare_commits_by_commit_date use, I have
noticed that it is used, I think as heuristics, for packfile creation in
upload-pack.c and fetch-pack.c.  Would they similarly improve with
compare_commits_by_gen_then_commit_date?

This is of course not something that this commit, or this patch series,
needs to address now.

>  	struct commit_list *result = NULL;
>  	int i;
>  
> diff --git a/commit.h b/commit.h
> index aac3b8c56f..64436ff44e 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>  extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>  
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);

All right.

>  
>  LAST_ARG_MUST_BE_NULL
>  extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 05/10] commit-graph: always load commit-graph information
  2018-04-25 14:37       ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-29 22:14         ` Jakub Narebski
  2018-05-01 12:19           ` Derrick Stolee
  2018-04-29 22:18         ` Jakub Narebski
  1 sibling, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-29 22:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> Most code paths load commits using lookup_commit() and then
> parse_commit().

And this automatically loads commit graph if needed, thanks to changes
in parse_commit_gently(), which parse_commit() uses.

>                 In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().

I guess the problem is that we cannot just add parse_commit_in_graph() 
like we did in parse_commit_gently(), for some reason?  Like for example
that parse_commit_gently() uses parse_commit_buffer() - which could have
been handled by moving parse_commit_in_graph() down the call chain from
parse_commit_gently() to parse_commit_buffer()... if not the fact that
check_commit() also uses parse_commit_buffer(), but it does not want to
load commit graph.  Am I right?

>
> With generation numbers in the commit-graph, we need to ensure that any
> commit that exists in the commit-graph file has its generation number
> loaded.

Is it generation number, or generation number and position in commit
graph?

>
> Create new load_commit_graph_info() method to fill in the information
> for a commit that exists only in the commit-graph file. Call it from
> parse_commit_buffer() after loading the other commit information from
> the given buffer. Only fill this information when specified by the
> 'check_graph' parameter.

I think this commit would be easier to review if it was split into pure
refactoring part (extracting fill_commit_graph_info() and
find_commit_in_graph()).  On the other hand the refactoring was needed
to reduce code duplication betweem existing parse_commit_in_graph() and
new load_commit_graph_info() functions.

I guess that the difference between parse_commit_in_graph() and
load_commit_graph_info() is that the former cares only about having just
enough information that is needed for parse_commit_gently() - and does
not load graph data if commit is parsed, while the latter is about
loading commit-graph data like generation numbers.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
>  commit-graph.h |  8 ++++++++
>  commit.c       |  7 +++++--
>  commit.h       |  2 +-
>  object.c       |  2 +-
>  sha1_file.c    |  2 +-
>  6 files changed, 46 insertions(+), 20 deletions(-)

I wonder if it would be possible to add tests for this feature, for
example that commit-graph is read when it should (including those branch
lookups), and is not read when the feature should be disabled.

But the only way to test it I can think of is a stupid one: create
invalid commit graph, and check that git fails as expected (trying to
read said malformed file), and does not fail if commit graph feature is
disabled.

>

Let me reorder files (BTW, is there a way for Git to put *.h files
before *.c files in diff?) for easier review:

> diff --git a/commit-graph.h b/commit-graph.h
> index 260a468e73..96cccb10f3 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>   */
>  int parse_commit_in_graph(struct commit *item);
>  
> +/*
> + * It is possible that we loaded commit contents from the commit buffer,
> + * but we also want to ensure the commit-graph content is correctly
> + * checked and filled. Fill the graph_pos and generation members of
> + * the given commit.
> + */
> +void load_commit_graph_info(struct commit *item);
> +
>  struct tree *get_commit_tree_in_graph(const struct commit *c);
>  
>  struct commit_graph {
> diff --git a/commit-graph.c b/commit-graph.c
> index 047fa9fca5..aebd242def 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>  	return &commit_list_insert(c, pptr)->next;
>  }
>  
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +{
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +}

The comment in the header file commit-graph.h talks about filling
graph_pos and generation members of the given commit, but I don't see
filling graph_pos member here.

Sidenote: it is a tiny little bit strange to see symbolic constants like
GRAPH_DATA_WIDTH near using magic values such as 8 and 2.

> +
>  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>  {
>  	uint32_t edge_value;
> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	return 1;
>  }
>  
> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
> +{
> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> +		*pos = item->graph_pos;
> +		return 1;
> +	} else {
> +		return bsearch_graph(g, &(item->object.oid), pos);
> +	}
> +}

Nice refactoring here.

> +
>  int parse_commit_in_graph(struct commit *item)
>  {
> +	uint32_t pos;
> +
>  	if (!core_commit_graph)
>  		return 0;
>  	if (item->object.parsed)
>  		return 1;
> -
>  	prepare_commit_graph();
> -	if (commit_graph) {
> -		uint32_t pos;
> -		int found;
> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> -			pos = item->graph_pos;
> -			found = 1;
> -		} else {
> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
> -		}
> -
> -		if (found)
> -			return fill_commit_in_graph(item, commit_graph, pos);
> -	}
> -
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		return fill_commit_in_graph(item, commit_graph, pos);
>  	return 0;
>  }
>  
> +void load_commit_graph_info(struct commit *item)
> +{
> +	uint32_t pos;
> +	if (!core_commit_graph)
> +		return;
> +	prepare_commit_graph();
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		fill_commit_graph_info(item, commit_graph, pos);
> +}

Similar functions, different goals (as the names imply).

> +
>  static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>  {
>  	struct object_id oid;
> diff --git a/commit.c b/commit.c
> index 4d00b0a1d6..39a3749abd 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>  	return ret;
>  }
>  
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>  {
>  	const char *tail = buffer;
>  	const char *bufptr = buffer;
> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>  	}
>  	item->date = parse_commit_date(bufptr, tail);
>  
> +	if (check_graph)
> +		load_commit_graph_info(item);
> +

All right, read commit-graph specific data after parsing commit itself.
It is at the end because commit object needs to be parsed sequentially,
and it includes more info that is contained in commit-graph CDAT+EDGE
data.

>  	return 0;
>  }
>  
> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  		return error("Object %s not a commit",
>  			     oid_to_hex(&item->object.oid));
>  	}
> -	ret = parse_commit_buffer(item, buffer, size);
> +	ret = parse_commit_buffer(item, buffer, size, 0);

The parse_commit_gently() contract is that it provides only bare minimum
of information, from commit-graph if possible, and does read object from
disk and parses it only when it could not avoid it.  If it needs to
parse it, it doesn't need to fill commit-graph specific data again.

All right.

>  	if (save_commit_buffer && !ret) {
>  		set_commit_buffer(item, buffer, size);
>  		return 0;
> diff --git a/commit.h b/commit.h
> index 64436ff44e..b5afde1ae9 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>   */
>  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>  
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>  int parse_commit_gently(struct commit *item, int quiet_on_missing);
>  static inline int parse_commit(struct commit *item)
>  {
> diff --git a/object.c b/object.c
> index e6ad3f61f0..efe4871325 100644
> --- a/object.c
> +++ b/object.c
> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>  	} else if (type == OBJ_COMMIT) {
>  		struct commit *commit = lookup_commit(oid);
>  		if (commit) {
> -			if (parse_commit_buffer(commit, buffer, size))
> +			if (parse_commit_buffer(commit, buffer, size, 1))

All that rigamarole was needed because of

DS>                 In some cases, including some branch lookups, the commit
DS> is parsed using parse_object_buffer() which side-steps parse_commit() in
DS> favor of parse_commit_buffer().

Here we want parse_object_buffer() to get also commit-graph specific
data, if available.  All right.

>  				return NULL;
>  			if (!get_cached_commit_buffer(commit, NULL)) {
>  				set_commit_buffer(commit, buffer, size);
> diff --git a/sha1_file.c b/sha1_file.c
> index 1b94f39c4c..0fd4f0b8b6 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>  {
>  	struct commit c;
>  	memset(&c, 0, sizeof(c));
> -	if (parse_commit_buffer(&c, buf, size))
> +	if (parse_commit_buffer(&c, buf, size, 0))

For check we don't need commit graph data.  Looks all right.

>  		die("corrupt commit");
>  }

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 05/10] commit-graph: always load commit-graph information
  2018-04-25 14:37       ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee
  2018-04-29 22:14         ` Jakub Narebski
@ 2018-04-29 22:18         ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-29 22:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

[Forgot about one thing]

Derrick Stolee <dstolee@microsoft.com> writes:

> Create new load_commit_graph_info() method to fill in the information
> for a commit that exists only in the commit-graph file.

The above sentence is a bit hard to parse because of ambiguity: is it
"the information" that exists only in the commit-graph file, or "a
commit" that exists only in the commit-graph file?

>                                                          Call it from
> parse_commit_buffer() after loading the other commit information from
> the given buffer. Only fill this information when specified by the
> 'check_graph' parameter.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 02/10] commit: add generation number to struct commmit
  2018-04-28 22:35         ` Jakub Narebski
@ 2018-04-30 12:05           ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-04-30 12:05 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

On 4/28/2018 6:35 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The generation number of a commit is defined recursively as follows:
>>
>> * If a commit A has no parents, then the generation number of A is one.
>> * If a commit A has parents, then the generation number of A is one
>>    more than the maximum generation number among the parents of A.
> Very minor nitpick: it would be more readable wrapped differently:
>
>    * If a commit A has parents, then the generation number of A is
>      one more than the maximum generation number among parents of A.
>
> Very minor nitpick: possibly "parents", not "the parents", but I am
> not native English speaker.
>
>> Add a uint32_t generation field to struct commit so we can pass this
>> information to revision walks. We use three special values to signal
>> the generation number is invalid:
>>
>> GENERATION_NUMBER_INFINITY 0xFFFFFFFF
>> GENERATION_NUMBER_MAX 0x3FFFFFFF
>> GENERATION_NUMBER_ZERO 0
>>
>> The first (_INFINITY) means the generation number has not been loaded or
>> computed. The second (_MAX) means the generation number is too large to
>> store in the commit-graph file. The third (_ZERO) means the generation
>> number was loaded from a commit graph file that was written by a version
>> of git that did not support generation numbers.
> Good explanation; I wonder if we want to have it in some shortened form
> also in comments, and not only in the commit message.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   alloc.c        | 1 +
>>   commit-graph.c | 2 ++
>>   commit.h       | 4 ++++
>>   3 files changed, 7 insertions(+)
> I have reordered patches to make it easier to review.
>
>> diff --git a/commit.h b/commit.h
>> index 23a3f364ed..aac3b8c56f 100644
>> --- a/commit.h
>> +++ b/commit.h
>> @@ -10,6 +10,9 @@
>>   #include "pretty.h"
>>   
>>   #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
>> +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
>> +#define GENERATION_NUMBER_MAX 0x3FFFFFFF
>> +#define GENERATION_NUMBER_ZERO 0
> I wonder if it wouldn't be good to have some short in-line comments
> explaining those constants, or a block comment above them.
>
>>   
>>   struct commit_list {
>>   	struct commit *item;
>> @@ -30,6 +33,7 @@ struct commit {
>>   	 */
>>   	struct tree *maybe_tree;
>>   	uint32_t graph_pos;
>> +	uint32_t generation;
>>   };
>>   
>>   extern int save_commit_buffer;
> All right, simple addition of the new field.  Nothing to go wrong here.
>
> Sidenote: With 0x7FFFFFFF being (if I am not wrong) maximum graph_pos
> and maximum number of nodes in commit graph, we won't hit 0x3FFFFFFF
> generation number limit for all except very, very linear histories.

Both of these limits are far away from being realistic. But we could 
extend the maximum graph_pos independently from the maximum generation 
number now that we have the "capped" logic.

>
>> diff --git a/alloc.c b/alloc.c
>> index cf4f8b61e1..e8ab14f4a1 100644
>> --- a/alloc.c
>> +++ b/alloc.c
>> @@ -94,6 +94,7 @@ void *alloc_commit_node(void)
>>   	c->object.type = OBJ_COMMIT;
>>   	c->index = alloc_commit_index();
>>   	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
>> +	c->generation = GENERATION_NUMBER_INFINITY;
>>   	return c;
>>   }
> All right, start with initializing it with "not from commit-graph" value
> after allocation.
>
>>   
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 70fa1b25fd..9ad21c3ffb 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>>   	date_low = get_be32(commit_data + g->hash_len + 12);
>>   	item->date = (timestamp_t)((date_high << 32) | date_low);
>>   
>> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>> +
> I guess we should not worry about these "magical constants" sprinkled
> here, like "+ 8" above.
>
> Let's examine how it goes, taking a look at commit-graph-format.txt
> in Documentation/technical/commit-graph-format.txt
>
>   * The first H (g->hash_len) bytes are for the OID of the root tree.
>   * The next 8 bytes are for the positions of the first two parents [...]
>
> So 'commit_data + g->hash_len + 8' is our offset from the start of
> commit data.  All right.
>
>    * The next 8 bytes store the generation number of the commit and
>      the commit time in seconds since EPOCH.  The generation number
>      uses the higher 30 bits of the first 4 bytes. [...]
>
> The higher 30 bits of the 4 bytes, which is 32 bits, means that we need
> to shift 32-bit value 2 bits right, so that we get lower 30 bits of
> 32-bit value.  All right.
>
>    All 4-byte numbers are in network order.
>
> Shouldn't it be ntohl() to convert from network order to host order, and
> not get_be32()?  I guess they are the same (network order is big-endian
> order), and get_be32() is what rest of git uses...

ntohl() takes a 32-bit value, while get_be32() takes a pointer. This 
makes pulling network-bytes out of streams much cleaner with get_be32(), 
so I try to use that whenever possible.


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 06/10] ref-filter: use generation number for --contains
  2018-04-25 14:37       ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-30 16:34         ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-30 16:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> A commit A can reach a commit B only if the generation number of A
> is strictly larger than the generation number of B. This condition
> allows significantly short-circuiting commit-graph walks.
>
> Use generation number for '--contains' type queries.
>
> On a copy of the Linux repository where HEAD is containd in v4.13

Minor typo: containd -> contained.

> but no earlier tag, the command 'git tag --contains HEAD' had the
> following peformance improvement:
>
> Before: 0.81s
> After:  0.04s
> Rel %:  -95%

Very nice.  I guess that any performance changes for when commit-graph
feature is not available are negligible / not measurable.

Rel % = (before - after)/before * 100%, isn't it?.

Good.

>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  ref-filter.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index aff24d93be..fb35067fc9 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -16,6 +16,7 @@
>  #include "trailer.h"
>  #include "wt-status.h"
>  #include "commit-slab.h"
> +#include "commit-graph.h"
>  
>  static struct ref_msg {
>  	const char *gone;
> @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>   */
>  static enum contains_result contains_test(struct commit *candidate,
>  					  const struct commit_list *want,
> -					  struct contains_cache *cache)
> +					  struct contains_cache *cache,
> +					  uint32_t cutoff)
>  {
>  	enum contains_result *cached = contains_cache_at(cache, candidate);
>  
> @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate,
>  
>  	/* Otherwise, we don't know; prepare to recurse */
>  	parse_commit_or_die(candidate);
> +
> +	if (candidate->generation < cutoff)
> +		return CONTAINS_NO;
> +

We use here weaker negative-cut criteria, which has the advantage of
simply automatic handling of special values: _INFINITY, _MAX, _ZERO.

Stronger version:

  if A != B and A ---> B, then gen(A) > gen(B)

  if gen(A) <= gen(B) and A != B, then A -/-> B

Weaker version:

  if gen(A) < gen(B), then A -/-> B

If commit-graph feature is not available, then all generation numbers
would be _INFINITY, and cutoff would also be _INFINITY - which means
this operation is practically no-op.  One memory access (probably from
cache) and one comparison is very cheap.

All right.

>  	return CONTAINS_UNKNOWN;
>  }
>  
> @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  					      struct contains_cache *cache)
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
> -	enum contains_result result = contains_test(candidate, want, cache);
> +	enum contains_result result;
> +	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> +	const struct commit_list *p;
> +
> +	for (p = want; p; p = p->next) {
> +		struct commit *c = p->item;
> +		load_commit_graph_info(c);
> +		if (c->generation < cutoff)
> +			cutoff = c->generation;
> +	}

For each in wants, load generation numbers if needed and find lowest
one.  Anything lower cannot reach any of wants.  All right.

If commit-graph feature is not available, this is practically no-op.  It
is fast, as it only accesses memory - it does not access disk, nor do it
needs to do any decompression, un-deltafication or parsing.

All right.

>  
> +	result = contains_test(candidate, want, cache, cutoff);
>  	if (result != CONTAINS_UNKNOWN)
>  		return result;
>  
> @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  		 * If we just popped the stack, parents->item has been marked,
>  		 * therefore contains_test will return a meaningful yes/no.
>  		 */
> -		else switch (contains_test(parents->item, want, cache)) {
> +		else switch (contains_test(parents->item, want, cache, cutoff)) {
>  		case CONTAINS_YES:
>  			*contains_cache_at(cache, commit) = CONTAINS_YES;
>  			contains_stack.nr--;
> @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  		}
>  	}
>  	free(contains_stack.contains_stack);
> -	return contains_test(candidate, want, cache);
> +	return contains_test(candidate, want, cache, cutoff);

Those two just update callsite to new signatore.  All right.

>  }
>  
>  static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 07/10] commit: use generation numbers for in_merge_bases()
  2018-04-25 14:37       ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-30 17:05         ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-04-30 17:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> The containment algorithm for 'git branch --contains' is different
> from that for 'git tag --contains' in that it uses is_descendant_of()
> instead of contains_tag_algo(). The expensive portion of the branch
> algorithm is computing merge bases.
>
> When a commit-graph file exists with generation numbers computed,
> we can avoid this merge-base calculation when the target commit has
> a larger generation number than the initial commits.

Right.

>
> Performance tests were run on a copy of the Linux repository where
> HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> copied to branches and 'git branch --contains' was tested:

I guess that it is equivalent of 'git tag --contains' setup from
previous commit, just for 'git branch --contains', isn't it?

>
> Before: 60.0s
> After:   0.4s
> Rel %: -99.3%

Very nice.

Sidenote: an alternative to using "Rel %" of -99.3% (which is calculated
as (before-after)/before) would be to use "Speedup" of 150 x (calculated
as before/after).  One one hand it might be more readable, on the other
hand it might be a bit misleading.

Yet another alternative would be to use a chart like the following:

           time  Before   After
  Before  60.0s      --  -99.3%
  After    0.4s   +149%      --

Anyway, consistency in presentation in patch series is good.  So I am
for keeping your notation thorough the series.

>
> Reported-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 39a3749abd..7bb007f56a 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *

Let's give it a bit more context:

   /*
    * Is "commit" an ancestor of one of the "references"?
    */
   int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit **reference)
>  {
>  	struct commit_list *bases;
>  	int ret = 0, i;
> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>  
>  	if (parse_commit(commit))
>  		return ret;
> -	for (i = 0; i < nr_reference; i++)
> +	for (i = 0; i < nr_reference; i++) {
>  		if (parse_commit(reference[i]))
>  			return ret;

We use parse_commit(), so there is no need for calling
load_commit_graph_info(), like in previous patch.

All right.

> +		if (min_generation > reference[i]->generation)

At first glance, I thought it was wrong; but it is the same as the
following, it is just a matter of taste (which feels more natural):

  +		if (reference[i]->generation < min_generation)

> +			min_generation = reference[i]->generation;
> +	}
> +
> +	if (commit->generation > min_generation)
> +		return ret;

All right, using weak version of generation numbers based negative-cut
nicely handles automatically all corner-cases, including the case where
commit-graaph feature is turned off.

If commit-graph feature is not available, it costs only few memory
access and few comparisons than before, and performance is dominated by
something else anyway.  Negligible and possibly unnoticeable change, I
guess.  Good.

>  
>  	bases = paint_down_to_common(commit, nr_reference, reference);
>  	if (commit->object.flags & PARENT2)

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
  2018-04-25 14:38       ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-30 22:19         ` Jakub Narebski
  2018-05-01 11:47           ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-30 22:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> When running 'git branch --contains', the in_merge_bases_many()
> method calls paint_down_to_common() to discover if a specific
> commit is reachable from a set of branches. Commits with lower
> generation number are not needed to correctly answer the
> containment query of in_merge_bases_many().
>
> Add a new parameter, min_generation, to paint_down_to_common() that
> prevents walking commits with generation number strictly less than
> min_generation. If 0 is given, then there is no functional change.

This is thanks to the fact that generation numbers start at zero (as
special case, though, with _ZERO), and we use strict inequality to avoid
handling _ZERO etc. in a special way.

As you wrote in response in previous version of this series, because
paint_down_to_common() is file-local, there is no need to come up with
symbolic name for GENERATION_NO_CUTOFF case.

All right.

>
> For in_merge_bases_many(), we can pass commit->generation as the
> cutoff, and this saves time during 'git branch --contains' queries
> that would otherwise walk "around" the commit we are inspecting.

All right, and when using paint_down_to_common() to actually find merge
bases, and not only check for containment, we cannot use cutoff.
Therefore at least one call site needs to run it without functional
change... which we can do.  Good.

>
> For a copy of the Linux repository, where HEAD is checked out at
> v4.13~100, we get the following performance improvement for
> 'git branch --contains' over the previous commit:
>
> Before: 0.21s
> After:  0.13s
> Rel %: -38%

Nice.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)

Let me reorder chunks a bit to make it easier to review.

>
> diff --git a/commit.c b/commit.c
> index 7bb007f56a..e2e16ea1a7 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>  	if (commit->generation > min_generation)
>  		return ret;
>  
> -	bases = paint_down_to_common(commit, nr_reference, reference);
> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
>  	if (commit->object.flags & PARENT2)
>  		ret = 1;
>  	clear_commit_marks(commit, all_flags);
> @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  }
>  
>  /* all input commits in one and twos[] must have been parsed! */
> -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
> +static struct commit_list *paint_down_to_common(struct commit *one, int n,
> +						struct commit **twos,
> +						int min_generation)
>  {
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
> +	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
>  
>  	one->object.flags |= PARENT1;
>  	if (!n) {
> @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  		struct commit_list *parents;
>  		int flags;
>  
> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +		last_gen = commit->generation;

Shouldn't we provide more information about where the problem is to the
user, to make it easier to debug the repository / commit-graph data?

Good to have this sanity check here.

> +
> +		if (commit->generation < min_generation)
> +			break;

So the reasoning for this, as far as I understand, is the following.
Please correct me if I am wrong.

The callsite with non-zero min_generation, in_merge_bases_many(), tries
to find out if "commit" is an ancestor of one of the "references".  At
least one of "references" is above "commit", so in_merge_bases_many()
uses paint_down_to_common() - but is interested only if "commit" was
painted as reachable from one of "references".

Thus we can interrupt the walk if we know that none of [considered]
commits in the queue can reach "commit"/"one" - as if they were all
STALE.

The search is done using priority queue (a bit like in Dijkstra
algorithm), with newer commits - with larger generation numbers -
considered first.  Thus if current commit has generation number less
than min_generation cutoff, i.e. if it is below "commit", then all
remaining commits in the queue are below cutoff.

Good.

> +
>  		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>  		if (flags == (PARENT1 | PARENT2)) {
>  			if (!(commit->object.flags & RESULT)) {
> @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>  			return NULL;
>  	}
>  
> -	list = paint_down_to_common(one, n, twos);
> +	list = paint_down_to_common(one, n, twos, 0);

When calculating merge bases there is no such possibility of an early
return due to generation number cutoff.  All right then.

>  
>  	while (list) {
>  		struct commit *commit = pop_commit(&list);
> @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt)
>  			filled_index[filled] = j;
>  			work[filled++] = array[j];
>  		}
> -		common = paint_down_to_common(array[i], filled, work);
> +		common = paint_down_to_common(array[i], filled, work, 0);

Here we are interested not only if "one"/array[i] is reachable from
"twos"/work, but also if "twos" is reachable from "one".  Simple cutoff
only works in one way, though I wonder if we couldn't use cutoff being
minimum generation number of "one" and "twos" together.

But that may be left for a separate commit (after checking that the
above is correct).

Not as simple and obvious as paint_down_to_common() used in
in_merge_bases_any(), so it is all right.

>  		if (array[i]->object.flags & PARENT2)
>  			redundant[i] = 1;
>  		for (j = 0; j < filled; j++)

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 09/10] merge: check config before loading commits
  2018-04-25 14:38       ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee
@ 2018-04-30 22:54         ` Jakub Narebski
  2018-05-01 11:52           ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-30 22:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> Now that we use generation numbers from the commit-graph, we must
> ensure that all commits that exist in the commit-graph are loaded
> from that file instead of from the object database. Since the
> commit-graph file is only checked if core.commitGraph is true, we
> must check the default config before we load any commits.
>
> In the merge builtin, the config was checked after loading the HEAD
> commit. This was due to the use of the global 'branch' when checking
> merge-specific config settings.
>
> Move the config load to be between the initialization of 'branch' and
> the commit lookup.

Sidenote: I wonder why reading config was postponed to later in the
command lifetime... I guess it was to avoid having to read config if
HEAD was invalid.

>
> Without this change, a fast-forward merge would hit a BUG("bad
> generation skip") statement in commit.c during paint_down_to_common().
> This is because the HEAD commit would be loaded with "infinite"
> generation but then reached by commits with "finite" generation
> numbers.

I guess this is because we avoid re-parsing objects at all costs; we
want to avoid re-reading commit graph too.

>
> Add a test to t5318-commit-graph.sh that exercises this code path to
> prevent a regression.

I would prefer if this commit was put earlier in the series, to avoid
having broken Git (and thus a possibility of problems when bisecting) in
between those two commits.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/merge.c         | 7 ++++---
>  t/t5318-commit-graph.sh | 9 +++++++++
>  2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/builtin/merge.c b/builtin/merge.c
> index 5e5e4497e3..b819756946 100644
> --- a/builtin/merge.c
> +++ b/builtin/merge.c
> @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>  	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
>  	if (branch)
>  		skip_prefix(branch, "refs/heads/", &branch);
> +
> +	init_diff_ui_defaults();
> +	git_config(git_merge_config, NULL);
> +
>  	if (!branch || is_null_oid(&head_oid))
>  		head_commit = NULL;
>  	else
>  		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>  
> -	init_diff_ui_defaults();
> -	git_config(git_merge_config, NULL);
> -

Good.

>  	if (branch_mergeoptions)
>  		parse_branch_merge_options(branch_mergeoptions);
>  	argc = parse_options(argc, argv, prefix, builtin_merge_options,
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index a380419b65..77d85aefe7 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
>  
> +test_expect_success 'perform fast-forward merge in full repo' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git checkout -b merge-5-to-8 commits/5 &&
> +	git merge commits/8 &&
> +	git show-ref -s merge-5-to-8 >output &&
> +	git show-ref -s commits/8 >expect &&
> +	test_cmp expect output
> +'

All right.  (though I wonder if this tests catches all problems where
BUG("bad generation skip") could have been encountered.

> +
>  test_done

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 10/10] commit-graph.txt: update design document
  2018-04-25 14:38       ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee
@ 2018-04-30 23:32         ` Jakub Narebski
  2018-05-01 12:00           ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-04-30 23:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> We now calculate generation numbers in the commit-graph file and use
> them in paint_down_to_common().
>
> Expand the section on generation numbers to discuss how the three
> special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
> _MAX interact with other generation numbers.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Looks good.

> ---
>  Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 0550c6d0dc..d9f2713efa 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>  
> +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +in the commit-graph file. If a commit-graph file was written by a version
> +of Git that did not compute generation numbers, then those commits will
> +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> +
> +Since the commit-graph file is closed under reachability, we can guarantee
> +the following weaker condition on all commits:
> +
> +    If A and B are commits with generation numbers N amd M, respectively,
> +    and N < M, then A cannot reach B.
> +
> +Note how the strict inequality differs from the inequality when we have
> +fully-computed generation numbers. Using strict inequality may result in
> +walking a few extra commits,

The linux kernel commit graph has maximum of 513 commits sharing the
same generation number, but is is 5.43 commits sharing the same
generation number on average, with standard deviation 10.70; median is
even lower: it is 2, with 5.35 median absolute deviation (MAD).

So on average it would be a few extra commits.  Right.

>                               but the simplicity in dealing with commits
> +with generation number *_INFINITY or *_ZERO is valuable.

As I wrote before, handling those corner cases in more complicated, but
not that complicated.  We could simply use stronger condition if both
generation numbers are ordinary generation numbers, and weaker condition
when at least one generation number has one of those special values.

> +
> +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> +generation numbers are computed to be at least this value. We limit at
> +this value since it is the largest value that can be stored in the
> +commit-graph file using the 30 bits available to generation numbers. This
> +presents another case where a commit can have generation number equal to
> +that of a parent.

Ordinary generation numbers, where stronger condition holds, are those
between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX.

> +
>  Design Details
>  --------------
>  
> @@ -98,17 +121,12 @@ Future Work
>  - The 'commit-graph' subcommand does not have a "verify" mode that is
>    necessary for integration with fsck.
>  
> -- The file format includes room for precomputed generation numbers. These
> -  are not currently computed, so all generation numbers will be marked as
> -  0 (or "uncomputed"). A later patch will include this calculation.
> -

Good.

>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operations are important candidates:
> +  operation is an important candidate:
>  
> -    - paint_down_to_common()
>      - 'log --topo-order'

Another possible candidates:

       - remove_redundant() - see comment in previous patch
       - still_interesting() - where Git uses date slop to stop walking
         too far

>  
>  - Currently, parse_commit_gently() requires filling in the root tree

One important issue left is handling features that change view of
project history, and their interaction with commit-graph feature.

What would happen, if we turn on commit-graph feature, generate commit
graph file, and then:

  * use graft file or remove graft entries to cut history, or remove cut
    or join two [independent] histories.
  * use git-replace mechanims to do the same
  * in shallow clone, deepen or shorten the clone

What would happen if without re-generating commit-graph file (assuming
tha Git wouldn't do it for us), we run some feature that makes use of
commit-graph data:

  - git branch --contains
  - git tag --contains
  - git rev-list A..B

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
  2018-04-30 22:19         ` Jakub Narebski
@ 2018-05-01 11:47           ` Derrick Stolee
  2018-05-02 13:05             ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 11:47 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

On 4/30/2018 6:19 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> When running 'git branch --contains', the in_merge_bases_many()
>> method calls paint_down_to_common() to discover if a specific
>> commit is reachable from a set of branches. Commits with lower
>> generation number are not needed to correctly answer the
>> containment query of in_merge_bases_many().
>>
>> Add a new parameter, min_generation, to paint_down_to_common() that
>> prevents walking commits with generation number strictly less than
>> min_generation. If 0 is given, then there is no functional change.
> This is thanks to the fact that generation numbers start at zero (as
> special case, though, with _ZERO), and we use strict inequality to avoid
> handling _ZERO etc. in a special way.
>
> As you wrote in response in previous version of this series, because
> paint_down_to_common() is file-local, there is no need to come up with
> symbolic name for GENERATION_NO_CUTOFF case.
>
> All right.
>
>> For in_merge_bases_many(), we can pass commit->generation as the
>> cutoff, and this saves time during 'git branch --contains' queries
>> that would otherwise walk "around" the commit we are inspecting.
> All right, and when using paint_down_to_common() to actually find merge
> bases, and not only check for containment, we cannot use cutoff.
> Therefore at least one call site needs to run it without functional
> change... which we can do.  Good.
>
>> For a copy of the Linux repository, where HEAD is checked out at
>> v4.13~100, we get the following performance improvement for
>> 'git branch --contains' over the previous commit:
>>
>> Before: 0.21s
>> After:  0.13s
>> Rel %: -38%
> Nice.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit.c | 18 ++++++++++++++----
>>   1 file changed, 14 insertions(+), 4 deletions(-)
> Let me reorder chunks a bit to make it easier to review.
>
>> diff --git a/commit.c b/commit.c
>> index 7bb007f56a..e2e16ea1a7 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>   	if (commit->generation > min_generation)
>>   		return ret;
>>   
>> -	bases = paint_down_to_common(commit, nr_reference, reference);
>> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
>>   	if (commit->object.flags & PARENT2)
>>   		ret = 1;
>>   	clear_commit_marks(commit, all_flags);
>> @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
>>   }
>>   
>>   /* all input commits in one and twos[] must have been parsed! */
>> -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>> +static struct commit_list *paint_down_to_common(struct commit *one, int n,
>> +						struct commit **twos,
>> +						int min_generation)
>>   {
>>   	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>>   	struct commit_list *result = NULL;
>>   	int i;
>> +	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
>>   
>>   	one->object.flags |= PARENT1;
>>   	if (!n) {
>> @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>>   		struct commit_list *parents;
>>   		int flags;
>>   
>> +		if (commit->generation > last_gen)
>> +			BUG("bad generation skip");
>> +		last_gen = commit->generation;
> Shouldn't we provide more information about where the problem is to the
> user, to make it easier to debug the repository / commit-graph data?
>
> Good to have this sanity check here.

This BUG() _should_ only be seen by developers who add callers which do 
not load commits from the commit-graph file. There is a chance that 
there are cases not covered by this patch and the added tests, though. 
Hopefully we catch them all by dogfooding the feature before turning it 
on by default.

I can add the following to help debug these bad situations:

+			BUG("bad generation skip %d > %d at %s",
+			    commit->generation, last_gen,
+			    oid_to_hex(&commit->object.oid));



>
>> +
>> +		if (commit->generation < min_generation)
>> +			break;
> So the reasoning for this, as far as I understand, is the following.
> Please correct me if I am wrong.
>
> The callsite with non-zero min_generation, in_merge_bases_many(), tries
> to find out if "commit" is an ancestor of one of the "references".  At
> least one of "references" is above "commit", so in_merge_bases_many()
> uses paint_down_to_common() - but is interested only if "commit" was
> painted as reachable from one of "references".
>
> Thus we can interrupt the walk if we know that none of [considered]
> commits in the queue can reach "commit"/"one" - as if they were all
> STALE.
>
> The search is done using priority queue (a bit like in Dijkstra
> algorithm), with newer commits - with larger generation numbers -
> considered first.  Thus if current commit has generation number less
> than min_generation cutoff, i.e. if it is below "commit", then all
> remaining commits in the queue are below cutoff.
>
> Good.
>
>> +
>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>   		if (flags == (PARENT1 | PARENT2)) {
>>   			if (!(commit->object.flags & RESULT)) {
>> @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>>   			return NULL;
>>   	}
>>   
>> -	list = paint_down_to_common(one, n, twos);
>> +	list = paint_down_to_common(one, n, twos, 0);
> When calculating merge bases there is no such possibility of an early
> return due to generation number cutoff.  All right then.
>
>>   
>>   	while (list) {
>>   		struct commit *commit = pop_commit(&list);
>> @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>   			filled_index[filled] = j;
>>   			work[filled++] = array[j];
>>   		}
>> -		common = paint_down_to_common(array[i], filled, work);
>> +		common = paint_down_to_common(array[i], filled, work, 0);
> Here we are interested not only if "one"/array[i] is reachable from
> "twos"/work, but also if "twos" is reachable from "one".  Simple cutoff
> only works in one way, though I wonder if we couldn't use cutoff being
> minimum generation number of "one" and "twos" together.
>
> But that may be left for a separate commit (after checking that the
> above is correct).
>
> Not as simple and obvious as paint_down_to_common() used in
> in_merge_bases_any(), so it is all right.

Thanks for reporting this. Since we are only concerned about 
reachability in this method, it is a good candidate to use 
min_generation. It is also subtle enough that we should leave it as a 
separate commit. Also, we can measure performance improvements 
separately, as I will mention in my commit message (but I'll copy it here):

     For a copy of the Linux repository, we measured the following
     performance improvements:

     git merge-base v3.3 v4.5

     Before: 234 ms
      After: 208 ms
      Rel %: -11%

     git merge-base v4.3 v4.5

     Before: 102 ms
      After:  83 ms
      Rel %: -19%

     The experiments above were chosen to demonstrate that we are
     improving the filtering of the merge-base set. In the first
     example, more time is spent walking the history to find the
     set of merge bases before the remove_redundant() call. The
     starting commits are closer together in the second example,
     therefore more time is spent in remove_redundant(). The relative
     change in performance differs as expected.


>
>>   		if (array[i]->object.flags & PARENT2)
>>   			redundant[i] = 1;
>>   		for (j = 0; j < filled; j++)


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 09/10] merge: check config before loading commits
  2018-04-30 22:54         ` Jakub Narebski
@ 2018-05-01 11:52           ` Derrick Stolee
  2018-05-02 11:41             ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 11:52 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

On 4/30/2018 6:54 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Now that we use generation numbers from the commit-graph, we must
>> ensure that all commits that exist in the commit-graph are loaded
>> from that file instead of from the object database. Since the
>> commit-graph file is only checked if core.commitGraph is true, we
>> must check the default config before we load any commits.
>>
>> In the merge builtin, the config was checked after loading the HEAD
>> commit. This was due to the use of the global 'branch' when checking
>> merge-specific config settings.
>>
>> Move the config load to be between the initialization of 'branch' and
>> the commit lookup.
> Sidenote: I wonder why reading config was postponed to later in the
> command lifetime... I guess it was to avoid having to read config if
> HEAD was invalid.

The 'branch' does need to be loaded before the call to git_config (as I 
found out after moving the config call too early), so I suppose it was 
natural to pair that with resolving head_commit.

>
>> Without this change, a fast-forward merge would hit a BUG("bad
>> generation skip") statement in commit.c during paint_down_to_common().
>> This is because the HEAD commit would be loaded with "infinite"
>> generation but then reached by commits with "finite" generation
>> numbers.
> I guess this is because we avoid re-parsing objects at all costs; we
> want to avoid re-reading commit graph too.
>
>> Add a test to t5318-commit-graph.sh that exercises this code path to
>> prevent a regression.
> I would prefer if this commit was put earlier in the series, to avoid
> having broken Git (and thus a possibility of problems when bisecting) in
> between those two commits.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   builtin/merge.c         | 7 ++++---
>>   t/t5318-commit-graph.sh | 9 +++++++++
>>   2 files changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/builtin/merge.c b/builtin/merge.c
>> index 5e5e4497e3..b819756946 100644
>> --- a/builtin/merge.c
>> +++ b/builtin/merge.c
>> @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>>   	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
>>   	if (branch)
>>   		skip_prefix(branch, "refs/heads/", &branch);
>> +
>> +	init_diff_ui_defaults();
>> +	git_config(git_merge_config, NULL);
>> +
>>   	if (!branch || is_null_oid(&head_oid))
>>   		head_commit = NULL;
>>   	else
>>   		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>>   
>> -	init_diff_ui_defaults();
>> -	git_config(git_merge_config, NULL);
>> -
> Good.
>
>>   	if (branch_mergeoptions)
>>   		parse_branch_merge_options(branch_mergeoptions);
>>   	argc = parse_options(argc, argv, prefix, builtin_merge_options,
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index a380419b65..77d85aefe7 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
>>   graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
>>   graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
>>   
>> +test_expect_success 'perform fast-forward merge in full repo' '
>> +	cd "$TRASH_DIRECTORY/full" &&
>> +	git checkout -b merge-5-to-8 commits/5 &&
>> +	git merge commits/8 &&
>> +	git show-ref -s merge-5-to-8 >output &&
>> +	git show-ref -s commits/8 >expect &&
>> +	test_cmp expect output
>> +'
> All right.  (though I wonder if this tests catches all problems where
> BUG("bad generation skip") could have been encountered.

We will never know until we have this series running in the wild (and 
even then, some features are very obscure) and enough people turn on the 
config setting.

One goal of the "fsck and gc" series is to get this feature running 
during the rest of the test suite as much as possible, so we can get 
additional coverage. Also to get more experience from the community 
dogfooding the feature.

>
>> +
>>   test_done
> Best,


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 10/10] commit-graph.txt: update design document
  2018-04-30 23:32         ` Jakub Narebski
@ 2018-05-01 12:00           ` Derrick Stolee
  2018-05-02  7:57             ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:00 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

On 4/30/2018 7:32 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> We now calculate generation numbers in the commit-graph file and use
>> them in paint_down_to_common().
>>
>> Expand the section on generation numbers to discuss how the three
>> special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
>> _MAX interact with other generation numbers.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Looks good.
>
>> ---
>>   Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
>>   1 file changed, 24 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
>> index 0550c6d0dc..d9f2713efa 100644
>> --- a/Documentation/technical/commit-graph.txt
>> +++ b/Documentation/technical/commit-graph.txt
>> @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
>>   generation number and walk until reaching commits with known generation
>>   number.
>>   
>> +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
>> +in the commit-graph file. If a commit-graph file was written by a version
>> +of Git that did not compute generation numbers, then those commits will
>> +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
>> +
>> +Since the commit-graph file is closed under reachability, we can guarantee
>> +the following weaker condition on all commits:
>> +
>> +    If A and B are commits with generation numbers N amd M, respectively,
>> +    and N < M, then A cannot reach B.
>> +
>> +Note how the strict inequality differs from the inequality when we have
>> +fully-computed generation numbers. Using strict inequality may result in
>> +walking a few extra commits,
> The linux kernel commit graph has maximum of 513 commits sharing the
> same generation number, but is is 5.43 commits sharing the same
> generation number on average, with standard deviation 10.70; median is
> even lower: it is 2, with 5.35 median absolute deviation (MAD).
>
> So on average it would be a few extra commits.  Right.
>
>>                                but the simplicity in dealing with commits
>> +with generation number *_INFINITY or *_ZERO is valuable.
> As I wrote before, handling those corner cases in more complicated, but
> not that complicated.  We could simply use stronger condition if both
> generation numbers are ordinary generation numbers, and weaker condition
> when at least one generation number has one of those special values.
>
>> +
>> +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
>> +generation numbers are computed to be at least this value. We limit at
>> +this value since it is the largest value that can be stored in the
>> +commit-graph file using the 30 bits available to generation numbers. This
>> +presents another case where a commit can have generation number equal to
>> +that of a parent.
> Ordinary generation numbers, where stronger condition holds, are those
> between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX.
>
>> +
>>   Design Details
>>   --------------
>>   
>> @@ -98,17 +121,12 @@ Future Work
>>   - The 'commit-graph' subcommand does not have a "verify" mode that is
>>     necessary for integration with fsck.
>>   
>> -- The file format includes room for precomputed generation numbers. These
>> -  are not currently computed, so all generation numbers will be marked as
>> -  0 (or "uncomputed"). A later patch will include this calculation.
>> -
> Good.
>
>>   - After computing and storing generation numbers, we must make graph
>>     walks aware of generation numbers to gain the performance benefits they
>>     enable. This will mostly be accomplished by swapping a commit-date-ordered
>>     priority queue with one ordered by generation number. The following
>> -  operations are important candidates:
>> +  operation is an important candidate:
>>   
>> -    - paint_down_to_common()
>>       - 'log --topo-order'
> Another possible candidates:
>
>         - remove_redundant() - see comment in previous patch
>         - still_interesting() - where Git uses date slop to stop walking
>           too far

remove_redundant() will be included in v5, thanks.

Instead of "still_interesting()" I'll add "git tag --merged" as the 
candidate to consider, as discussed in [1].

[1] https://public-inbox.org/git/87fu3g67ry.fsf@lant.ki.iif.hu/t/#u
     "branch --contains / tag --merged inconsistency"

>
>>   
>>   - Currently, parse_commit_gently() requires filling in the root tree
> One important issue left is handling features that change view of
> project history, and their interaction with commit-graph feature.
>
> What would happen, if we turn on commit-graph feature, generate commit
> graph file, and then:
>
>    * use graft file or remove graft entries to cut history, or remove cut
>      or join two [independent] histories.
>    * use git-replace mechanims to do the same
>    * in shallow clone, deepen or shorten the clone
>
> What would happen if without re-generating commit-graph file (assuming
> tha Git wouldn't do it for us), we run some feature that makes use of
> commit-graph data:
>
>    - git branch --contains
>    - git tag --contains
>    - git rev-list A..B
>

The commit-graph is not supported in these scenarios (yet). grafts are 
specifically mentioned in the future work section.

I'm not particularly interested in supporting these features, so they 
are good venues for other contributors to get involved in the 
commit-graph feature. Eventually, they will be blockers to making the 
commit-graph feature a "default" feature. That is when I will pay 
attention to these situations. For now, a user must opt-in to having a 
commit-graph file (and that same user has possibly opted in to these 
history modifying features).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-04-29  9:08         ` Jakub Narebski
@ 2018-05-01 12:10           ` Derrick Stolee
  2018-05-02 16:15             ` Jakub Narebski
  0 siblings, 1 reply; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:10 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, peff, avarab

On 4/29/2018 5:08 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> While preparing commits to be written into a commit-graph file, compute
>> the generation numbers using a depth-first strategy.
> Sidenote: for generation numbers it does not matter if we use
> depth-first or breadth-first strategy, but it is more natural to use
> depth-first search because generation numbers need post-order processing
> (parents before child).
>
>> The only commits that are walked in this depth-first search are those
>> without a precomputed generation number. Thus, computation time will be
>> relative to the number of new commits to the commit-graph file.
> A question: what happens if the existing commit graph is from older
> version of git and has _ZERO for generation numbers?
>
> Answer: I see that we treat both _INFINITY (not in commit-graph) and
> _ZERO (in commit graph but not computed) as not computed generation
> numbers.  All right.
>
>> If a computed generation number would exceed GENERATION_NUMBER_MAX, then
>> use GENERATION_NUMBER_MAX instead.
> All right, though I guess this would remain theoretical for a long
> while.
>
> We don't have any way of testing this, at least not without recompiling
> Git with lower value of GENERATION_NUMBER_MAX -- which means not
> automatically, isn't it?
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 9ad21c3ffb..047fa9fca5 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>>   		else
>>   			packedDate[0] = 0;
>>   
>> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY)
>> +			packedDate[0] |= htonl((*list)->generation << 2);
>> +
> If we stumble upon commit marked as "not in commit-graph" while writing
> commit graph, it is a BUG(), isn't it?
>
> (Problem noticed by Junio.)

Since we are computing the values for all commits in the list, this 
condition is not important and will be removed.

>
> It is a bit strange to me that the code uses get_be32 for reading, but
> htonl for writing.  Is Git tested on non little-endian machines, like
> big-endian ppc64 or s390x, or on mixed-endian machines (or
> selectable-endian machines with data endianness set to non
> little-endian, like ia64)?  If not, could we use for example openSUSE
> Build Service (https://build.opensuse.org/) for this?

Since we are packing two values into 64 bits, I am using htonl() here to 
arrange the 30-bit generation number alongside the 34-bit commit date 
value, then writing with hashwrite(). The other 32-bit integers are 
written with hashwrite_be32() to avoid translating this data in-memory.

>
>>   		packedDate[1] = htonl((*list)->date);
>>   		hashwrite(f, packedDate, 8);
>>   
>> @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
>>   	}
>>   }
>>   
>> +static void compute_generation_numbers(struct commit** commits,
>> +				       int nr_commits)
>> +{
>> +	int i;
>> +	struct commit_list *list = NULL;
> All right, commit_list will work as stack.
>
>> +
>> +	for (i = 0; i < nr_commits; i++) {
>> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
>> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
>> +			continue;
> All right, we consider _INFINITY and _SERO as not computed.  If
> generation number is computed (by 'recursion' or from commit graph), we
> (re)use it.  This means that generation number calculation is
> incremental, as intended -- good.
>
>> +
>> +		commit_list_insert(commits[i], &list);
> Start depth-first walks from commits given.
>
>> +		while (list) {
>> +			struct commit *current = list->item;
>> +			struct commit_list *parent;
>> +			int all_parents_computed = 1;
> Here all_parents_computed is a boolean flag.  I see that it is easier to
> start with assumption that all parents will have computed generation
> numbers.
>
>> +			uint32_t max_generation = 0;
> The generation number value of 0 functions as sentinel; generation
> numbers start from 1.  Not that it matters much, as lowest possible
> generation number is 1, and we could have started from that value.

Except that for a commit with no parents, we want it to receive 
generation number max_generation + 1 = 1, so this value of 0 is important.

>
>> +
>> +			for (parent = current->parents; parent; parent = parent->next) {
>> +				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
>> +				    parent->item->generation == GENERATION_NUMBER_ZERO) {
>> +					all_parents_computed = 0;
>> +					commit_list_insert(parent->item, &list);
>> +					break;
> If some parent doesn't have generation number calculated, we add it to
> stack (and break out of loop because it is depth-first walk), and mark
> this situation.  All right.
>
>> +				} else if (parent->item->generation > max_generation) {
>> +					max_generation = parent->item->generation;
> Otherwise, update max_generation.  All right.
>
>> +				}
>> +			}
>> +
>> +			if (all_parents_computed) {
>> +				current->generation = max_generation + 1;
>> +				pop_commit(&list);
>> +			}
>> +
>> +			if (current->generation > GENERATION_NUMBER_MAX)
>> +				current->generation = GENERATION_NUMBER_MAX;
> This conditional should be inside all_parents_computed test, for example
> like this:
>
>    +			if (all_parents_computed) {
>    +				current->generation = max_generation + 1;
>    +				if (current->generation > GENERATION_NUMBER_MAX)
>    +					current->generation = GENERATION_NUMBER_MAX;
>    +
>    +				pop_commit(&list);
>    +			}
>
> (Noticed by Junio.)
>
> Sidenote: when we revisit the commit, returning from depth-first walk of
> one of its parents, we calculate max_generation from scratch again.
> This does not matter for performance, as it's just data access and
> calculating maximum - any workaround to not restart those calculations
> would take more time and memory.  And it's simple.
>
>> +		}
>> +	}
>> +}
>> +
>>   void write_commit_graph(const char *obj_dir,
>>   			const char **pack_indexes,
>>   			int nr_packs,
>> @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
>>   	if (commits.nr >= GRAPH_PARENT_MISSING)
>>   		die(_("too many commits to write graph"));
>>   
>> +	compute_generation_numbers(commits.list, commits.nr);
>> +
> Nice and simple.  All right.
>
> I guess that we do not pass "struct packed_commit_list commits" as
> argument to compute_generation_numbers instead of "struct commit**
> commits.list" and "int commits.nr" to compute_generation_numbers() to
> keep the latter nice and generic?

Good catch. There is no reason to not use packed_commit_list here.

>
>>   	graph_name = get_commit_graph_filename(obj_dir);
>>   	fd = hold_lock_file_for_update(&lk, graph_name, 0);
> Best,


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 05/10] commit-graph: always load commit-graph information
  2018-04-29 22:14         ` Jakub Narebski
@ 2018-05-01 12:19           ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:19 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

On 4/29/2018 6:14 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Most code paths load commits using lookup_commit() and then
>> parse_commit().
> And this automatically loads commit graph if needed, thanks to changes
> in parse_commit_gently(), which parse_commit() uses.
>
>>                  In some cases, including some branch lookups, the commit
>> is parsed using parse_object_buffer() which side-steps parse_commit() in
>> favor of parse_commit_buffer().
> I guess the problem is that we cannot just add parse_commit_in_graph()
> like we did in parse_commit_gently(), for some reason?  Like for example
> that parse_commit_gently() uses parse_commit_buffer() - which could have
> been handled by moving parse_commit_in_graph() down the call chain from
> parse_commit_gently() to parse_commit_buffer()... if not the fact that
> check_commit() also uses parse_commit_buffer(), but it does not want to
> load commit graph.  Am I right?

If a caller uses parse_commit_buffer() directly, then we will guarantee 
that all values in the struct commit that would be loaded from the 
buffer are loaded from the buffer. This means we do NOT load the root 
tree id or commit date from the commit-graph file. We do still need to 
load the data that is not available in the buffer, such as graph_pos and 
generation.

>
>> With generation numbers in the commit-graph, we need to ensure that any
>> commit that exists in the commit-graph file has its generation number
>> loaded.
> Is it generation number, or generation number and position in commit
> graph?

We don't need to ensure the graph_pos (the commit will never be 
re-parsed, so we will not try to find it in the commit-graph file 
again), but we DO need to ensure the generation (or our commit walks 
will be incorrect). We get the graph_pos as a side-effect.

>
>> Create new load_commit_graph_info() method to fill in the information
>> for a commit that exists only in the commit-graph file. Call it from
>> parse_commit_buffer() after loading the other commit information from
>> the given buffer. Only fill this information when specified by the
>> 'check_graph' parameter.
> I think this commit would be easier to review if it was split into pure
> refactoring part (extracting fill_commit_graph_info() and
> find_commit_in_graph()).  On the other hand the refactoring was needed
> to reduce code duplication betweem existing parse_commit_in_graph() and
> new load_commit_graph_info() functions.
>
> I guess that the difference between parse_commit_in_graph() and
> load_commit_graph_info() is that the former cares only about having just
> enough information that is needed for parse_commit_gently() - and does
> not load graph data if commit is parsed, while the latter is about
> loading commit-graph data like generation numbers.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
>>   commit-graph.h |  8 ++++++++
>>   commit.c       |  7 +++++--
>>   commit.h       |  2 +-
>>   object.c       |  2 +-
>>   sha1_file.c    |  2 +-
>>   6 files changed, 46 insertions(+), 20 deletions(-)
> I wonder if it would be possible to add tests for this feature, for
> example that commit-graph is read when it should (including those branch
> lookups), and is not read when the feature should be disabled.
>
> But the only way to test it I can think of is a stupid one: create
> invalid commit graph, and check that git fails as expected (trying to
> read said malformed file), and does not fail if commit graph feature is
> disabled.
>
> Let me reorder files (BTW, is there a way for Git to put *.h files
> before *.c files in diff?) for easier review:
>
>> diff --git a/commit-graph.h b/commit-graph.h
>> index 260a468e73..96cccb10f3 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>>    */
>>   int parse_commit_in_graph(struct commit *item);
>>   
>> +/*
>> + * It is possible that we loaded commit contents from the commit buffer,
>> + * but we also want to ensure the commit-graph content is correctly
>> + * checked and filled. Fill the graph_pos and generation members of
>> + * the given commit.
>> + */
>> +void load_commit_graph_info(struct commit *item);
>> +
>>   struct tree *get_commit_tree_in_graph(const struct commit *c);
>>   
>>   struct commit_graph {
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 047fa9fca5..aebd242def 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>>   	return &commit_list_insert(c, pptr)->next;
>>   }
>>   
>> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
>> +{
>> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
>> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>> +}
> The comment in the header file commit-graph.h talks about filling
> graph_pos and generation members of the given commit, but I don't see
> filling graph_pos member here.

We are missing the following line:

+    item->graph_pos = pos;

I will add it for v5. The equivalent line exists in fill_commit_in_graph().

>
> Sidenote: it is a tiny little bit strange to see symbolic constants like
> GRAPH_DATA_WIDTH near using magic values such as 8 and 2.

There needs to be some boundary between abstraction and concreteness 
when dealing directly with a binary file format. GRAPH_DATA_WIDTH helps 
us navigate to the correct "row" in the chunk, while we use the 
constants 8 and 2 to get the correct "column" out of that row.

>
>> +
>>   static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>>   {
>>   	uint32_t edge_value;
>> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>>   	return 1;
>>   }
>>   
>> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
>> +{
>> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
>> +		*pos = item->graph_pos;
>> +		return 1;
>> +	} else {
>> +		return bsearch_graph(g, &(item->object.oid), pos);
>> +	}
>> +}
> Nice refactoring here.
>
>> +
>>   int parse_commit_in_graph(struct commit *item)
>>   {
>> +	uint32_t pos;
>> +
>>   	if (!core_commit_graph)
>>   		return 0;
>>   	if (item->object.parsed)
>>   		return 1;
>> -
>>   	prepare_commit_graph();
>> -	if (commit_graph) {
>> -		uint32_t pos;
>> -		int found;
>> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
>> -			pos = item->graph_pos;
>> -			found = 1;
>> -		} else {
>> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
>> -		}
>> -
>> -		if (found)
>> -			return fill_commit_in_graph(item, commit_graph, pos);
>> -	}
>> -
>> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> +		return fill_commit_in_graph(item, commit_graph, pos);
>>   	return 0;
>>   }
>>   
>> +void load_commit_graph_info(struct commit *item)
>> +{
>> +	uint32_t pos;
>> +	if (!core_commit_graph)
>> +		return;
>> +	prepare_commit_graph();
>> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> +		fill_commit_graph_info(item, commit_graph, pos);
>> +}
> Similar functions, different goals (as the names imply).
>
>> +
>>   static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>>   {
>>   	struct object_id oid;
>> diff --git a/commit.c b/commit.c
>> index 4d00b0a1d6..39a3749abd 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>>   	return ret;
>>   }
>>   
>> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
>> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>>   {
>>   	const char *tail = buffer;
>>   	const char *bufptr = buffer;
>> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>>   	}
>>   	item->date = parse_commit_date(bufptr, tail);
>>   
>> +	if (check_graph)
>> +		load_commit_graph_info(item);
>> +
> All right, read commit-graph specific data after parsing commit itself.
> It is at the end because commit object needs to be parsed sequentially,
> and it includes more info that is contained in commit-graph CDAT+EDGE
> data.
>
>>   	return 0;
>>   }
>>   
>> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>>   		return error("Object %s not a commit",
>>   			     oid_to_hex(&item->object.oid));
>>   	}
>> -	ret = parse_commit_buffer(item, buffer, size);
>> +	ret = parse_commit_buffer(item, buffer, size, 0);
> The parse_commit_gently() contract is that it provides only bare minimum
> of information, from commit-graph if possible, and does read object from
> disk and parses it only when it could not avoid it.  If it needs to
> parse it, it doesn't need to fill commit-graph specific data again.
>
> All right.
>
>>   	if (save_commit_buffer && !ret) {
>>   		set_commit_buffer(item, buffer, size);
>>   		return 0;
>> diff --git a/commit.h b/commit.h
>> index 64436ff44e..b5afde1ae9 100644
>> --- a/commit.h
>> +++ b/commit.h
>> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>>    */
>>   struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>>   
>> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
>> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>>   int parse_commit_gently(struct commit *item, int quiet_on_missing);
>>   static inline int parse_commit(struct commit *item)
>>   {
>> diff --git a/object.c b/object.c
>> index e6ad3f61f0..efe4871325 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>>   	} else if (type == OBJ_COMMIT) {
>>   		struct commit *commit = lookup_commit(oid);
>>   		if (commit) {
>> -			if (parse_commit_buffer(commit, buffer, size))
>> +			if (parse_commit_buffer(commit, buffer, size, 1))
> All that rigamarole was needed because of
>
> DS>                 In some cases, including some branch lookups, the commit
> DS> is parsed using parse_object_buffer() which side-steps parse_commit() in
> DS> favor of parse_commit_buffer().
>
> Here we want parse_object_buffer() to get also commit-graph specific
> data, if available.  All right.
>
>>   				return NULL;
>>   			if (!get_cached_commit_buffer(commit, NULL)) {
>>   				set_commit_buffer(commit, buffer, size);
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 1b94f39c4c..0fd4f0b8b6 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>>   {
>>   	struct commit c;
>>   	memset(&c, 0, sizeof(c));
>> -	if (parse_commit_buffer(&c, buf, size))
>> +	if (parse_commit_buffer(&c, buf, size, 0))
> For check we don't need commit graph data.  Looks all right.
>
>>   		die("corrupt commit");
>>   }
> Best,


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 00/11] Compute and consume generation numbers
  2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
                         ` (10 preceding siblings ...)
  2018-04-25 14:40       ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee
@ 2018-05-01 12:47       ` " Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
                           ` (11 more replies)
  11 siblings, 12 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

Most of the changes from v4 are cosmetic, but there is one new commit:

	commit: use generation number in remove_redundant()

Other changes are non-functional, but do clarify things.

Inter-diff from v4:

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index d9f2713efa..e1a883eb46 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -125,9 +125,10 @@ Future Work
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operation is an important candidate:
+  operations are important candidates:

     - 'log --topo-order'
+    - 'tag --merged'

 - Currently, parse_commit_gently() requires filling in the root tree
   object for a commit. This passes through lookup_tree() and consequently
diff --git a/commit-graph.c b/commit-graph.c
index aebd242def..a8c337dd77 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -248,6 +248,7 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
        const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+       item->graph_pos = pos;
        item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }

@@ -454,8 +455,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
                else
                        packedDate[0] = 0;

-               if ((*list)->generation != GENERATION_NUMBER_INFINITY)
-                       packedDate[0] |= htonl((*list)->generation << 2);
+               packedDate[0] |= htonl((*list)->generation << 2);

                packedDate[1] = htonl((*list)->date);
                hashwrite(f, packedDate, 8);
@@ -589,18 +589,17 @@ static void close_reachable(struct packed_oid_list *oids)
        }
 }

-static void compute_generation_numbers(struct commit** commits,
-                                      int nr_commits)
+static void compute_generation_numbers(struct packed_commit_list* commits)
 {
        int i;
        struct commit_list *list = NULL;

-       for (i = 0; i < nr_commits; i++) {
-               if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
-                   commits[i]->generation != GENERATION_NUMBER_ZERO)
+       for (i = 0; i < commits->nr; i++) {
+               if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY &&
+                   commits->list[i]->generation != GENERATION_NUMBER_ZERO)
                        continue;

-               commit_list_insert(commits[i], &list);
+               commit_list_insert(commits->list[i], &list);
                while (list) {
                        struct commit *current = list->item;
                        struct commit_list *parent;
@@ -621,10 +620,10 @@ static void compute_generation_numbers(struct commit** commits,
                        if (all_parents_computed) {
                                current->generation = max_generation + 1;
                                pop_commit(&list);
-                       }

-                       if (current->generation > GENERATION_NUMBER_MAX)
-                               current->generation = GENERATION_NUMBER_MAX;
+                               if (current->generation > GENERATION_NUMBER_MAX)
+                                       current->generation = GENERATION_NUMBER_MAX;
+                       }
                }
        }
 }
@@ -752,7 +751,7 @@ void write_commit_graph(const char *obj_dir,
        if (commits.nr >= GRAPH_PARENT_MISSING)
                die(_("too many commits to write graph"));

-       compute_generation_numbers(commits.list, commits.nr);
+       compute_generation_numbers(&commits);

        graph_name = get_commit_graph_filename(obj_dir);
        fd = hold_lock_file_for_update(&lk, graph_name, 0);
diff --git a/commit.c b/commit.c
index e2e16ea1a7..5064db4e61 100644
--- a/commit.c
+++ b/commit.c
@@ -835,7 +835,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
                int flags;

                if (commit->generation > last_gen)
-                       BUG("bad generation skip");
+                       BUG("bad generation skip %8x > %8x at %s",
+                           commit->generation, last_gen,
+                           oid_to_hex(&commit->object.oid));
                last_gen = commit->generation;

                if (commit->generation < min_generation)
@@ -947,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
                parse_commit(array[i]);
        for (i = 0; i < cnt; i++) {
                struct commit_list *common;
+               uint32_t min_generation = GENERATION_NUMBER_INFINITY;

                if (redundant[i])
                        continue;
@@ -955,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
                                continue;
                        filled_index[filled] = j;
                        work[filled++] = array[j];
+
+                       if (array[j]->generation < min_generation)
+                               min_generation = array[j]->generation;
                }
-               common = paint_down_to_common(array[i], filled, work, 0);
+               common = paint_down_to_common(array[i], filled, work,
+                                             min_generation);
                if (array[i]->object.flags & PARENT2)
                        redundant[i] = 1;
                for (j = 0; j < filled; j++)
@@ -1073,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
        for (i = 0; i < nr_reference; i++) {
                if (parse_commit(reference[i]))
                        return ret;
-               if (min_generation > reference[i]->generation)
+               if (reference[i]->generation < min_generation)
                        min_generation = reference[i]->generation;
        }


-- >8 --

Derrick Stolee (11):
  ref-filter: fix outdated comment on in_commit_list
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph: always load commit-graph information
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  commit: use generation number in remove_redundant()
  merge: check config before loading commits
  commit-graph.txt: update design document

 Documentation/technical/commit-graph.txt | 30 ++++++--
 alloc.c                                  |  1 +
 builtin/merge.c                          |  7 +-
 commit-graph.c                           | 91 ++++++++++++++++++++----
 commit-graph.h                           |  8 +++
 commit.c                                 | 61 +++++++++++++---
 commit.h                                 |  7 +-
 object.c                                 |  2 +-
 ref-filter.c                             | 26 +++++--
 sha1_file.c                              |  2 +-
 t/t5318-commit-graph.sh                  |  9 +++
 11 files changed, 204 insertions(+), 40 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 02/11] commit: add generation number to struct commmit Derrick Stolee
                           ` (10 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

The in_commit_list() method does not check the parents of
the candidate for containment in the list. Fix the comment
that incorrectly states that it does.

Reported-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..aff24d93be 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 }
 
 /*
- * Test whether the candidate or one of its parents is contained in the list.
+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 02/11] commit: add generation number to struct commmit
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 03/11] commit-graph: compute generation numbers Derrick Stolee
                           ` (9 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0xFFFFFFFF
GENERATION_NUMBER_MAX 0x3FFFFFFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_INFINITY;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
 	struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 	 */
 	struct tree *maybe_tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 03/11] commit-graph: compute generation numbers
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 02/11] commit: add generation number to struct commmit Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 04/11] commit: use generations in paint_down_to_common() Derrick Stolee
                           ` (8 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..36d765e10a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		packedDate[0] |= htonl((*list)->generation << 2);
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -571,6 +573,45 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct packed_commit_list* commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < commits->nr; i++) {
+		if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    commits->list[i]->generation != GENERATION_NUMBER_ZERO)
+			continue;
+
+		commit_list_insert(commits->list[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
+				    parent->item->generation == GENERATION_NUMBER_ZERO) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+
+				if (current->generation > GENERATION_NUMBER_MAX)
+					current->generation = GENERATION_NUMBER_MAX;
+			}
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -694,6 +735,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(&commits);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 04/11] commit: use generations in paint_down_to_common()
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (2 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 03/11] commit-graph: compute generation numbers Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 05/11] commit-graph: always load commit-graph information Derrick Stolee
                           ` (7 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 20 +++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..4d00b0a1d6 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	/* newer commits first */
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* use date as a heuristic when generations are equal */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 05/11] commit-graph: always load commit-graph information
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (3 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 04/11] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 06/11] ref-filter: use generation number for --contains Derrick Stolee
                           ` (6 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 +++++++++++++++++++++++++++++++---------------
 commit-graph.h |  8 ++++++++
 commit.c       |  7 +++++--
 commit.h       |  2 +-
 object.c       |  2 +-
 sha1_file.c    |  2 +-
 6 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 36d765e10a..a8c337dd77 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,13 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 	return &commit_list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	item->graph_pos = pos;
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	uint32_t edge_value;
@@ -292,31 +299,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
+{
+	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		*pos = item->graph_pos;
+		return 1;
+	} else {
+		return bsearch_graph(g, &(item->object.oid), pos);
+	}
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+	uint32_t pos;
+
 	if (!core_commit_graph)
 		return 0;
 	if (item->object.parsed)
 		return 1;
-
 	prepare_commit_graph();
-	if (commit_graph) {
-		uint32_t pos;
-		int found;
-		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-			pos = item->graph_pos;
-			found = 1;
-		} else {
-			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
-		}
-
-		if (found)
-			return fill_commit_in_graph(item, commit_graph, pos);
-	}
-
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		return fill_commit_in_graph(item, commit_graph, pos);
 	return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+	uint32_t pos;
+	if (!core_commit_graph)
+		return;
+	prepare_commit_graph();
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
 {
 	struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index 4d00b0a1d6..39a3749abd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 	return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
 {
 	const char *tail = buffer;
 	const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	}
 	item->date = parse_commit_date(bufptr, tail);
 
+	if (check_graph)
+		load_commit_graph_info(item);
+
 	return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return error("Object %s not a commit",
 			     oid_to_hex(&item->object.oid));
 	}
-	ret = parse_commit_buffer(item, buffer, size);
+	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
 		set_commit_buffer(item, buffer, size);
 		return 0;
diff --git a/commit.h b/commit.h
index 64436ff44e..b5afde1ae9 100644
--- a/commit.h
+++ b/commit.h
@@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
  */
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
diff --git a/object.c b/object.c
index e6ad3f61f0..efe4871325 100644
--- a/object.c
+++ b/object.c
@@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (parse_commit_buffer(commit, buffer, size, 1))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
diff --git a/sha1_file.c b/sha1_file.c
index 1b94f39c4c..0fd4f0b8b6 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
 {
 	struct commit c;
 	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(&c, buf, size))
+	if (parse_commit_buffer(&c, buf, size, 0))
 		die("corrupt commit");
 }
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 06/11] ref-filter: use generation number for --contains
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (4 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 05/11] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() Derrick Stolee
                           ` (5 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is contained in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index aff24d93be..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"
 
 static struct ref_msg {
 	const char *gone;
@@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate,
 
 	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		load_commit_graph_info(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 07/11] commit: use generation numbers for in_merge_bases()
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (5 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 06/11] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() Derrick Stolee
                           ` (4 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the initial commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 39a3749abd..3ecdc13356 100644
--- a/commit.c
+++ b/commit.c
@@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (reference[i]->generation < min_generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return ret;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common()
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (6 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee
                           ` (3 subsequent siblings)
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 3ecdc13356..9875feec01 100644
--- a/commit.c
+++ b/commit.c
@@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+						struct commit **twos,
+						int min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -831,6 +834,15 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip %8x > %8x at %s",
+			    commit->generation, last_gen,
+			    oid_to_hex(&commit->object.oid));
+		last_gen = commit->generation;
+
+		if (commit->generation < min_generation)
+			break;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -879,7 +891,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
 			return NULL;
 	}
 
-	list = paint_down_to_common(one, n, twos);
+	list = paint_down_to_common(one, n, twos, 0);
 
 	while (list) {
 		struct commit *commit = pop_commit(&list);
@@ -946,7 +958,7 @@ static int remove_redundant(struct commit **array, int cnt)
 			filled_index[filled] = j;
 			work[filled++] = array[j];
 		}
-		common = paint_down_to_common(array[i], filled, work);
+		common = paint_down_to_common(array[i], filled, work, 0);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
@@ -1070,7 +1082,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 	if (commit->generation > min_generation)
 		return ret;
 
-	bases = paint_down_to_common(commit, nr_reference, reference);
+	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
 	if (commit->object.flags & PARENT2)
 		ret = 1;
 	clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 09/11] commit: use generation number in remove_redundant()
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (7 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 15:37           ` Derrick Stolee
  2018-05-03 18:45           ` Jakub Narebski
  2018-05-01 12:47         ` [PATCH v5 10/11] merge: check config before loading commits Derrick Stolee
                           ` (2 subsequent siblings)
  11 siblings, 2 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

The static remove_redundant() method is used to filter a list
of commits by removing those that are reachable from another
commit in the list. This is used to remove all possible merge-
bases except a maximal, mutually independent set.

To determine these commits are independent, we use a number of
paint_down_to_common() walks and use the PARENT1, PARENT2 flags
to determine reachability. Since we only care about reachability
and not the full set of merge-bases between 'one' and 'twos', we
can use the 'min_generation' parameter to short-circuit the walk.

When no commit-graph exists, there is no change in behavior.

For a copy of the Linux repository, we measured the following
performance improvements:

git merge-base v3.3 v4.5

Before: 234 ms
 After: 208 ms
 Rel %: -11%

git merge-base v4.3 v4.5

Before: 102 ms
 After:  83 ms
 Rel %: -19%

The experiments above were chosen to demonstrate that we are
improving the filtering of the merge-base set. In the first
example, more time is spent walking the history to find the
set of merge bases before the remove_redundant() call. The
starting commits are closer together in the second example,
therefore more time is spent in remove_redundant(). The relative
change in performance differs as expected.

Reported-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 9875feec01..5064db4e61 100644
--- a/commit.c
+++ b/commit.c
@@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
 		parse_commit(array[i]);
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *common;
+		uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 		if (redundant[i])
 			continue;
@@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
 				continue;
 			filled_index[filled] = j;
 			work[filled++] = array[j];
+
+			if (array[j]->generation < min_generation)
+				min_generation = array[j]->generation;
 		}
-		common = paint_down_to_common(array[i], filled, work, 0);
+		common = paint_down_to_common(array[i], filled, work,
+					      min_generation);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 10/11] merge: check config before loading commits
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (8 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-01 12:47         ` [PATCH v5 11/11] commit-graph.txt: update design document Derrick Stolee
  2018-05-03 11:18         ` [PATCH v5 00/11] Compute and consume generation numbers Jakub Narebski
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/merge.c         | 7 ++++---
 t/t5318-commit-graph.sh | 9 +++++++++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
 	if (branch)
 		skip_prefix(branch, "refs/heads/", &branch);
+
+	init_diff_ui_defaults();
+	git_config(git_merge_config, NULL);
+
 	if (!branch || is_null_oid(&head_oid))
 		head_commit = NULL;
 	else
 		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
 
-	init_diff_ui_defaults();
-	git_config(git_merge_config, NULL);
-
 	if (branch_mergeoptions)
 		parse_branch_merge_options(branch_mergeoptions);
 	argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git checkout -b merge-5-to-8 commits/5 &&
+	git merge commits/8 &&
+	git show-ref -s merge-5-to-8 >output &&
+	git show-ref -s commits/8 >expect &&
+	test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* [PATCH v5 11/11] commit-graph.txt: update design document
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (9 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 10/11] merge: check config before loading commits Derrick Stolee
@ 2018-05-01 12:47         ` Derrick Stolee
  2018-05-03 11:18         ` [PATCH v5 00/11] Compute and consume generation numbers Jakub Narebski
  11 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 29 ++++++++++++++++++++----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..e1a883eb46 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N amd M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --------------
 
@@ -98,18 +121,14 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
   operations are important candidates:
 
-    - paint_down_to_common()
     - 'log --topo-order'
+    - 'tag --merged'
 
 - Currently, parse_commit_gently() requires filling in the root tree
   object for a commit. This passes through lookup_tree() and consequently
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v5 09/11] commit: use generation number in remove_redundant()
  2018-05-01 12:47         ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee
@ 2018-05-01 15:37           ` Derrick Stolee
  2018-05-03 18:45           ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-01 15:37 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: gitster, peff, jnareb, avarab



On 5/1/2018 8:47 AM, Derrick Stolee wrote:
> The static remove_redundant() method is used to filter a list
> of commits by removing those that are reachable from another
> commit in the list. This is used to remove all possible merge-
> bases except a maximal, mutually independent set.
>
> To determine these commits are independent, we use a number of
> paint_down_to_common() walks and use the PARENT1, PARENT2 flags
> to determine reachability. Since we only care about reachability
> and not the full set of merge-bases between 'one' and 'twos', we
> can use the 'min_generation' parameter to short-circuit the walk.
>
> When no commit-graph exists, there is no change in behavior.
>
> For a copy of the Linux repository, we measured the following
> performance improvements:
>
> git merge-base v3.3 v4.5
>
> Before: 234 ms
>   After: 208 ms
>   Rel %: -11%
>
> git merge-base v4.3 v4.5
>
> Before: 102 ms
>   After:  83 ms
>   Rel %: -19%
>
> The experiments above were chosen to demonstrate that we are
> improving the filtering of the merge-base set. In the first
> example, more time is spent walking the history to find the
> set of merge bases before the remove_redundant() call. The
> starting commits are closer together in the second example,
> therefore more time is spent in remove_redundant(). The relative
> change in performance differs as expected.
>
> Reported-by: Jakub Narebski <jnareb@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   commit.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 9875feec01..5064db4e61 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
>   		parse_commit(array[i]);
>   	for (i = 0; i < cnt; i++) {
>   		struct commit_list *common;
> +		uint32_t min_generation = GENERATION_NUMBER_INFINITY;

This initialization should be

     uint32_t min_generation = array[i]->generation;

since the assignment (using j) below skips the ith commit.

>   
>   		if (redundant[i])
>   			continue;
> @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
>   				continue;
>   			filled_index[filled] = j;
>   			work[filled++] = array[j];
> +
> +			if (array[j]->generation < min_generation)
> +				min_generation = array[j]->generation;
>   		}
> -		common = paint_down_to_common(array[i], filled, work, 0);
> +		common = paint_down_to_common(array[i], filled, work,
> +					      min_generation);
>   		if (array[i]->object.flags & PARENT2)
>   			redundant[i] = 1;
>   		for (j = 0; j < filled; j++)


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 10/10] commit-graph.txt: update design document
  2018-05-01 12:00           ` Derrick Stolee
@ 2018-05-02  7:57             ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-05-02  7:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:

> On 4/30/2018 7:32 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
[...]
>>>   - After computing and storing generation numbers, we must make graph
>>>     walks aware of generation numbers to gain the performance benefits they
>>>     enable. This will mostly be accomplished by swapping a commit-date-ordered
>>>     priority queue with one ordered by generation number. The following
>>> -  operations are important candidates:
>>> +  operation is an important candidate:
>>>   -    - paint_down_to_common()
>>>       - 'log --topo-order'
>>
>> Another possible candidates:
>>
>>         - remove_redundant() - see comment in previous patch
>>         - still_interesting() - where Git uses date slop to stop walking
>>           too far
>
> remove_redundant() will be included in v5, thanks.

Oh.  Nice.

I'll try to review the new patch in detail soon.

> Instead of "still_interesting()" I'll add "git tag --merged" as the
> candidate to consider, as discussed in [1].
>
> [1] https://public-inbox.org/git/87fu3g67ry.fsf@lant.ki.iif.hu/t/#u
>     "branch --contains / tag --merged inconsistency"

All right.  I have mentioned still_interesting() as a hint where
possible additional generation numbers based optimization may lurk
(because that's where heuristic based on dates is used - similarly to
how it was done in this series with paint_down_to_common()).

[...]
>> One important issue left is handling features that change view of
>> project history, and their interaction with commit-graph feature.
>>
>> What would happen, if we turn on commit-graph feature, generate commit
>> graph file, and then:
>>
>>    * use graft file or remove graft entries to cut history, or remove cut
>>      or join two [independent] histories.
>>    * use git-replace mechanims to do the same
>>    * in shallow clone, deepen or shorten the clone
>>
>> What would happen if without re-generating commit-graph file (assuming
>> tha Git wouldn't do it for us), we run some feature that makes use of
>> commit-graph data:
>>
>>    - git branch --contains
>>    - git tag --contains
>>    - git rev-list A..B
>>
>
> The commit-graph is not supported in these scenarios (yet). grafts are
> specifically mentioned in the future work section.
>
> I'm not particularly interested in supporting these features, so they
> are good venues for other contributors to get involved in the
> commit-graph feature. Eventually, they will be blockers to making the
> commit-graph feature a "default" feature. That is when I will pay
> attention to these situations. For now, a user must opt-in to having a
> commit-graph file (and that same user has possibly opted in to these
> history modifying features).

Well, that is sensible approach.  Get commit-graph features in working
condition, and worry about beng able to make it on by default later.

Nice to have it clarified.  I'll stop nagging about that, then ;-P

One issue: 'grafts' are mentioned in the future work section of the
technical documentation, but we don't have *any* warning about
commit-graph limitations in user-facing documentation, that is
git-commit-graph(1) manpage.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 09/10] merge: check config before loading commits
  2018-05-01 11:52           ` Derrick Stolee
@ 2018-05-02 11:41             ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-05-02 11:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 4/30/2018 6:54 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> Now that we use generation numbers from the commit-graph, we must
>>> ensure that all commits that exist in the commit-graph are loaded
>>> from that file instead of from the object database. Since the
>>> commit-graph file is only checked if core.commitGraph is true, we
>>> must check the default config before we load any commits.
>>>
>>> In the merge builtin, the config was checked after loading the HEAD
>>> commit. This was due to the use of the global 'branch' when checking
>>> merge-specific config settings.
>>>
>>> Move the config load to be between the initialization of 'branch' and
>>> the commit lookup.
>>
>> Sidenote: I wonder why reading config was postponed to later in the
>> command lifetime... I guess it was to avoid having to read config if
>> HEAD was invalid.
>
> The 'branch' does need to be loaded before the call to git_config (as
> I found out after moving the config call too early), so I suppose it
> was natural to pair that with resolving head_commit.

Right, so there was only a limited number of places where call to
git_config could be put correctly. Now I wonder no more.

[...]
>>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>>> index a380419b65..77d85aefe7 100755
>>> --- a/t/t5318-commit-graph.sh
>>> +++ b/t/t5318-commit-graph.sh
>>> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
>>>   graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
>>>   graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
>>>   +test_expect_success 'perform fast-forward merge in full repo' '
>>> +	cd "$TRASH_DIRECTORY/full" &&
>>> +	git checkout -b merge-5-to-8 commits/5 &&
>>> +	git merge commits/8 &&
>>> +	git show-ref -s merge-5-to-8 >output &&
>>> +	git show-ref -s commits/8 >expect &&
>>> +	test_cmp expect output
>>> +'
>> All right.  (though I wonder if this tests catches all problems where
>> BUG("bad generation skip") could have been encountered.
>
> We will never know until we have this series running in the wild (and
> even then, some features are very obscure) and enough people turn on
> the config setting.
>
> One goal of the "fsck and gc" series is to get this feature running
> during the rest of the test suite as much as possible, so we can get
> additional coverage. Also to get more experience from the community
> dogfooding the feature.

Sidenote: for two out of three features that change the view of history
we could also update commit-graph automatically:
* the shortening or deepening of shallow clone could also re-calculate
  the commit graph (or invalidate it)
* git-replace could check if the replacement modifies history, and if
  so, recalculate the commit graph (or invalidate it/check its validity)
* there is no such possibility for grafts, but they are deprecated anyway

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
  2018-05-01 11:47           ` Derrick Stolee
@ 2018-05-02 13:05             ` Jakub Narebski
  2018-05-02 13:42               ` Derrick Stolee
  0 siblings, 1 reply; 162+ messages in thread
From: Jakub Narebski @ 2018-05-02 13:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 4/30/2018 6:19 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
[...]
>>> @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>>>   		struct commit_list *parents;
>>>   		int flags;
>>>   +		if (commit->generation > last_gen)
>>> +			BUG("bad generation skip");
>>> +		last_gen = commit->generation;
>> Shouldn't we provide more information about where the problem is to the
>> user, to make it easier to debug the repository / commit-graph data?
>>
>> Good to have this sanity check here.
>
> This BUG() _should_ only be seen by developers who add callers which
> do not load commits from the commit-graph file. There is a chance that
> there are cases not covered by this patch and the added tests,
> though. Hopefully we catch them all by dogfooding the feature before
> turning it on by default.
>
> I can add the following to help debug these bad situations:
>
> +			BUG("bad generation skip %d > %d at %s",
> +			    commit->generation, last_gen,
> +			    oid_to_hex(&commit->object.oid));

On one hand, after thiking about this a bit, I agree that this BUG() is
more about catching the errors in Git code, rather than in repository.

On the other hand, the more detailed information could help determining
what the problems is (e.g. that "at <hex>" is at HEAD).

Hopefully we won't see which is which, as it would mean bugs in Git ;))

[...]
>>> @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>>   			filled_index[filled] = j;
>>>   			work[filled++] = array[j];
>>>   		}
>>> -		common = paint_down_to_common(array[i], filled, work);
>>> +		common = paint_down_to_common(array[i], filled, work, 0);
>>
>> Here we are interested not only if "one"/array[i] is reachable from
>> "twos"/work, but also if "twos" is reachable from "one".  Simple cutoff
>> only works in one way, though I wonder if we couldn't use cutoff being
>> minimum generation number of "one" and "twos" together.
>>
>> But that may be left for a separate commit (after checking that the
>> above is correct).
>>
>> Not as simple and obvious as paint_down_to_common() used in
>> in_merge_bases_any(), so it is all right.
>
> Thanks for reporting this. Since we are only concerned about
> reachability in this method, it is a good candidate to use
> min_generation. It is also subtle enough that we should leave it as a
> separate commit.

Thanks for checking this, and for the followup.

>                  Also, we can measure performance improvements
> separately, as I will mention in my commit message (but I'll copy it
> here):
>
>     For a copy of the Linux repository, we measured the following
>     performance improvements:
>
>     git merge-base v3.3 v4.5
>
>     Before: 234 ms
>      After: 208 ms
>      Rel %: -11%
>
>     git merge-base v4.3 v4.5
>
>     Before: 102 ms
>      After:  83 ms
>      Rel %: -19%
>
>     The experiments above were chosen to demonstrate that we are
>     improving the filtering of the merge-base set. In the first
>     example, more time is spent walking the history to find the
>     set of merge bases before the remove_redundant() call. The
>     starting commits are closer together in the second example,
>     therefore more time is spent in remove_redundant(). The relative
>     change in performance differs as expected.

Nice.

I was not expecting as much performance improvements as we got for
--contains tests because remove_redundant() is a final step in longer
process, dominated by man calculations.  Still, nothing to sneeze about.

Best regards,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
  2018-05-02 13:05             ` Jakub Narebski
@ 2018-05-02 13:42               ` Derrick Stolee
  0 siblings, 0 replies; 162+ messages in thread
From: Derrick Stolee @ 2018-05-02 13:42 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

On 5/2/2018 9:05 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>>      For a copy of the Linux repository, we measured the following
>>      performance improvements:
>>
>>      git merge-base v3.3 v4.5
>>
>>      Before: 234 ms
>>       After: 208 ms
>>       Rel %: -11%
>>
>>      git merge-base v4.3 v4.5
>>
>>      Before: 102 ms
>>       After:  83 ms
>>       Rel %: -19%
>>
>>      The experiments above were chosen to demonstrate that we are
>>      improving the filtering of the merge-base set. In the first
>>      example, more time is spent walking the history to find the
>>      set of merge bases before the remove_redundant() call. The
>>      starting commits are closer together in the second example,
>>      therefore more time is spent in remove_redundant(). The relative
>>      change in performance differs as expected.
> Nice.
>
> I was not expecting as much performance improvements as we got for
> --contains tests because remove_redundant() is a final step in longer
> process, dominated by man calculations.  Still, nothing to sneeze about.

One reason these numbers are not too surprising is that 
remove_redundant() can demonstrate quadratic behavior. It is calculating 
pair-wise reachability by starting a walk at each of the candidates (in 
the worst case). In typical cases, the first walk marks many of the 
other candidates as redundant and we don't need to start walks from 
those commits.

A possible optimization could be to sort the candidates by descending 
generation so we find the first walk is likely to mark the rest as 
redundant. But this may already be the case if the candidates are added 
to the list in order of "discovery" which is already simulating this 
behavior.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v4 03/10] commit-graph: compute generation numbers
  2018-05-01 12:10           ` Derrick Stolee
@ 2018-05-02 16:15             ` Jakub Narebski
  0 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-05-02 16:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 4/29/2018 5:08 AM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:

[...]
>> It is a bit strange to me that the code uses get_be32 for reading, but
>> htonl for writing.  Is Git tested on non little-endian machines, like
>> big-endian ppc64 or s390x, or on mixed-endian machines (or
>> selectable-endian machines with data endianness set to non
>> little-endian, like ia64)?  If not, could we use for example openSUSE
>> Build Service (https://build.opensuse.org/) for this?
>
> Since we are packing two values into 64 bits, I am using htonl() here
> to arrange the 30-bit generation number alongside the 34-bit commit
> date value, then writing with hashwrite(). The other 32-bit integers
> are written with hashwrite_be32() to avoid translating this data
> in-memory.

O.K., so you are using what is more effective and easier to use.
Nice to know, thanks for the information.

[...]
>>>   +static void compute_generation_numbers(struct commit** commits,
>>> +				       int nr_commits)
>>> +{
[...]
>>> +	for (i = 0; i < nr_commits; i++) {
>>> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
>>> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
>>> +			continue;

[...]

>>>   +	compute_generation_numbers(commits.list, commits.nr);
>>> +
>> Nice and simple.  All right.
>>
>> I guess that we do not pass "struct packed_commit_list commits" as
>> argument to compute_generation_numbers instead of "struct commit**
>> commits.list" and "int commits.nr" to compute_generation_numbers() to
>> keep the latter nice and generic?
>
> Good catch. There is no reason to not use packed_commit_list here.

Actually, now that v5 shows how using packed_commit_list looks like, in
my opinion it looks uglier.  And it might be easier to make mistake.

Also, depending on how compiler is able to optimize it, the version
passing packed_commit_list as an argument has one more indirection
(following two pointers) in the loop.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v5 00/11] Compute and consume generation numbers
  2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (10 preceding siblings ...)
  2018-05-01 12:47         ` [PATCH v5 11/11] commit-graph.txt: update design document Derrick Stolee
@ 2018-05-03 11:18         ` Jakub Narebski
  11 siblings, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-05-03 11:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Derrick Stolee <dstolee@microsoft.com> writes:

> Most of the changes from v4 are cosmetic, but there is one new commit:
>
> 	commit: use generation number in remove_redundant()
>
> Other changes are non-functional, but do clarify things.

I wonder if out perf framework in t/perf could help here to show
performance gains for the whole series.  Though it may not include
operations that are most helped by this one.

For commit-graph feature if would be nice, if feasible, to see changes
in performance from before version, checking both state where feature is
enabled to see the gains, and state where feature is disabled to see if
there are no performance regressions.

>
> Inter-diff from v4:

O.K., now to commenting on inter-changes.

> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index d9f2713efa..e1a883eb46 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -125,9 +125,10 @@ Future Work
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operation is an important candidate:
> +  operations are important candidates:
>
>      - 'log --topo-order'
> +    - 'tag --merged'
>
>  - Currently, parse_commit_gently() requires filling in the root tree
>    object for a commit. This passes through lookup_tree() and consequently

O.K., this is about discussion in "branch --contains / tag --merged
inconsistency" thread:

  https://public-inbox.org/git/87fu3g67ry.fsf@lant.ki.iif.hu/t/#u

> diff --git a/commit-graph.c b/commit-graph.c
> index aebd242def..a8c337dd77 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -248,6 +248,7 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>  static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
>  {
>         const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> +       item->graph_pos = pos;
>         item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }
>

Minor bugfix.

> @@ -454,8 +455,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>                 else
>                         packedDate[0] = 0;
>
> -               if ((*list)->generation != GENERATION_NUMBER_INFINITY)
> -                       packedDate[0] |= htonl((*list)->generation << 2);
> +               packedDate[0] |= htonl((*list)->generation << 2);
>
>                 packedDate[1] = htonl((*list)->date);
>                 hashwrite(f, packedDate, 8);

Minor bugfix.

> @@ -589,18 +589,17 @@ static void close_reachable(struct packed_oid_list *oids)
>         }
>  }
>
> -static void compute_generation_numbers(struct commit** commits,
> -                                      int nr_commits)
> +static void compute_generation_numbers(struct packed_commit_list* commits)
>  {
>         int i;
>         struct commit_list *list = NULL;
>
> -       for (i = 0; i < nr_commits; i++) {
> -               if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
> -                   commits[i]->generation != GENERATION_NUMBER_ZERO)
> +       for (i = 0; i < commits->nr; i++) {
> +               if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY &&
> +                   commits->list[i]->generation != GENERATION_NUMBER_ZERO)
>                         continue;
>
> -               commit_list_insert(commits[i], &list);
> +               commit_list_insert(commits->list[i], &list);
>                 while (list) {
>                         struct commit *current = list->item;
>                         struct commit_list *parent;

Refactoring: signature change from pair of struct commit** + int to
struct packed_commit_list*.

I think that it makes code a bit uglier for no gain, but that is just my
personal opinion; it is a matter of taste.

> @@ -621,10 +620,10 @@ static void compute_generation_numbers(struct commit** commits,
>                         if (all_parents_computed) {
>                                 current->generation = max_generation + 1;
>                                 pop_commit(&list);
> -                       }
>
> -                       if (current->generation > GENERATION_NUMBER_MAX)
> -                               current->generation = GENERATION_NUMBER_MAX;
> +                               if (current->generation > GENERATION_NUMBER_MAX)
> +                                       current->generation = GENERATION_NUMBER_MAX;
> +                       }
>                 }
>         }
>  }

Bugfix (though it didn't result in wrong information writen out, just in
inconsistent state in the middle of computation).

> @@ -752,7 +751,7 @@ void write_commit_graph(const char *obj_dir,
>         if (commits.nr >= GRAPH_PARENT_MISSING)
>                 die(_("too many commits to write graph"));
>
> -       compute_generation_numbers(commits.list, commits.nr);
> +       compute_generation_numbers(&commits);
>
>         graph_name = get_commit_graph_filename(obj_dir);
>         fd = hold_lock_file_for_update(&lk, graph_name, 0);

The other side of signature change.

> diff --git a/commit.c b/commit.c
> index e2e16ea1a7..5064db4e61 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -835,7 +835,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>                 int flags;
>
>                 if (commit->generation > last_gen)
> -                       BUG("bad generation skip");
> +                       BUG("bad generation skip %8x > %8x at %s",
> +                           commit->generation, last_gen,
> +                           oid_to_hex(&commit->object.oid));
>                 last_gen = commit->generation;
>
>                 if (commit->generation < min_generation)

More detailed BUG() message, always nice to have.

> @@ -947,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
>                 parse_commit(array[i]);
>         for (i = 0; i < cnt; i++) {
>                 struct commit_list *common;
> +               uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>
>                 if (redundant[i])
>                         continue;
> @@ -955,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
>                                 continue;
>                         filled_index[filled] = j;
>                         work[filled++] = array[j];
> +
> +                       if (array[j]->generation < min_generation)
> +                               min_generation = array[j]->generation;
>                 }
> -               common = paint_down_to_common(array[i], filled, work, 0);
> +               common = paint_down_to_common(array[i], filled, work,
> +                                             min_generation);
>                 if (array[i]->object.flags & PARENT2)
>                         redundant[i] = 1;
>                 for (j = 0; j < filled; j++)

New commit in series.  Change looks quite short, gives measurable
performance gains (in appropriate case).

> @@ -1073,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>         for (i = 0; i < nr_reference; i++) {
>                 if (parse_commit(reference[i]))
>                         return ret;
> -               if (min_generation > reference[i]->generation)
> +               if (reference[i]->generation < min_generation)
>                         min_generation = reference[i]->generation;
>         }
>
>

Style change.

> -- >8 --
>
> Derrick Stolee (11):
>   ref-filter: fix outdated comment on in_commit_list
>   commit: add generation number to struct commmit
>   commit-graph: compute generation numbers
>   commit: use generations in paint_down_to_common()
>   commit-graph: always load commit-graph information
>   ref-filter: use generation number for --contains
>   commit: use generation numbers for in_merge_bases()
>   commit: add short-circuit to paint_down_to_common()
>   commit: use generation number in remove_redundant()
>   merge: check config before loading commits
>   commit-graph.txt: update design document

It looks like the series is maturing nicely.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

* Re: [PATCH v5 09/11] commit: use generation number in remove_redundant()
  2018-05-01 12:47         ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee
  2018-05-01 15:37           ` Derrick Stolee
@ 2018-05-03 18:45           ` Jakub Narebski
  1 sibling, 0 replies; 162+ messages in thread
From: Jakub Narebski @ 2018-05-03 18:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Derrick Stolee, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <dstolee@microsoft.com> writes:

> The static remove_redundant() method is used to filter a list
> of commits by removing those that are reachable from another
> commit in the list. This is used to remove all possible merge-
> bases except a maximal, mutually independent set.
>
> To determine these commits are independent, we use a number of
> paint_down_to_common() walks and use the PARENT1, PARENT2 flags
> to determine reachability. Since we only care about reachability
> and not the full set of merge-bases between 'one' and 'twos', we
> can use the 'min_generation' parameter to short-circuit the walk.
>
> When no commit-graph exists, there is no change in behavior.
>
> For a copy of the Linux repository, we measured the following
> performance improvements:
>
> git merge-base v3.3 v4.5
>
> Before: 234 ms
>  After: 208 ms
>  Rel %: -11%
>
> git merge-base v4.3 v4.5
>
> Before: 102 ms
>  After:  83 ms
>  Rel %: -19%
>
> The experiments above were chosen to demonstrate that we are
> improving the filtering of the merge-base set. In the first
> example, more time is spent walking the history to find the
> set of merge bases before the remove_redundant() call. The
> starting commits are closer together in the second example,
> therefore more time is spent in remove_redundant(). The relative
> change in performance differs as expected.
>
> Reported-by: Jakub Narebski <jnareb@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Good description.

> ---
>  commit.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>

Let me extend context a bit to make it easier to review.

> diff --git a/commit.c b/commit.c
> index 9875feec01..5064db4e61 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
>  		parse_commit(array[i]);
>  	for (i = 0; i < cnt; i++) {
>  		struct commit_list *common;
> +		uint32_t min_generation = GENERATION_NUMBER_INFINITY;

As you have noticed, and how it is already fixed in 'pu' it should be

  +		uint32_t min_generation = array[i]->generation;

>  
>  		if (redundant[i])
>  			continue;
> @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
>  				continue;
>  			filled_index[filled] = j;
>  			work[filled++] = array[j];
> +
> +			if (array[j]->generation < min_generation)
> +				min_generation = array[j]->generation;

remove_redundant() checks if i-th commit is reachable from commits
i+1..cnt, and vice versa - via checking PARENT1 and PARENT2 flag,
respectively.

As you have noticed this means that the min_generation cutoff should be
minimum of array[i]->generation, and all of array[j]->generation for
j=i+1..cnt.  There is no reason going further down if we are interested
only in reachability, and not actually in merge bases.

>  		}
> -		common = paint_down_to_common(array[i], filled, work, 0);
> +		common = paint_down_to_common(array[i], filled, work,
> +					      min_generation);
>  		if (array[i]->object.flags & PARENT2)
>  			redundant[i] = 1;
>  		for (j = 0; j < filled; j++)
   			if (work[j]->object.flags & PARENT1)
   				redundant[filled_index[j]] = 1;

Beside this issue, nice and simple speedup.  Good.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 162+ messages in thread

end of thread, back to index

Thread overview: 162+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
2018-04-03 18:21   ` Jonathan Tan
2018-04-03 18:28     ` Jeff King
2018-04-03 18:32       ` Derrick Stolee
2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
2018-04-03 18:05   ` Brandon Williams
2018-04-03 18:28     ` Jeff King
2018-04-03 18:31       ` Derrick Stolee
2018-04-03 18:32       ` Brandon Williams
2018-04-03 18:44       ` Stefan Beller
2018-04-03 23:17       ` Ramsay Jones
2018-04-03 23:19         ` Jeff King
2018-04-03 18:24   ` Jonathan Tan
2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
2018-04-03 18:30   ` Jonathan Tan
2018-04-03 18:49     ` Stefan Beller
2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-03 18:31   ` Stefan Beller
2018-04-03 18:31   ` Jonathan Tan
2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
2018-04-03 19:01   ` Jonathan Tan
2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
2018-04-03 19:04   ` Jonathan Tan
2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
2018-04-03 18:03 ` Brandon Williams
2018-04-03 18:29   ` Derrick Stolee
2018-04-03 18:47     ` Jeff King
2018-04-03 19:05       ` Jeff King
2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-04 15:48             ` Derrick Stolee
2018-04-04 17:01               ` Brandon Williams
2018-04-04 18:24               ` Jeff King
2018-04-04 18:53                 ` Derrick Stolee
2018-04-04 18:59                   ` Jeff King
2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
2018-04-04 19:06             ` Derrick Stolee
2018-04-04 19:16               ` Jeff King
2018-04-04 19:22                 ` Derrick Stolee
2018-04-04 19:42                   ` Jeff King
2018-04-04 19:45                     ` Derrick Stolee
2018-04-04 19:46                       ` Jeff King
2018-04-07 17:09     ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski
2018-04-07 16:55 ` Jakub Narebski
2018-04-08  1:06   ` Derrick Stolee
2018-04-11 19:32     ` Jakub Narebski
2018-04-11 19:58       ` Derrick Stolee
2018-04-14 16:52         ` Jakub Narebski
2018-04-21 20:44           ` Jakub Narebski
2018-04-23 13:54             ` Derrick Stolee
2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
2018-04-11  2:12     ` Junio C Hamano
2018-04-11 12:49       ` Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
2018-04-09 17:59     ` Stefan Beller
2018-04-11  2:31     ` Junio C Hamano
2018-04-11 12:57       ` Derrick Stolee
2018-04-11 23:28         ` Junio C Hamano
2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
2018-04-11  2:51     ` Junio C Hamano
2018-04-11 13:02       ` Derrick Stolee
2018-04-11 18:49         ` Stefan Beller
2018-04-11 19:26         ` Eric Sunshine
2018-04-09 16:42   ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
2018-04-11  3:02     ` Junio C Hamano
2018-04-11 13:24       ` Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
2018-04-12  9:12     ` Junio C Hamano
2018-04-12 11:35       ` Derrick Stolee
2018-04-13  9:53         ` Jakub Narebski
2018-04-09 16:42   ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-18 14:31       ` Jakub Narebski
2018-04-18 14:46         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
2018-04-18 19:47       ` Jakub Narebski
2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
2018-04-18 21:02       ` Jakub Narebski
2018-04-23 14:22         ` Derrick Stolee
2018-04-24 18:56           ` Jakub Narebski
2018-04-25 14:11             ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-18 22:15       ` Jakub Narebski
2018-04-23 14:31         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
2018-04-18 23:19       ` Jakub Narebski
2018-04-23 14:40         ` Derrick Stolee
2018-04-23 21:38           ` Jakub Narebski
2018-04-24 12:31             ` Derrick Stolee
2018-04-19  8:32       ` Jakub Narebski
2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
2018-04-17 17:50       ` Derrick Stolee
2018-04-19  0:02       ` Jakub Narebski
2018-04-23 14:49         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee
2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
2018-04-23 14:54       ` Derrick Stolee
2018-04-25 14:37     ` [PATCH v4 00/10] " Derrick Stolee
2018-04-25 14:37       ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
2018-04-28 17:54         ` Jakub Narebski
2018-04-25 14:37       ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee
2018-04-28 22:35         ` Jakub Narebski
2018-04-30 12:05           ` Derrick Stolee
2018-04-25 14:37       ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee
2018-04-26  2:35         ` Junio C Hamano
2018-04-26 12:58           ` Derrick Stolee
2018-04-26 13:49             ` Derrick Stolee
2018-04-29  9:08         ` Jakub Narebski
2018-05-01 12:10           ` Derrick Stolee
2018-05-02 16:15             ` Jakub Narebski
2018-04-25 14:37       ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-26  3:22         ` Junio C Hamano
2018-04-26  9:02           ` Jakub Narebski
2018-04-28 14:38             ` Jakub Narebski
2018-04-29 15:40         ` Jakub Narebski
2018-04-25 14:37       ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee
2018-04-29 22:14         ` Jakub Narebski
2018-05-01 12:19           ` Derrick Stolee
2018-04-29 22:18         ` Jakub Narebski
2018-04-25 14:37       ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee
2018-04-30 16:34         ` Jakub Narebski
2018-04-25 14:37       ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-30 17:05         ` Jakub Narebski
2018-04-25 14:38       ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
2018-04-30 22:19         ` Jakub Narebski
2018-05-01 11:47           ` Derrick Stolee
2018-05-02 13:05             ` Jakub Narebski
2018-05-02 13:42               ` Derrick Stolee
2018-04-25 14:38       ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee
2018-04-30 22:54         ` Jakub Narebski
2018-05-01 11:52           ` Derrick Stolee
2018-05-02 11:41             ` Jakub Narebski
2018-04-25 14:38       ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee
2018-04-30 23:32         ` Jakub Narebski
2018-05-01 12:00           ` Derrick Stolee
2018-05-02  7:57             ` Jakub Narebski
2018-04-25 14:40       ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee
2018-04-28 17:28         ` Jakub Narebski
2018-05-01 12:47       ` [PATCH v5 00/11] " Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 02/11] commit: add generation number to struct commmit Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 03/11] commit-graph: compute generation numbers Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 04/11] commit: use generations in paint_down_to_common() Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 05/11] commit-graph: always load commit-graph information Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 06/11] ref-filter: use generation number for --contains Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee
2018-05-01 15:37           ` Derrick Stolee
2018-05-03 18:45           ` Jakub Narebski
2018-05-01 12:47         ` [PATCH v5 10/11] merge: check config before loading commits Derrick Stolee
2018-05-01 12:47         ` [PATCH v5 11/11] commit-graph.txt: update design document Derrick Stolee
2018-05-03 11:18         ` [PATCH v5 00/11] Compute and consume generation numbers Jakub Narebski

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox