git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH 0/6] Compute and consume generation numbers
@ 2018-04-03 16:51 Derrick Stolee
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
                   ` (9 more replies)
  0 siblings, 10 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (6):
  object.c: parse commit in graph first
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: sort by generation number in paint_down_to_common()
  commit.c: use generation number to stop merge-base walks
  commit-graph.txt: update design doc with generation numbers

 Documentation/technical/commit-graph.txt |  7 +---
 alloc.c                                  |  1 +
 commit-graph.c                           | 48 +++++++++++++++++++++
 commit.c                                 | 53 ++++++++++++++++++++----
 commit.h                                 |  7 +++-
 object.c                                 |  4 +-
 6 files changed, 104 insertions(+), 16 deletions(-)

-- 
2.17.0.20.g9f30ba16e1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:21   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/object.c b/object.c
index e6ad3f61f0..4cd3e98e04 100644
--- a/object.c
+++ b/object.c
@@ -3,6 +3,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "tag.h"
 
 static struct object **obj_hash;
@@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (!parse_commit_in_graph(commit) &&
+			    parse_commit_buffer(commit, buffer, size))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:05   ` Brandon Williams
  2018-04-03 18:24   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_UNDEF 0xFFFFFFFF
GENERATION_NUMBER_NONE 0

The first (_UNDEF) means the generation number has not been loaded or
computed. The second (_NONE) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..1a62e85ac3 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_UNDEF;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..d24b947525 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index e57ae4b583..3cadd386f3 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,8 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
+#define GENERATION_NUMBER_NONE 0
 
 struct commit_list {
 	struct commit *item;
@@ -24,6 +26,7 @@ struct commit {
 	struct commit_list *parents;
 	struct tree *tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 3/6] commit-graph: compute generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:30   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 commit.h       |  1 +
 2 files changed, 47 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d24b947525..b80c8ad80e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
+			if ((*list)->generation > GENERATION_NUMBER_MAX)
+				die("generation number %u is too large to store in commit-graph",
+				    (*list)->generation);
+			packedDate[0] |= htonl((*list)->generation << 2);
+		}
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_UNDEF &&
+		    commits[i]->generation != GENERATION_NUMBER_NONE)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_UNDEF ||
+				    parent->item->generation == GENERATION_NUMBER_NONE) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
diff --git a/commit.h b/commit.h
index 3cadd386f3..bc7a3186c5 100644
--- a/commit.h
+++ b/commit.h
@@ -11,6 +11,7 @@
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
 #define GENERATION_NUMBER_NONE 0
 
 struct commit_list {
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 4/6] commit: use generations in paint_down_to_common()
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 18:31   ` Stefan Beller
  2018-04-03 18:31   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_UNDEF.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 19 ++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 3e39c86abf..95ae7e13a3 100644
--- a/commit.c
+++ b/commit.c
@@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* newer commits with larger date first */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index bc7a3186c5..cb97b7636a 100644
--- a/commit.h
+++ b/commit.h
@@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 5/6] commit.c: use generation to halt paint walk
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 19:01   ` Jonathan Tan
  2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

In paint_down_to_common(), the walk is halted when the queue contains
only stale commits. The queue_has_nonstale() method iterates over the
entire queue looking for a nonstale commit. In a wide commit graph where
the two sides share many commits in common, but have deep sets of
different commits, this method may inspect many elements before finding
a nonstale commit. In the worst case, this can give quadratic
performance in paint_down_to_common().

Convert queue_has_nonstale() to use generation numbers for an O(1)
termination condition. To properly take advantage of this condition,
track the minimum generation number of a commit that enters the queue
with nonstale status.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 95ae7e13a3..858f4fdbc9 100644
--- a/commit.c
+++ b/commit.c
@@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
-static int queue_has_nonstale(struct prio_queue *queue)
+static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 {
-	int i;
-	for (i = 0; i < queue->nr; i++) {
-		struct commit *commit = queue->array[i].data;
-		if (!(commit->object.flags & STALE))
-			return 1;
+	if (min_gen != GENERATION_NUMBER_UNDEF) {
+		if (queue->nr > 0) {
+			struct commit *commit = queue->array[0].data;
+			return commit->generation >= min_gen;
+		}
+	} else {
+		int i;
+		for (i = 0; i < queue->nr; i++) {
+			struct commit *commit = queue->array[i].data;
+			if (!(commit->object.flags & STALE))
+				return 1;
+		}
 	}
+
 	return 0;
 }
 
@@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_UNDEF;
+	uint32_t min_nonstale_gen = GENERATION_NUMBER_UNDEF;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		return result;
 	}
 	prio_queue_put(&queue, one);
+	if (one->generation < min_nonstale_gen)
+		min_nonstale_gen = one->generation;
 
 	for (i = 0; i < n; i++) {
 		twos[i]->object.flags |= PARENT2;
 		prio_queue_put(&queue, twos[i]);
+		if (twos[i]->generation < min_nonstale_gen)
+			min_nonstale_gen = twos[i]->generation;
 	}
 
-	while (queue_has_nonstale(&queue)) {
+	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+
+		last_gen = commit->generation;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 				return NULL;
 			p->object.flags |= flags;
 			prio_queue_put(&queue, p);
+
+			if (!(flags & STALE) &&
+			    p->generation < min_nonstale_gen)
+				min_nonstale_gen = p->generation;
 		}
 	}
 
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 6/6] commit-graph.txt: update future work
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-03 16:51 ` Derrick Stolee
  2018-04-03 19:04   ` Jonathan Tan
  2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw)
  To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..be68bee43d 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -98,17 +98,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-03 16:56 ` Derrick Stolee
  2018-04-03 18:03 ` Brandon Williams
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 16:56 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: avarab, sbeller, larsxschneider, peff

On 4/3/2018 12:51 PM, Derrick Stolee wrote:
> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a constant-time check in
> queue_has_nonstale() instead of the previous linear-time check.
>
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.
>
> This patch series depends on v7 of ds/commit-graph.
>
> Derrick Stolee (6):
>    object.c: parse commit in graph first
>    commit: add generation number to struct commmit
>    commit-graph: compute generation numbers
>    commit: sort by generation number in paint_down_to_common()
>    commit.c: use generation number to stop merge-base walks
>    commit-graph.txt: update design doc with generation numbers

This patch is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/5

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
@ 2018-04-03 18:03 ` Brandon Williams
  2018-04-03 18:29   ` Derrick Stolee
  2018-04-07 16:55 ` Jakub Narebski
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  9 siblings, 1 reply; 103+ messages in thread
From: Brandon Williams @ 2018-04-03 18:03 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On 04/03, Derrick Stolee wrote:
> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
> 
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).

Thanks for ensuring that this is defined and documented somewhere :)

> 
> This series makes the computation of generation numbers part of the
> commit-graph write process.
> 
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a constant-time check in
> queue_has_nonstale() instead of the previous linear-time check.
> 
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
> 
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.

log --graph should benefit a lot more from this correct?  I know we've
talked a bit about negotiation and I wonder if these generation numbers
should be able to help out a little bit with that some day.

> 
> This patch series depends on v7 of ds/commit-graph.
> 
> Derrick Stolee (6):
>   object.c: parse commit in graph first
>   commit: add generation number to struct commmit
>   commit-graph: compute generation numbers
>   commit: sort by generation number in paint_down_to_common()
>   commit.c: use generation number to stop merge-base walks
>   commit-graph.txt: update design doc with generation numbers
> 
>  Documentation/technical/commit-graph.txt |  7 +---
>  alloc.c                                  |  1 +
>  commit-graph.c                           | 48 +++++++++++++++++++++
>  commit.c                                 | 53 ++++++++++++++++++++----
>  commit.h                                 |  7 +++-
>  object.c                                 |  4 +-
>  6 files changed, 104 insertions(+), 16 deletions(-)
> 
> -- 
> 2.17.0.20.g9f30ba16e1
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-03 18:05   ` Brandon Williams
  2018-04-03 18:28     ` Jeff King
  2018-04-03 18:24   ` Jonathan Tan
  1 sibling, 1 reply; 103+ messages in thread
From: Brandon Williams @ 2018-04-03 18:05 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On 04/03, Derrick Stolee wrote:
> The generation number of a commit is defined recursively as follows:
> 
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
> 
> Add a uint32_t generation field to struct commit so we can pass this

Is there any reason to believe this would be too small of a value in the
future?  Or is a 32 bit unsigned good enough?

> information to revision walks. We use two special values to signal
> the generation number is invalid:
> 
> GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> GENERATION_NUMBER_NONE 0
> 
> The first (_UNDEF) means the generation number has not been loaded or
> computed. The second (_NONE) means the generation number was loaded
> from a commit graph file that was stored before generation numbers
> were computed.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  alloc.c        | 1 +
>  commit-graph.c | 2 ++
>  commit.h       | 3 +++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/alloc.c b/alloc.c
> index cf4f8b61e1..1a62e85ac3 100644
> --- a/alloc.c
> +++ b/alloc.c
> @@ -94,6 +94,7 @@ void *alloc_commit_node(void)
>  	c->object.type = OBJ_COMMIT;
>  	c->index = alloc_commit_index();
>  	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
> +	c->generation = GENERATION_NUMBER_UNDEF;
>  	return c;
>  }
>  
> diff --git a/commit-graph.c b/commit-graph.c
> index 1fc63d541b..d24b947525 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
>  	pptr = &item->parents;
>  
>  	edge_value = get_be32(commit_data + g->hash_len);
> diff --git a/commit.h b/commit.h
> index e57ae4b583..3cadd386f3 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -10,6 +10,8 @@
>  #include "pretty.h"
>  
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> +#define GENERATION_NUMBER_NONE 0
>  
>  struct commit_list {
>  	struct commit *item;
> @@ -24,6 +26,7 @@ struct commit {
>  	struct commit_list *parents;
>  	struct tree *tree;
>  	uint32_t graph_pos;
> +	uint32_t generation;
>  };
>  
>  extern int save_commit_buffer;
> -- 
> 2.17.0.rc0
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
@ 2018-04-03 18:21   ` Jonathan Tan
  2018-04-03 18:28     ` Jeff King
  0 siblings, 1 reply; 103+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:38 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> Most code paths load commits using lookup_commit() and then
> parse_commit(). In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().
> 
> Before adding generation numbers to the commit-graph, we need to ensure
> that any commit that exists in the graph is loaded from the graph, so
> check parse_commit_in_graph() before calling parse_commit_buffer().
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Modifying parse_object_buffer() is the most pragmatic way to accomplish
this, but this also means that parse_object_buffer() now potentially
reads from the local object store (instead of only relying on what's in
memory and what's in the provided buffer). parse_object_buffer() is
called by several callers including in builtin/fsck.c. I would feel more
comfortable if the relevant [1] caller to parse_object_buffer() was
modified instead of parse_object_buffer(), but I'll let others give
their opinions too.

[1] The caller which, if modified, will result in the speedup to
the merge-base calculations in the Windows repository you describe in
your cover letter.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
  2018-04-03 18:05   ` Brandon Williams
@ 2018-04-03 18:24   ` Jonathan Tan
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:39 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> The generation number of a commit is defined recursively as follows:
> 
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
> 
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use two special values to signal
> the generation number is invalid:
> 
> GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> GENERATION_NUMBER_NONE 0
> 
> The first (_UNDEF) means the generation number has not been loaded or
> computed. The second (_NONE) means the generation number was loaded
> from a commit graph file that was stored before generation numbers
> were computed.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

This looks straightforward and correct, thanks. I think some of the
description above should appear as code comments.

> +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF
> +#define GENERATION_NUMBER_NONE 0

I would include the description above here as documentation, and would
replace "was stored before generation numbers were computed" by "was
written by a version of Git that did not support generation numbers".

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:05   ` Brandon Williams
@ 2018-04-03 18:28     ` Jeff King
  2018-04-03 18:31       ` Derrick Stolee
                         ` (3 more replies)
  0 siblings, 4 replies; 103+ messages in thread
From: Jeff King @ 2018-04-03 18:28 UTC (permalink / raw)
  To: Brandon Williams; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:

> On 04/03, Derrick Stolee wrote:
> > The generation number of a commit is defined recursively as follows:
> > 
> > * If a commit A has no parents, then the generation number of A is one.
> > * If a commit A has parents, then the generation number of A is one
> >   more than the maximum generation number among the parents of A.
> > 
> > Add a uint32_t generation field to struct commit so we can pass this
> 
> Is there any reason to believe this would be too small of a value in the
> future?  Or is a 32 bit unsigned good enough?

The linux kernel took ~10 years to produce 500k commits. Even assuming
those were all linear (and they're not), that gives us ~80,000 years of
leeway. So even if the pace of development speeds up or we have a
quicker project, it still seems we have a pretty reasonable safety
margin.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 18:21   ` Jonathan Tan
@ 2018-04-03 18:28     ` Jeff King
  2018-04-03 18:32       ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-03 18:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote:

> On Tue,  3 Apr 2018 12:51:38 -0400
> Derrick Stolee <dstolee@microsoft.com> wrote:
> 
> > Most code paths load commits using lookup_commit() and then
> > parse_commit(). In some cases, including some branch lookups, the commit
> > is parsed using parse_object_buffer() which side-steps parse_commit() in
> > favor of parse_commit_buffer().
> > 
> > Before adding generation numbers to the commit-graph, we need to ensure
> > that any commit that exists in the graph is loaded from the graph, so
> > check parse_commit_in_graph() before calling parse_commit_buffer().
> > 
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> 
> Modifying parse_object_buffer() is the most pragmatic way to accomplish
> this, but this also means that parse_object_buffer() now potentially
> reads from the local object store (instead of only relying on what's in
> memory and what's in the provided buffer). parse_object_buffer() is
> called by several callers including in builtin/fsck.c. I would feel more
> comfortable if the relevant [1] caller to parse_object_buffer() was
> modified instead of parse_object_buffer(), but I'll let others give
> their opinions too.

It's not just you. This seems like a really odd place to put it.
Especially because if we have the buffer to pass to this function, then
we'd already have incurred the cost to inflate the object.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:03 ` Brandon Williams
@ 2018-04-03 18:29   ` Derrick Stolee
  2018-04-03 18:47     ` Jeff King
  2018-04-07 17:09     ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski
  0 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 18:29 UTC (permalink / raw)
  To: Brandon Williams, Derrick Stolee
  Cc: git, avarab, sbeller, larsxschneider, peff

On 4/3/2018 2:03 PM, Brandon Williams wrote:
> On 04/03, Derrick Stolee wrote:
>> This is the first of several "small" patches that follow the serialized
>> Git commit graph patch (ds/commit-graph).
>>
>> As described in Documentation/technical/commit-graph.txt, the generation
>> number of a commit is one more than the maximum generation number among
>> its parents (trivially, a commit with no parents has generation number
>> one).
> Thanks for ensuring that this is defined and documented somewhere :)
>
>> This series makes the computation of generation numbers part of the
>> commit-graph write process.
>>
>> Finally, generation numbers are used to order commits in the priority
>> queue in paint_down_to_common(). This allows a constant-time check in
>> queue_has_nonstale() instead of the previous linear-time check.
>>
>> This does not have a significant performance benefit in repositories
>> of normal size, but in the Windows repository, some merge-base
>> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
>> an actual consumer of generation numbers as a starting point.
>>
>> A more substantial refactoring of revision.c is required before making
>> 'git log --graph' use generation numbers effectively.
> log --graph should benefit a lot more from this correct?  I know we've
> talked a bit about negotiation and I wonder if these generation numbers
> should be able to help out a little bit with that some day.

'log --graph' should be a HUGE speedup, when it is refactored. Since the 
topo-order can "stream" commits to the pager, it can be very responsive 
to return the graph in almost all conditions. (The case where generation 
numbers are not enough is when filters reduce the set of displayed 
commits to be very sparse, so many commits are walked anyway.)

If we have generic "can X reach Y?" queries, then we can also use 
generation numbers there to great effect (by not walking commits Z with 
gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" 
thread for ideas.

For negotiation, there are some things we can do here. VSTS uses 
generation numbers as a heuristic for determining "all wants connected 
to haves" which is a condition for halting negotiation. The idea is very 
simple, and I'd be happy to discuss it on a separate thread.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 3/6] commit-graph: compute generation numbers
  2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-03 18:30   ` Jonathan Tan
  2018-04-03 18:49     ` Stefan Beller
  0 siblings, 1 reply; 103+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:40 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> +		if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
> +			if ((*list)->generation > GENERATION_NUMBER_MAX)
> +				die("generation number %u is too large to store in commit-graph",
> +				    (*list)->generation);
> +			packedDate[0] |= htonl((*list)->generation << 2);
> +		}

The die() should have "BUG:" if you agree with my comment below.

> +static void compute_generation_numbers(struct commit** commits,
> +				       int nr_commits)

Style: space before **, not after.

> +			if (all_parents_computed) {
> +				current->generation = max_generation + 1;
> +				pop_commit(&list);
> +			}

I think the current->generation should be clamped to _MAX here. If we do, then
the die() I mentioned in my first comment will have "BUG:", since we are never
meant to write any number larger than _MAX in ->generation.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
@ 2018-04-03 18:31       ` Derrick Stolee
  2018-04-03 18:32       ` Brandon Williams
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 18:31 UTC (permalink / raw)
  To: Jeff King, Brandon Williams
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On 4/3/2018 2:28 PM, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
>
>> On 04/03, Derrick Stolee wrote:
>>> The generation number of a commit is defined recursively as follows:
>>>
>>> * If a commit A has no parents, then the generation number of A is one.
>>> * If a commit A has parents, then the generation number of A is one
>>>    more than the maximum generation number among the parents of A.
>>>
>>> Add a uint32_t generation field to struct commit so we can pass this
>> Is there any reason to believe this would be too small of a value in the
>> future?  Or is a 32 bit unsigned good enough?
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not), that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.

That, and larger projects do not have linear histories. Despite having 
almost 2 million reachable commits, the Windows repository has maximum 
generation number ~100,000.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 4/6] commit: use generations in paint_down_to_common()
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-03 18:31   ` Stefan Beller
  2018-04-03 18:31   ` Jonathan Tan
  1 sibling, 0 replies; 103+ messages in thread
From: Stefan Beller @ 2018-04-03 18:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Lars Schneider, Jeff King

On Tue, Apr 3, 2018 at 9:51 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).
>
> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_UNDEF.
>
> This change does not affect the number of commits that are walked during
> the execution of paint_down_to_common(), only the order that those
> commits are inspected. In the case that commit dates violate topological
> order (i.e. a parent is "newer" than a child), the previous code could
> walk a commit twice: if a commit is reached with the PARENT1 bit, but
> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
> propagated to its parents. Using generation numbers avoids this extra
> effort, even if it is somewhat rare.


This patch (or later in this series) may want to touch
Documentation/technical/commit-graph.txt, that mentions this in
the section of Future Work:

- After computing and storing generation numbers, we must make graph
  walks aware of generation numbers to gain the performance benefits they
  enable. This will mostly be accomplished by swapping a commit-date-ordered
  priority queue with one ordered by generation number. The following
  operations are important candidates:

    - paint_down_to_common()
    - 'log --topo-order'

The paint down to common is only internal, not exposed to the user
for ordering, i.e. the topological ordering is still ordering commits in
a branch adjacent?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 4/6] commit: use generations in paint_down_to_common()
  2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
  2018-04-03 18:31   ` Stefan Beller
@ 2018-04-03 18:31   ` Jonathan Tan
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Tan @ 2018-04-03 18:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:41 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> +{
> +	const struct commit *a = a_, *b = b_;
> +
> +	if (a->generation < b->generation)
> +		return 1;
> +	else if (a->generation > b->generation)
> +		return -1;
> +
> +	/* newer commits with larger date first */
> +	if (a->date < b->date)
> +		return 1;
> +	else if (a->date > b->date)
> +		return -1;
> +	return 0;
> +}

I think it would be clearer if you commented above the first block
"newer commits first", then on the second block, "use date as a
heuristic to determine newer commit".

Other than that, this looks good.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
  2018-04-03 18:31       ` Derrick Stolee
@ 2018-04-03 18:32       ` Brandon Williams
  2018-04-03 18:44       ` Stefan Beller
  2018-04-03 23:17       ` Ramsay Jones
  3 siblings, 0 replies; 103+ messages in thread
From: Brandon Williams @ 2018-04-03 18:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider

On 04/03, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
> 
> > On 04/03, Derrick Stolee wrote:
> > > The generation number of a commit is defined recursively as follows:
> > > 
> > > * If a commit A has no parents, then the generation number of A is one.
> > > * If a commit A has parents, then the generation number of A is one
> > >   more than the maximum generation number among the parents of A.
> > > 
> > > Add a uint32_t generation field to struct commit so we can pass this
> > 
> > Is there any reason to believe this would be too small of a value in the
> > future?  Or is a 32 bit unsigned good enough?
> 
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not), that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.
> 
> -Peff

I figured as much, but just wanted to check since the windows folks
seems to produce commits pretty quickly.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/6] object.c: parse commit in graph first
  2018-04-03 18:28     ` Jeff King
@ 2018-04-03 18:32       ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-03 18:32 UTC (permalink / raw)
  To: Jeff King, Jonathan Tan
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider



On 4/3/2018 2:28 PM, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote:
>
>> On Tue,  3 Apr 2018 12:51:38 -0400
>> Derrick Stolee <dstolee@microsoft.com> wrote:
>>
>>> Most code paths load commits using lookup_commit() and then
>>> parse_commit(). In some cases, including some branch lookups, the commit
>>> is parsed using parse_object_buffer() which side-steps parse_commit() in
>>> favor of parse_commit_buffer().
>>>
>>> Before adding generation numbers to the commit-graph, we need to ensure
>>> that any commit that exists in the graph is loaded from the graph, so
>>> check parse_commit_in_graph() before calling parse_commit_buffer().
>>>
>>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> Modifying parse_object_buffer() is the most pragmatic way to accomplish
>> this, but this also means that parse_object_buffer() now potentially
>> reads from the local object store (instead of only relying on what's in
>> memory and what's in the provided buffer). parse_object_buffer() is
>> called by several callers including in builtin/fsck.c. I would feel more
>> comfortable if the relevant [1] caller to parse_object_buffer() was
>> modified instead of parse_object_buffer(), but I'll let others give
>> their opinions too.
> It's not just you. This seems like a really odd place to put it.
> Especially because if we have the buffer to pass to this function, then
> we'd already have incurred the cost to inflate the object.
>

OK. Thanks. I'll try to find the better place to put this check.

-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
  2018-04-03 18:31       ` Derrick Stolee
  2018-04-03 18:32       ` Brandon Williams
@ 2018-04-03 18:44       ` Stefan Beller
  2018-04-03 23:17       ` Ramsay Jones
  3 siblings, 0 replies; 103+ messages in thread
From: Stefan Beller @ 2018-04-03 18:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Brandon Williams, Derrick Stolee, git,
	Ævar Arnfjörð Bjarmason, Lars Schneider

On Tue, Apr 3, 2018 at 11:28 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
>
>> On 04/03, Derrick Stolee wrote:
>> > The generation number of a commit is defined recursively as follows:
>> >
>> > * If a commit A has no parents, then the generation number of A is one.
>> > * If a commit A has parents, then the generation number of A is one
>> >   more than the maximum generation number among the parents of A.
>> >
>> > Add a uint32_t generation field to struct commit so we can pass this
>>
>> Is there any reason to believe this would be too small of a value in the
>> future?  Or is a 32 bit unsigned good enough?
>
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not),

... which you meant in terms of DAG, where a linear history is the worst case
for generation numbers.

I first read it the other way round, as the best case w.r.t. timing

~/linux$ git log --oneline |wc -l
721223
$ git log --oneline --since 2012 |wc -l
421853
$ git log --oneline --since 2011 |wc -l
477155

The number of commits is growing exponentially, though the exponential
part is very small and the YoY growth can be estimated using linear
interpolation.

In linux, the release is a natural synchronization point IIUC as well
as on a regular schedule. So an interesting question to ask there would
be whether the delta in generation number goes up over time, or if the
DAG just gets wider (=more parallel)

> that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.

Thanks for the estimate.
Stefan

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:29   ` Derrick Stolee
@ 2018-04-03 18:47     ` Jeff King
  2018-04-03 19:05       ` Jeff King
  2018-04-07 17:09     ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski
  1 sibling, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-03 18:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote:

> If we have generic "can X reach Y?" queries, then we can also use generation
> numbers there to great effect (by not walking commits Z with gen(Z) <=
> gen(Y)). Perhaps I should look at that "git branch --contains" thread for
> ideas.

I think the gist of it is the patch below. Which I hastily adapted from
the patch we run at GitHub that uses timestamps as a proxy. So it's
possible I completely flubbed the logic. I'm assuming unavailable
generation numbers are set to 0; the logic is actually a bit simpler if
they end up as (uint32_t)-1.

Assuming it works, that would cover for-each-ref and tag. You'd probably
want to drop the "with_commit_tag_algo" flag in ref-filter.h, and just
use always use it by default (and that would cover "git branch").

---
diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..6bea6173d1 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation && candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = -1;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation && c->generation < cutoff )
+			cutoff = c->generation;
+	}
+	if (cutoff == -1)
+		cutoff = 0;
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 3/6] commit-graph: compute generation numbers
  2018-04-03 18:30   ` Jonathan Tan
@ 2018-04-03 18:49     ` Stefan Beller
  0 siblings, 0 replies; 103+ messages in thread
From: Stefan Beller @ 2018-04-03 18:49 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Lars Schneider, Jeff King

On Tue, Apr 3, 2018 at 11:30 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Tue,  3 Apr 2018 12:51:40 -0400
> Derrick Stolee <dstolee@microsoft.com> wrote:
>
>> +             if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
>> +                     if ((*list)->generation > GENERATION_NUMBER_MAX)
>> +                             die("generation number %u is too large to store in commit-graph",
>> +                                 (*list)->generation);
>> +                     packedDate[0] |= htonl((*list)->generation << 2);
>> +             }
>
> The die() should have "BUG:" if you agree with my comment below.

I would remove the BUG/die() altogether and keep going.
(But do not write it out, i.e. warn and skip the next line)

A degraded commit graph with partial generation numbers is better
than Git refusing to write any part of the commit graph (which later on
will be part of many maintenance operations I would think, leading to
more immediate headache rather than "working but slightly slower")

>
>> +static void compute_generation_numbers(struct commit** commits,
>> +                                    int nr_commits)
>
> Style: space before **, not after.
>
>> +                     if (all_parents_computed) {
>> +                             current->generation = max_generation + 1;
>> +                             pop_commit(&list);
>> +                     }
>
> I think the current->generation should be clamped to _MAX here. If we do, then
> the die() I mentioned in my first comment will have "BUG:", since we are never
> meant to write any number larger than _MAX in ->generation.

When we clamp here, we'd have to treat the _MAX specially
in all our use cases or we'd encounter funny bugs due to miss ordered
commits later?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 5/6] commit.c: use generation to halt paint walk
  2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-03 19:01   ` Jonathan Tan
  0 siblings, 0 replies; 103+ messages in thread
From: Jonathan Tan @ 2018-04-03 19:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:42 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> -static int queue_has_nonstale(struct prio_queue *queue)
> +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
>  {
> -	int i;
> -	for (i = 0; i < queue->nr; i++) {
> -		struct commit *commit = queue->array[i].data;
> -		if (!(commit->object.flags & STALE))
> -			return 1;
> +	if (min_gen != GENERATION_NUMBER_UNDEF) {
> +		if (queue->nr > 0) {
> +			struct commit *commit = queue->array[0].data;
> +			return commit->generation >= min_gen;
> +		}

This only works if the prio_queue has
compare_commits_by_gen_then_commit_date. Also, I don't think that the
min_gen != GENERATION_NUMBER_UNDEF check is necessary. So I would write
this as:

  if (queue->compare == compare_commits_by_gen_then_commit_date &&
      queue->nr) {
    struct commit *commit = queue->array[0].data;
    return commit->generation >= min_gen;
  }
  for (i = 0 ...

If you'd rather not perform the comparison to
compare_commits_by_gen_then_commit_date every time you invoke
queue_has_nonstale(), that's fine with me too, but document somewhere
that queue_has_nonstale() only works if this comparison function is
used.

> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +
> +		last_gen = commit->generation;

last_gen seems to only be used to ensure that the priority queue returns
elements in the correct order - I think we can generally trust the
queue, and if we need to test it, we can do it elsewhere.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 6/6] commit-graph.txt: update future work
  2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-03 19:04   ` Jonathan Tan
  0 siblings, 0 replies; 103+ messages in thread
From: Jonathan Tan @ 2018-04-03 19:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff

On Tue,  3 Apr 2018 12:51:43 -0400
Derrick Stolee <dstolee@microsoft.com> wrote:

> We now calculate generation numbers in the commit-graph file and use
> them in paint_down_to_common().

For completeness, I'll mention that I don't see any issues with this
patch, of course.

Thanks for this series.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:47     ` Jeff King
@ 2018-04-03 19:05       ` Jeff King
  2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-03 19:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider

On Tue, Apr 03, 2018 at 02:47:27PM -0400, Jeff King wrote:

> On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote:
> 
> > If we have generic "can X reach Y?" queries, then we can also use generation
> > numbers there to great effect (by not walking commits Z with gen(Z) <=
> > gen(Y)). Perhaps I should look at that "git branch --contains" thread for
> > ideas.
> 
> I think the gist of it is the patch below. Which I hastily adapted from
> the patch we run at GitHub that uses timestamps as a proxy. So it's
> possible I completely flubbed the logic. I'm assuming unavailable
> generation numbers are set to 0; the logic is actually a bit simpler if
> they end up as (uint32_t)-1.

Oh indeed, that is already the value of your UNDEF. So the patch is more
like this:

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..b147b1d0ee 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
+	if (cutoff == GENERATION_NUMBER_UNDEF)
+		cutoff = GENERATION_NUMBER_NONE;
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 18:28     ` Jeff King
                         ` (2 preceding siblings ...)
  2018-04-03 18:44       ` Stefan Beller
@ 2018-04-03 23:17       ` Ramsay Jones
  2018-04-03 23:19         ` Jeff King
  3 siblings, 1 reply; 103+ messages in thread
From: Ramsay Jones @ 2018-04-03 23:17 UTC (permalink / raw)
  To: Jeff King, Brandon Williams
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider



On 03/04/18 19:28, Jeff King wrote:
> On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:
> 
>> On 04/03, Derrick Stolee wrote:
>>> The generation number of a commit is defined recursively as follows:
>>>
>>> * If a commit A has no parents, then the generation number of A is one.
>>> * If a commit A has parents, then the generation number of A is one
>>>   more than the maximum generation number among the parents of A.
>>>
>>> Add a uint32_t generation field to struct commit so we can pass this
>>
>> Is there any reason to believe this would be too small of a value in the
>> future?  Or is a 32 bit unsigned good enough?
> 
> The linux kernel took ~10 years to produce 500k commits. Even assuming
> those were all linear (and they're not), that gives us ~80,000 years of
> leeway. So even if the pace of development speeds up or we have a
> quicker project, it still seems we have a pretty reasonable safety
> margin.

I didn't read the patches closely, but isn't it ~20,000 years?

Given that '#define GENERATION_NUMBER_MAX 0x3FFFFFFF', that is. ;-)

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 2/6] commit: add generation number to struct commmit
  2018-04-03 23:17       ` Ramsay Jones
@ 2018-04-03 23:19         ` Jeff King
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff King @ 2018-04-03 23:19 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider

On Wed, Apr 04, 2018 at 12:17:06AM +0100, Ramsay Jones wrote:

> >> Is there any reason to believe this would be too small of a value in the
> >> future?  Or is a 32 bit unsigned good enough?
> > 
> > The linux kernel took ~10 years to produce 500k commits. Even assuming
> > those were all linear (and they're not), that gives us ~80,000 years of
> > leeway. So even if the pace of development speeds up or we have a
> > quicker project, it still seems we have a pretty reasonable safety
> > margin.
> 
> I didn't read the patches closely, but isn't it ~20,000 years?
> 
> Given that '#define GENERATION_NUMBER_MAX 0x3FFFFFFF', that is. ;-)

What, I'm supposed to read the patches before responding? Heresy.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-03 19:05       ` Jeff King
@ 2018-04-04 15:45         ` Derrick Stolee
  2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
  2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
  0 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 15:45 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..b147b1d0ee 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
+	if (cutoff == GENERATION_NUMBER_UNDEF)
+		cutoff = GENERATION_NUMBER_NONE;
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-04 15:45           ` Derrick Stolee
  2018-04-04 15:48             ` Derrick Stolee
  2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
  1 sibling, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 15:45 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 858f4fdbc9..2566cba79f 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return 0;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0.rc0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-04 15:48             ` Derrick Stolee
  2018-04-04 17:01               ` Brandon Williams
  2018-04-04 18:24               ` Jeff King
  0 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 15:48 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 11:45 AM, Derrick Stolee wrote:
> The containment algorithm for 'git branch --contains' is different
> from that for 'git tag --contains' in that it uses is_descendant_of()
> instead of contains_tag_algo(). The expensive portion of the branch
> algorithm is computing merge bases.
>
> When a commit-graph file exists with generation numbers computed,
> we can avoid this merge-base calculation when the target commit has
> a larger generation number than the target commits.
>
> Performance tests were run on a copy of the Linux repository where
> HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> copied to branches and 'git branch --contains' was tested:
>
> Before: 60.0s
> After:   0.4s
> Rel %: -99.3%
>
> Reported-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   commit.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 858f4fdbc9..2566cba79f 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>   {
>   	struct commit_list *bases;
>   	int ret = 0, i;
> +	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
>   
>   	if (parse_commit(commit))
>   		return ret;
> -	for (i = 0; i < nr_reference; i++)
> +	for (i = 0; i < nr_reference; i++) {
>   		if (parse_commit(reference[i]))
>   			return ret;
> +		if (min_generation > reference[i]->generation)
> +			min_generation = reference[i]->generation;
> +	}
> +
> +	if (commit->generation > min_generation)
> +		return 0;
>   
>   	bases = paint_down_to_common(commit, nr_reference, reference);
>   	if (commit->object.flags & PARENT2)

This patch may suffice to speed up 'git branch --contains' instead of 
needing to always use the 'git tag --contains' algorithm as considered 
in [1].

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180303051516.GE27689@sigill.intra.peff.net/
     Re: [PATCH 0/4] Speed up git tag --contains

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:48             ` Derrick Stolee
@ 2018-04-04 17:01               ` Brandon Williams
  2018-04-04 18:24               ` Jeff King
  1 sibling, 0 replies; 103+ messages in thread
From: Brandon Williams @ 2018-04-04 17:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider

On 04/04, Derrick Stolee wrote:
> On 4/4/2018 11:45 AM, Derrick Stolee wrote:
> > The containment algorithm for 'git branch --contains' is different
> > from that for 'git tag --contains' in that it uses is_descendant_of()
> > instead of contains_tag_algo(). The expensive portion of the branch
> > algorithm is computing merge bases.
> > 
> > When a commit-graph file exists with generation numbers computed,
> > we can avoid this merge-base calculation when the target commit has
> > a larger generation number than the target commits.
> > 
> > Performance tests were run on a copy of the Linux repository where
> > HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> > copied to branches and 'git branch --contains' was tested:
> > 
> > Before: 60.0s
> > After:   0.4s
> > Rel %: -99.3%

Now that is an impressive speedup.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
  2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-04 18:22           ` Jeff King
  2018-04-04 19:06             ` Derrick Stolee
  1 sibling, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-04 18:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote:

> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  					      struct contains_cache *cache)
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
> -	enum contains_result result = contains_test(candidate, want, cache);
> +	enum contains_result result;
> +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
> +	const struct commit_list *p;
> +
> +	for (p = want; p; p = p->next) {
> +		struct commit *c = p->item;
> +		parse_commit_or_die(c);
> +		if (c->generation < cutoff)
> +			cutoff = c->generation;
> +	}
> +	if (cutoff == GENERATION_NUMBER_UNDEF)
> +		cutoff = GENERATION_NUMBER_NONE;

Hmm, on reflection, I'm not sure if this is right in the face of
multiple "want" commits, only some of which have generation numbers.  We
probably want to disable the cutoff if _any_ "want" commit doesn't have
a number.

There's also an obvious corner case where this won't kick in, and you'd
really like it to: recently added commits. E.g,. if I do this:

  git gc ;# imagine this writes generation numbers
  git pull
  git tag --contains HEAD

then HEAD isn't going to have a generation number. But this is the case
where we have the most to gain, since we could throw away all of the
ancient tags immediately upon seeing that their generation numbers are
way less than that of HEAD.

I wonder to what degree it's worth traversing to come up with a
generation number for the "want" commits. If we walked, say, 50 commits
to do it, you'd probably save a lot of work (since the alternative is
walking thousands of commits until you realize that some ancient "v1.0"
tag is not useful).

I'd actually go so far as to say that any amount of traversal is
generally going to be worth it to come up with the correct generation
cutoff here. You can come up with pathological cases where you only have
one really recent tag or something, but in practice every repository
where performance is a concern is going to end up with refs much further
back than it would take to reach the cutoff condition.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 15:48             ` Derrick Stolee
  2018-04-04 17:01               ` Brandon Williams
@ 2018-04-04 18:24               ` Jeff King
  2018-04-04 18:53                 ` Derrick Stolee
  1 sibling, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-04 18:24 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote:

> > diff --git a/commit.c b/commit.c
> > index 858f4fdbc9..2566cba79f 100644
> > --- a/commit.c
> > +++ b/commit.c
> > @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
> >   {
> >   	struct commit_list *bases;
> >   	int ret = 0, i;
> > +	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
> >   	if (parse_commit(commit))
> >   		return ret;
> > -	for (i = 0; i < nr_reference; i++)
> > +	for (i = 0; i < nr_reference; i++) {
> >   		if (parse_commit(reference[i]))
> >   			return ret;
> > +		if (min_generation > reference[i]->generation)
> > +			min_generation = reference[i]->generation;
> > +	}
> > +
> > +	if (commit->generation > min_generation)
> > +		return 0;
> >   	bases = paint_down_to_common(commit, nr_reference, reference);
> >   	if (commit->object.flags & PARENT2)
> 
> This patch may suffice to speed up 'git branch --contains' instead of
> needing to always use the 'git tag --contains' algorithm as considered in
> [1].

I'd have to do some timings, but I suspect we may want to switch to the
"tag --contains" algorithm anyway. This still does N independent
merge-base operations, one per ref. So with enough refs, you're still
better off throwing it all into one big traversal.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 18:24               ` Jeff King
@ 2018-04-04 18:53                 ` Derrick Stolee
  2018-04-04 18:59                   ` Jeff King
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 18:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 2:24 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote:
>
>>> diff --git a/commit.c b/commit.c
>>> index 858f4fdbc9..2566cba79f 100644
>>> --- a/commit.c
>>> +++ b/commit.c
>>> @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>>    {
>>>    	struct commit_list *bases;
>>>    	int ret = 0, i;
>>> +	uint32_t min_generation = GENERATION_NUMBER_UNDEF;
>>>    	if (parse_commit(commit))
>>>    		return ret;
>>> -	for (i = 0; i < nr_reference; i++)
>>> +	for (i = 0; i < nr_reference; i++) {
>>>    		if (parse_commit(reference[i]))
>>>    			return ret;
>>> +		if (min_generation > reference[i]->generation)
>>> +			min_generation = reference[i]->generation;
>>> +	}
>>> +
>>> +	if (commit->generation > min_generation)
>>> +		return 0;
>>>    	bases = paint_down_to_common(commit, nr_reference, reference);
>>>    	if (commit->object.flags & PARENT2)
>> This patch may suffice to speed up 'git branch --contains' instead of
>> needing to always use the 'git tag --contains' algorithm as considered in
>> [1].

I guess I want to specify: the only reason to NOT switch to the tags 
algorithm is because it _may_ hurt existing cases in certain data shapes...

> I'd have to do some timings, but I suspect we may want to switch to the
> "tag --contains" algorithm anyway. This still does N independent
> merge-base operations, one per ref. So with enough refs, you're still
> better off throwing it all into one big traversal.

...and I suppose your timings are to find out if there are data shapes 
where the branch algorithm is faster. Perhaps that is impossible now 
that we have the generation number cutoff for the tag algorithm.

Since the branch algorithm checks generation numbers before triggering 
pain_down_to_common(), we will do N independent merge-base calculations, 
where N is the number of branches with large enough generation numbers 
(which is why my test does so well: most are below the target generation 
number). This doesn't help at all if none of the refs are in the graph.

The other thing to do is add a minimum generation for the walk in 
paint_down_to_common() so even if commit->generation <= min_generation 
we still only walk down to commit->generation instead of all merge 
bases. This is something we could change in a later patch.

Patches 7 and 8 seem to me like simple changes with no downside UNLESS 
we are deciding instead to delete the code I'm changing.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
  2018-04-04 18:53                 ` Derrick Stolee
@ 2018-04-04 18:59                   ` Jeff King
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff King @ 2018-04-04 18:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 02:53:45PM -0400, Derrick Stolee wrote:

> > I'd have to do some timings, but I suspect we may want to switch to the
> > "tag --contains" algorithm anyway. This still does N independent
> > merge-base operations, one per ref. So with enough refs, you're still
> > better off throwing it all into one big traversal.
> 
> ...and I suppose your timings are to find out if there are data shapes where
> the branch algorithm is faster. Perhaps that is impossible now that we have
> the generation number cutoff for the tag algorithm.

Well, I wanted to show the opposite: that the branch algorithm can still
perform quite poorly. :)

I think with generation numbers that the tag algorithm should always
perform better, since you can't walk past a merge base when using a
cutoff. But it could definitely perform worse in a case where you don't
have generation numbers.

> Patches 7 and 8 seem to me like simple changes with no downside UNLESS we
> are deciding instead to delete the code I'm changing.

Yeah, I think they are strict improvements modulo the inverted UNDEF
logic I mentioned.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
@ 2018-04-04 19:06             ` Derrick Stolee
  2018-04-04 19:16               ` Jeff King
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 19:06 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 2:22 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote:
>
>> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   					      struct contains_cache *cache)
>>   {
>>   	struct contains_stack contains_stack = { 0, 0, NULL };
>> -	enum contains_result result = contains_test(candidate, want, cache);
>> +	enum contains_result result;
>> +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
>> +	const struct commit_list *p;
>> +
>> +	for (p = want; p; p = p->next) {
>> +		struct commit *c = p->item;
>> +		parse_commit_or_die(c);
>> +		if (c->generation < cutoff)
>> +			cutoff = c->generation;
>> +	}

Now that you mention it, let me split out the portion you are probably 
talking about as incorrect:

>> +	if (cutoff == GENERATION_NUMBER_UNDEF)
>> +		cutoff = GENERATION_NUMBER_NONE;

You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we 
get no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then 
our walk will be limited to commits NOT in the commit-graph (which we 
hope is small if proper hygiene is followed).

> Hmm, on reflection, I'm not sure if this is right in the face of
> multiple "want" commits, only some of which have generation numbers.  We
> probably want to disable the cutoff if _any_ "want" commit doesn't have
> a number.
>
> There's also an obvious corner case where this won't kick in, and you'd
> really like it to: recently added commits. E.g,. if I do this:
>
>    git gc ;# imagine this writes generation numbers
>    git pull
>    git tag --contains HEAD
>
> then HEAD isn't going to have a generation number. But this is the case
> where we have the most to gain, since we could throw away all of the
> ancient tags immediately upon seeing that their generation numbers are
> way less than that of HEAD.
>
> I wonder to what degree it's worth traversing to come up with a
> generation number for the "want" commits. If we walked, say, 50 commits
> to do it, you'd probably save a lot of work (since the alternative is
> walking thousands of commits until you realize that some ancient "v1.0"
> tag is not useful).
>
> I'd actually go so far as to say that any amount of traversal is
> generally going to be worth it to come up with the correct generation
> cutoff here. You can come up with pathological cases where you only have
> one really recent tag or something, but in practice every repository
> where performance is a concern is going to end up with refs much further
> back than it would take to reach the cutoff condition.

Perhaps there is some value in walking to find the correct cutoff value, 
but it is difficult to determine how far we are from commits with 
correct generation numbers _a priori_. I'd rather rely on the 
commit-graph being in a good state, not too far behind the refs. An 
added complexity of computing generation numbers dynamically is that we 
would need to add a dependence on the commit-graph file's existence at all.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:06             ` Derrick Stolee
@ 2018-04-04 19:16               ` Jeff King
  2018-04-04 19:22                 ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-04 19:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote:

> > > @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
> > >   					      struct contains_cache *cache)
> > >   {
> > >   	struct contains_stack contains_stack = { 0, 0, NULL };
> > > -	enum contains_result result = contains_test(candidate, want, cache);
> > > +	enum contains_result result;
> > > +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
> > > +	const struct commit_list *p;
> > > +
> > > +	for (p = want; p; p = p->next) {
> > > +		struct commit *c = p->item;
> > > +		parse_commit_or_die(c);
> > > +		if (c->generation < cutoff)
> > > +			cutoff = c->generation;
> > > +	}
> 
> Now that you mention it, let me split out the portion you are probably
> talking about as incorrect:
> 
> > > +	if (cutoff == GENERATION_NUMBER_UNDEF)
> > > +		cutoff = GENERATION_NUMBER_NONE;
> 
> You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get
> no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk
> will be limited to commits NOT in the commit-graph (which we hope is small
> if proper hygiene is followed).

I think it's more than that. If we leave it at UNDEF, that's wrong,
because contains_test() compares:

  candidate->generation < cutoff

which would _always_ be true. In other words, we're saying that our
"want" has an insanely high generation number, and traversing can never
find it. Which is clearly wrong.

So we have to put it at "0", to say "you should always traverse, we
can't tell you that this is a dead end". So that part of the logic is
currently correct.

But what I was getting at is that the loop behavior can't just pick the
min cutoff. The min is effectively "0" if there's even a single ref for
which we don't have a generation number, because we cannot ever stop
traversing (we might get to that commit if we kept going).

(It's also possible I'm confused about how UNDEF and NONE are used; I'm
assuming commits for which we don't have a generation number available
would get UNDEF in their commit->generation field).

If you could make the assumption that when we have a generation for
commit X, then we have a generation for all of its ancestors, things get
easier. Because then if you hit commit X with a generation number and
want to compare it to a cutoff, you know that either:

  1. The cutoff is defined, in which case you can stop traversing if
     we've gone past the cutoff.

  2. The cutoff is undefined, in which case we cannot possibly reach
     our "want" by traversing. Even if it has a smaller generation
     number than us, it's on an unrelated line of development.

I don't know that the reachability property is explicitly promised by
your work, but it seems like it would be a natural fallout (after all,
you have to know the generation of each ancestor in order to compute the
later ones, so you're really just promising that you've actually stored
all the ones you've computed).

> > I wonder to what degree it's worth traversing to come up with a
> > generation number for the "want" commits. If we walked, say, 50 commits
> > to do it, you'd probably save a lot of work (since the alternative is
> > walking thousands of commits until you realize that some ancient "v1.0"
> > tag is not useful).
> > 
> > I'd actually go so far as to say that any amount of traversal is
> > generally going to be worth it to come up with the correct generation
> > cutoff here. You can come up with pathological cases where you only have
> > one really recent tag or something, but in practice every repository
> > where performance is a concern is going to end up with refs much further
> > back than it would take to reach the cutoff condition.
> 
> Perhaps there is some value in walking to find the correct cutoff value, but
> it is difficult to determine how far we are from commits with correct
> generation numbers _a priori_. I'd rather rely on the commit-graph being in
> a good state, not too far behind the refs. An added complexity of computing
> generation numbers dynamically is that we would need to add a dependence on
> the commit-graph file's existence at all.

If you could make the reachability assumption, I think this question
just goes away. As soon as you hit a commit with _any_ generation
number, you could quit traversing down that path.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:16               ` Jeff King
@ 2018-04-04 19:22                 ` Derrick Stolee
  2018-04-04 19:42                   ` Jeff King
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 19:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 3:16 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote:
>
>>>> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>>>    					      struct contains_cache *cache)
>>>>    {
>>>>    	struct contains_stack contains_stack = { 0, 0, NULL };
>>>> -	enum contains_result result = contains_test(candidate, want, cache);
>>>> +	enum contains_result result;
>>>> +	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
>>>> +	const struct commit_list *p;
>>>> +
>>>> +	for (p = want; p; p = p->next) {
>>>> +		struct commit *c = p->item;
>>>> +		parse_commit_or_die(c);
>>>> +		if (c->generation < cutoff)
>>>> +			cutoff = c->generation;
>>>> +	}
>> Now that you mention it, let me split out the portion you are probably
>> talking about as incorrect:
>>
>>>> +	if (cutoff == GENERATION_NUMBER_UNDEF)
>>>> +		cutoff = GENERATION_NUMBER_NONE;
>> You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get
>> no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk
>> will be limited to commits NOT in the commit-graph (which we hope is small
>> if proper hygiene is followed).
> I think it's more than that. If we leave it at UNDEF, that's wrong,
> because contains_test() compares:
>
>    candidate->generation < cutoff
>
> which would _always_ be true. In other words, we're saying that our
> "want" has an insanely high generation number, and traversing can never
> find it. Which is clearly wrong.

That condition is not always true (which is why we use strict comparison 
instead of <=). If a commit is not in the commit-graph file, then its 
generation is equal to GENERATION_NUMBER_UNDEF, as shown in alloc.c:

void *alloc_commit_node(void)
{
         struct commit *c = alloc_node(&commit_state, sizeof(struct 
commit));
         c->object.type = OBJ_COMMIT;
         c->index = alloc_commit_index();
         c->graph_pos = COMMIT_NOT_FROM_GRAPH;
         c->generation = GENERATION_NUMBER_UNDEF;
         return c;
}


> So we have to put it at "0", to say "you should always traverse, we
> can't tell you that this is a dead end". So that part of the logic is
> currently correct.
>
> But what I was getting at is that the loop behavior can't just pick the
> min cutoff. The min is effectively "0" if there's even a single ref for
> which we don't have a generation number, because we cannot ever stop
> traversing (we might get to that commit if we kept going).
>
> (It's also possible I'm confused about how UNDEF and NONE are used; I'm
> assuming commits for which we don't have a generation number available
> would get UNDEF in their commit->generation field).

I think it is this case.

> If you could make the assumption that when we have a generation for
> commit X, then we have a generation for all of its ancestors, things get
> easier. Because then if you hit commit X with a generation number and
> want to compare it to a cutoff, you know that either:
>
>    1. The cutoff is defined, in which case you can stop traversing if
>       we've gone past the cutoff.
>
>    2. The cutoff is undefined, in which case we cannot possibly reach
>       our "want" by traversing. Even if it has a smaller generation
>       number than us, it's on an unrelated line of development.
>
> I don't know that the reachability property is explicitly promised by
> your work, but it seems like it would be a natural fallout (after all,
> you have to know the generation of each ancestor in order to compute the
> later ones, so you're really just promising that you've actually stored
> all the ones you've computed).

The commit-graph is closed under reachability, so if a commit has a 
generation number then it is in the graph and so are all its ancestors.

The reason for GENERATION_NUMBER_NONE is that the commit-graph file 
stores "0" for generation number until this patch. It still satisfies 
the condition that gen(A) < gen(B) if B can reach A, but also gives us a 
condition for "this commit still needs its generation number computed".

>
>>> I wonder to what degree it's worth traversing to come up with a
>>> generation number for the "want" commits. If we walked, say, 50 commits
>>> to do it, you'd probably save a lot of work (since the alternative is
>>> walking thousands of commits until you realize that some ancient "v1.0"
>>> tag is not useful).
>>>
>>> I'd actually go so far as to say that any amount of traversal is
>>> generally going to be worth it to come up with the correct generation
>>> cutoff here. You can come up with pathological cases where you only have
>>> one really recent tag or something, but in practice every repository
>>> where performance is a concern is going to end up with refs much further
>>> back than it would take to reach the cutoff condition.
>> Perhaps there is some value in walking to find the correct cutoff value, but
>> it is difficult to determine how far we are from commits with correct
>> generation numbers _a priori_. I'd rather rely on the commit-graph being in
>> a good state, not too far behind the refs. An added complexity of computing
>> generation numbers dynamically is that we would need to add a dependence on
>> the commit-graph file's existence at all.
> If you could make the reachability assumption, I think this question
> just goes away. As soon as you hit a commit with _any_ generation
> number, you could quit traversing down that path.
That is the idea. I should make this clearer in all of my commit messages.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:22                 ` Derrick Stolee
@ 2018-04-04 19:42                   ` Jeff King
  2018-04-04 19:45                     ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jeff King @ 2018-04-04 19:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:

> > I don't know that the reachability property is explicitly promised by
> > your work, but it seems like it would be a natural fallout (after all,
> > you have to know the generation of each ancestor in order to compute the
> > later ones, so you're really just promising that you've actually stored
> > all the ones you've computed).
> 
> The commit-graph is closed under reachability, so if a commit has a
> generation number then it is in the graph and so are all its ancestors.

OK, if we assume that it's closed, then I think we can effectively
ignore the UNDEF cases. They'll just work out. And then yes I'd agree
that the:

  if (cutoff == UNDEF)
    cutoff = NONE;

code is wrong. We'd want to keep it at UNDEF so we stop traversing at
any generation number.

> The reason for GENERATION_NUMBER_NONE is that the commit-graph file stores
> "0" for generation number until this patch. It still satisfies the condition
> that gen(A) < gen(B) if B can reach A, but also gives us a condition for
> "this commit still needs its generation number computed".

OK. I thought at first that would yield wrong results when comparing
UNDEF to NONE, but I think for this kind of --contains traversal, it's
still OK (NONE is less than UNDEF, but we know that the UNDEF thing
cannot be found by traversing from a NONE).

> > If you could make the reachability assumption, I think this question
> > just goes away. As soon as you hit a commit with _any_ generation
> > number, you could quit traversing down that path.
> That is the idea. I should make this clearer in all of my commit messages.

Yes, please. :) And maybe in the documentation of the file format, if
it's not there (I didn't check). It's a very useful property, and we
want to make sure people making use of the graph know they can depend on
it.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:42                   ` Jeff King
@ 2018-04-04 19:45                     ` Derrick Stolee
  2018-04-04 19:46                       ` Jeff King
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-04 19:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On 4/4/2018 3:42 PM, Jeff King wrote:
> On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:
>
>> That is the idea. I should make this clearer in all of my commit messages.
> Yes, please. :) And maybe in the documentation of the file format, if
> it's not there (I didn't check). It's a very useful property, and we
> want to make sure people making use of the graph know they can depend on
> it.

For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion 
of generation numbers in Documentation/technical/commit-graph.txt (the 
design doc instead of the file format).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 7/6] ref-filter: use generation number for --contains
  2018-04-04 19:45                     ` Derrick Stolee
@ 2018-04-04 19:46                       ` Jeff King
  0 siblings, 0 replies; 103+ messages in thread
From: Jeff King @ 2018-04-04 19:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill

On Wed, Apr 04, 2018 at 03:45:30PM -0400, Derrick Stolee wrote:

> On 4/4/2018 3:42 PM, Jeff King wrote:
> > On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:
> > 
> > > That is the idea. I should make this clearer in all of my commit messages.
> > Yes, please. :) And maybe in the documentation of the file format, if
> > it's not there (I didn't check). It's a very useful property, and we
> > want to make sure people making use of the graph know they can depend on
> > it.
> 
> For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion of
> generation numbers in Documentation/technical/commit-graph.txt (the design
> doc instead of the file format).

Yeah, that makes sense. Thanks, and thanks for a thoughtful discussion.
The performance numbers are very exciting.

-Peff

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-04-03 18:03 ` Brandon Williams
@ 2018-04-07 16:55 ` Jakub Narebski
  2018-04-08  1:06   ` Derrick Stolee
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  9 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-07 16:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Stefan Beller,
	Lars Schneider, Jeff King

Hello,

Derrick Stolee <dstolee@microsoft.com> writes:

> This is the first of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used [...].
>
> This does not have a significant performance benefit in repositories
> of normal size, but in the Windows repository, some merge-base
> calculations improve from 3.1s to 2.9s. A modest speedup, but provides
> an actual consumer of generation numbers as a starting point.
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.

I have started working on Jupyter Notebook on Google Colaboratory to
find out how much speedup we can get using generation numbers (level
negative-cut filter), FELINE index (negative-cut filter) and min-post
intervals in some spanning tree (positive-cut filter, if I understand it
correctly the base of GRAIL method) in commit graphs.

Currently I am at the stage of reproducing results in FELINE paper:
"Reachability Queries in Very Large Graphs: A Fast Refined Online Search
Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
J. Zaki (2014).  This paper is available in the PDF form at
https://openproceedings.org/EDBT/2014/paper_166.pdf

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

I hope that could be of help, or at least interesting
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-03 18:29   ` Derrick Stolee
  2018-04-03 18:47     ` Jeff King
@ 2018-04-07 17:09     ` Jakub Narebski
  1 sibling, 0 replies; 103+ messages in thread
From: Jakub Narebski @ 2018-04-07 17:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller,
	larsxschneider, peff

Derrick Stolee <stolee@gmail.com> writes:

> On 4/3/2018 2:03 PM, Brandon Williams wrote:
>> On 04/03, Derrick Stolee wrote:
>>> This is the first of several "small" patches that follow the serialized
>>> Git commit graph patch (ds/commit-graph).
>>>
>>> As described in Documentation/technical/commit-graph.txt, the generation
>>> number of a commit is one more than the maximum generation number among
>>> its parents (trivially, a commit with no parents has generation number
>>> one).
[...]
>>> A more substantial refactoring of revision.c is required before making
>>> 'git log --graph' use generation numbers effectively.
>>
>> log --graph should benefit a lot more from this correct?  I know we've
>> talked a bit about negotiation and I wonder if these generation numbers
>> should be able to help out a little bit with that some day.
>
> 'log --graph' should be a HUGE speedup, when it is refactored. Since
> the topo-order can "stream" commits to the pager, it can be very
> responsive to return the graph in almost all conditions. (The case
> where generation numbers are not enough is when filters reduce the set
> of displayed commits to be very sparse, so many commits are walked
> anyway.)

I wonder if next big speedup would be to store [some] topological
ordering of commits in the commit graph... It could be done for example
in two chunks: a mapping to position in topological order, and list of
commits sorted in topological order.

Note also that FELINE index uses (or can use -- but it is supposedly the
optimal choice) position of vertex/node in topological order as one of
the two values in the pair that composes FELINE index.

> If we have generic "can X reach Y?" queries, then we can also use
> generation numbers there to great effect (by not walking commits Z
> with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch
> --contains" thread for ideas.

This is something that is shown in the Google Colab [Jupyter] Notebook
I have mentioned:

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
  https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

> For negotiation, there are some things we can do here. VSTS uses
> generation numbers as a heuristic for determining "all wants connected
> to haves" which is a condition for halting negotiation. The idea is
> very simple, and I'd be happy to discuss it on a separate thread.

Nice.  How much speedup it gives?

Best regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-07 16:55 ` Jakub Narebski
@ 2018-04-08  1:06   ` Derrick Stolee
  2018-04-11 19:32     ` Jakub Narebski
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-08  1:06 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Stefan Beller,
	Lars Schneider, Jeff King

On 4/7/2018 12:55 PM, Jakub Narebski wrote:
> Currently I am at the stage of reproducing results in FELINE paper:
> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search
> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
> J. Zaki (2014).  This paper is available in the PDF form at
> https://openproceedings.org/EDBT/2014/paper_166.pdf
>
> The Jupyter Notebook (which runs on Google cloud, but can be also run
> locally) uses Python kernel, NetworkX librabry for graph manipulation,
> and matplotlib (via NetworkX) for display.
>
> Available at:
> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing
>
> I hope that could be of help, or at least interesting

Let me know when you can give numbers (either raw performance or # of 
commits walked) for real-world Git commit graphs. The Linux repo is a 
good example to use for benchmarking, but I also use the Kotlin repo 
sometimes as it has over a million objects and over 250K commits.

Of course, the only important statistic at the end of the day is the 
end-to-end time of a 'git ...' command. Your investigations should 
inform whether it is worth prototyping the feature in the git codebase.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 00/10] Compute and consume generation numbers
  2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-04-07 16:55 ` Jakub Narebski
@ 2018-04-09 16:41 ` " Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
                     ` (10 more replies)
  9 siblings, 11 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

Thanks for the lively discussion of this patch series in v1!

I've incorporated the feedback from the previous round, added patches
[7/6] and [8/6], expanded the discussion of generation numbers in the
design document, and added another speedup for 'git branch --contains'.

One major difference: I renamed the macros from _UNDEF to _INFINITY and
_NONE to _ZERO. This communicates their value more clearly, since the
previous names were unclear about which was larger than the "real"
generation numbers.

Patch 2 includes a change to builtin/merge.c and a new test in
t5318-commit-graph.sh that exposes a problem I found when testing the
previous patch series on my box. The "BUG: bad generation skip" message
from "commit.c: use generation to halt paint walk" would halt a fast-
forward merge since the HEAD commit was loaded before the core.commitGraph
config setting was loaded. It is crucial that all commits that exist
in the commit-graph file are loaded from that file or else we will
lose our expected inequalities of generation numbers.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

Further, use generation numbers for '--contains' queries in 'git tag'
and 'git branch', providing a significant speedup (at least 95% for
some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (10):
  object.c: parse commit in graph first
  merge: check config before loading commits
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit.c: use generation to halt paint walk
  commit-graph.txt: update future work
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()

 Documentation/technical/commit-graph.txt | 50 +++++++++++++--
 alloc.c                                  |  1 +
 builtin/merge.c                          |  5 +-
 commit-graph.c                           | 48 +++++++++++++++
 commit.c                                 | 78 ++++++++++++++++++++----
 commit.h                                 |  5 ++
 object.c                                 |  4 +-
 ref-filter.c                             | 24 ++++++--
 t/t5318-commit-graph.sh                  |  9 +++
 9 files changed, 197 insertions(+), 27 deletions(-)

-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 01/10] object.c: parse commit in graph first
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
@ 2018-04-09 16:41   ` Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/object.c b/object.c
index e6ad3f61f0..4cd3e98e04 100644
--- a/object.c
+++ b/object.c
@@ -3,6 +3,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "tag.h"
 
 static struct object **obj_hash;
@@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (!parse_commit_in_graph(commit) &&
+			    parse_commit_buffer(commit, buffer, size))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 02/10] merge: check config before loading commits
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
@ 2018-04-09 16:41   ` Derrick Stolee
  2018-04-11  2:12     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

In anticipation of using generation numbers from the commit-graph,
we must ensure that all commits that exist in the commit-graph are
loaded from that file instead of from the object database. Since
the commit-graph file is only checked if core.commitGraph is true,
we must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch'
and the commit lookup. Also add a test to t5318-commit-graph.sh
that exercises this code path to prevent a regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/merge.c         | 5 +++--
 t/t5318-commit-graph.sh | 9 +++++++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index ee050a47f3..20897f8223 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
 	if (branch)
 		skip_prefix(branch, "refs/heads/", &branch);
+	init_diff_ui_defaults();
+	git_config(git_merge_config, NULL);
+
 	if (!branch || is_null_oid(&head_oid))
 		head_commit = NULL;
 	else
 		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
 
-	init_diff_ui_defaults();
-	git_config(git_merge_config, NULL);
 
 	if (branch_mergeoptions)
 		parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git checkout -b merge-5-to-8 commits/5 &&
+	git merge commits/8 &&
+	git show-ref -s merge-5-to-8 >output &&
+	git show-ref -s commits/8 >expect &&
+	test_cmp expect output
+'
+
 test_done
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
  2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 17:59     ` Stefan Beller
  2018-04-11  2:31     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
                     ` (7 subsequent siblings)
  10 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_ININITY 0xFFFFFFFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_ZERO) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_INFINITY;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..d24b947525 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index e57ae4b583..b91df315c5 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
 	struct commit *item;
@@ -24,6 +27,7 @@ struct commit {
 	struct commit_list *parents;
 	struct tree *tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (2 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-11  2:51     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d24b947525..5fd63acc31 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+			if ((*list)->generation > GENERATION_NUMBER_MAX)
+				die("generation number %u is too large to store in commit-graph",
+				    (*list)->generation);
+			packedDate[0] |= htonl((*list)->generation << 2);
+		}
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    commits[i]->generation != GENERATION_NUMBER_ZERO)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
+				    parent->item->generation == GENERATION_NUMBER_ZERO) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 05/10] commit: use generations in paint_down_to_common()
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (3 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 19 ++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 3e39c86abf..95ae7e13a3 100644
--- a/commit.c
+++ b/commit.c
@@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* newer commits with larger date first */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index b91df315c5..c440f56bf9 100644
--- a/commit.h
+++ b/commit.h
@@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 06/10] commit.c: use generation to halt paint walk
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (4 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-11  3:02     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

In paint_down_to_common(), the walk is halted when the queue contains
only stale commits. The queue_has_nonstale() method iterates over the
entire queue looking for a nonstale commit. In a wide commit graph where
the two sides share many commits in common, but have deep sets of
different commits, this method may inspect many elements before finding
a nonstale commit. In the worst case, this can give quadratic
performance in paint_down_to_common().

Convert queue_has_nonstale() to use generation numbers for an O(1)
termination condition. To properly take advantage of this condition,
track the minimum generation number of a commit that enters the queue
with nonstale status.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 95ae7e13a3..00bdc2ab21 100644
--- a/commit.c
+++ b/commit.c
@@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
-static int queue_has_nonstale(struct prio_queue *queue)
+static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 {
-	int i;
-	for (i = 0; i < queue->nr; i++) {
-		struct commit *commit = queue->array[i].data;
-		if (!(commit->object.flags & STALE))
-			return 1;
+	if (min_gen != GENERATION_NUMBER_INFINITY) {
+		if (queue->nr > 0) {
+			struct commit *commit = queue->array[0].data;
+			return commit->generation >= min_gen;
+		}
+	} else {
+		int i;
+		for (i = 0; i < queue->nr; i++) {
+			struct commit *commit = queue->array[i].data;
+			if (!(commit->object.flags & STALE))
+				return 1;
+		}
 	}
+
 	return 0;
 }
 
@@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+	uint32_t min_nonstale_gen = GENERATION_NUMBER_INFINITY;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		return result;
 	}
 	prio_queue_put(&queue, one);
+	if (one->generation < min_nonstale_gen)
+		min_nonstale_gen = one->generation;
 
 	for (i = 0; i < n; i++) {
 		twos[i]->object.flags |= PARENT2;
 		prio_queue_put(&queue, twos[i]);
+		if (twos[i]->generation < min_nonstale_gen)
+			min_nonstale_gen = twos[i]->generation;
 	}
 
-	while (queue_has_nonstale(&queue)) {
+	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
 		struct commit *commit = prio_queue_get(&queue);
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+
+		last_gen = commit->generation;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 				return NULL;
 			p->object.flags |= flags;
 			prio_queue_put(&queue, p);
+
+			if (!(flags & STALE) &&
+			    p->generation < min_nonstale_gen)
+				min_nonstale_gen = p->generation;
 		}
 	}
 
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (5 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-12  9:12     ` Junio C Hamano
  2018-04-09 16:42   ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the two
"special" generation numbers GENERATION_NUMBER_INFINITY and *_ZERO
interact with other generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 50 +++++++++++++++++++++---
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..a8df0ae9db 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,49 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N amd M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+Here is a diagram to visualize the shape of the full commit graph, and
+how different generation numbers relate:
+
+    +-----------------------------------------+
+    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
+    +-----------------------------------------+
+	    |            |      ^
+	    |            |      |
+	    |            +------+
+	    |         [gen(A) = gen(B)]
+	    V
+    +-------------------------------------+
+    | 0 < commit->generation < 0x40000000 |
+    +-------------------------------------+
+	    |            |      ^
+	    |            |      |
+	    |            +------+
+	    |        [gen(A) > gen(B)]
+	    V
+    +-------------------------------------+
+    | GENERATION_NUMBER_ZERO = 0          |
+    +-------------------------------------+
+			 |      ^
+			 |      |
+			 +------+
+		     [gen(A) = gen(B)]
+
 Design Details
 --------------
 
@@ -98,17 +141,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 08/10] ref-filter: use generation number for --contains
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (6 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 16:42   ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..2f5e79b5de 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate,
 		return CONTAINS_YES;
 	}
 
-	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1634,7 +1648,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1648,7 +1662,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 09/10] commit: use generation numbers for in_merge_bases()
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (7 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-09 16:42   ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  10 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 00bdc2ab21..0b155dece8 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return 0;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common()
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (8 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-09 16:42   ` Derrick Stolee
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  10 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee

When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 0b155dece8..7348075e38 100644
--- a/commit.c
+++ b/commit.c
@@ -796,7 +796,9 @@ static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+						struct commit **twos,
+						int min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
@@ -830,6 +832,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 
 		last_gen = commit->generation;
 
+		if (commit->generation < min_generation)
+			break;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -882,7 +887,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
 			return NULL;
 	}
 
-	list = paint_down_to_common(one, n, twos);
+	list = paint_down_to_common(one, n, twos, 0);
 
 	while (list) {
 		struct commit *commit = pop_commit(&list);
@@ -949,7 +954,7 @@ static int remove_redundant(struct commit **array, int cnt)
 			filled_index[filled] = j;
 			work[filled++] = array[j];
 		}
-		common = paint_down_to_common(array[i], filled, work);
+		common = paint_down_to_common(array[i], filled, work, 0);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
@@ -1073,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 	if (commit->generation > min_generation)
 		return 0;
 
-	bases = paint_down_to_common(commit, nr_reference, reference);
+	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
 	if (commit->object.flags & PARENT2)
 		ret = 1;
 	clear_commit_marks(commit, all_flags);
-- 
2.17.0


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-09 17:59     ` Stefan Beller
  2018-04-11  2:31     ` Junio C Hamano
  1 sibling, 0 replies; 103+ messages in thread
From: Stefan Beller @ 2018-04-09 17:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, avarab, larsxschneider, bmwill

On Mon, Apr 9, 2018 at 9:42 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
> The generation number of a commit is defined recursively as follows:
>
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
>
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use two special values to signal
> the generation number is invalid:
>
> GENERATION_NUMBER_ININITY 0xFFFFFFFF

GENERATION_NUMBER_INFINITY

On disk we currently only store up to 2^30-1,
(2 bits fewer than MAX_UINT_32), but here we just take the maximum
value of what a uint32_t can store. That miss match should not be a
problem albeit aesthetically.

Once we run into scaling problems, we can just up to uint64_t in the code,
and defer the solution on disk to a new file format.

With both ZERO and _INFINITY we are at the border of uint
wrap-around, so we have to be very careful to not add/subtract
one and then compare. Just to watch out for when reviewing.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 02/10] merge: check config before loading commits
  2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
@ 2018-04-11  2:12     ` Junio C Hamano
  2018-04-11 12:49       ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Junio C Hamano @ 2018-04-11  2:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> diff --git a/builtin/merge.c b/builtin/merge.c
> index ee050a47f3..20897f8223 100644
> --- a/builtin/merge.c
> +++ b/builtin/merge.c
> @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>  	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
>  	if (branch)
>  		skip_prefix(branch, "refs/heads/", &branch);
> +	init_diff_ui_defaults();
> +	git_config(git_merge_config, NULL);
> +
>  	if (!branch || is_null_oid(&head_oid))
>  		head_commit = NULL;
>  	else
>  		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>  
> -	init_diff_ui_defaults();
> -	git_config(git_merge_config, NULL);

Wow, that's tricky.  git_merge_config() wants to know which "branch"
we are on, and this place is as early as we can move the call to
without breaking things.  Is this to allow parse_object() called
in lookup_commit_reference_gently() to know if we can rely on the
data cached in the commit-graph data?

> Move the config load to be between the initialization of 'branch'
> and the commit lookup. Also add a test to t5318-commit-graph.sh
> that exercises this code path to prevent a regression.

It is not clear to me how a successful merge of commits/8
demonstrates that reading the config earlier than before is
regression free.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index a380419b65..77d85aefe7 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
>  graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
>  
> +test_expect_success 'perform fast-forward merge in full repo' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git checkout -b merge-5-to-8 commits/5 &&
> +	git merge commits/8 &&
> +	git show-ref -s merge-5-to-8 >output &&
> +	git show-ref -s commits/8 >expect &&
> +	test_cmp expect output
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
  2018-04-09 17:59     ` Stefan Beller
@ 2018-04-11  2:31     ` Junio C Hamano
  2018-04-11 12:57       ` Derrick Stolee
  1 sibling, 1 reply; 103+ messages in thread
From: Junio C Hamano @ 2018-04-11  2:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> The generation number of a commit is defined recursively as follows:
>
> * If a commit A has no parents, then the generation number of A is one.
> * If a commit A has parents, then the generation number of A is one
>   more than the maximum generation number among the parents of A.
>
> Add a uint32_t generation field to struct commit so we can pass this
> information to revision walks. We use two special values to signal
> the generation number is invalid:
>
> GENERATION_NUMBER_ININITY 0xFFFFFFFF
> GENERATION_NUMBER_ZERO 0
>
> The first (_INFINITY) means the generation number has not been loaded or
> computed. The second (_ZERO) means the generation number was loaded
> from a commit graph file that was stored before generation numbers
> were computed.

Should it also be possible for a caller to tell if a given commit
has too deep a history, i.e. we do not know its generation number
exactly, but we know it is larger than 1<<30?

It seems that we only have a 30-bit field in the file, so wouldn't
we need a special value defined in (e.g. "0") so that we can tell
that the commit has such a large generation number?  E.g.

> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

	if (!item->generation)
		item->generation = GENERATION_NUMBER_OVERFLOW;

when we read it from the file?

We obviously need to do something similar when assigning a
generation number to a child commit, perhaps like

	#define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1)

	commit->generation = 1; /* assume no parent */
	for (p = commit->parents; p; p++) {
		uint32_t gen = p->item->generation + 1;

		if (gen >= GENERATION_NUMBER_OVERFLOW) {
			commit->generation = GENERATION_NUMBER_OVERFLOW;
			break;
		} else if (commit->generation < gen)
			commit->generation = gen;
	}
        
or something?  And then on the writing side you'd encode too large a
generation as '0'.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-11  2:51     ` Junio C Hamano
  2018-04-11 13:02       ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Junio C Hamano @ 2018-04-11  2:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
> +			if ((*list)->generation > GENERATION_NUMBER_MAX)
> +				die("generation number %u is too large to store in commit-graph",
> +				    (*list)->generation);
> +			packedDate[0] |= htonl((*list)->generation << 2);
> +		}


How serious do we want this feature to be?  On one extreme, we could
be irresponsible and say it will be a problem for our descendants in
the future if their repositories have more than billion pearls on a
single strand, and the above certainly is a reasonable way to punt.
Those who actually encounter the problem will notice by Git dying
somewhere rather deep in the callchain.

Or we could say Git actually does support a history that is
arbitrarily long, even though such a deep portion of history will
not benefit from having generation numbers in commit-graph.

I've been assuming that our stance is the latter and that is why I
made noises about overflowing 30-bit generation field in my review
of the previous step.

In case we want to do the "we know this is very large, but we do not
know the exact value", we may actually want a mode where we can
pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
make sure that the code to handle overflow behaves sensibly.

> +	for (i = 0; i < nr_commits; i++) {
> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
> +			continue;
> +
> +		commit_list_insert(commits[i], &list);
> +		while (list) {
> +...
> +		}
> +	}

So we go over the list of commits just _once_ and make sure each of
them gets the generation assigned correctly by (conceptually
recursively but iteratively in implementation by using a commit
list) making sure that all its parents have generation assigned and
compute the generation for the commit, before moving to the next
one.  Which sounds correct.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk
  2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
@ 2018-04-11  3:02     ` Junio C Hamano
  2018-04-11 13:24       ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Junio C Hamano @ 2018-04-11  3:02 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  		return result;
>  	}
>  	prio_queue_put(&queue, one);
> +	if (one->generation < min_nonstale_gen)
> +		min_nonstale_gen = one->generation;
>  
>  	for (i = 0; i < n; i++) {
>  		twos[i]->object.flags |= PARENT2;
>  		prio_queue_put(&queue, twos[i]);
> +		if (twos[i]->generation < min_nonstale_gen)
> +			min_nonstale_gen = twos[i]->generation;
>  	}
>  
> -	while (queue_has_nonstale(&queue)) {
> +	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
>  		struct commit *commit = prio_queue_get(&queue);
>  		struct commit_list *parents;
>  		int flags;
>  
> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +
> +		last_gen = commit->generation;
> +
>  		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>  		if (flags == (PARENT1 | PARENT2)) {
>  			if (!(commit->object.flags & RESULT)) {
> @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  				return NULL;
>  			p->object.flags |= flags;

Hmph.  Can a commit that used to be not stale (and contributed to
the current value of min_nonstale_gen) become stale here by getting
visited twice, invalidating the value in min_nonstale_gen?

>  			prio_queue_put(&queue, p);
> +
> +			if (!(flags & STALE) &&
> +			    p->generation < min_nonstale_gen)
> +				min_nonstale_gen = p->generation;
>  		}
>  	}

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 02/10] merge: check config before loading commits
  2018-04-11  2:12     ` Junio C Hamano
@ 2018-04-11 12:49       ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-11 12:49 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 10:12 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> diff --git a/builtin/merge.c b/builtin/merge.c
>> index ee050a47f3..20897f8223 100644
>> --- a/builtin/merge.c
>> +++ b/builtin/merge.c
>> @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>>   	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
>>   	if (branch)
>>   		skip_prefix(branch, "refs/heads/", &branch);
>> +	init_diff_ui_defaults();
>> +	git_config(git_merge_config, NULL);
>> +
>>   	if (!branch || is_null_oid(&head_oid))
>>   		head_commit = NULL;
>>   	else
>>   		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
>>   
>> -	init_diff_ui_defaults();
>> -	git_config(git_merge_config, NULL);
> Wow, that's tricky.  git_merge_config() wants to know which "branch"
> we are on, and this place is as early as we can move the call to
> without breaking things.  Is this to allow parse_object() called
> in lookup_commit_reference_gently() to know if we can rely on the
> data cached in the commit-graph data?

When I saw the bug on my machine, I tracked the issue down to a call to 
parse_commit_in_graph() that skipped the graph check since 
core_commit_graph was not set. The call stack from this call is as follows:

* lookup_commit_or_die()
* lookup_commit_reference()
* lookup_commit_reference_gently()
* parse_object()
* parse_object_buffer()
* parse_commit_in_graph() [as introduced in PATCH 01/10]

>
>> Move the config load to be between the initialization of 'branch'
>> and the commit lookup. Also add a test to t5318-commit-graph.sh
>> that exercises this code path to prevent a regression.
> It is not clear to me how a successful merge of commits/8
> demonstrates that reading the config earlier than before is
> regression free.

I didn't want to introduce commits in an order that led to a commit 
failing tests, but if you drop the change to builtin/merge.c from this 
series, the tip commit will fail this test with "BUG: bad generation skip".

The reason for this failure is that commits/5 is loaded from HEAD from 
the object database, so its generation is marked as 
GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, 
the commit at merges/3 is loaded from the graph with generation 4. This 
triggers the BUG statement in paint_down_to_common(). That is why it is 
important to check a fast-forward merge.

In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were 
already testing 'git merge-base' to check the commit walk logic.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-11  2:31     ` Junio C Hamano
@ 2018-04-11 12:57       ` Derrick Stolee
  2018-04-11 23:28         ` Junio C Hamano
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-11 12:57 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 10:31 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The generation number of a commit is defined recursively as follows:
>>
>> * If a commit A has no parents, then the generation number of A is one.
>> * If a commit A has parents, then the generation number of A is one
>>    more than the maximum generation number among the parents of A.
>>
>> Add a uint32_t generation field to struct commit so we can pass this
>> information to revision walks. We use two special values to signal
>> the generation number is invalid:
>>
>> GENERATION_NUMBER_ININITY 0xFFFFFFFF
>> GENERATION_NUMBER_ZERO 0
>>
>> The first (_INFINITY) means the generation number has not been loaded or
>> computed. The second (_ZERO) means the generation number was loaded
>> from a commit graph file that was stored before generation numbers
>> were computed.
> Should it also be possible for a caller to tell if a given commit
> has too deep a history, i.e. we do not know its generation number
> exactly, but we know it is larger than 1<<30?
>
> It seems that we only have a 30-bit field in the file, so wouldn't
> we need a special value defined in (e.g. "0") so that we can tell
> that the commit has such a large generation number?  E.g.
>
>> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> 	if (!item->generation)
> 		item->generation = GENERATION_NUMBER_OVERFLOW;
>
> when we read it from the file?
>
> We obviously need to do something similar when assigning a
> generation number to a child commit, perhaps like
>
> 	#define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1)
>
> 	commit->generation = 1; /* assume no parent */
> 	for (p = commit->parents; p; p++) {
> 		uint32_t gen = p->item->generation + 1;
>
> 		if (gen >= GENERATION_NUMBER_OVERFLOW) {
> 			commit->generation = GENERATION_NUMBER_OVERFLOW;
> 			break;
> 		} else if (commit->generation < gen)
> 			commit->generation = gen;
> 	}
>          
> or something?  And then on the writing side you'd encode too large a
> generation as '0'.

You raise a very good point. How about we do a slightly different 
arrangement for these overflow commits?

Instead of storing the commits in the commit-graph file as "0" (which 
currently means "written by a version of git that did not compute 
generation numbers") we could let GENERATION_NUMBER_MAX be the maximum 
generation of a commit in the commit-graph, and if a commit would have 
larger generation, we collapse it down to that value.

It slightly complicates the diagram I made in 
Documentation/technical/commit-graph.txt, but it was already a bit of a 
simplification. Here is an updated diagram, but likely we will want to 
limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, 
since it is not a practical situation at the moment.

     +-----------------------------------------+
     | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
     +-----------------------------------------+
       |    |            |      ^
       |    |            |      |
       |    |            +------+
       |    |         [gen(A) = gen(B)]
       |    V
       |  +------------------------------------+
       |  | GENERATION_NUMBER_MAX = 0x3FFFFFFF |
       |  +------------------------------------+
       |    |            |      ^
       |    |            |      |
       |    |            +------+
       |    |         [gen(A) = gen(B)]
       V    V
     +-------------------------------------+
     | 0 < commit->generation < 0x3FFFFFFF |
     +-------------------------------------+
         |            |      ^
         |            |      |
         |            +------+
         |        [gen(A) > gen(B)]
         V
     +-------------------------------------+
     | GENERATION_NUMBER_ZERO = 0          |
     +-------------------------------------+
              |      ^
              |      |
              +------+
              [gen(A) = gen(B)]

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-11  2:51     ` Junio C Hamano
@ 2018-04-11 13:02       ` Derrick Stolee
  2018-04-11 18:49         ` Stefan Beller
  2018-04-11 19:26         ` Eric Sunshine
  0 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-11 13:02 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 10:51 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> +		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
>> +			if ((*list)->generation > GENERATION_NUMBER_MAX)
>> +				die("generation number %u is too large to store in commit-graph",
>> +				    (*list)->generation);
>> +			packedDate[0] |= htonl((*list)->generation << 2);
>> +		}
>
> How serious do we want this feature to be?  On one extreme, we could
> be irresponsible and say it will be a problem for our descendants in
> the future if their repositories have more than billion pearls on a
> single strand, and the above certainly is a reasonable way to punt.
> Those who actually encounter the problem will notice by Git dying
> somewhere rather deep in the callchain.
>
> Or we could say Git actually does support a history that is
> arbitrarily long, even though such a deep portion of history will
> not benefit from having generation numbers in commit-graph.
>
> I've been assuming that our stance is the latter and that is why I
> made noises about overflowing 30-bit generation field in my review
> of the previous step.
>
> In case we want to do the "we know this is very large, but we do not
> know the exact value", we may actually want a mode where we can
> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
> make sure that the code to handle overflow behaves sensibly.

I agree. I wonder how we can effectively expose this value into a test. 
It's probably not sufficient to manually test using compiler flags ("-D 
GENERATION_NUMBER_MAX=8").

>
>> +	for (i = 0; i < nr_commits; i++) {
>> +		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
>> +		    commits[i]->generation != GENERATION_NUMBER_ZERO)
>> +			continue;
>> +
>> +		commit_list_insert(commits[i], &list);
>> +		while (list) {
>> +...
>> +		}
>> +	}
> So we go over the list of commits just _once_ and make sure each of
> them gets the generation assigned correctly by (conceptually
> recursively but iteratively in implementation by using a commit
> list) making sure that all its parents have generation assigned and
> compute the generation for the commit, before moving to the next
> one.  Which sounds correct.

Yes, we compute the generation number of a commit exactly once. We use 
the list as a stack so we do not have recursion limits during our 
depth-first search (DFS). We rely on the object cache to ensure we store 
the computed generation numbers, and computed generation numbers provide 
termination conditions to the DFS.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk
  2018-04-11  3:02     ` Junio C Hamano
@ 2018-04-11 13:24       ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-11 13:24 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/10/2018 11:02 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>>   		return result;
>>   	}
>>   	prio_queue_put(&queue, one);
>> +	if (one->generation < min_nonstale_gen)
>> +		min_nonstale_gen = one->generation;
>>   
>>   	for (i = 0; i < n; i++) {
>>   		twos[i]->object.flags |= PARENT2;
>>   		prio_queue_put(&queue, twos[i]);
>> +		if (twos[i]->generation < min_nonstale_gen)
>> +			min_nonstale_gen = twos[i]->generation;
>>   	}
>>   
>> -	while (queue_has_nonstale(&queue)) {
>> +	while (queue_has_nonstale(&queue, min_nonstale_gen)) {
>>   		struct commit *commit = prio_queue_get(&queue);
>>   		struct commit_list *parents;
>>   		int flags;
>>   
>> +		if (commit->generation > last_gen)
>> +			BUG("bad generation skip");
>> +
>> +		last_gen = commit->generation;
>> +
>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>   		if (flags == (PARENT1 | PARENT2)) {
>>   			if (!(commit->object.flags & RESULT)) {
>> @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>>   				return NULL;
>>   			p->object.flags |= flags;
> Hmph.  Can a commit that used to be not stale (and contributed to
> the current value of min_nonstale_gen) become stale here by getting
> visited twice, invalidating the value in min_nonstale_gen?

min_nonstale_gen can be "wrong" in the way you say, but fits the 
definition from the commit message:

"To properly take advantage of this condition, track the minimum 
generation number of a commit that **enters the queue** with nonstale 
status." (Emphasis added)

You make an excellent point about how this can be problematic. I was 
confused by the lack of clear performance benefits here, but I think 
that whatever benefits making queue_has_nonstale() be O(1) were removed 
by walking more commits than necessary.

Consider the following commit graph, where M is a parent of both A and 
B, S is a parent of M and B, and there is a large set of commits 
reachable from M with generation number larger than gen(S).

A    B
| __/|
|/   |
M    |
|\   |
. |  |
. |  |
. |_/
|/
S

Between A and B, the true merge base is M. Anything reachable from M is 
marked as stale. When S is added to the queue, it is only reachable from 
B, so it is non-stale. However, it is marked stale after M is walked. 
The old code would detect this as a termination condition, but the new 
code would not.

I think this data shape is actually common (not exactly, as it may be 
that some ancestor of M provides a second path to S) especially in the 
world of pull requests and users merging master into their topic branches.

I'll remove this commit in the next version, but use the new prototype 
for queue_has_nonstale() in "commit: add short-circuit to 
paint_down_to_common()" using the given 'min_generation' instead of 
'min_nonstale_gen'.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-11 13:02       ` Derrick Stolee
@ 2018-04-11 18:49         ` Stefan Beller
  2018-04-11 19:26         ` Eric Sunshine
  1 sibling, 0 replies; 103+ messages in thread
From: Stefan Beller @ 2018-04-11 18:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee, git, peff, avarab,
	larsxschneider, bmwill

On Wed, Apr 11, 2018 at 6:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/10/2018 10:51 PM, Junio C Hamano wrote:
>>
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> +               if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
>>> +                       if ((*list)->generation > GENERATION_NUMBER_MAX)
>>> +                               die("generation number %u is too large to
>>> store in commit-graph",
>>> +                                   (*list)->generation);
>>> +                       packedDate[0] |= htonl((*list)->generation << 2);
>>> +               }
>>
>>
>> How serious do we want this feature to be?  On one extreme, we could
>> be irresponsible and say it will be a problem for our descendants in
>> the future if their repositories have more than billion pearls on a
>> single strand, and the above certainly is a reasonable way to punt.
>> Those who actually encounter the problem will notice by Git dying
>> somewhere rather deep in the callchain.
>>
>> Or we could say Git actually does support a history that is
>> arbitrarily long, even though such a deep portion of history will
>> not benefit from having generation numbers in commit-graph.
>>
>> I've been assuming that our stance is the latter and that is why I
>> made noises about overflowing 30-bit generation field in my review
>> of the previous step.
>>
>> In case we want to do the "we know this is very large, but we do not
>> know the exact value", we may actually want a mode where we can
>> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
>> make sure that the code to handle overflow behaves sensibly.
>
>
> I agree. I wonder how we can effectively expose this value into a test. It's
> probably not sufficient to manually test using compiler flags ("-D
> GENERATION_NUMBER_MAX=8").

Would using an environment variable for this testing purpose be a good idea?

If we allow a user to pass in an arbitrary maximum, then we'd have to care about
generation numbers that are stored in the commit graph file larger than that
user specific maximum, though.

Looking through the output of "git grep getenv" we only have two instances
with _DEBUG, both in transport.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 04/10] commit-graph: compute generation numbers
  2018-04-11 13:02       ` Derrick Stolee
  2018-04-11 18:49         ` Stefan Beller
@ 2018-04-11 19:26         ` Eric Sunshine
  1 sibling, 0 replies; 103+ messages in thread
From: Eric Sunshine @ 2018-04-11 19:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee, git, peff, avarab, sbeller,
	larsxschneider, bmwill

On Wed, Apr 11, 2018 at 9:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/10/2018 10:51 PM, Junio C Hamano wrote:
>> In case we want to do the "we know this is very large, but we do not
>> know the exact value", we may actually want a mode where we can
>> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
>> make sure that the code to handle overflow behaves sensibly.
>
> I agree. I wonder how we can effectively expose this value into a test. It's
> probably not sufficient to manually test using compiler flags ("-D
> GENERATION_NUMBER_MAX=8").

A few similar cases of tests needing to tweak some behavior do so by
environment variable. See, for instance, GIT_GETTEXT_POISON and
GIT_FSMONITOR_TEST.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-08  1:06   ` Derrick Stolee
@ 2018-04-11 19:32     ` Jakub Narebski
  2018-04-11 19:58       ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-11 19:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

Derrick Stolee <stolee@gmail.com> writes:

> On 4/7/2018 12:55 PM, Jakub Narebski wrote:
>> Currently I am at the stage of reproducing results in FELINE paper:
>> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search
>> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
>> J. Zaki (2014).  This paper is available in the PDF form at
>> https://openproceedings.org/EDBT/2014/paper_166.pdf
>>
>> The Jupyter Notebook (which runs on Google cloud, but can be also run
>> locally) uses Python kernel, NetworkX librabry for graph manipulation,
>> and matplotlib (via NetworkX) for display.
>>
>> Available at:
>> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing
>>
>> I hope that could be of help, or at least interesting
>
> Let me know when you can give numbers (either raw performance or # of
> commits walked) for real-world Git commit graphs. The Linux repo is a
> good example to use for benchmarking, but I also use the Kotlin repo
> sometimes as it has over a million objects and over 250K commits.

As I am curently converting git repository into commit graph, number of
objects doesn't matter.

Though Kotlin is nicely in largish size set, not as large as Linux
kernel which has 750K commits, but mich larger than git.git with 65K
commits.

> Of course, the only important statistic at the end of the day is the
> end-to-end time of a 'git ...' command. Your investigations should
> inform whether it is worth prototyping the feature in the git
> codebase.

What would you suggest as a good test that could imply performance?  The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.


I have tried finding number of false positives for level (generation
number) filter and for FELINE index, and number of false negatives for
min-post intervals in the spanning tree (for DFS tree) for 10000
randomly selected pairs of commits... but I don't think this is a good
benchmark.

I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
that has 750832 nodes and 811733 edges, and 563747941392 possible
directed pairs, we have for 10000 randomly selected pairs of commits:

  level-filter has    91 =  0.91% [all] false positives
  FELINE index has    78 =  0.78% [all] false positives
  FELINE index has 1.16667 less false positives than level filter

  min-post spanning-tree intervals has  3641 = 36.41% [all] false
  negatives

For git.git repository (https://github.com/git/git.git) that has 52950
nodes and 65887 edges the numbers are slighly more in FELINE index
favor (also out of 10000 random pairs):

  level-filter has   504 =  9.11% false positives
  FELINE index has   125 =  2.26% false positives
  FELINE index has 4.032 less false positives than level filter

This is for FELINE which does not use level / generatio-numbers filter.

Regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-11 19:32     ` Jakub Narebski
@ 2018-04-11 19:58       ` Derrick Stolee
  2018-04-14 16:52         ` Jakub Narebski
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-11 19:58 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

On 4/11/2018 3:32 PM, Jakub Narebski wrote:
> What would you suggest as a good test that could imply performance? The
> Google Colab notebook linked to above includes a function to count
> number of commits (nodes / vertices in the commit graph) walked,
> currently in the worst case scenario.

The two main questions to consider are:

1. Can X reach Y?
2. What is the set of merge-bases between X and Y?

And the thing to measure is a commit count. If possible, it would be 
good to count commits walked (commits whose parent list is enumerated) 
and commits inspected (commits that were listed as a parent of some 
walked commit). Walked commits require a commit parse -- albeit from the 
commit-graph instead of the ODB now -- while inspected commits only 
check the in-memory cache.

For git.git and Linux, I like to use the release tags as tests. They 
provide a realistic view of the linear history, and maintenance releases 
have their own history from the major releases.

> I have tried finding number of false positives for level (generation
> number) filter and for FELINE index, and number of false negatives for
> min-post intervals in the spanning tree (for DFS tree) for 10000
> randomly selected pairs of commits... but I don't think this is a good
> benchmark.

What is a false-positive? A case where gen(X) < gen(Y) but Y cannot 
reach X? I do not think that is a great benchmark, but I guess it is 
something to measure.

> I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
> that has 750832 nodes and 811733 edges, and 563747941392 possible
> directed pairs, we have for 10000 randomly selected pairs of commits:
>
>    level-filter has    91 =  0.91% [all] false positives
>    FELINE index has    78 =  0.78% [all] false positives
>    FELINE index has 1.16667 less false positives than level filter
>
>    min-post spanning-tree intervals has  3641 = 36.41% [all] false
>    negatives

Perhaps something you can do instead of sampling from N^2 commits in 
total is to select a pair of generations (say, G = 20000, G' = 20100) or 
regions of generations ( 20000 <= G <= 20050, 20100 <= G' <= 20150) and 
see how many false positives you see by testing all pairs (one from each 
level). The delta between the generations may need to be smaller to 
actually have a large proportion of unreachable pairs. Try different 
levels, since major version releases tend to "pinch" the commit graph to 
a common history.

> For git.git repository (https://github.com/git/git.git) that has 52950
> nodes and 65887 edges the numbers are slighly more in FELINE index
> favor (also out of 10000 random pairs):
>
>    level-filter has   504 =  9.11% false positives
>    FELINE index has   125 =  2.26% false positives
>    FELINE index has 4.032 less false positives than level filter
>
> This is for FELINE which does not use level / generatio-numbers filter.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 03/10] commit: add generation number to struct commmit
  2018-04-11 12:57       ` Derrick Stolee
@ 2018-04-11 23:28         ` Junio C Hamano
  0 siblings, 0 replies; 103+ messages in thread
From: Junio C Hamano @ 2018-04-11 23:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <stolee@gmail.com> writes:

> How about we do a slightly different
> arrangement for these overflow commits?
>
> Instead of storing the commits in the commit-graph file as "0" (which
> currently means "written by a version of git that did not compute
> generation numbers") we could let GENERATION_NUMBER_MAX be the maximum
> generation of a commit in the commit-graph, and if a commit would have
> larger generation, we collapse it down to that value.

Sure.  Any value we can tell that it is special is fine.  Thanks.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
@ 2018-04-12  9:12     ` Junio C Hamano
  2018-04-12 11:35       ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Junio C Hamano @ 2018-04-12  9:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, avarab\, sbeller\, larsxschneider\, bmwill\

Derrick Stolee <dstolee@microsoft.com> writes:

> +Here is a diagram to visualize the shape of the full commit graph, and
> +how different generation numbers relate:
> +
> +    +-----------------------------------------+
> +    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
> +    +-----------------------------------------+
> +	    |            |      ^
> +	    |            |      |
> +	    |            +------+
> +	    |         [gen(A) = gen(B)]
> +	    V
> +    +-------------------------------------+
> +    | 0 < commit->generation < 0x40000000 |
> +    +-------------------------------------+
> +	    |            |      ^
> +	    |            |      |
> +	    |            +------+
> +	    |        [gen(A) > gen(B)]
> +	    V
> +    +-------------------------------------+
> +    | GENERATION_NUMBER_ZERO = 0          |
> +    +-------------------------------------+
> +			 |      ^
> +			 |      |
> +			 +------+
> +		     [gen(A) = gen(B)]

It may be just me but all I can read out of the above is that
commit->generation may store 0xFFFFFFFF, a value between 0 and
0x40000000, or 0.  I cannot quite tell what the notation [gen(A)
<cmp> gen(B)] is trying to say.  I am guessing "Two generation
numbers within the 'valid' range can be compared" is what the second
one is trying to say, but it is much less interesting to know that
two infinities compare equal than how generation numbers from
different classes compare, which cannot be depicted in the above
notation, I am afraid.  For example, don't we want to say that a
commit with INF can never be reached by a commit with a valid
generation number, or something like that?

>  Design Details
>  --------------
>  
> @@ -98,17 +141,12 @@ Future Work
>  - The 'commit-graph' subcommand does not have a "verify" mode that is
>    necessary for integration with fsck.
>  
> -- The file format includes room for precomputed generation numbers. These
> -  are not currently computed, so all generation numbers will be marked as
> -  0 (or "uncomputed"). A later patch will include this calculation.
> -
>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operations are important candidates:
> +  operation is an important candidate:

Good.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-12  9:12     ` Junio C Hamano
@ 2018-04-12 11:35       ` Derrick Stolee
  2018-04-13  9:53         ` Jakub Narebski
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-12 11:35 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill

On 4/12/2018 5:12 AM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> +Here is a diagram to visualize the shape of the full commit graph, and
>> +how different generation numbers relate:
>> +
>> +    +-----------------------------------------+
>> +    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
>> +    +-----------------------------------------+
>> +	    |            |      ^
>> +	    |            |      |
>> +	    |            +------+
>> +	    |         [gen(A) = gen(B)]
>> +	    V
>> +    +-------------------------------------+
>> +    | 0 < commit->generation < 0x40000000 |
>> +    +-------------------------------------+
>> +	    |            |      ^
>> +	    |            |      |
>> +	    |            +------+
>> +	    |        [gen(A) > gen(B)]
>> +	    V
>> +    +-------------------------------------+
>> +    | GENERATION_NUMBER_ZERO = 0          |
>> +    +-------------------------------------+
>> +			 |      ^
>> +			 |      |
>> +			 +------+
>> +		     [gen(A) = gen(B)]
> It may be just me but all I can read out of the above is that
> commit->generation may store 0xFFFFFFFF, a value between 0 and
> 0x40000000, or 0.  I cannot quite tell what the notation [gen(A)
> <cmp> gen(B)] is trying to say.  I am guessing "Two generation
> numbers within the 'valid' range can be compared" is what the second
> one is trying to say, but it is much less interesting to know that
> two infinities compare equal than how generation numbers from
> different classes compare, which cannot be depicted in the above
> notation, I am afraid.  For example, don't we want to say that a
> commit with INF can never be reached by a commit with a valid
> generation number, or something like that?

My intention with the arrows was to demonstrate where parent 
relationships can go, and the generation-number relation between a 
commit A with parent B. Clearly, this diagram is less than helpful.

>
>>   Design Details
>>   --------------
>>   
>> @@ -98,17 +141,12 @@ Future Work
>>   - The 'commit-graph' subcommand does not have a "verify" mode that is
>>     necessary for integration with fsck.
>>   
>> -- The file format includes room for precomputed generation numbers. These
>> -  are not currently computed, so all generation numbers will be marked as
>> -  0 (or "uncomputed"). A later patch will include this calculation.
>> -
>>   - After computing and storing generation numbers, we must make graph
>>     walks aware of generation numbers to gain the performance benefits they
>>     enable. This will mostly be accomplished by swapping a commit-date-ordered
>>     priority queue with one ordered by generation number. The following
>> -  operations are important candidates:
>> +  operation is an important candidate:
> Good.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 07/10] commit-graph.txt: update future work
  2018-04-12 11:35       ` Derrick Stolee
@ 2018-04-13  9:53         ` Jakub Narebski
  0 siblings, 0 replies; 103+ messages in thread
From: Jakub Narebski @ 2018-04-13  9:53 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee, git, Jeff King,
	Ævar Arnfjörð Bjarmason, Stefan Beller,
	Lars Schneider, Brandon Williams

Derrick Stolee <stolee@gmail.com> writes:

> On 4/12/2018 5:12 AM, Junio C Hamano wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> +Here is a diagram to visualize the shape of the full commit graph, and
>>> +how different generation numbers relate:
>>> +
>>> +    +-----------------------------------------+
>>> +    | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF |
>>> +    +-----------------------------------------+
>>> +	    |            |      ^
>>> +	    |            |      |
>>> +	    |            +------+
>>> +	    |         [gen(A) = gen(B)]
>>> +	    V
>>> +    +-------------------------------------+
>>> +    | 0 < commit->generation < 0x40000000 |
>>> +    +-------------------------------------+
>>> +	    |            |      ^
>>> +	    |            |      |
>>> +	    |            +------+
>>> +	    |        [gen(A) > gen(B)]
>>> +	    V
>>> +    +-------------------------------------+
>>> +    | GENERATION_NUMBER_ZERO = 0          |
>>> +    +-------------------------------------+
>>> +			 |      ^
>>> +			 |      |
>>> +			 +------+
>>> +		     [gen(A) = gen(B)]
>>
>> It may be just me but all I can read out of the above is that

It's not just you.

>> commit->generation may store 0xFFFFFFFF, a value between 0 and
>> 0x40000000, or 0.  I cannot quite tell what the notation [gen(A)
>> <cmp> gen(B)] is trying to say.  I am guessing "Two generation
>> numbers within the 'valid' range can be compared" is what the second
>> one is trying to say, but it is much less interesting to know that
>> two infinities compare equal than how generation numbers from
>> different classes compare, which cannot be depicted in the above
>> notation, I am afraid.  For example, don't we want to say that a
>> commit with INF can never be reached by a commit with a valid
>> generation number, or something like that?
>
> My intention with the arrows was to demonstrate where parent
> relationships can go, and the generation-number relation between a
> commit A with parent B. Clearly, this diagram is less than helpful.

Perhaps the following table would make the information clearer (perhaps
in addition to the above graph, but without "gen(A) {cmp} gen(B)"
arrows).

I assume that it is possible to have both GENERATION_NUMBER_ZERO and non
zero generation numbers in one repo, perhaps via alternates.  I also
assume that A != B, and that generation numbers (both set, and 0s) are
transitivelu closed under reachability.

gen(A) \   commit B ->   |                     gen(B)
        \-----\          |
commit A       \         | 0xFFFFFFFF | larger   | smaller | 0x00000000
----------------\--------+------------+----------+---------+------------
0xFFFFFFFF               | =            >          >         >
0 < larger  < 0x40000000 | < N          = n        >         >
0 < smaller < 0x40000000 | < N          < N        = n       >
0x00000000               | < N          < N        < N       =

The "<", "=", ">" denotes result of comparison between gen(A) and gen(B).

Generation numbers create a negative-cut filter: "N" and "n" denote
situation where we know from gen(A) and gen(B) that B is not reachable
from A.

As can be seen if we use gen(A) < gen(B) as cutoff, we don't need to
treat "infinity" and "zero" in a special way.


Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-11 19:58       ` Derrick Stolee
@ 2018-04-14 16:52         ` Jakub Narebski
  2018-04-21 20:44           ` Jakub Narebski
  0 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-14 16:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

Derrick Stolee <stolee@gmail.com> writes:
> On 4/11/2018 3:32 PM, Jakub Narebski wrote:

>> What would you suggest as a good test that could imply performance? The
>> Google Colab notebook linked to above includes a function to count
>> number of commits (nodes / vertices in the commit graph) walked,
>> currently in the worst case scenario.
>
> The two main questions to consider are:
>
> 1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

> 2. What is the set of merge-bases between X and Y?

I don't have an algorithm for that in the Google Colaboratory notebook.
Though I see that there exist algorithms for calculating lowest common
ancestors in DAGs...

I'll have to take a look how Git does that.

>
> And the thing to measure is a commit count. If possible, it would be
> good to count commits walked (commits whose parent list is enumerated)
> and commits inspected (commits that were listed as a parent of some
> walked commit). Walked commits require a commit parse -- albeit from
> the commit-graph instead of the ODB now -- while inspected commits
> only check the in-memory cache.

I don't quite see the distinction.  Whether we access generation number
of a commit (information about level of vertex in graph), or a parent
list (vertex successors / neighbours), it both needs accessing
commit-graph; well, accessing parents may be more costly for octopus
merges (due to having to go through EDGE chunk).

I can easily return the set of visited commits (vertices), or just size
of said set.

>
> For git.git and Linux, I like to use the release tags as tests. They
> provide a realistic view of the linear history, and maintenance
> releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

>> I have tried finding number of false positives for level (generation
>> number) filter and for FELINE index, and number of false negatives for
>> min-post intervals in the spanning tree (for DFS tree) for 10000
>> randomly selected pairs of commits... but I don't think this is a good
>> benchmark.
>
> What is a false-positive? A case where gen(X) < gen(Y) but Y cannot
> reach X?

Yes.  (And equivalent for FELINE index, which is a pair of integers).

> I do not think that is a great benchmark, but I guess it is
> something to measure.

I have simply used it to have something to compare.

>> I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
>> that has 750832 nodes and 811733 edges, and 563747941392 possible
>> directed pairs, we have for 10000 randomly selected pairs of commits:
>>
>>    level-filter has    91 =  0.91% [all] false positives
>>    FELINE index has    78 =  0.78% [all] false positives
>>    FELINE index has 1.16667 less false positives than level filter
>>
>>    min-post spanning-tree intervals has  3641 = 36.41% [all] false
>>    negatives
>
> Perhaps something you can do instead of sampling from N^2 commits in
> total is to select a pair of generations (say, G = 20000, G' = 20100)
> or regions of generations ( 20000 <= G <= 20050, 20100 <= G' <= 20150)
> and see how many false positives you see by testing all pairs (one
> from each level). The delta between the generations may need to be
> smaller to actually have a large proportion of unreachable pairs. Try
> different levels, since major version releases tend to "pinch" the
> commit graph to a common history.

That's a good idea.

>> For git.git repository (https://github.com/git/git.git) that has 52950
>> nodes and 65887 edges the numbers are slighly more in FELINE index
>> favor (also out of 10000 random pairs):
>>
>>    level-filter has   504 =  9.11% false positives
>>    FELINE index has   125 =  2.26% false positives
>>    FELINE index has 4.032 less false positives than level filter
>>
>> This is for FELINE which does not use level / generatio-numbers filter.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 0/9] Compute and consume generation numbers
  2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
                     ` (9 preceding siblings ...)
  2018-04-09 16:42   ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-17 17:00   ` Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
                       ` (9 more replies)
  10 siblings, 10 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Thanks for all the help on v2. Here are a few changes between versions:

* Removed the constant-time check in queue_has_nonstale() due to the
  possibility of a performance hit and no evidence of a performance
  benefit in typical cases.

* Reordered the commits about loading commits from the commit-graph.
  This way it is easier to demonstrate the incorrect checks. On my
  machine, every commit compiles and the test suite passes, but patches
  6-8 have the bug that is fixed in patch 9 "merge: check config before
  loading commits".

* The interaction with parse_commit_in_graph() from parse_object() is
  replaced with a new 'check_graph' parameter in parse_commit_buffer().
  This allows us to fill in the graph_pos and generation values for
  commits that are parsed directly from a buffer. This keeps the existing
  behavior that a commit parsed this way should match its buffer.

* There was discussion about making GENERATION_NUMBER_MAX assignable by
  an environment variable so we could add tests that exercise the behavior
  of capping a generation at that value. Perhaps the code around this is
  simple enough that we do not need to add that complexity.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph.txt: update design document
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  commit-graph: always load commit-graph information
  merge: check config before loading commits

 Documentation/technical/commit-graph.txt | 30 +++++--
 alloc.c                                  |  1 +
 builtin/merge.c                          |  5 +-
 commit-graph.c                           | 99 +++++++++++++++++++-----
 commit-graph.h                           |  8 ++
 commit.c                                 | 54 +++++++++++--
 commit.h                                 |  7 +-
 object.c                                 |  2 +-
 ref-filter.c                             | 23 +++++-
 sha1_file.c                              |  2 +-
 t/t5318-commit-graph.sh                  |  9 +++
 11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 1/9] commit: add generation number to struct commmit
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0xFFFFFFFF
GENERATION_NUMBER_MAX 0x3FFFFFFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c        | 1 +
 commit-graph.c | 2 ++
 commit.h       | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+	c->generation = GENERATION_NUMBER_INFINITY;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
+#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
 	struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 	 */
 	struct tree *maybe_tree;
 	uint32_t graph_pos;
+	uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 2/9] commit-graph: compute generation numbers
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..688d5b1801 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		else
 			packedDate[0] = 0;
 
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+			packedDate[0] |= htonl((*list)->generation << 2);
+		}
+
 		packedDate[1] = htonl((*list)->date);
 		hashwrite(f, packedDate, 8);
 
@@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+				       int nr_commits)
+{
+	int i;
+	struct commit_list *list = NULL;
+
+	for (i = 0; i < nr_commits; i++) {
+		if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    commits[i]->generation != GENERATION_NUMBER_ZERO)
+			continue;
+
+		commit_list_insert(commits[i], &list);
+		while (list) {
+			struct commit *current = list->item;
+			struct commit_list *parent;
+			int all_parents_computed = 1;
+			uint32_t max_generation = 0;
+
+			for (parent = current->parents; parent; parent = parent->next) {
+				if (parent->item->generation == GENERATION_NUMBER_INFINITY ||
+				    parent->item->generation == GENERATION_NUMBER_ZERO) {
+					all_parents_computed = 0;
+					commit_list_insert(parent->item, &list);
+					break;
+				} else if (parent->item->generation > max_generation) {
+					max_generation = parent->item->generation;
+				}
+			}
+
+			if (all_parents_computed) {
+				current->generation = max_generation + 1;
+				pop_commit(&list);
+			}
+
+			if (current->generation > GENERATION_NUMBER_MAX)
+				current->generation = GENERATION_NUMBER_MAX;
+		}
+	}
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
@@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
+	compute_generation_numbers(commits.list, commits.nr);
+
 	graph_name = get_commit_graph_filename(obj_dir);
 	fd = hold_lock_file_for_update(&lk, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 3/9] commit: use generations in paint_down_to_common()
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
  2018-04-17 17:00     ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 14:31       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
                       ` (6 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 20 +++++++++++++++++++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..a44899c733 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
+{
+	const struct commit *a = a_, *b = b_;
+
+	/* newer commits first */
+	if (a->generation < b->generation)
+		return 1;
+	else if (a->generation > b->generation)
+		return -1;
+
+	/* use date as a heuristic when generataions are equal */
+	if (a->date < b->date)
+		return 1;
+	else if (a->date > b->date)
+		return -1;
+	return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
 {
-	struct prio_queue queue = { compare_commits_by_commit_date };
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 4/9] commit-graph.txt: update design document
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (2 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 19:47       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
                       ` (5 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N amd M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --------------
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-    - paint_down_to_common()
     - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (3 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 21:02       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
                       ` (4 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 ref-filter.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
-					  struct contains_cache *cache)
+					  struct contains_cache *cache,
+					  uint32_t cutoff)
 {
 	enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
 
 	/* Otherwise, we don't know; prepare to recurse */
 	parse_commit_or_die(candidate);
+
+	if (candidate->generation < cutoff)
+		return CONTAINS_NO;
+
 	return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
 
+	result = contains_test(candidate, want, cache, cutoff);
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
@@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		 * If we just popped the stack, parents->item has been marked,
 		 * therefore contains_test will return a meaningful yes/no.
 		 */
-		else switch (contains_test(parents->item, want, cache)) {
+		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
 			*contains_cache_at(cache, commit) = CONTAINS_YES;
 			contains_stack.nr--;
@@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		}
 	}
 	free(contains_stack.contains_stack);
-	return contains_test(candidate, want, cache);
+	return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (4 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 22:15       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
                       ` (3 subsequent siblings)
  9 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index a44899c733..bceb79c419 100644
--- a/commit.c
+++ b/commit.c
@@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 {
 	struct commit_list *bases;
 	int ret = 0, i;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
 	if (parse_commit(commit))
 		return ret;
-	for (i = 0; i < nr_reference; i++)
+	for (i = 0; i < nr_reference; i++) {
 		if (parse_commit(reference[i]))
 			return ret;
+		if (min_generation > reference[i]->generation)
+			min_generation = reference[i]->generation;
+	}
+
+	if (commit->generation > min_generation)
+		return 0;
 
 	bases = paint_down_to_common(commit, nr_reference, reference);
 	if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (5 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-18 23:19       ` Jakub Narebski
  2018-04-19  8:32       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
                       ` (2 subsequent siblings)
  9 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index bceb79c419..a70f120878 100644
--- a/commit.c
+++ b/commit.c
@@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+						struct commit **twos,
+						int min_generation)
 {
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 	struct commit_list *result = NULL;
 	int i;
+	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
 	one->object.flags |= PARENT1;
 	if (!n) {
@@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
 		struct commit_list *parents;
 		int flags;
 
+		if (commit->generation > last_gen)
+			BUG("bad generation skip");
+		last_gen = commit->generation;
+
+		if (commit->generation < min_generation)
+			break;
+
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
 		if (flags == (PARENT1 | PARENT2)) {
 			if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
 			return NULL;
 	}
 
-	list = paint_down_to_common(one, n, twos);
+	list = paint_down_to_common(one, n, twos, 0);
 
 	while (list) {
 		struct commit *commit = pop_commit(&list);
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
 			filled_index[filled] = j;
 			work[filled++] = array[j];
 		}
-		common = paint_down_to_common(array[i], filled, work);
+		common = paint_down_to_common(array[i], filled, work, 0);
 		if (array[i]->object.flags & PARENT2)
 			redundant[i] = 1;
 		for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
 	if (commit->generation > min_generation)
 		return 0;
 
-	bases = paint_down_to_common(commit, nr_reference, reference);
+	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
 	if (commit->object.flags & PARENT2)
 		ret = 1;
 	clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (6 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-17 17:50       ` Derrick Stolee
  2018-04-19  0:02       ` Jakub Narebski
  2018-04-17 17:00     ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee
  2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
  9 siblings, 2 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
 commit-graph.h |  8 ++++++++
 commit.c       |  7 +++++--
 commit.h       |  2 +-
 object.c       |  2 +-
 sha1_file.c    |  2 +-
 6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 	return &commit_list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	uint32_t edge_value;
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
 
 	item->object.parsed = 1;
 	item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
+{
+	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		*pos = item->graph_pos;
+		return 1;
+	} else {
+		return bsearch_graph(commit_graph, &(item->object.oid), pos);
+	}
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+	uint32_t pos;
+
+	if (item->object.parsed)
+		return 0;
 	if (!core_commit_graph)
 		return 0;
-	if (item->object.parsed)
-		return 1;
-
 	prepare_commit_graph();
-	if (commit_graph) {
-		uint32_t pos;
-		int found;
-		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-			pos = item->graph_pos;
-			found = 1;
-		} else {
-			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
-		}
-
-		if (found)
-			return fill_commit_in_graph(item, commit_graph, pos);
-	}
-
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		return fill_commit_in_graph(item, commit_graph, pos);
 	return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+	uint32_t pos;
+	if (!core_commit_graph)
+		return;
+	prepare_commit_graph();
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
 {
 	struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 	return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
 {
 	const char *tail = buffer;
 	const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	}
 	item->date = parse_commit_date(bufptr, tail);
 
+	if (check_graph)
+		load_commit_graph_info(item);
+
 	return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return error("Object %s not a commit",
 			     oid_to_hex(&item->object.oid));
 	}
-	ret = parse_commit_buffer(item, buffer, size);
+	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
 		set_commit_buffer(item, buffer, size);
 		return 0;
diff --git a/commit.h b/commit.h
index 64436ff44e..b5afde1ae9 100644
--- a/commit.h
+++ b/commit.h
@@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
  */
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
diff --git a/object.c b/object.c
index e6ad3f61f0..efe4871325 100644
--- a/object.c
+++ b/object.c
@@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
 	} else if (type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(oid);
 		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (parse_commit_buffer(commit, buffer, size, 1))
 				return NULL;
 			if (!get_cached_commit_buffer(commit, NULL)) {
 				set_commit_buffer(commit, buffer, size);
diff --git a/sha1_file.c b/sha1_file.c
index 1b94f39c4c..0fd4f0b8b6 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
 {
 	struct commit c;
 	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(&c, buf, size))
+	if (parse_commit_buffer(&c, buf, size, 0))
 		die("corrupt commit");
 }
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 9/9] merge: check config before loading commits
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (7 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-17 17:00     ` Derrick Stolee
  2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
  9 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw)
  To: git
  Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy, Derrick Stolee

Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/merge.c         | 5 +++--
 t/t5318-commit-graph.sh | 9 +++++++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..7e1da6c6ea 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
 	branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL);
 	if (branch)
 		skip_prefix(branch, "refs/heads/", &branch);
+	init_diff_ui_defaults();
+	git_config(git_merge_config, NULL);
+
 	if (!branch || is_null_oid(&head_oid))
 		head_commit = NULL;
 	else
 		head_commit = lookup_commit_or_die(&head_oid, "HEAD");
 
-	init_diff_ui_defaults();
-	git_config(git_merge_config, NULL);
 
 	if (branch_mergeoptions)
 		parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git checkout -b merge-5-to-8 commits/5 &&
+	git merge commits/8 &&
+	git show-ref -s merge-5-to-8 >output &&
+	git show-ref -s commits/8 >expect &&
+	test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-17 17:50       ` Derrick Stolee
  2018-04-19  0:02       ` Jakub Narebski
  1 sibling, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-17 17:50 UTC (permalink / raw)
  To: Derrick Stolee, git
  Cc: peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine,
	jonathantanmy

On 4/17/2018 1:00 PM, Derrick Stolee wrote:
> Most code paths load commits using lookup_commit() and then
> parse_commit(). In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().
>
> With generation numbers in the commit-graph, we need to ensure that any
> commit that exists in the commit-graph file has its generation number
> loaded.
>
> Create new load_commit_graph_info() method to fill in the information
> for a commit that exists only in the commit-graph file. Call it from
> parse_commit_buffer() after loading the other commit information from
> the given buffer. Only fill this information when specified by the
> 'check_graph' parameter. This avoids duplicate work when we already
> checked the graph in parse_commit_gently() or when simply checking the
> buffer contents in check_commit().
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
>   commit-graph.h |  8 ++++++++
>   commit.c       |  7 +++++--
>   commit.h       |  2 +-
>   object.c       |  2 +-
>   sha1_file.c    |  2 +-
>   6 files changed, 49 insertions(+), 23 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 688d5b1801..21e853c21a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>   	return &commit_list_insert(c, pptr)->next;
>   }
>   
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +{
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +}
> +
>   static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>   {
>   	uint32_t edge_value;
>   	uint32_t *parent_data_ptr;
>   	uint64_t date_low, date_high;
>   	struct commit_list **pptr;
> -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
>   
>   	item->object.parsed = 1;
>   	item->graph_pos = pos;
> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>   	return 1;
>   }
>   
> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
> +{
> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> +		*pos = item->graph_pos;
> +		return 1;
> +	} else {
> +		return bsearch_graph(commit_graph, &(item->object.oid), pos);

The reference to 'commit_graph' in the above line should be 'g'. Sorry!

> +	}
> +}
> +
>   int parse_commit_in_graph(struct commit *item)
>   {
> +	uint32_t pos;
> +
> +	if (item->object.parsed)
> +		return 0;
>   	if (!core_commit_graph)
>   		return 0;
> -	if (item->object.parsed)
> -		return 1;
> -
>   	prepare_commit_graph();
> -	if (commit_graph) {
> -		uint32_t pos;
> -		int found;
> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> -			pos = item->graph_pos;
> -			found = 1;
> -		} else {
> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
> -		}
> -
> -		if (found)
> -			return fill_commit_in_graph(item, commit_graph, pos);
> -	}
> -
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		return fill_commit_in_graph(item, commit_graph, pos);
>   	return 0;
>   }
>   
> +void load_commit_graph_info(struct commit *item)
> +{
> +	uint32_t pos;
> +	if (!core_commit_graph)
> +		return;
> +	prepare_commit_graph();
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		fill_commit_graph_info(item, commit_graph, pos);
> +}
> +
>   static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>   {
>   	struct object_id oid;
> diff --git a/commit-graph.h b/commit-graph.h
> index 260a468e73..96cccb10f3 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>    */
>   int parse_commit_in_graph(struct commit *item);
>   
> +/*
> + * It is possible that we loaded commit contents from the commit buffer,
> + * but we also want to ensure the commit-graph content is correctly
> + * checked and filled. Fill the graph_pos and generation members of
> + * the given commit.
> + */
> +void load_commit_graph_info(struct commit *item);
> +
>   struct tree *get_commit_tree_in_graph(const struct commit *c);
>   
>   struct commit_graph {
> diff --git a/commit.c b/commit.c
> index a70f120878..9ef6f699bd 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>   	return ret;
>   }
>   
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>   {
>   	const char *tail = buffer;
>   	const char *bufptr = buffer;
> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>   	}
>   	item->date = parse_commit_date(bufptr, tail);
>   
> +	if (check_graph)
> +		load_commit_graph_info(item);
> +
>   	return 0;
>   }
>   
> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>   		return error("Object %s not a commit",
>   			     oid_to_hex(&item->object.oid));
>   	}
> -	ret = parse_commit_buffer(item, buffer, size);
> +	ret = parse_commit_buffer(item, buffer, size, 0);
>   	if (save_commit_buffer && !ret) {
>   		set_commit_buffer(item, buffer, size);
>   		return 0;
> diff --git a/commit.h b/commit.h
> index 64436ff44e..b5afde1ae9 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>    */
>   struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>   
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>   int parse_commit_gently(struct commit *item, int quiet_on_missing);
>   static inline int parse_commit(struct commit *item)
>   {
> diff --git a/object.c b/object.c
> index e6ad3f61f0..efe4871325 100644
> --- a/object.c
> +++ b/object.c
> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>   	} else if (type == OBJ_COMMIT) {
>   		struct commit *commit = lookup_commit(oid);
>   		if (commit) {
> -			if (parse_commit_buffer(commit, buffer, size))
> +			if (parse_commit_buffer(commit, buffer, size, 1))
>   				return NULL;
>   			if (!get_cached_commit_buffer(commit, NULL)) {
>   				set_commit_buffer(commit, buffer, size);
> diff --git a/sha1_file.c b/sha1_file.c
> index 1b94f39c4c..0fd4f0b8b6 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>   {
>   	struct commit c;
>   	memset(&c, 0, sizeof(c));
> -	if (parse_commit_buffer(&c, buf, size))
> +	if (parse_commit_buffer(&c, buf, size, 0))
>   		die("corrupt commit");
>   }
>   


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()
  2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
@ 2018-04-18 14:31       ` Jakub Narebski
  2018-04-18 14:46         ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-18 14:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> Define compare_commits_by_gen_then_commit_date(), which uses generation
> numbers as a primary comparison and commit date to break ties (or as a
> comparison when both commits do not have computed generation numbers).
>
> Since the commit-graph file is closed under reachability, we know that
> all commits in the file have generation at most GENERATION_NUMBER_MAX
> which is less than GENERATION_NUMBER_INFINITY.
>
> This change does not affect the number of commits that are walked during
> the execution of paint_down_to_common(), only the order that those
> commits are inspected. In the case that commit dates violate topological
> order (i.e. a parent is "newer" than a child), the previous code could
> walk a commit twice: if a commit is reached with the PARENT1 bit, but
> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
> propagated to its parents. Using generation numbers avoids this extra
> effort, even if it is somewhat rare.

Does it mean that it gives no measureable performance improvements for
typical test cases?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 20 +++++++++++++++++++-
>  commit.h |  1 +
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 711f674c18..a44899c733 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
>  	return 0;
>  }
>  
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> +{
> +	const struct commit *a = a_, *b = b_;
> +
> +	/* newer commits first */
> +	if (a->generation < b->generation)
> +		return 1;
> +	else if (a->generation > b->generation)
> +		return -1;
> +
> +	/* use date as a heuristic when generataions are equal */

Very minor typo in above comment:

s/generataions/generations/

> +	if (a->date < b->date)
> +		return 1;
> +	else if (a->date > b->date)
> +		return -1;
> +	return 0;
> +}
> +
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
>  {
>  	const struct commit *a = a_, *b = b_;
> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  /* all input commits in one and twos[] must have been parsed! */
>  static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>  {
> -	struct prio_queue queue = { compare_commits_by_commit_date };
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
>  
> diff --git a/commit.h b/commit.h
> index aac3b8c56f..64436ff44e 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>  extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>  
>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
>  
>  LAST_ARG_MUST_BE_NULL
>  extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()
  2018-04-18 14:31       ` Jakub Narebski
@ 2018-04-18 14:46         ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-18 14:46 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 10:31 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Define compare_commits_by_gen_then_commit_date(), which uses generation
>> numbers as a primary comparison and commit date to break ties (or as a
>> comparison when both commits do not have computed generation numbers).
>>
>> Since the commit-graph file is closed under reachability, we know that
>> all commits in the file have generation at most GENERATION_NUMBER_MAX
>> which is less than GENERATION_NUMBER_INFINITY.
>>
>> This change does not affect the number of commits that are walked during
>> the execution of paint_down_to_common(), only the order that those
>> commits are inspected. In the case that commit dates violate topological
>> order (i.e. a parent is "newer" than a child), the previous code could
>> walk a commit twice: if a commit is reached with the PARENT1 bit, but
>> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
>> propagated to its parents. Using generation numbers avoids this extra
>> effort, even if it is somewhat rare.
> Does it mean that it gives no measureable performance improvements for
> typical test cases?

Not in this commit. When we add the `min_generation` parameter in a 
later commit, we do get a significant performance boost (when we can 
supply a non-zero value to `min_generation`).

This step of using generation numbers for the priority is important for 
that commit, but on its own has limited value outside of the clock-skew 
case mentioned above.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit.c | 20 +++++++++++++++++++-
>>   commit.h |  1 +
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/commit.c b/commit.c
>> index 711f674c18..a44899c733 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
>>   	return 0;
>>   }
>>   
>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
>> +{
>> +	const struct commit *a = a_, *b = b_;
>> +
>> +	/* newer commits first */
>> +	if (a->generation < b->generation)
>> +		return 1;
>> +	else if (a->generation > b->generation)
>> +		return -1;
>> +
>> +	/* use date as a heuristic when generataions are equal */
> Very minor typo in above comment:
>
> s/generataions/generations/

Good catch!

>
>> +	if (a->date < b->date)
>> +		return 1;
>> +	else if (a->date > b->date)
>> +		return -1;
>> +	return 0;
>> +}
>> +
>>   int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
>>   {
>>   	const struct commit *a = a_, *b = b_;
>> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
>>   /* all input commits in one and twos[] must have been parsed! */
>>   static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
>>   {
>> -	struct prio_queue queue = { compare_commits_by_commit_date };
>> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>>   	struct commit_list *result = NULL;
>>   	int i;
>>   
>> diff --git a/commit.h b/commit.h
>> index aac3b8c56f..64436ff44e 100644
>> --- a/commit.h
>> +++ b/commit.h
>> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
>>   extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>>   
>>   int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
>>   
>>   LAST_ARG_MUST_BE_NULL
>>   extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 4/9] commit-graph.txt: update design document
  2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
@ 2018-04-18 19:47       ` Jakub Narebski
  0 siblings, 0 replies; 103+ messages in thread
From: Jakub Narebski @ 2018-04-18 19:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> We now calculate generation numbers in the commit-graph file and use
> them in paint_down_to_common().

All right.

>
> Expand the section on generation numbers to discuss how the three
> special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
> _MAX interact with other generation numbers.

Very good.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 0550c6d0dc..d9f2713efa 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>
> +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +in the commit-graph file. If a commit-graph file was written by a version
> +of Git that did not compute generation numbers, then those commits will
> +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.

I have to wonder if there would be any relesed Git that do not compute
generation numbers...

On the other hand in case the user-visible view of the project history
changes, be it because shallow clone is shortened or deepened, or grafts
file is edited, or a commit object is replaced with another with
different parents - we can still use "commit-graph" data, just pretend
that generation numbers (which are invalid in altered history) are all
zero.  (I'll write about this idea in comments to later series.)

On the other hand with GENERATION_NUMBER_ZERO these series of patches
are self-contained and bisectable.

> +
> +Since the commit-graph file is closed under reachability, we can guarantee
> +the following weaker condition on all commits:

I have had to look up the contents of the whole file, but it turns out
that it is all right: "weaker condition" refers to earlier "N <= M".

Minor sidenote: if one would be extremly pedantic, one could say that
previous condition is incorrect, because it doesn't state explicitely
that commit A != commit B. ;-)

> +
> +    If A and B are commits with generation numbers N amd M, respectively,
> +    and N < M, then A cannot reach B.
> +
> +Note how the strict inequality differs from the inequality when we have
> +fully-computed generation numbers. Using strict inequality may result in
> +walking a few extra commits, but the simplicity in dealing with commits
> +with generation number *_INFINITY or *_ZERO is valuable.
> +
> +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> +generation numbers are computed to be at least this value. We limit at
> +this value since it is the largest value that can be stored in the
> +commit-graph file using the 30 bits available to generation numbers. This
> +presents another case where a commit can have generation number equal to
> +that of a parent.

I wonder if something like the table I have proposed in v2 version of
this patch [1] would make it easier or harder to understand.

[1]: https://public-inbox.org/git/86a7u7mnzi.fsf@gmail.com/

Something like the following:

             |                      gen(B)
             |
gen(A)       | _INFINITY | _MAX     | larger   | smaller  | _ZERO
-------------+-----------+----------+----------+----------+--------
_INFINITY    | =         | >        | >        | >        | >
_MAX         | < N       | =        | >        | >        | >
larger       | < N       | < N      | = n      | >        | >
smaller      | < N       | < N      | < N      | = n      | >
_ZERO        | < N       | < N      | < N      | < N      | =

Here "n" and "N" denotes stronger condition, and "N" denotes weaker
condition.  We have _INFINITY > _MAX > larger > smaller > _ZERO.

> +
>  Design Details
>  --------------
>
> @@ -98,17 +121,12 @@ Future Work
>  - The 'commit-graph' subcommand does not have a "verify" mode that is
>    necessary for integration with fsck.
>
> -- The file format includes room for precomputed generation numbers. These
> -  are not currently computed, so all generation numbers will be marked as
> -  0 (or "uncomputed"). A later patch will include this calculation.
> -
>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operations are important candidates:
> +  operation is an important candidate:
>
> -    - paint_down_to_common()
>      - 'log --topo-order'
>
>  - Currently, parse_commit_gently() requires filling in the root tree

Looks good.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
@ 2018-04-18 21:02       ` Jakub Narebski
  2018-04-23 14:22         ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-18 21:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Here I can offer only the cursory examination, as I don't know this area
of code in question.

Derrick Stolee <dstolee@microsoft.com> writes:

> A commit A can reach a commit B only if the generation number of A
> is larger than the generation number of B. This condition allows
> significantly short-circuiting commit-graph walks.
>
> Use generation number for 'git tag --contains' queries.
>
> On a copy of the Linux repository where HEAD is containd in v4.13
> but no earlier tag, the command 'git tag --contains HEAD' had the
> following peformance improvement:
>
> Before: 0.81s
> After:  0.04s
> Rel %:  -95%

A question: what is the performance after if the "commit-graph" feature
is disabled, or there is no commit-graph file?  Is there performance
regression in this case, or is the difference negligible?

>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  ref-filter.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index cffd8bf3ce..e2fea6d635 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  /*
>   * Test whether the candidate or one of its parents is contained in the list.
                                 ^^^^^^^^^^^^^^^^^^^^^

Sidenote: when examining the code after the change, I have noticed that
the above part of commit header for the comtains_test() function is no
longer entirely correct, as the function only checks the candidate
commit, and in no place it access its parents.

But that is not your problem.

>   * Do not recurse to find out, though, but return -1 if inconclusive.
>   */
>  static enum contains_result contains_test(struct commit *candidate,
>  					  const struct commit_list *want,
> -					  struct contains_cache *cache)
> +					  struct contains_cache *cache,
> +					  uint32_t cutoff)
>  {
>  	enum contains_result *cached = contains_cache_at(cache, candidate);
>  
> @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
>  
>  	/* Otherwise, we don't know; prepare to recurse */
>  	parse_commit_or_die(candidate);
> +
> +	if (candidate->generation < cutoff)
> +		return CONTAINS_NO;
> +

Looks good to me.

The only [minor] question may be whether to define separate type for
generation numbers, and whether to future proof the tests - though the
latter would be almost certainly overengineering, and the former
probablt too.

>  	return CONTAINS_UNKNOWN;
>  }
>  
> @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  					      struct contains_cache *cache)
>  {
>  	struct contains_stack contains_stack = { 0, 0, NULL };
> -	enum contains_result result = contains_test(candidate, want, cache);
> +	enum contains_result result;
> +	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> +	const struct commit_list *p;
> +
> +	for (p = want; p; p = p->next) {
> +		struct commit *c = p->item;
> +		parse_commit_or_die(c);
> +		if (c->generation < cutoff)
> +			cutoff = c->generation;
> +	}

Sholdn't the above be made conditional on the ability to get generation
numbers from the commit-graph file (feature is turned on and file
exists)?  Otherwise here after the change contains_tag_algo() now parses
each commit in 'want', which I think was not done previously.

With commit-graph file parsing is [probably] cheap.  Without it, not
necessary.

But I might be worrying about nothing.

>  
> +	result = contains_test(candidate, want, cache, cutoff);

Other than the question about possible performace regression if
commit-graph data is not available, it looks good to me.

>  	if (result != CONTAINS_UNKNOWN)
>  		return result;
>  
> @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  		 * If we just popped the stack, parents->item has been marked,
>  		 * therefore contains_test will return a meaningful yes/no.
>  		 */
> -		else switch (contains_test(parents->item, want, cache)) {
> +		else switch (contains_test(parents->item, want, cache, cutoff)) {
>  		case CONTAINS_YES:
>  			*contains_cache_at(cache, commit) = CONTAINS_YES;
>  			contains_stack.nr--;
> @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  		}
>  	}
>  	free(contains_stack.contains_stack);
> -	return contains_test(candidate, want, cache);
> +	return contains_test(candidate, want, cache, cutoff);

Simple change. It looks good to me.

>  }
>  
>  static int commit_contains(struct ref_filter *filter, struct commit *commit,

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
  2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
@ 2018-04-18 22:15       ` Jakub Narebski
  2018-04-23 14:31         ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-18 22:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> The containment algorithm for 'git branch --contains' is different
> from that for 'git tag --contains' in that it uses is_descendant_of()
> instead of contains_tag_algo(). The expensive portion of the branch
> algorithm is computing merge bases.
>
> When a commit-graph file exists with generation numbers computed,
> we can avoid this merge-base calculation when the target commit has
> a larger generation number than the target commits.

You have "target" twice in above paragraph; one of those should probably
be something else.

>
> Performance tests were run on a copy of the Linux repository where
> HEAD is contained in v4.13 but no earlier tag. Also, all tags were
> copied to branches and 'git branch --contains' was tested:
>
> Before: 60.0s
> After:   0.4s
> Rel %: -99.3%

Nice...

>
> Reported-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)

...especially for so small changes.

>
> diff --git a/commit.c b/commit.c
> index a44899c733..bceb79c419 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>  {
>  	struct commit_list *bases;
>  	int ret = 0, i;
> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>  
>  	if (parse_commit(commit))
>  		return ret;
> -	for (i = 0; i < nr_reference; i++)
> +	for (i = 0; i < nr_reference; i++) {
>  		if (parse_commit(reference[i]))
>  			return ret;
> +		if (min_generation > reference[i]->generation)
> +			min_generation = reference[i]->generation;
> +	}
> +
> +	if (commit->generation > min_generation)
> +		return 0;

Why not use "return ret;" instead of "return 0;", like the rest of the
code [cryptically] does, that is:

  +	if (commit->generation > min_generation)
  +		return ret;

>  
>  	bases = paint_down_to_common(commit, nr_reference, reference);
>  	if (commit->object.flags & PARENT2)

Otherwise, it looks good to me.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
@ 2018-04-18 23:19       ` Jakub Narebski
  2018-04-23 14:40         ` Derrick Stolee
  2018-04-19  8:32       ` Jakub Narebski
  1 sibling, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-18 23:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> When running 'git branch --contains', the in_merge_bases_many()
> method calls paint_down_to_common() to discover if a specific
> commit is reachable from a set of branches. Commits with lower
> generation number are not needed to correctly answer the
> containment query of in_merge_bases_many().

Right. This description is not entirely clear to me, but I don't have a
better proposal. Good enough, I guess.

>
> Add a new parameter, min_generation, to paint_down_to_common() that
> prevents walking commits with generation number strictly less than
> min_generation. If 0 is given, then there is no functional change.

Is it new parameter really needed, i.e. do you really need to change the
signature of this function?  See below for details.

>
> For in_merge_bases_many(), we can pass commit->generation as the
> cutoff,...

This is the only callsite that uses min_generation with non-zero value,
and it uses commit->generation to fill it... while commit itself is one
of exiting parameters.

> [...], and this saves time during 'git branch --contains' queries
> that would otherwise walk "around" the commit we are inspecting.

If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/

Do I have that right?  If so, it looks all right to me.

>
> For a copy of the Linux repository, where HEAD is checked out at
> v4.13~100, we get the following performance improvement for
> 'git branch --contains' over the previous commit:
>
> Before: 0.21s
> After:  0.13s
> Rel %: -38%

Nice.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/commit.c b/commit.c
> index bceb79c419..a70f120878 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
>  }
>  
>  /* all input commits in one and twos[] must have been parsed! */
> -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos)
> +static struct commit_list *paint_down_to_common(struct commit *one, int n,
> +						struct commit **twos,
> +						int min_generation)
>  {
>  	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>  	struct commit_list *result = NULL;
>  	int i;
> +	uint32_t last_gen = GENERATION_NUMBER_INFINITY;

Do we really need to change the signature of paint_down_to_common(), or
would it be enough to create a local variable min_generation set
initially to one->generation.

 +      uint32_t min_generation = one->generation;
 +	uint32_t last_gen = GENERATION_NUMBER_INFINITY;

>  
>  	one->object.flags |= PARENT1;
>  	if (!n) {
> @@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc
>  		struct commit_list *parents;
>  		int flags;
>  
> +		if (commit->generation > last_gen)
> +			BUG("bad generation skip");
> +		last_gen = commit->generation;
> +
> +		if (commit->generation < min_generation)
> +			break;
> +

I think, after looking at the whole post-image code, that it is all
right.

>  		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>  		if (flags == (PARENT1 | PARENT2)) {
>  			if (!(commit->object.flags & RESULT)) {
> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>  			return NULL;
>  	}
>  
> -	list = paint_down_to_common(one, n, twos);
> +	list = paint_down_to_common(one, n, twos, 0);
>  
>  	while (list) {
>  		struct commit *commit = pop_commit(&list);
> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>  			filled_index[filled] = j;
>  			work[filled++] = array[j];
>  		}
> -		common = paint_down_to_common(array[i], filled, work);
> +		common = paint_down_to_common(array[i], filled, work, 0);
>  		if (array[i]->object.flags & PARENT2)
>  			redundant[i] = 1;
>  		for (j = 0; j < filled; j++)
> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>  	if (commit->generation > min_generation)
>  		return 0;
>  
> -	bases = paint_down_to_common(commit, nr_reference, reference);
> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.

>  	if (commit->object.flags & PARENT2)
>  		ret = 1;
>  	clear_commit_marks(commit, all_flags);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
  2018-04-17 17:50       ` Derrick Stolee
@ 2018-04-19  0:02       ` Jakub Narebski
  2018-04-23 14:49         ` Derrick Stolee
  1 sibling, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-19  0:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> Most code paths load commits using lookup_commit() and then
> parse_commit(). In some cases, including some branch lookups, the commit
> is parsed using parse_object_buffer() which side-steps parse_commit() in
> favor of parse_commit_buffer().
>
> With generation numbers in the commit-graph, we need to ensure that any
> commit that exists in the commit-graph file has its generation number
> loaded.

All right, that is nice explanation of the why behind this change.

>
> Create new load_commit_graph_info() method to fill in the information
> for a commit that exists only in the commit-graph file. Call it from
> parse_commit_buffer() after loading the other commit information from
> the given buffer. Only fill this information when specified by the
> 'check_graph' parameter. This avoids duplicate work when we already
> checked the graph in parse_commit_gently() or when simply checking the
> buffer contents in check_commit().

Couldn't this 'check_graph' parameter be a global variable similar to
the 'commit_graph' variable?  Maybe I am not understanding it.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
>  commit-graph.h |  8 ++++++++
>  commit.c       |  7 +++++--
>  commit.h       |  2 +-
>  object.c       |  2 +-
>  sha1_file.c    |  2 +-
>  6 files changed, 49 insertions(+), 23 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 688d5b1801..21e853c21a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>  	return &commit_list_insert(c, pptr)->next;
>  }
>  
> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
> +{
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +}
> +
>  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>  {
>  	uint32_t edge_value;
>  	uint32_t *parent_data_ptr;
>  	uint64_t date_low, date_high;
>  	struct commit_list **pptr;
> -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;

I'm probably wrong, but isn't it unrelated change?

>  
>  	item->object.parsed = 1;
>  	item->graph_pos = pos;
> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	return 1;
>  }
>  
> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
> +{
> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> +		*pos = item->graph_pos;
> +		return 1;
> +	} else {
> +		return bsearch_graph(commit_graph, &(item->object.oid), pos);
> +	}
> +}

All right (after the fix).

> +
>  int parse_commit_in_graph(struct commit *item)
>  {
> +	uint32_t pos;
> +
> +	if (item->object.parsed)
> +		return 0;
>  	if (!core_commit_graph)
>  		return 0;
> -	if (item->object.parsed)
> -		return 1;

Hmmm... previously the function returned 1 if item->object.parsed, now
it returns 0 for this situation.  I don't understand this change.

> -
>  	prepare_commit_graph();
> -	if (commit_graph) {
> -		uint32_t pos;
> -		int found;
> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> -			pos = item->graph_pos;
> -			found = 1;
> -		} else {
> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
> -		}
> -
> -		if (found)
> -			return fill_commit_in_graph(item, commit_graph, pos);
> -	}
> -
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		return fill_commit_in_graph(item, commit_graph, pos);

Nice refactoring.

>  	return 0;
>  }
>  
> +void load_commit_graph_info(struct commit *item)
> +{
> +	uint32_t pos;
> +	if (!core_commit_graph)
> +		return;
> +	prepare_commit_graph();
> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> +		fill_commit_graph_info(item, commit_graph, pos);
> +}

And the reason for the refactoring.

> +
>  static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>  {
>  	struct object_id oid;
> diff --git a/commit-graph.h b/commit-graph.h
> index 260a468e73..96cccb10f3 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>   */
>  int parse_commit_in_graph(struct commit *item);
>  
> +/*
> + * It is possible that we loaded commit contents from the commit buffer,
> + * but we also want to ensure the commit-graph content is correctly
> + * checked and filled. Fill the graph_pos and generation members of
> + * the given commit.
> + */
> +void load_commit_graph_info(struct commit *item);
> +
>  struct tree *get_commit_tree_in_graph(const struct commit *c);
>  
>  struct commit_graph {
> diff --git a/commit.c b/commit.c
> index a70f120878..9ef6f699bd 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>  	return ret;
>  }
>  
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>  {
>  	const char *tail = buffer;
>  	const char *bufptr = buffer;
> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>  	}
>  	item->date = parse_commit_date(bufptr, tail);
>  
> +	if (check_graph)
> +		load_commit_graph_info(item);
> +
>  	return 0;
>  }
>  
> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  		return error("Object %s not a commit",
>  			     oid_to_hex(&item->object.oid));
>  	}
> -	ret = parse_commit_buffer(item, buffer, size);
> +	ret = parse_commit_buffer(item, buffer, size, 0);
>  	if (save_commit_buffer && !ret) {
>  		set_commit_buffer(item, buffer, size);
>  		return 0;
> diff --git a/commit.h b/commit.h
> index 64436ff44e..b5afde1ae9 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>   */
>  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>  
> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>  int parse_commit_gently(struct commit *item, int quiet_on_missing);
>  static inline int parse_commit(struct commit *item)
>  {
> diff --git a/object.c b/object.c
> index e6ad3f61f0..efe4871325 100644
> --- a/object.c
> +++ b/object.c
> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>  	} else if (type == OBJ_COMMIT) {
>  		struct commit *commit = lookup_commit(oid);
>  		if (commit) {
> -			if (parse_commit_buffer(commit, buffer, size))
> +			if (parse_commit_buffer(commit, buffer, size, 1))
>  				return NULL;
>  			if (!get_cached_commit_buffer(commit, NULL)) {
>  				set_commit_buffer(commit, buffer, size);
> diff --git a/sha1_file.c b/sha1_file.c
> index 1b94f39c4c..0fd4f0b8b6 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>  {
>  	struct commit c;
>  	memset(&c, 0, sizeof(c));
> -	if (parse_commit_buffer(&c, buf, size))
> +	if (parse_commit_buffer(&c, buf, size, 0))
>  		die("corrupt commit");
>  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 0/9] Compute and consume generation numbers
  2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
                       ` (8 preceding siblings ...)
  2018-04-17 17:00     ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee
@ 2018-04-19  0:04     ` Jakub Narebski
  2018-04-23 14:54       ` Derrick Stolee
  9 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-19  0:04 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> -- >8 --
>
> This is the one of several "small" patches that follow the serialized
> Git commit graph patch (ds/commit-graph) and lazy-loading trees
> (ds/lazy-load-trees).
>
> As described in Documentation/technical/commit-graph.txt, the generation
> number of a commit is one more than the maximum generation number among
> its parents (trivially, a commit with no parents has generation number
> one). This section is expanded to describe the interaction with special
> generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
> file) and *_ZERO (commits in a commit-graph file written before generation
> numbers were implemented).
>
> This series makes the computation of generation numbers part of the
> commit-graph write process.
>
> Finally, generation numbers are used to order commits in the priority
> queue in paint_down_to_common(). This allows a short-circuit mechanism
> to improve performance of `git branch --contains`.
>
> Further, use generation numbers for 'git tag --contains', providing a
> significant speedup (at least 95% for some cases).
>
> A more substantial refactoring of revision.c is required before making
> 'git log --graph' use generation numbers effectively.
>
> This patch series is build on ds/lazy-load-trees.
>
> Derrick Stolee (9):
>   commit: add generation number to struct commmit

Nice and short patch. Looks good to me.

>   commit-graph: compute generation numbers

Another quite easy to understand patch. LGTM.

>   commit: use generations in paint_down_to_common()

Nice and short patch; minor typo in comment in code.
Otherwise it looks good to me.

>   commit-graph.txt: update design document

I see that diagram got removed in this version; maybe it could be
replaced with relationship table?

Anyway, it looks good to me.

>   ref-filter: use generation number for --contains

A question: how performance looks like after the change if commit-graph
is not available?

>   commit: use generation numbers for in_merge_bases()

Possible typo in the commit message, and stylistic inconsistence in
in_merge_bases() - though actually more clear than existing code.

Short, simple, and gives good performance improvenebts.

>   commit: add short-circuit to paint_down_to_common()

Looks good to me; ignore [mostly] what I have written in response to the
patch in question.

>   commit-graph: always load commit-graph information

Looks all right; question: parameter or one more global variable.

>   merge: check config before loading commits

This looks good to me.

>
>  Documentation/technical/commit-graph.txt | 30 +++++--
>  alloc.c                                  |  1 +
>  builtin/merge.c                          |  5 +-
>  commit-graph.c                           | 99 +++++++++++++++++++-----
>  commit-graph.h                           |  8 ++
>  commit.c                                 | 54 +++++++++++--
>  commit.h                                 |  7 +-
>  object.c                                 |  2 +-
>  ref-filter.c                             | 23 +++++-
>  sha1_file.c                              |  2 +-
>  t/t5318-commit-graph.sh                  |  9 +++
>  11 files changed, 199 insertions(+), 41 deletions(-)
>
>
> base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
  2018-04-18 23:19       ` Jakub Narebski
@ 2018-04-19  8:32       ` Jakub Narebski
  1 sibling, 0 replies; 103+ messages in thread
From: Jakub Narebski @ 2018-04-19  8:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git\, peff\, stolee\, avarab\, sbeller\, larsxschneider\, bmwill\,
	gitster\, sunshine\, jonathantanmy\

Derrick Stolee <dstolee@microsoft.com> writes:

> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>  			return NULL;
>  	}
>  
> -	list = paint_down_to_common(one, n, twos);
> +	list = paint_down_to_common(one, n, twos, 0);
>  
>  	while (list) {
>  		struct commit *commit = pop_commit(&list);
> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>  			filled_index[filled] = j;
>  			work[filled++] = array[j];
>  		}
> -		common = paint_down_to_common(array[i], filled, work);
> +		common = paint_down_to_common(array[i], filled, work, 0);
>  		if (array[i]->object.flags & PARENT2)
>  			redundant[i] = 1;
>  		for (j = 0; j < filled; j++)

Wouldn't it be better and more readable to create a symbolic name for
this 0, for example:

  -	list = paint_down_to_common(one, n, twos);
  +	list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF);

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-14 16:52         ` Jakub Narebski
@ 2018-04-21 20:44           ` Jakub Narebski
  2018-04-23 13:54             ` Derrick Stolee
  0 siblings, 1 reply; 103+ messages in thread
From: Jakub Narebski @ 2018-04-21 20:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

Jakub Narebski <jnareb@gmail.com> writes:
> Derrick Stolee <stolee@gmail.com> writes:
>> On 4/11/2018 3:32 PM, Jakub Narebski wrote:
>
>>> What would you suggest as a good test that could imply performance? The
>>> Google Colab notebook linked to above includes a function to count
>>> number of commits (nodes / vertices in the commit graph) walked,
>>> currently in the worst case scenario.
>>
>> The two main questions to consider are:
>>
>> 1. Can X reach Y?
>
> That is easy to do.  The function generic_is_reachable() does
> that... though using direct translation of the pseudocode for
> "Algorithm 3: Reachable" from FELINE paper, which is recursive and
> doesn't check if vertex was already visited was not good idea for large
> graphs such as Linux kernel commit graph, oops.  That is why
> generic_is_reachable_large() was created.
[...]

>> And the thing to measure is a commit count. If possible, it would be
>> good to count commits walked (commits whose parent list is enumerated)
>> and commits inspected (commits that were listed as a parent of some
>> walked commit). Walked commits require a commit parse -- albeit from
>> the commit-graph instead of the ODB now -- while inspected commits
>> only check the in-memory cache.
[...]
>>
>> For git.git and Linux, I like to use the release tags as tests. They
>> provide a realistic view of the linear history, and maintenance
>> releases have their own history from the major releases.
>
> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
> FELINE index does not bring any improvements over using just level
> (generation number) filter.  But that may be caused by narrowing od
> commit DAG around releases.
>
> I try do do the same between commits in wide part, with many commits
> with the same level (same generation number) both for source and for
> target commit.  Though this may be unfair to level filter, though...
>
>
> Note however that FELINE index is not unabiguous, like generation
> numbers are (modulo decision whether to start at 0 or at 1); it depends
> on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
 * src = 468165c1d = v2.17.0
 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

 * 468165c1d has level 18418 which it shares with 6 commits
 * 66d2e04ec has level 14776 which it shares with 93 commits
 * gen(468165c1d) - gen(66d2e04ec) = 3642

 algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
 -----------+---------+--------+----------+---------+----------+-----------+
 naive      | 48865   | 39599  | 244      | 9200    |          |           |
 level      |  3086   |  2492  | 113      |  528    | 285      |           |
 FELINE     |   283   |   216  |  68      |    0    |          |  25       |
 lev+FELINE |   282   |   215  |  68      |    0    |   5      |  24       |
 -----------+---------+--------+----------+---------+----------+-----------+
 lev+FEL+mpi|    79   |    59  |  21      |    0    |   0      |   0       |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
  note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
  were used for negative-cut; note that those are not disjoint; node can
  be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.

P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, FELINE
index, min-post intervals, bloom filter for changed files, etc.) be
moved into separate thread?
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/6] Compute and consume generation numbers
  2018-04-21 20:44           ` Jakub Narebski
@ 2018-04-23 13:54             ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-23 13:54 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Lars Schneider, Jeff King

On 4/21/2018 4:44 PM, Jakub Narebski wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>> Derrick Stolee <stolee@gmail.com> writes:
>>> On 4/11/2018 3:32 PM, Jakub Narebski wrote:
>>>> What would you suggest as a good test that could imply performance? The
>>>> Google Colab notebook linked to above includes a function to count
>>>> number of commits (nodes / vertices in the commit graph) walked,
>>>> currently in the worst case scenario.
>>> The two main questions to consider are:
>>>
>>> 1. Can X reach Y?
>> That is easy to do.  The function generic_is_reachable() does
>> that... though using direct translation of the pseudocode for
>> "Algorithm 3: Reachable" from FELINE paper, which is recursive and
>> doesn't check if vertex was already visited was not good idea for large
>> graphs such as Linux kernel commit graph, oops.  That is why
>> generic_is_reachable_large() was created.
> [...]
>
>>> And the thing to measure is a commit count. If possible, it would be
>>> good to count commits walked (commits whose parent list is enumerated)
>>> and commits inspected (commits that were listed as a parent of some
>>> walked commit). Walked commits require a commit parse -- albeit from
>>> the commit-graph instead of the ODB now -- while inspected commits
>>> only check the in-memory cache.
> [...]
>>> For git.git and Linux, I like to use the release tags as tests. They
>>> provide a realistic view of the linear history, and maintenance
>>> releases have their own history from the major releases.
>> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
>> FELINE index does not bring any improvements over using just level
>> (generation number) filter.  But that may be caused by narrowing od
>> commit DAG around releases.
>>
>> I try do do the same between commits in wide part, with many commits
>> with the same level (same generation number) both for source and for
>> target commit.  Though this may be unfair to level filter, though...
>>
>>
>> Note however that FELINE index is not unabiguous, like generation
>> numbers are (modulo decision whether to start at 0 or at 1); it depends
>> on the topological ordering chosen for the X elements.
> One can now test reachability on git.git repository; there is a form
> where one can plug source and destination revisions at
> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1
>
> I have tried the case that is quite unfair to the generation numbers
> filter, namely the check between one of recent tags, and one commit that
> shares generation number among largest number of other commits.
>
> Here level = generation number-1 (as it starts at 0 for root commit, not
> 1).
>
> The results are:
>   * src = 468165c1d = v2.17.0
>   * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec
>
>   * 468165c1d has level 18418 which it shares with 6 commits
>   * 66d2e04ec has level 14776 which it shares with 93 commits
>   * gen(468165c1d) - gen(66d2e04ec) = 3642
>
>   algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
>   -----------+---------+--------+----------+---------+----------+-----------+
>   naive      | 48865   | 39599  | 244      | 9200    |          |           |
>   level      |  3086   |  2492  | 113      |  528    | 285      |           |
>   FELINE     |   283   |   216  |  68      |    0    |          |  25       |
>   lev+FELINE |   282   |   215  |  68      |    0    |   5      |  24       |
>   -----------+---------+--------+----------+---------+----------+-----------+
>   lev+FEL+mpi|    79   |    59  |  21      |    0    |   0      |   0       |
>
> Here we have:
> * 'naive' implementation means simple DFS walk, without any filters (cut-offs)
> * 'level' means using levels / generation numbers based negative-cut filter
> * 'FELINE' means using FELINE index based negative-cut filter
> * 'lev+FELINE' means combining generation numbers filter with FELINE filter
> * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
>    note that the code does not walk the path after cut, but it is easy to do
>
> The stats have the following meaning:
> * 'access' means accessing the node
> * 'walk' is actual walking the node
> * 'maxdepth' is maximum depth of the stack used for DFS
> * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
>    were used for negative-cut; note that those are not disjoint; node can
>    be rejected by both level filter and FELINE filter
>
> For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
> the results are the same, with 5 commits accessed and 6 walked compared
> to 61574 accessed in naive algorithm.
>
> The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
> (different child-less commits) and 9 roots, and has average clustering
> coefficient 0.000409217.

Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.

> P.S. Would it be better to move the discussion about possible extensions
> to the commit-graph in the form of new chunks (topological order, FELINE
> index, min-post intervals, bloom filter for changed files, etc.) be
> moved into separate thread?

Yes. I think we've exhausted this thought experiment and future 
discussion should revolve around actual implementations in Git with 
end-to-end performance times. The computation time for computing the 
FELINE index should be included in that discussion.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
  2018-04-18 21:02       ` Jakub Narebski
@ 2018-04-23 14:22         ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:22 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 5:02 PM, Jakub Narebski wrote:
> Here I can offer only the cursory examination, as I don't know this area
> of code in question.
>
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> A commit A can reach a commit B only if the generation number of A
>> is larger than the generation number of B. This condition allows
>> significantly short-circuiting commit-graph walks.
>>
>> Use generation number for 'git tag --contains' queries.
>>
>> On a copy of the Linux repository where HEAD is containd in v4.13
>> but no earlier tag, the command 'git tag --contains HEAD' had the
>> following peformance improvement:
>>
>> Before: 0.81s
>> After:  0.04s
>> Rel %:  -95%
> A question: what is the performance after if the "commit-graph" feature
> is disabled, or there is no commit-graph file?  Is there performance
> regression in this case, or is the difference negligible?

Negligible, since we are adding a small number of integer comparisons 
and the main cost is in commit parsing. More on commit parsing in 
response to your comments below.

>
>> Helped-by: Jeff King <peff@peff.net>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   ref-filter.c | 23 +++++++++++++++++++----
>>   1 file changed, 19 insertions(+), 4 deletions(-)
>>
>> diff --git a/ref-filter.c b/ref-filter.c
>> index cffd8bf3ce..e2fea6d635 100644
>> --- a/ref-filter.c
>> +++ b/ref-filter.c
>> @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>>   /*
>>    * Test whether the candidate or one of its parents is contained in the list.
>                                   ^^^^^^^^^^^^^^^^^^^^^
>
> Sidenote: when examining the code after the change, I have noticed that
> the above part of commit header for the comtains_test() function is no
> longer entirely correct, as the function only checks the candidate
> commit, and in no place it access its parents.
>
> But that is not your problem.

I'll add a commit in the next version that fixes this comment before I 
make any changes to the method.

>
>>    * Do not recurse to find out, though, but return -1 if inconclusive.
>>    */
>>   static enum contains_result contains_test(struct commit *candidate,
>>   					  const struct commit_list *want,
>> -					  struct contains_cache *cache)
>> +					  struct contains_cache *cache,
>> +					  uint32_t cutoff)
>>   {
>>   	enum contains_result *cached = contains_cache_at(cache, candidate);
>>   
>> @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
>>   
>>   	/* Otherwise, we don't know; prepare to recurse */
>>   	parse_commit_or_die(candidate);
>> +
>> +	if (candidate->generation < cutoff)
>> +		return CONTAINS_NO;
>> +
> Looks good to me.
>
> The only [minor] question may be whether to define separate type for
> generation numbers, and whether to future proof the tests - though the
> latter would be almost certainly overengineering, and the former
> probablt too.

If we have multiple notions of generation, then we can refactor all 
references to the "generation" member.

>
>>   	return CONTAINS_UNKNOWN;
>>   }
>>   
>> @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   					      struct contains_cache *cache)
>>   {
>>   	struct contains_stack contains_stack = { 0, 0, NULL };
>> -	enum contains_result result = contains_test(candidate, want, cache);
>> +	enum contains_result result;
>> +	uint32_t cutoff = GENERATION_NUMBER_INFINITY;
>> +	const struct commit_list *p;
>> +
>> +	for (p = want; p; p = p->next) {
>> +		struct commit *c = p->item;
>> +		parse_commit_or_die(c);
>> +		if (c->generation < cutoff)
>> +			cutoff = c->generation;
>> +	}
> Sholdn't the above be made conditional on the ability to get generation
> numbers from the commit-graph file (feature is turned on and file
> exists)?  Otherwise here after the change contains_tag_algo() now parses
> each commit in 'want', which I think was not done previously.
>
> With commit-graph file parsing is [probably] cheap.  Without it, not
> necessary.
>
> But I might be worrying about nothing.

Not nothing. This parses the "wants" when we previously did not parse 
the wants. Further: this parsing happens before we do the simple check 
of comparing the OID of the candidate against the wants.

The question is: are these parsed commits significant compared to the 
walk that will parse many more commits? It is certainly possible.

One way to fix this is to call 'prepare_commit_graph()' directly and 
then test that 'commit_graph' is non-null before performing any parses. 
I'm not thrilled with how that couples the commit-graph implementation 
to this feature, but that may be necessary to avoid regressions in the 
non-commit-graph case.

>
>>   
>> +	result = contains_test(candidate, want, cache, cutoff);
> Other than the question about possible performace regression if
> commit-graph data is not available, it looks good to me.
>
>>   	if (result != CONTAINS_UNKNOWN)
>>   		return result;
>>   
>> @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   		 * If we just popped the stack, parents->item has been marked,
>>   		 * therefore contains_test will return a meaningful yes/no.
>>   		 */
>> -		else switch (contains_test(parents->item, want, cache)) {
>> +		else switch (contains_test(parents->item, want, cache, cutoff)) {
>>   		case CONTAINS_YES:
>>   			*contains_cache_at(cache, commit) = CONTAINS_YES;
>>   			contains_stack.nr--;
>> @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>>   		}
>>   	}
>>   	free(contains_stack.contains_stack);
>> -	return contains_test(candidate, want, cache);
>> +	return contains_test(candidate, want, cache, cutoff);
> Simple change. It looks good to me.
>
>>   }
>>   
>>   static int commit_contains(struct ref_filter *filter, struct commit *commit,


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
  2018-04-18 22:15       ` Jakub Narebski
@ 2018-04-23 14:31         ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:31 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 6:15 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The containment algorithm for 'git branch --contains' is different
>> from that for 'git tag --contains' in that it uses is_descendant_of()
>> instead of contains_tag_algo(). The expensive portion of the branch
>> algorithm is computing merge bases.
>>
>> When a commit-graph file exists with generation numbers computed,
>> we can avoid this merge-base calculation when the target commit has
>> a larger generation number than the target commits.
> You have "target" twice in above paragraph; one of those should probably
> be something else.

Thanks. Second "target" should be "initial".

> [...]
>> +
>> +	if (commit->generation > min_generation)
>> +		return 0;
> Why not use "return ret;" instead of "return 0;", like the rest of the
> code [cryptically] does, that is:
>
>    +	if (commit->generation > min_generation)
>    +		return ret;

Sure.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-18 23:19       ` Jakub Narebski
@ 2018-04-23 14:40         ` Derrick Stolee
  2018-04-23 21:38           ` Jakub Narebski
  0 siblings, 1 reply; 103+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:40 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 7:19 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
[...]
>> [...], and this saves time during 'git branch --contains' queries
>> that would otherwise walk "around" the commit we are inspecting.
> If I understand the code properly, what happens is that we can now
> short-circuit if all commits that are left are lower than the target
> commit.
>
> This is because max-order priority queue is used: if the commit with
> maximum generation number is below generation number of target commit,
> then target commit is not reachable from any commit in the priority
> queue (all of which has generation number less or equal than the commit
> at head of queue, i.e. all are same level or deeper); compare what I
> have written in [1]
>
> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/
>
> Do I have that right?  If so, it looks all right to me.

Yes, the priority queue needs to compare via generation number first or 
there will be errors. This is why we could not use commit time before.

>
>> For a copy of the Linux repository, where HEAD is checked out at
>> v4.13~100, we get the following performance improvement for
>> 'git branch --contains' over the previous commit:
>>
>> Before: 0.21s
>> After:  0.13s
>> Rel %: -38%
> [...]
>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>   		if (flags == (PARENT1 | PARENT2)) {
>>   			if (!(commit->object.flags & RESULT)) {
>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>>   			return NULL;
>>   	}
>>   
>> -	list = paint_down_to_common(one, n, twos);
>> +	list = paint_down_to_common(one, n, twos, 0);
>>   
>>   	while (list) {
>>   		struct commit *commit = pop_commit(&list);
>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>   			filled_index[filled] = j;
>>   			work[filled++] = array[j];
>>   		}
>> -		common = paint_down_to_common(array[i], filled, work);
>> +		common = paint_down_to_common(array[i], filled, work, 0);
>>   		if (array[i]->object.flags & PARENT2)
>>   			redundant[i] = 1;
>>   		for (j = 0; j < filled; j++)
>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>   	if (commit->generation > min_generation)
>>   		return 0;
>>   
>> -	bases = paint_down_to_common(commit, nr_reference, reference);
>> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
> Is it the only case where we would call paint_down_to_common() with
> non-zero last parameter?  Would we always use commit->generation where
> commit is the first parameter of paint_down_to_common()?
>
> If both are true and will remain true, then in my humble opinion it is
> not necessary to change the signature of this function.

We need to change the signature some way, but maybe the way I chose is 
not the best.

To elaborate: paint_down_to_common() is used for multiple purposes. The 
caller here that supplies 'commit->generation' is used only to compute 
reachability (by testing if the flag PARENT2 exists on the commit, then 
clears all flags). The other callers expect the full walk down to the 
common commits, and keeps those PARENT1, PARENT2, and STALE flags for 
future use (such as reporting merge bases). Usually the call to 
paint_down_to_common() is followed by a revision walk that only halts 
when reaching root commits or commits with both PARENT1 and PARENT2 
flags on, so always short-circuiting on generations would break the 
functionality; this is confirmed by the t5318-commit-graph.sh.

An alternative to the signature change is to add a boolean parameter 
"use_cutoff" or something, that specifies "don't walk beyond the 
commit". This may give a more of a clear description of what it will do 
with the generation value, but since we are already performing 
generation comparisons before calling paint_down_to_common() I find this 
simple enough.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
  2018-04-19  0:02       ` Jakub Narebski
@ 2018-04-23 14:49         ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:49 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 8:02 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Most code paths load commits using lookup_commit() and then
>> parse_commit(). In some cases, including some branch lookups, the commit
>> is parsed using parse_object_buffer() which side-steps parse_commit() in
>> favor of parse_commit_buffer().
>>
>> With generation numbers in the commit-graph, we need to ensure that any
>> commit that exists in the commit-graph file has its generation number
>> loaded.
> All right, that is nice explanation of the why behind this change.
>
>> Create new load_commit_graph_info() method to fill in the information
>> for a commit that exists only in the commit-graph file. Call it from
>> parse_commit_buffer() after loading the other commit information from
>> the given buffer. Only fill this information when specified by the
>> 'check_graph' parameter. This avoids duplicate work when we already
>> checked the graph in parse_commit_gently() or when simply checking the
>> buffer contents in check_commit().
> Couldn't this 'check_graph' parameter be a global variable similar to
> the 'commit_graph' variable?  Maybe I am not understanding it.

See the two callers at the bottom of the patch. They have different 
purposes: one needs to fill in a valid commit struct, the other needs to 
check the commit buffer is valid (then throws away the struct). They 
have different values for 'check_graph'. Also, in parse_commit_gently() 
we check parse_commit_in_graph() before we call parse_commit_buffer, so 
we do not want to repeat work; in the case of a valid commit-graph file, 
but the commit is not in the commit-graph, we would repeat our binary 
search for the same commit.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------
>>   commit-graph.h |  8 ++++++++
>>   commit.c       |  7 +++++--
>>   commit.h       |  2 +-
>>   object.c       |  2 +-
>>   sha1_file.c    |  2 +-
>>   6 files changed, 49 insertions(+), 23 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 688d5b1801..21e853c21a 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>>   	return &commit_list_insert(c, pptr)->next;
>>   }
>>   
>> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
>> +{
>> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
>> +	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>> +}
>> +
>>   static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>>   {
>>   	uint32_t edge_value;
>>   	uint32_t *parent_data_ptr;
>>   	uint64_t date_low, date_high;
>>   	struct commit_list **pptr;
>> -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
>> +	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
> I'm probably wrong, but isn't it unrelated change?

You're right. I saw this while I was in here, and there was a similar 
comment on this change in a different patch. Probably best to keep these 
cleanup things in a separate commit.

>>   	item->object.parsed = 1;
>>   	item->graph_pos = pos;
>> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>>   	return 1;
>>   }
>>   
>> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
>> +{
>> +	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
>> +		*pos = item->graph_pos;
>> +		return 1;
>> +	} else {
>> +		return bsearch_graph(commit_graph, &(item->object.oid), pos);
>> +	}
>> +}
> All right (after the fix).
>
>> +
>>   int parse_commit_in_graph(struct commit *item)
>>   {
>> +	uint32_t pos;
>> +
>> +	if (item->object.parsed)
>> +		return 0;
>>   	if (!core_commit_graph)
>>   		return 0;
>> -	if (item->object.parsed)
>> -		return 1;
> Hmmm... previously the function returned 1 if item->object.parsed, now
> it returns 0 for this situation.  I don't understand this change.

The good news is that this change is unimportant (the only caller is 
parse_commit_gently() which checks item->object.parsed before calling 
parse_commit_in_graph()). I wonder why I reordered those things, anyway. 
I'll revert to simplify the patch.

>
>> -
>>   	prepare_commit_graph();
>> -	if (commit_graph) {
>> -		uint32_t pos;
>> -		int found;
>> -		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
>> -			pos = item->graph_pos;
>> -			found = 1;
>> -		} else {
>> -			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
>> -		}
>> -
>> -		if (found)
>> -			return fill_commit_in_graph(item, commit_graph, pos);
>> -	}
>> -
>> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> +		return fill_commit_in_graph(item, commit_graph, pos);
> Nice refactoring.
>
>>   	return 0;
>>   }
>>   
>> +void load_commit_graph_info(struct commit *item)
>> +{
>> +	uint32_t pos;
>> +	if (!core_commit_graph)
>> +		return;
>> +	prepare_commit_graph();
>> +	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> +		fill_commit_graph_info(item, commit_graph, pos);
>> +}
> And the reason for the refactoring.
>
>> +
>>   static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>>   {
>>   	struct object_id oid;
>> diff --git a/commit-graph.h b/commit-graph.h
>> index 260a468e73..96cccb10f3 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
>>    */
>>   int parse_commit_in_graph(struct commit *item);
>>   
>> +/*
>> + * It is possible that we loaded commit contents from the commit buffer,
>> + * but we also want to ensure the commit-graph content is correctly
>> + * checked and filled. Fill the graph_pos and generation members of
>> + * the given commit.
>> + */
>> +void load_commit_graph_info(struct commit *item);
>> +
>>   struct tree *get_commit_tree_in_graph(const struct commit *c);
>>   
>>   struct commit_graph {
>> diff --git a/commit.c b/commit.c
>> index a70f120878..9ef6f699bd 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>>   	return ret;
>>   }
>>   
>> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
>> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
>>   {
>>   	const char *tail = buffer;
>>   	const char *bufptr = buffer;
>> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>>   	}
>>   	item->date = parse_commit_date(bufptr, tail);
>>   
>> +	if (check_graph)
>> +		load_commit_graph_info(item);
>> +
>>   	return 0;
>>   }
>>   
>> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>>   		return error("Object %s not a commit",
>>   			     oid_to_hex(&item->object.oid));
>>   	}
>> -	ret = parse_commit_buffer(item, buffer, size);
>> +	ret = parse_commit_buffer(item, buffer, size, 0);
>>   	if (save_commit_buffer && !ret) {
>>   		set_commit_buffer(item, buffer, size);
>>   		return 0;
>> diff --git a/commit.h b/commit.h
>> index 64436ff44e..b5afde1ae9 100644
>> --- a/commit.h
>> +++ b/commit.h
>> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>>    */
>>   struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>>   
>> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
>> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
>>   int parse_commit_gently(struct commit *item, int quiet_on_missing);
>>   static inline int parse_commit(struct commit *item)
>>   {
>> diff --git a/object.c b/object.c
>> index e6ad3f61f0..efe4871325 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
>>   	} else if (type == OBJ_COMMIT) {
>>   		struct commit *commit = lookup_commit(oid);
>>   		if (commit) {
>> -			if (parse_commit_buffer(commit, buffer, size))
>> +			if (parse_commit_buffer(commit, buffer, size, 1))
>>   				return NULL;
>>   			if (!get_cached_commit_buffer(commit, NULL)) {
>>   				set_commit_buffer(commit, buffer, size);
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 1b94f39c4c..0fd4f0b8b6 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
>>   {
>>   	struct commit c;
>>   	memset(&c, 0, sizeof(c));
>> -	if (parse_commit_buffer(&c, buf, size))
>> +	if (parse_commit_buffer(&c, buf, size, 0))
>>   		die("corrupt commit");
>>   }


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 0/9] Compute and consume generation numbers
  2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
@ 2018-04-23 14:54       ` Derrick Stolee
  0 siblings, 0 replies; 103+ messages in thread
From: Derrick Stolee @ 2018-04-23 14:54 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster,
	sunshine, jonathantanmy

On 4/18/2018 8:04 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> -- >8 --
>>
>> This is the one of several "small" patches that follow the serialized
>> Git commit graph patch (ds/commit-graph) and lazy-loading trees
>> (ds/lazy-load-trees).
>>
>> As described in Documentation/technical/commit-graph.txt, the generation
>> number of a commit is one more than the maximum generation number among
>> its parents (trivially, a commit with no parents has generation number
>> one). This section is expanded to describe the interaction with special
>> generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
>> file) and *_ZERO (commits in a commit-graph file written before generation
>> numbers were implemented).
>>
>> This series makes the computation of generation numbers part of the
>> commit-graph write process.
>>
>> Finally, generation numbers are used to order commits in the priority
>> queue in paint_down_to_common(). This allows a short-circuit mechanism
>> to improve performance of `git branch --contains`.
>>
>> Further, use generation numbers for 'git tag --contains', providing a
>> significant speedup (at least 95% for some cases).
>>
>> A more substantial refactoring of revision.c is required before making
>> 'git log --graph' use generation numbers effectively.
>>
>> This patch series is build on ds/lazy-load-trees.
>>
>> Derrick Stolee (9):
>>    commit: add generation number to struct commmit
> Nice and short patch. Looks good to me.
>
>>    commit-graph: compute generation numbers
> Another quite easy to understand patch. LGTM.
>
>>    commit: use generations in paint_down_to_common()
> Nice and short patch; minor typo in comment in code.
> Otherwise it looks good to me.
>
>>    commit-graph.txt: update design document
> I see that diagram got removed in this version; maybe it could be
> replaced with relationship table?
>
> Anyway, it looks good to me.

The diagrams and tables seemed to cause more confusion than clarity. I 
think the reader should create their own mental model from the 
definitions and description and we should avoid trying to make a summary.

>
>>    ref-filter: use generation number for --contains
> A question: how performance looks like after the change if commit-graph
> is not available?

The performance issue is minor, but will be fixed in v4.

>
>>    commit: use generation numbers for in_merge_bases()
> Possible typo in the commit message, and stylistic inconsistence in
> in_merge_bases() - though actually more clear than existing code.
>
> Short, simple, and gives good performance improvenebts.
>
>>    commit: add short-circuit to paint_down_to_common()
> Looks good to me; ignore [mostly] what I have written in response to the
> patch in question.
>
>>    commit-graph: always load commit-graph information
> Looks all right; question: parameter or one more global variable.

I responded to say that the global variable approach is incorrect. 
Parameter is important to functionality and performance.

>
>>    merge: check config before loading commits
> This looks good to me.
>
>>   Documentation/technical/commit-graph.txt | 30 +++++--
>>   alloc.c                                  |  1 +
>>   builtin/merge.c                          |  5 +-
>>   commit-graph.c                           | 99 +++++++++++++++++++-----
>>   commit-graph.h                           |  8 ++
>>   commit.c                                 | 54 +++++++++++--
>>   commit.h                                 |  7 +-
>>   object.c                                 |  2 +-
>>   ref-filter.c                             | 23 +++++-
>>   sha1_file.c                              |  2 +-
>>   t/t5318-commit-graph.sh                  |  9 +++
>>   11 files changed, 199 insertions(+), 41 deletions(-)
>>
>>
>> base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
  2018-04-23 14:40         ` Derrick Stolee
@ 2018-04-23 21:38           ` Jakub Narebski
  0 siblings, 0 replies; 103+ messages in thread
From: Jakub Narebski @ 2018-04-23 21:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, peff\, avarab\, sbeller\, larsxschneider\,
	bmwill\, gitster\, sunshine\, jonathantanmy\

Derrick Stolee <stolee@gmail.com> writes:

> On 4/18/2018 7:19 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
> [...]
>>> [...], and this saves time during 'git branch --contains' queries
>>> that would otherwise walk "around" the commit we are inspecting.
>>>
>> If I understand the code properly, what happens is that we can now
>> short-circuit if all commits that are left are lower than the target
>> commit.
>>
>> This is because max-order priority queue is used: if the commit with
>> maximum generation number is below generation number of target commit,
>> then target commit is not reachable from any commit in the priority
>> queue (all of which has generation number less or equal than the commit
>> at head of queue, i.e. all are same level or deeper); compare what I
>> have written in [1]
>>
>> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/
>>
>> Do I have that right?  If so, it looks all right to me.
>
> Yes, the priority queue needs to compare via generation number first
> or there will be errors. This is why we could not use commit time
> before.

I was more concerned about getting right the order in the priority queue
(does it return minimal or maximal generation number).

I understand that the cutoff could not be used without generation
numbers because of the possibility of clock skew - using cutoff on dates
could lead to wrong results.

>>> For a copy of the Linux repository, where HEAD is checked out at
>>> v4.13~100, we get the following performance improvement for
>>> 'git branch --contains' over the previous commit:
>>>
>>> Before: 0.21s
>>> After:  0.13s
>>> Rel %: -38%
>> [...]
>>>   		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
>>>   		if (flags == (PARENT1 | PARENT2)) {
>>>   			if (!(commit->object.flags & RESULT)) {
>>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
>>>   			return NULL;
>>>   	}
>>>   -	list = paint_down_to_common(one, n, twos);
>>> +	list = paint_down_to_common(one, n, twos, 0);
>>>     	while (list) {
>>>   		struct commit *commit = pop_commit(&list);
>>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
>>>   			filled_index[filled] = j;
>>>   			work[filled++] = array[j];
>>>   		}
>>> -		common = paint_down_to_common(array[i], filled, work);
>>> +		common = paint_down_to_common(array[i], filled, work, 0);
>>>   		if (array[i]->object.flags & PARENT2)
>>>   			redundant[i] = 1;
>>>   		for (j = 0; j < filled; j++)
>>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit *
>>>   	if (commit->generation > min_generation)
>>>   		return 0;
>>>   -	bases = paint_down_to_common(commit, nr_reference, reference);
>>> +	bases = paint_down_to_common(commit, nr_reference, reference, commit->generation);
>>
>> Is it the only case where we would call paint_down_to_common() with
>> non-zero last parameter?  Would we always use commit->generation where
>> commit is the first parameter of paint_down_to_common()?
>>
>> If both are true and will remain true, then in my humble opinion it is
>> not necessary to change the signature of this function.
>
> We need to change the signature some way, but maybe the way I chose is
> not the best.

No, after taking longer I think the new signature is a good choice.

> To elaborate: paint_down_to_common() is used for multiple
> purposes. The caller here that supplies 'commit->generation' is used
> only to compute reachability (by testing if the flag PARENT2 exists on
> the commit, then clears all flags). The other callers expect the full
> walk down to the common commits, and keeps those PARENT1, PARENT2, and
> STALE flags for future use (such as reporting merge bases). Usually
> the call to paint_down_to_common() is followed by a revision walk that
> only halts when reaching root commits or commits with both PARENT1 and
> PARENT2 flags on, so always short-circuiting on generations would
> break the functionality; this is confirmed by the
> t5318-commit-graph.sh.

Right.

I have realized that just after sending the email.  I'm sorry about this.

>
> An alternative to the signature change is to add a boolean parameter
> "use_cutoff" or something, that specifies "don't walk beyond the
> commit". This may give a more of a clear description of what it will
> do with the generation value, but since we are already performing
> generation comparisons before calling paint_down_to_common() I find
> this simple enough.

Two things:

1. The signature proposed in the patch is more generic.  The cutoff does
   not need to be equal to the generation number of the commit, though
   currently it always (all of one time the new mechanism is used) is.

   So now I think the new signature of paint_down_to_common() is all
   right as it is proposed here.

2. The way generation numbers are defined (with 0 being a special case,
   and generation numbers starting from 1 for parent-less commits), and
   the way they are compared (using strict comparison, to avoid having
   to special-case _ZERO, _MAX and _INFINITY generation numbers) the
   cutoff of 0 means no cutoff.

   On the other hand cutoff of 0 can be understood as meaning no cutoff
   as a special case.

   It could be made more clear to use (as I proposed elsewhere in this
   thread) symbolic name for this no-cutoff case via preprocessor
   constants or enums, e.g. GENERATION_NO_CUTOFF:

    @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co
      			return NULL;
      	}
      -	list = paint_down_to_common(one, n, twos);
    +	list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF);
        	while (list) {
      		struct commit *commit = pop_commit(&list);
    @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
      			filled_index[filled] = j;
      			work[filled++] = array[j];
      		}
    -		common = paint_down_to_common(array[i], filled, work);
    +		common = paint_down_to_common(array[i], filled, work, GENERATION_NO_CUTOFF);
      		if (array[i]->object.flags & PARENT2)
      			redundant[i] = 1;
      		for (j = 0; j < filled; j++)


   But whether it makes code more readable, or less readable, is a
   matter of opinion and taste.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, back to index

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee
2018-04-03 18:21   ` Jonathan Tan
2018-04-03 18:28     ` Jeff King
2018-04-03 18:32       ` Derrick Stolee
2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee
2018-04-03 18:05   ` Brandon Williams
2018-04-03 18:28     ` Jeff King
2018-04-03 18:31       ` Derrick Stolee
2018-04-03 18:32       ` Brandon Williams
2018-04-03 18:44       ` Stefan Beller
2018-04-03 23:17       ` Ramsay Jones
2018-04-03 23:19         ` Jeff King
2018-04-03 18:24   ` Jonathan Tan
2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee
2018-04-03 18:30   ` Jonathan Tan
2018-04-03 18:49     ` Stefan Beller
2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-03 18:31   ` Stefan Beller
2018-04-03 18:31   ` Jonathan Tan
2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee
2018-04-03 19:01   ` Jonathan Tan
2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee
2018-04-03 19:04   ` Jonathan Tan
2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee
2018-04-03 18:03 ` Brandon Williams
2018-04-03 18:29   ` Derrick Stolee
2018-04-03 18:47     ` Jeff King
2018-04-03 19:05       ` Jeff King
2018-04-04 15:45         ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee
2018-04-04 15:45           ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-04 15:48             ` Derrick Stolee
2018-04-04 17:01               ` Brandon Williams
2018-04-04 18:24               ` Jeff King
2018-04-04 18:53                 ` Derrick Stolee
2018-04-04 18:59                   ` Jeff King
2018-04-04 18:22           ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King
2018-04-04 19:06             ` Derrick Stolee
2018-04-04 19:16               ` Jeff King
2018-04-04 19:22                 ` Derrick Stolee
2018-04-04 19:42                   ` Jeff King
2018-04-04 19:45                     ` Derrick Stolee
2018-04-04 19:46                       ` Jeff King
2018-04-07 17:09     ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski
2018-04-07 16:55 ` Jakub Narebski
2018-04-08  1:06   ` Derrick Stolee
2018-04-11 19:32     ` Jakub Narebski
2018-04-11 19:58       ` Derrick Stolee
2018-04-14 16:52         ` Jakub Narebski
2018-04-21 20:44           ` Jakub Narebski
2018-04-23 13:54             ` Derrick Stolee
2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee
2018-04-09 16:41   ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee
2018-04-09 16:41   ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee
2018-04-11  2:12     ` Junio C Hamano
2018-04-11 12:49       ` Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee
2018-04-09 17:59     ` Stefan Beller
2018-04-11  2:31     ` Junio C Hamano
2018-04-11 12:57       ` Derrick Stolee
2018-04-11 23:28         ` Junio C Hamano
2018-04-09 16:42   ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee
2018-04-11  2:51     ` Junio C Hamano
2018-04-11 13:02       ` Derrick Stolee
2018-04-11 18:49         ` Stefan Beller
2018-04-11 19:26         ` Eric Sunshine
2018-04-09 16:42   ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee
2018-04-11  3:02     ` Junio C Hamano
2018-04-11 13:24       ` Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee
2018-04-12  9:12     ` Junio C Hamano
2018-04-12 11:35       ` Derrick Stolee
2018-04-13  9:53         ` Jakub Narebski
2018-04-09 16:42   ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-09 16:42   ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee
2018-04-17 17:00   ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee
2018-04-18 14:31       ` Jakub Narebski
2018-04-18 14:46         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee
2018-04-18 19:47       ` Jakub Narebski
2018-04-17 17:00     ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee
2018-04-18 21:02       ` Jakub Narebski
2018-04-23 14:22         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee
2018-04-18 22:15       ` Jakub Narebski
2018-04-23 14:31         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee
2018-04-18 23:19       ` Jakub Narebski
2018-04-23 14:40         ` Derrick Stolee
2018-04-23 21:38           ` Jakub Narebski
2018-04-19  8:32       ` Jakub Narebski
2018-04-17 17:00     ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee
2018-04-17 17:50       ` Derrick Stolee
2018-04-19  0:02       ` Jakub Narebski
2018-04-23 14:49         ` Derrick Stolee
2018-04-17 17:00     ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee
2018-04-19  0:04     ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski
2018-04-23 14:54       ` Derrick Stolee

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox