git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 00/17] [RFC] Commit-graph: Write incremental files
@ 2019-05-08 15:53 Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 01/17] commit-graph: fix the_repository reference Derrick Stolee via GitGitGadget
                   ` (18 more replies)
  0 siblings, 19 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano

This patch series is marked as RFC quality because it is missing some key
features and tests, but hopefully starts a concrete discussion of how the
incremental commit-graph writes can work. If this is a good direction, then
it would replace ds/commit-graph-format-v2.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "stack" of commit-graphs where the existing commit-graph 
file is "level 0" and other levels are in files 
$OBJDIR/info/commit-graphs/commit-graph-N for positive N. Each level is
closed under reachability with its lower levels, and the idea of "graph
position" now considers the concatenation of the commit orders from each
level. See PATCH 12 for more details.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
conditions:

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
pack-files.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

git commit-graph write --stdin-commits:         8m 12s
git commit-graph write --stdin-commits --split:    45s

The test using --split included at least six full collapses to the full
commit-graph. I believe the commit-graph stack had at most three levels
during this test.

This series is long because I felt the need to refactor write_commit_graph()
before making such a sweeping change to the format.

 * Patches 1-4: these are small changes which either fix issues or just
   provide clean-up. These are mostly borrowed from
   ds/commit-graph-format-v2.
   
   
 * Patches 5-11: these provide a non-functional refactor of
   write_commit_graph() into several methods using a "struct
   write_commit_graph_context" to share across the methods.
   
   
 * Patches 12-16: Implement the split commit-graph feature.
   
   
 * Patch 17: Demonstrate the value by writing a split commit-graph during 
   git fetch when the new config setting fetch.writeCommitGraph is true.
   
   

TODO: There are several things missing that need to be added before this
series is ready for full review and merging:

 1. The documentation for git commit-graph needs updating for the --split 
    option.
    
    
 2. We likely want config settings for the merge strategy. This is mentioned
    in the design doc, and could be saved for later.
    
    
 3. We want to update the git commit-graph verify subcommand to understand
    the commit-graph stack and optionally only verify the tip of the stack.
    This allows faster (amortized) verification if we are verifying
    immediately after writes and trusting the files at rest.
    
    
 4. It would be helpful to add a new optional chunk that contains the
    trailing hash for the lower level of the commit-graph stack. This chunk
    would only be for the commit-graph-N files, and would provide a simple
    way to check that the stack is valid on read, in case we are still
    worried about other processes reading/writing in the wrong order.
    
    
 5. Currently, --split essentially implies --append since we either (a)
    don't change the existing stack and only add commits, or (b) add all
    existing commits while merging files. However, if you would use --append 
    with --split, the append logic will trigger a merge with the current tip
    (at minimum). Some care should be taken to make this more clear.
    
    

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (17):
  commit-graph: fix the_repository reference
  commit-graph: return with errors during write
  commit-graph: collapse parameters into flags
  commit-graph: remove Future Work section
  commit-graph: create write_commit_graph_context
  commit-graph: extract fill_oids_from_packs()
  commit-graph: extract fill_oids_from_commit_hex()
  commit-graph: extract fill_oids_from_all_packs()
  commit-graph: extract count_distinct_commits()
  commit-graph: extract copy_oids_to_commits()
  commit-graph: extract write_commit_graph_file()
  Documentation: describe split commit-graphs
  commit-graph: lay groundwork for incremental files
  commit-graph: load split commit-graph files
  commit-graph: write split commit-graph files
  commit-graph: add --split option
  fetch: add fetch.writeCommitGraph config setting

 Documentation/technical/commit-graph.txt | 157 +++-
 builtin/commit-graph.c                   |  31 +-
 builtin/commit.c                         |   5 +-
 builtin/fetch.c                          |  17 +
 builtin/gc.c                             |   7 +-
 commit-graph.c                           | 946 ++++++++++++++++-------
 commit-graph.h                           |  19 +-
 commit.c                                 |   2 +-
 t/t5318-commit-graph.sh                  |  28 +-
 9 files changed, 887 insertions(+), 325 deletions(-)


base-commit: 93b4405ffe4ad9308740e7c1c71383bfc369baaa
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/184
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 01/17] commit-graph: fix the_repository reference
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 02/17] commit-graph: return with errors during write Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The parse_commit_buffer() method takes a repository pointer, so it
should not refer to the_repository anymore.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index a5333c7ac6..e4d1233226 100644
--- a/commit.c
+++ b/commit.c
@@ -443,7 +443,7 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	item->date = parse_commit_date(bufptr, tail);
 
 	if (check_graph)
-		load_commit_graph_info(the_repository, item);
+		load_commit_graph_info(r, item);
 
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 02/17] commit-graph: return with errors during write
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 01/17] commit-graph: fix the_repository reference Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 04/17] commit-graph: remove Future Work section Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method uses die() to report failure and
exit when confronted with an unexpected condition. This use of
die() in a library function is incorrect and is now replaced by
error() statements and an int return type.

Now that we use 'goto cleanup' to jump to the terminal condition
on an error, we have new paths that could lead to uninitialized
values. New initializers are added to correct for this.

The builtins 'commit-graph', 'gc', and 'commit' call these methods,
so update them to check the return value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c | 19 +++++++------
 builtin/commit.c       |  5 ++--
 builtin/gc.c           |  7 ++---
 commit-graph.c         | 60 +++++++++++++++++++++++++++++-------------
 commit-graph.h         | 10 +++----
 5 files changed, 62 insertions(+), 39 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 537fdfd0f0..2e86251f02 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -141,6 +141,7 @@ static int graph_write(int argc, const char **argv)
 	struct string_list *pack_indexes = NULL;
 	struct string_list *commit_hex = NULL;
 	struct string_list lines;
+	int result;
 
 	static struct option builtin_commit_graph_write_options[] = {
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
@@ -168,10 +169,8 @@ static int graph_write(int argc, const char **argv)
 
 	read_replace_refs = 0;
 
-	if (opts.reachable) {
-		write_commit_graph_reachable(opts.obj_dir, opts.append, 1);
-		return 0;
-	}
+	if (opts.reachable)
+		return write_commit_graph_reachable(opts.obj_dir, opts.append, 1);
 
 	string_list_init(&lines, 0);
 	if (opts.stdin_packs || opts.stdin_commits) {
@@ -188,14 +187,14 @@ static int graph_write(int argc, const char **argv)
 		UNLEAK(buf);
 	}
 
-	write_commit_graph(opts.obj_dir,
-			   pack_indexes,
-			   commit_hex,
-			   opts.append,
-			   1);
+	result = write_commit_graph(opts.obj_dir,
+				    pack_indexes,
+				    commit_hex,
+				    opts.append,
+				    1);
 
 	UNLEAK(lines);
-	return 0;
+	return result;
 }
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
diff --git a/builtin/commit.c b/builtin/commit.c
index 2986553d5f..b9ea7222fa 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1669,8 +1669,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "new_index file. Check that disk is not full and quota is\n"
 		      "not exceeded, and then \"git reset HEAD\" to recover."));
 
-	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
-		write_commit_graph_reachable(get_object_directory(), 0, 0);
+	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
+	    write_commit_graph_reachable(get_object_directory(), 0, 0))
+		return 1;
 
 	repo_rerere(the_repository, 0);
 	run_command_v_opt(argv_gc_auto, RUN_GIT_CMD);
diff --git a/builtin/gc.c b/builtin/gc.c
index 020f725acc..3984addf73 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -664,9 +664,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		clean_pack_garbage();
 	}
 
-	if (gc_write_commit_graph)
-		write_commit_graph_reachable(get_object_directory(), 0,
-					     !quiet && !daemonized);
+	if (gc_write_commit_graph &&
+	    write_commit_graph_reachable(get_object_directory(), 0,
+					 !quiet && !daemonized))
+		return 1;
 
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
diff --git a/commit-graph.c b/commit-graph.c
index 66865acbd7..ee487a364b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -851,27 +851,30 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
-void write_commit_graph_reachable(const char *obj_dir, int append,
-				  int report_progress)
+int write_commit_graph_reachable(const char *obj_dir, int append,
+				 int report_progress)
 {
 	struct string_list list = STRING_LIST_INIT_DUP;
+	int result;
 
 	for_each_ref(add_ref_to_list, &list);
-	write_commit_graph(obj_dir, NULL, &list, append, report_progress);
+	result = write_commit_graph(obj_dir, NULL, &list,
+				    append, report_progress);
 
 	string_list_clear(&list, 0);
+	return result;
 }
 
-void write_commit_graph(const char *obj_dir,
-			struct string_list *pack_indexes,
-			struct string_list *commit_hex,
-			int append, int report_progress)
+int write_commit_graph(const char *obj_dir,
+		       struct string_list *pack_indexes,
+		       struct string_list *commit_hex,
+		       int append, int report_progress)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
 	struct hashfile *f;
 	uint32_t i, count_distinct = 0;
-	char *graph_name;
+	char *graph_name = NULL;
 	struct lock_file lk = LOCK_INIT;
 	uint32_t chunk_ids[5];
 	uint64_t chunk_offsets[5];
@@ -883,15 +886,17 @@ void write_commit_graph(const char *obj_dir,
 	uint64_t progress_cnt = 0;
 	struct strbuf progress_title = STRBUF_INIT;
 	unsigned long approx_nr_objects;
+	int res = 0;
 
 	if (!commit_graph_compatible(the_repository))
-		return;
+		return 0;
 
 	oids.nr = 0;
 	approx_nr_objects = approximate_object_count();
 	oids.alloc = approx_nr_objects / 32;
 	oids.progress = NULL;
 	oids.progress_done = 0;
+	commits.list = NULL;
 
 	if (append) {
 		prepare_commit_graph_one(the_repository, obj_dir);
@@ -932,10 +937,16 @@ void write_commit_graph(const char *obj_dir,
 			strbuf_setlen(&packname, dirlen);
 			strbuf_addstr(&packname, pack_indexes->items[i].string);
 			p = add_packed_git(packname.buf, packname.len, 1);
-			if (!p)
-				die(_("error adding pack %s"), packname.buf);
-			if (open_pack_index(p))
-				die(_("error opening index for %s"), packname.buf);
+			if (!p) {
+				error(_("error adding pack %s"), packname.buf);
+				res = 1;
+				goto cleanup;
+			}
+			if (open_pack_index(p)) {
+				error(_("error opening index for %s"), packname.buf);
+				res = 1;
+				goto cleanup;
+			}
 			for_each_object_in_pack(p, add_packed_commits, &oids,
 						FOR_EACH_OBJECT_PACK_ORDER);
 			close_pack(p);
@@ -1006,8 +1017,11 @@ void write_commit_graph(const char *obj_dir,
 	}
 	stop_progress(&progress);
 
-	if (count_distinct >= GRAPH_EDGE_LAST_MASK)
-		die(_("the commit graph format cannot write %d commits"), count_distinct);
+	if (count_distinct >= GRAPH_EDGE_LAST_MASK) {
+		error(_("the commit graph format cannot write %d commits"), count_distinct);
+		res = 1;
+		goto cleanup;
+	}
 
 	commits.nr = 0;
 	commits.alloc = count_distinct;
@@ -1039,16 +1053,21 @@ void write_commit_graph(const char *obj_dir,
 	num_chunks = num_extra_edges ? 4 : 3;
 	stop_progress(&progress);
 
-	if (commits.nr >= GRAPH_EDGE_LAST_MASK)
-		die(_("too many commits to write graph"));
+	if (commits.nr >= GRAPH_EDGE_LAST_MASK) {
+		error(_("too many commits to write graph"));
+		res = 1;
+		goto cleanup;
+	}
 
 	compute_generation_numbers(&commits, report_progress);
 
 	graph_name = get_commit_graph_filename(obj_dir);
 	if (safe_create_leading_directories(graph_name)) {
 		UNLEAK(graph_name);
-		die_errno(_("unable to create leading directories of %s"),
-			  graph_name);
+		error(_("unable to create leading directories of %s"),
+			graph_name);
+		res = errno;
+		goto cleanup;
 	}
 
 	hold_lock_file_for_update(&lk, graph_name, LOCK_DIE_ON_ERROR);
@@ -1107,9 +1126,12 @@ void write_commit_graph(const char *obj_dir,
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	commit_lock_file(&lk);
 
+cleanup:
 	free(graph_name);
 	free(commits.list);
 	free(oids.list);
+
+	return res;
 }
 
 #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
diff --git a/commit-graph.h b/commit-graph.h
index 7dfb8c896f..d15670bf46 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -65,12 +65,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
  */
 int generation_numbers_enabled(struct repository *r);
 
-void write_commit_graph_reachable(const char *obj_dir, int append,
+int write_commit_graph_reachable(const char *obj_dir, int append,
 				  int report_progress);
-void write_commit_graph(const char *obj_dir,
-			struct string_list *pack_indexes,
-			struct string_list *commit_hex,
-			int append, int report_progress);
+int write_commit_graph(const char *obj_dir,
+		       struct string_list *pack_indexes,
+		       struct string_list *commit_hex,
+		       int append, int report_progress);
 
 int verify_commit_graph(struct repository *r, struct commit_graph *g);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 04/17] commit-graph: remove Future Work section
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 01/17] commit-graph: fix the_repository reference Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 02/17] commit-graph: return with errors during write Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 03/17] commit-graph: collapse parameters into flags Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph feature began with a long list of planned
benefits, most of which are now complete. The future work
section has only a few items left.

As for making more algorithms aware of generation numbers,
some are only waiting for generation number v2 to ensure the
performance matches the existing behavior using commit date.

It is unlikely that we will ever send a commit-graph file
as part of the protocol, since we would need to verify the
data, and that is as expensive as writing a commit-graph from
scratch. If we want to start trusting remote content, then
that item can be investigated again.

While there is more work to be done on the feature, having
a section of the docs devoted to a TODO list is wasteful and
hard to keep up-to-date.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 7805b0968c..fb53341d5e 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,23 +127,6 @@ Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
-Future Work
------------
-
-- After computing and storing generation numbers, we must make graph
-  walks aware of generation numbers to gain the performance benefits they
-  enable. This will mostly be accomplished by swapping a commit-date-ordered
-  priority queue with one ordered by generation number. The following
-  operations are important candidates:
-
-    - 'log --topo-order'
-    - 'tag --merged'
-
-- A server could provide a commit-graph file as part of the network protocol
-  to avoid extra calculations by clients. This feature is only of benefit if
-  the user is willing to trust the file, because verifying the file is correct
-  is as hard as computing it from scratch.
-
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 03/17] commit-graph: collapse parameters into flags
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 04/17] commit-graph: remove Future Work section Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 05/17] commit-graph: create write_commit_graph_context Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() and write_commit_graph_reachable() methods
currently take two boolean parameters: 'append' and 'report_progress'.
We will soon expand the possible options to send to these methods, so
instead of complicating the parameter list, first simplify it.

Collapse these parameters into a 'flags' parameter, and adjust the
callers to provide flags as necessary.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c | 8 +++++---
 builtin/commit.c       | 2 +-
 builtin/gc.c           | 4 ++--
 commit-graph.c         | 9 +++++----
 commit-graph.h         | 8 +++++---
 5 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 2e86251f02..828b1a713f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -142,6 +142,7 @@ static int graph_write(int argc, const char **argv)
 	struct string_list *commit_hex = NULL;
 	struct string_list lines;
 	int result;
+	int flags = COMMIT_GRAPH_PROGRESS;
 
 	static struct option builtin_commit_graph_write_options[] = {
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
@@ -166,11 +167,13 @@ static int graph_write(int argc, const char **argv)
 		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
+	if (opts.append)
+		flags |= COMMIT_GRAPH_APPEND;
 
 	read_replace_refs = 0;
 
 	if (opts.reachable)
-		return write_commit_graph_reachable(opts.obj_dir, opts.append, 1);
+		return write_commit_graph_reachable(opts.obj_dir, flags);
 
 	string_list_init(&lines, 0);
 	if (opts.stdin_packs || opts.stdin_commits) {
@@ -190,8 +193,7 @@ static int graph_write(int argc, const char **argv)
 	result = write_commit_graph(opts.obj_dir,
 				    pack_indexes,
 				    commit_hex,
-				    opts.append,
-				    1);
+				    flags);
 
 	UNLEAK(lines);
 	return result;
diff --git a/builtin/commit.c b/builtin/commit.c
index b9ea7222fa..b001ef565d 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1670,7 +1670,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "not exceeded, and then \"git reset HEAD\" to recover."));
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(get_object_directory(), 0, 0))
+	    write_commit_graph_reachable(get_object_directory(), 0))
 		return 1;
 
 	repo_rerere(the_repository, 0);
diff --git a/builtin/gc.c b/builtin/gc.c
index 3984addf73..df2573f124 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -665,8 +665,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	}
 
 	if (gc_write_commit_graph &&
-	    write_commit_graph_reachable(get_object_directory(), 0,
-					 !quiet && !daemonized))
+	    write_commit_graph_reachable(get_object_directory(),
+					 !quiet && !daemonized ? COMMIT_GRAPH_PROGRESS : 0))
 		return 1;
 
 	if (auto_gc && too_many_loose_objects())
diff --git a/commit-graph.c b/commit-graph.c
index ee487a364b..8bbd50658c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -851,15 +851,14 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
-int write_commit_graph_reachable(const char *obj_dir, int append,
-				 int report_progress)
+int write_commit_graph_reachable(const char *obj_dir, unsigned int flags)
 {
 	struct string_list list = STRING_LIST_INIT_DUP;
 	int result;
 
 	for_each_ref(add_ref_to_list, &list);
 	result = write_commit_graph(obj_dir, NULL, &list,
-				    append, report_progress);
+				    flags);
 
 	string_list_clear(&list, 0);
 	return result;
@@ -868,7 +867,7 @@ int write_commit_graph_reachable(const char *obj_dir, int append,
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
-		       int append, int report_progress)
+		       unsigned int flags)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -887,6 +886,8 @@ int write_commit_graph(const char *obj_dir,
 	struct strbuf progress_title = STRBUF_INIT;
 	unsigned long approx_nr_objects;
 	int res = 0;
+	int append = flags & COMMIT_GRAPH_APPEND;
+	int report_progress = flags & COMMIT_GRAPH_PROGRESS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
diff --git a/commit-graph.h b/commit-graph.h
index d15670bf46..70f4caf0c7 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -65,12 +65,14 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
  */
 int generation_numbers_enabled(struct repository *r);
 
-int write_commit_graph_reachable(const char *obj_dir, int append,
-				  int report_progress);
+#define COMMIT_GRAPH_APPEND     (1 << 0)
+#define COMMIT_GRAPH_PROGRESS   (1 << 1)
+
+int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
-		       int append, int report_progress);
+		       unsigned int flags);
 
 int verify_commit_graph(struct repository *r, struct commit_graph *g);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 05/17] commit-graph: create write_commit_graph_context
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 03/17] commit-graph: collapse parameters into flags Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 06/17] commit-graph: extract fill_oids_from_packs() Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too large and complex. To simplify
it, we should extract several small methods. However, we will risk
repeating a lot of declarations related to progress incidators and
object id or commit lists.

Create a new write_commit_graph_context struct that contains the
core data structures used in this process. Replace the other local
variables with the values inside the context object. Following this
change, we will start to lift code segments wholesale out of the
write_commit_graph() method and into their own methods.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 390 ++++++++++++++++++++++++-------------------------
 1 file changed, 194 insertions(+), 196 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8bbd50658c..58f0f0ae34 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -518,14 +518,38 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+struct packed_oid_list {
+	struct object_id *list;
+	int nr;
+	int alloc;
+};
+
+struct write_commit_graph_context {
+	struct repository *r;
+	const char *obj_dir;
+	char *graph_name;
+	struct packed_oid_list oids;
+	struct packed_commit_list commits;
+	int num_extra_edges;
+	unsigned long approx_nr_objects;
+	struct progress *progress;
+	int progress_done;
+	uint64_t progress_cnt;
+	unsigned append:1,
+		 report_progress:1;
+};
+
 static void write_graph_chunk_fanout(struct hashfile *f,
-				     struct commit **commits,
-				     int nr_commits,
-				     struct progress *progress,
-				     uint64_t *progress_cnt)
+				     struct write_commit_graph_context *ctx)
 {
 	int i, count = 0;
-	struct commit **list = commits;
+	struct commit **list = ctx->commits.list;
 
 	/*
 	 * Write the first-level table (the list is sorted,
@@ -533,10 +557,10 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 	 * having to do eight extra binary search iterations).
 	 */
 	for (i = 0; i < 256; i++) {
-		while (count < nr_commits) {
+		while (count < ctx->commits.nr) {
 			if ((*list)->object.oid.hash[0] != i)
 				break;
-			display_progress(progress, ++*progress_cnt);
+			display_progress(ctx->progress, ++ctx->progress_cnt);
 			count++;
 			list++;
 		}
@@ -546,14 +570,12 @@ static void write_graph_chunk_fanout(struct hashfile *f,
 }
 
 static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
-				   struct commit **commits, int nr_commits,
-				   struct progress *progress,
-				   uint64_t *progress_cnt)
+				   struct write_commit_graph_context *ctx)
 {
-	struct commit **list = commits;
+	struct commit **list = ctx->commits.list;
 	int count;
-	for (count = 0; count < nr_commits; count++, list++) {
-		display_progress(progress, ++*progress_cnt);
+	for (count = 0; count < ctx->commits.nr; count++, list++) {
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
 	}
 }
@@ -565,19 +587,17 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 }
 
 static void write_graph_chunk_data(struct hashfile *f, int hash_len,
-				   struct commit **commits, int nr_commits,
-				   struct progress *progress,
-				   uint64_t *progress_cnt)
+				   struct write_commit_graph_context *ctx)
 {
-	struct commit **list = commits;
-	struct commit **last = commits + nr_commits;
+	struct commit **list = ctx->commits.list;
+	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
 
 	while (list < last) {
 		struct commit_list *parent;
 		int edge_value;
 		uint32_t packedDate[2];
-		display_progress(progress, ++*progress_cnt);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 
 		parse_commit_no_graph(*list);
 		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);
@@ -588,8 +608,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			edge_value = GRAPH_PARENT_NONE;
 		else {
 			edge_value = sha1_pos(parent->item->object.oid.hash,
-					      commits,
-					      nr_commits,
+					      ctx->commits.list,
+					      ctx->commits.nr,
 					      commit_to_sha1);
 
 			if (edge_value < 0)
@@ -609,8 +629,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 			edge_value = GRAPH_EXTRA_EDGES_NEEDED | num_extra_edges;
 		else {
 			edge_value = sha1_pos(parent->item->object.oid.hash,
-					      commits,
-					      nr_commits,
+					      ctx->commits.list,
+					      ctx->commits.nr,
 					      commit_to_sha1);
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
@@ -642,19 +662,16 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 }
 
 static void write_graph_chunk_extra_edges(struct hashfile *f,
-					  struct commit **commits,
-					  int nr_commits,
-					  struct progress *progress,
-					  uint64_t *progress_cnt)
+					  struct write_commit_graph_context *ctx)
 {
-	struct commit **list = commits;
-	struct commit **last = commits + nr_commits;
+	struct commit **list = ctx->commits.list;
+	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	struct commit_list *parent;
 
 	while (list < last) {
 		int num_parents = 0;
 
-		display_progress(progress, ++*progress_cnt);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 
 		for (parent = (*list)->parents; num_parents < 3 && parent;
 		     parent = parent->next)
@@ -668,8 +685,8 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 		/* Since num_parents > 2, this initializer is safe. */
 		for (parent = (*list)->parents->next; parent; parent = parent->next) {
 			int edge_value = sha1_pos(parent->item->object.oid.hash,
-						  commits,
-						  nr_commits,
+						  ctx->commits.list,
+						  ctx->commits.nr,
 						  commit_to_sha1);
 
 			if (edge_value < 0)
@@ -693,125 +710,111 @@ static int commit_compare(const void *_a, const void *_b)
 	return oidcmp(a, b);
 }
 
-struct packed_commit_list {
-	struct commit **list;
-	int nr;
-	int alloc;
-};
-
-struct packed_oid_list {
-	struct object_id *list;
-	int nr;
-	int alloc;
-	struct progress *progress;
-	int progress_done;
-};
-
 static int add_packed_commits(const struct object_id *oid,
 			      struct packed_git *pack,
 			      uint32_t pos,
 			      void *data)
 {
-	struct packed_oid_list *list = (struct packed_oid_list*)data;
+	struct write_commit_graph_context *ctx = (struct write_commit_graph_context*)data;
 	enum object_type type;
 	off_t offset = nth_packed_object_offset(pack, pos);
 	struct object_info oi = OBJECT_INFO_INIT;
 
-	if (list->progress)
-		display_progress(list->progress, ++list->progress_done);
+	if (ctx->progress)
+		display_progress(ctx->progress, ++ctx->progress_done);
 
 	oi.typep = &type;
-	if (packed_object_info(the_repository, pack, offset, &oi) < 0)
+	if (packed_object_info(ctx->r, pack, offset, &oi) < 0)
 		die(_("unable to get type of object %s"), oid_to_hex(oid));
 
 	if (type != OBJ_COMMIT)
 		return 0;
 
-	ALLOC_GROW(list->list, list->nr + 1, list->alloc);
-	oidcpy(&(list->list[list->nr]), oid);
-	list->nr++;
+	ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+	oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
+	ctx->oids.nr++;
 
 	return 0;
 }
 
-static void add_missing_parents(struct packed_oid_list *oids, struct commit *commit)
+static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
 {
 	struct commit_list *parent;
 	for (parent = commit->parents; parent; parent = parent->next) {
 		if (!(parent->item->object.flags & UNINTERESTING)) {
-			ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
-			oidcpy(&oids->list[oids->nr], &(parent->item->object.oid));
-			oids->nr++;
+			ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+			oidcpy(&ctx->oids.list[ctx->oids.nr], &(parent->item->object.oid));
+			ctx->oids.nr++;
 			parent->item->object.flags |= UNINTERESTING;
 		}
 	}
 }
 
-static void close_reachable(struct packed_oid_list *oids, int report_progress)
+static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	struct progress *progress = NULL;
 
-	if (report_progress)
-		progress = start_delayed_progress(
-			_("Loading known commits in commit graph"), oids->nr);
-	for (i = 0; i < oids->nr; i++) {
-		display_progress(progress, i + 1);
-		commit = lookup_commit(the_repository, &oids->list[i]);
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("Loading known commits in commit graph"),
+					ctx->oids.nr);
+	for (i = 0; i < ctx->oids.nr; i++) {
+		display_progress(ctx->progress, i + 1);
+		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 		if (commit)
 			commit->object.flags |= UNINTERESTING;
 	}
-	stop_progress(&progress);
+	stop_progress(&ctx->progress);
 
 	/*
-	 * As this loop runs, oids->nr may grow, but not more
+	 * As this loop runs, ctx->oids.nr may grow, but not more
 	 * than the number of missing commits in the reachable
 	 * closure.
 	 */
-	if (report_progress)
-		progress = start_delayed_progress(
-			_("Expanding reachable commits in commit graph"), oids->nr);
-	for (i = 0; i < oids->nr; i++) {
-		display_progress(progress, i + 1);
-		commit = lookup_commit(the_repository, &oids->list[i]);
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("Expanding reachable commits in commit graph"),
+					ctx->oids.nr);
+	for (i = 0; i < ctx->oids.nr; i++) {
+		display_progress(ctx->progress, i + 1);
+		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
 		if (commit && !parse_commit_no_graph(commit))
-			add_missing_parents(oids, commit);
+			add_missing_parents(ctx, commit);
 	}
-	stop_progress(&progress);
+	stop_progress(&ctx->progress);
 
-	if (report_progress)
-		progress = start_delayed_progress(
-			_("Clearing commit marks in commit graph"), oids->nr);
-	for (i = 0; i < oids->nr; i++) {
-		display_progress(progress, i + 1);
-		commit = lookup_commit(the_repository, &oids->list[i]);
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("Clearing commit marks in commit graph"),
+					ctx->oids.nr);
+	for (i = 0; i < ctx->oids.nr; i++) {
+		display_progress(ctx->progress, i + 1);
+		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
 		if (commit)
 			commit->object.flags &= ~UNINTERESTING;
 	}
-	stop_progress(&progress);
+	stop_progress(&ctx->progress);
 }
 
-static void compute_generation_numbers(struct packed_commit_list* commits,
-				       int report_progress)
+static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit_list *list = NULL;
-	struct progress *progress = NULL;
 
-	if (report_progress)
-		progress = start_progress(
-			_("Computing commit graph generation numbers"),
-			commits->nr);
-	for (i = 0; i < commits->nr; i++) {
-		display_progress(progress, i + 1);
-		if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY &&
-		    commits->list[i]->generation != GENERATION_NUMBER_ZERO)
+	if (ctx->report_progress)
+		ctx->progress = start_progress(
+					_("Computing commit graph generation numbers"),
+					ctx->commits.nr);
+	for (i = 0; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, i + 1);
+		if (ctx->commits.list[i]->generation != GENERATION_NUMBER_INFINITY &&
+		    ctx->commits.list[i]->generation != GENERATION_NUMBER_ZERO)
 			continue;
 
-		commit_list_insert(commits->list[i], &list);
+		commit_list_insert(ctx->commits.list[i], &list);
 		while (list) {
 			struct commit *current = list->item;
 			struct commit_list *parent;
@@ -838,7 +841,7 @@ static void compute_generation_numbers(struct packed_commit_list* commits,
 			}
 		}
 	}
-	stop_progress(&progress);
+	stop_progress(&ctx->progress);
 }
 
 static int add_ref_to_list(const char *refname,
@@ -869,8 +872,7 @@ int write_commit_graph(const char *obj_dir,
 		       struct string_list *commit_hex,
 		       unsigned int flags)
 {
-	struct packed_oid_list oids;
-	struct packed_commit_list commits;
+	struct write_commit_graph_context *ctx;
 	struct hashfile *f;
 	uint32_t i, count_distinct = 0;
 	char *graph_name = NULL;
@@ -878,44 +880,38 @@ int write_commit_graph(const char *obj_dir,
 	uint32_t chunk_ids[5];
 	uint64_t chunk_offsets[5];
 	int num_chunks;
-	int num_extra_edges;
 	struct commit_list *parent;
-	struct progress *progress = NULL;
 	const unsigned hashsz = the_hash_algo->rawsz;
-	uint64_t progress_cnt = 0;
 	struct strbuf progress_title = STRBUF_INIT;
-	unsigned long approx_nr_objects;
 	int res = 0;
-	int append = flags & COMMIT_GRAPH_APPEND;
-	int report_progress = flags & COMMIT_GRAPH_PROGRESS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
 
-	oids.nr = 0;
-	approx_nr_objects = approximate_object_count();
-	oids.alloc = approx_nr_objects / 32;
-	oids.progress = NULL;
-	oids.progress_done = 0;
-	commits.list = NULL;
-
-	if (append) {
-		prepare_commit_graph_one(the_repository, obj_dir);
-		if (the_repository->objects->commit_graph)
-			oids.alloc += the_repository->objects->commit_graph->num_commits;
+	ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
+	ctx->r = the_repository;
+	ctx->obj_dir = obj_dir;
+	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
+	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
+
+	ctx->approx_nr_objects = approximate_object_count();
+	ctx->oids.alloc = ctx->approx_nr_objects / 32;
+
+	if (ctx->append) {
+		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
+		if (ctx->r->objects->commit_graph)
+			ctx->oids.alloc += ctx->r->objects->commit_graph->num_commits;
 	}
 
-	if (oids.alloc < 1024)
-		oids.alloc = 1024;
-	ALLOC_ARRAY(oids.list, oids.alloc);
-
-	if (append && the_repository->objects->commit_graph) {
-		struct commit_graph *commit_graph =
-			the_repository->objects->commit_graph;
-		for (i = 0; i < commit_graph->num_commits; i++) {
-			const unsigned char *hash = commit_graph->chunk_oid_lookup +
-				commit_graph->hash_len * i;
-			hashcpy(oids.list[oids.nr++].hash, hash);
+	if (ctx->oids.alloc < 1024)
+		ctx->oids.alloc = 1024;
+	ALLOC_ARRAY(ctx->oids.list, ctx->oids.alloc);
+
+	if (ctx->append && ctx->r->objects->commit_graph) {
+		struct commit_graph *g = ctx->r->objects->commit_graph;
+		for (i = 0; i < g->num_commits; i++) {
+			const unsigned char *hash = g->chunk_oid_lookup + g->hash_len * i;
+			hashcpy(ctx->oids.list[ctx->oids.nr++].hash, hash);
 		}
 	}
 
@@ -924,14 +920,14 @@ int write_commit_graph(const char *obj_dir,
 		int dirlen;
 		strbuf_addf(&packname, "%s/pack/", obj_dir);
 		dirlen = packname.len;
-		if (report_progress) {
+		if (ctx->report_progress) {
 			strbuf_addf(&progress_title,
 				    Q_("Finding commits for commit graph in %d pack",
 				       "Finding commits for commit graph in %d packs",
 				       pack_indexes->nr),
 				    pack_indexes->nr);
-			oids.progress = start_delayed_progress(progress_title.buf, 0);
-			oids.progress_done = 0;
+			ctx->progress = start_delayed_progress(progress_title.buf, 0);
+			ctx->progress_done = 0;
 		}
 		for (i = 0; i < pack_indexes->nr; i++) {
 			struct packed_git *p;
@@ -948,75 +944,76 @@ int write_commit_graph(const char *obj_dir,
 				res = 1;
 				goto cleanup;
 			}
-			for_each_object_in_pack(p, add_packed_commits, &oids,
+			for_each_object_in_pack(p, add_packed_commits, ctx,
 						FOR_EACH_OBJECT_PACK_ORDER);
 			close_pack(p);
 			free(p);
 		}
-		stop_progress(&oids.progress);
+		stop_progress(&ctx->progress);
 		strbuf_reset(&progress_title);
 		strbuf_release(&packname);
 	}
 
 	if (commit_hex) {
-		if (report_progress) {
+		if (ctx->report_progress) {
 			strbuf_addf(&progress_title,
 				    Q_("Finding commits for commit graph from %d ref",
 				       "Finding commits for commit graph from %d refs",
 				       commit_hex->nr),
 				    commit_hex->nr);
-			progress = start_delayed_progress(progress_title.buf,
-							  commit_hex->nr);
+			ctx->progress = start_delayed_progress(
+						progress_title.buf,
+						commit_hex->nr);
 		}
 		for (i = 0; i < commit_hex->nr; i++) {
 			const char *end;
 			struct object_id oid;
 			struct commit *result;
 
-			display_progress(progress, i + 1);
+			display_progress(ctx->progress, i + 1);
 			if (commit_hex->items[i].string &&
 			    parse_oid_hex(commit_hex->items[i].string, &oid, &end))
 				continue;
 
-			result = lookup_commit_reference_gently(the_repository, &oid, 1);
+			result = lookup_commit_reference_gently(ctx->r, &oid, 1);
 
 			if (result) {
-				ALLOC_GROW(oids.list, oids.nr + 1, oids.alloc);
-				oidcpy(&oids.list[oids.nr], &(result->object.oid));
-				oids.nr++;
+				ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+				oidcpy(&ctx->oids.list[ctx->oids.nr], &(result->object.oid));
+				ctx->oids.nr++;
 			}
 		}
-		stop_progress(&progress);
+		stop_progress(&ctx->progress);
 		strbuf_reset(&progress_title);
 	}
 
 	if (!pack_indexes && !commit_hex) {
-		if (report_progress)
-			oids.progress = start_delayed_progress(
+		if (ctx->report_progress)
+			ctx->progress = start_delayed_progress(
 				_("Finding commits for commit graph among packed objects"),
-				approx_nr_objects);
-		for_each_packed_object(add_packed_commits, &oids,
+				ctx->approx_nr_objects);
+		for_each_packed_object(add_packed_commits, ctx,
 				       FOR_EACH_OBJECT_PACK_ORDER);
-		if (oids.progress_done < approx_nr_objects)
-			display_progress(oids.progress, approx_nr_objects);
-		stop_progress(&oids.progress);
+		if (ctx->progress_done < ctx->approx_nr_objects)
+			display_progress(ctx->progress, ctx->approx_nr_objects);
+		stop_progress(&ctx->progress);
 	}
 
-	close_reachable(&oids, report_progress);
+	close_reachable(ctx);
 
-	if (report_progress)
-		progress = start_delayed_progress(
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
 			_("Counting distinct commits in commit graph"),
-			oids.nr);
-	display_progress(progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(oids.list, oids.nr, commit_compare);
+			ctx->oids.nr);
+	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
+	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
 	count_distinct = 1;
-	for (i = 1; i < oids.nr; i++) {
-		display_progress(progress, i + 1);
-		if (!oideq(&oids.list[i - 1], &oids.list[i]))
+	for (i = 1; i < ctx->oids.nr; i++) {
+		display_progress(ctx->progress, i + 1);
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			count_distinct++;
 	}
-	stop_progress(&progress);
+	stop_progress(&ctx->progress);
 
 	if (count_distinct >= GRAPH_EDGE_LAST_MASK) {
 		error(_("the commit graph format cannot write %d commits"), count_distinct);
@@ -1024,54 +1021,54 @@ int write_commit_graph(const char *obj_dir,
 		goto cleanup;
 	}
 
-	commits.nr = 0;
-	commits.alloc = count_distinct;
-	ALLOC_ARRAY(commits.list, commits.alloc);
+	ctx->commits.alloc = count_distinct;
+	ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc);
 
-	num_extra_edges = 0;
-	if (report_progress)
-		progress = start_delayed_progress(
+	ctx->num_extra_edges = 0;
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
 			_("Finding extra edges in commit graph"),
-			oids.nr);
-	for (i = 0; i < oids.nr; i++) {
+			ctx->oids.nr);
+	for (i = 0; i < ctx->oids.nr; i++) {
 		int num_parents = 0;
-		display_progress(progress, i + 1);
-		if (i > 0 && oideq(&oids.list[i - 1], &oids.list[i]))
+		display_progress(ctx->progress, i + 1);
+		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			continue;
 
-		commits.list[commits.nr] = lookup_commit(the_repository, &oids.list[i]);
-		parse_commit_no_graph(commits.list[commits.nr]);
+		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
 
-		for (parent = commits.list[commits.nr]->parents;
+		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
 		     parent; parent = parent->next)
 			num_parents++;
 
 		if (num_parents > 2)
-			num_extra_edges += num_parents - 1;
+			ctx->num_extra_edges += num_parents - 1;
 
-		commits.nr++;
+		ctx->commits.nr++;
 	}
-	num_chunks = num_extra_edges ? 4 : 3;
-	stop_progress(&progress);
+	stop_progress(&ctx->progress);
 
-	if (commits.nr >= GRAPH_EDGE_LAST_MASK) {
+	if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) {
 		error(_("too many commits to write graph"));
 		res = 1;
 		goto cleanup;
 	}
 
-	compute_generation_numbers(&commits, report_progress);
+	compute_generation_numbers(ctx);
 
-	graph_name = get_commit_graph_filename(obj_dir);
-	if (safe_create_leading_directories(graph_name)) {
-		UNLEAK(graph_name);
+	num_chunks = ctx->num_extra_edges ? 4 : 3;
+
+	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	if (safe_create_leading_directories(ctx->graph_name)) {
+		UNLEAK(ctx->graph_name);
 		error(_("unable to create leading directories of %s"),
-			graph_name);
+			ctx->graph_name);
 		res = errno;
 		goto cleanup;
 	}
 
-	hold_lock_file_for_update(&lk, graph_name, LOCK_DIE_ON_ERROR);
+	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
@@ -1084,7 +1081,7 @@ int write_commit_graph(const char *obj_dir,
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	if (num_extra_edges)
+	if (ctx->num_extra_edges)
 		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
 	else
 		chunk_ids[3] = 0;
@@ -1092,9 +1089,9 @@ int write_commit_graph(const char *obj_dir,
 
 	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
-	chunk_offsets[2] = chunk_offsets[1] + hashsz * commits.nr;
-	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * commits.nr;
-	chunk_offsets[4] = chunk_offsets[3] + 4 * num_extra_edges;
+	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
+	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
@@ -1105,32 +1102,33 @@ int write_commit_graph(const char *obj_dir,
 		hashwrite(f, chunk_write, 12);
 	}
 
-	if (report_progress) {
+	if (ctx->report_progress) {
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
 			       "Writing out commit graph in %d passes",
 			       num_chunks),
 			    num_chunks);
-		progress = start_delayed_progress(
+		ctx->progress = start_delayed_progress(
 			progress_title.buf,
-			num_chunks * commits.nr);
+			num_chunks * ctx->commits.nr);
 	}
-	write_graph_chunk_fanout(f, commits.list, commits.nr, progress, &progress_cnt);
-	write_graph_chunk_oids(f, hashsz, commits.list, commits.nr, progress, &progress_cnt);
-	write_graph_chunk_data(f, hashsz, commits.list, commits.nr, progress, &progress_cnt);
-	if (num_extra_edges)
-		write_graph_chunk_extra_edges(f, commits.list, commits.nr, progress, &progress_cnt);
-	stop_progress(&progress);
+	write_graph_chunk_fanout(f, ctx);
+	write_graph_chunk_oids(f, hashsz, ctx);
+	write_graph_chunk_data(f, hashsz, ctx);
+	if (ctx->num_extra_edges)
+		write_graph_chunk_extra_edges(f, ctx);
+	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
-	close_commit_graph(the_repository);
+	close_commit_graph(ctx->r);
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	commit_lock_file(&lk);
 
 cleanup:
 	free(graph_name);
-	free(commits.list);
-	free(oids.list);
+	free(ctx->commits.list);
+	free(ctx->oids.list);
+	free(ctx);
 
 	return res;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 06/17] commit-graph: extract fill_oids_from_packs()
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 05/17] commit-graph: create write_commit_graph_context Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 07/17] commit-graph: extract fill_oids_from_commit_hex() Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too complex, so we are
extracting methods one by one.

This extracts fill_oids_from_packs() that reads the given
pack-file list and fills the oid list in the context.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 83 ++++++++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 36 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 58f0f0ae34..80c7069aaa 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -867,6 +867,51 @@ int write_commit_graph_reachable(const char *obj_dir, unsigned int flags)
 	return result;
 }
 
+static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
+				struct string_list *pack_indexes)
+{
+	uint32_t i;
+	struct strbuf progress_title = STRBUF_INIT;
+	struct strbuf packname = STRBUF_INIT;
+	int dirlen;
+
+	strbuf_addf(&packname, "%s/pack/", ctx->obj_dir);
+	dirlen = packname.len;
+	if (ctx->report_progress) {
+		strbuf_addf(&progress_title,
+			    Q_("Finding commits for commit graph in %d pack",
+			       "Finding commits for commit graph in %d packs",
+			       pack_indexes->nr),
+			    pack_indexes->nr);
+		ctx->progress = start_delayed_progress(progress_title.buf, 0);
+		ctx->progress_done = 0;
+	}
+	for (i = 0; i < pack_indexes->nr; i++) {
+		struct packed_git *p;
+		strbuf_setlen(&packname, dirlen);
+		strbuf_addstr(&packname, pack_indexes->items[i].string);
+		p = add_packed_git(packname.buf, packname.len, 1);
+		if (!p) {
+			error(_("error adding pack %s"), packname.buf);
+			return 1;
+		}
+		if (open_pack_index(p)) {
+			error(_("error opening index for %s"), packname.buf);
+			return 1;
+		}
+		for_each_object_in_pack(p, add_packed_commits, ctx,
+					FOR_EACH_OBJECT_PACK_ORDER);
+		close_pack(p);
+		free(p);
+	}
+
+	stop_progress(&ctx->progress);
+	strbuf_reset(&progress_title);
+	strbuf_release(&packname);
+
+	return 0;
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -916,42 +961,8 @@ int write_commit_graph(const char *obj_dir,
 	}
 
 	if (pack_indexes) {
-		struct strbuf packname = STRBUF_INIT;
-		int dirlen;
-		strbuf_addf(&packname, "%s/pack/", obj_dir);
-		dirlen = packname.len;
-		if (ctx->report_progress) {
-			strbuf_addf(&progress_title,
-				    Q_("Finding commits for commit graph in %d pack",
-				       "Finding commits for commit graph in %d packs",
-				       pack_indexes->nr),
-				    pack_indexes->nr);
-			ctx->progress = start_delayed_progress(progress_title.buf, 0);
-			ctx->progress_done = 0;
-		}
-		for (i = 0; i < pack_indexes->nr; i++) {
-			struct packed_git *p;
-			strbuf_setlen(&packname, dirlen);
-			strbuf_addstr(&packname, pack_indexes->items[i].string);
-			p = add_packed_git(packname.buf, packname.len, 1);
-			if (!p) {
-				error(_("error adding pack %s"), packname.buf);
-				res = 1;
-				goto cleanup;
-			}
-			if (open_pack_index(p)) {
-				error(_("error opening index for %s"), packname.buf);
-				res = 1;
-				goto cleanup;
-			}
-			for_each_object_in_pack(p, add_packed_commits, ctx,
-						FOR_EACH_OBJECT_PACK_ORDER);
-			close_pack(p);
-			free(p);
-		}
-		stop_progress(&ctx->progress);
-		strbuf_reset(&progress_title);
-		strbuf_release(&packname);
+		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
+			goto cleanup;
 	}
 
 	if (commit_hex) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 07/17] commit-graph: extract fill_oids_from_commit_hex()
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 06/17] commit-graph: extract fill_oids_from_packs() Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 08/17] commit-graph: extract fill_oids_from_all_packs() Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too complex, so we are
extracting methods one by one.

Extract fill_oids_from_commit_hex() that reads the given commit
id list and fille the oid list in the context.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 72 ++++++++++++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 32 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 80c7069aaa..fb25280df1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -912,6 +912,44 @@ static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
 	return 0;
 }
 
+static void fill_oids_from_commit_hex(struct write_commit_graph_context *ctx,
+				      struct string_list *commit_hex)
+{
+	uint32_t i;
+	struct strbuf progress_title = STRBUF_INIT;
+
+	if (ctx->report_progress) {
+		strbuf_addf(&progress_title,
+			    Q_("Finding commits for commit graph from %d ref",
+			       "Finding commits for commit graph from %d refs",
+			       commit_hex->nr),
+			    commit_hex->nr);
+		ctx->progress = start_delayed_progress(
+					progress_title.buf,
+					commit_hex->nr);
+	}
+	for (i = 0; i < commit_hex->nr; i++) {
+		const char *end;
+		struct object_id oid;
+		struct commit *result;
+
+		display_progress(ctx->progress, i + 1);
+		if (commit_hex->items[i].string &&
+		    parse_oid_hex(commit_hex->items[i].string, &oid, &end))
+			continue;
+
+		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+
+		if (result) {
+			ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
+			oidcpy(&ctx->oids.list[ctx->oids.nr], &(result->object.oid));
+			ctx->oids.nr++;
+		}
+	}
+	stop_progress(&ctx->progress);
+	strbuf_release(&progress_title);
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -965,38 +1003,8 @@ int write_commit_graph(const char *obj_dir,
 			goto cleanup;
 	}
 
-	if (commit_hex) {
-		if (ctx->report_progress) {
-			strbuf_addf(&progress_title,
-				    Q_("Finding commits for commit graph from %d ref",
-				       "Finding commits for commit graph from %d refs",
-				       commit_hex->nr),
-				    commit_hex->nr);
-			ctx->progress = start_delayed_progress(
-						progress_title.buf,
-						commit_hex->nr);
-		}
-		for (i = 0; i < commit_hex->nr; i++) {
-			const char *end;
-			struct object_id oid;
-			struct commit *result;
-
-			display_progress(ctx->progress, i + 1);
-			if (commit_hex->items[i].string &&
-			    parse_oid_hex(commit_hex->items[i].string, &oid, &end))
-				continue;
-
-			result = lookup_commit_reference_gently(ctx->r, &oid, 1);
-
-			if (result) {
-				ALLOC_GROW(ctx->oids.list, ctx->oids.nr + 1, ctx->oids.alloc);
-				oidcpy(&ctx->oids.list[ctx->oids.nr], &(result->object.oid));
-				ctx->oids.nr++;
-			}
-		}
-		stop_progress(&ctx->progress);
-		strbuf_reset(&progress_title);
-	}
+	if (commit_hex)
+		fill_oids_from_commit_hex(ctx, commit_hex);
 
 	if (!pack_indexes && !commit_hex) {
 		if (ctx->report_progress)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 08/17] commit-graph: extract fill_oids_from_all_packs()
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 07/17] commit-graph: extract fill_oids_from_commit_hex() Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 10/17] commit-graph: extract copy_oids_to_commits() Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too complex, so we are
extracting methods one by one.

Extract fill_oids_from_all_packs() that reads all pack-files
for commits and fills the oid list in the context.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fb25280df1..730d529815 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -950,6 +950,19 @@ static void fill_oids_from_commit_hex(struct write_commit_graph_context *ctx,
 	strbuf_release(&progress_title);
 }
 
+static void fill_oids_from_all_packs(struct write_commit_graph_context *ctx)
+{
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+			_("Finding commits for commit graph among packed objects"),
+			ctx->approx_nr_objects);
+	for_each_packed_object(add_packed_commits, ctx,
+			       FOR_EACH_OBJECT_PACK_ORDER);
+	if (ctx->progress_done < ctx->approx_nr_objects)
+		display_progress(ctx->progress, ctx->approx_nr_objects);
+	stop_progress(&ctx->progress);
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1006,17 +1019,8 @@ int write_commit_graph(const char *obj_dir,
 	if (commit_hex)
 		fill_oids_from_commit_hex(ctx, commit_hex);
 
-	if (!pack_indexes && !commit_hex) {
-		if (ctx->report_progress)
-			ctx->progress = start_delayed_progress(
-				_("Finding commits for commit graph among packed objects"),
-				ctx->approx_nr_objects);
-		for_each_packed_object(add_packed_commits, ctx,
-				       FOR_EACH_OBJECT_PACK_ORDER);
-		if (ctx->progress_done < ctx->approx_nr_objects)
-			display_progress(ctx->progress, ctx->approx_nr_objects);
-		stop_progress(&ctx->progress);
-	}
+	if (!pack_indexes && !commit_hex)
+		fill_oids_from_all_packs(ctx);
 
 	close_reachable(ctx);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 09/17] commit-graph: extract count_distinct_commits()
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 10/17] commit-graph: extract copy_oids_to_commits() Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 11/17] commit-graph: extract write_commit_graph_file() Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too complex, so we are
extracting methods one by one.

Extract count_distinct_commits(), which sorts the oids list, then
iterates through to find duplicates.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 730d529815..f7419c919b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -963,6 +963,27 @@ static void fill_oids_from_all_packs(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i, count_distinct = 1;
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+			_("Counting distinct commits in commit graph"),
+			ctx->oids.nr);
+	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
+	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
+
+	for (i = 1; i < ctx->oids.nr; i++) {
+		display_progress(ctx->progress, i + 1);
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+			count_distinct++;
+	}
+	stop_progress(&ctx->progress);
+
+	return count_distinct;
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1024,19 +1045,7 @@ int write_commit_graph(const char *obj_dir,
 
 	close_reachable(ctx);
 
-	if (ctx->report_progress)
-		ctx->progress = start_delayed_progress(
-			_("Counting distinct commits in commit graph"),
-			ctx->oids.nr);
-	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
-	count_distinct = 1;
-	for (i = 1; i < ctx->oids.nr; i++) {
-		display_progress(ctx->progress, i + 1);
-		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
-			count_distinct++;
-	}
-	stop_progress(&ctx->progress);
+	count_distinct = count_distinct_commits(ctx);
 
 	if (count_distinct >= GRAPH_EDGE_LAST_MASK) {
 		error(_("the commit graph format cannot write %d commits"), count_distinct);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 10/17] commit-graph: extract copy_oids_to_commits()
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 08/17] commit-graph: extract fill_oids_from_all_packs() Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 09/17] commit-graph: extract count_distinct_commits() Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too complex, so we are
extracting methods one by one.

Extract copy_oids_to_commits(), which fills the commits list
with the distinct commits from the oids list. During this loop,
it also counts the number of "extra" edges from octopus merges.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 57 ++++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f7419c919b..16cdd7afb2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -984,6 +984,37 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 	return count_distinct;
 }
 
+static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i;
+	struct commit_list *parent;
+
+	ctx->num_extra_edges = 0;
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+			_("Finding extra edges in commit graph"),
+			ctx->oids.nr);
+	for (i = 0; i < ctx->oids.nr; i++) {
+		int num_parents = 0;
+		display_progress(ctx->progress, i + 1);
+		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+			continue;
+
+		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
+
+		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
+		     parent; parent = parent->next)
+			num_parents++;
+
+		if (num_parents > 2)
+			ctx->num_extra_edges += num_parents - 1;
+
+		ctx->commits.nr++;
+	}
+	stop_progress(&ctx->progress);
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -997,7 +1028,6 @@ int write_commit_graph(const char *obj_dir,
 	uint32_t chunk_ids[5];
 	uint64_t chunk_offsets[5];
 	int num_chunks;
-	struct commit_list *parent;
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int res = 0;
@@ -1056,30 +1086,7 @@ int write_commit_graph(const char *obj_dir,
 	ctx->commits.alloc = count_distinct;
 	ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc);
 
-	ctx->num_extra_edges = 0;
-	if (ctx->report_progress)
-		ctx->progress = start_delayed_progress(
-			_("Finding extra edges in commit graph"),
-			ctx->oids.nr);
-	for (i = 0; i < ctx->oids.nr; i++) {
-		int num_parents = 0;
-		display_progress(ctx->progress, i + 1);
-		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
-			continue;
-
-		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
-		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
-
-		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
-		     parent; parent = parent->next)
-			num_parents++;
-
-		if (num_parents > 2)
-			ctx->num_extra_edges += num_parents - 1;
-
-		ctx->commits.nr++;
-	}
-	stop_progress(&ctx->progress);
+	copy_oids_to_commits(ctx);
 
 	if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) {
 		error(_("too many commits to write graph"));
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 11/17] commit-graph: extract write_commit_graph_file()
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 09/17] commit-graph: extract count_distinct_commits() Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 12/17] Documentation: describe split commit-graphs Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The write_commit_graph() method is too complex, so we are
extracting methods one by one.

Extract write_commit_graph_file() that takes all of the information
in the context struct and writes the data to a commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 155 +++++++++++++++++++++++++------------------------
 1 file changed, 80 insertions(+), 75 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 16cdd7afb2..7723156964 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1015,21 +1015,91 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-int write_commit_graph(const char *obj_dir,
-		       struct string_list *pack_indexes,
-		       struct string_list *commit_hex,
-		       unsigned int flags)
+static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
-	struct write_commit_graph_context *ctx;
+	uint32_t i;
 	struct hashfile *f;
-	uint32_t i, count_distinct = 0;
-	char *graph_name = NULL;
 	struct lock_file lk = LOCK_INIT;
 	uint32_t chunk_ids[5];
 	uint64_t chunk_offsets[5];
-	int num_chunks;
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
+	int num_chunks = ctx->num_extra_edges ? 4 : 3;
+
+	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	if (safe_create_leading_directories(ctx->graph_name)) {
+		UNLEAK(ctx->graph_name);
+		error(_("unable to create leading directories of %s"),
+			ctx->graph_name);
+		return errno;
+	}
+
+	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
+	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+
+	hashwrite_be32(f, GRAPH_SIGNATURE);
+
+	hashwrite_u8(f, GRAPH_VERSION);
+	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, 0); /* unused padding byte */
+
+	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
+	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
+	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	if (ctx->num_extra_edges)
+		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
+	else
+		chunk_ids[3] = 0;
+	chunk_ids[4] = 0;
+
+	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
+	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
+	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
+	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
+
+	for (i = 0; i <= num_chunks; i++) {
+		uint32_t chunk_write[3];
+
+		chunk_write[0] = htonl(chunk_ids[i]);
+		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
+		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
+		hashwrite(f, chunk_write, 12);
+	}
+
+	if (ctx->report_progress) {
+		strbuf_addf(&progress_title,
+			    Q_("Writing out commit graph in %d pass",
+			       "Writing out commit graph in %d passes",
+			       num_chunks),
+			    num_chunks);
+		ctx->progress = start_delayed_progress(
+			progress_title.buf,
+			num_chunks * ctx->commits.nr);
+	}
+	write_graph_chunk_fanout(f, ctx);
+	write_graph_chunk_oids(f, hashsz, ctx);
+	write_graph_chunk_data(f, hashsz, ctx);
+	if (ctx->num_extra_edges)
+		write_graph_chunk_extra_edges(f, ctx);
+	stop_progress(&ctx->progress);
+	strbuf_release(&progress_title);
+
+	close_commit_graph(ctx->r);
+	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	commit_lock_file(&lk);
+
+	return 0;
+}
+
+int write_commit_graph(const char *obj_dir,
+		       struct string_list *pack_indexes,
+		       struct string_list *commit_hex,
+		       unsigned int flags)
+{
+	struct write_commit_graph_context *ctx;
+	uint32_t i, count_distinct = 0;
 	int res = 0;
 
 	if (!commit_graph_compatible(the_repository))
@@ -1096,75 +1166,10 @@ int write_commit_graph(const char *obj_dir,
 
 	compute_generation_numbers(ctx);
 
-	num_chunks = ctx->num_extra_edges ? 4 : 3;
-
-	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
-	if (safe_create_leading_directories(ctx->graph_name)) {
-		UNLEAK(ctx->graph_name);
-		error(_("unable to create leading directories of %s"),
-			ctx->graph_name);
-		res = errno;
-		goto cleanup;
-	}
-
-	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
-	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
-
-	hashwrite_be32(f, GRAPH_SIGNATURE);
-
-	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0); /* unused padding byte */
-
-	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
-	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
-	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	if (ctx->num_extra_edges)
-		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
-	else
-		chunk_ids[3] = 0;
-	chunk_ids[4] = 0;
-
-	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
-	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
-	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
-	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
-
-	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunk_ids[i]);
-		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
-		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
-	}
-
-	if (ctx->report_progress) {
-		strbuf_addf(&progress_title,
-			    Q_("Writing out commit graph in %d pass",
-			       "Writing out commit graph in %d passes",
-			       num_chunks),
-			    num_chunks);
-		ctx->progress = start_delayed_progress(
-			progress_title.buf,
-			num_chunks * ctx->commits.nr);
-	}
-	write_graph_chunk_fanout(f, ctx);
-	write_graph_chunk_oids(f, hashsz, ctx);
-	write_graph_chunk_data(f, hashsz, ctx);
-	if (ctx->num_extra_edges)
-		write_graph_chunk_extra_edges(f, ctx);
-	stop_progress(&ctx->progress);
-	strbuf_release(&progress_title);
-
-	close_commit_graph(ctx->r);
-	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
-	commit_lock_file(&lk);
+	res = write_commit_graph_file(ctx);
 
 cleanup:
-	free(graph_name);
+	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
 	free(ctx);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 11/17] commit-graph: extract write_commit_graph_file() Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 17:20   ` SZEDER Gábor
  2019-05-08 15:53 ` [PATCH 13/17] commit-graph: lay groundwork for incremental files Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The design for the split commit-graphs uses file names to force
a "stack" of commit-graph files. This allows incremental writes
without updating the file format.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 142 +++++++++++++++++++++++
 1 file changed, 142 insertions(+)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index fb53341d5e..ca1661d2d8 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,6 +127,148 @@ Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
+Split Commit Graphs
+-------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over
+time, the number of commits added by a fetch operation is much smaller than
+the number of commits in the full history. The split commit-graph feature
+allows for fast writes of new commit data without rewriting the entire
+commit history -- at least, most of the time.
+
+## File Layout
+
+A split commit-graph uses multiple files, and we use a fixed naming
+convention to organize these files. The base commit-graph file is the
+same: `$OBJDIR/info/commit-graph`. The rest of the commit-graph files have
+the format `$OBJDIR/info/commit-graphs/commit-graph-<N>` where N is a
+positive integer. The integers must start at 1 and grow sequentially
+to form a stack of files.
+
+Each `commit-graph-<N>` file has the same format as the `commit-graph`
+file, including a lexicographic list of commit ids. The only difference
+is that this list is considered to be concatenated to the list from
+the lower commit-graphs. As an example, consider this diagram of three
+files:
+
+ +-----------------------+
+ |  commit-graph-2       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  commit-graph-1       |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  commit-graph         |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `commit-graph`, X1 be the number
+of commits in commit-graph-1, and X2 be the number of commits in
+commit-graph-2. If a commit appears in position i in `commit-graph-2`,
+then we interpret this as being the commit in position (X0 + X1 + i),
+and that will be used as its "graph position". The commits in
+commit-graph-2 use these positions to refer to their parents, which
+may be in commit-graph-1 or commit-graph. We can navigate to an
+arbitrary commit in position j by checking its containment in the
+intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + X2).
+
+When Git reads from these files, it starts by acquiring a read handle
+on the `commit-graph` file. On success, it continues acquiring read
+handles on the `commit-graph-<N>` files in increasing order. This
+order is important for how we replace the files.
+
+## Merging commit-graph files
+
+If we only added a `commit-graph-<N>` file on every write, we would
+run into a linear search problem through many commit-graph files.
+Instead, we use a merge strategy to decide when the stack should
+collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits
+are added, it is determined by the merge strategy that the files
+should collapse to `commit-graph-1`. Thus, the new commits, the
+commits in `commit-graph-2` and the commits in `commit-graph-1`
+should be combined into a new `commit-graph-1` file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  commit-graph-2       |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  commit-graph-1       |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                   commit-graph-1.lock
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  commit-graph         |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted
+and we write the contents to the `commit-graph-1.lock` file.
+When the file is flushed and ready to swap to `commit-graph-1`,
+we first unlink the files above our target file. This unlinking
+is done from the top of the stack, the reverse direction that
+another process would use to read the stack.
+
+During this time window, another process trying to read the
+commit-graph stack could read `commit-graph-1` before the swap
+but try to read `commit-graph-2` after it is unlinked. That
+process would then believe that this stack is complete, but
+will miss out on the performance benefits of the commits in
+`commit-graph-2`. For this reason, the stack above the
+`commit-graph` file should be small.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the
+commit-graph stack of height N, we default to creating
+a new file at level N + 1. We then decide to merge
+with the Nth level if one of two conditions hold:
+
+  1. The expected file size for level N + 1 is at
+     least half the file size for level N.
+
+  2. Level N + 1 contains more than MAX_SPLIT_COMMITS
+     commits (64,0000 commits).
+
+This decision cascades down the levels: when we
+merge a level we create a new set of commits that
+then compares to the next level.
+
+The first condition bounds the number of levels
+to be logarithmic in the total number of commits.
+The second condition bounds the total number of
+commits in a `commit-graph-N` file and not in
+the `commit-graph` file, preventing significant
+performance issues when the stack merges and another
+process only partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple,
+64,000 for the maximum number of commits) could be
+extracted into config settings for full flexibility.
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 13/17] commit-graph: lay groundwork for incremental files
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 12/17] Documentation: describe split commit-graphs Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 14/17] commit-graph: load split commit-graph files Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 66 ++++++++++++++++++++++++++++++++++++++++++++------
 commit-graph.h |  4 +++
 2 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7723156964..f790f44a9c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -371,6 +371,25 @@ static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t
 			    g->chunk_oid_lookup, g->hash_len, pos);
 }
 
+static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
+{
+	if (!g)
+		BUG("NULL commit-graph");
+
+	if (pos < g->num_commits_in_base) {
+		load_oid_from_graph(g->base_graph, pos, oid);
+		return;
+	}
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	pos -= g->num_commits_in_base;
+
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * pos);
+}
+
 static struct commit_list **insert_parent_or_die(struct repository *r,
 						 struct commit_graph *g,
 						 uint64_t pos,
@@ -379,10 +398,10 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	struct commit *c;
 	struct object_id oid;
 
-	if (pos >= g->num_commits)
+	if (pos >= g->num_commits + g->num_commits_in_base)
 		die("invalid parent position %"PRIu64, pos);
 
-	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+	load_oid_from_graph(g, pos, &oid);
 	c = lookup_commit(r, &oid);
 	if (!c)
 		die(_("could not find commit %s"), oid_to_hex(&oid));
@@ -393,7 +412,7 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
 	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
-	item->graph_pos = pos;
+	item->graph_pos = pos + g->num_commits_in_base;
 	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
@@ -405,10 +424,25 @@ static int fill_commit_in_graph(struct repository *r,
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data;
 
-	item->object.parsed = 1;
+	if (pos < g->num_commits_in_base)
+		return fill_commit_in_graph(r, item, g->base_graph, pos);
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	/*
+	 * Store the "full" position, but then use the
+	 * "local" position for the rest of the calculation.
+	 */
 	item->graph_pos = pos;
+	pos -= g->num_commits_in_base;
+
+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+
+	item->object.parsed = 1;
 
 	item->maybe_tree = NULL;
 
@@ -452,7 +486,18 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 		*pos = item->graph_pos;
 		return 1;
 	} else {
-		return bsearch_graph(g, &(item->object.oid), pos);
+		struct commit_graph *cur_g = g;
+		uint32_t pos_in_g;
+
+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &pos_in_g))
+			cur_g = cur_g->base_graph;
+
+		if (cur_g) {
+			*pos = pos_in_g + cur_g->num_commits_in_base;
+			return 1;
+		}
+
+		return 0;
 	}
 }
 
@@ -492,8 +537,13 @@ static struct tree *load_tree_for_commit(struct repository *r,
 					 struct commit *c)
 {
 	struct object_id oid;
-	const unsigned char *commit_data = g->chunk_commit_data +
-					   GRAPH_DATA_WIDTH * (c->graph_pos);
+	const unsigned char *commit_data;
+
+	if (c->graph_pos < g->num_commits_in_base)
+		return load_tree_for_commit(r, g->base_graph, c);
+
+	commit_data = g->chunk_commit_data +
+			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
 
 	hashcpy(oid.hash, commit_data);
 	c->maybe_tree = lookup_tree(r, &oid);
diff --git a/commit-graph.h b/commit-graph.h
index 70f4caf0c7..170920720d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,10 +48,14 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 
+	uint32_t num_commits_in_base;
+	struct commit_graph *base_graph;
+
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
+	const unsigned char *chunk_base_graph;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 14/17] commit-graph: load split commit-graph files
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 13/17] commit-graph: lay groundwork for incremental files Derrick Stolee via GitGitGadget
@ 2019-05-08 15:53 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:54 ` [PATCH 15/17] commit-graph: write " Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Starting with commit-graph, load commit-graph files in a
sequence as follows:

  commit-graph
  commit-graph-1
  commit-graph-2
  ...
  commit-graph-N

This creates N + 1 files in order.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f790f44a9c..5f6193277a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,6 +45,12 @@ char *get_commit_graph_filename(const char *obj_dir)
 	return xstrfmt("%s/info/commit-graph", obj_dir);
 }
 
+static char *get_split_graph_filename(const char *obj_dir,
+				      uint32_t split_count)
+{
+	return xstrfmt("%s/info/commit-graphs/commit-graph-%d", obj_dir, split_count);
+}
+
 static uint8_t oid_version(void)
 {
 	return 1;
@@ -289,15 +295,31 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
 {
 	char *graph_name;
+	uint32_t split_count = 1;
+	struct commit_graph *g;
 
 	if (r->objects->commit_graph)
 		return;
 
 	graph_name = get_commit_graph_filename(obj_dir);
-	r->objects->commit_graph =
-		load_commit_graph_one(graph_name);
-
+	g = load_commit_graph_one(graph_name);
 	FREE_AND_NULL(graph_name);
+
+	while (g) {
+		g->base_graph = r->objects->commit_graph;
+
+		if (g->base_graph)
+			g->num_commits_in_base = g->base_graph->num_commits +
+						 g->base_graph->num_commits_in_base;;
+
+		r->objects->commit_graph = g;
+
+		graph_name = get_split_graph_filename(obj_dir, split_count);
+		g = load_commit_graph_one(graph_name);
+		FREE_AND_NULL(graph_name);
+
+		split_count++;
+	}
 }
 
 /*
@@ -411,8 +433,15 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
-	item->graph_pos = pos + g->num_commits_in_base;
+	const unsigned char *commit_data;
+
+	if (pos < g->num_commits_in_base) {
+		fill_commit_graph_info(item, g->base_graph, pos);
+		return;
+	}
+
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * (pos - g->num_commits_in_base);
+	item->graph_pos = pos;
 	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 15/17] commit-graph: write split commit-graph files
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2019-05-08 15:53 ` [PATCH 14/17] commit-graph: load split commit-graph files Derrick Stolee via GitGitGadget
@ 2019-05-08 15:54 ` " Derrick Stolee via GitGitGadget
  2019-05-08 15:54 ` [PATCH 16/17] commit-graph: add --split option Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:54 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 248 +++++++++++++++++++++++++++++++++++++++-
 commit-graph.h          |   1 +
 t/t5318-commit-graph.sh |   2 +-
 3 files changed, 244 insertions(+), 7 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5f6193277a..44448aabe4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -40,6 +40,8 @@
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
+#define MAX_SPLIT_COMMITS 64000
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	return xstrfmt("%s/info/commit-graph", obj_dir);
@@ -619,9 +621,14 @@ struct write_commit_graph_context {
 	unsigned long approx_nr_objects;
 	struct progress *progress;
 	int progress_done;
+	int num_commit_graphs_before;
+	int num_commit_graphs_after;
+	uint32_t new_num_commits_in_base;
+	struct commit_graph *new_base_graph;
 	uint64_t progress_cnt;
 	unsigned append:1,
-		 report_progress:1;
+		 report_progress:1,
+		 split:1;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -691,6 +698,16 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.nr,
 					      commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -711,6 +728,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.list,
 					      ctx->commits.nr,
 					      commit_to_sha1);
+
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -768,6 +796,16 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 						  ctx->commits.nr,
 						  commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -782,7 +820,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
-static int commit_compare(const void *_a, const void *_b)
+static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
 	const struct object_id *b = (const struct object_id *)_b;
@@ -859,7 +897,13 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 		display_progress(ctx->progress, i + 1);
 		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
-		if (commit && !parse_commit_no_graph(commit))
+		if (!commit)
+			continue;
+		if (ctx->split) {
+			if (!parse_commit(commit) &&
+			    commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+				add_missing_parents(ctx, commit);
+		} else if (!parse_commit_no_graph(commit))
 			add_missing_parents(ctx, commit);
 	}
 	stop_progress(&ctx->progress);
@@ -1051,12 +1095,20 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 			_("Counting distinct commits in commit graph"),
 			ctx->oids.nr);
 	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i])) {
+			if (ctx->split) {
+				struct commit *c = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+				if (!c || c->graph_pos != COMMIT_NOT_FROM_GRAPH)
+					continue;
+			}
+
 			count_distinct++;
+		}
 	}
 	stop_progress(&ctx->progress);
 
@@ -1079,7 +1131,13 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			continue;
 
+		ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
 		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+		if (ctx->split &&
+		    ctx->commits.list[ctx->commits.nr]->graph_pos != COMMIT_NOT_FROM_GRAPH)
+			continue;
+
 		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
 
 		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
@@ -1105,7 +1163,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = ctx->num_extra_edges ? 4 : 3;
 
-	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	if (ctx->num_commit_graphs_after > 1)
+		ctx->graph_name = get_split_graph_filename(
+					ctx->obj_dir,
+					ctx->num_commit_graphs_after - 1);
+	else
+		ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+
 	if (safe_create_leading_directories(ctx->graph_name)) {
 		UNLEAK(ctx->graph_name);
 		error(_("unable to create leading directories of %s"),
@@ -1167,11 +1231,166 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	close_commit_graph(ctx->r);
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+
+	while (ctx->num_commit_graphs_before > ctx->num_commit_graphs_after) {
+		char *graph_name = get_split_graph_filename(
+					ctx->obj_dir,
+					--ctx->num_commit_graphs_before);
+		unlink(graph_name);
+		free(graph_name);
+	}
+
 	commit_lock_file(&lk);
 
 	return 0;
 }
 
+static size_t expected_commit_graph_size(size_t num_commits)
+{
+	return GRAPH_HEADER_SIZE + GRAPH_FANOUT_SIZE + 6 * GRAPH_CHUNKLOOKUP_WIDTH +
+		(num_commits + 1) * (GRAPH_DATA_WIDTH + the_hash_algo->rawsz);
+}
+
+static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t num_commits = ctx->commits.nr;
+	size_t expected_size = expected_commit_graph_size(num_commits);
+
+	ctx->num_commit_graphs_before = 0;
+	while (g) {
+		ctx->num_commit_graphs_before++;
+		g = g->base_graph;
+	}
+
+	g = ctx->r->objects->commit_graph;
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	while (g && (g->data_len <= 2 * expected_size || num_commits > MAX_SPLIT_COMMITS)) {
+		num_commits += g->num_commits;
+		expected_size = expected_commit_graph_size(num_commits);
+		g = g->base_graph;
+		ctx->num_commit_graphs_after--;
+	}
+}
+
+static void merge_commit_graph(struct write_commit_graph_context *ctx,
+			       struct commit_graph *g)
+{
+	uint32_t i;
+	uint32_t offset = g->num_commits_in_base;
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct object_id oid;
+		struct commit *result;
+
+		display_progress(ctx->progress, i + 1);
+
+		load_oid_from_graph(g, i + offset, &oid);
+
+		/* only add commits if they still exist in the repo */
+		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+
+		if (result) {
+			ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
+			ctx->commits.list[ctx->commits.nr] = result;
+			ctx->commits.nr++;
+		}
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit **)_a;
+	const struct commit *b = *(const struct commit **)_b;
+	return oidcmp(&a->object.oid, &b->object.oid);
+}
+
+static void deduplicate_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i, num_parents, last_distinct = 0, duplicates = 0;
+	struct commit_list *parent;
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("De-duplicating merged commits"),
+					ctx->commits.nr);
+
+	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+
+	ctx->num_extra_edges = 0;
+	for (i = 1; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, i);
+
+		if (oideq(&ctx->commits.list[last_distinct]->object.oid,
+			  &ctx->commits.list[i]->object.oid)) {
+			duplicates++;
+		} else {
+			if (duplicates)
+				ctx->commits.list[last_distinct + 1] = ctx->commits.list[i];
+			last_distinct++;
+
+			num_parents = 0;
+			for (parent = ctx->commits.list[i]->parents; parent; parent = parent->next)
+				num_parents++;
+
+			if (num_parents > 2)
+				ctx->num_extra_edges += num_parents - 2;
+		}
+	}
+
+	ctx->commits.nr -= duplicates;
+	stop_progress(&ctx->progress);
+}
+
+static void merge_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t current_graph_number = ctx->num_commit_graphs_before;
+	struct strbuf progress_title = STRBUF_INIT;
+
+	while (g && current_graph_number >= ctx->num_commit_graphs_after) {
+		current_graph_number--;
+
+		if (ctx->report_progress) {
+			if (current_graph_number)
+				strbuf_addf(&progress_title,
+					    _("Merging commit-graph-%d"),
+					    current_graph_number);
+			else
+				strbuf_addstr(&progress_title,
+					      _("Merging commit-graph"));
+			ctx->progress = start_delayed_progress(progress_title.buf, 0);
+		}
+
+		merge_commit_graph(ctx, g);
+		stop_progress(&ctx->progress);
+		strbuf_release(&progress_title);
+
+		g = g->base_graph;
+	}
+
+	if (g) {
+		ctx->new_base_graph = g;
+		ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+	}
+
+	deduplicate_commits(ctx);
+}
+
+static void collapse_all_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+
+	ctx->num_commit_graphs_after = 1;
+	ctx->num_commit_graphs_before = 0;
+
+	while (g) {
+		ctx->num_commit_graphs_before++;
+		g = g->base_graph;
+	}
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1189,10 +1408,17 @@ int write_commit_graph(const char *obj_dir,
 	ctx->obj_dir = obj_dir;
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
+	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+
+	if (ctx->split)
+		prepare_commit_graph(ctx->r);
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
+	if (ctx->split && ctx->oids.alloc > MAX_SPLIT_COMMITS)
+		ctx->oids.alloc = MAX_SPLIT_COMMITS;
+
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
 		if (ctx->r->objects->commit_graph)
@@ -1243,6 +1469,16 @@ int write_commit_graph(const char *obj_dir,
 		goto cleanup;
 	}
 
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split) {
+		split_graph_merge_strategy(ctx);
+
+		merge_commit_graphs(ctx);
+	} else
+		collapse_all_commit_graphs(ctx);
+
 	compute_generation_numbers(ctx);
 
 	res = write_commit_graph_file(ctx);
diff --git a/commit-graph.h b/commit-graph.h
index 170920720d..7a39ae2278 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -71,6 +71,7 @@ int generation_numbers_enabled(struct repository *r);
 
 #define COMMIT_GRAPH_APPEND     (1 << 0)
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
+#define COMMIT_GRAPH_SPLIT      (1 << 2)
 
 int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
 int write_commit_graph(const char *obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index e80c1cac02..d621608500 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -20,7 +20,7 @@ test_expect_success 'verify graph with no graph file' '
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
-	test_path_is_file info/commit-graph
+	test_path_is_missing info/commit-graph
 '
 
 test_expect_success 'create commits and repack' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 16/17] commit-graph: add --split option
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2019-05-08 15:54 ` [PATCH 15/17] commit-graph: write " Derrick Stolee via GitGitGadget
@ 2019-05-08 15:54 ` Derrick Stolee via GitGitGadget
  2019-05-08 15:54 ` [PATCH 17/17] fetch: add fetch.writeCommitGraph config setting Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:54 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c  | 10 +++++++---
 t/t5318-commit-graph.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 828b1a713f..c2c07d3917 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -35,9 +35,9 @@ static struct opts_commit_graph {
 	int stdin_packs;
 	int stdin_commits;
 	int append;
+	int split;
 } opts;
 
-
 static int graph_verify(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -156,6 +156,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "split", &opts.split,
+			N_("allow writing an incremental commit-graph file")),
 		OPT_END(),
 	};
 
@@ -169,6 +171,8 @@ static int graph_write(int argc, const char **argv)
 		opts.obj_dir = get_object_directory();
 	if (opts.append)
 		flags |= COMMIT_GRAPH_APPEND;
+	if (opts.split)
+		flags |= COMMIT_GRAPH_SPLIT;
 
 	read_replace_refs = 0;
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index d621608500..9dfd4cc9b1 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -260,6 +260,32 @@ test_expect_success 'check that gc computes commit-graph' '
 	test_cmp_bin commit-graph-after-gc $objdir/info/commit-graph
 '
 
+test_expect_success 'write split commit-graph' '
+	cd "$TRASH_DIRECTORY" &&
+	git clone full split &&
+	cd split &&
+	git config core.commitGraph true &&
+	for i in $(test_seq 1 20); do
+		test_commit padding-$i
+	done &&
+	git commit-graph write --reachable &&
+	test_commit split-commit &&
+	git branch -f split-commit &&
+	git commit-graph write --reachable --split &&
+	test_path_is_file .git/objects/info/commit-graphs/commit-graph-1
+'
+
+graph_git_behavior 'split graph, split-commit vs merge 1' bare split-commit merge/1
+
+test_expect_success 'collapse split commit-graph' '
+	cd "$TRASH_DIRECTORY/split" &&
+	git commit-graph write --reachable &&
+	test_path_is_missing .git/objects/info/commit-graphs/commit-graph-1 &&
+	test_path_is_file .git/objects/info/commit-graph
+'
+
+graph_git_behavior 'collapsed graph, split-commit vs merge 1' bare split-commit merge/1
+
 test_expect_success 'replace-objects invalidates commit-graph' '
 	cd "$TRASH_DIRECTORY" &&
 	test_when_finished rm -rf replace &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 17/17] fetch: add fetch.writeCommitGraph config setting
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2019-05-08 15:54 ` [PATCH 16/17] commit-graph: add --split option Derrick Stolee via GitGitGadget
@ 2019-05-08 15:54 ` Derrick Stolee via GitGitGadget
  2019-05-09  8:07   ` Ævar Arnfjörð Bjarmason
  2019-05-08 19:27 ` [PATCH 00/17] [RFC] Commit-graph: Write incremental files Ævar Arnfjörð Bjarmason
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  18 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:54 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/fetch.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index b620fd54b4..cf0944bad5 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -23,6 +23,7 @@
 #include "packfile.h"
 #include "list-objects-filter-options.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 static const char * const builtin_fetch_usage[] = {
 	N_("git fetch [<options>] [<repository> [<refspec>...]]"),
@@ -62,6 +63,7 @@ static const char *submodule_prefix = "";
 static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT;
 static int recurse_submodules_default = RECURSE_SUBMODULES_ON_DEMAND;
 static int shown_url = 0;
+static int fetch_write_commit_graph = 0;
 static struct refspec refmap = REFSPEC_INIT_FETCH;
 static struct list_objects_filter_options filter_options;
 static struct string_list server_options = STRING_LIST_INIT_DUP;
@@ -79,6 +81,11 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.writecommitgraph")) {
+		fetch_write_commit_graph = 1;
+		return 0;
+	}
+
 	if (!strcmp(k, "submodule.recurse")) {
 		int r = git_config_bool(k, v) ?
 			RECURSE_SUBMODULES_ON : RECURSE_SUBMODULES_OFF;
@@ -1670,6 +1677,16 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 
 	string_list_clear(&list, 0);
 
+	if (fetch_write_commit_graph) {
+		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
+
+		if (progress)
+			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
+
+		write_commit_graph_reachable(get_object_directory(),
+					     commit_graph_flags);
+	}
+
 	close_all_packs(the_repository->objects);
 
 	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-08 15:53 ` [PATCH 12/17] Documentation: describe split commit-graphs Derrick Stolee via GitGitGadget
@ 2019-05-08 17:20   ` SZEDER Gábor
  2019-05-08 19:00     ` Derrick Stolee
  0 siblings, 1 reply; 136+ messages in thread
From: SZEDER Gábor @ 2019-05-08 17:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, Junio C Hamano,
	Derrick Stolee

On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The design for the split commit-graphs uses file names to force
> a "stack" of commit-graph files. This allows incremental writes
> without updating the file format.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 142 +++++++++++++++++++++++
>  1 file changed, 142 insertions(+)
> 
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index fb53341d5e..ca1661d2d8 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -127,6 +127,148 @@ Design Details
>    helpful for these clones, anyway. The commit-graph will not be read or
>    written when shallow commits are present.
>  
> +Split Commit Graphs
> +-------------------
> +
> +Typically, repos grow with near-constant velocity (commits per day). Over
> +time, the number of commits added by a fetch operation is much smaller than
> +the number of commits in the full history. The split commit-graph feature
> +allows for fast writes of new commit data without rewriting the entire
> +commit history -- at least, most of the time.
> +
> +## File Layout
> +
> +A split commit-graph uses multiple files, and we use a fixed naming
> +convention to organize these files. The base commit-graph file is the
> +same: `$OBJDIR/info/commit-graph`. The rest of the commit-graph files have
> +the format `$OBJDIR/info/commit-graphs/commit-graph-<N>` where N is a
> +positive integer. The integers must start at 1 and grow sequentially
> +to form a stack of files.
> +
> +Each `commit-graph-<N>` file has the same format as the `commit-graph`
> +file, including a lexicographic list of commit ids. The only difference
> +is that this list is considered to be concatenated to the list from
> +the lower commit-graphs. As an example, consider this diagram of three
> +files:
> +
> + +-----------------------+
> + |  commit-graph-2       |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |  commit-graph-1       |
> + |                       |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |                       |
> + |                       |
> + |  commit-graph         |
> + |                       |
> + |                       |
> + |                       |
> + +-----------------------+
> +
> +Let X0 be the number of commits in `commit-graph`, X1 be the number
> +of commits in commit-graph-1, and X2 be the number of commits in
> +commit-graph-2. If a commit appears in position i in `commit-graph-2`,
> +then we interpret this as being the commit in position (X0 + X1 + i),
> +and that will be used as its "graph position". The commits in
> +commit-graph-2 use these positions to refer to their parents, which
> +may be in commit-graph-1 or commit-graph. We can navigate to an
> +arbitrary commit in position j by checking its containment in the
> +intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + X2).
> +
> +When Git reads from these files, it starts by acquiring a read handle
> +on the `commit-graph` file. On success, it continues acquiring read
> +handles on the `commit-graph-<N>` files in increasing order. This
> +order is important for how we replace the files.
> +
> +## Merging commit-graph files
> +
> +If we only added a `commit-graph-<N>` file on every write, we would
> +run into a linear search problem through many commit-graph files.
> +Instead, we use a merge strategy to decide when the stack should
> +collapse some number of levels.
> +
> +The diagram below shows such a collapse. As a set of new commits
> +are added, it is determined by the merge strategy that the files
> +should collapse to `commit-graph-1`. Thus, the new commits, the
> +commits in `commit-graph-2` and the commits in `commit-graph-1`
> +should be combined into a new `commit-graph-1` file.
> +
> +			    +---------------------+
> +			    |                     |
> +			    |    (new commits)    |
> +			    |                     |
> +			    +---------------------+
> +			    |                     |
> + +-----------------------+  +---------------------+
> + |  commit-graph-2       |->|                     |
> + +-----------------------+  +---------------------+
> +	  |                 |                     |
> + +-----------------------+  +---------------------+
> + |                       |  |                     |
> + |  commit-graph-1       |->|                     |
> + |                       |  |                     |
> + +-----------------------+  +---------------------+
> +	  |                   commit-graph-1.lock
> + +-----------------------+
> + |                       |
> + |                       |
> + |                       |
> + |  commit-graph         |
> + |                       |
> + |                       |
> + |                       |
> + +-----------------------+
> +
> +During this process, the commits to write are combined, sorted
> +and we write the contents to the `commit-graph-1.lock` file.
> +When the file is flushed and ready to swap to `commit-graph-1`,
> +we first unlink the files above our target file. This unlinking
> +is done from the top of the stack, the reverse direction that
> +another process would use to read the stack.
> +
> +During this time window, another process trying to read the
> +commit-graph stack could read `commit-graph-1` before the swap
> +but try to read `commit-graph-2` after it is unlinked. That
> +process would then believe that this stack is complete, but
> +will miss out on the performance benefits of the commits in
> +`commit-graph-2`. For this reason, the stack above the
> +`commit-graph` file should be small.

Consider the following sequence of events:

  1. There are three commit-graph files in the repository.

  2. A git process opens the base commit-graph and commit-graph-1 for
     reading.  It doesn't yet open commit-graph-2, because the (for
     arguments sake not very fair) scheduler takes the CPU away.

  3. Meanwhile, a 'git fetch', well, fetches from a remote, and
     upon noticing that it got a lot of commits it decides to collapse
     commit-graph-1 and -2 and the new commits, writing a brand new
     commit-graph-1.

  4. A second fetch fetches from a second remote, and writes
     commit-graph-2 (no collapsing this time).

  5. Now the crappy scheduler finally decides that it's time to wake
     up the waiting git process from step 2, which then finds the new
     commit-graph-2 file and opens it for reading.

  6. At this point this poor git process has file handles for:
  
     - the base commit-graph file, which is unchanged.

     - the old commit-graph-1 which has since been replaced, and does
       not yet contain info about the old commit-graph-2 or the
       commits received in the first fetch.

     - the new commit-graph-2, containing info only about commits
       received in the second fetch, and whose parents' graph
       positions point either to the base commitg-graph (good, since
       unchanged) or to the new commit-graph-1 (uh-oh).

What happens next?  If this process tries to access the parent of a
commit from commit-graph-2, and the metadata about this parent is in
the new commit-graph-1, then I expect all kinds of weird bugs.

But will a git process ever try to access a commit that didn't yet
existed in the repository when it started opening the commit-graph
files?



> +## Merge Strategy
> +
> +When writing a set of commits that do not exist in the
> +commit-graph stack of height N, we default to creating
> +a new file at level N + 1. We then decide to merge
> +with the Nth level if one of two conditions hold:
> +
> +  1. The expected file size for level N + 1 is at
> +     least half the file size for level N.
> +
> +  2. Level N + 1 contains more than MAX_SPLIT_COMMITS
> +     commits (64,0000 commits).
> +
> +This decision cascades down the levels: when we
> +merge a level we create a new set of commits that
> +then compares to the next level.
> +
> +The first condition bounds the number of levels
> +to be logarithmic in the total number of commits.
> +The second condition bounds the total number of
> +commits in a `commit-graph-N` file and not in
> +the `commit-graph` file, preventing significant
> +performance issues when the stack merges and another
> +process only partially reads the previous stack.
> +
> +The merge strategy values (2 for the size multiple,
> +64,000 for the maximum number of commits) could be
> +extracted into config settings for full flexibility.
> +
>  Related Links
>  -------------
>  [0] https://bugs.chromium.org/p/git/issues/detail?id=8
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-08 17:20   ` SZEDER Gábor
@ 2019-05-08 19:00     ` Derrick Stolee
  2019-05-08 20:11       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee @ 2019-05-08 19:00 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, Junio C Hamano,
	Derrick Stolee

On 5/8/2019 1:20 PM, SZEDER Gábor wrote:
> On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
> Consider the following sequence of events:
> 
>   1. There are three commit-graph files in the repository.
> 
>   2. A git process opens the base commit-graph and commit-graph-1 for
>      reading.  It doesn't yet open commit-graph-2, because the (for
>      arguments sake not very fair) scheduler takes the CPU away.
> 
>   3. Meanwhile, a 'git fetch', well, fetches from a remote, and
>      upon noticing that it got a lot of commits it decides to collapse
>      commit-graph-1 and -2 and the new commits, writing a brand new
>      commit-graph-1.
> 
>   4. A second fetch fetches from a second remote, and writes
>      commit-graph-2 (no collapsing this time).
> 
>   5. Now the crappy scheduler finally decides that it's time to wake
>      up the waiting git process from step 2, which then finds the new
>      commit-graph-2 file and opens it for reading.
> 
>   6. At this point this poor git process has file handles for:
>   
>      - the base commit-graph file, which is unchanged.
> 
>      - the old commit-graph-1 which has since been replaced, and does
>        not yet contain info about the old commit-graph-2 or the
>        commits received in the first fetch.
> 
>      - the new commit-graph-2, containing info only about commits
>        received in the second fetch, and whose parents' graph
>        positions point either to the base commitg-graph (good, since
>        unchanged) or to the new commit-graph-1 (uh-oh).
> 
> What happens next?  If this process tries to access the parent of a
> commit from commit-graph-2, and the metadata about this parent is in
> the new commit-graph-1, then I expect all kinds of weird bugs.
> 
> But will a git process ever try to access a commit that didn't yet
> existed in the repository when it started opening the commit-graph
> files?

I'll ignore the improbability of this turn of events (two writes happening
during the span of trying to read two files) and focus on the fact that
we can prevent issues here using the 4th TODO item in my cover letter:

 4. It would be helpful to add a new optional chunk that contains the
    trailing hash for the lower level of the commit-graph stack. This chunk
    would only be for the commit-graph-N files, and would provide a simple
    way to check that the stack is valid on read, in case we are still
    worried about other processes reading/writing in the wrong order.

If we have this chunk -- you have convinced me that we need it -- then we
could ignore the "new" commit-graph-2 because its base graph hash does not
match. We can continue without dying because we can always parse the "missing"
commits from the packs.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 00/17] [RFC] Commit-graph: Write incremental files
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2019-05-08 15:54 ` [PATCH 17/17] fetch: add fetch.writeCommitGraph config setting Derrick Stolee via GitGitGadget
@ 2019-05-08 19:27 ` Ævar Arnfjörð Bjarmason
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  18 siblings, 0 replies; 136+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-08 19:27 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, git, jrnieder, steadmon, Junio C Hamano


On Wed, May 08 2019, Derrick Stolee via GitGitGadget wrote:

> This patch series is marked as RFC quality because it is missing some key
> features and tests, but hopefully starts a concrete discussion of how the
> incremental commit-graph writes can work. If this is a good direction, then
> it would replace ds/commit-graph-format-v2.

I have some comments on 12/17 that I'll put there.

I think it would be best to start by submitting 1-11 for inclusion so we
can get minor cleanups/refactoring out of the way. I've only skimmed
those patches, but they seem to be obviously correct, although the diff
move detection (and with -w) doesn't help much with them.

This next bit sounds petty, but I honestly don't mean it that way :)

One minor thing I want to note is 04/17. The change itself I 100% agree
on (in-tree docs are bad places for TODO lists), but the commit message
*still* says that a "verify" is just as slow as "write", even though I
noted a ~40% difference in [1].

Do I care about that tiny isolated thing? Nope. But I *do* think it's
indicative of a general thing that could be improved in these RFC
iterations that I found a bit frustrating in reading through
it. I.e. you're getting some of the "C[comments]", but then there's
re-rolled patches that don't address those things.

What we say in the commit message for 4/17 obviously doesn't matter much
at all. But there's other outstanding feedback on the last iteration
that from reading this one still mostly/entirely applies.

So I'll just leave this reply at "I have a lot of comments", but that
they're still sitting there.

1. https://public-inbox.org/git/87o94mql0a.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-08 19:00     ` Derrick Stolee
@ 2019-05-08 20:11       ` Ævar Arnfjörð Bjarmason
  2019-05-09  4:49         ` Junio C Hamano
  2019-05-09 13:45         ` Derrick Stolee
  0 siblings, 2 replies; 136+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-08 20:11 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee


On Wed, May 08 2019, Derrick Stolee wrote:

> On 5/8/2019 1:20 PM, SZEDER Gábor wrote:
>> On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>> Consider the following sequence of events:
>>
>>   1. There are three commit-graph files in the repository.
>>
>>   2. A git process opens the base commit-graph and commit-graph-1 for
>>      reading.  It doesn't yet open commit-graph-2, because the (for
>>      arguments sake not very fair) scheduler takes the CPU away.
>>
>>   3. Meanwhile, a 'git fetch', well, fetches from a remote, and
>>      upon noticing that it got a lot of commits it decides to collapse
>>      commit-graph-1 and -2 and the new commits, writing a brand new
>>      commit-graph-1.
>>
>>   4. A second fetch fetches from a second remote, and writes
>>      commit-graph-2 (no collapsing this time).
>>
>>   5. Now the crappy scheduler finally decides that it's time to wake
>>      up the waiting git process from step 2, which then finds the new
>>      commit-graph-2 file and opens it for reading.
>>
>>   6. At this point this poor git process has file handles for:
>>
>>      - the base commit-graph file, which is unchanged.
>>
>>      - the old commit-graph-1 which has since been replaced, and does
>>        not yet contain info about the old commit-graph-2 or the
>>        commits received in the first fetch.
>>
>>      - the new commit-graph-2, containing info only about commits
>>        received in the second fetch, and whose parents' graph
>>        positions point either to the base commitg-graph (good, since
>>        unchanged) or to the new commit-graph-1 (uh-oh).
>>
>> What happens next?  If this process tries to access the parent of a
>> commit from commit-graph-2, and the metadata about this parent is in
>> the new commit-graph-1, then I expect all kinds of weird bugs.
>>
>> But will a git process ever try to access a commit that didn't yet
>> existed in the repository when it started opening the commit-graph
>> files?
>
> I'll ignore the improbability of this turn of events (two writes happening
> during the span of trying to read two files) and focus on the fact that
> we can prevent issues here using the 4th TODO item in my cover letter:

FWIW the sort of scenario SZEDER is describing is something I deal with
in production a lot. It doesn't require an unfair scheduler, just that
you have differently nice(1)'d processes accessing the same repo.

So if you have batch "cron" processes their IO scheduling follows their
nice(1) scheduling. It's not atypical to e.g. have some background
thingy sit for seconds or even minutes on an I/O syscall while the
kernel decides everyone else has right of way, since you nice'd that not
caring if it finishes in 10 seconds or 10 hours.

>  4. It would be helpful to add a new optional chunk that contains the
>     trailing hash for the lower level of the commit-graph stack. This chunk
>     would only be for the commit-graph-N files, and would provide a simple
>     way to check that the stack is valid on read, in case we are still
>     worried about other processes reading/writing in the wrong order.
>
> If we have this chunk -- you have convinced me that we need it -- then we
> could ignore the "new" commit-graph-2 because its base graph hash does not
> match. We can continue without dying because we can always parse the "missing"
> commits from the packs.

Having read the actual code for this & what the format will look like
(last time I didn't have that context[1]) I really don't get why we
don't save ourselves a lot of trouble and do away with this "N" in the
filename idea from the start.

It's just a slight change to the iteration in
prepare_commit_graph_one().

Instead of looping through files N at a time we'd have a discovery step
where we'd need to open() all the files, see which ones say "my parent
hash hash X", and then create a list of those hashes in order to read a
bunch of commit-graph-<HASH> files.

Is that a bit painful? Sure, but way less painful than dealing with the
caveats I'd mentioned in [1] and SZEDER details here.

And the obvious thing would be to save ourselves most of that work every
time we read by writing a .git/objects/commit-graphs/info on
"commit-graph write", which would be the <HASH> of the end of the
"latest" chain. We could also have some side-index listing the whole
"current" chain in order (but I'm more paranoid about locking/updates to
such a thing, maybe we could put it in the last file in a new chunk
....).

If we didn't have such side-indexes then the way the current loading
works it would need to traverse the files back to front, and *then* load
them front to back (as it does now), so that's a slight pain,
obviously. But not a big deal.

With commit-graph-<HASH> all these unlink() race conditions go away,
partial reads due to concurrent graph writing becomes a non-issue (we'd
just leave the old files, "gc" deals with them later..), no need to
carefully fsync() files/dirs etc as we need to carefully juggle N and
N+1 files.

It also becomes easy to "chain" graphs across repos e.g. via
alternates. Say in the scenario github/gitlab have where they have a
"main" repo and other objects on another delta island.

In that case the repo would have a local "tip" file with the last link
in its chain, some of which would then refer back to <HASHes> in other
"parent" alternates.

As long as such a setup has a "gc" process that's not overly eager about
pruning old stuff and considers that constellation of repos as a whole
that should just work. You can freely optimize and rewrite graphs across
repos, just be careful about unlinking old stuff.

I don't see how it would work with commit-graph-N without a *lot* of
painful orchestration (where e.g. you *must* guarantee that the parent
repo ends in N, all child repos start at N+1).

> This allows incremental writes without updating the file format.

FWIW this is some of what I was talking about in [2]. In
ds/commit-graph-format-v2 I had feedback to the effect[3] that the
particular way in which you proposed to update the format (changing the
header) wouldn't be worth it, since old clients dealt with it so badly.

But as noted in [3] I see zero reason for why we can't update the
existing format, we just add new chunks. That allows us to add any new
data in backwards-compatible ways.

I see nothing wrong with solution that has split files in principle,
just with the currently proposed commit-graph-N way of doing that.

I just wonder if we're looking at a "Y" solution to an "X-Y" problem
where "X" was unduly dismissed. If updating the format was a non-issue
(which seems to me to be the case), what then?

I imagine we'd still have just a "commit-graph" file, to write a new one
we'd "cp" that one, then munge that existing file to write something new
to the and and "mv" it in-place. It seems to me a (sane) split-file plan
is better, but I'm not in your head, maybe it was a much better for
reasons I'm not imagining before I apparently talked you out of changing
the format itself :)

1. https://public-inbox.org/git/87bm0jirjj.fsf@evledraar.gmail.com/
2. https://public-inbox.org/git/87r298hhlc.fsf@evledraar.gmail.com/
3. https://public-inbox.org/git/87lfzprkfc.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-08 20:11       ` Ævar Arnfjörð Bjarmason
@ 2019-05-09  4:49         ` Junio C Hamano
  2019-05-09 12:25           ` Derrick Stolee
  2019-05-09 13:45         ` Derrick Stolee
  1 sibling, 1 reply; 136+ messages in thread
From: Junio C Hamano @ 2019-05-09  4:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, SZEDER Gábor,
	Derrick Stolee via GitGitGadget, git, peff, git, jrnieder,
	steadmon, Derrick Stolee

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> With commit-graph-<HASH> all these unlink() race conditions go away,
> partial reads due to concurrent graph writing becomes a non-issue (we'd
> just leave the old files, "gc" deals with them later..), no need to
> carefully fsync() files/dirs etc as we need to carefully juggle N and
> N+1 files.

The above would give a nice course correction to be in line with the
rest of the system, like how split index knows about and chains to
its base.  Thanks for a dose of sanity.



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 17/17] fetch: add fetch.writeCommitGraph config setting
  2019-05-08 15:54 ` [PATCH 17/17] fetch: add fetch.writeCommitGraph config setting Derrick Stolee via GitGitGadget
@ 2019-05-09  8:07   ` Ævar Arnfjörð Bjarmason
  2019-05-09 14:21     ` Derrick Stolee
  0 siblings, 1 reply; 136+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-09  8:07 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee


On Wed, May 08 2019, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/fetch.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index b620fd54b4..cf0944bad5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -23,6 +23,7 @@
>  #include "packfile.h"
>  #include "list-objects-filter-options.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>
>  static const char * const builtin_fetch_usage[] = {
>  	N_("git fetch [<options>] [<repository> [<refspec>...]]"),
> @@ -62,6 +63,7 @@ static const char *submodule_prefix = "";
>  static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT;
>  static int recurse_submodules_default = RECURSE_SUBMODULES_ON_DEMAND;
>  static int shown_url = 0;
> +static int fetch_write_commit_graph = 0;
>  static struct refspec refmap = REFSPEC_INIT_FETCH;
>  static struct list_objects_filter_options filter_options;
>  static struct string_list server_options = STRING_LIST_INIT_DUP;
> @@ -79,6 +81,11 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
>  		return 0;
>  	}
>
> +	if (!strcmp(k, "fetch.writecommitgraph")) {
> +		fetch_write_commit_graph = 1;
> +		return 0;
> +	}
> +
>  	if (!strcmp(k, "submodule.recurse")) {
>  		int r = git_config_bool(k, v) ?
>  			RECURSE_SUBMODULES_ON : RECURSE_SUBMODULES_OFF;
> @@ -1670,6 +1677,16 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>
>  	string_list_clear(&list, 0);
>
> +	if (fetch_write_commit_graph) {
> +		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
> +
> +		if (progress)
> +			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
> +
> +		write_commit_graph_reachable(get_object_directory(),
> +					     commit_graph_flags);
> +	}
> +
>  	close_all_packs(the_repository->objects);
>
>  	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);

I'm keen in general to refactor "git gc --auto" a bit so it moves away
from being an all-or-nothing to something where we can do an
"incremental" gc.

I'm happy to do that work, the main thing it's been blocked on is not
having some fast easy-to-lookup heuristic for one of those "incremental"
things.

The two obvious candidates are the commit-graph (I mainly wanted this on
"gc --auto" after clone), and pack-refs (but doing that is more
expensive).

So rather than have this patch I'd like to as noted in 00/17 get the
refactoring bits of the commit-graph in first.

Then some version of my WIP patch in
https://public-inbox.org/git/87lfzprkfc.fsf@evledraar.gmail.com/ where
we'd note the number of objects we had when we did the last commit-graph
in the graph itself.

Then "gc --auto" would look at that, then approximate_object_count(),
and have some percentage threshhold for doing a "do some of the gc"
task, which would just be a small change to need_to_gc() to make it
return/populate a "what needs to be done" rather than "yes/no".

That would give you what you want here, but also be a more general
solution. E.g. we'd write the graph on "clone" once "gc --auto" was
called there, as well as on "fetch".

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-09  4:49         ` Junio C Hamano
@ 2019-05-09 12:25           ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-05-09 12:25 UTC (permalink / raw)
  To: Junio C Hamano, Ævar Arnfjörð Bjarmason
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Derrick Stolee

On 5/9/2019 12:49 AM, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
>> With commit-graph-<HASH> all these unlink() race conditions go away,
>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>> just leave the old files, "gc" deals with them later..), no need to
>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>> N+1 files.
> 
> The above would give a nice course correction to be in line with the
> rest of the system, like how split index knows about and chains to
> its base.  Thanks for a dose of sanity.

I'm working on a detailed response to Ævar's ideas, to be sure we are
talking about the same thing, because the original motivation for the
commit-graph format v2 was to allow the 'commit-graph' file to point
to a chain of base files by a list of hashes (like the split index
does). The current proposal was created in response to an unwillingness
to break the file format for the 'commit-graph' file.

-Stolee

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-08 20:11       ` Ævar Arnfjörð Bjarmason
  2019-05-09  4:49         ` Junio C Hamano
@ 2019-05-09 13:45         ` Derrick Stolee
  2019-05-09 15:48           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 136+ messages in thread
From: Derrick Stolee @ 2019-05-09 13:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

On 5/8/2019 4:11 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, May 08 2019, Derrick Stolee wrote:
>
>> I'll ignore the improbability of this turn of events (two writes happening
>> during the span of trying to read two files) and focus on the fact that
>> we can prevent issues here using the 4th TODO item in my cover letter:
> 
> FWIW the sort of scenario SZEDER is describing is something I deal with
> in production a lot. It doesn't require an unfair scheduler, just that
> you have differently nice(1)'d processes accessing the same repo.
> 
> So if you have batch "cron" processes their IO scheduling follows their
> nice(1) scheduling. It's not atypical to e.g. have some background
> thingy sit for seconds or even minutes on an I/O syscall while the
> kernel decides everyone else has right of way, since you nice'd that not
> caring if it finishes in 10 seconds or 10 hours.

Thanks. I learned something today.

I still don't think this is a problem, with the idea below.

>>  4. It would be helpful to add a new optional chunk that contains the
>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>     would only be for the commit-graph-N files, and would provide a simple
>>     way to check that the stack is valid on read, in case we are still
>>     worried about other processes reading/writing in the wrong order.
>>
>> If we have this chunk -- you have convinced me that we need it -- then we
>> could ignore the "new" commit-graph-2 because its base graph hash does not
>> match. We can continue without dying because we can always parse the "missing"
>> commits from the packs.

So, let's set that idea aside. You have other concerns.

> Instead of looping through files N at a time we'd have a discovery step
> where we'd need to open() all the files, see which ones say "my parent
> hash hash X", and then create a list of those hashes in order to read a
> bunch of commit-graph-<HASH> files.
> 
> Is that a bit painful? Sure, but way less painful than dealing with the
> caveats I'd mentioned in [1] and SZEDER details here.

I don't see how this step is less painful than the one I am describing.
You'll need to be a bit more specific to convince me.

I'll try to be specific with a few ideas that have been thrown around,
so we can compare and contrast (and point out where I am misunderstanding
what you are trying to say).

Option 0 (this series): commit-graph-N
--------------------------------------

On read, we look for the 'info/commit-graph' file and acquire a read handle.
We set that commit_graph struct as our tip commit-graph. Then, for each
increasing N (until we fail) we acquire a read handle on
'info/commit-graphs/commit-graph-N' and check that its base hash matches
our current tip commit-graph. If the file doesn't exist, or the base
hash doesn't match, then we stop and continue with our current tip graph.

On write, use a 'mv' to swap our .lock file with whatever level we are
merging, THEN can unlink() the higher layers in decreasing order. (This
"mv-then-unlink" order is different than what is implemented by this
series, but is enabled by the chunk containing the base graph hash.)

Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
--------------------------------------------------------------

On read, we load the 'info/commit-graph' file and inspect the byte saying
how many base files we have. We load their hashes from the base file chunk
and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
to ignore anything "above" the failure in the chain (and specifically this
includes the 'commit-graph' file), or consider reloading the commit-graph
file altogether and hope it works this time. [Note: if an older version
of Git doesn't understand the incremental file format, it will fail to
properly understand the graph positions and either fail with an
"invalid parent position" error, or worse give garbage results.]

On write, if we are creating a new layer to our chain, we need to _copy_
the existing commit-graph file to a graph-{hash}.graph file before renaming
the .lock file. If we are merging layers, then we either (a) clean up the
dangling chains after moving our commit-graph file, or (b) have something
like 'gc' clean up the files later. I think using 'gc' for this is not
a good idea, since I would expect these files to be written and merged
much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
up the dangling chains leads to our concurrency issues. Further, this
'copy the base' no longer keeps our large base file at rest.

Option 2: grovel commit-graphs directory for graph-{hash}.graph
---------------------------------------------------------------

On read, we load the 'info/commit-graph' file and assume it is never
an incremental file. Then, scan the 'info/commit-graphs' directory
for 'graph-{hash}.graph' files and open them _all_ up to construct
a "graph forest" (each graph has a single parent, given by a chunk
specifying its base graph hash). If we don't have an instance of a
graph with a given hash, then ignore any graphs pointing to that hash.
We now have a decision to make: which leaf of this forest should we
use as our tip commit-graph? That could be given by the
commit-graphs/info file. But what happens when we have timing issues
around scanning the directory and the commit-graphs/info file? Do
we fall back to modified time?

On write, if we are not merging, then we just create a new
graph-{hash}.graph file. If we are merging, but still have a base graph,
then create a new graph-{hash}.graph file. Finally, if we are merging
all layers, then we rename our .lock file to 'info/commit-graph'.
To clean up, we need to grovel the directory to look for graph-{hash}.graph
files whose base chains no longer match the new, "best" chain and unlink()
them. This clean-up step can happen at any time.

--[end description of options]--

Did I accurately describe the options we are considering?

Option 1 was the design I was planning, and I think it matches how the
split-index feature works. Please correct me if I am missing something.
It _requires_ updating the file format version. But it also has a flaw
that the other options do not have: the copy of the base file. One
thing I want to enable is for whatever machinery is handling these
file writes to run a 'verify' immediately after, and have that be fast
most of the time. With a model that changes only the "tip" file, we
can verify only the new files and have confidence that the base file
did not change. I think options 0 and 2 both improve in this direction.
 
> With commit-graph-<HASH> all these unlink() race conditions go away,
> partial reads due to concurrent graph writing becomes a non-issue (we'd
> just leave the old files, "gc" deals with them later..), no need to
> carefully fsync() files/dirs etc as we need to carefully juggle N and
> N+1 files.

Calling this a non-issue is an exaggeration, especially if you are
claiming we need to be robust to multi-hour gaps between reading files.

> It also becomes easy to "chain" graphs across repos e.g. via
> alternates. Say in the scenario github/gitlab have where they have a
> "main" repo and other objects on another delta island.
> 
> In that case the repo would have a local "tip" file with the last link
> in its chain, some of which would then refer back to <HASHes> in other
> "parent" alternates.
> 
> As long as such a setup has a "gc" process that's not overly eager about
> pruning old stuff and considers that constellation of repos as a whole
> that should just work. You can freely optimize and rewrite graphs across
> repos, just be careful about unlinking old stuff.
> 
> I don't see how it would work with commit-graph-N without a *lot* of
> painful orchestration (where e.g. you *must* guarantee that the parent
> repo ends in N, all child repos start at N+1).

You're right that Option 0 does not work in this model where some graph
information is stored in an alternate _and_ more information is stored
outside the alternate. My perspective is biased, because I consider the
alternate to be "almost everything" and the local object store to be
small. But in a fork network, this is not always the case. I appreciate
your feedback for this environment, and I've always hoped that someone
with server experience would come and say "this feature is great, but
we need X, Y, and Z to make best use of it in our environment. Here's
a patch that moves us in that direction!" At least you are doing the
next-best thing: stopping me from making mistakes that would block
adoption.

So let's consider how Option 2 would work in this "multi-tip" case.
Each object directory would have some number of graph files, and one
'commit-graphs/info' file pointing to some hash. When we read, we
try to pick the info file that is "closest" to us.

This does create some complications that I don't think you gave enough
attention to. These may be solvable, but they are non-trivial:

* When we 'gc' the "core" repo, we need to enumerate all of the
  "leaf" repos to check their tip commit-graph files and make a
  decision if we should keep their bases around or delete those tips.
  Perhaps I'm over-stating the difficulty here, since we need to do
  something similar to find still-reachable objects, anyway. But if
  we are doing that reachability calculation, then why are we not
  just putting all of the commit-graph data in the core repo? Note
  that we don't get the same value as delta islands because this data
  isn't being shared across the protocol. The issue with storing all
  graph data in the core repo is that the core repo doesn't actually
  have all of the commits, which makes 'verify' on the graph a bit
  annoying.

* If we choose a local tip instead of the "core" tip, then that chain
  of commit-graphs can be far behind the core repo. In the world where
  a fork moves only at the speed of a single developer, but the core
  project moves quickly, then computing a merge base with the core's
  master branch becomes slow as our local chain doesn't contain most
  of the commits.

* We can't take all of the "core" chain _and_ the local chain, because
  the concept of "graph position" no longer makes sense. The only way
  I see out of this is to make the graph position two-dimensional:
  commit -> (index of tip chain, position in that chain). Perhaps this
  is a valuable thing to do in the future? Or perhaps, we shouldn't
  have incremental chains spanning object directories and instead
  introduce "parents-by-ref" where we mark some parents as included
  by object id instead of by graph position. This would allow the
  core repo to gc without caring about the external repos. It also
  wouldn't care about how the graph files are stored (Option 0 would
  work, as graph chains would not cross object store boundaries) and
  more closely resembles the independence of the pack-files in each
  object store. The "parents-by-ref" would require updating the
  file format version.

--[end discussion of incremental files]--

I'll move forward applying your existing feedback on patches 1-11 and
submit as a full series to replace ds/commit-graph-format-v2. We can
work on reviewing that code while we continue to think critically on
the topic of incremental files.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 17/17] fetch: add fetch.writeCommitGraph config setting
  2019-05-09  8:07   ` Ævar Arnfjörð Bjarmason
@ 2019-05-09 14:21     ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-05-09 14:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, peff, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

On 5/9/2019 4:07 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> So rather than have this patch I'd like to as noted in 00/17 get the
> refactoring bits of the commit-graph in first.

Refactor-only series coming soon.

> Then some version of my WIP patch in
> https://public-inbox.org/git/87lfzprkfc.fsf@evledraar.gmail.com/ where
> we'd note the number of objects we had when we did the last commit-graph
> in the graph itself.
> 
> Then "gc --auto" would look at that, then approximate_object_count(),
> and have some percentage threshhold for doing a "do some of the gc"
> task, which would just be a small change to need_to_gc() to make it
> return/populate a "what needs to be done" rather than "yes/no".
> 
> That would give you what you want here, but also be a more general
> solution. E.g. we'd write the graph on "clone" once "gc --auto" was
> called there, as well as on "fetch".

I think this "gc --auto" idea is solid, but I also think there is value
in writing a commit-graph after _every_ fetch, not just one big enough
to trigger this new gc behavior. Perhaps your new metadata in the
commit-graph file could store multiple values for "number of objects since
X was cleaned up" where X is in { packs, reflog, commit-graph, etc. } and 
GC could consider each maintenance task independently. _That_ would make
this patch unnecessary.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-09 13:45         ` Derrick Stolee
@ 2019-05-09 15:48           ` Ævar Arnfjörð Bjarmason
  2019-05-09 17:08             ` Derrick Stolee
  0 siblings, 1 reply; 136+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-09 15:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee


On Thu, May 09 2019, Derrick Stolee wrote:

Not a very detailed reply since I've gotta run soon, but figured I'd
send something (also given our timezone difference).

> On 5/8/2019 4:11 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, May 08 2019, Derrick Stolee wrote:
>>
>>> I'll ignore the improbability of this turn of events (two writes happening
>>> during the span of trying to read two files) and focus on the fact that
>>> we can prevent issues here using the 4th TODO item in my cover letter:
>>
>> FWIW the sort of scenario SZEDER is describing is something I deal with
>> in production a lot. It doesn't require an unfair scheduler, just that
>> you have differently nice(1)'d processes accessing the same repo.
>>
>> So if you have batch "cron" processes their IO scheduling follows their
>> nice(1) scheduling. It's not atypical to e.g. have some background
>> thingy sit for seconds or even minutes on an I/O syscall while the
>> kernel decides everyone else has right of way, since you nice'd that not
>> caring if it finishes in 10 seconds or 10 hours.
>
> Thanks. I learned something today.
>
> I still don't think this is a problem, with the idea below.
>
>>>  4. It would be helpful to add a new optional chunk that contains the
>>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>>     would only be for the commit-graph-N files, and would provide a simple
>>>     way to check that the stack is valid on read, in case we are still
>>>     worried about other processes reading/writing in the wrong order.
>>>
>>> If we have this chunk -- you have convinced me that we need it -- then we
>>> could ignore the "new" commit-graph-2 because its base graph hash does not
>>> match. We can continue without dying because we can always parse the "missing"
>>> commits from the packs.
>
> So, let's set that idea aside. You have other concerns.
>
>> Instead of looping through files N at a time we'd have a discovery step
>> where we'd need to open() all the files, see which ones say "my parent
>> hash hash X", and then create a list of those hashes in order to read a
>> bunch of commit-graph-<HASH> files.
>>
>> Is that a bit painful? Sure, but way less painful than dealing with the
>> caveats I'd mentioned in [1] and SZEDER details here.
>
> I don't see how this step is less painful than the one I am describing.
> You'll need to be a bit more specific to convince me.
>
> I'll try to be specific with a few ideas that have been thrown around,
> so we can compare and contrast (and point out where I am misunderstanding
> what you are trying to say).
>
> Option 0 (this series): commit-graph-N
> --------------------------------------
>
> On read, we look for the 'info/commit-graph' file and acquire a read handle.
> We set that commit_graph struct as our tip commit-graph. Then, for each
> increasing N (until we fail) we acquire a read handle on
> 'info/commit-graphs/commit-graph-N' and check that its base hash matches
> our current tip commit-graph. If the file doesn't exist, or the base
> hash doesn't match, then we stop and continue with our current tip graph.

This "base hash" is something I may have missed. I *thought* we'd
happy-go-lucky load e.g. commit-graph-1, commit-graph-2 in that order,
and if (e.g. due to fsync/dir entry issues noted earlier) if we got the
"wrong" commit-graph-2 we'd be oblivious to that. Is that not the case?

I was trying (and failing) to make the test suite vomit by hacking up
the code to load "commit-graph-1" first, then "commit-graph", so maybe I
ran into that safeguard, whatever it is. I've just seen how we carry the
"num_commits_in_base" forward, not some expected checksum...

> On write, use a 'mv' to swap our .lock file with whatever level we are
> merging, THEN can unlink() the higher layers in decreasing order. (This
> "mv-then-unlink" order is different than what is implemented by this
> series, but is enabled by the chunk containing the base graph hash.)

So one of the things that make me paranoid about this (as noted
upthread/earlier) is that at least on POSIX filesystems just because you
observe that you create and fsync "foo" and "bar" in that order doesn't
mean that a concurrent reader of that directory will get the updates in
that order. I.e. file update != dir entry update.

It can be made to work, and core.fsyncObjectFiles is a partial solution,
but better to avoid it entirely. Also the failure with
core.fsyncObjectFiles is much more graceful, since in that case we don't
have a "deadbeef[...]" loose object *yet* but will (presumably) see the
*right* thing Real Soon Now since it's content-addressable, whereas with
this design we might see the *wrong* thing. So this would be the first
area of git (that I know of) where we'd be sensitive to a combination of
syncing dir entries and file content in order.

> Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
> --------------------------------------------------------------
>
> On read, we load the 'info/commit-graph' file and inspect the byte saying
> how many base files we have. We load their hashes from the base file chunk
> and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
> to ignore anything "above" the failure in the chain (and specifically this
> includes the 'commit-graph' file), or consider reloading the commit-graph
> file altogether and hope it works this time. [Note: if an older version
> of Git doesn't understand the incremental file format, it will fail to
> properly understand the graph positions and either fail with an
> "invalid parent position" error, or worse give garbage results.]

Right, except as noted before I'd add (at least per my understanding so
far) "s/needs format v2//". I.e. we could stick this data in new v1
backwards-compatible chunks, which also addresses the "older version..."
caveat.

I.e. we'd always guarantee that the "commit-graph" file was
understandable/valid for older versions (even though eventually their
view on it would be "oh, this has no data" (but really it's all in a new
chunk they don't grok).

> On write, if we are creating a new layer to our chain, we need to _copy_
> the existing commit-graph file to a graph-{hash}.graph file before renaming
> the .lock file. If we are merging layers, then we either (a) clean up the
> dangling chains after moving our commit-graph file, or (b) have something
> like 'gc' clean up the files later. I think using 'gc' for this is not
> a good idea, since I would expect these files to be written and merged
> much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
> up the dangling chains leads to our concurrency issues. Further, this
> 'copy the base' no longer keeps our large base file at rest.

So aside from the specifics of implementation both this and the
commit-graph-N way of doing it involves juggling a sequence of chunks
that "point upwards", which IMO has worse caveats than "point
downwards". I.e. we'd need to rewrite "base" files to point "upwards" to
new stuff, and there's no (sane) way in option 0 to split commit-graphs
across repos, and no way at all in this scenario (if I understand it
correctly...).

> Option 2: grovel commit-graphs directory for graph-{hash}.graph
> ---------------------------------------------------------------
>
> On read, we load the 'info/commit-graph' file and assume it is never
> an incremental file. Then, scan the 'info/commit-graphs' directory
> for 'graph-{hash}.graph' files and open them _all_ up to construct
> a "graph forest" (each graph has a single parent, given by a chunk
> specifying its base graph hash). If we don't have an instance of a
> graph with a given hash, then ignore any graphs pointing to that hash.
> We now have a decision to make: which leaf of this forest should we
> use as our tip commit-graph? That could be given by the
> commit-graphs/info file. But what happens when we have timing issues
> around scanning the directory and the commit-graphs/info file? Do
> we fall back to modified time?

I think in a worst-case scenario we shrug and just pick whatever chain
looks most recent/largest or whatever, but in practice I think it's a
non-issue.

Critical for that "it's a non-issue" is the suggestion I had on 17/17 of
not doing the incremental write in "fetch", but instead just rely on "gc
--auto" *and* to address your "I would expect these files to be written
and merged much more frequently" we'd just (and I'm keen to hack this
up) teach "gc --auto" such an incremental mode.

This means that within a single repo all updates to the commit-graph
will go through the gc.lock, whereas with your current 17/17 we'd
potentially have a "commit-graph write" racing a concurrent "gc", with
both aiming to update the commit-graph.

> On write, if we are not merging, then we just create a new
> graph-{hash}.graph file. If we are merging, but still have a base graph,
> then create a new graph-{hash}.graph file. Finally, if we are merging
> all layers, then we rename our .lock file to 'info/commit-graph'.
> To clean up, we need to grovel the directory to look for graph-{hash}.graph
> files whose base chains no longer match the new, "best" chain and unlink()
> them. This clean-up step can happen at any time.
>
> --[end description of options]--
>
> Did I accurately describe the options we are considering?
>
> Option 1 was the design I was planning, and I think it matches how the
> split-index feature works. Please correct me if I am missing something.
> It _requires_ updating the file format version. But it also has a flaw
> that the other options do not have: the copy of the base file. One
> thing I want to enable is for whatever machinery is handling these
> file writes to run a 'verify' immediately after, and have that be fast
> most of the time. With a model that changes only the "tip" file, we
> can verify only the new files and have confidence that the base file
> did not change. I think options 0 and 2 both improve in this direction.
>
>> With commit-graph-<HASH> all these unlink() race conditions go away,
>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>> just leave the old files, "gc" deals with them later..), no need to
>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>> N+1 files.
>
> Calling this a non-issue is an exaggeration, especially if you are
> claiming we need to be robust to multi-hour gaps between reading files.

We always have a race, but they're very different races.

With "N" files (option #0) we'd have races of the type where within the
same few milliseconds of a commit-graph being merged/updated a
e.g. concurrent "tag --contains" would either be much slower (couldn't
get one of the incremental graphs it expected), or produce the wrong
answer (this may be wrong on my part, see my "base hash" comment above).

Whereas if I run something with "ionice -c 3" I could possibly hang for
however many hours/days/weeks we wait until another "gc" comes along and
unlinks those old files, but if I'm running it like that I'm not
expecting it to be fast, so it's OK if the files went away, and it won't
ever get the wrong file (since the filenames are hash-addressible).

>> It also becomes easy to "chain" graphs across repos e.g. via
>> alternates. Say in the scenario github/gitlab have where they have a
>> "main" repo and other objects on another delta island.
>>
>> In that case the repo would have a local "tip" file with the last link
>> in its chain, some of which would then refer back to <HASHes> in other
>> "parent" alternates.
>>
>> As long as such a setup has a "gc" process that's not overly eager about
>> pruning old stuff and considers that constellation of repos as a whole
>> that should just work. You can freely optimize and rewrite graphs across
>> repos, just be careful about unlinking old stuff.
>>
>> I don't see how it would work with commit-graph-N without a *lot* of
>> painful orchestration (where e.g. you *must* guarantee that the parent
>> repo ends in N, all child repos start at N+1).
>
> You're right that Option 0 does not work in this model where some graph
> information is stored in an alternate _and_ more information is stored
> outside the alternate. My perspective is biased, because I consider the
> alternate to be "almost everything" and the local object store to be
> small. But in a fork network, this is not always the case. I appreciate
> your feedback for this environment, and I've always hoped that someone
> with server experience would come and say "this feature is great, but
> we need X, Y, and Z to make best use of it in our environment. Here's
> a patch that moves us in that direction!" At least you are doing the
> next-best thing: stopping me from making mistakes that would block
> adoption.

I'm happy to write some patches, just want to talk about it first (and
if I'm lucky convince you to write them for me :) ).

One example is the git-for-windows commit-graph is 5.6MB, the git.git
one is 3.1MB. Should GitHub just stick that all in the one "parent"
graph, maybe. But nice to have the flexibility of stacking them.

There's also more disconnected cases, e.g. I have some "staging" boxes
where where I have a cronjob running around re-pointing clones of a big
monorepo to a shared "alternates" store where I guarantee objects are
only ever added, never removed.

It would be nice to have a way to provide a commit-graph there that's
"stable" that clients could point to, and they'd just generate the
difference.

I.e. now I have a shared .git/objects which contains gigabytes, a
crapload of stuff in /home where .git/objects is 10-50MB, and each one
has a commit-graph that's around the same 50-100MB size (since it needs
to contain the metadata for the full set).

> So let's consider how Option 2 would work in this "multi-tip" case.
> Each object directory would have some number of graph files, and one
> 'commit-graphs/info' file pointing to some hash. When we read, we
> try to pick the info file that is "closest" to us.
>
> This does create some complications that I don't think you gave enough
> attention to. These may be solvable, but they are non-trivial:
>
> * When we 'gc' the "core" repo, we need to enumerate all of the
>   "leaf" repos to check their tip commit-graph files and make a
>   decision if we should keep their bases around or delete those tips.
>   Perhaps I'm over-stating the difficulty here, since we need to do
>   something similar to find still-reachable objects, anyway. But if
>   we are doing that reachability calculation, then why are we not
>   just putting all of the commit-graph data in the core repo? Note
>   that we don't get the same value as delta islands because this data
>   isn't being shared across the protocol. The issue with storing all
>   graph data in the core repo is that the core repo doesn't actually
>   have all of the commits, which makes 'verify' on the graph a bit
>   annoying.

Yeah I think similar to "alternates" it would be annoying to have a case
where a given repo has metadata on objects it doesn't have, and there's
cases (see the "staging" case I mentioned above) where that "parent"
repo won't have access to those things.

> * If we choose a local tip instead of the "core" tip, then that chain
>   of commit-graphs can be far behind the core repo. In the world where
>   a fork moves only at the speed of a single developer, but the core
>   project moves quickly, then computing a merge base with the core's
>   master branch becomes slow as our local chain doesn't contain most
>   of the commits.

That's a good point and a case where pointing "upwards" or just having
commit-graphs in the "base" repo is better, i.e. the "fork" has almost
no objects.

But solvable by triggering "gc" on these child projects so their
commit-graph keeps being re-pointed to a later version.

And we'd have the reverse problem with a git-for-windows wouldn't we?
I.e. the fork is "far ahead".

> * We can't take all of the "core" chain _and_ the local chain, because
>   the concept of "graph position" no longer makes sense. The only way
>   I see out of this is to make the graph position two-dimensional:
>   commit -> (index of tip chain, position in that chain). Perhaps this
>   is a valuable thing to do in the future? Or perhaps, we shouldn't
>   have incremental chains spanning object directories and instead
>   introduce "parents-by-ref" where we mark some parents as included
>   by object id instead of by graph position. This would allow the
>   core repo to gc without caring about the external repos. It also
>   wouldn't care about how the graph files are stored (Option 0 would
>   work, as graph chains would not cross object store boundaries) and
>   more closely resembles the independence of the pack-files in each
>   object store. The "parents-by-ref" would require updating the
>   file format version.

The parent repo can "gc" without caring/inspecting "child" repos with
the "point downwards" of option #2, as long as it promises to retain its
old commit-graph files for some retention period of X, and the "child"
repos promise to "gc" (and re-point to new graphs if necessary) at rate
that's faster than that.

This makes it easy to e.g. say "we retain old commit-graph files for 2
weeks", and "we re-gc everything in cron weekly".

It would work best if we can also pull this trick on the "base"
commit-graph file, which I believe we could do in a backwards-compatible
way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
is the current "base".

> --[end discussion of incremental files]--
>
> I'll move forward applying your existing feedback on patches 1-11 and
> submit as a full series to replace ds/commit-graph-format-v2. We can
> work on reviewing that code while we continue to think critically on
> the topic of incremental files.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-09 15:48           ` Ævar Arnfjörð Bjarmason
@ 2019-05-09 17:08             ` Derrick Stolee
  2019-05-09 21:45               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee @ 2019-05-09 17:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

On 5/9/2019 11:48 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, May 09 2019, Derrick Stolee wrote:
> [snip]
>>
>> I still don't think this is a problem, with the idea below.
>>
>>>>  4. It would be helpful to add a new optional chunk that contains the
>>>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>>>     would only be for the commit-graph-N files, and would provide a simple
>>>>     way to check that the stack is valid on read, in case we are still
>>>>     worried about other processes reading/writing in the wrong order.
>>>>
>>>> If we have this chunk -- you have convinced me that we need it -- then we
>>>> could ignore the "new" commit-graph-2 because its base graph hash does not
>>>> match. We can continue without dying because we can always parse the "missing"
>>>> commits from the packs.
>>
>> So, let's set that idea aside. You have other concerns.
>>
>>> Instead of looping through files N at a time we'd have a discovery step
>>> where we'd need to open() all the files, see which ones say "my parent
>>> hash hash X", and then create a list of those hashes in order to read a
>>> bunch of commit-graph-<HASH> files.
>>>
>>> Is that a bit painful? Sure, but way less painful than dealing with the
>>> caveats I'd mentioned in [1] and SZEDER details here.
>>
>> I don't see how this step is less painful than the one I am describing.
>> You'll need to be a bit more specific to convince me.
>>
>> I'll try to be specific with a few ideas that have been thrown around,
>> so we can compare and contrast (and point out where I am misunderstanding
>> what you are trying to say).
>>
>> Option 0 (this series): commit-graph-N
>> --------------------------------------
>>
>> On read, we look for the 'info/commit-graph' file and acquire a read handle.
>> We set that commit_graph struct as our tip commit-graph. Then, for each
>> increasing N (until we fail) we acquire a read handle on
>> 'info/commit-graphs/commit-graph-N' and check that its base hash matches
>> our current tip commit-graph. If the file doesn't exist, or the base
>> hash doesn't match, then we stop and continue with our current tip graph.
> 
> This "base hash" is something I may have missed. I *thought* we'd
> happy-go-lucky load e.g. commit-graph-1, commit-graph-2 in that order,
> and if (e.g. due to fsync/dir entry issues noted earlier) if we got the
> "wrong" commit-graph-2 we'd be oblivious to that. Is that not the case?> 
> I was trying (and failing) to make the test suite vomit by hacking up
> the code to load "commit-graph-1" first, then "commit-graph", so maybe I
> ran into that safeguard, whatever it is. I've just seen how we carry the
> "num_commits_in_base" forward, not some expected checksum...

It's not implemented in the patch series code, but included as an idea
in the cover letter, and the discussion above convinced me it should be
required. The discussion here assumes that is part of the design.

> >> On write, use a 'mv' to swap our .lock file with whatever level we are
>> merging, THEN can unlink() the higher layers in decreasing order. (This
>> "mv-then-unlink" order is different than what is implemented by this
>> series, but is enabled by the chunk containing the base graph hash.)
> 
> So one of the things that make me paranoid about this (as noted
> upthread/earlier) is that at least on POSIX filesystems just because you
> observe that you create and fsync "foo" and "bar" in that order doesn't
> mean that a concurrent reader of that directory will get the updates in
> that order. I.e. file update != dir entry update.

How far apart can these concurrency issues happen in the file system?
One benefit to Option 0 is that there is only one file _write_ that matters.
The other options require at least two writes.

> It can be made to work, and core.fsyncObjectFiles is a partial solution,
> but better to avoid it entirely. Also the failure with
> core.fsyncObjectFiles is much more graceful, since in that case we don't
> have a "deadbeef[...]" loose object *yet* but will (presumably) see the
> *right* thing Real Soon Now since it's content-addressable, whereas with
> this design we might see the *wrong* thing. So this would be the first
> area of git (that I know of) where we'd be sensitive to a combination of
> syncing dir entries and file content in order.
> 
>> Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
>> --------------------------------------------------------------
>>
>> On read, we load the 'info/commit-graph' file and inspect the byte saying
>> how many base files we have. We load their hashes from the base file chunk
>> and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
>> to ignore anything "above" the failure in the chain (and specifically this
>> includes the 'commit-graph' file), or consider reloading the commit-graph
>> file altogether and hope it works this time. [Note: if an older version
>> of Git doesn't understand the incremental file format, it will fail to
>> properly understand the graph positions and either fail with an
>> "invalid parent position" error, or worse give garbage results.]
> 
> Right, except as noted before I'd add (at least per my understanding so
> far) "s/needs format v2//". I.e. we could stick this data in new v1
> backwards-compatible chunks, which also addresses the "older version..."
> caveat.
> 
> I.e. we'd always guarantee that the "commit-graph" file was
> understandable/valid for older versions (even though eventually their
> view on it would be "oh, this has no data" (but really it's all in a new
> chunk they don't grok).

I see. Something that I gather from your paragraph above but cannot find
written explicitly anywhere is "write a commit-graph with zero commits,
and it contains extra information that we can use to find the rest of the
commits." In essence, the entire purpose of the file would be to contain
the information of the 'commit-graphs/info' file from Option 2. It would
mean we don't need to "copy then move" because the tip commit-graph file
has no content to persist (except for the first incremental write).

That would prevent breaking old clients, but they would also not have any
commit-graph data to use. Option 0 and Option 2 can leave a valid v1
commit-graph file with the majority of the commit data. This is only a
performance issue for a narrow case, so maybe that's worth ignoring.

(See Ævar's idea about symlinks at the bottom of this message for more.)

>> On write, if we are creating a new layer to our chain, we need to _copy_
>> the existing commit-graph file to a graph-{hash}.graph file before renaming
>> the .lock file. If we are merging layers, then we either (a) clean up the
>> dangling chains after moving our commit-graph file, or (b) have something
>> like 'gc' clean up the files later. I think using 'gc' for this is not
>> a good idea, since I would expect these files to be written and merged
>> much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
>> up the dangling chains leads to our concurrency issues. Further, this
>> 'copy the base' no longer keeps our large base file at rest.
> 
> So aside from the specifics of implementation both this and the
> commit-graph-N way of doing it involves juggling a sequence of chunks
> that "point upwards", which IMO has worse caveats than "point
> downwards". I.e. we'd need to rewrite "base" files to point "upwards" to
> new stuff, and there's no (sane) way in option 0 to split commit-graphs
> across repos, and no way at all in this scenario (if I understand it
> correctly...).

I don't understand the value of rewriting base files to point upwards.
The whole point is to not change the base files.

>> Option 2: grovel commit-graphs directory for graph-{hash}.graph
>> ---------------------------------------------------------------
>>
>> On read, we load the 'info/commit-graph' file and assume it is never
>> an incremental file. Then, scan the 'info/commit-graphs' directory
>> for 'graph-{hash}.graph' files and open them _all_ up to construct
>> a "graph forest" (each graph has a single parent, given by a chunk
>> specifying its base graph hash). If we don't have an instance of a
>> graph with a given hash, then ignore any graphs pointing to that hash.
>> We now have a decision to make: which leaf of this forest should we
>> use as our tip commit-graph? That could be given by the
>> commit-graphs/info file. But what happens when we have timing issues
>> around scanning the directory and the commit-graphs/info file? Do
>> we fall back to modified time?
> 
> I think in a worst-case scenario we shrug and just pick whatever chain
> looks most recent/largest or whatever, but in practice I think it's a
> non-issue.
> 
> Critical for that "it's a non-issue" is the suggestion I had on 17/17 of
> not doing the incremental write in "fetch", but instead just rely on "gc
> --auto" *and* to address your "I would expect these files to be written
> and merged much more frequently" we'd just (and I'm keen to hack this
> up) teach "gc --auto" such an incremental mode.

I look forward to this incremental mode, and its "partial maintenance".

> This means that within a single repo all updates to the commit-graph
> will go through the gc.lock, whereas with your current 17/17 we'd
> potentially have a "commit-graph write" racing a concurrent "gc", with
> both aiming to update the commit-graph.

Except for writes that happen in `git commit-graph write`, of course.

VFS for Git never runs 'gc' and instead maintains the commit-graph directly.

Should the commit-graph builtin take a gc lock on write?

>> On write, if we are not merging, then we just create a new
>> graph-{hash}.graph file. If we are merging, but still have a base graph,
>> then create a new graph-{hash}.graph file. Finally, if we are merging
>> all layers, then we rename our .lock file to 'info/commit-graph'.
>> To clean up, we need to grovel the directory to look for graph-{hash}.graph
>> files whose base chains no longer match the new, "best" chain and unlink()
>> them. This clean-up step can happen at any time.
>>
>> --[end description of options]--
>>
>> Did I accurately describe the options we are considering?
>>
>> Option 1 was the design I was planning, and I think it matches how the
>> split-index feature works. Please correct me if I am missing something.
>> It _requires_ updating the file format version. But it also has a flaw
>> that the other options do not have: the copy of the base file. One
>> thing I want to enable is for whatever machinery is handling these
>> file writes to run a 'verify' immediately after, and have that be fast
>> most of the time. With a model that changes only the "tip" file, we
>> can verify only the new files and have confidence that the base file
>> did not change. I think options 0 and 2 both improve in this direction.
>>
>>> With commit-graph-<HASH> all these unlink() race conditions go away,
>>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>>> just leave the old files, "gc" deals with them later..), no need to
>>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>>> N+1 files.
>>
>> Calling this a non-issue is an exaggeration, especially if you are
>> claiming we need to be robust to multi-hour gaps between reading files.
> 
> We always have a race, but they're very different races.
> 
> With "N" files (option #0) we'd have races of the type where within the
> same few milliseconds of a commit-graph being merged/updated a
> e.g. concurrent "tag --contains" would either be much slower (couldn't
> get one of the incremental graphs it expected), or produce the wrong
> answer (this may be wrong on my part, see my "base hash" comment above).

With the "base hash" feature (planned, not implemented) we won't be wrong.

With the "max number of commits in the incremental files" feature, we
would limit the performance issues from missing an update. BUT, this also
implies we are rewriting the base commit-graph more often. With a limit
of 64,000 commits, that's still only once every few weeks in the Windows
OS repo.

> Whereas if I run something with "ionice -c 3" I could possibly hang for
> however many hours/days/weeks we wait until another "gc" comes along and
> unlinks those old files, but if I'm running it like that I'm not
> expecting it to be fast, so it's OK if the files went away, and it won't
> ever get the wrong file (since the filenames are hash-addressible).

How do we recover from this situation? Have no commit-graph at all? Or
do we try again?

>>> It also becomes easy to "chain" graphs across repos e.g. via
>>> alternates. Say in the scenario github/gitlab have where they have a
>>> "main" repo and other objects on another delta island.
>>>
>>> In that case the repo would have a local "tip" file with the last link
>>> in its chain, some of which would then refer back to <HASHes> in other
>>> "parent" alternates.
>>>
>>> As long as such a setup has a "gc" process that's not overly eager about
>>> pruning old stuff and considers that constellation of repos as a whole
>>> that should just work. You can freely optimize and rewrite graphs across
>>> repos, just be careful about unlinking old stuff.
>>>
>>> I don't see how it would work with commit-graph-N without a *lot* of
>>> painful orchestration (where e.g. you *must* guarantee that the parent
>>> repo ends in N, all child repos start at N+1).
>>
>> You're right that Option 0 does not work in this model where some graph
>> information is stored in an alternate _and_ more information is stored
>> outside the alternate. My perspective is biased, because I consider the
>> alternate to be "almost everything" and the local object store to be
>> small. But in a fork network, this is not always the case. I appreciate
>> your feedback for this environment, and I've always hoped that someone
>> with server experience would come and say "this feature is great, but
>> we need X, Y, and Z to make best use of it in our environment. Here's
>> a patch that moves us in that direction!" At least you are doing the
>> next-best thing: stopping me from making mistakes that would block
>> adoption.
> 
> I'm happy to write some patches, just want to talk about it first (and
> if I'm lucky convince you to write them for me :) ).
> 
> One example is the git-for-windows commit-graph is 5.6MB, the git.git
> one is 3.1MB. Should GitHub just stick that all in the one "parent"
> graph, maybe. But nice to have the flexibility of stacking them.
> 
> There's also more disconnected cases, e.g. I have some "staging" boxes
> where where I have a cronjob running around re-pointing clones of a big
> monorepo to a shared "alternates" store where I guarantee objects are
> only ever added, never removed.
> 
> It would be nice to have a way to provide a commit-graph there that's
> "stable" that clients could point to, and they'd just generate the
> difference.
> 
> I.e. now I have a shared .git/objects which contains gigabytes, a
> crapload of stuff in /home where .git/objects is 10-50MB, and each one
> has a commit-graph that's around the same 50-100MB size (since it needs
> to contain the metadata for the full set).
>
>> So let's consider how Option 2 would work in this "multi-tip" case.
>> Each object directory would have some number of graph files, and one
>> 'commit-graphs/info' file pointing to some hash. When we read, we
>> try to pick the info file that is "closest" to us.
>>
>> This does create some complications that I don't think you gave enough
>> attention to. These may be solvable, but they are non-trivial:
>>
>> * When we 'gc' the "core" repo, we need to enumerate all of the
>>   "leaf" repos to check their tip commit-graph files and make a
>>   decision if we should keep their bases around or delete those tips.
>>   Perhaps I'm over-stating the difficulty here, since we need to do
>>   something similar to find still-reachable objects, anyway. But if
>>   we are doing that reachability calculation, then why are we not
>>   just putting all of the commit-graph data in the core repo? Note
>>   that we don't get the same value as delta islands because this data
>>   isn't being shared across the protocol. The issue with storing all
>>   graph data in the core repo is that the core repo doesn't actually
>>   have all of the commits, which makes 'verify' on the graph a bit
>>   annoying.
> 
> Yeah I think similar to "alternates" it would be annoying to have a case
> where a given repo has metadata on objects it doesn't have, and there's
> cases (see the "staging" case I mentioned above) where that "parent"
> repo won't have access to those things.
> 
>> * If we choose a local tip instead of the "core" tip, then that chain
>>   of commit-graphs can be far behind the core repo. In the world where
>>   a fork moves only at the speed of a single developer, but the core
>>   project moves quickly, then computing a merge base with the core's
>>   master branch becomes slow as our local chain doesn't contain most
>>   of the commits.
> 
> That's a good point and a case where pointing "upwards" or just having
> commit-graphs in the "base" repo is better, i.e. the "fork" has almost
> no objects.

Is your idea of "upwards" different than mine? I think of pointing to
a base file as pointing "down", and the opposite would be "up". In the
case of a fork network or multiple repos using an alternate, pointing
upwards would not even be well-defined.

> But solvable by triggering "gc" on these child projects so their
> commit-graph keeps being re-pointed to a later version.
> 
> And we'd have the reverse problem with a git-for-windows wouldn't we?
> I.e. the fork is "far ahead".

This is the quintessential example for why we can't have a single chain
of commit-graphs long-term. It deviates from most fork networks enough
that we can't say "just take the base repo's commit-graph" but typical
fork networks can't say "just take my local commit-graph chain". The
two-dimensional graph position would be valuable to help both shapes.

>> * We can't take all of the "core" chain _and_ the local chain, because
>>   the concept of "graph position" no longer makes sense. The only way
>>   I see out of this is to make the graph position two-dimensional:
>>   commit -> (index of tip chain, position in that chain). Perhaps this
>>   is a valuable thing to do in the future? Or perhaps, we shouldn't
>>   have incremental chains spanning object directories and instead
>>   introduce "parents-by-ref" where we mark some parents as included
>>   by object id instead of by graph position. This would allow the
>>   core repo to gc without caring about the external repos. It also
>>   wouldn't care about how the graph files are stored (Option 0 would
>>   work, as graph chains would not cross object store boundaries) and
>>   more closely resembles the independence of the pack-files in each
>>   object store. The "parents-by-ref" would require updating the
>>   file format version.
> 
> The parent repo can "gc" without caring/inspecting "child" repos with
> the "point downwards" of option #2, as long as it promises to retain its
> old commit-graph files for some retention period of X, and the "child"
> repos promise to "gc" (and re-point to new graphs if necessary) at rate
> that's faster than that.
> 
> This makes it easy to e.g. say "we retain old commit-graph files for 2
> weeks", and "we re-gc everything in cron weekly".

Here, I think, is the most crucial point of why Option 2 may be worth the
added complexity over Option 0. Option 0 _requires_ that the files be
replaced immediately on a new write, while Option 2 provides a way to
leave old files around and be cleaned up later.

But how should we actually perform this cleanup? I would imagine a
'git commit-graph gc' subcommand that cleans up old files. A 'git gc'
run would perform the same logic, but we need a way to do this outside
of 'gc'. It needs to use the modified time as an indicator, since we
could run 'git commit-graph write' twice an hour before our two-week
cleanup job and need to keep our hour-old stale file. Perhaps the
'git commit-graph gc' subcommand could take a '--window' parameter that
can be 0, while 'git gc' uses a config setting.

The decision can then be: "is this file not in our graph chain and
older than <window> from now?"

But also, I expect the stale commit-graph files will pile up quickly.
We rebuild the commit-graph file roughly every hour. I would write
our maintenance to call these subcommands in order with no delay:

    "write" -> "verify --shallow" -> "gc --window=0"

(Here, "verify --shallow" would only verify the tip of the
commit-graph chain.)

> It would work best if we can also pull this trick on the "base"
> commit-graph file, which I believe we could do in a backwards-compatible
> way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
> is the current "base".

Could we do this, anyway? Use 'commit-graphs/info' to point to the tip
and let the symlink 'commit-graph' point to the base. Then, old clients
would load a full commit-graph and new clients would get the full chain.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-09 17:08             ` Derrick Stolee
@ 2019-05-09 21:45               ` Ævar Arnfjörð Bjarmason
  2019-05-10 12:44                 ` Derrick Stolee
  0 siblings, 1 reply; 136+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-09 21:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee


On Thu, May 09 2019, Derrick Stolee wrote:

> On 5/9/2019 11:48 AM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Thu, May 09 2019, Derrick Stolee wrote:
>> [snip]
>>>
>>> I still don't think this is a problem, with the idea below.
>>>
>>>>>  4. It would be helpful to add a new optional chunk that contains the
>>>>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>>>>     would only be for the commit-graph-N files, and would provide a simple
>>>>>     way to check that the stack is valid on read, in case we are still
>>>>>     worried about other processes reading/writing in the wrong order.
>>>>>
>>>>> If we have this chunk -- you have convinced me that we need it -- then we
>>>>> could ignore the "new" commit-graph-2 because its base graph hash does not
>>>>> match. We can continue without dying because we can always parse the "missing"
>>>>> commits from the packs.
>>>
>>> So, let's set that idea aside. You have other concerns.
>>>
>>>> Instead of looping through files N at a time we'd have a discovery step
>>>> where we'd need to open() all the files, see which ones say "my parent
>>>> hash hash X", and then create a list of those hashes in order to read a
>>>> bunch of commit-graph-<HASH> files.
>>>>
>>>> Is that a bit painful? Sure, but way less painful than dealing with the
>>>> caveats I'd mentioned in [1] and SZEDER details here.
>>>
>>> I don't see how this step is less painful than the one I am describing.
>>> You'll need to be a bit more specific to convince me.
>>>
>>> I'll try to be specific with a few ideas that have been thrown around,
>>> so we can compare and contrast (and point out where I am misunderstanding
>>> what you are trying to say).
>>>
>>> Option 0 (this series): commit-graph-N
>>> --------------------------------------
>>>
>>> On read, we look for the 'info/commit-graph' file and acquire a read handle.
>>> We set that commit_graph struct as our tip commit-graph. Then, for each
>>> increasing N (until we fail) we acquire a read handle on
>>> 'info/commit-graphs/commit-graph-N' and check that its base hash matches
>>> our current tip commit-graph. If the file doesn't exist, or the base
>>> hash doesn't match, then we stop and continue with our current tip graph.
>>
>> This "base hash" is something I may have missed. I *thought* we'd
>> happy-go-lucky load e.g. commit-graph-1, commit-graph-2 in that order,
>> and if (e.g. due to fsync/dir entry issues noted earlier) if we got the
>> "wrong" commit-graph-2 we'd be oblivious to that. Is that not the case?>
>> I was trying (and failing) to make the test suite vomit by hacking up
>> the code to load "commit-graph-1" first, then "commit-graph", so maybe I
>> ran into that safeguard, whatever it is. I've just seen how we carry the
>> "num_commits_in_base" forward, not some expected checksum...
>
> It's not implemented in the patch series code, but included as an idea
> in the cover letter, and the discussion above convinced me it should be
> required. The discussion here assumes that is part of the design.

Got it. Hashing like that would definitely mitigate the "wrong file"
failure scenario, although as discussed leaving some others...

>> >> On write, use a 'mv' to swap our .lock file with whatever level we are
>>> merging, THEN can unlink() the higher layers in decreasing order. (This
>>> "mv-then-unlink" order is different than what is implemented by this
>>> series, but is enabled by the chunk containing the base graph hash.)
>>
>> So one of the things that make me paranoid about this (as noted
>> upthread/earlier) is that at least on POSIX filesystems just because you
>> observe that you create and fsync "foo" and "bar" in that order doesn't
>> mean that a concurrent reader of that directory will get the updates in
>> that order. I.e. file update != dir entry update.
>
> How far apart can these concurrency issues happen in the file system?
> One benefit to Option 0 is that there is only one file _write_ that matters.
> The other options require at least two writes.

You need to write() and then fsync()/close() to be guaranteed that the
data was written to the file.

As an aside I see that the current commit-graph uses CSUM_FSYNC (but
without CSUM_CLOSE, probably doesn't matter), I thought it
didn't. Maybe we should remove that unless core.fsyncObjectFiles=true
(or have another "looser" fsync config).

So writing is cheap, but it's asking the OS to sync to disk that
generally hurts. As noted in the discussions when core.fsyncObjectFiles
was introduced some FSs are really stupid about it, so it's better to
avoid these caveats if you can.

As noted in the fsync(2) manpage on Linux (ditto POSIX) an fsync to a
*file* doesn't guarantee that the directory entry is updated. So we'd
also need to opendir() the containing directory and fsync that FD as we
juggle these renames/replaces.

But stepping away from strict POSIX for a second, many users of git use
the object store over something like NFS. It's very handy to be able to
mount such a volume with e.g. lookupcache=positive
(https://linux.die.net/man/5/nfs).

I.e. if you request a commit-graph-<HASH> you don't need a round-trip to
the server to server to see if it's still there.

So it's not that we can't do it, but rather that there's some usage
patterns that are much friendlier to performance & caching than
others. Just throwing things at the FS and having it flush stuff in its
own time is ideal, and e.g. commercial FS appliances have a lot of
performance knobs you can use if you have access patterns like what we
have in objects/pack/.

I.e. "any file here you can cache forever, they never change (they
disappear, but no hurry!), but new ones might show up".

That sort of thing becomes more true with other dumber FS-alike things,
e.g. if we go for some "grab a graph from the server" implementation
such a thing is way simpler as a list of hashes that can be cached
forever (http proxies et al) than N files sensitive to their update
sequence.

>> It can be made to work, and core.fsyncObjectFiles is a partial solution,
>> but better to avoid it entirely. Also the failure with
>> core.fsyncObjectFiles is much more graceful, since in that case we don't
>> have a "deadbeef[...]" loose object *yet* but will (presumably) see the
>> *right* thing Real Soon Now since it's content-addressable, whereas with
>> this design we might see the *wrong* thing. So this would be the first
>> area of git (that I know of) where we'd be sensitive to a combination of
>> syncing dir entries and file content in order.
>>
>>> Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
>>> --------------------------------------------------------------
>>>
>>> On read, we load the 'info/commit-graph' file and inspect the byte saying
>>> how many base files we have. We load their hashes from the base file chunk
>>> and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
>>> to ignore anything "above" the failure in the chain (and specifically this
>>> includes the 'commit-graph' file), or consider reloading the commit-graph
>>> file altogether and hope it works this time. [Note: if an older version
>>> of Git doesn't understand the incremental file format, it will fail to
>>> properly understand the graph positions and either fail with an
>>> "invalid parent position" error, or worse give garbage results.]
>>
>> Right, except as noted before I'd add (at least per my understanding so
>> far) "s/needs format v2//". I.e. we could stick this data in new v1
>> backwards-compatible chunks, which also addresses the "older version..."
>> caveat.
>>
>> I.e. we'd always guarantee that the "commit-graph" file was
>> understandable/valid for older versions (even though eventually their
>> view on it would be "oh, this has no data" (but really it's all in a new
>> chunk they don't grok).
>
> I see. Something that I gather from your paragraph above but cannot find
> written explicitly anywhere is "write a commit-graph with zero commits,
> and it contains extra information that we can use to find the rest of the
> commits." In essence, the entire purpose of the file would be to contain
> the information of the 'commit-graphs/info' file from Option 2. It would
> mean we don't need to "copy then move" because the tip commit-graph file
> has no content to persist (except for the first incremental write).
>
> That would prevent breaking old clients, but they would also not have any
> commit-graph data to use. Option 0 and Option 2 can leave a valid v1
> commit-graph file with the majority of the commit data. This is only a
> performance issue for a narrow case, so maybe that's worth ignoring.

Indeed. As noted in the thread about the v1->v2 format if we just append
new chunks we have leeway like that, i.e. it's completely OK if we
choose to from the POV of old clients write no data to them so they're
not helped by the optimization, that's a more graceful way than them
dying on a format change.

And yes, we could stick the proposed commit-graphs/info "index" in a
chunk there. I've got a preference for a \n-delimited list similar to
objects/info/packs, but that's mostly for aesthetic reasons.

> (See Ævar's idea about symlinks at the bottom of this message for more.)
>
>>> On write, if we are creating a new layer to our chain, we need to _copy_
>>> the existing commit-graph file to a graph-{hash}.graph file before renaming
>>> the .lock file. If we are merging layers, then we either (a) clean up the
>>> dangling chains after moving our commit-graph file, or (b) have something
>>> like 'gc' clean up the files later. I think using 'gc' for this is not
>>> a good idea, since I would expect these files to be written and merged
>>> much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
>>> up the dangling chains leads to our concurrency issues. Further, this
>>> 'copy the base' no longer keeps our large base file at rest.
>>
>> So aside from the specifics of implementation both this and the
>> commit-graph-N way of doing it involves juggling a sequence of chunks
>> that "point upwards", which IMO has worse caveats than "point
>> downwards". I.e. we'd need to rewrite "base" files to point "upwards" to
>> new stuff, and there's no (sane) way in option 0 to split commit-graphs
>> across repos, and no way at all in this scenario (if I understand it
>> correctly...).
>
> I don't understand the value of rewriting base files to point upwards.
> The whole point is to not change the base files.

Yeah I don't think it makes sense, but maybe I misread the
description...

>>> Option 2: grovel commit-graphs directory for graph-{hash}.graph
>>> ---------------------------------------------------------------
>>>
>>> On read, we load the 'info/commit-graph' file and assume it is never
>>> an incremental file. Then, scan the 'info/commit-graphs' directory
>>> for 'graph-{hash}.graph' files and open them _all_ up to construct
>>> a "graph forest" (each graph has a single parent, given by a chunk
>>> specifying its base graph hash). If we don't have an instance of a
>>> graph with a given hash, then ignore any graphs pointing to that hash.
>>> We now have a decision to make: which leaf of this forest should we
>>> use as our tip commit-graph? That could be given by the
>>> commit-graphs/info file. But what happens when we have timing issues
>>> around scanning the directory and the commit-graphs/info file? Do
>>> we fall back to modified time?
>>
>> I think in a worst-case scenario we shrug and just pick whatever chain
>> looks most recent/largest or whatever, but in practice I think it's a
>> non-issue.
>>
>> Critical for that "it's a non-issue" is the suggestion I had on 17/17 of
>> not doing the incremental write in "fetch", but instead just rely on "gc
>> --auto" *and* to address your "I would expect these files to be written
>> and merged much more frequently" we'd just (and I'm keen to hack this
>> up) teach "gc --auto" such an incremental mode.
>
> I look forward to this incremental mode, and its "partial maintenance".
>
>> This means that within a single repo all updates to the commit-graph
>> will go through the gc.lock, whereas with your current 17/17 we'd
>> potentially have a "commit-graph write" racing a concurrent "gc", with
>> both aiming to update the commit-graph.
>
> Except for writes that happen in `git commit-graph write`, of course.
>
> VFS for Git never runs 'gc' and instead maintains the commit-graph directly.
>
> Should the commit-graph builtin take a gc lock on write?

If we make the file management dumb enough it shouldn't matter, we'd
always have a lock for updating the "meta" file, but if we didn't we'd
just have some harmless duplicate work.

I think a lock in "gc" is enough, it's our default, so it should be
sensible. People doing custom GC generally do/want their own management
of that, no need for us to hold their hand beyond the point where
e.g. concurrent "pack-refs" might corrupt refs if the ref update itself
isn't locked.

>>> On write, if we are not merging, then we just create a new
>>> graph-{hash}.graph file. If we are merging, but still have a base graph,
>>> then create a new graph-{hash}.graph file. Finally, if we are merging
>>> all layers, then we rename our .lock file to 'info/commit-graph'.
>>> To clean up, we need to grovel the directory to look for graph-{hash}.graph
>>> files whose base chains no longer match the new, "best" chain and unlink()
>>> them. This clean-up step can happen at any time.
>>>
>>> --[end description of options]--
>>>
>>> Did I accurately describe the options we are considering?
>>>
>>> Option 1 was the design I was planning, and I think it matches how the
>>> split-index feature works. Please correct me if I am missing something.
>>> It _requires_ updating the file format version. But it also has a flaw
>>> that the other options do not have: the copy of the base file. One
>>> thing I want to enable is for whatever machinery is handling these
>>> file writes to run a 'verify' immediately after, and have that be fast
>>> most of the time. With a model that changes only the "tip" file, we
>>> can verify only the new files and have confidence that the base file
>>> did not change. I think options 0 and 2 both improve in this direction.
>>>
>>>> With commit-graph-<HASH> all these unlink() race conditions go away,
>>>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>>>> just leave the old files, "gc" deals with them later..), no need to
>>>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>>>> N+1 files.
>>>
>>> Calling this a non-issue is an exaggeration, especially if you are
>>> claiming we need to be robust to multi-hour gaps between reading files.
>>
>> We always have a race, but they're very different races.
>>
>> With "N" files (option #0) we'd have races of the type where within the
>> same few milliseconds of a commit-graph being merged/updated a
>> e.g. concurrent "tag --contains" would either be much slower (couldn't
>> get one of the incremental graphs it expected), or produce the wrong
>> answer (this may be wrong on my part, see my "base hash" comment above).
>
> With the "base hash" feature (planned, not implemented) we won't be wrong.

*nod*

> With the "max number of commits in the incremental files" feature, we
> would limit the performance issues from missing an update. BUT, this also
> implies we are rewriting the base commit-graph more often. With a limit
> of 64,000 commits, that's still only once every few weeks in the Windows
> OS repo.
>
>> Whereas if I run something with "ionice -c 3" I could possibly hang for
>> however many hours/days/weeks we wait until another "gc" comes along and
>> unlinks those old files, but if I'm running it like that I'm not
>> expecting it to be fast, so it's OK if the files went away, and it won't
>> ever get the wrong file (since the filenames are hash-addressible).
>
> How do we recover from this situation? Have no commit-graph at all? Or
> do we try again?

I think just no commit-graph at all is fine. If it lazily took you N
hours from reading the "commit-graphs/info" to getting around to looking
at now-disappeared files it sucks to be you.

>>>> It also becomes easy to "chain" graphs across repos e.g. via
>>>> alternates. Say in the scenario github/gitlab have where they have a
>>>> "main" repo and other objects on another delta island.
>>>>
>>>> In that case the repo would have a local "tip" file with the last link
>>>> in its chain, some of which would then refer back to <HASHes> in other
>>>> "parent" alternates.
>>>>
>>>> As long as such a setup has a "gc" process that's not overly eager about
>>>> pruning old stuff and considers that constellation of repos as a whole
>>>> that should just work. You can freely optimize and rewrite graphs across
>>>> repos, just be careful about unlinking old stuff.
>>>>
>>>> I don't see how it would work with commit-graph-N without a *lot* of
>>>> painful orchestration (where e.g. you *must* guarantee that the parent
>>>> repo ends in N, all child repos start at N+1).
>>>
>>> You're right that Option 0 does not work in this model where some graph
>>> information is stored in an alternate _and_ more information is stored
>>> outside the alternate. My perspective is biased, because I consider the
>>> alternate to be "almost everything" and the local object store to be
>>> small. But in a fork network, this is not always the case. I appreciate
>>> your feedback for this environment, and I've always hoped that someone
>>> with server experience would come and say "this feature is great, but
>>> we need X, Y, and Z to make best use of it in our environment. Here's
>>> a patch that moves us in that direction!" At least you are doing the
>>> next-best thing: stopping me from making mistakes that would block
>>> adoption.
>>
>> I'm happy to write some patches, just want to talk about it first (and
>> if I'm lucky convince you to write them for me :) ).
>>
>> One example is the git-for-windows commit-graph is 5.6MB, the git.git
>> one is 3.1MB. Should GitHub just stick that all in the one "parent"
>> graph, maybe. But nice to have the flexibility of stacking them.
>>
>> There's also more disconnected cases, e.g. I have some "staging" boxes
>> where where I have a cronjob running around re-pointing clones of a big
>> monorepo to a shared "alternates" store where I guarantee objects are
>> only ever added, never removed.
>>
>> It would be nice to have a way to provide a commit-graph there that's
>> "stable" that clients could point to, and they'd just generate the
>> difference.
>>
>> I.e. now I have a shared .git/objects which contains gigabytes, a
>> crapload of stuff in /home where .git/objects is 10-50MB, and each one
>> has a commit-graph that's around the same 50-100MB size (since it needs
>> to contain the metadata for the full set).
>>
>>> So let's consider how Option 2 would work in this "multi-tip" case.
>>> Each object directory would have some number of graph files, and one
>>> 'commit-graphs/info' file pointing to some hash. When we read, we
>>> try to pick the info file that is "closest" to us.
>>>
>>> This does create some complications that I don't think you gave enough
>>> attention to. These may be solvable, but they are non-trivial:
>>>
>>> * When we 'gc' the "core" repo, we need to enumerate all of the
>>>   "leaf" repos to check their tip commit-graph files and make a
>>>   decision if we should keep their bases around or delete those tips.
>>>   Perhaps I'm over-stating the difficulty here, since we need to do
>>>   something similar to find still-reachable objects, anyway. But if
>>>   we are doing that reachability calculation, then why are we not
>>>   just putting all of the commit-graph data in the core repo? Note
>>>   that we don't get the same value as delta islands because this data
>>>   isn't being shared across the protocol. The issue with storing all
>>>   graph data in the core repo is that the core repo doesn't actually
>>>   have all of the commits, which makes 'verify' on the graph a bit
>>>   annoying.
>>
>> Yeah I think similar to "alternates" it would be annoying to have a case
>> where a given repo has metadata on objects it doesn't have, and there's
>> cases (see the "staging" case I mentioned above) where that "parent"
>> repo won't have access to those things.
>>
>>> * If we choose a local tip instead of the "core" tip, then that chain
>>>   of commit-graphs can be far behind the core repo. In the world where
>>>   a fork moves only at the speed of a single developer, but the core
>>>   project moves quickly, then computing a merge base with the core's
>>>   master branch becomes slow as our local chain doesn't contain most
>>>   of the commits.
>>
>> That's a good point and a case where pointing "upwards" or just having
>> commit-graphs in the "base" repo is better, i.e. the "fork" has almost
>> no objects.
>
> Is your idea of "upwards" different than mine? I think of pointing to
> a base file as pointing "down", and the opposite would be "up". In the
> case of a fork network or multiple repos using an alternate, pointing
> upwards would not even be well-defined.

We've got the same idea, but as noted above maybe I misunderstood the
"option 1" description.

>> But solvable by triggering "gc" on these child projects so their
>> commit-graph keeps being re-pointed to a later version.
>>
>> And we'd have the reverse problem with a git-for-windows wouldn't we?
>> I.e. the fork is "far ahead".
>
> This is the quintessential example for why we can't have a single chain
> of commit-graphs long-term. It deviates from most fork networks enough
> that we can't say "just take the base repo's commit-graph" but typical
> fork networks can't say "just take my local commit-graph chain". The
> two-dimensional graph position would be valuable to help both shapes.

Indeed. I just thought about e.g. a [branch|tag] --contains for the
current repo, but for things like "how ahead of gitster/pu is my GFW
topic?" one graph for the network is needed.

If we can help it it would be useful to not unduly box the user in and
offer them flexibility to choose. I.e. some (e.g. my staging server
use-case) might want base+local repo and never need 2x local repos
v.s. each other, whereas github/gitlab might need one giant graph etc.

>>> * We can't take all of the "core" chain _and_ the local chain, because
>>>   the concept of "graph position" no longer makes sense. The only way
>>>   I see out of this is to make the graph position two-dimensional:
>>>   commit -> (index of tip chain, position in that chain). Perhaps this
>>>   is a valuable thing to do in the future? Or perhaps, we shouldn't
>>>   have incremental chains spanning object directories and instead
>>>   introduce "parents-by-ref" where we mark some parents as included
>>>   by object id instead of by graph position. This would allow the
>>>   core repo to gc without caring about the external repos. It also
>>>   wouldn't care about how the graph files are stored (Option 0 would
>>>   work, as graph chains would not cross object store boundaries) and
>>>   more closely resembles the independence of the pack-files in each
>>>   object store. The "parents-by-ref" would require updating the
>>>   file format version.
>>
>> The parent repo can "gc" without caring/inspecting "child" repos with
>> the "point downwards" of option #2, as long as it promises to retain its
>> old commit-graph files for some retention period of X, and the "child"
>> repos promise to "gc" (and re-point to new graphs if necessary) at rate
>> that's faster than that.
>>
>> This makes it easy to e.g. say "we retain old commit-graph files for 2
>> weeks", and "we re-gc everything in cron weekly".
>
> Here, I think, is the most crucial point of why Option 2 may be worth the
> added complexity over Option 0. Option 0 _requires_ that the files be
> replaced immediately on a new write, while Option 2 provides a way to
> leave old files around and be cleaned up later.
>
> But how should we actually perform this cleanup? I would imagine a
> 'git commit-graph gc' subcommand that cleans up old files. A 'git gc'
> run would perform the same logic, but we need a way to do this outside
> of 'gc'. It needs to use the modified time as an indicator, since we
> could run 'git commit-graph write' twice an hour before our two-week
> cleanup job and need to keep our hour-old stale file. Perhaps the
> 'git commit-graph gc' subcommand could take a '--window' parameter that
> can be 0, while 'git gc' uses a config setting.
>
> The decision can then be: "is this file not in our graph chain and
> older than <window> from now?"
>
> But also, I expect the stale commit-graph files will pile up quickly.
> We rebuild the commit-graph file roughly every hour. I would write
> our maintenance to call these subcommands in order with no delay:
>
>     "write" -> "verify --shallow" -> "gc --window=0"
>
> (Here, "verify --shallow" would only verify the tip of the
> commit-graph chain.)

I think a sane default would be to just unlink() the old ones as soon as
we're done writing the new ones & writing the commit-graphs/info file
saying "here's your current ones".

I.e. as noted in what I said about fsync above it's not that we need to
keep the old ones around, but avoiding the tight dance with N & N+1
updates, and being friendlier to stuff like lookupcache=positive.

Users with more advanced use-cases (e.g. cross-repo graphs) could then
always increase such an expiry.

>> It would work best if we can also pull this trick on the "base"
>> commit-graph file, which I believe we could do in a backwards-compatible
>> way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
>> is the current "base".
>
> Could we do this, anyway? Use 'commit-graphs/info' to point to the tip
> and let the symlink 'commit-graph' point to the base. Then, old clients
> would load a full commit-graph and new clients would get the full chain.

How's the Windows support for symlinks? We don't symlink anything in
.git/objects/ ourselves now (but see[1]).

On *nix just manually symlinking it works fine (you need to go out of
your way not to support it, which we didn't).

So something like this would be desirable:

    $ tree -a .git/objects/info/
    .git/objects/info/
    ├── commit-graph -> commit-graphs/commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
    ├── commit-graphs
    │   ├── commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
    │   ├── commit-graph-988881adc9fc3655077dc2d4d757d480b5ea0e11
    │   └── info
    └── packs
    $ cat .git/objects/info/commit-graphs/info
    2492e0ef38643d4cb6369f76443e6a814d616258
    988881adc9fc3655077dc2d4d757d480b5ea0e11

I.e. create new ones as needed, and when done say what sequence they
should be read in in the "info" file, and symlink "commit-graph" to
whatever the latest "base" is as a courtesy to old clients (or not, or
eventually don't bother).

1. https://public-inbox.org/git/20190502144829.4394-1-matheus.bernardino@usp.br/

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 12/17] Documentation: describe split commit-graphs
  2019-05-09 21:45               ` Ævar Arnfjörð Bjarmason
@ 2019-05-10 12:44                 ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-05-10 12:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: SZEDER Gábor, Derrick Stolee via GitGitGadget, git, peff,
	git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

On 5/9/2019 5:45 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, May 09 2019, Derrick Stolee wrote:
> 
>> How far apart can these concurrency issues happen in the file system?
>> One benefit to Option 0 is that there is only one file _write_ that matters.
>> The other options require at least two writes.
> 
> You need to write() and then fsync()/close() to be guaranteed that the
> data was written to the file.
> 
> As an aside I see that the current commit-graph uses CSUM_FSYNC (but
> without CSUM_CLOSE, probably doesn't matter), I thought it
> didn't. Maybe we should remove that unless core.fsyncObjectFiles=true
> (or have another "looser" fsync config).
> 
> So writing is cheap, but it's asking the OS to sync to disk that
> generally hurts. As noted in the discussions when core.fsyncObjectFiles
> was introduced some FSs are really stupid about it, so it's better to
> avoid these caveats if you can.
> 
> As noted in the fsync(2) manpage on Linux (ditto POSIX) an fsync to a
> *file* doesn't guarantee that the directory entry is updated. So we'd
> also need to opendir() the containing directory and fsync that FD as we
> juggle these renames/replaces.

I think any plan should not rely on directory scanning for this reason.
Starting from a known file (info/commit-graphs/head?) and then navigating
directly to files would avoid these issues, right?

>> That would prevent breaking old clients, but they would also not have any
>> commit-graph data to use. Option 0 and Option 2 can leave a valid v1
>> commit-graph file with the majority of the commit data. This is only a
>> performance issue for a narrow case, so maybe that's worth ignoring.
> 
> Indeed. As noted in the thread about the v1->v2 format if we just append
> new chunks we have leeway like that, i.e. it's completely OK if we
> choose to from the POV of old clients write no data to them so they're
> not helped by the optimization, that's a more graceful way than them
> dying on a format change.
> 
> And yes, we could stick the proposed commit-graphs/info "index" in a
> chunk there. I've got a preference for a \n-delimited list similar to
> objects/info/packs, but that's mostly for aesthetic reasons.

I think a \n-delimited list would be better because using the commit-graph
format with zero commits creates at least a kilobyte of data for a
zero-valued fanout table. If we relax the format to say "the head file
has zero commits, so the fanout, commit oids, and commit data chunks
should not exist" then we would be fine, and get versioning "for free".
But the file serves a different purpose.

>>> Whereas if I run something with "ionice -c 3" I could possibly hang for
>>> however many hours/days/weeks we wait until another "gc" comes along and
>>> unlinks those old files, but if I'm running it like that I'm not
>>> expecting it to be fast, so it's OK if the files went away, and it won't
>>> ever get the wrong file (since the filenames are hash-addressible).
>>
>> How do we recover from this situation? Have no commit-graph at all? Or
>> do we try again?
> 
> I think just no commit-graph at all is fine. If it lazily took you N
> hours from reading the "commit-graphs/info" to getting around to looking
> at now-disappeared files it sucks to be you.

Sounds good. We can revisit if this actually starts hurting any real
scenarios.
>>> And we'd have the reverse problem with a git-for-windows wouldn't we?
>>> I.e. the fork is "far ahead".
>>
>> This is the quintessential example for why we can't have a single chain
>> of commit-graphs long-term. It deviates from most fork networks enough
>> that we can't say "just take the base repo's commit-graph" but typical
>> fork networks can't say "just take my local commit-graph chain". The
>> two-dimensional graph position would be valuable to help both shapes.
> 
> Indeed. I just thought about e.g. a [branch|tag] --contains for the
> current repo, but for things like "how ahead of gitster/pu is my GFW
> topic?" one graph for the network is needed.
> 
> If we can help it it would be useful to not unduly box the user in and
> offer them flexibility to choose. I.e. some (e.g. my staging server
> use-case) might want base+local repo and never need 2x local repos
> v.s. each other, whereas github/gitlab might need one giant graph etc.

I made a note to follow up with the two-dimensional graph position [1].

[1] https://github.com/microsoft/git/issues/138

>>> This makes it easy to e.g. say "we retain old commit-graph files for 2
>>> weeks", and "we re-gc everything in cron weekly".
>>
>> Here, I think, is the most crucial point of why Option 2 may be worth the
>> added complexity over Option 0. Option 0 _requires_ that the files be
>> replaced immediately on a new write, while Option 2 provides a way to
>> leave old files around and be cleaned up later.
>>
>> But how should we actually perform this cleanup? I would imagine a
>> 'git commit-graph gc' subcommand that cleans up old files. A 'git gc'
>> run would perform the same logic, but we need a way to do this outside
>> of 'gc'. It needs to use the modified time as an indicator, since we
>> could run 'git commit-graph write' twice an hour before our two-week
>> cleanup job and need to keep our hour-old stale file. Perhaps the
>> 'git commit-graph gc' subcommand could take a '--window' parameter that
>> can be 0, while 'git gc' uses a config setting.
>>
>> The decision can then be: "is this file not in our graph chain and
>> older than <window> from now?"
>>
>> But also, I expect the stale commit-graph files will pile up quickly.
>> We rebuild the commit-graph file roughly every hour. I would write
>> our maintenance to call these subcommands in order with no delay:
>>
>>     "write" -> "verify --shallow" -> "gc --window=0"
>>
>> (Here, "verify --shallow" would only verify the tip of the
>> commit-graph chain.)
> 
> I think a sane default would be to just unlink() the old ones as soon as
> we're done writing the new ones & writing the commit-graphs/info file
> saying "here's your current ones".
> 
> I.e. as noted in what I said about fsync above it's not that we need to
> keep the old ones around, but avoiding the tight dance with N & N+1
> updates, and being friendlier to stuff like lookupcache=positive.
> 
> Users with more advanced use-cases (e.g. cross-repo graphs) could then
> always increase such an expiry.

Noted. Thanks.

>>> It would work best if we can also pull this trick on the "base"
>>> commit-graph file, which I believe we could do in a backwards-compatible
>>> way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
>>> is the current "base".
>>
>> Could we do this, anyway? Use 'commit-graphs/info' to point to the tip
>> and let the symlink 'commit-graph' point to the base. Then, old clients
>> would load a full commit-graph and new clients would get the full chain.
> 
> How's the Windows support for symlinks? We don't symlink anything in
> .git/objects/ ourselves now (but see[1]).

Maybe this won't work in all scenarios, but it could be left as a future
enhancement if anyone cares. The "support two versions" scenario is rare
enough to avoid building something specifically for it, but often enough
to not _break_ it.

> On *nix just manually symlinking it works fine (you need to go out of
> your way not to support it, which we didn't).
> 
> So something like this would be desirable:
> 
>     $ tree -a .git/objects/info/
>     .git/objects/info/
>     ├── commit-graph -> commit-graphs/commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
>     ├── commit-graphs
>     │   ├── commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
>     │   ├── commit-graph-988881adc9fc3655077dc2d4d757d480b5ea0e11
>     │   └── info
>     └── packs
>     $ cat .git/objects/info/commit-graphs/info
>     2492e0ef38643d4cb6369f76443e6a814d616258
>     988881adc9fc3655077dc2d4d757d480b5ea0e11
> 
> I.e. create new ones as needed, and when done say what sequence they
> should be read in in the "info" file, and symlink "commit-graph" to
> whatever the latest "base" is as a courtesy to old clients (or not, or
> eventually don't bother).

OK. Glad that the idea would work.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 00/11] [RFC] Commit-graph: Write incremental files
  2019-05-08 15:53 [PATCH 00/17] [RFC] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2019-05-08 19:27 ` [PATCH 00/17] [RFC] Commit-graph: Write incremental files Ævar Arnfjörð Bjarmason
@ 2019-05-22 19:53 ` " Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 01/11] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
                     ` (11 more replies)
  18 siblings, 12 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano

This patch series is marked as RFC quality because it is missing some key
features and tests, but hopefully starts a concrete discussion of how the
incremental commit-graph writes can work. This version takes the design
suggestions from the earlier discussion and tries to work out most of the
concerns.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "chain" of commit-graphs in the
.git/objects/info/commit-graphs folder with name graph-{hash}.graph. The
list of hashes is given by the commit-graph-chain file, and also in a "base
graph chunk" in the commit-graph format. As we read a chain, we can verify
that the hashes match the trailing hash of each commit-graph we read along
the way and each hash below a level is expected by that graph file.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
conditions:

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
pack-files.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

git commit-graph write --stdin-commits:         8m 12s
git commit-graph write --stdin-commits --split:    48s

The test using --split included at least six full collapses to the full
commit-graph. I believe the commit-graph stack had at most three levels
during this test.

Here are a few points that still need to be addressed before this is ready
for full review:

 * The merge strategy values should be extracted into config options.
   
   
 * If we have a commit-graph chain and someone writes without "--split" it
   will make a new commit-graph file and not clean up the old files.
   
   
 * We need to update 'git commit-graph verify' to understand the chains, and
   test that it catches the new problems. It would be good to have a
   '--shallow' option to only verify the tip file, as if we run that after
   every write we can have some confidence that the files at rest are still
   valid and we only need to check the smaller file. (This is the main
   reason this is a priority to the VFS for Git team.)
   
   

This is based on ds/commit-graph-write-refactor.

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (11):
  commit-graph: document commit-graph chains
  commit-graph: prepare for commit-graph chains
  commit-graph: rename commit_compare to oid_compare
  commit-graph: load commit-graph chains
  commit-graph: add base graphs chunk
  commit-graph: rearrange chunk count logic
  commit-graph: write commit-graph chains
  commit-graph: add --split option to builtin
  commit-graph: merge commit-graph chains
  commit-graph: allow cross-alternate chains
  commit-graph: expire commit-graph files

 .../technical/commit-graph-format.txt         |  11 +-
 Documentation/technical/commit-graph.txt      | 195 +++++
 builtin/commit-graph.c                        |  10 +-
 commit-graph.c                                | 734 +++++++++++++++++-
 commit-graph.h                                |   7 +
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5323-split-commit-graph.sh                 | 172 ++++
 7 files changed, 1088 insertions(+), 43 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh


base-commit: 8520d7fc7c6edd4d71582c69a873436029b6cb1b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/184

Range-diff vs v1:

  1:  0be7713a25 <  -:  ---------- commit-graph: fix the_repository reference
  2:  86b18f15ba <  -:  ---------- commit-graph: return with errors during write
  3:  9299a7fe25 <  -:  ---------- commit-graph: collapse parameters into flags
  4:  f9b719fc7a <  -:  ---------- commit-graph: remove Future Work section
  5:  74e40970e0 <  -:  ---------- commit-graph: create write_commit_graph_context
  6:  54817ef50b <  -:  ---------- commit-graph: extract fill_oids_from_packs()
  7:  cf792d38ed <  -:  ---------- commit-graph: extract fill_oids_from_commit_hex()
  8:  aaae85f1ec <  -:  ---------- commit-graph: extract fill_oids_from_all_packs()
  9:  9d434dc38c <  -:  ---------- commit-graph: extract count_distinct_commits()
 10:  ebd665468e <  -:  ---------- commit-graph: extract copy_oids_to_commits()
 11:  3eee3667cf <  -:  ---------- commit-graph: extract write_commit_graph_file()
 12:  7bbe8d9150 <  -:  ---------- Documentation: describe split commit-graphs
  -:  ---------- >  1:  a423afbfdd commit-graph: document commit-graph chains
 13:  9d0e966a3d !  2:  249668fc92 commit-graph: lay groundwork for incremental files
     @@ -1,6 +1,26 @@
      Author: Derrick Stolee <dstolee@microsoft.com>
      
     -    commit-graph: lay groundwork for incremental files
     +    commit-graph: prepare for commit-graph chains
     +
     +    To prepare for a chain of commit-graph files, augment the
     +    commit_graph struct to point to a base commit_graph. As we load
     +    commits from the graph, we may actually want to read from a base
     +    file according to the graph position.
     +
     +    The "graph position" of a commit is given by concatenating the
     +    lexicographic commit orders from each of the commit-graph files in
     +    the chain. This means that we must distinguish two values:
     +
     +     * lexicographic index : the position within the lexicographic
     +       order in a single commit-graph file.
     +
     +     * graph position: the posiiton within the concatenated order
     +       of multiple commit-graph files
     +
     +    Given the lexicographic index of a commit in a graph, we can
     +    compute the graph position by adding the number of commits in
     +    the lower-level graphs. To find the lexicographic index of
     +    a commit, we subtract the number of commits in lower-level graphs.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ -13,21 +33,21 @@
       
      +static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
      +{
     ++	uint32_t lex_index;
     ++
      +	if (!g)
      +		BUG("NULL commit-graph");
      +
     -+	if (pos < g->num_commits_in_base) {
     -+		load_oid_from_graph(g->base_graph, pos, oid);
     -+		return;
     -+	}
     ++	while (pos < g->num_commits_in_base)
     ++		g = g->base_graph;
      +
      +	if (pos >= g->num_commits + g->num_commits_in_base)
      +		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
      +		    pos, g->num_commits, g->num_commits_in_base);
      +
     -+	pos -= g->num_commits_in_base;
     ++	lex_index = pos - g->num_commits_in_base;
      +
     -+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * pos);
     ++	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * lex_index);
      +}
      +
       static struct commit_list **insert_parent_or_die(struct repository *r,
     @@ -47,24 +67,32 @@
       	if (!c)
       		die(_("could not find commit %s"), oid_to_hex(&oid));
      @@
     + 
       static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
       {
     - 	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
     --	item->graph_pos = pos;
     -+	item->graph_pos = pos + g->num_commits_in_base;
     +-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
     ++	const unsigned char *commit_data;
     ++	uint32_t lex_index;
     ++
     ++	while (pos < g->num_commits_in_base)
     ++		g = g->base_graph;
     ++
     ++	lex_index = pos - g->num_commits_in_base;
     ++	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
     + 	item->graph_pos = pos;
       	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
       }
     - 
      @@
       	uint32_t *parent_data_ptr;
       	uint64_t date_low, date_high;
       	struct commit_list **pptr;
      -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
      +	const unsigned char *commit_data;
     ++	uint32_t lex_index;
       
      -	item->object.parsed = 1;
     -+	if (pos < g->num_commits_in_base)
     -+		return fill_commit_in_graph(r, item, g->base_graph, pos);
     ++	while (pos < g->num_commits_in_base)
     ++		g = g->base_graph;
      +
      +	if (pos >= g->num_commits + g->num_commits_in_base)
      +		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
     @@ -75,9 +103,9 @@
      +	 * "local" position for the rest of the calculation.
      +	 */
       	item->graph_pos = pos;
     -+	pos -= g->num_commits_in_base;
     ++	lex_index = pos - g->num_commits_in_base;
      +
     -+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
     ++	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
      +
      +	item->object.parsed = 1;
       
     @@ -89,13 +117,13 @@
       	} else {
      -		return bsearch_graph(g, &(item->object.oid), pos);
      +		struct commit_graph *cur_g = g;
     -+		uint32_t pos_in_g;
     ++		uint32_t lex_index;
      +
     -+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &pos_in_g))
     ++		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
      +			cur_g = cur_g->base_graph;
      +
      +		if (cur_g) {
     -+			*pos = pos_in_g + cur_g->num_commits_in_base;
     ++			*pos = lex_index + cur_g->num_commits_in_base;
      +			return 1;
      +		}
      +
     @@ -111,8 +139,8 @@
      -					   GRAPH_DATA_WIDTH * (c->graph_pos);
      +	const unsigned char *commit_data;
      +
     -+	if (c->graph_pos < g->num_commits_in_base)
     -+		return load_tree_for_commit(r, g->base_graph, c);
     ++	while (c->graph_pos < g->num_commits_in_base)
     ++		g = g->base_graph;
      +
      +	commit_data = g->chunk_commit_data +
      +			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
     @@ -133,8 +161,3 @@
       	const uint32_t *chunk_oid_fanout;
       	const unsigned char *chunk_oid_lookup;
       	const unsigned char *chunk_commit_data;
     - 	const unsigned char *chunk_extra_edges;
     -+	const unsigned char *chunk_base_graph;
     - };
     - 
     - struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
 14:  4436c8f4f1 <  -:  ---------- commit-graph: load split commit-graph files
 15:  aa4a096813 <  -:  ---------- commit-graph: write split commit-graph files
  -:  ---------- >  3:  809fa7ad80 commit-graph: rename commit_compare to oid_compare
  -:  ---------- >  4:  a8c0b47c8a commit-graph: load commit-graph chains
  -:  ---------- >  5:  4fefd0a654 commit-graph: add base graphs chunk
  -:  ---------- >  6:  a595a1eb65 commit-graph: rearrange chunk count logic
  -:  ---------- >  7:  9cbfb656b3 commit-graph: write commit-graph chains
 16:  7c5bc06d14 !  8:  5ad14f574b commit-graph: add --split option
     @@ -1,6 +1,16 @@
      Author: Derrick Stolee <dstolee@microsoft.com>
      
     -    commit-graph: add --split option
     +    commit-graph: add --split option to builtin
     +
     +    Add a new "--split" option to the 'git commit-graph write' subcommand. This
     +    option allows the optional behavior of writing a commit-graph chain.
     +
     +    The current behavior will add a tip commit-graph containing any commits that
     +    are not in the existing commit-graph or commit-graph chain. Later changes
     +    will allow merging the chain and expiring out-dated files.
     +
     +    Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
     +    behavior.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ -55,39 +65,130 @@
       	read_replace_refs = 0;
       
      
     - diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
     - --- a/t/t5318-commit-graph.sh
     - +++ b/t/t5318-commit-graph.sh
     + diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     + new file mode 100755
     + --- /dev/null
     + +++ b/t/t5323-split-commit-graph.sh
      @@
     - 	test_cmp_bin commit-graph-after-gc $objdir/info/commit-graph
     - '
     - 
     -+test_expect_success 'write split commit-graph' '
     -+	cd "$TRASH_DIRECTORY" &&
     -+	git clone full split &&
     -+	cd split &&
     ++#!/bin/sh
     ++
     ++test_description='split commit graph'
     ++. ./test-lib.sh
     ++
     ++GIT_TEST_COMMIT_GRAPH=0
     ++
     ++test_expect_success 'setup repo' '
     ++	git init &&
      +	git config core.commitGraph true &&
     -+	for i in $(test_seq 1 20); do
     -+		test_commit padding-$i
     ++	infodir=".git/objects/info" &&
     ++	graphdir="$infodir/commit-graphs" &&
     ++	test_oid_init
     ++'
     ++
     ++graph_read_expect() {
     ++	NUM_BASE=0
     ++	if test ! -z $2
     ++	then
     ++		NUM_BASE=$2
     ++	fi
     ++	cat >expect <<- EOF
     ++	header: 43475048 1 1 3 $NUM_BASE
     ++	num_commits: $1
     ++	chunks: oid_fanout oid_lookup commit_metadata
     ++	EOF
     ++	git commit-graph read >output &&
     ++	test_cmp expect output
     ++}
     ++
     ++test_expect_success 'create commits and write commit-graph' '
     ++	for i in $(test_seq 3)
     ++	do
     ++		test_commit $i &&
     ++		git branch commits/$i
      +	done &&
      +	git commit-graph write --reachable &&
     -+	test_commit split-commit &&
     -+	git branch -f split-commit &&
     -+	git commit-graph write --reachable --split &&
     -+	test_path_is_file .git/objects/info/commit-graphs/commit-graph-1
     ++	test_path_is_file $infodir/commit-graph &&
     ++	graph_read_expect 3
      +'
      +
     -+graph_git_behavior 'split graph, split-commit vs merge 1' bare split-commit merge/1
     ++graph_git_two_modes() {
     ++	git -c core.commitGraph=true $1 >output
     ++	git -c core.commitGraph=false $1 >expect
     ++	test_cmp expect output
     ++}
     ++
     ++graph_git_behavior() {
     ++	MSG=$1
     ++	BRANCH=$2
     ++	COMPARE=$3
     ++	test_expect_success "check normal git operations: $MSG" '
     ++		graph_git_two_modes "log --oneline $BRANCH" &&
     ++		graph_git_two_modes "log --topo-order $BRANCH" &&
     ++		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
     ++		graph_git_two_modes "branch -vv" &&
     ++		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
     ++	'
     ++}
      +
     -+test_expect_success 'collapse split commit-graph' '
     -+	cd "$TRASH_DIRECTORY/split" &&
     ++graph_git_behavior 'graph exists' commits/3 commits/1
     ++
     ++verify_chain_files_exist() {
     ++	for hash in $(cat $1/commit-graph-chain)
     ++	do
     ++		test_path_is_file $1/graph-$hash.graph
     ++	done
     ++}
     ++
     ++test_expect_success 'add more commits, and write a new base graph' '
     ++	git reset --hard commits/1 &&
     ++	for i in $(test_seq 4 5)
     ++	do
     ++		test_commit $i &&
     ++		git branch commits/$i
     ++	done &&
     ++	git reset --hard commits/2 &&
     ++	for i in $(test_seq 6 10)
     ++	do
     ++		test_commit $i &&
     ++		git branch commits/$i
     ++	done &&
     ++	git reset --hard commits/2 &&
     ++	git merge commits/4 &&
     ++	git branch merge/1 &&
     ++	git reset --hard commits/4 &&
     ++	git merge commits/6 &&
     ++	git branch merge/2 &&
      +	git commit-graph write --reachable &&
     -+	test_path_is_missing .git/objects/info/commit-graphs/commit-graph-1 &&
     -+	test_path_is_file .git/objects/info/commit-graph
     ++	graph_read_expect 12
     ++'
     ++
     ++test_expect_success 'add three more commits, write a tip graph' '
     ++	git reset --hard commits/3 &&
     ++	git merge merge/1 &&
     ++	git merge commits/5 &&
     ++	git merge merge/2 &&
     ++	git branch merge/3 &&
     ++	git commit-graph write --reachable --split &&
     ++	test_path_is_missing $infodir/commit-graph &&
     ++	test_path_is_file $graphdir/commit-graph-chain &&
     ++	ls $graphdir/graph-*.graph >graph-files &&
     ++	test_line_count = 2 graph-files &&
     ++	verify_chain_files_exist $graphdir
     ++'
     ++
     ++graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
     ++
     ++test_expect_success 'add one commit, write a tip graph' '
     ++	test_commit 11 &&
     ++	git branch commits/11 &&
     ++	git commit-graph write --reachable --split &&
     ++	test_path_is_missing $infodir/commit-graph &&
     ++	test_path_is_file $graphdir/commit-graph-chain &&
     ++	ls $graphdir/graph-*.graph >graph-files &&
     ++	test_line_count = 3 graph-files &&
     ++	verify_chain_files_exist $graphdir
      +'
      +
     -+graph_git_behavior 'collapsed graph, split-commit vs merge 1' bare split-commit merge/1
     ++graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
      +
     - test_expect_success 'replace-objects invalidates commit-graph' '
     - 	cd "$TRASH_DIRECTORY" &&
     - 	test_when_finished rm -rf replace &&
     ++test_done
 17:  3c52385e56 <  -:  ---------- fetch: add fetch.writeCommitGraph config setting
  -:  ---------- >  9:  9567daa0b8 commit-graph: merge commit-graph chains
  -:  ---------- > 10:  4cfe19a933 commit-graph: allow cross-alternate chains
  -:  ---------- > 11:  72fc0a1f17 commit-graph: expire commit-graph files

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 01/11] commit-graph: document commit-graph chains
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 02/11] commit-graph: prepare for " Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a basic description of commit-graph chains. More details about the
feature will be added as we add functionality. This introduction gives a
high-level overview to the goals of the feature and the basic layout of
commit-graph chains.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index fb53341d5e..1dca3bd8fe 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,6 +127,65 @@ Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
+Commit Graphs Chains
+--------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over time,
+the number of commits added by a fetch operation is much smaller than the
+number of commits in the full history. By creating a "chain" of commit-graphs,
+we enable fast writes of new commit data without rewriting the entire commit
+history -- at least, most of the time.
+
+## File Layout
+
+A commit-graph chain uses multiple files, and we use a fixed naming convention
+to organize these files. Each commit-graph file has a name
+`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
+valued hash stored in the footer of that file (which is a hash of the file's
+contents before that hash). For a chain of commit-graph files, a plain-text
+file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
+hashes for the files in order from "lowest" to "highest".
+
+For example, if the `commit-graph-chain` file contains the lines
+
+```
+	{hash0}
+	{hash1}
+	{hash2}
+```
+
+then the commit-graph chain looks like the following diagram:
+
+ +-----------------------+
+ |  graph-{hash2}.graph  |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  graph-{hash1}.graph  |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0}.graph  |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
+commits in `graph-{hash1}.graph`, and X2 be the number of commits in
+`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
+then we interpret this as being the commit in position (X0 + X1 + i), and that
+will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
+positions to refer to their parents, which may be in `graph-{hash1}.graph` or
+`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
+its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
+X2).
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 02/11] commit-graph: prepare for commit-graph chains
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 01/11] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` " Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 03/11] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To prepare for a chain of commit-graph files, augment the
commit_graph struct to point to a base commit_graph. As we load
commits from the graph, we may actually want to read from a base
file according to the graph position.

The "graph position" of a commit is given by concatenating the
lexicographic commit orders from each of the commit-graph files in
the chain. This means that we must distinguish two values:

 * lexicographic index : the position within the lexicographic
   order in a single commit-graph file.

 * graph position: the posiiton within the concatenated order
   of multiple commit-graph files

Given the lexicographic index of a commit in a graph, we can
compute the graph position by adding the number of commits in
the lower-level graphs. To find the lexicographic index of
a commit, we subtract the number of commits in lower-level graphs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------
 commit-graph.h |  3 ++
 2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7723156964..3afedcd7f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -371,6 +371,25 @@ static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t
 			    g->chunk_oid_lookup, g->hash_len, pos);
 }
 
+static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
+{
+	uint32_t lex_index;
+
+	if (!g)
+		BUG("NULL commit-graph");
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	lex_index = pos - g->num_commits_in_base;
+
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * lex_index);
+}
+
 static struct commit_list **insert_parent_or_die(struct repository *r,
 						 struct commit_graph *g,
 						 uint64_t pos,
@@ -379,10 +398,10 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	struct commit *c;
 	struct object_id oid;
 
-	if (pos >= g->num_commits)
+	if (pos >= g->num_commits + g->num_commits_in_base)
 		die("invalid parent position %"PRIu64, pos);
 
-	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+	load_oid_from_graph(g, pos, &oid);
 	c = lookup_commit(r, &oid);
 	if (!c)
 		die(_("could not find commit %s"), oid_to_hex(&oid));
@@ -392,7 +411,14 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	lex_index = pos - g->num_commits_in_base;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 	item->graph_pos = pos;
 	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -405,10 +431,26 @@ static int fill_commit_in_graph(struct repository *r,
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
 
-	item->object.parsed = 1;
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	/*
+	 * Store the "full" position, but then use the
+	 * "local" position for the rest of the calculation.
+	 */
 	item->graph_pos = pos;
+	lex_index = pos - g->num_commits_in_base;
+
+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+
+	item->object.parsed = 1;
 
 	item->maybe_tree = NULL;
 
@@ -452,7 +494,18 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 		*pos = item->graph_pos;
 		return 1;
 	} else {
-		return bsearch_graph(g, &(item->object.oid), pos);
+		struct commit_graph *cur_g = g;
+		uint32_t lex_index;
+
+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
+			cur_g = cur_g->base_graph;
+
+		if (cur_g) {
+			*pos = lex_index + cur_g->num_commits_in_base;
+			return 1;
+		}
+
+		return 0;
 	}
 }
 
@@ -492,8 +545,13 @@ static struct tree *load_tree_for_commit(struct repository *r,
 					 struct commit *c)
 {
 	struct object_id oid;
-	const unsigned char *commit_data = g->chunk_commit_data +
-					   GRAPH_DATA_WIDTH * (c->graph_pos);
+	const unsigned char *commit_data;
+
+	while (c->graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	commit_data = g->chunk_commit_data +
+			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
 
 	hashcpy(oid.hash, commit_data);
 	c->maybe_tree = lookup_tree(r, &oid);
diff --git a/commit-graph.h b/commit-graph.h
index 70f4caf0c7..f9fe32ebe3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,9 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 
+	uint32_t num_commits_in_base;
+	struct commit_graph *base_graph;
+
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 03/11] commit-graph: rename commit_compare to oid_compare
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 01/11] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 02/11] commit-graph: prepare for " Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 04/11] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The helper function commit_compare() actually compares object_id
structs, not commits. A future change to commit-graph.c will need
to sort commit structs, so rename this function in advance.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3afedcd7f5..e2f438f6a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -761,7 +761,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
-static int commit_compare(const void *_a, const void *_b)
+static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
 	const struct object_id *b = (const struct object_id *)_b;
@@ -1030,7 +1030,7 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 			_("Counting distinct commits in commit graph"),
 			ctx->oids.nr);
 	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 04/11] commit-graph: load commit-graph chains
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 03/11] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 05/11] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Prepare the logic for reading a chain of commit-graphs.

First, look for a file at $OBJDIR/info/commit-graph. If it exists,
then use that file and stop.

Next, look for the chain file at $OBJDIR/info/commit-graphs/commit-graph-chain.
If this file exists, then load the hash values as line-separated values in that
file and load $OBJDIR/info/commit-graphs/graph-{hash[i]}.graph for each hash[i]
in that file. The file is given in order, so the first hash corresponds to the
"base" file and the final hash corresponds to the "tip" file.

This implementation assumes that all of the graph-{hash}.graph files are in
the same object directory as the commit-graph-chain file. This will be updated
in a future change. This change is purposefully simple so we can isolate the
different concerns.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e2f438f6a3..70e44393b8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,6 +45,19 @@ char *get_commit_graph_filename(const char *obj_dir)
 	return xstrfmt("%s/info/commit-graph", obj_dir);
 }
 
+static char *get_split_graph_filename(const char *obj_dir,
+				      const char *oid_hex)
+{
+	return xstrfmt("%s/info/commit-graphs/graph-%s.graph",
+		       obj_dir,
+		       oid_hex);
+}
+
+static char *get_chain_filename(const char *obj_dir)
+{
+	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", obj_dir);
+}
+
 static uint8_t oid_version(void)
 {
 	return 1;
@@ -286,18 +299,93 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 	return load_commit_graph_one_fd_st(fd, &st);
 }
 
+static int prepare_commit_graph_v1(struct repository *r, const char *obj_dir)
+{
+	char *graph_name = get_commit_graph_filename(obj_dir);
+	r->objects->commit_graph = load_commit_graph_one(graph_name);
+	free(graph_name);
+
+	return r->objects->commit_graph ? 0 : -1;
+}
+
+static int add_graph_to_chain(struct commit_graph *g,
+			      struct commit_graph *chain,
+			      struct object_id *oids,
+			      int n)
+{
+	struct commit_graph *cur_g = chain;
+
+	while (n) {
+		n--;
+		cur_g = cur_g->base_graph;
+	}
+
+	g->base_graph = chain;
+
+	if (chain)
+		g->num_commits_in_base = chain->num_commits + chain->num_commits_in_base;
+
+	return 1;
+}
+
+static void prepare_commit_graph_chain(struct repository *r, const char *obj_dir)
+{
+	struct strbuf line = STRBUF_INIT;
+	struct stat st;
+	struct object_id *oids;
+	int i = 0, valid = 1;
+	char *chain_name = get_chain_filename(obj_dir);
+	FILE *fp;
+
+	if (stat(chain_name, &st))
+		return;
+
+	if (st.st_size <= the_hash_algo->hexsz)
+		return;
+
+	fp = fopen(chain_name, "r");
+	free(chain_name);
+
+	if (!fp)
+		return;
+
+	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
+
+	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
+		char *graph_name;
+		struct commit_graph *g;
+
+		if (get_oid_hex(line.buf, &oids[i])) {
+			warning(_("invalid commit-graph chain: line '%s' not a hash"),
+				line.buf);
+			valid = 0;
+			break;
+		}
+
+		graph_name = get_split_graph_filename(obj_dir, line.buf);
+		g = load_commit_graph_one(graph_name);
+		free(graph_name);
+
+		if (g && add_graph_to_chain(g, r->objects->commit_graph, oids, i))
+			r->objects->commit_graph = g;
+		else
+			valid = 0;
+	}
+
+	free(oids);
+	fclose(fp);
+}
+
 static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
 {
-	char *graph_name;
 
 	if (r->objects->commit_graph)
 		return;
 
-	graph_name = get_commit_graph_filename(obj_dir);
-	r->objects->commit_graph =
-		load_commit_graph_one(graph_name);
+	if (!prepare_commit_graph_v1(r, obj_dir))
+		return;
 
-	FREE_AND_NULL(graph_name);
+	prepare_commit_graph_chain(r, obj_dir);
 }
 
 /*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 05/11] commit-graph: add base graphs chunk
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 04/11] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 06/11] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To quickly verify a commit-graph chain is valid on load, we will
read from the new "Base Graphs Chunk" of each file in the chain.
This will prevent accidentally loading incorrect data from manually
editing the commit-graph-chain file or renaming graph-{hash}.graph
files.

The commit_graph struct already had an object_id struct "oid", but
it was never initialized or used. Add a line to read the hash from
the end of the commit-graph file and into the oid member.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .../technical/commit-graph-format.txt         | 11 ++++++++--
 commit-graph.c                                | 22 +++++++++++++++++++
 commit-graph.h                                |  1 +
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 16452a0504..a4f17441ae 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -44,8 +44,9 @@ HEADER:
 
   1-byte number (C) of "chunks"
 
-  1-byte (reserved for later use)
-     Current clients should ignore this value.
+  1-byte number (B) of base commit-graphs
+      We infer the length (H*B) of the Base Graphs chunk
+      from this value.
 
 CHUNK LOOKUP:
 
@@ -92,6 +93,12 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
+      This list of H-byte hashes describe a set of B commit-graph files that
+      form a commit-graph chain. The graph position for the ith commit in this
+      file's OID Lookup chunk is equal to i plus the number of commits in all
+      base graphs.  If B is non-zero, this chunk must exist.
+
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.
diff --git a/commit-graph.c b/commit-graph.c
index 70e44393b8..060897fff0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -22,6 +22,7 @@
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -262,6 +263,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			else
 				graph->chunk_extra_edges = data + chunk_offset;
 			break;
+
+		case GRAPH_CHUNKID_BASE:
+			if (graph->chunk_base_graphs)
+				chunk_repeated = 1;
+			else
+				graph->chunk_base_graphs = data + chunk_offset;
 		}
 
 		if (chunk_repeated) {
@@ -280,6 +287,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		last_chunk_offset = chunk_offset;
 	}
 
+	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
+
 	if (verify_commit_graph_lite(graph))
 		return NULL;
 
@@ -315,8 +324,21 @@ static int add_graph_to_chain(struct commit_graph *g,
 {
 	struct commit_graph *cur_g = chain;
 
+	if (n && !g->chunk_base_graphs) {
+		warning(_("commit-graph has no base graphs chunk"));
+		return 0;
+	}
+
 	while (n) {
 		n--;
+
+		if (!oideq(&oids[n], &cur_g->oid) ||
+		    !hasheq(oids[n].hash, g->chunk_base_graphs + g->hash_len * n)) {
+			warning(_("commit-graph chain does not match"));
+			return 0;
+		}
+
+
 		cur_g = cur_g->base_graph;
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index f9fe32ebe3..80f4917ddb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -55,6 +55,7 @@ struct commit_graph {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
+	const unsigned char *chunk_base_graphs;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 06/11] commit-graph: rearrange chunk count logic
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 05/11] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 07/11] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The number of chunks in a commit-graph file can change depending on
whether we need the Extra Edges Chunk. We are going to add more optional
chunks, and it will be helpful to rearrange this logic around the chunk
count before doing so.

Specifically, we need to finalize the number of chunks before writing
the commit-graph header. Further, we also need to fill out the chunk
lookup table dynamically and using "num_chunks" as we add optional
chunks is useful for adding optional chunks in the future.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 060897fff0..b8a1444217 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1192,7 +1192,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[5];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = ctx->num_extra_edges ? 4 : 3;
+	int num_chunks = 3;
 
 	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
@@ -1205,27 +1205,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
-	hashwrite_be32(f, GRAPH_SIGNATURE);
-
-	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0); /* unused padding byte */
-
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	if (ctx->num_extra_edges)
-		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
-	else
-		chunk_ids[3] = 0;
-	chunk_ids[4] = 0;
+	if (ctx->num_extra_edges) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
+		num_chunks++;
+	}
+
+	chunk_ids[num_chunks] = 0;
 
 	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
-	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
+
+	num_chunks = 3;
+	if (ctx->num_extra_edges) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						4 * ctx->num_extra_edges;
+		num_chunks++;
+	}
+
+	hashwrite_be32(f, GRAPH_SIGNATURE);
+
+	hashwrite_u8(f, GRAPH_VERSION);
+	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, 0);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 07/11] commit-graph: write commit-graph chains
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 06/11] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 08/11] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Extend write_commit_graph() to write a commit-graph chain when given the
COMMIT_GRAPH_SPLIT flag.

This implementation is purposefully simplistic in how it creates a new
chain. The commits not already in the chain are added to a new tip
commit-graph file.

Much of the logic around writing a graph-{hash}.graph file and updating
the commit-graph-chain file is the same as the commit-graph file case.
However, there are several places where we need to do some extra logic
in the split case.

Track the list of graph filenames before and after the planned write.
This will be more important when we start merging graph files, but it
also allows us to upgrade our commit-graph file to the appropriate
graph-{hash}.graph file when we upgrade to a chain of commit-graphs.

Note that we use the eighth byte of the commit-graph header to store the
number of base graph files. This determines the length of the base
graphs chunk.

A subtle change of behavior with the new logic is that we do not write a
commit-graph if we our commit list is empty. This extends to the typical
case, which is reflected in t5318-commit-graph.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 286 ++++++++++++++++++++++++++++++++++++++--
 commit-graph.h          |   2 +
 t/t5318-commit-graph.sh |   2 +-
 3 files changed, 278 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b8a1444217..fb972c0bc2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -300,12 +300,18 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 
 	struct stat st;
 	int fd;
+	struct commit_graph *g;
 	int open_ok = open_commit_graph(graph_file, &fd, &st);
 
 	if (!open_ok)
 		return NULL;
 
-	return load_commit_graph_one_fd_st(fd, &st);
+	g = load_commit_graph_one_fd_st(fd, &st);
+
+	if (g)
+		g->filename = xstrdup(graph_file);
+
+	return g;
 }
 
 static int prepare_commit_graph_v1(struct repository *r, const char *obj_dir)
@@ -709,8 +715,19 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+
+	char *base_graph_name;
+	int num_commit_graphs_before;
+	int num_commit_graphs_after;
+	char **commit_graph_filenames_before;
+	char **commit_graph_filenames_after;
+	char **commit_graph_hash_after;
+	uint32_t new_num_commits_in_base;
+	struct commit_graph *new_base_graph;
+
 	unsigned append:1,
-		 report_progress:1;
+		 report_progress:1,
+		 split:1;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -780,6 +797,16 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.nr,
 					      commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -800,6 +827,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.list,
 					      ctx->commits.nr,
 					      commit_to_sha1);
+
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -857,6 +895,16 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 						  ctx->commits.nr,
 						  commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -948,7 +996,13 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 		display_progress(ctx->progress, i + 1);
 		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
-		if (commit && !parse_commit_no_graph(commit))
+		if (!commit)
+			continue;
+		if (ctx->split) {
+			if (!parse_commit(commit) &&
+			    commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+				add_missing_parents(ctx, commit);
+		} else if (!parse_commit_no_graph(commit))
 			add_missing_parents(ctx, commit);
 	}
 	stop_progress(&ctx->progress);
@@ -1144,8 +1198,16 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i])) {
+			if (ctx->split) {
+				struct commit *c = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+				if (!c || c->graph_pos != COMMIT_NOT_FROM_GRAPH)
+					continue;
+			}
+
 			count_distinct++;
+		}
 	}
 	stop_progress(&ctx->progress);
 
@@ -1168,7 +1230,13 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			continue;
 
+		ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
 		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+		if (ctx->split &&
+		    ctx->commits.list[ctx->commits.nr]->graph_pos != COMMIT_NOT_FROM_GRAPH)
+			continue;
+
 		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
 
 		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
@@ -1183,18 +1251,86 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static int write_graph_chunk_base_1(struct hashfile *f,
+				    struct commit_graph *g)
+{
+	int num = 0;
+
+	if (!g)
+		return 0;
+
+	num = write_graph_chunk_base_1(f, g->base_graph);
+	hashwrite(f, g->oid.hash, the_hash_algo->rawsz);
+	return num + 1;
+}
+
+static int write_graph_chunk_base(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
+{
+	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
+
+	if (num != ctx->num_commit_graphs_after - 1) {
+		error(_("failed to write correct number of base graph ids"));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t i;
+
+	ctx->new_base_graph = g;
+	ctx->base_graph_name = xstrdup(g->filename);
+	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	if (ctx->num_commit_graphs_before)
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
+			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
+
+	i = ctx->num_commit_graphs_before - 1;
+
+	while (g) {
+		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+		i--;
+		g = g->base_graph;
+	}
+}
+
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
+	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[5];
-	uint64_t chunk_offsets[5];
+	uint32_t chunk_ids[6];
+	uint64_t chunk_offsets[6];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
+	struct object_id file_hash;
+
+	if (ctx->split) {
+		struct strbuf tmp_file = STRBUF_INIT;
+
+		strbuf_addf(&tmp_file,
+			    "%s/info/commit-graphs/tmp_graph_XXXXXX",
+			    ctx->obj_dir);
+		ctx->graph_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	}
 
-	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
 		UNLEAK(ctx->graph_name);
 		error(_("unable to create leading directories of %s"),
@@ -1202,8 +1338,23 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		return errno;
 	}
 
-	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
-	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	if (ctx->split) {
+		char *lock_name = get_chain_filename(ctx->obj_dir);
+
+		hold_lock_file_for_update(&lk, lock_name, LOCK_DIE_ON_ERROR);
+
+		fd = git_mkstemp_mode(ctx->graph_name, 0444);
+		if (fd < 0) {
+			error(_("unable to create '%s'"), ctx->graph_name);
+			return -1;
+		}
+
+		f = hashfd(fd, ctx->graph_name);
+	} else {
+		hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
+		fd = lk.tempfile->fd;
+		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	}
 
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
@@ -1212,6 +1363,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
+		num_chunks++;
+	}
 
 	chunk_ids[num_chunks] = 0;
 
@@ -1226,13 +1381,18 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 						4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						hashsz * (ctx->num_commit_graphs_after - 1);
+		num_chunks++;
+	}
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
 	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0);
+	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
@@ -1258,11 +1418,67 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_data(f, hashsz, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
+	if (ctx->num_commit_graphs_after > 1 &&
+	    write_graph_chunk_base(f, ctx)) {
+		return -1;
+	}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
+	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
+		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
+		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+
+		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2] = new_base_name;
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2] = new_base_hash;
+	}
+
 	close_commit_graph(ctx->r);
-	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+
+	if (ctx->split) {
+		FILE *chainf = fdopen_lock_file(&lk, "w");
+		char *final_graph_name;
+		int result;
+
+		close(fd);
+
+		if (!chainf) {
+			error(_("unable to open commit-graph chain file"));
+			return -1;
+		}
+
+		if (ctx->base_graph_name) {
+			result = rename(ctx->base_graph_name,
+					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+
+			if (result) {
+				error(_("failed to rename base commit-graph file"));
+				return -1;
+			}
+		} else {
+			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
+			unlink(graph_name);
+		}
+
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1] = xstrdup(oid_to_hex(&file_hash));
+		final_graph_name = get_split_graph_filename(ctx->obj_dir,
+					ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 1] = final_graph_name;
+
+		result = rename(ctx->graph_name, final_graph_name);
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++)
+			fprintf(lk.tempfile->fp, "%s\n", ctx->commit_graph_hash_after[i]);
+
+		if (result) {
+			error(_("failed to rename temporary commit-graph file"));
+			return -1;
+		}
+	}
+
 	commit_lock_file(&lk);
 
 	return 0;
@@ -1285,6 +1501,30 @@ int write_commit_graph(const char *obj_dir,
 	ctx->obj_dir = obj_dir;
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
+	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+
+	if (ctx->split) {
+		struct commit_graph *g;
+		prepare_commit_graph(ctx->r);
+
+		g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			ctx->num_commit_graphs_before++;
+			g = g->base_graph;
+		}
+
+		if (ctx->num_commit_graphs_before) {
+			ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before);
+			i = ctx->num_commit_graphs_before;
+			g = ctx->r->objects->commit_graph;
+
+			while (g) {
+				ctx->commit_graph_filenames_before[--i] = xstrdup(g->filename);
+				g = g->base_graph;
+			}
+		}
+	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
@@ -1339,6 +1579,14 @@ int write_commit_graph(const char *obj_dir,
 		goto cleanup;
 	}
 
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split)
+		init_commit_graph_chain(ctx);
+	else
+		ctx->num_commit_graphs_after = 1;
+
 	compute_generation_numbers(ctx);
 
 	res = write_commit_graph_file(ctx);
@@ -1347,6 +1595,21 @@ int write_commit_graph(const char *obj_dir,
 	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
+
+	if (ctx->commit_graph_filenames_after) {
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			free(ctx->commit_graph_filenames_after[i]);
+			free(ctx->commit_graph_hash_after[i]);
+		}
+
+		for (i = 0; i < ctx->num_commit_graphs_before; i++)
+			free(ctx->commit_graph_filenames_before[i]);
+
+		free(ctx->commit_graph_filenames_after);
+		free(ctx->commit_graph_filenames_before);
+		free(ctx->commit_graph_hash_after);
+	}
+
 	free(ctx);
 
 	return res;
@@ -1534,5 +1797,6 @@ void free_commit_graph(struct commit_graph *g)
 		g->data = NULL;
 		close(g->graph_fd);
 	}
+	free(g->filename);
 	free(g);
 }
diff --git a/commit-graph.h b/commit-graph.h
index 80f4917ddb..5c48c4f66a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -47,6 +47,7 @@ struct commit_graph {
 	unsigned char num_chunks;
 	uint32_t num_commits;
 	struct object_id oid;
+	char *filename;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
@@ -71,6 +72,7 @@ int generation_numbers_enabled(struct repository *r);
 
 #define COMMIT_GRAPH_APPEND     (1 << 0)
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
+#define COMMIT_GRAPH_SPLIT      (1 << 2)
 
 int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
 int write_commit_graph(const char *obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3b6fd0d728..063f906b3e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -20,7 +20,7 @@ test_expect_success 'verify graph with no graph file' '
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
-	test_path_is_file info/commit-graph
+	test_path_is_missing info/commit-graph
 '
 
 test_expect_success 'close with correct error on bad input' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 08/11] commit-graph: add --split option to builtin
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 07/11] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-27 11:28     ` SZEDER Gábor
  2019-05-22 19:53   ` [PATCH v2 09/11] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  11 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new "--split" option to the 'git commit-graph write' subcommand. This
option allows the optional behavior of writing a commit-graph chain.

The current behavior will add a tip commit-graph containing any commits that
are not in the existing commit-graph or commit-graph chain. Later changes
will allow merging the chain and expiring out-dated files.

Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
behavior.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c        |  10 ++-
 t/t5323-split-commit-graph.sh | 122 ++++++++++++++++++++++++++++++++++
 2 files changed, 129 insertions(+), 3 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 828b1a713f..c2c07d3917 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -35,9 +35,9 @@ static struct opts_commit_graph {
 	int stdin_packs;
 	int stdin_commits;
 	int append;
+	int split;
 } opts;
 
-
 static int graph_verify(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -156,6 +156,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "split", &opts.split,
+			N_("allow writing an incremental commit-graph file")),
 		OPT_END(),
 	};
 
@@ -169,6 +171,8 @@ static int graph_write(int argc, const char **argv)
 		opts.obj_dir = get_object_directory();
 	if (opts.append)
 		flags |= COMMIT_GRAPH_APPEND;
+	if (opts.split)
+		flags |= COMMIT_GRAPH_SPLIT;
 
 	read_replace_refs = 0;
 
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
new file mode 100755
index 0000000000..96704b9f5b
--- /dev/null
+++ b/t/t5323-split-commit-graph.sh
@@ -0,0 +1,122 @@
+#!/bin/sh
+
+test_description='split commit graph'
+. ./test-lib.sh
+
+GIT_TEST_COMMIT_GRAPH=0
+
+test_expect_success 'setup repo' '
+	git init &&
+	git config core.commitGraph true &&
+	infodir=".git/objects/info" &&
+	graphdir="$infodir/commit-graphs" &&
+	test_oid_init
+'
+
+graph_read_expect() {
+	NUM_BASE=0
+	if test ! -z $2
+	then
+		NUM_BASE=$2
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 3 $NUM_BASE
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	git commit-graph read >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'create commits and write commit-graph' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git commit-graph write --reachable &&
+	test_path_is_file $infodir/commit-graph &&
+	graph_read_expect 3
+'
+
+graph_git_two_modes() {
+	git -c core.commitGraph=true $1 >output
+	git -c core.commitGraph=false $1 >expect
+	test_cmp expect output
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_git_behavior 'graph exists' commits/3 commits/1
+
+verify_chain_files_exist() {
+	for hash in $(cat $1/commit-graph-chain)
+	do
+		test_path_is_file $1/graph-$hash.graph
+	done
+}
+
+test_expect_success 'add more commits, and write a new base graph' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 12
+'
+
+test_expect_success 'add three more commits, write a tip graph' '
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
+
+test_expect_success 'add one commit, write a tip graph' '
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 3 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 09/11] commit-graph: merge commit-graph chains
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 08/11] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-23  0:43     ` Ævar Arnfjörð Bjarmason
  2019-05-22 19:53   ` [PATCH v2 10/11] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  11 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.

To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:

  1. If the number of commits we are adding is more than half the number
     of commits in the graph below, then merge with that graph.

  2. If we are writing more than 64,000 commits into a single graph,
     then merge with all lower graphs.

The numeric values in the conditions above are currently constant, but
can become config options in a future update.

As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.

After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt |  81 ++++++++++
 commit-graph.c                           | 190 +++++++++++++++++++----
 t/t5323-split-commit-graph.sh            |  13 ++
 3 files changed, 251 insertions(+), 33 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 1dca3bd8fe..083fff9927 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -186,6 +186,87 @@ positions to refer to their parents, which may be in `graph-{hash1}.graph` or
 its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
 X2).
 
+Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
+specifying the hashes of all files in the lower layers. In the above example,
+`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
+`{hash0}` and `{hash1}`.
+
+## Merging commit-graph files
+
+If we only added a new commit-graph file on every write, we would run into a
+linear search problem through many commit-graph files.  Instead, we use a merge
+strategy to decide when the stack should collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits are added, it
+is determined by the merge strategy that the files should collapse to
+`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
+the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
+file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  graph-{hash2} |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  graph-{hash1} |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                  tmp_graphXXX
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0} |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted and we write the
+contents to a temporary file, all while holding a `commit-graph-chain.lock`
+lock-file.  When the file is flushed, we rename it to `graph-{hash3}`
+according to the computed `{hash3}`. Finally, we write the new chain data to
+`commit-graph-chain.lock`:
+
+```
+	{hash3}
+	{hash0}
+```
+
+We then close the lock-file.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the commit-graph stack of
+height N, we default to creating a new file at level N + 1. We then decide to
+merge with the Nth level if one of two conditions hold:
+
+  1. The expected file size for level N + 1 is at least half the file size for
+     level N.
+
+  2. Level N + 1 contains more than MAX_SPLIT_COMMITS commits (64,0000
+     commits).
+
+This decision cascades down the levels: when we merge a level we create a new
+set of commits that then compares to the next level.
+
+The first condition bounds the number of levels to be logarithmic in the total
+number of commits.  The second condition bounds the total number of commits in
+a `graph-{hashN}` file and not in the `commit-graph` file, preventing
+significant performance issues when the stack merges and another process only
+partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple, 64,000 for the maximum
+number of commits) could be extracted into config settings for full
+flexibility.
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
diff --git a/commit-graph.c b/commit-graph.c
index fb972c0bc2..e9784cd559 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1277,36 +1277,6 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
-{
-	struct commit_graph *g = ctx->r->objects->commit_graph;
-	uint32_t i;
-
-	ctx->new_base_graph = g;
-	ctx->base_graph_name = xstrdup(g->filename);
-	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
-
-	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
-
-	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
-	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
-
-	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
-		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
-
-	if (ctx->num_commit_graphs_before)
-		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
-			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
-
-	i = ctx->num_commit_graphs_before - 1;
-
-	while (g) {
-		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
-		i--;
-		g = g->base_graph;
-	}
-}
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
@@ -1484,6 +1454,155 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
+static int split_strategy_max_commits = 64000;
+static float split_strategy_size_mult = 2.0f;
+
+static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t num_commits = ctx->commits.nr;
+	uint32_t i;
+
+	g = ctx->r->objects->commit_graph;
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
+		     num_commits > split_strategy_max_commits)) {
+		num_commits += g->num_commits;
+		g = g->base_graph;
+
+		ctx->num_commit_graphs_after--;
+	}
+
+	ctx->new_base_graph = g;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_after &&
+		    i < ctx->num_commit_graphs_before; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	i = ctx->num_commit_graphs_before - 1;
+	g = ctx->r->objects->commit_graph;
+
+	while (g) {
+		if (i < ctx->num_commit_graphs_after)
+			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+
+		i--;
+		g = g->base_graph;
+	}
+}
+
+static void merge_commit_graph(struct write_commit_graph_context *ctx,
+			       struct commit_graph *g)
+{
+	uint32_t i;
+	uint32_t offset = g->num_commits_in_base;
+
+	ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct object_id oid;
+		struct commit *result;
+
+		display_progress(ctx->progress, i + 1);
+
+		load_oid_from_graph(g, i + offset, &oid);
+
+		/* only add commits if they still exist in the repo */
+		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+
+		if (result) {
+			ctx->commits.list[ctx->commits.nr] = result;
+			ctx->commits.nr++;
+		}
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit **)_a;
+	const struct commit *b = *(const struct commit **)_b;
+	return oidcmp(&a->object.oid, &b->object.oid);
+}
+
+static void deduplicate_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i, num_parents, last_distinct = 0, duplicates = 0;
+	struct commit_list *parent;
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("De-duplicating merged commits"),
+					ctx->commits.nr);
+
+	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+
+	ctx->num_extra_edges = 0;
+	for (i = 1; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, i);
+
+		if (oideq(&ctx->commits.list[last_distinct]->object.oid,
+			  &ctx->commits.list[i]->object.oid)) {
+			duplicates++;
+		} else {
+			if (duplicates)
+				ctx->commits.list[last_distinct + 1] = ctx->commits.list[i];
+			last_distinct++;
+
+			num_parents = 0;
+			for (parent = ctx->commits.list[i]->parents; parent; parent = parent->next)
+				num_parents++;
+
+			if (num_parents > 2)
+				ctx->num_extra_edges += num_parents - 2;
+		}
+	}
+
+	ctx->commits.nr -= duplicates;
+	stop_progress(&ctx->progress);
+}
+
+static void merge_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t current_graph_number = ctx->num_commit_graphs_before;
+	struct strbuf progress_title = STRBUF_INIT;
+
+	while (g && current_graph_number >= ctx->num_commit_graphs_after) {
+		current_graph_number--;
+
+		if (ctx->report_progress) {
+			if (current_graph_number)
+				strbuf_addf(&progress_title,
+					    _("Merging commit-graph-%d"),
+					    current_graph_number);
+			else
+				strbuf_addstr(&progress_title,
+					      _("Merging commit-graph"));
+			ctx->progress = start_delayed_progress(progress_title.buf, 0);
+		}
+
+		merge_commit_graph(ctx, g);
+		stop_progress(&ctx->progress);
+		strbuf_release(&progress_title);
+
+		g = g->base_graph;
+	}
+
+	if (g) {
+		ctx->new_base_graph = g;
+		ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+	}
+
+	if (ctx->new_base_graph)
+		ctx->base_graph_name = xstrdup(ctx->new_base_graph->filename);
+
+	deduplicate_commits(ctx);
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1529,6 +1648,9 @@ int write_commit_graph(const char *obj_dir,
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
+	if (ctx->split && ctx->oids.alloc > split_strategy_max_commits)
+		ctx->oids.alloc = split_strategy_max_commits;
+
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
 		if (ctx->r->objects->commit_graph)
@@ -1582,9 +1704,11 @@ int write_commit_graph(const char *obj_dir,
 	if (!ctx->commits.nr)
 		goto cleanup;
 
-	if (ctx->split)
-		init_commit_graph_chain(ctx);
-	else
+	if (ctx->split) {
+		split_graph_merge_strategy(ctx);
+
+		merge_commit_graphs(ctx);
+	} else
 		ctx->num_commit_graphs_after = 1;
 
 	compute_generation_numbers(ctx);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index 96704b9f5b..3902fd9aee 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -119,4 +119,17 @@ test_expect_success 'add one commit, write a tip graph' '
 
 graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
 
+test_expect_success 'add one commit, write a merged graph' '
+	test_commit 12 &&
+	git branch commits/12 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 2 $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 4 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 10/11] commit-graph: allow cross-alternate chains
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 09/11] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-05-22 19:53   ` [PATCH v2 11/11] commit-graph: expire commit-graph files Derrick Stolee via GitGitGadget
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment like a fork network, it is helpful to have a
commit-graph chain that spans both the base repo and the fork repo. The
fork is usually a small set of data on top of the large repo, but
sometimes the fork is much larger. For example, git-for-windows/git has
almost double the number of commits as git/git because it rebases its
commits on every major version update.

To allow cross-alternate commit-graph chains, we need a few pieces:

1. When looking for a graph-{hash}.graph file, check all alternates.

2. When merging commit-graph chains, do not merge across alternates.

3. When writing a new commit-graph chain based on a commit-graph file
   in another object directory, do not allow success if the base file
   has of the name "commit-graph" instead of
   "commit-graphs/graoh-{hash}.graph".

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 40 +++++++++++++++++++++
 commit-graph.c                           | 46 ++++++++++++++++++------
 commit-graph.h                           |  1 +
 t/t5323-split-commit-graph.sh            | 37 +++++++++++++++++++
 4 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 083fff9927..b6fe8b2321 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -267,6 +267,42 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Chains across multiple object directories
+
+In a repo with alternates, we look for the `commit-graph-chain` file starting
+in the local object directory and then in each alternate. The first file that
+exists defines our chain. As we look for the `graph-{hash}` files for
+each `{hash}` in the chain file, we follow the same pattern for the host
+directories.
+
+This allows commit-graphs to be split across multiple forks in a fork network.
+The typical case is a large "base" repo with many smaller forks.
+
+As the base repo advances, it will likely update and merge its commit-graph
+chain more frequently than the forks. If a fork updates their commit-graph after
+the base repo, then it should "reparent" the commit-graph chain onto the new
+chain in the base repo. When reading each `graph-{hash}` file, we track
+the object directory containing it. During a write of a new commit-graph file,
+we check for any changes in the source object directory and read the
+`commit-graph-chain` file for that source and create a new file based on those
+files. During this "reparent" operation, we necessarily need to collapse all
+levels in the fork, as all of the files are invalid against the new base file.
+
+It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
+files in this scenario. It falls to the user to define the proper settings for
+their custom environment:
+
+ 1. When merging levels in the base repo, the unreferenced files may still be
+    referenced by chains from fork repos.
+
+ 2. The expiry time should be set to a length of time such that every fork has
+    time to recompute their commit-graph chain to "reparent" onto the new base
+    file(s).
+
+ 3. If the commit-graph chain is updated in the base, the fork will not have
+    access to the new chain until its chain is updated to reference those files.
+    (This may change in the future [5].)
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
@@ -293,3 +329,7 @@ Related Links
 
 [4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
     A patch to remove the ahead-behind calculation from 'status'.
+
+[5] https://public-inbox.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
+    A discussion of a "two-dimensional graph position" that can allow reading
+    multiple commit-graph chains at the same time.
diff --git a/commit-graph.c b/commit-graph.c
index e9784cd559..0d8b942e2b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -320,6 +320,9 @@ static int prepare_commit_graph_v1(struct repository *r, const char *obj_dir)
 	r->objects->commit_graph = load_commit_graph_one(graph_name);
 	free(graph_name);
 
+	if (r->objects->commit_graph)
+		r->objects->commit_graph->obj_dir = obj_dir;
+
 	return r->objects->commit_graph ? 0 : -1;
 }
 
@@ -380,8 +383,7 @@ static void prepare_commit_graph_chain(struct repository *r, const char *obj_dir
 	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
 
 	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
-		char *graph_name;
-		struct commit_graph *g;
+		struct object_directory *odb;
 
 		if (get_oid_hex(line.buf, &oids[i])) {
 			warning(_("invalid commit-graph chain: line '%s' not a hash"),
@@ -390,14 +392,23 @@ static void prepare_commit_graph_chain(struct repository *r, const char *obj_dir
 			break;
 		}
 
-		graph_name = get_split_graph_filename(obj_dir, line.buf);
-		g = load_commit_graph_one(graph_name);
-		free(graph_name);
+		for (odb = r->objects->odb; odb; odb = odb->next) {
+			char *graph_name = get_split_graph_filename(odb->path, line.buf);
+			struct commit_graph *g = load_commit_graph_one(graph_name);
 
-		if (g && add_graph_to_chain(g, r->objects->commit_graph, oids, i))
-			r->objects->commit_graph = g;
-		else
-			valid = 0;
+			free(graph_name);
+
+			if (g) {
+				g->obj_dir = odb->path;
+
+				if (add_graph_to_chain(g, r->objects->commit_graph, oids, i))
+					r->objects->commit_graph = g;
+				else
+					valid = 0;
+
+				break;
+			}
+		}
 	}
 
 	free(oids);
@@ -1397,7 +1408,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
 		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
-		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+		char *new_base_name = get_split_graph_filename(ctx->new_base_graph->obj_dir, new_base_hash);
 
 		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
 		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
@@ -1468,6 +1479,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 
 	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
 		     num_commits > split_strategy_max_commits)) {
+		if (strcmp(g->obj_dir, ctx->obj_dir))
+			break;
+
 		num_commits += g->num_commits;
 		g = g->base_graph;
 
@@ -1476,6 +1490,18 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 
 	ctx->new_base_graph = g;
 
+	if (ctx->num_commit_graphs_after == 2) {
+		char *old_graph_name = get_commit_graph_filename(g->obj_dir);
+
+		if (!strcmp(g->filename, old_graph_name) &&
+		    strcmp(g->obj_dir, ctx->obj_dir)) {
+			ctx->num_commit_graphs_after = 1;
+			ctx->new_base_graph = NULL;
+		}
+
+		free(old_graph_name);
+	}
+
 	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
 	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
 
diff --git a/commit-graph.h b/commit-graph.h
index 5c48c4f66a..10466bc064 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,7 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 	char *filename;
+	const char *obj_dir;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index 3902fd9aee..ed5cac8617 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -90,6 +90,21 @@ test_expect_success 'add more commits, and write a new base graph' '
 	graph_read_expect 12
 '
 
+test_expect_success 'fork and fail to base a chain on a commit-graph file' '
+	test_when_finished rm -rf fork &&
+	git clone . fork &&
+	(
+		cd fork &&
+		rm .git/objects/info/commit-graph &&
+		echo "$TRASH_DIRECTORY/.git/objects" >.git/objects/info/alternates &&
+		test_commit new-commit &&
+		git commit-graph write --reachable --split &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir
+	)
+'
+
 test_expect_success 'add three more commits, write a tip graph' '
 	git reset --hard commits/3 &&
 	git merge merge/1 &&
@@ -132,4 +147,26 @@ test_expect_success 'add one commit, write a merged graph' '
 
 graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
 
+test_expect_success 'create fork and chain across alternate' '
+	git clone . fork &&
+	(
+		cd fork &&
+		git config core.commitGraph true &&
+		rm -rf $graphdir &&
+		echo "$TRASH_DIRECTORY/.git/objects" >.git/objects/info/alternates &&
+		test_commit 13 &&
+		git branch commits/13 &&
+		git commit-graph write --reachable --split &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files &&
+		git -c core.commitGraph=true  rev-list HEAD >expect &&
+		git -c core.commitGraph=false rev-list HEAD >actual &&
+		test_cmp expect actual
+	)
+'
+
+graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v2 11/11] commit-graph: expire commit-graph files
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 10/11] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
@ 2019-05-22 19:53   ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-22 19:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we merge commit-graph files in a commit-graph chain, we should clean
up the files that are no longer used.

This change introduces an 'expiry_window' value to the context, which is
always zero (for now). We then check the modified time of each
graph-{hash}.graph file in the $OBJDIR/info/commit-graphs folder and
unlink the files that are older than the expiry_window.

Since this is always zero, this immediately clears all unused graph
files. We will update the value to match a config setting in a future
change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 15 +++++
 commit-graph.c                           | 71 ++++++++++++++++++++++++
 t/t5323-split-commit-graph.sh            |  2 +-
 3 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index b6fe8b2321..4ecec54148 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -267,6 +267,21 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Deleting graph-{hash} files
+
+After a new tip file is written, some `graph-{hash}` files may no longer
+be part of a chain. It is important to remove these files from disk, eventually.
+The main reason to delay removal is that another process could read the
+`commit-graph-chain` file before it is rewritten, but then look for the
+`graph-{hash}` files after they are deleted.
+
+To allow holding old split commit-graphs for a while after they are unreferenced,
+we update the modified times of the files when they become unreferenced. Then,
+we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
+files whose modified times are older than a given expiry window. This window
+defaults to zero, but can be changed using command-line arguments or a config
+setting.
+
 ## Chains across multiple object directories
 
 In a repo with alternates, we look for the `commit-graph-chain` file starting
diff --git a/commit-graph.c b/commit-graph.c
index 0d8b942e2b..69c4f83c2a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -739,6 +739,8 @@ struct write_commit_graph_context {
 	unsigned append:1,
 		 report_progress:1,
 		 split:1;
+
+	time_t expire_window;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1629,6 +1631,70 @@ static void merge_commit_graphs(struct write_commit_graph_context *ctx)
 	deduplicate_commits(ctx);
 }
 
+static void mark_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	uint32_t i;
+	time_t now = time(NULL);
+
+	for (i = ctx->num_commit_graphs_after - 1; i < ctx->num_commit_graphs_before; i++) {
+		struct stat st;
+		struct utimbuf updated_time;
+
+		stat(ctx->commit_graph_filenames_before[i], &st);
+
+		updated_time.actime = st.st_atime;
+		updated_time.modtime = now;
+		utime(ctx->commit_graph_filenames_before[i], &updated_time);
+	}
+}
+
+static void expire_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct strbuf path = STRBUF_INIT;
+	DIR *dir;
+	struct dirent *de;
+	size_t dirnamelen;
+	time_t expire_time = time(NULL) - ctx->expire_window;
+
+	strbuf_addstr(&path, ctx->obj_dir);
+	strbuf_addstr(&path, "/info/commit-graphs");
+	dir = opendir(path.buf);
+
+	if (!dir) {
+		strbuf_release(&path);
+		return;
+	}
+
+	strbuf_addch(&path, '/');
+	dirnamelen = path.len;
+	while ((de = readdir(dir)) != NULL) {
+		struct stat st;
+		uint32_t i, found = 0;
+
+		strbuf_setlen(&path, dirnamelen);
+		strbuf_addstr(&path, de->d_name);
+
+		stat(path.buf, &st);
+
+		if (st.st_mtime > expire_time)
+			continue;
+		if (path.len < 6 || strcmp(path.buf + path.len - 6, ".graph"))
+			continue;
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			if (!strcmp(ctx->commit_graph_filenames_after[i],
+				    path.buf)) {
+				found = 1;
+				break;
+			}
+		}
+
+		if (!found)
+			unlink(path.buf);
+
+	}
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1741,6 +1807,11 @@ int write_commit_graph(const char *obj_dir,
 
 	res = write_commit_graph_file(ctx);
 
+	if (ctx->split) {
+		mark_commit_graphs(ctx);
+		expire_commit_graphs(ctx);
+	}
+
 cleanup:
 	free(ctx->graph_name);
 	free(ctx->commits.list);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index ed5cac8617..6215efcac5 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -141,7 +141,7 @@ test_expect_success 'add one commit, write a merged graph' '
 	test_path_is_file $graphdir/commit-graph-chain &&
 	test_line_count = 2 $graphdir/commit-graph-chain &&
 	ls $graphdir/graph-*.graph >graph-files &&
-	test_line_count = 4 graph-files &&
+	test_line_count = 2 graph-files &&
 	verify_chain_files_exist $graphdir
 '
 
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: merge commit-graph chains
  2019-05-22 19:53   ` [PATCH v2 09/11] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-05-23  0:43     ` Ævar Arnfjörð Bjarmason
  2019-05-23 13:00       ` Derrick Stolee
  0 siblings, 1 reply; 136+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-23  0:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee


On Wed, May 22 2019, Derrick Stolee via GitGitGadget wrote:

> To keep lookups fast, but also keep most incremental writes fast, create
> a strategy for merging levels of the commit-graph chain. The strategy is
> detailed in the commit-graph design document, but is summarized by these
> two conditions:
>
>   1. If the number of commits we are adding is more than half the number
>      of commits in the graph below, then merge with that graph.
>
>   2. If we are writing more than 64,000 commits into a single graph,
>      then merge with all lower graphs.
>
> The numeric values in the conditions above are currently constant, but
> can become config options in a future update.
> [...]
> +## Merge Strategy
> +
> +When writing a set of commits that do not exist in the commit-graph stack of
> +height N, we default to creating a new file at level N + 1. We then decide to
> +merge with the Nth level if one of two conditions hold:
> +
> +  1. The expected file size for level N + 1 is at least half the file size for
> +     level N.
> +
> +  2. Level N + 1 contains more than MAX_SPLIT_COMMITS commits (64,0000
> +     commits).
> +
> +This decision cascades down the levels: when we merge a level we create a new
> +set of commits that then compares to the next level.
> +
> +The first condition bounds the number of levels to be logarithmic in the total
> +number of commits.  The second condition bounds the total number of commits in
> +a `graph-{hashN}` file and not in the `commit-graph` file, preventing
> +significant performance issues when the stack merges and another process only
> +partially reads the previous stack.
> +
> +The merge strategy values (2 for the size multiple, 64,000 for the maximum
> +number of commits) could be extracted into config settings for full
> +flexibility.

As noted this can become configurable, so it's no big deal. But is there
any reason for ths 64K limit anymore?

While with the default expiry of 0sec we can still get that race, it
seems unlikely in practice, as the "commit-graph write" process would
write a new manifest at the end, then go and unlink() the old files.

So maybe at this point we could make this even dumber with something
that behaves like gc.autoPackLimit? I.e. keep writing new graphs, and
then coalesce them all (or maybe not the "base" graph, like
gc.bigPackThreshold)?

Also: These docs refer to MAX_SPLIT_COMMITS, but in v2 it's now a
"split_strategy_max_commits" variable instead.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: merge commit-graph chains
  2019-05-23  0:43     ` Ævar Arnfjörð Bjarmason
@ 2019-05-23 13:00       ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-05-23 13:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, peff, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

On 5/22/2019 8:43 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, May 22 2019, Derrick Stolee via GitGitGadget wrote:
> 
>> To keep lookups fast, but also keep most incremental writes fast, create
>> a strategy for merging levels of the commit-graph chain. The strategy is
>> detailed in the commit-graph design document, but is summarized by these
>> two conditions:
>>
>>   1. If the number of commits we are adding is more than half the number
>>      of commits in the graph below, then merge with that graph.
>>
>>   2. If we are writing more than 64,000 commits into a single graph,
>>      then merge with all lower graphs.
>>
>> The numeric values in the conditions above are currently constant, but
>> can become config options in a future update.
>> [...]
>> +## Merge Strategy
>> +
>> +When writing a set of commits that do not exist in the commit-graph stack of
>> +height N, we default to creating a new file at level N + 1. We then decide to
>> +merge with the Nth level if one of two conditions hold:
>> +
>> +  1. The expected file size for level N + 1 is at least half the file size for
>> +     level N.
>> +
>> +  2. Level N + 1 contains more than MAX_SPLIT_COMMITS commits (64,0000
>> +     commits).
>> +
>> +This decision cascades down the levels: when we merge a level we create a new
>> +set of commits that then compares to the next level.
>> +
>> +The first condition bounds the number of levels to be logarithmic in the total
>> +number of commits.  The second condition bounds the total number of commits in
>> +a `graph-{hashN}` file and not in the `commit-graph` file, preventing
>> +significant performance issues when the stack merges and another process only
>> +partially reads the previous stack.
>> +
>> +The merge strategy values (2 for the size multiple, 64,000 for the maximum
>> +number of commits) could be extracted into config settings for full
>> +flexibility.
> 
> As noted this can become configurable, so it's no big deal. But is there
> any reason for ths 64K limit anymore?

There may not be an important reason to include it by default. Whatever config
option we use could have special values:

 * -1: No maximum commit limit (default)
 *  0: Never write more than one level.

I would personally set a limit somewhere around 64,000 to prevent long chains
and having some processes hit a concurrency issue where their run-time
commit-graph is effectively 100,000 commits behind.

> While with the default expiry of 0sec we can still get that race, it
> seems unlikely in practice, as the "commit-graph write" process would
> write a new manifest at the end, then go and unlink() the old files.

The process reading an old manifest can still only succeed partially as it
reads the chain from bottom-to-top. Succeeding gracefully in this case is
built-in, but should have a test! (noted)

But I don't think the concurrency window is any more lenient than before.
We now have the flexibility to set a non-zero expiry window. That was
impossible in the other model.

> So maybe at this point we could make this even dumber with something
> that behaves like gc.autoPackLimit? I.e. keep writing new graphs, and
> then coalesce them all (or maybe not the "base" graph, like
> gc.bigPackThreshold)?

I will look into these existing settings and try to include similar options
here. Better to stay consistent when we can.

> Also: These docs refer to MAX_SPLIT_COMMITS, but in v2 it's now a
> "split_strategy_max_commits" variable instead.

Thanks for catching my out-of-date documentation.

-Stolee


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v2 08/11] commit-graph: add --split option to builtin
  2019-05-22 19:53   ` [PATCH v2 08/11] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
@ 2019-05-27 11:28     ` SZEDER Gábor
  0 siblings, 0 replies; 136+ messages in thread
From: SZEDER Gábor @ 2019-05-27 11:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, Junio C Hamano,
	Derrick Stolee

On Wed, May 22, 2019 at 12:53:27PM -0700, Derrick Stolee via GitGitGadget wrote:
> diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
> new file mode 100755
> index 0000000000..96704b9f5b
> --- /dev/null
> +++ b/t/t5323-split-commit-graph.sh
> @@ -0,0 +1,122 @@
> +#!/bin/sh
> +
> +test_description='split commit graph'
> +. ./test-lib.sh
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +
> +test_expect_success 'setup repo' '
> +	git init &&
> +	git config core.commitGraph true &&
> +	infodir=".git/objects/info" &&
> +	graphdir="$infodir/commit-graphs" &&
> +	test_oid_init
> +'
> +
> +graph_read_expect() {
> +	NUM_BASE=0
> +	if test ! -z $2
> +	then
> +		NUM_BASE=$2
> +	fi
> +	cat >expect <<- EOF
> +	header: 43475048 1 1 3 $NUM_BASE
> +	num_commits: $1
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	git commit-graph read >output &&
> +	test_cmp expect output
> +}
> +
> +test_expect_success 'create commits and write commit-graph' '
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i
> +	done &&

Please add a "|| return 1" at the end of the for loop's body, i.e.

  for ....
  do
        this &&
        that || return 1
  done

because for loops continue iteration even when the commands in their
body failed, potentially hinding errors.

This applies to the other three for loops below as well.

> +	git commit-graph write --reachable &&
> +	test_path_is_file $infodir/commit-graph &&
> +	graph_read_expect 3
> +'
> +
> +graph_git_two_modes() {
> +	git -c core.commitGraph=true $1 >output
> +	git -c core.commitGraph=false $1 >expect
> +	test_cmp expect output
> +}
> +
> +graph_git_behavior() {
> +	MSG=$1
> +	BRANCH=$2
> +	COMPARE=$3
> +	test_expect_success "check normal git operations: $MSG" '
> +		graph_git_two_modes "log --oneline $BRANCH" &&
> +		graph_git_two_modes "log --topo-order $BRANCH" &&
> +		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
> +		graph_git_two_modes "branch -vv" &&
> +		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
> +	'
> +}
> +
> +graph_git_behavior 'graph exists' commits/3 commits/1
> +
> +verify_chain_files_exist() {
> +	for hash in $(cat $1/commit-graph-chain)
> +	do
> +		test_path_is_file $1/graph-$hash.graph
> +	done
> +}
> +
> +test_expect_success 'add more commits, and write a new base graph' '
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i
> +	done &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	git commit-graph write --reachable &&
> +	graph_read_expect 12
> +'
> +
> +test_expect_success 'add three more commits, write a tip graph' '
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 2 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'
> +
> +graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
> +
> +test_expect_success 'add one commit, write a tip graph' '
> +	test_commit 11 &&
> +	git branch commits/11 &&
> +	git commit-graph write --reachable --split &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 3 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'
> +
> +graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
> +
> +test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 00/14] Commit-graph: Write incremental files
  2019-05-22 19:53 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2019-05-22 19:53   ` [PATCH v2 11/11] commit-graph: expire commit-graph files Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03   ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
                       ` (14 more replies)
  11 siblings, 15 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano

This version is now ready for review.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "chain" of commit-graphs in the
.git/objects/info/commit-graphs folder with name graph-{hash}.graph. The
list of hashes is given by the commit-graph-chain file, and also in a "base
graph chunk" in the commit-graph format. As we read a chain, we can verify
that the hashes match the trailing hash of each commit-graph we read along
the way and each hash below a level is expected by that graph file.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
conditions:

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
pack-files.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

write --stdin-commits:         8m 12s
write --stdin-commits --split:    28s
write --split && verify --shallow: 60s

Updates in V3:

 * git commit-graph verify now works on commit-graph chains. We do a simple
   test to check the behavior of a new --shallow option.
   
   
 * When someone writes a flat commit-graph, we now expire the old chain
   according to the expire time.
   
   
 * The "max commits" limit is no longer enabled by default, but instead is
   enabled by a --max-commits=<n> option. Ignored if n=0.
   
   

This is based on ds/commit-graph-write-refactor.

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (14):
  commit-graph: document commit-graph chains
  commit-graph: prepare for commit-graph chains
  commit-graph: rename commit_compare to oid_compare
  commit-graph: load commit-graph chains
  commit-graph: add base graphs chunk
  commit-graph: rearrange chunk count logic
  commit-graph: write commit-graph chains
  commit-graph: add --split option to builtin
  commit-graph: merge commit-graph chains
  commit-graph: allow cross-alternate chains
  commit-graph: expire commit-graph files
  commit-graph: create options for split files
  commit-graph: verify chains with --shallow mode
  commit-graph: clean up chains after flattened write

 Documentation/git-commit-graph.txt            |  26 +-
 .../technical/commit-graph-format.txt         |  11 +-
 Documentation/technical/commit-graph.txt      | 195 +++++
 builtin/commit-graph.c                        |  53 +-
 builtin/commit.c                              |   2 +-
 builtin/gc.c                                  |   3 +-
 commit-graph.c                                | 780 +++++++++++++++++-
 commit-graph.h                                |  25 +-
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5323-split-commit-graph.sh                 | 240 ++++++
 10 files changed, 1270 insertions(+), 67 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh


base-commit: 8520d7fc7c6edd4d71582c69a873436029b6cb1b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/184

Range-diff vs v2:

  1:  a423afbfdd =  1:  b184919255 commit-graph: document commit-graph chains
  2:  249668fc92 =  2:  d0dc154a27 commit-graph: prepare for commit-graph chains
  3:  809fa7ad80 =  3:  f35b04224a commit-graph: rename commit_compare to oid_compare
  4:  a8c0b47c8a !  4:  ca670536df commit-graph: load commit-graph chains
     @@ -47,13 +47,13 @@
       	return load_commit_graph_one_fd_st(fd, &st);
       }
       
     -+static int prepare_commit_graph_v1(struct repository *r, const char *obj_dir)
     ++static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
      +{
      +	char *graph_name = get_commit_graph_filename(obj_dir);
     -+	r->objects->commit_graph = load_commit_graph_one(graph_name);
     ++	struct commit_graph *g = load_commit_graph_one(graph_name);
      +	free(graph_name);
      +
     -+	return r->objects->commit_graph ? 0 : -1;
     ++	return g;
      +}
      +
      +static int add_graph_to_chain(struct commit_graph *g,
     @@ -76,8 +76,9 @@
      +	return 1;
      +}
      +
     -+static void prepare_commit_graph_chain(struct repository *r, const char *obj_dir)
     ++static struct commit_graph *load_commit_graph_chain(struct repository *r, const char *obj_dir)
      +{
     ++	struct commit_graph *graph_chain = NULL;
      +	struct strbuf line = STRBUF_INIT;
      +	struct stat st;
      +	struct object_id *oids;
     @@ -85,17 +86,21 @@
      +	char *chain_name = get_chain_filename(obj_dir);
      +	FILE *fp;
      +
     -+	if (stat(chain_name, &st))
     -+		return;
     ++	if (stat(chain_name, &st)) {
     ++		free(chain_name);
     ++		return NULL;
     ++	}
      +
     -+	if (st.st_size <= the_hash_algo->hexsz)
     -+		return;
     ++	if (st.st_size <= the_hash_algo->hexsz) {
     ++		free(chain_name);
     ++		return NULL;
     ++	}
      +
      +	fp = fopen(chain_name, "r");
      +	free(chain_name);
      +
      +	if (!fp)
     -+		return;
     ++		return NULL;
      +
      +	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
      +
     @@ -114,14 +119,26 @@
      +		g = load_commit_graph_one(graph_name);
      +		free(graph_name);
      +
     -+		if (g && add_graph_to_chain(g, r->objects->commit_graph, oids, i))
     -+			r->objects->commit_graph = g;
     ++		if (g && add_graph_to_chain(g, graph_chain, oids, i))
     ++			graph_chain = g;
      +		else
      +			valid = 0;
      +	}
      +
      +	free(oids);
      +	fclose(fp);
     ++
     ++	return graph_chain;
     ++}
     ++
     ++static struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
     ++{
     ++	struct commit_graph *g = load_commit_graph_v1(r, obj_dir);
     ++
     ++	if (!g)
     ++		g = load_commit_graph_chain(r, obj_dir);
     ++
     ++	return g;
      +}
      +
       static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
     @@ -134,11 +151,9 @@
      -	graph_name = get_commit_graph_filename(obj_dir);
      -	r->objects->commit_graph =
      -		load_commit_graph_one(graph_name);
     -+	if (!prepare_commit_graph_v1(r, obj_dir))
     -+		return;
     - 
     +-
      -	FREE_AND_NULL(graph_name);
     -+	prepare_commit_graph_chain(r, obj_dir);
     ++	r->objects->commit_graph = read_commit_graph_one(r, obj_dir);
       }
       
       /*
  5:  4fefd0a654 =  5:  df44cbc1bf commit-graph: add base graphs chunk
  6:  a595a1eb65 =  6:  e65f9e841d commit-graph: rearrange chunk count logic
  7:  9cbfb656b3 !  7:  fe0aa343cd commit-graph: write commit-graph chains
     @@ -51,7 +51,7 @@
      +	return g;
       }
       
     - static int prepare_commit_graph_v1(struct repository *r, const char *obj_dir)
     + static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
      @@
       	struct progress *progress;
       	int progress_done;
  8:  5ad14f574b !  8:  4f4ccc8062 commit-graph: add --split option to builtin
     @@ -104,7 +104,7 @@
      +	for i in $(test_seq 3)
      +	do
      +		test_commit $i &&
     -+		git branch commits/$i
     ++		git branch commits/$i || return 1
      +	done &&
      +	git commit-graph write --reachable &&
      +	test_path_is_file $infodir/commit-graph &&
     @@ -135,7 +135,7 @@
      +verify_chain_files_exist() {
      +	for hash in $(cat $1/commit-graph-chain)
      +	do
     -+		test_path_is_file $1/graph-$hash.graph
     ++		test_path_is_file $1/graph-$hash.graph || return 1
      +	done
      +}
      +
     @@ -144,13 +144,13 @@
      +	for i in $(test_seq 4 5)
      +	do
      +		test_commit $i &&
     -+		git branch commits/$i
     ++		git branch commits/$i || return 1
      +	done &&
      +	git reset --hard commits/2 &&
      +	for i in $(test_seq 6 10)
      +	do
      +		test_commit $i &&
     -+		git branch commits/$i
     ++		git branch commits/$i || return 1
      +	done &&
      +	git reset --hard commits/2 &&
      +	git merge commits/4 &&
  9:  9567daa0b8 !  9:  87fb895fe4 commit-graph: merge commit-graph chains
     @@ -105,8 +105,7 @@
      +  1. The expected file size for level N + 1 is at least half the file size for
      +     level N.
      +
     -+  2. Level N + 1 contains more than MAX_SPLIT_COMMITS commits (64,0000
     -+     commits).
     ++  2. Level N + 1 contains more than 64,0000 commits.
      +
      +This decision cascades down the levels: when we merge a level we create a new
      +set of commits that then compares to the next level.
     @@ -290,13 +289,7 @@
      +		current_graph_number--;
      +
      +		if (ctx->report_progress) {
     -+			if (current_graph_number)
     -+				strbuf_addf(&progress_title,
     -+					    _("Merging commit-graph-%d"),
     -+					    current_graph_number);
     -+			else
     -+				strbuf_addstr(&progress_title,
     -+					      _("Merging commit-graph"));
     ++			strbuf_addstr(&progress_title, _("Merging commit-graph"));
      +			ctx->progress = start_delayed_progress(progress_title.buf, 0);
      +		}
      +
 10:  4cfe19a933 ! 10:  5cfd653d24 commit-graph: allow cross-alternate chains
     @@ -81,13 +81,13 @@
       --- a/commit-graph.c
       +++ b/commit-graph.c
      @@
     - 	r->objects->commit_graph = load_commit_graph_one(graph_name);
     + 	struct commit_graph *g = load_commit_graph_one(graph_name);
       	free(graph_name);
       
     -+	if (r->objects->commit_graph)
     -+		r->objects->commit_graph->obj_dir = obj_dir;
     ++	if (g)
     ++		g->obj_dir = obj_dir;
      +
     - 	return r->objects->commit_graph ? 0 : -1;
     + 	return g;
       }
       
      @@
     @@ -111,8 +111,8 @@
      +			char *graph_name = get_split_graph_filename(odb->path, line.buf);
      +			struct commit_graph *g = load_commit_graph_one(graph_name);
       
     --		if (g && add_graph_to_chain(g, r->objects->commit_graph, oids, i))
     --			r->objects->commit_graph = g;
     +-		if (g && add_graph_to_chain(g, graph_chain, oids, i))
     +-			graph_chain = g;
      -		else
      -			valid = 0;
      +			free(graph_name);
     @@ -120,8 +120,8 @@
      +			if (g) {
      +				g->obj_dir = odb->path;
      +
     -+				if (add_graph_to_chain(g, r->objects->commit_graph, oids, i))
     -+					r->objects->commit_graph = g;
     ++				if (add_graph_to_chain(g, graph_chain, oids, i))
     ++					graph_chain = g;
      +				else
      +					valid = 0;
      +
 11:  72fc0a1f17 ! 11:  18d612be9e commit-graph: expire commit-graph files
     @@ -45,15 +45,6 @@
       diff --git a/commit-graph.c b/commit-graph.c
       --- a/commit-graph.c
       +++ b/commit-graph.c
     -@@
     - 	unsigned append:1,
     - 		 report_progress:1,
     - 		 split:1;
     -+
     -+	time_t expire_window;
     - };
     - 
     - static void write_graph_chunk_fanout(struct hashfile *f,
      @@
       	deduplicate_commits(ctx);
       }
     @@ -81,7 +72,7 @@
      +	DIR *dir;
      +	struct dirent *de;
      +	size_t dirnamelen;
     -+	time_t expire_time = time(NULL) - ctx->expire_window;
     ++	time_t expire_time = time(NULL);
      +
      +	strbuf_addstr(&path, ctx->obj_dir);
      +	strbuf_addstr(&path, "/info/commit-graphs");
  -:  ---------- > 12:  4de4bfba64 commit-graph: create options for split files
  -:  ---------- > 13:  fe91ff5fca commit-graph: verify chains with --shallow mode
  -:  ---------- > 14:  ca41bf08d0 commit-graph: clean up chains after flattened write

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 01/14] commit-graph: document commit-graph chains
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-05 17:22       ` Junio C Hamano
  2019-06-06 12:10       ` Philip Oakley
  2019-06-03 16:03     ` [PATCH v3 02/14] commit-graph: prepare for " Derrick Stolee via GitGitGadget
                       ` (13 subsequent siblings)
  14 siblings, 2 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a basic description of commit-graph chains. More details about the
feature will be added as we add functionality. This introduction gives a
high-level overview to the goals of the feature and the basic layout of
commit-graph chains.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index fb53341d5e..1dca3bd8fe 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,6 +127,65 @@ Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
+Commit Graphs Chains
+--------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over time,
+the number of commits added by a fetch operation is much smaller than the
+number of commits in the full history. By creating a "chain" of commit-graphs,
+we enable fast writes of new commit data without rewriting the entire commit
+history -- at least, most of the time.
+
+## File Layout
+
+A commit-graph chain uses multiple files, and we use a fixed naming convention
+to organize these files. Each commit-graph file has a name
+`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
+valued hash stored in the footer of that file (which is a hash of the file's
+contents before that hash). For a chain of commit-graph files, a plain-text
+file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
+hashes for the files in order from "lowest" to "highest".
+
+For example, if the `commit-graph-chain` file contains the lines
+
+```
+	{hash0}
+	{hash1}
+	{hash2}
+```
+
+then the commit-graph chain looks like the following diagram:
+
+ +-----------------------+
+ |  graph-{hash2}.graph  |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  graph-{hash1}.graph  |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0}.graph  |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
+commits in `graph-{hash1}.graph`, and X2 be the number of commits in
+`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
+then we interpret this as being the commit in position (X0 + X1 + i), and that
+will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
+positions to refer to their parents, which may be in `graph-{hash1}.graph` or
+`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
+its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
+X2).
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 02/14] commit-graph: prepare for commit-graph chains
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` " Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 03/14] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
                       ` (12 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To prepare for a chain of commit-graph files, augment the
commit_graph struct to point to a base commit_graph. As we load
commits from the graph, we may actually want to read from a base
file according to the graph position.

The "graph position" of a commit is given by concatenating the
lexicographic commit orders from each of the commit-graph files in
the chain. This means that we must distinguish two values:

 * lexicographic index : the position within the lexicographic
   order in a single commit-graph file.

 * graph position: the posiiton within the concatenated order
   of multiple commit-graph files

Given the lexicographic index of a commit in a graph, we can
compute the graph position by adding the number of commits in
the lower-level graphs. To find the lexicographic index of
a commit, we subtract the number of commits in lower-level graphs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------
 commit-graph.h |  3 ++
 2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7723156964..3afedcd7f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -371,6 +371,25 @@ static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t
 			    g->chunk_oid_lookup, g->hash_len, pos);
 }
 
+static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
+{
+	uint32_t lex_index;
+
+	if (!g)
+		BUG("NULL commit-graph");
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	lex_index = pos - g->num_commits_in_base;
+
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * lex_index);
+}
+
 static struct commit_list **insert_parent_or_die(struct repository *r,
 						 struct commit_graph *g,
 						 uint64_t pos,
@@ -379,10 +398,10 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	struct commit *c;
 	struct object_id oid;
 
-	if (pos >= g->num_commits)
+	if (pos >= g->num_commits + g->num_commits_in_base)
 		die("invalid parent position %"PRIu64, pos);
 
-	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+	load_oid_from_graph(g, pos, &oid);
 	c = lookup_commit(r, &oid);
 	if (!c)
 		die(_("could not find commit %s"), oid_to_hex(&oid));
@@ -392,7 +411,14 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	lex_index = pos - g->num_commits_in_base;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 	item->graph_pos = pos;
 	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -405,10 +431,26 @@ static int fill_commit_in_graph(struct repository *r,
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
 
-	item->object.parsed = 1;
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	/*
+	 * Store the "full" position, but then use the
+	 * "local" position for the rest of the calculation.
+	 */
 	item->graph_pos = pos;
+	lex_index = pos - g->num_commits_in_base;
+
+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+
+	item->object.parsed = 1;
 
 	item->maybe_tree = NULL;
 
@@ -452,7 +494,18 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 		*pos = item->graph_pos;
 		return 1;
 	} else {
-		return bsearch_graph(g, &(item->object.oid), pos);
+		struct commit_graph *cur_g = g;
+		uint32_t lex_index;
+
+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
+			cur_g = cur_g->base_graph;
+
+		if (cur_g) {
+			*pos = lex_index + cur_g->num_commits_in_base;
+			return 1;
+		}
+
+		return 0;
 	}
 }
 
@@ -492,8 +545,13 @@ static struct tree *load_tree_for_commit(struct repository *r,
 					 struct commit *c)
 {
 	struct object_id oid;
-	const unsigned char *commit_data = g->chunk_commit_data +
-					   GRAPH_DATA_WIDTH * (c->graph_pos);
+	const unsigned char *commit_data;
+
+	while (c->graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	commit_data = g->chunk_commit_data +
+			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
 
 	hashcpy(oid.hash, commit_data);
 	c->maybe_tree = lookup_tree(r, &oid);
diff --git a/commit-graph.h b/commit-graph.h
index 70f4caf0c7..f9fe32ebe3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,9 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 
+	uint32_t num_commits_in_base;
+	struct commit_graph *base_graph;
+
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 03/14] commit-graph: rename commit_compare to oid_compare
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 02/14] commit-graph: prepare for " Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 04/14] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
                       ` (11 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The helper function commit_compare() actually compares object_id
structs, not commits. A future change to commit-graph.c will need
to sort commit structs, so rename this function in advance.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3afedcd7f5..e2f438f6a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -761,7 +761,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
-static int commit_compare(const void *_a, const void *_b)
+static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
 	const struct object_id *b = (const struct object_id *)_b;
@@ -1030,7 +1030,7 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 			_("Counting distinct commits in commit graph"),
 			ctx->oids.nr);
 	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 04/14] commit-graph: load commit-graph chains
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 03/14] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 06/14] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
                       ` (10 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Prepare the logic for reading a chain of commit-graphs.

First, look for a file at $OBJDIR/info/commit-graph. If it exists,
then use that file and stop.

Next, look for the chain file at $OBJDIR/info/commit-graphs/commit-graph-chain.
If this file exists, then load the hash values as line-separated values in that
file and load $OBJDIR/info/commit-graphs/graph-{hash[i]}.graph for each hash[i]
in that file. The file is given in order, so the first hash corresponds to the
"base" file and the final hash corresponds to the "tip" file.

This implementation assumes that all of the graph-{hash}.graph files are in
the same object directory as the commit-graph-chain file. This will be updated
in a future change. This change is purposefully simple so we can isolate the
different concerns.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 108 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e2f438f6a3..3ed930159e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,6 +45,19 @@ char *get_commit_graph_filename(const char *obj_dir)
 	return xstrfmt("%s/info/commit-graph", obj_dir);
 }
 
+static char *get_split_graph_filename(const char *obj_dir,
+				      const char *oid_hex)
+{
+	return xstrfmt("%s/info/commit-graphs/graph-%s.graph",
+		       obj_dir,
+		       oid_hex);
+}
+
+static char *get_chain_filename(const char *obj_dir)
+{
+	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", obj_dir);
+}
+
 static uint8_t oid_version(void)
 {
 	return 1;
@@ -286,18 +299,107 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 	return load_commit_graph_one_fd_st(fd, &st);
 }
 
+static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
+{
+	char *graph_name = get_commit_graph_filename(obj_dir);
+	struct commit_graph *g = load_commit_graph_one(graph_name);
+	free(graph_name);
+
+	return g;
+}
+
+static int add_graph_to_chain(struct commit_graph *g,
+			      struct commit_graph *chain,
+			      struct object_id *oids,
+			      int n)
+{
+	struct commit_graph *cur_g = chain;
+
+	while (n) {
+		n--;
+		cur_g = cur_g->base_graph;
+	}
+
+	g->base_graph = chain;
+
+	if (chain)
+		g->num_commits_in_base = chain->num_commits + chain->num_commits_in_base;
+
+	return 1;
+}
+
+static struct commit_graph *load_commit_graph_chain(struct repository *r, const char *obj_dir)
+{
+	struct commit_graph *graph_chain = NULL;
+	struct strbuf line = STRBUF_INIT;
+	struct stat st;
+	struct object_id *oids;
+	int i = 0, valid = 1;
+	char *chain_name = get_chain_filename(obj_dir);
+	FILE *fp;
+
+	if (stat(chain_name, &st)) {
+		free(chain_name);
+		return NULL;
+	}
+
+	if (st.st_size <= the_hash_algo->hexsz) {
+		free(chain_name);
+		return NULL;
+	}
+
+	fp = fopen(chain_name, "r");
+	free(chain_name);
+
+	if (!fp)
+		return NULL;
+
+	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
+
+	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
+		char *graph_name;
+		struct commit_graph *g;
+
+		if (get_oid_hex(line.buf, &oids[i])) {
+			warning(_("invalid commit-graph chain: line '%s' not a hash"),
+				line.buf);
+			valid = 0;
+			break;
+		}
+
+		graph_name = get_split_graph_filename(obj_dir, line.buf);
+		g = load_commit_graph_one(graph_name);
+		free(graph_name);
+
+		if (g && add_graph_to_chain(g, graph_chain, oids, i))
+			graph_chain = g;
+		else
+			valid = 0;
+	}
+
+	free(oids);
+	fclose(fp);
+
+	return graph_chain;
+}
+
+static struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
+{
+	struct commit_graph *g = load_commit_graph_v1(r, obj_dir);
+
+	if (!g)
+		g = load_commit_graph_chain(r, obj_dir);
+
+	return g;
+}
+
 static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
 {
-	char *graph_name;
 
 	if (r->objects->commit_graph)
 		return;
 
-	graph_name = get_commit_graph_filename(obj_dir);
-	r->objects->commit_graph =
-		load_commit_graph_one(graph_name);
-
-	FREE_AND_NULL(graph_name);
+	r->objects->commit_graph = read_commit_graph_one(r, obj_dir);
 }
 
 /*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 05/14] commit-graph: add base graphs chunk
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 06/14] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 07/14] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
                       ` (8 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To quickly verify a commit-graph chain is valid on load, we will
read from the new "Base Graphs Chunk" of each file in the chain.
This will prevent accidentally loading incorrect data from manually
editing the commit-graph-chain file or renaming graph-{hash}.graph
files.

The commit_graph struct already had an object_id struct "oid", but
it was never initialized or used. Add a line to read the hash from
the end of the commit-graph file and into the oid member.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .../technical/commit-graph-format.txt         | 11 ++++++++--
 commit-graph.c                                | 22 +++++++++++++++++++
 commit-graph.h                                |  1 +
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 16452a0504..a4f17441ae 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -44,8 +44,9 @@ HEADER:
 
   1-byte number (C) of "chunks"
 
-  1-byte (reserved for later use)
-     Current clients should ignore this value.
+  1-byte number (B) of base commit-graphs
+      We infer the length (H*B) of the Base Graphs chunk
+      from this value.
 
 CHUNK LOOKUP:
 
@@ -92,6 +93,12 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
+      This list of H-byte hashes describe a set of B commit-graph files that
+      form a commit-graph chain. The graph position for the ith commit in this
+      file's OID Lookup chunk is equal to i plus the number of commits in all
+      base graphs.  If B is non-zero, this chunk must exist.
+
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.
diff --git a/commit-graph.c b/commit-graph.c
index 3ed930159e..909c841db5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -22,6 +22,7 @@
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -262,6 +263,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			else
 				graph->chunk_extra_edges = data + chunk_offset;
 			break;
+
+		case GRAPH_CHUNKID_BASE:
+			if (graph->chunk_base_graphs)
+				chunk_repeated = 1;
+			else
+				graph->chunk_base_graphs = data + chunk_offset;
 		}
 
 		if (chunk_repeated) {
@@ -280,6 +287,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		last_chunk_offset = chunk_offset;
 	}
 
+	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
+
 	if (verify_commit_graph_lite(graph))
 		return NULL;
 
@@ -315,8 +324,21 @@ static int add_graph_to_chain(struct commit_graph *g,
 {
 	struct commit_graph *cur_g = chain;
 
+	if (n && !g->chunk_base_graphs) {
+		warning(_("commit-graph has no base graphs chunk"));
+		return 0;
+	}
+
 	while (n) {
 		n--;
+
+		if (!oideq(&oids[n], &cur_g->oid) ||
+		    !hasheq(oids[n].hash, g->chunk_base_graphs + g->hash_len * n)) {
+			warning(_("commit-graph chain does not match"));
+			return 0;
+		}
+
+
 		cur_g = cur_g->base_graph;
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index f9fe32ebe3..80f4917ddb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -55,6 +55,7 @@ struct commit_graph {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
+	const unsigned char *chunk_base_graphs;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 06/14] commit-graph: rearrange chunk count logic
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 04/14] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 05/14] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
                       ` (9 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The number of chunks in a commit-graph file can change depending on
whether we need the Extra Edges Chunk. We are going to add more optional
chunks, and it will be helpful to rearrange this logic around the chunk
count before doing so.

Specifically, we need to finalize the number of chunks before writing
the commit-graph header. Further, we also need to fill out the chunk
lookup table dynamically and using "num_chunks" as we add optional
chunks is useful for adding optional chunks in the future.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 909c841db5..80df6d6d9d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1206,7 +1206,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[5];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = ctx->num_extra_edges ? 4 : 3;
+	int num_chunks = 3;
 
 	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
@@ -1219,27 +1219,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
-	hashwrite_be32(f, GRAPH_SIGNATURE);
-
-	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0); /* unused padding byte */
-
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	if (ctx->num_extra_edges)
-		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
-	else
-		chunk_ids[3] = 0;
-	chunk_ids[4] = 0;
+	if (ctx->num_extra_edges) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
+		num_chunks++;
+	}
+
+	chunk_ids[num_chunks] = 0;
 
 	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
-	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
+
+	num_chunks = 3;
+	if (ctx->num_extra_edges) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						4 * ctx->num_extra_edges;
+		num_chunks++;
+	}
+
+	hashwrite_be32(f, GRAPH_SIGNATURE);
+
+	hashwrite_u8(f, GRAPH_VERSION);
+	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, 0);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 07/14] commit-graph: write commit-graph chains
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 05/14] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 08/14] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Extend write_commit_graph() to write a commit-graph chain when given the
COMMIT_GRAPH_SPLIT flag.

This implementation is purposefully simplistic in how it creates a new
chain. The commits not already in the chain are added to a new tip
commit-graph file.

Much of the logic around writing a graph-{hash}.graph file and updating
the commit-graph-chain file is the same as the commit-graph file case.
However, there are several places where we need to do some extra logic
in the split case.

Track the list of graph filenames before and after the planned write.
This will be more important when we start merging graph files, but it
also allows us to upgrade our commit-graph file to the appropriate
graph-{hash}.graph file when we upgrade to a chain of commit-graphs.

Note that we use the eighth byte of the commit-graph header to store the
number of base graph files. This determines the length of the base
graphs chunk.

A subtle change of behavior with the new logic is that we do not write a
commit-graph if we our commit list is empty. This extends to the typical
case, which is reflected in t5318-commit-graph.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 286 ++++++++++++++++++++++++++++++++++++++--
 commit-graph.h          |   2 +
 t/t5318-commit-graph.sh |   2 +-
 3 files changed, 278 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 80df6d6d9d..7d3e001479 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -300,12 +300,18 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 
 	struct stat st;
 	int fd;
+	struct commit_graph *g;
 	int open_ok = open_commit_graph(graph_file, &fd, &st);
 
 	if (!open_ok)
 		return NULL;
 
-	return load_commit_graph_one_fd_st(fd, &st);
+	g = load_commit_graph_one_fd_st(fd, &st);
+
+	if (g)
+		g->filename = xstrdup(graph_file);
+
+	return g;
 }
 
 static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
@@ -723,8 +729,19 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+
+	char *base_graph_name;
+	int num_commit_graphs_before;
+	int num_commit_graphs_after;
+	char **commit_graph_filenames_before;
+	char **commit_graph_filenames_after;
+	char **commit_graph_hash_after;
+	uint32_t new_num_commits_in_base;
+	struct commit_graph *new_base_graph;
+
 	unsigned append:1,
-		 report_progress:1;
+		 report_progress:1,
+		 split:1;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -794,6 +811,16 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.nr,
 					      commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -814,6 +841,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.list,
 					      ctx->commits.nr,
 					      commit_to_sha1);
+
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -871,6 +909,16 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 						  ctx->commits.nr,
 						  commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -962,7 +1010,13 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 		display_progress(ctx->progress, i + 1);
 		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
-		if (commit && !parse_commit_no_graph(commit))
+		if (!commit)
+			continue;
+		if (ctx->split) {
+			if (!parse_commit(commit) &&
+			    commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+				add_missing_parents(ctx, commit);
+		} else if (!parse_commit_no_graph(commit))
 			add_missing_parents(ctx, commit);
 	}
 	stop_progress(&ctx->progress);
@@ -1158,8 +1212,16 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i])) {
+			if (ctx->split) {
+				struct commit *c = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+				if (!c || c->graph_pos != COMMIT_NOT_FROM_GRAPH)
+					continue;
+			}
+
 			count_distinct++;
+		}
 	}
 	stop_progress(&ctx->progress);
 
@@ -1182,7 +1244,13 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			continue;
 
+		ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
 		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+		if (ctx->split &&
+		    ctx->commits.list[ctx->commits.nr]->graph_pos != COMMIT_NOT_FROM_GRAPH)
+			continue;
+
 		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
 
 		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
@@ -1197,18 +1265,86 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static int write_graph_chunk_base_1(struct hashfile *f,
+				    struct commit_graph *g)
+{
+	int num = 0;
+
+	if (!g)
+		return 0;
+
+	num = write_graph_chunk_base_1(f, g->base_graph);
+	hashwrite(f, g->oid.hash, the_hash_algo->rawsz);
+	return num + 1;
+}
+
+static int write_graph_chunk_base(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
+{
+	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
+
+	if (num != ctx->num_commit_graphs_after - 1) {
+		error(_("failed to write correct number of base graph ids"));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t i;
+
+	ctx->new_base_graph = g;
+	ctx->base_graph_name = xstrdup(g->filename);
+	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	if (ctx->num_commit_graphs_before)
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
+			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
+
+	i = ctx->num_commit_graphs_before - 1;
+
+	while (g) {
+		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+		i--;
+		g = g->base_graph;
+	}
+}
+
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
+	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[5];
-	uint64_t chunk_offsets[5];
+	uint32_t chunk_ids[6];
+	uint64_t chunk_offsets[6];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
+	struct object_id file_hash;
+
+	if (ctx->split) {
+		struct strbuf tmp_file = STRBUF_INIT;
+
+		strbuf_addf(&tmp_file,
+			    "%s/info/commit-graphs/tmp_graph_XXXXXX",
+			    ctx->obj_dir);
+		ctx->graph_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	}
 
-	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
 		UNLEAK(ctx->graph_name);
 		error(_("unable to create leading directories of %s"),
@@ -1216,8 +1352,23 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		return errno;
 	}
 
-	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
-	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	if (ctx->split) {
+		char *lock_name = get_chain_filename(ctx->obj_dir);
+
+		hold_lock_file_for_update(&lk, lock_name, LOCK_DIE_ON_ERROR);
+
+		fd = git_mkstemp_mode(ctx->graph_name, 0444);
+		if (fd < 0) {
+			error(_("unable to create '%s'"), ctx->graph_name);
+			return -1;
+		}
+
+		f = hashfd(fd, ctx->graph_name);
+	} else {
+		hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
+		fd = lk.tempfile->fd;
+		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	}
 
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
@@ -1226,6 +1377,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
+		num_chunks++;
+	}
 
 	chunk_ids[num_chunks] = 0;
 
@@ -1240,13 +1395,18 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 						4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						hashsz * (ctx->num_commit_graphs_after - 1);
+		num_chunks++;
+	}
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
 	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0);
+	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
@@ -1272,11 +1432,67 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_data(f, hashsz, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
+	if (ctx->num_commit_graphs_after > 1 &&
+	    write_graph_chunk_base(f, ctx)) {
+		return -1;
+	}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
+	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
+		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
+		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+
+		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2] = new_base_name;
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2] = new_base_hash;
+	}
+
 	close_commit_graph(ctx->r);
-	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+
+	if (ctx->split) {
+		FILE *chainf = fdopen_lock_file(&lk, "w");
+		char *final_graph_name;
+		int result;
+
+		close(fd);
+
+		if (!chainf) {
+			error(_("unable to open commit-graph chain file"));
+			return -1;
+		}
+
+		if (ctx->base_graph_name) {
+			result = rename(ctx->base_graph_name,
+					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+
+			if (result) {
+				error(_("failed to rename base commit-graph file"));
+				return -1;
+			}
+		} else {
+			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
+			unlink(graph_name);
+		}
+
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1] = xstrdup(oid_to_hex(&file_hash));
+		final_graph_name = get_split_graph_filename(ctx->obj_dir,
+					ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 1] = final_graph_name;
+
+		result = rename(ctx->graph_name, final_graph_name);
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++)
+			fprintf(lk.tempfile->fp, "%s\n", ctx->commit_graph_hash_after[i]);
+
+		if (result) {
+			error(_("failed to rename temporary commit-graph file"));
+			return -1;
+		}
+	}
+
 	commit_lock_file(&lk);
 
 	return 0;
@@ -1299,6 +1515,30 @@ int write_commit_graph(const char *obj_dir,
 	ctx->obj_dir = obj_dir;
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
+	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+
+	if (ctx->split) {
+		struct commit_graph *g;
+		prepare_commit_graph(ctx->r);
+
+		g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			ctx->num_commit_graphs_before++;
+			g = g->base_graph;
+		}
+
+		if (ctx->num_commit_graphs_before) {
+			ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before);
+			i = ctx->num_commit_graphs_before;
+			g = ctx->r->objects->commit_graph;
+
+			while (g) {
+				ctx->commit_graph_filenames_before[--i] = xstrdup(g->filename);
+				g = g->base_graph;
+			}
+		}
+	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
@@ -1353,6 +1593,14 @@ int write_commit_graph(const char *obj_dir,
 		goto cleanup;
 	}
 
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split)
+		init_commit_graph_chain(ctx);
+	else
+		ctx->num_commit_graphs_after = 1;
+
 	compute_generation_numbers(ctx);
 
 	res = write_commit_graph_file(ctx);
@@ -1361,6 +1609,21 @@ int write_commit_graph(const char *obj_dir,
 	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
+
+	if (ctx->commit_graph_filenames_after) {
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			free(ctx->commit_graph_filenames_after[i]);
+			free(ctx->commit_graph_hash_after[i]);
+		}
+
+		for (i = 0; i < ctx->num_commit_graphs_before; i++)
+			free(ctx->commit_graph_filenames_before[i]);
+
+		free(ctx->commit_graph_filenames_after);
+		free(ctx->commit_graph_filenames_before);
+		free(ctx->commit_graph_hash_after);
+	}
+
 	free(ctx);
 
 	return res;
@@ -1548,5 +1811,6 @@ void free_commit_graph(struct commit_graph *g)
 		g->data = NULL;
 		close(g->graph_fd);
 	}
+	free(g->filename);
 	free(g);
 }
diff --git a/commit-graph.h b/commit-graph.h
index 80f4917ddb..5c48c4f66a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -47,6 +47,7 @@ struct commit_graph {
 	unsigned char num_chunks;
 	uint32_t num_commits;
 	struct object_id oid;
+	char *filename;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
@@ -71,6 +72,7 @@ int generation_numbers_enabled(struct repository *r);
 
 #define COMMIT_GRAPH_APPEND     (1 << 0)
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
+#define COMMIT_GRAPH_SPLIT      (1 << 2)
 
 int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
 int write_commit_graph(const char *obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3b6fd0d728..063f906b3e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -20,7 +20,7 @@ test_expect_success 'verify graph with no graph file' '
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
-	test_path_is_file info/commit-graph
+	test_path_is_missing info/commit-graph
 '
 
 test_expect_success 'close with correct error on bad input' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 08/14] commit-graph: add --split option to builtin
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 07/14] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 09/14] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
                       ` (6 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new "--split" option to the 'git commit-graph write' subcommand. This
option allows the optional behavior of writing a commit-graph chain.

The current behavior will add a tip commit-graph containing any commits that
are not in the existing commit-graph or commit-graph chain. Later changes
will allow merging the chain and expiring out-dated files.

Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
behavior.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c        |  10 ++-
 t/t5323-split-commit-graph.sh | 122 ++++++++++++++++++++++++++++++++++
 2 files changed, 129 insertions(+), 3 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 828b1a713f..c2c07d3917 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -35,9 +35,9 @@ static struct opts_commit_graph {
 	int stdin_packs;
 	int stdin_commits;
 	int append;
+	int split;
 } opts;
 
-
 static int graph_verify(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -156,6 +156,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "split", &opts.split,
+			N_("allow writing an incremental commit-graph file")),
 		OPT_END(),
 	};
 
@@ -169,6 +171,8 @@ static int graph_write(int argc, const char **argv)
 		opts.obj_dir = get_object_directory();
 	if (opts.append)
 		flags |= COMMIT_GRAPH_APPEND;
+	if (opts.split)
+		flags |= COMMIT_GRAPH_SPLIT;
 
 	read_replace_refs = 0;
 
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
new file mode 100755
index 0000000000..ccd24bd22b
--- /dev/null
+++ b/t/t5323-split-commit-graph.sh
@@ -0,0 +1,122 @@
+#!/bin/sh
+
+test_description='split commit graph'
+. ./test-lib.sh
+
+GIT_TEST_COMMIT_GRAPH=0
+
+test_expect_success 'setup repo' '
+	git init &&
+	git config core.commitGraph true &&
+	infodir=".git/objects/info" &&
+	graphdir="$infodir/commit-graphs" &&
+	test_oid_init
+'
+
+graph_read_expect() {
+	NUM_BASE=0
+	if test ! -z $2
+	then
+		NUM_BASE=$2
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 3 $NUM_BASE
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	git commit-graph read >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'create commits and write commit-graph' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable &&
+	test_path_is_file $infodir/commit-graph &&
+	graph_read_expect 3
+'
+
+graph_git_two_modes() {
+	git -c core.commitGraph=true $1 >output
+	git -c core.commitGraph=false $1 >expect
+	test_cmp expect output
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_git_behavior 'graph exists' commits/3 commits/1
+
+verify_chain_files_exist() {
+	for hash in $(cat $1/commit-graph-chain)
+	do
+		test_path_is_file $1/graph-$hash.graph || return 1
+	done
+}
+
+test_expect_success 'add more commits, and write a new base graph' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 12
+'
+
+test_expect_success 'add three more commits, write a tip graph' '
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
+
+test_expect_success 'add one commit, write a tip graph' '
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 3 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 09/14] commit-graph: merge commit-graph chains
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 08/14] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 10/14] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.

To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:

  1. If the number of commits we are adding is more than half the number
     of commits in the graph below, then merge with that graph.

  2. If we are writing more than 64,000 commits into a single graph,
     then merge with all lower graphs.

The numeric values in the conditions above are currently constant, but
can become config options in a future update.

As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.

After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt |  80 ++++++++++
 commit-graph.c                           | 184 +++++++++++++++++++----
 t/t5323-split-commit-graph.sh            |  13 ++
 3 files changed, 244 insertions(+), 33 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 1dca3bd8fe..d9c6253b0a 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -186,6 +186,86 @@ positions to refer to their parents, which may be in `graph-{hash1}.graph` or
 its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
 X2).
 
+Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
+specifying the hashes of all files in the lower layers. In the above example,
+`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
+`{hash0}` and `{hash1}`.
+
+## Merging commit-graph files
+
+If we only added a new commit-graph file on every write, we would run into a
+linear search problem through many commit-graph files.  Instead, we use a merge
+strategy to decide when the stack should collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits are added, it
+is determined by the merge strategy that the files should collapse to
+`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
+the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
+file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  graph-{hash2} |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  graph-{hash1} |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                  tmp_graphXXX
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0} |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted and we write the
+contents to a temporary file, all while holding a `commit-graph-chain.lock`
+lock-file.  When the file is flushed, we rename it to `graph-{hash3}`
+according to the computed `{hash3}`. Finally, we write the new chain data to
+`commit-graph-chain.lock`:
+
+```
+	{hash3}
+	{hash0}
+```
+
+We then close the lock-file.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the commit-graph stack of
+height N, we default to creating a new file at level N + 1. We then decide to
+merge with the Nth level if one of two conditions hold:
+
+  1. The expected file size for level N + 1 is at least half the file size for
+     level N.
+
+  2. Level N + 1 contains more than 64,0000 commits.
+
+This decision cascades down the levels: when we merge a level we create a new
+set of commits that then compares to the next level.
+
+The first condition bounds the number of levels to be logarithmic in the total
+number of commits.  The second condition bounds the total number of commits in
+a `graph-{hashN}` file and not in the `commit-graph` file, preventing
+significant performance issues when the stack merges and another process only
+partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple, 64,000 for the maximum
+number of commits) could be extracted into config settings for full
+flexibility.
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
diff --git a/commit-graph.c b/commit-graph.c
index 7d3e001479..bfef1b6960 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1291,36 +1291,6 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
-{
-	struct commit_graph *g = ctx->r->objects->commit_graph;
-	uint32_t i;
-
-	ctx->new_base_graph = g;
-	ctx->base_graph_name = xstrdup(g->filename);
-	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
-
-	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
-
-	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
-	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
-
-	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
-		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
-
-	if (ctx->num_commit_graphs_before)
-		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
-			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
-
-	i = ctx->num_commit_graphs_before - 1;
-
-	while (g) {
-		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
-		i--;
-		g = g->base_graph;
-	}
-}
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
@@ -1498,6 +1468,149 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
+static int split_strategy_max_commits = 64000;
+static float split_strategy_size_mult = 2.0f;
+
+static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t num_commits = ctx->commits.nr;
+	uint32_t i;
+
+	g = ctx->r->objects->commit_graph;
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
+		     num_commits > split_strategy_max_commits)) {
+		num_commits += g->num_commits;
+		g = g->base_graph;
+
+		ctx->num_commit_graphs_after--;
+	}
+
+	ctx->new_base_graph = g;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_after &&
+		    i < ctx->num_commit_graphs_before; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	i = ctx->num_commit_graphs_before - 1;
+	g = ctx->r->objects->commit_graph;
+
+	while (g) {
+		if (i < ctx->num_commit_graphs_after)
+			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+
+		i--;
+		g = g->base_graph;
+	}
+}
+
+static void merge_commit_graph(struct write_commit_graph_context *ctx,
+			       struct commit_graph *g)
+{
+	uint32_t i;
+	uint32_t offset = g->num_commits_in_base;
+
+	ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct object_id oid;
+		struct commit *result;
+
+		display_progress(ctx->progress, i + 1);
+
+		load_oid_from_graph(g, i + offset, &oid);
+
+		/* only add commits if they still exist in the repo */
+		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+
+		if (result) {
+			ctx->commits.list[ctx->commits.nr] = result;
+			ctx->commits.nr++;
+		}
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit **)_a;
+	const struct commit *b = *(const struct commit **)_b;
+	return oidcmp(&a->object.oid, &b->object.oid);
+}
+
+static void deduplicate_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i, num_parents, last_distinct = 0, duplicates = 0;
+	struct commit_list *parent;
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("De-duplicating merged commits"),
+					ctx->commits.nr);
+
+	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+
+	ctx->num_extra_edges = 0;
+	for (i = 1; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, i);
+
+		if (oideq(&ctx->commits.list[last_distinct]->object.oid,
+			  &ctx->commits.list[i]->object.oid)) {
+			duplicates++;
+		} else {
+			if (duplicates)
+				ctx->commits.list[last_distinct + 1] = ctx->commits.list[i];
+			last_distinct++;
+
+			num_parents = 0;
+			for (parent = ctx->commits.list[i]->parents; parent; parent = parent->next)
+				num_parents++;
+
+			if (num_parents > 2)
+				ctx->num_extra_edges += num_parents - 2;
+		}
+	}
+
+	ctx->commits.nr -= duplicates;
+	stop_progress(&ctx->progress);
+}
+
+static void merge_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t current_graph_number = ctx->num_commit_graphs_before;
+	struct strbuf progress_title = STRBUF_INIT;
+
+	while (g && current_graph_number >= ctx->num_commit_graphs_after) {
+		current_graph_number--;
+
+		if (ctx->report_progress) {
+			strbuf_addstr(&progress_title, _("Merging commit-graph"));
+			ctx->progress = start_delayed_progress(progress_title.buf, 0);
+		}
+
+		merge_commit_graph(ctx, g);
+		stop_progress(&ctx->progress);
+		strbuf_release(&progress_title);
+
+		g = g->base_graph;
+	}
+
+	if (g) {
+		ctx->new_base_graph = g;
+		ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+	}
+
+	if (ctx->new_base_graph)
+		ctx->base_graph_name = xstrdup(ctx->new_base_graph->filename);
+
+	deduplicate_commits(ctx);
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1543,6 +1656,9 @@ int write_commit_graph(const char *obj_dir,
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
+	if (ctx->split && ctx->oids.alloc > split_strategy_max_commits)
+		ctx->oids.alloc = split_strategy_max_commits;
+
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
 		if (ctx->r->objects->commit_graph)
@@ -1596,9 +1712,11 @@ int write_commit_graph(const char *obj_dir,
 	if (!ctx->commits.nr)
 		goto cleanup;
 
-	if (ctx->split)
-		init_commit_graph_chain(ctx);
-	else
+	if (ctx->split) {
+		split_graph_merge_strategy(ctx);
+
+		merge_commit_graphs(ctx);
+	} else
 		ctx->num_commit_graphs_after = 1;
 
 	compute_generation_numbers(ctx);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index ccd24bd22b..5cb5663a30 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -119,4 +119,17 @@ test_expect_success 'add one commit, write a tip graph' '
 
 graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
 
+test_expect_success 'add one commit, write a merged graph' '
+	test_commit 12 &&
+	git branch commits/12 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 2 $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 4 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 10/14] commit-graph: allow cross-alternate chains
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (8 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 09/14] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:03     ` [PATCH v3 11/14] commit-graph: expire commit-graph files Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment like a fork network, it is helpful to have a
commit-graph chain that spans both the base repo and the fork repo. The
fork is usually a small set of data on top of the large repo, but
sometimes the fork is much larger. For example, git-for-windows/git has
almost double the number of commits as git/git because it rebases its
commits on every major version update.

To allow cross-alternate commit-graph chains, we need a few pieces:

1. When looking for a graph-{hash}.graph file, check all alternates.

2. When merging commit-graph chains, do not merge across alternates.

3. When writing a new commit-graph chain based on a commit-graph file
   in another object directory, do not allow success if the base file
   has of the name "commit-graph" instead of
   "commit-graphs/graoh-{hash}.graph".

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 40 +++++++++++++++++++++
 commit-graph.c                           | 46 ++++++++++++++++++------
 commit-graph.h                           |  1 +
 t/t5323-split-commit-graph.sh            | 37 +++++++++++++++++++
 4 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index d9c6253b0a..473032e476 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -266,6 +266,42 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Chains across multiple object directories
+
+In a repo with alternates, we look for the `commit-graph-chain` file starting
+in the local object directory and then in each alternate. The first file that
+exists defines our chain. As we look for the `graph-{hash}` files for
+each `{hash}` in the chain file, we follow the same pattern for the host
+directories.
+
+This allows commit-graphs to be split across multiple forks in a fork network.
+The typical case is a large "base" repo with many smaller forks.
+
+As the base repo advances, it will likely update and merge its commit-graph
+chain more frequently than the forks. If a fork updates their commit-graph after
+the base repo, then it should "reparent" the commit-graph chain onto the new
+chain in the base repo. When reading each `graph-{hash}` file, we track
+the object directory containing it. During a write of a new commit-graph file,
+we check for any changes in the source object directory and read the
+`commit-graph-chain` file for that source and create a new file based on those
+files. During this "reparent" operation, we necessarily need to collapse all
+levels in the fork, as all of the files are invalid against the new base file.
+
+It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
+files in this scenario. It falls to the user to define the proper settings for
+their custom environment:
+
+ 1. When merging levels in the base repo, the unreferenced files may still be
+    referenced by chains from fork repos.
+
+ 2. The expiry time should be set to a length of time such that every fork has
+    time to recompute their commit-graph chain to "reparent" onto the new base
+    file(s).
+
+ 3. If the commit-graph chain is updated in the base, the fork will not have
+    access to the new chain until its chain is updated to reference those files.
+    (This may change in the future [5].)
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
@@ -292,3 +328,7 @@ Related Links
 
 [4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
     A patch to remove the ahead-behind calculation from 'status'.
+
+[5] https://public-inbox.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
+    A discussion of a "two-dimensional graph position" that can allow reading
+    multiple commit-graph chains at the same time.
diff --git a/commit-graph.c b/commit-graph.c
index bfef1b6960..0b7c186a5d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -320,6 +320,9 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r, const cha
 	struct commit_graph *g = load_commit_graph_one(graph_name);
 	free(graph_name);
 
+	if (g)
+		g->obj_dir = obj_dir;
+
 	return g;
 }
 
@@ -385,8 +388,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r, const
 	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
 
 	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
-		char *graph_name;
-		struct commit_graph *g;
+		struct object_directory *odb;
 
 		if (get_oid_hex(line.buf, &oids[i])) {
 			warning(_("invalid commit-graph chain: line '%s' not a hash"),
@@ -395,14 +397,23 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r, const
 			break;
 		}
 
-		graph_name = get_split_graph_filename(obj_dir, line.buf);
-		g = load_commit_graph_one(graph_name);
-		free(graph_name);
+		for (odb = r->objects->odb; odb; odb = odb->next) {
+			char *graph_name = get_split_graph_filename(odb->path, line.buf);
+			struct commit_graph *g = load_commit_graph_one(graph_name);
 
-		if (g && add_graph_to_chain(g, graph_chain, oids, i))
-			graph_chain = g;
-		else
-			valid = 0;
+			free(graph_name);
+
+			if (g) {
+				g->obj_dir = odb->path;
+
+				if (add_graph_to_chain(g, graph_chain, oids, i))
+					graph_chain = g;
+				else
+					valid = 0;
+
+				break;
+			}
+		}
 	}
 
 	free(oids);
@@ -1411,7 +1422,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
 		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
-		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+		char *new_base_name = get_split_graph_filename(ctx->new_base_graph->obj_dir, new_base_hash);
 
 		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
 		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
@@ -1482,6 +1493,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 
 	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
 		     num_commits > split_strategy_max_commits)) {
+		if (strcmp(g->obj_dir, ctx->obj_dir))
+			break;
+
 		num_commits += g->num_commits;
 		g = g->base_graph;
 
@@ -1490,6 +1504,18 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 
 	ctx->new_base_graph = g;
 
+	if (ctx->num_commit_graphs_after == 2) {
+		char *old_graph_name = get_commit_graph_filename(g->obj_dir);
+
+		if (!strcmp(g->filename, old_graph_name) &&
+		    strcmp(g->obj_dir, ctx->obj_dir)) {
+			ctx->num_commit_graphs_after = 1;
+			ctx->new_base_graph = NULL;
+		}
+
+		free(old_graph_name);
+	}
+
 	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
 	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
 
diff --git a/commit-graph.h b/commit-graph.h
index 5c48c4f66a..10466bc064 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,7 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 	char *filename;
+	const char *obj_dir;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index 5cb5663a30..fad27dc6e3 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -90,6 +90,21 @@ test_expect_success 'add more commits, and write a new base graph' '
 	graph_read_expect 12
 '
 
+test_expect_success 'fork and fail to base a chain on a commit-graph file' '
+	test_when_finished rm -rf fork &&
+	git clone . fork &&
+	(
+		cd fork &&
+		rm .git/objects/info/commit-graph &&
+		echo "$TRASH_DIRECTORY/.git/objects" >.git/objects/info/alternates &&
+		test_commit new-commit &&
+		git commit-graph write --reachable --split &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir
+	)
+'
+
 test_expect_success 'add three more commits, write a tip graph' '
 	git reset --hard commits/3 &&
 	git merge merge/1 &&
@@ -132,4 +147,26 @@ test_expect_success 'add one commit, write a merged graph' '
 
 graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
 
+test_expect_success 'create fork and chain across alternate' '
+	git clone . fork &&
+	(
+		cd fork &&
+		git config core.commitGraph true &&
+		rm -rf $graphdir &&
+		echo "$TRASH_DIRECTORY/.git/objects" >.git/objects/info/alternates &&
+		test_commit 13 &&
+		git branch commits/13 &&
+		git commit-graph write --reachable --split &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files &&
+		git -c core.commitGraph=true  rev-list HEAD >expect &&
+		git -c core.commitGraph=false rev-list HEAD >actual &&
+		test_cmp expect actual
+	)
+'
+
+graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 11/14] commit-graph: expire commit-graph files
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (9 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 10/14] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
@ 2019-06-03 16:03     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:04     ` [PATCH v3 12/14] commit-graph: create options for split files Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:03 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we merge commit-graph files in a commit-graph chain, we should clean
up the files that are no longer used.

This change introduces an 'expiry_window' value to the context, which is
always zero (for now). We then check the modified time of each
graph-{hash}.graph file in the $OBJDIR/info/commit-graphs folder and
unlink the files that are older than the expiry_window.

Since this is always zero, this immediately clears all unused graph
files. We will update the value to match a config setting in a future
change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 15 ++++++
 commit-graph.c                           | 69 ++++++++++++++++++++++++
 t/t5323-split-commit-graph.sh            |  2 +-
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 473032e476..aed4350a59 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -266,6 +266,21 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Deleting graph-{hash} files
+
+After a new tip file is written, some `graph-{hash}` files may no longer
+be part of a chain. It is important to remove these files from disk, eventually.
+The main reason to delay removal is that another process could read the
+`commit-graph-chain` file before it is rewritten, but then look for the
+`graph-{hash}` files after they are deleted.
+
+To allow holding old split commit-graphs for a while after they are unreferenced,
+we update the modified times of the files when they become unreferenced. Then,
+we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
+files whose modified times are older than a given expiry window. This window
+defaults to zero, but can be changed using command-line arguments or a config
+setting.
+
 ## Chains across multiple object directories
 
 In a repo with alternates, we look for the `commit-graph-chain` file starting
diff --git a/commit-graph.c b/commit-graph.c
index 0b7c186a5d..4cbced7ff4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1637,6 +1637,70 @@ static void merge_commit_graphs(struct write_commit_graph_context *ctx)
 	deduplicate_commits(ctx);
 }
 
+static void mark_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	uint32_t i;
+	time_t now = time(NULL);
+
+	for (i = ctx->num_commit_graphs_after - 1; i < ctx->num_commit_graphs_before; i++) {
+		struct stat st;
+		struct utimbuf updated_time;
+
+		stat(ctx->commit_graph_filenames_before[i], &st);
+
+		updated_time.actime = st.st_atime;
+		updated_time.modtime = now;
+		utime(ctx->commit_graph_filenames_before[i], &updated_time);
+	}
+}
+
+static void expire_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct strbuf path = STRBUF_INIT;
+	DIR *dir;
+	struct dirent *de;
+	size_t dirnamelen;
+	time_t expire_time = time(NULL);
+
+	strbuf_addstr(&path, ctx->obj_dir);
+	strbuf_addstr(&path, "/info/commit-graphs");
+	dir = opendir(path.buf);
+
+	if (!dir) {
+		strbuf_release(&path);
+		return;
+	}
+
+	strbuf_addch(&path, '/');
+	dirnamelen = path.len;
+	while ((de = readdir(dir)) != NULL) {
+		struct stat st;
+		uint32_t i, found = 0;
+
+		strbuf_setlen(&path, dirnamelen);
+		strbuf_addstr(&path, de->d_name);
+
+		stat(path.buf, &st);
+
+		if (st.st_mtime > expire_time)
+			continue;
+		if (path.len < 6 || strcmp(path.buf + path.len - 6, ".graph"))
+			continue;
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			if (!strcmp(ctx->commit_graph_filenames_after[i],
+				    path.buf)) {
+				found = 1;
+				break;
+			}
+		}
+
+		if (!found)
+			unlink(path.buf);
+
+	}
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1749,6 +1813,11 @@ int write_commit_graph(const char *obj_dir,
 
 	res = write_commit_graph_file(ctx);
 
+	if (ctx->split) {
+		mark_commit_graphs(ctx);
+		expire_commit_graphs(ctx);
+	}
+
 cleanup:
 	free(ctx->graph_name);
 	free(ctx->commits.list);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index fad27dc6e3..ae45329651 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -141,7 +141,7 @@ test_expect_success 'add one commit, write a merged graph' '
 	test_path_is_file $graphdir/commit-graph-chain &&
 	test_line_count = 2 $graphdir/commit-graph-chain &&
 	ls $graphdir/graph-*.graph >graph-files &&
-	test_line_count = 4 graph-files &&
+	test_line_count = 2 graph-files &&
 	verify_chain_files_exist $graphdir
 '
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 12/14] commit-graph: create options for split files
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (10 preceding siblings ...)
  2019-06-03 16:03     ` [PATCH v3 11/14] commit-graph: expire commit-graph files Derrick Stolee via GitGitGadget
@ 2019-06-03 16:04     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:04     ` [PATCH v3 13/14] commit-graph: verify chains with --shallow mode Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:04 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The split commit-graph feature is now fully implemented, but needs
some more run-time configurability. Allow direct callers to 'git
commit-graph write --split' to specify the values used in the
merge strategy and the expire time.

Update the documentation to specify these values.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt       | 21 +++++++++++++-
 Documentation/technical/commit-graph.txt |  7 +++--
 builtin/commit-graph.c                   | 20 +++++++++++---
 builtin/commit.c                         |  2 +-
 builtin/gc.c                             |  3 +-
 commit-graph.c                           | 35 ++++++++++++++++--------
 commit-graph.h                           | 12 ++++++--
 t/t5323-split-commit-graph.sh            | 35 ++++++++++++++++++++++++
 8 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 624470e198..365e145e82 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -26,7 +26,7 @@ OPTIONS
 	Use given directory for the location of packfiles and commit-graph
 	file. This parameter exists to specify the location of an alternate
 	that only has the objects directory, not a full `.git` directory. The
-	commit-graph file is expected to be at `<dir>/info/commit-graph` and
+	commit-graph file is expected to be in the `<dir>/info` directory and
 	the packfiles are expected to be in `<dir>/pack`.
 
 
@@ -51,6 +51,25 @@ or `--stdin-packs`.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
++
+With the `--split` option, write the commit-graph as a chain of multiple
+commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
+not already in the commit-graph are added in a new "tip" file. This file
+is merged with the existing file if the following merge conditions are
+met:
++
+* If `--size-multiple=<X>` is not specified, let `X` equal 2. If the new
+tip file would have `N` commits and the previous tip has `M` commits and
+`X` times `N` is greater than  `M`, instead merge the two files into a
+single file.
++
+* If `--max-commits=<M>` is specified with `M` a positive integer, and the
+new tip file would have more than `M` commits, then instead merge the new
+tip with the previous tip.
++
+Finally, if `--expire-time=<datetime>` is not specified, let `datetime`
+be the current time. After writing the split commit-graph, delete all
+unused commit-graph whose modified times are older than `datetime`.
 
 'read'::
 
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index aed4350a59..729fbcb32f 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -248,10 +248,11 @@ When writing a set of commits that do not exist in the commit-graph stack of
 height N, we default to creating a new file at level N + 1. We then decide to
 merge with the Nth level if one of two conditions hold:
 
-  1. The expected file size for level N + 1 is at least half the file size for
-     level N.
+  1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
+     level N is less than X times the number of commits in level N + 1.
 
-  2. Level N + 1 contains more than 64,0000 commits.
+  2. `--max-commits=<C>` is specified with non-zero C and the number of commits
+     in level N + 1 is more than C commits.
 
 This decision cascades down the levels: when we merge a level we create a new
 set of commits that then compares to the next level.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index c2c07d3917..18e3b61fb6 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
 	NULL
 };
 
@@ -135,6 +135,7 @@ static int graph_read(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
+struct split_commit_graph_opts split_opts;
 
 static int graph_write(int argc, const char **argv)
 {
@@ -158,9 +159,19 @@ static int graph_write(int argc, const char **argv)
 			N_("include all commits already in the commit-graph file")),
 		OPT_BOOL(0, "split", &opts.split,
 			N_("allow writing an incremental commit-graph file")),
+		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+			N_("maximum number of commits in a non-base split commit-graph")),
+		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+			N_("maximum ratio between two levels of a split commit-graph")),
+		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+			N_("maximum number of commits in a non-base split commit-graph")),
 		OPT_END(),
 	};
 
+	split_opts.size_multiple = 2;
+	split_opts.max_commits = 0;
+	split_opts.expire_time = 0;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
@@ -177,7 +188,7 @@ static int graph_write(int argc, const char **argv)
 	read_replace_refs = 0;
 
 	if (opts.reachable)
-		return write_commit_graph_reachable(opts.obj_dir, flags);
+		return write_commit_graph_reachable(opts.obj_dir, flags, &split_opts);
 
 	string_list_init(&lines, 0);
 	if (opts.stdin_packs || opts.stdin_commits) {
@@ -197,7 +208,8 @@ static int graph_write(int argc, const char **argv)
 	result = write_commit_graph(opts.obj_dir,
 				    pack_indexes,
 				    commit_hex,
-				    flags);
+				    flags,
+				    &split_opts);
 
 	UNLEAK(lines);
 	return result;
diff --git a/builtin/commit.c b/builtin/commit.c
index b001ef565d..9216e9c043 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1670,7 +1670,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "not exceeded, and then \"git reset HEAD\" to recover."));
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(get_object_directory(), 0))
+	    write_commit_graph_reachable(get_object_directory(), 0, NULL))
 		return 1;
 
 	repo_rerere(the_repository, 0);
diff --git a/builtin/gc.c b/builtin/gc.c
index df2573f124..2ab590ffd4 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -666,7 +666,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	if (gc_write_commit_graph &&
 	    write_commit_graph_reachable(get_object_directory(),
-					 !quiet && !daemonized ? COMMIT_GRAPH_PROGRESS : 0))
+					 !quiet && !daemonized ? COMMIT_GRAPH_PROGRESS : 0,
+					 NULL))
 		return 1;
 
 	if (auto_gc && too_many_loose_objects())
diff --git a/commit-graph.c b/commit-graph.c
index 4cbced7ff4..9d1e7393e4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -753,6 +753,8 @@ struct write_commit_graph_context {
 	unsigned append:1,
 		 report_progress:1,
 		 split:1;
+
+	const struct split_commit_graph_opts *split_opts;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1101,14 +1103,15 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
-int write_commit_graph_reachable(const char *obj_dir, unsigned int flags)
+int write_commit_graph_reachable(const char *obj_dir, unsigned int flags,
+				 const struct split_commit_graph_opts *split_opts)
 {
 	struct string_list list = STRING_LIST_INIT_DUP;
 	int result;
 
 	for_each_ref(add_ref_to_list, &list);
 	result = write_commit_graph(obj_dir, NULL, &list,
-				    flags);
+				    flags, split_opts);
 
 	string_list_clear(&list, 0);
 	return result;
@@ -1479,20 +1482,25 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
-static int split_strategy_max_commits = 64000;
-static float split_strategy_size_mult = 2.0f;
-
 static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 {
 	struct commit_graph *g = ctx->r->objects->commit_graph;
 	uint32_t num_commits = ctx->commits.nr;
 	uint32_t i;
 
+	int max_commits = 0;
+	int size_mult = 2;
+
+	if (ctx->split_opts) {
+		max_commits = ctx->split_opts->max_commits;
+		size_mult = ctx->split_opts->size_multiple;
+	}
+
 	g = ctx->r->objects->commit_graph;
 	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
 
-	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
-		     num_commits > split_strategy_max_commits)) {
+	while (g && (g->num_commits <= size_mult * num_commits ||
+		    (max_commits && num_commits > max_commits))) {
 		if (strcmp(g->obj_dir, ctx->obj_dir))
 			break;
 
@@ -1660,7 +1668,10 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	DIR *dir;
 	struct dirent *de;
 	size_t dirnamelen;
-	time_t expire_time = time(NULL);
+	timestamp_t expire_time = time(NULL);
+
+	if (ctx->split_opts && ctx->split_opts->expire_time)
+		expire_time -= ctx->split_opts->expire_time;
 
 	strbuf_addstr(&path, ctx->obj_dir);
 	strbuf_addstr(&path, "/info/commit-graphs");
@@ -1704,7 +1715,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
-		       unsigned int flags)
+		       unsigned int flags,
+		       const struct split_commit_graph_opts *split_opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -1719,6 +1731,7 @@ int write_commit_graph(const char *obj_dir,
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+	ctx->split_opts = split_opts;
 
 	if (ctx->split) {
 		struct commit_graph *g;
@@ -1746,8 +1759,8 @@ int write_commit_graph(const char *obj_dir,
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && ctx->oids.alloc > split_strategy_max_commits)
-		ctx->oids.alloc = split_strategy_max_commits;
+	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
+		ctx->oids.alloc = split_opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
diff --git a/commit-graph.h b/commit-graph.h
index 10466bc064..194acab2b7 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,19 @@ int generation_numbers_enabled(struct repository *r);
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
 #define COMMIT_GRAPH_SPLIT      (1 << 2)
 
-int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
+struct split_commit_graph_opts {
+	int size_multiple;
+	int max_commits;
+	timestamp_t expire_time;
+};
+
+int write_commit_graph_reachable(const char *obj_dir, unsigned int flags,
+				 const struct split_commit_graph_opts *split_opts);
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
-		       unsigned int flags);
+		       unsigned int flags,
+		       const struct split_commit_graph_opts *split_opts);
 
 int verify_commit_graph(struct repository *r, struct commit_graph *g);
 
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index ae45329651..2f9f1bf0dd 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -169,4 +169,39 @@ test_expect_success 'create fork and chain across alternate' '
 
 graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
 
+test_expect_success 'test merge stragety constants' '
+	git clone . merge-2 &&
+	(
+		cd merge-2 &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --size-multiple=2 &&
+		test_line_count = 3 $graphdir/commit-graph-chain
+
+	) &&
+	git clone . merge-10 &&
+	(
+		cd merge-10 &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --size-multiple=10 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files
+	) &&
+	git clone . merge-10-expire &&
+	(
+		cd merge-10-expire &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 15 &&
+		git commit-graph write --reachable --split --size-multiple=10 --expire-time=1980-01-01 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 3 graph-files
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 13/14] commit-graph: verify chains with --shallow mode
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (11 preceding siblings ...)
  2019-06-03 16:04     ` [PATCH v3 12/14] commit-graph: create options for split files Derrick Stolee via GitGitGadget
@ 2019-06-03 16:04     ` Derrick Stolee via GitGitGadget
  2019-06-03 16:04     ` [PATCH v3 14/14] commit-graph: clean up chains after flattened write Derrick Stolee via GitGitGadget
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:04 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we wrote a commit-graph chain, we only modified the tip file in
the chain. It is valuable to verify what we wrote, but not waste
time checking files we did not write.

Add a '--shallow' option to the 'git commit-graph verify' subcommand
and check that it does not read the base graph in a two-file chain.

Making the verify subcommand read from a chain of commit-graphs takes
some rearranging of the builtin code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  5 ++++-
 builtin/commit-graph.c             | 27 +++++++++++++++++++--------
 commit-graph.c                     | 15 ++++++++++++---
 commit-graph.h                     |  6 ++++--
 t/t5323-split-commit-graph.sh      | 21 +++++++++++++++++++++
 5 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 365e145e82..eb5e7865f0 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git commit-graph read' [--object-dir <dir>]
-'git commit-graph verify' [--object-dir <dir>]
+'git commit-graph verify' [--object-dir <dir>] [--shallow]
 'git commit-graph write' <options> [--object-dir <dir>]
 
 
@@ -80,6 +80,9 @@ Used for debugging purposes.
 
 Read the commit-graph file and verify its contents against the object
 database. Used to check for corrupted data.
++
+With the `--shallow` option, only check the tip commit-graph file in
+a chain of split commit-graphs.
 
 
 EXAMPLES
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 18e3b61fb6..7cde1e1aaa 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -5,17 +5,18 @@
 #include "parse-options.h"
 #include "repository.h"
 #include "commit-graph.h"
+#include "object-store.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
-	N_("git commit-graph verify [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>] [--shallow]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
 	NULL
 };
 
 static const char * const builtin_commit_graph_verify_usage[] = {
-	N_("git commit-graph verify [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>] [--shallow]"),
 	NULL
 };
 
@@ -36,6 +37,7 @@ static struct opts_commit_graph {
 	int stdin_commits;
 	int append;
 	int split;
+	int shallow;
 } opts;
 
 static int graph_verify(int argc, const char **argv)
@@ -45,11 +47,14 @@ static int graph_verify(int argc, const char **argv)
 	int open_ok;
 	int fd;
 	struct stat st;
+	int flags = 0;
 
 	static struct option builtin_commit_graph_verify_options[] = {
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
 			   N_("dir"),
 			   N_("The object directory to store the graph")),
+		OPT_BOOL(0, "shallow", &opts.shallow,
+			 N_("if the commit-graph is split, only verify the tip file")),
 		OPT_END(),
 	};
 
@@ -59,21 +64,27 @@ static int graph_verify(int argc, const char **argv)
 
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
+	if (opts.shallow)
+		flags |= COMMIT_GRAPH_VERIFY_SHALLOW;
 
 	graph_name = get_commit_graph_filename(opts.obj_dir);
 	open_ok = open_commit_graph(graph_name, &fd, &st);
-	if (!open_ok && errno == ENOENT)
-		return 0;
-	if (!open_ok)
+	if (!open_ok && errno != ENOENT)
 		die_errno(_("Could not open commit-graph '%s'"), graph_name);
-	graph = load_commit_graph_one_fd_st(fd, &st);
+
 	FREE_AND_NULL(graph_name);
 
+	if (open_ok)
+		graph = load_commit_graph_one_fd_st(fd, &st);
+	 else
+		graph = read_commit_graph_one(the_repository, opts.obj_dir);
+
+	/* Return failure if open_ok predicted success */
 	if (!graph)
-		return 1;
+		return !!open_ok;
 
 	UNLEAK(graph);
-	return verify_commit_graph(the_repository, graph);
+	return verify_commit_graph(the_repository, graph, flags);
 }
 
 static int graph_read(int argc, const char **argv)
diff --git a/commit-graph.c b/commit-graph.c
index 9d1e7393e4..f7459d40f2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -422,7 +422,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r, const
 	return graph_chain;
 }
 
-static struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
+struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
 {
 	struct commit_graph *g = load_commit_graph_v1(r, obj_dir);
 
@@ -1872,7 +1872,7 @@ static void graph_report(const char *fmt, ...)
 #define GENERATION_ZERO_EXISTS 1
 #define GENERATION_NUMBER_EXISTS 2
 
-int verify_commit_graph(struct repository *r, struct commit_graph *g)
+int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 {
 	uint32_t i, cur_fanout_pos = 0;
 	struct object_id prev_oid, cur_oid, checksum;
@@ -1880,6 +1880,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 	struct hashfile *f;
 	int devnull;
 	struct progress *progress = NULL;
+	int local_error = 0;
 
 	if (!g) {
 		graph_report("no commit-graph file loaded");
@@ -1974,6 +1975,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 				break;
 			}
 
+			/* parse parent in case it is in a base graph */
+			parse_commit_in_graph_one(r, g, graph_parents->item);
+
 			if (!oideq(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
 				graph_report(_("commit-graph parent for %s is %s != %s"),
 					     oid_to_hex(&cur_oid),
@@ -2025,7 +2029,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 	}
 	stop_progress(&progress);
 
-	return verify_commit_graph_error;
+	local_error = verify_commit_graph_error;
+
+	if (!(flags & COMMIT_GRAPH_VERIFY_SHALLOW) && g->base_graph)
+		local_error |= verify_commit_graph(r, g->base_graph, flags);
+
+	return local_error;
 }
 
 void free_commit_graph(struct commit_graph *g)
diff --git a/commit-graph.h b/commit-graph.h
index 194acab2b7..84e5e91fc6 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -61,7 +61,7 @@ struct commit_graph {
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-
+struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir);
 struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 					size_t graph_size);
 
@@ -89,7 +89,9 @@ int write_commit_graph(const char *obj_dir,
 		       unsigned int flags,
 		       const struct split_commit_graph_opts *split_opts);
 
-int verify_commit_graph(struct repository *r, struct commit_graph *g);
+#define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
+
+int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags);
 
 void close_commit_graph(struct repository *);
 void free_commit_graph(struct commit_graph *);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index 2f9f1bf0dd..b70dc90706 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -204,4 +204,25 @@ test_expect_success 'test merge stragety constants' '
 	)
 '
 
+corrupt_file() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+test_expect_success 'verify shallow' '
+	git clone . verify &&
+	(
+		cd verify &&
+		git commit-graph verify &&
+		base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
+		corrupt_file "$base_file" 1760 "\01" &&
+		git commit-graph verify --shallow &&
+		test_must_fail git commit-graph verify 2>test_err &&
+		grep -v "^+" test_err >err &&
+		test_i18ngrep "incorrect checksum" err
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v3 14/14] commit-graph: clean up chains after flattened write
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (12 preceding siblings ...)
  2019-06-03 16:04     ` [PATCH v3 13/14] commit-graph: verify chains with --shallow mode Derrick Stolee via GitGitGadget
@ 2019-06-03 16:04     ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  14 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-03 16:04 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we write a commit-graph file without the split option, then
we write to $OBJDIR/info/commit-graph and start to ignore
the chains in $OBJDIR/info/commit-graphs/.

Unlink the commit-graph-chain file and expire the graph-{hash}.graph
files in $OBJDIR/info/commit-graphs/ during every write.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c                | 12 +++++++++---
 t/t5323-split-commit-graph.sh | 12 ++++++++++++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f7459d40f2..291a3c60ee 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1672,6 +1672,12 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 
 	if (ctx->split_opts && ctx->split_opts->expire_time)
 		expire_time -= ctx->split_opts->expire_time;
+	if (!ctx->split) {
+		char *chain_file_name = get_chain_filename(ctx->obj_dir);
+		unlink(chain_file_name);
+		free(chain_file_name);
+		ctx->num_commit_graphs_after = 0;
+	}
 
 	strbuf_addstr(&path, ctx->obj_dir);
 	strbuf_addstr(&path, "/info/commit-graphs");
@@ -1826,10 +1832,10 @@ int write_commit_graph(const char *obj_dir,
 
 	res = write_commit_graph_file(ctx);
 
-	if (ctx->split) {
+	if (ctx->split)
 		mark_commit_graphs(ctx);
-		expire_commit_graphs(ctx);
-	}
+
+	expire_commit_graphs(ctx);
 
 cleanup:
 	free(ctx->graph_name);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index b70dc90706..9a19de9e80 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -204,6 +204,18 @@ test_expect_success 'test merge stragety constants' '
 	)
 '
 
+test_expect_success 'remove commit-graph-chain file after flattening' '
+	git clone . flatten &&
+	(
+		cd flatten &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		git commit-graph write --reachable &&
+		test_path_is_missing $graphdir/commit-graph-chain &&
+		ls $graphdir >graph-files &&
+		test_line_count = 0 graph-files
+	)
+'
+
 corrupt_file() {
 	file=$1
 	pos=$2
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: document commit-graph chains
  2019-06-03 16:03     ` [PATCH v3 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-05 17:22       ` Junio C Hamano
  2019-06-05 18:09         ` Derrick Stolee
  2019-06-06 12:10       ` Philip Oakley
  1 sibling, 1 reply; 136+ messages in thread
From: Junio C Hamano @ 2019-06-05 17:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a basic description of commit-graph chains. More details about the
> feature will be added as we add functionality. This introduction gives a
> high-level overview to the goals of the feature and the basic layout of
> commit-graph chains.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index fb53341d5e..1dca3bd8fe 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -127,6 +127,65 @@ Design Details
>    helpful for these clones, anyway. The commit-graph will not be read or
>    written when shallow commits are present.
>  
> +Commit Graphs Chains
> +--------------------
> +
> +Typically, repos grow with near-constant velocity (commits per day). Over time,
> +the number of commits added by a fetch operation is much smaller than the
> +number of commits in the full history. By creating a "chain" of commit-graphs,
> +we enable fast writes of new commit data without rewriting the entire commit
> +history -- at least, most of the time.
> +
> +## File Layout
> +
> +A commit-graph chain uses multiple files, and we use a fixed naming convention
> +to organize these files. Each commit-graph file has a name
> +`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
> +valued hash stored in the footer of that file (which is a hash of the file's
> +contents before that hash). For a chain of commit-graph files, a plain-text
> +file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
> +hashes for the files in order from "lowest" to "highest".
> +
> +For example, if the `commit-graph-chain` file contains the lines
> +
> +```
> +	{hash0}
> +	{hash1}
> +	{hash2}
> +```
> +
> +then the commit-graph chain looks like the following diagram:
> +
> + +-----------------------+
> + |  graph-{hash2}.graph  |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |  graph-{hash1}.graph  |
> + |                       |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |                       |
> + |                       |
> + |  graph-{hash0}.graph  |
> + |                       |
> + |                       |
> + |                       |
> + +-----------------------+
> +
> +Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
> +commits in `graph-{hash1}.graph`, and X2 be the number of commits in
> +`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
> +then we interpret this as being the commit in position (X0 + X1 + i), and that
> +will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
> +positions to refer to their parents, which may be in `graph-{hash1}.graph` or
> +`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
> +its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
> +X2).

One thing that I fail to read from the above is what it means for
graphs to be inside a single chain.  What is the significance for a
graph file graph-{hash1}.graph to be between graph-{hash0}.graph and
graph-{hash2}.graph?   For example, is any of the following true?

 - For a commit in graph-{hash1}.graph file, if graph->{hash0}.graph
   or any other graph files lower in the position in the chain were
   unavailable, information on some ancestor of that commit may not
   be available.

 - Even if graph-{hash2}.graph or any other graph files higher in
   the position in the chain gets lost, information on a commit in
   graph-{hash1}.graph file or any of its ancestors is not affected.

Another thing I've assumed to be true but cannot read from the above
description is that the hashes in `commit-graph-chain` file, other
than the newest one, are merely redundant information, and each
graph file records the hash of its "previous" graph file (i.e. the
one that used to be the youngest before it got created).


   

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: document commit-graph chains
  2019-06-05 17:22       ` Junio C Hamano
@ 2019-06-05 18:09         ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-06-05 18:09 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, Derrick Stolee

On 6/5/2019 1:22 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Add a basic description of commit-graph chains. More details about the
>> feature will be added as we add functionality. This introduction gives a
>> high-level overview to the goals of the feature and the basic layout of
>> commit-graph chains.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
>>  1 file changed, 59 insertions(+)
>>
>> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
>> index fb53341d5e..1dca3bd8fe 100644
>> --- a/Documentation/technical/commit-graph.txt
>> +++ b/Documentation/technical/commit-graph.txt
>> @@ -127,6 +127,65 @@ Design Details
>>    helpful for these clones, anyway. The commit-graph will not be read or
>>    written when shallow commits are present.
>>  
>> +Commit Graphs Chains
>> +--------------------
>> +
>> +Typically, repos grow with near-constant velocity (commits per day). Over time,
>> +the number of commits added by a fetch operation is much smaller than the
>> +number of commits in the full history. By creating a "chain" of commit-graphs,
>> +we enable fast writes of new commit data without rewriting the entire commit
>> +history -- at least, most of the time.
>> +
>> +## File Layout
>> +
>> +A commit-graph chain uses multiple files, and we use a fixed naming convention
>> +to organize these files. Each commit-graph file has a name
>> +`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
>> +valued hash stored in the footer of that file (which is a hash of the file's
>> +contents before that hash). For a chain of commit-graph files, a plain-text
>> +file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
>> +hashes for the files in order from "lowest" to "highest".
>> +
>> +For example, if the `commit-graph-chain` file contains the lines
>> +
>> +```
>> +	{hash0}
>> +	{hash1}
>> +	{hash2}
>> +```
>> +
>> +then the commit-graph chain looks like the following diagram:
>> +
>> + +-----------------------+
>> + |  graph-{hash2}.graph  |
>> + +-----------------------+
>> +	  |
>> + +-----------------------+
>> + |                       |
>> + |  graph-{hash1}.graph  |
>> + |                       |
>> + +-----------------------+
>> +	  |
>> + +-----------------------+
>> + |                       |
>> + |                       |
>> + |                       |
>> + |  graph-{hash0}.graph  |
>> + |                       |
>> + |                       |
>> + |                       |
>> + +-----------------------+
>> +
>> +Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
>> +commits in `graph-{hash1}.graph`, and X2 be the number of commits in
>> +`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
>> +then we interpret this as being the commit in position (X0 + X1 + i), and that
>> +will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
>> +positions to refer to their parents, which may be in `graph-{hash1}.graph` or
>> +`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
>> +its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
>> +X2).
> 
> One thing that I fail to read from the above is what it means for
> graphs to be inside a single chain.  What is the significance for a
> graph file graph-{hash1}.graph to be between graph-{hash0}.graph and
> graph-{hash2}.graph?   For example, is any of the following true?
> 
>  - For a commit in graph-{hash1}.graph file, if graph->{hash0}.graph
>    or any other graph files lower in the position in the chain were
>    unavailable, information on some ancestor of that commit may not
>    be available.

Not only that, but if graph-{hash0}.graph is unavailable, then the
graph-{hash1}.graph file is _unusable_. For example, we don't track
the number of commits in the base graph, so we could not accurately
find the commit-parent links even within graph-{hash1}.graph.

>  - Even if graph-{hash2}.graph or any other graph files higher in
>    the position in the chain gets lost, information on a commit in
>    graph-{hash1}.graph file or any of its ancestors is not affected.

This is correct. As we "build from the bottom" we can stop if we fail
to find a file, and all the data we already accessed is still valid.

> Another thing I've assumed to be true but cannot read from the above
> description is that the hashes in `commit-graph-chain` file, other
> than the newest one, are merely redundant information, and each
> graph file records the hash of its "previous" graph file (i.e. the
> one that used to be the youngest before it got created).

If the entire chain is available, then we only really need the tip hash.
However, having the entire chain in this file allows the "building from
the bottom" pattern so we can get some value even if the tip was removed
from under us. Since we expect the base file to change infrequently, this
should cover a large number of commits.

I can try to make this pattern more clear in a future revision, assuming
we stick with the pattern. It remains unclear if this strategy as a whole
has been accepted as a good direction.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: document commit-graph chains
  2019-06-03 16:03     ` [PATCH v3 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
  2019-06-05 17:22       ` Junio C Hamano
@ 2019-06-06 12:10       ` Philip Oakley
  2019-06-06 17:09         ` Derrick Stolee
  1 sibling, 1 reply; 136+ messages in thread
From: Philip Oakley @ 2019-06-06 12:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

Hi Derrick ,

On 03/06/2019 17:03, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a basic description of commit-graph chains.
Not really your problem, but I did notice that we don't actually explain 
what we mean here by a commit graph (before we start chaining them), and 
the distinction between the generic concept and the specific implementation.

If I understand it correctly, the regular DAG (directed acyclic graph) 
already inherently contains the commit graph, showing the parent(s) of 
each commit. Hence, why do we need another? (which then needs explaining 
the what/why/how)

So, in one sense, another commit chain is potentially duplicated 
redundant data. What hasn't been surfaced (for the reader coming later) 
is probably that accessing the DAG commit graph can be (a) slow, (b) one 
way (no child relationships), and (c) accesses large amounts of other 
data that isn't relevant to the task at hand.

So the commit graph (implementation) is [I think] a fast, compact, 
sorted(?), list of commit oids that provides two way linkage through the 
commit graph (?) to allow fast queries within the Git codebase.

The commit graph is normally considered immutable, however the DAG 
commit graph can be extended by new commits, trimmed by branch deletion, 
rebasing, forced push, etc, or even reorganised via 'replace' or grafts 
commits, which must then be reflected in the commit graph (implementation).

It just felt that there is a gap between the high level DAG, explained 
in the glossary, and the commit-graph That perhaps the 
technical/commit-graph.txt ought to summarise.

--
Philip
>   More details about the
> feature will be added as we add functionality. This introduction gives a
> high-level overview to the goals of the feature and the basic layout of
> commit-graph chains.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
>   1 file changed, 59 insertions(+)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index fb53341d5e..1dca3bd8fe 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -127,6 +127,65 @@ Design Details
>     helpful for these clones, anyway. The commit-graph will not be read or
>     written when shallow commits are present.
>   
> +Commit Graphs Chains
> +--------------------
> +
> +Typically, repos grow with near-constant velocity (commits per day). Over time,
> +the number of commits added by a fetch operation is much smaller than the
> +number of commits in the full history. By creating a "chain" of commit-graphs,
> +we enable fast writes of new commit data without rewriting the entire commit
> +history -- at least, most of the time.
> +
> +## File Layout
> +
> +A commit-graph chain uses multiple files, and we use a fixed naming convention
> +to organize these files. Each commit-graph file has a name
> +`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
> +valued hash stored in the footer of that file (which is a hash of the file's
> +contents before that hash). For a chain of commit-graph files, a plain-text
> +file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
> +hashes for the files in order from "lowest" to "highest".
> +
> +For example, if the `commit-graph-chain` file contains the lines
> +
> +```
> +	{hash0}
> +	{hash1}
> +	{hash2}
> +```
> +
> +then the commit-graph chain looks like the following diagram:
> +
> + +-----------------------+
> + |  graph-{hash2}.graph  |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |  graph-{hash1}.graph  |
> + |                       |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |                       |
> + |                       |
> + |  graph-{hash0}.graph  |
> + |                       |
> + |                       |
> + |                       |
> + +-----------------------+
> +
> +Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
> +commits in `graph-{hash1}.graph`, and X2 be the number of commits in
> +`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
> +then we interpret this as being the commit in position (X0 + X1 + i), and that
> +will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
> +positions to refer to their parents, which may be in `graph-{hash1}.graph` or
> +`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
> +its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
> +X2).
> +
>   Related Links
>   -------------
>   [0] https://bugs.chromium.org/p/git/issues/detail?id=8


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 00/14] Commit-graph: Write incremental files
  2019-06-03 16:03   ` [PATCH v3 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                       ` (13 preceding siblings ...)
  2019-06-03 16:04     ` [PATCH v3 14/14] commit-graph: clean up chains after flattened write Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15     ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
                         ` (15 more replies)
  14 siblings, 16 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano

This version is now ready for review.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "chain" of commit-graphs in the
.git/objects/info/commit-graphs folder with name graph-{hash}.graph. The
list of hashes is given by the commit-graph-chain file, and also in a "base
graph chunk" in the commit-graph format. As we read a chain, we can verify
that the hashes match the trailing hash of each commit-graph we read along
the way and each hash below a level is expected by that graph file.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
conditions:

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
pack-files.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

write --stdin-commits:         8m 12s
write --stdin-commits --split:    28s
write --split && verify --shallow: 60s

Updates in V3:

 * git commit-graph verify now works on commit-graph chains. We do a simple
   test to check the behavior of a new --shallow option.
   
   
 * When someone writes a flat commit-graph, we now expire the old chain
   according to the expire time.
   
   
 * The "max commits" limit is no longer enabled by default, but instead is
   enabled by a --max-commits=<n> option. Ignored if n=0.
   
   

Updates in V4:

Johannes pointed out some test failures on the Windows platform. We found
that the tests were not running on Windows in the gitgitgadget PR builds,
which is now resolved.

 * We need to close commit-graphs recursively down the chain. This prevented
   an unlink() from working because of an open handle.
   
   
 * Creating the alternates file used a path-specification that didn't work
   on Windows.
   
   
 * Renaming a file to the same name failed, but is probably related to the
   unlink() error mentioned above.
   
   

This is based on ds/commit-graph-write-refactor.

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (14):
  commit-graph: document commit-graph chains
  commit-graph: prepare for commit-graph chains
  commit-graph: rename commit_compare to oid_compare
  commit-graph: load commit-graph chains
  commit-graph: add base graphs chunk
  commit-graph: rearrange chunk count logic
  commit-graph: write commit-graph chains
  commit-graph: add --split option to builtin
  commit-graph: merge commit-graph chains
  commit-graph: allow cross-alternate chains
  commit-graph: expire commit-graph files
  commit-graph: create options for split files
  commit-graph: verify chains with --shallow mode
  commit-graph: clean up chains after flattened write

 Documentation/git-commit-graph.txt            |  26 +-
 .../technical/commit-graph-format.txt         |  11 +-
 Documentation/technical/commit-graph.txt      | 195 +++++
 builtin/commit-graph.c                        |  53 +-
 builtin/commit.c                              |   2 +-
 builtin/gc.c                                  |   3 +-
 commit-graph.c                                | 794 +++++++++++++++++-
 commit-graph.h                                |  25 +-
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5323-split-commit-graph.sh                 | 240 ++++++
 10 files changed, 1283 insertions(+), 68 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh


base-commit: 8520d7fc7c6edd4d71582c69a873436029b6cb1b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/184

Range-diff vs v3:

  1:  b184919255 =  1:  b184919255 commit-graph: document commit-graph chains
  2:  d0dc154a27 =  2:  d0dc154a27 commit-graph: prepare for commit-graph chains
  3:  f35b04224a =  3:  f35b04224a commit-graph: rename commit_compare to oid_compare
  4:  ca670536df =  4:  ca670536df commit-graph: load commit-graph chains
  5:  df44cbc1bf =  5:  df44cbc1bf commit-graph: add base graphs chunk
  6:  e65f9e841d =  6:  e65f9e841d commit-graph: rearrange chunk count logic
  7:  fe0aa343cd =  7:  fe0aa343cd commit-graph: write commit-graph chains
  8:  4f4ccc8062 !  8:  c42e683ef6 commit-graph: add --split option to builtin
     @@ -12,6 +12,7 @@
          Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
          behavior.
      
     +    Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
     @@ -65,6 +66,32 @@
       	read_replace_refs = 0;
       
      
     + diff --git a/commit-graph.c b/commit-graph.c
     + --- a/commit-graph.c
     + +++ b/commit-graph.c
     +@@
     + 		}
     + 
     + 		if (ctx->base_graph_name) {
     +-			result = rename(ctx->base_graph_name,
     +-					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
     ++			const char *dest = ctx->commit_graph_filenames_after[
     ++						ctx->num_commit_graphs_after - 2];
     + 
     +-			if (result) {
     +-				error(_("failed to rename base commit-graph file"));
     +-				return -1;
     ++			if (strcmp(ctx->base_graph_name, dest)) {
     ++				result = rename(ctx->base_graph_name, dest);
     ++
     ++				if (result) {
     ++					error(_("failed to rename base commit-graph file"));
     ++					return -1;
     ++				}
     + 			}
     + 		} else {
     + 			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
     +
       diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
       new file mode 100755
       --- /dev/null
  9:  87fb895fe4 =  9:  d065758454 commit-graph: merge commit-graph chains
 10:  5cfd653d24 ! 10:  62b3fca582 commit-graph: allow cross-alternate chains
     @@ -195,7 +195,7 @@
      +	(
      +		cd fork &&
      +		rm .git/objects/info/commit-graph &&
     -+		echo "$TRASH_DIRECTORY/.git/objects" >.git/objects/info/alternates &&
     ++		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
      +		test_commit new-commit &&
      +		git commit-graph write --reachable --split &&
      +		test_path_is_file $graphdir/commit-graph-chain &&
     @@ -217,7 +217,7 @@
      +		cd fork &&
      +		git config core.commitGraph true &&
      +		rm -rf $graphdir &&
     -+		echo "$TRASH_DIRECTORY/.git/objects" >.git/objects/info/alternates &&
     ++		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
      +		test_commit 13 &&
      +		git branch commits/13 &&
      +		git commit-graph write --reachable --split &&
 11:  18d612be9e ! 11:  b5aeeed909 commit-graph: expire commit-graph files
     @@ -45,6 +45,26 @@
       diff --git a/commit-graph.c b/commit-graph.c
       --- a/commit-graph.c
       +++ b/commit-graph.c
     +@@
     + 	return !!first_generation;
     + }
     + 
     ++static void close_commit_graph_one(struct commit_graph *g)
     ++{
     ++	if (!g)
     ++		return;
     ++
     ++	close_commit_graph_one(g->base_graph);
     ++	free_commit_graph(g);
     ++}
     ++
     + void close_commit_graph(struct repository *r)
     + {
     +-	free_commit_graph(r->objects->commit_graph);
     ++	close_commit_graph_one(r->objects->commit_graph);
     + 	r->objects->commit_graph = NULL;
     + }
     + 
      @@
       	deduplicate_commits(ctx);
       }
     @@ -109,7 +129,6 @@
      +
      +		if (!found)
      +			unlink(path.buf);
     -+
      +	}
      +}
      +
 12:  4de4bfba64 = 12:  ac5586a20f commit-graph: create options for split files
 13:  fe91ff5fca = 13:  548ec69d01 commit-graph: verify chains with --shallow mode
 14:  ca41bf08d0 = 14:  6084bbd164 commit-graph: clean up chains after flattened write

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 01/14] commit-graph: document commit-graph chains
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 02/14] commit-graph: prepare for " Derrick Stolee via GitGitGadget
                         ` (14 subsequent siblings)
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a basic description of commit-graph chains. More details about the
feature will be added as we add functionality. This introduction gives a
high-level overview to the goals of the feature and the basic layout of
commit-graph chains.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index fb53341d5e..1dca3bd8fe 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,6 +127,65 @@ Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
+Commit Graphs Chains
+--------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over time,
+the number of commits added by a fetch operation is much smaller than the
+number of commits in the full history. By creating a "chain" of commit-graphs,
+we enable fast writes of new commit data without rewriting the entire commit
+history -- at least, most of the time.
+
+## File Layout
+
+A commit-graph chain uses multiple files, and we use a fixed naming convention
+to organize these files. Each commit-graph file has a name
+`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
+valued hash stored in the footer of that file (which is a hash of the file's
+contents before that hash). For a chain of commit-graph files, a plain-text
+file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
+hashes for the files in order from "lowest" to "highest".
+
+For example, if the `commit-graph-chain` file contains the lines
+
+```
+	{hash0}
+	{hash1}
+	{hash2}
+```
+
+then the commit-graph chain looks like the following diagram:
+
+ +-----------------------+
+ |  graph-{hash2}.graph  |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  graph-{hash1}.graph  |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0}.graph  |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
+commits in `graph-{hash1}.graph`, and X2 be the number of commits in
+`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
+then we interpret this as being the commit in position (X0 + X1 + i), and that
+will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
+positions to refer to their parents, which may be in `graph-{hash1}.graph` or
+`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
+its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
+X2).
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 02/14] commit-graph: prepare for commit-graph chains
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` " Derrick Stolee via GitGitGadget
  2019-06-06 15:19         ` Philip Oakley
  2019-06-06 21:28         ` Junio C Hamano
  2019-06-06 14:15       ` [PATCH v4 03/14] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
                         ` (13 subsequent siblings)
  15 siblings, 2 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To prepare for a chain of commit-graph files, augment the
commit_graph struct to point to a base commit_graph. As we load
commits from the graph, we may actually want to read from a base
file according to the graph position.

The "graph position" of a commit is given by concatenating the
lexicographic commit orders from each of the commit-graph files in
the chain. This means that we must distinguish two values:

 * lexicographic index : the position within the lexicographic
   order in a single commit-graph file.

 * graph position: the posiiton within the concatenated order
   of multiple commit-graph files

Given the lexicographic index of a commit in a graph, we can
compute the graph position by adding the number of commits in
the lower-level graphs. To find the lexicographic index of
a commit, we subtract the number of commits in lower-level graphs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------
 commit-graph.h |  3 ++
 2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7723156964..3afedcd7f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -371,6 +371,25 @@ static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t
 			    g->chunk_oid_lookup, g->hash_len, pos);
 }
 
+static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
+{
+	uint32_t lex_index;
+
+	if (!g)
+		BUG("NULL commit-graph");
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	lex_index = pos - g->num_commits_in_base;
+
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * lex_index);
+}
+
 static struct commit_list **insert_parent_or_die(struct repository *r,
 						 struct commit_graph *g,
 						 uint64_t pos,
@@ -379,10 +398,10 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 	struct commit *c;
 	struct object_id oid;
 
-	if (pos >= g->num_commits)
+	if (pos >= g->num_commits + g->num_commits_in_base)
 		die("invalid parent position %"PRIu64, pos);
 
-	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+	load_oid_from_graph(g, pos, &oid);
 	c = lookup_commit(r, &oid);
 	if (!c)
 		die(_("could not find commit %s"), oid_to_hex(&oid));
@@ -392,7 +411,14 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	lex_index = pos - g->num_commits_in_base;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 	item->graph_pos = pos;
 	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -405,10 +431,26 @@ static int fill_commit_in_graph(struct repository *r,
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
 
-	item->object.parsed = 1;
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
+		    pos, g->num_commits, g->num_commits_in_base);
+
+	/*
+	 * Store the "full" position, but then use the
+	 * "local" position for the rest of the calculation.
+	 */
 	item->graph_pos = pos;
+	lex_index = pos - g->num_commits_in_base;
+
+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+
+	item->object.parsed = 1;
 
 	item->maybe_tree = NULL;
 
@@ -452,7 +494,18 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 		*pos = item->graph_pos;
 		return 1;
 	} else {
-		return bsearch_graph(g, &(item->object.oid), pos);
+		struct commit_graph *cur_g = g;
+		uint32_t lex_index;
+
+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
+			cur_g = cur_g->base_graph;
+
+		if (cur_g) {
+			*pos = lex_index + cur_g->num_commits_in_base;
+			return 1;
+		}
+
+		return 0;
 	}
 }
 
@@ -492,8 +545,13 @@ static struct tree *load_tree_for_commit(struct repository *r,
 					 struct commit *c)
 {
 	struct object_id oid;
-	const unsigned char *commit_data = g->chunk_commit_data +
-					   GRAPH_DATA_WIDTH * (c->graph_pos);
+	const unsigned char *commit_data;
+
+	while (c->graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	commit_data = g->chunk_commit_data +
+			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
 
 	hashcpy(oid.hash, commit_data);
 	c->maybe_tree = lookup_tree(r, &oid);
diff --git a/commit-graph.h b/commit-graph.h
index 70f4caf0c7..f9fe32ebe3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,9 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 
+	uint32_t num_commits_in_base;
+	struct commit_graph *base_graph;
+
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 03/14] commit-graph: rename commit_compare to oid_compare
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 01/14] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 02/14] commit-graph: prepare for " Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 04/14] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
                         ` (12 subsequent siblings)
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The helper function commit_compare() actually compares object_id
structs, not commits. A future change to commit-graph.c will need
to sort commit structs, so rename this function in advance.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3afedcd7f5..e2f438f6a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -761,7 +761,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
-static int commit_compare(const void *_a, const void *_b)
+static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
 	const struct object_id *b = (const struct object_id *)_b;
@@ -1030,7 +1030,7 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 			_("Counting distinct commits in commit graph"),
 			ctx->oids.nr);
 	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 04/14] commit-graph: load commit-graph chains
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 03/14] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 22:20         ` Junio C Hamano
  2019-06-06 14:15       ` [PATCH v4 05/14] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
                         ` (11 subsequent siblings)
  15 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Prepare the logic for reading a chain of commit-graphs.

First, look for a file at $OBJDIR/info/commit-graph. If it exists,
then use that file and stop.

Next, look for the chain file at $OBJDIR/info/commit-graphs/commit-graph-chain.
If this file exists, then load the hash values as line-separated values in that
file and load $OBJDIR/info/commit-graphs/graph-{hash[i]}.graph for each hash[i]
in that file. The file is given in order, so the first hash corresponds to the
"base" file and the final hash corresponds to the "tip" file.

This implementation assumes that all of the graph-{hash}.graph files are in
the same object directory as the commit-graph-chain file. This will be updated
in a future change. This change is purposefully simple so we can isolate the
different concerns.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 108 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e2f438f6a3..3ed930159e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,6 +45,19 @@ char *get_commit_graph_filename(const char *obj_dir)
 	return xstrfmt("%s/info/commit-graph", obj_dir);
 }
 
+static char *get_split_graph_filename(const char *obj_dir,
+				      const char *oid_hex)
+{
+	return xstrfmt("%s/info/commit-graphs/graph-%s.graph",
+		       obj_dir,
+		       oid_hex);
+}
+
+static char *get_chain_filename(const char *obj_dir)
+{
+	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", obj_dir);
+}
+
 static uint8_t oid_version(void)
 {
 	return 1;
@@ -286,18 +299,107 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 	return load_commit_graph_one_fd_st(fd, &st);
 }
 
+static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
+{
+	char *graph_name = get_commit_graph_filename(obj_dir);
+	struct commit_graph *g = load_commit_graph_one(graph_name);
+	free(graph_name);
+
+	return g;
+}
+
+static int add_graph_to_chain(struct commit_graph *g,
+			      struct commit_graph *chain,
+			      struct object_id *oids,
+			      int n)
+{
+	struct commit_graph *cur_g = chain;
+
+	while (n) {
+		n--;
+		cur_g = cur_g->base_graph;
+	}
+
+	g->base_graph = chain;
+
+	if (chain)
+		g->num_commits_in_base = chain->num_commits + chain->num_commits_in_base;
+
+	return 1;
+}
+
+static struct commit_graph *load_commit_graph_chain(struct repository *r, const char *obj_dir)
+{
+	struct commit_graph *graph_chain = NULL;
+	struct strbuf line = STRBUF_INIT;
+	struct stat st;
+	struct object_id *oids;
+	int i = 0, valid = 1;
+	char *chain_name = get_chain_filename(obj_dir);
+	FILE *fp;
+
+	if (stat(chain_name, &st)) {
+		free(chain_name);
+		return NULL;
+	}
+
+	if (st.st_size <= the_hash_algo->hexsz) {
+		free(chain_name);
+		return NULL;
+	}
+
+	fp = fopen(chain_name, "r");
+	free(chain_name);
+
+	if (!fp)
+		return NULL;
+
+	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
+
+	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
+		char *graph_name;
+		struct commit_graph *g;
+
+		if (get_oid_hex(line.buf, &oids[i])) {
+			warning(_("invalid commit-graph chain: line '%s' not a hash"),
+				line.buf);
+			valid = 0;
+			break;
+		}
+
+		graph_name = get_split_graph_filename(obj_dir, line.buf);
+		g = load_commit_graph_one(graph_name);
+		free(graph_name);
+
+		if (g && add_graph_to_chain(g, graph_chain, oids, i))
+			graph_chain = g;
+		else
+			valid = 0;
+	}
+
+	free(oids);
+	fclose(fp);
+
+	return graph_chain;
+}
+
+static struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
+{
+	struct commit_graph *g = load_commit_graph_v1(r, obj_dir);
+
+	if (!g)
+		g = load_commit_graph_chain(r, obj_dir);
+
+	return g;
+}
+
 static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
 {
-	char *graph_name;
 
 	if (r->objects->commit_graph)
 		return;
 
-	graph_name = get_commit_graph_filename(obj_dir);
-	r->objects->commit_graph =
-		load_commit_graph_one(graph_name);
-
-	FREE_AND_NULL(graph_name);
+	r->objects->commit_graph = read_commit_graph_one(r, obj_dir);
 }
 
 /*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 05/14] commit-graph: add base graphs chunk
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 04/14] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-07 18:15         ` Junio C Hamano
  2019-06-06 14:15       ` [PATCH v4 06/14] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
                         ` (10 subsequent siblings)
  15 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To quickly verify a commit-graph chain is valid on load, we will
read from the new "Base Graphs Chunk" of each file in the chain.
This will prevent accidentally loading incorrect data from manually
editing the commit-graph-chain file or renaming graph-{hash}.graph
files.

The commit_graph struct already had an object_id struct "oid", but
it was never initialized or used. Add a line to read the hash from
the end of the commit-graph file and into the oid member.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .../technical/commit-graph-format.txt         | 11 ++++++++--
 commit-graph.c                                | 22 +++++++++++++++++++
 commit-graph.h                                |  1 +
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 16452a0504..a4f17441ae 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -44,8 +44,9 @@ HEADER:
 
   1-byte number (C) of "chunks"
 
-  1-byte (reserved for later use)
-     Current clients should ignore this value.
+  1-byte number (B) of base commit-graphs
+      We infer the length (H*B) of the Base Graphs chunk
+      from this value.
 
 CHUNK LOOKUP:
 
@@ -92,6 +93,12 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
+      This list of H-byte hashes describe a set of B commit-graph files that
+      form a commit-graph chain. The graph position for the ith commit in this
+      file's OID Lookup chunk is equal to i plus the number of commits in all
+      base graphs.  If B is non-zero, this chunk must exist.
+
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.
diff --git a/commit-graph.c b/commit-graph.c
index 3ed930159e..909c841db5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -22,6 +22,7 @@
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -262,6 +263,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			else
 				graph->chunk_extra_edges = data + chunk_offset;
 			break;
+
+		case GRAPH_CHUNKID_BASE:
+			if (graph->chunk_base_graphs)
+				chunk_repeated = 1;
+			else
+				graph->chunk_base_graphs = data + chunk_offset;
 		}
 
 		if (chunk_repeated) {
@@ -280,6 +287,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		last_chunk_offset = chunk_offset;
 	}
 
+	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
+
 	if (verify_commit_graph_lite(graph))
 		return NULL;
 
@@ -315,8 +324,21 @@ static int add_graph_to_chain(struct commit_graph *g,
 {
 	struct commit_graph *cur_g = chain;
 
+	if (n && !g->chunk_base_graphs) {
+		warning(_("commit-graph has no base graphs chunk"));
+		return 0;
+	}
+
 	while (n) {
 		n--;
+
+		if (!oideq(&oids[n], &cur_g->oid) ||
+		    !hasheq(oids[n].hash, g->chunk_base_graphs + g->hash_len * n)) {
+			warning(_("commit-graph chain does not match"));
+			return 0;
+		}
+
+
 		cur_g = cur_g->base_graph;
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index f9fe32ebe3..80f4917ddb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -55,6 +55,7 @@ struct commit_graph {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
+	const unsigned char *chunk_base_graphs;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 06/14] commit-graph: rearrange chunk count logic
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 05/14] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-07 18:23         ` Junio C Hamano
  2019-06-06 14:15       ` [PATCH v4 07/14] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
                         ` (9 subsequent siblings)
  15 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The number of chunks in a commit-graph file can change depending on
whether we need the Extra Edges Chunk. We are going to add more optional
chunks, and it will be helpful to rearrange this logic around the chunk
count before doing so.

Specifically, we need to finalize the number of chunks before writing
the commit-graph header. Further, we also need to fill out the chunk
lookup table dynamically and using "num_chunks" as we add optional
chunks is useful for adding optional chunks in the future.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 909c841db5..80df6d6d9d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1206,7 +1206,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[5];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = ctx->num_extra_edges ? 4 : 3;
+	int num_chunks = 3;
 
 	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
@@ -1219,27 +1219,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
-	hashwrite_be32(f, GRAPH_SIGNATURE);
-
-	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0); /* unused padding byte */
-
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	if (ctx->num_extra_edges)
-		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
-	else
-		chunk_ids[3] = 0;
-	chunk_ids[4] = 0;
+	if (ctx->num_extra_edges) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
+		num_chunks++;
+	}
+
+	chunk_ids[num_chunks] = 0;
 
 	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
-	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
+
+	num_chunks = 3;
+	if (ctx->num_extra_edges) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						4 * ctx->num_extra_edges;
+		num_chunks++;
+	}
+
+	hashwrite_be32(f, GRAPH_SIGNATURE);
+
+	hashwrite_u8(f, GRAPH_VERSION);
+	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, 0);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 07/14] commit-graph: write commit-graph chains
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 06/14] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 08/14] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
                         ` (8 subsequent siblings)
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Extend write_commit_graph() to write a commit-graph chain when given the
COMMIT_GRAPH_SPLIT flag.

This implementation is purposefully simplistic in how it creates a new
chain. The commits not already in the chain are added to a new tip
commit-graph file.

Much of the logic around writing a graph-{hash}.graph file and updating
the commit-graph-chain file is the same as the commit-graph file case.
However, there are several places where we need to do some extra logic
in the split case.

Track the list of graph filenames before and after the planned write.
This will be more important when we start merging graph files, but it
also allows us to upgrade our commit-graph file to the appropriate
graph-{hash}.graph file when we upgrade to a chain of commit-graphs.

Note that we use the eighth byte of the commit-graph header to store the
number of base graph files. This determines the length of the base
graphs chunk.

A subtle change of behavior with the new logic is that we do not write a
commit-graph if we our commit list is empty. This extends to the typical
case, which is reflected in t5318-commit-graph.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 286 ++++++++++++++++++++++++++++++++++++++--
 commit-graph.h          |   2 +
 t/t5318-commit-graph.sh |   2 +-
 3 files changed, 278 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 80df6d6d9d..7d3e001479 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -300,12 +300,18 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 
 	struct stat st;
 	int fd;
+	struct commit_graph *g;
 	int open_ok = open_commit_graph(graph_file, &fd, &st);
 
 	if (!open_ok)
 		return NULL;
 
-	return load_commit_graph_one_fd_st(fd, &st);
+	g = load_commit_graph_one_fd_st(fd, &st);
+
+	if (g)
+		g->filename = xstrdup(graph_file);
+
+	return g;
 }
 
 static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
@@ -723,8 +729,19 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+
+	char *base_graph_name;
+	int num_commit_graphs_before;
+	int num_commit_graphs_after;
+	char **commit_graph_filenames_before;
+	char **commit_graph_filenames_after;
+	char **commit_graph_hash_after;
+	uint32_t new_num_commits_in_base;
+	struct commit_graph *new_base_graph;
+
 	unsigned append:1,
-		 report_progress:1;
+		 report_progress:1,
+		 split:1;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -794,6 +811,16 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.nr,
 					      commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -814,6 +841,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.list,
 					      ctx->commits.nr,
 					      commit_to_sha1);
+
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -871,6 +909,16 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 						  ctx->commits.nr,
 						  commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -962,7 +1010,13 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 		display_progress(ctx->progress, i + 1);
 		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
-		if (commit && !parse_commit_no_graph(commit))
+		if (!commit)
+			continue;
+		if (ctx->split) {
+			if (!parse_commit(commit) &&
+			    commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+				add_missing_parents(ctx, commit);
+		} else if (!parse_commit_no_graph(commit))
 			add_missing_parents(ctx, commit);
 	}
 	stop_progress(&ctx->progress);
@@ -1158,8 +1212,16 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i])) {
+			if (ctx->split) {
+				struct commit *c = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+				if (!c || c->graph_pos != COMMIT_NOT_FROM_GRAPH)
+					continue;
+			}
+
 			count_distinct++;
+		}
 	}
 	stop_progress(&ctx->progress);
 
@@ -1182,7 +1244,13 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			continue;
 
+		ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
 		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+		if (ctx->split &&
+		    ctx->commits.list[ctx->commits.nr]->graph_pos != COMMIT_NOT_FROM_GRAPH)
+			continue;
+
 		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
 
 		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
@@ -1197,18 +1265,86 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static int write_graph_chunk_base_1(struct hashfile *f,
+				    struct commit_graph *g)
+{
+	int num = 0;
+
+	if (!g)
+		return 0;
+
+	num = write_graph_chunk_base_1(f, g->base_graph);
+	hashwrite(f, g->oid.hash, the_hash_algo->rawsz);
+	return num + 1;
+}
+
+static int write_graph_chunk_base(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
+{
+	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
+
+	if (num != ctx->num_commit_graphs_after - 1) {
+		error(_("failed to write correct number of base graph ids"));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t i;
+
+	ctx->new_base_graph = g;
+	ctx->base_graph_name = xstrdup(g->filename);
+	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	if (ctx->num_commit_graphs_before)
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
+			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
+
+	i = ctx->num_commit_graphs_before - 1;
+
+	while (g) {
+		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+		i--;
+		g = g->base_graph;
+	}
+}
+
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
+	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[5];
-	uint64_t chunk_offsets[5];
+	uint32_t chunk_ids[6];
+	uint64_t chunk_offsets[6];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
+	struct object_id file_hash;
+
+	if (ctx->split) {
+		struct strbuf tmp_file = STRBUF_INIT;
+
+		strbuf_addf(&tmp_file,
+			    "%s/info/commit-graphs/tmp_graph_XXXXXX",
+			    ctx->obj_dir);
+		ctx->graph_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	}
 
-	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
 		UNLEAK(ctx->graph_name);
 		error(_("unable to create leading directories of %s"),
@@ -1216,8 +1352,23 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		return errno;
 	}
 
-	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
-	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	if (ctx->split) {
+		char *lock_name = get_chain_filename(ctx->obj_dir);
+
+		hold_lock_file_for_update(&lk, lock_name, LOCK_DIE_ON_ERROR);
+
+		fd = git_mkstemp_mode(ctx->graph_name, 0444);
+		if (fd < 0) {
+			error(_("unable to create '%s'"), ctx->graph_name);
+			return -1;
+		}
+
+		f = hashfd(fd, ctx->graph_name);
+	} else {
+		hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
+		fd = lk.tempfile->fd;
+		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	}
 
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
@@ -1226,6 +1377,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
+		num_chunks++;
+	}
 
 	chunk_ids[num_chunks] = 0;
 
@@ -1240,13 +1395,18 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 						4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						hashsz * (ctx->num_commit_graphs_after - 1);
+		num_chunks++;
+	}
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
 	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0);
+	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
@@ -1272,11 +1432,67 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_data(f, hashsz, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
+	if (ctx->num_commit_graphs_after > 1 &&
+	    write_graph_chunk_base(f, ctx)) {
+		return -1;
+	}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
+	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
+		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
+		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+
+		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2] = new_base_name;
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2] = new_base_hash;
+	}
+
 	close_commit_graph(ctx->r);
-	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+
+	if (ctx->split) {
+		FILE *chainf = fdopen_lock_file(&lk, "w");
+		char *final_graph_name;
+		int result;
+
+		close(fd);
+
+		if (!chainf) {
+			error(_("unable to open commit-graph chain file"));
+			return -1;
+		}
+
+		if (ctx->base_graph_name) {
+			result = rename(ctx->base_graph_name,
+					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+
+			if (result) {
+				error(_("failed to rename base commit-graph file"));
+				return -1;
+			}
+		} else {
+			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
+			unlink(graph_name);
+		}
+
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1] = xstrdup(oid_to_hex(&file_hash));
+		final_graph_name = get_split_graph_filename(ctx->obj_dir,
+					ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 1] = final_graph_name;
+
+		result = rename(ctx->graph_name, final_graph_name);
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++)
+			fprintf(lk.tempfile->fp, "%s\n", ctx->commit_graph_hash_after[i]);
+
+		if (result) {
+			error(_("failed to rename temporary commit-graph file"));
+			return -1;
+		}
+	}
+
 	commit_lock_file(&lk);
 
 	return 0;
@@ -1299,6 +1515,30 @@ int write_commit_graph(const char *obj_dir,
 	ctx->obj_dir = obj_dir;
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
+	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+
+	if (ctx->split) {
+		struct commit_graph *g;
+		prepare_commit_graph(ctx->r);
+
+		g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			ctx->num_commit_graphs_before++;
+			g = g->base_graph;
+		}
+
+		if (ctx->num_commit_graphs_before) {
+			ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before);
+			i = ctx->num_commit_graphs_before;
+			g = ctx->r->objects->commit_graph;
+
+			while (g) {
+				ctx->commit_graph_filenames_before[--i] = xstrdup(g->filename);
+				g = g->base_graph;
+			}
+		}
+	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
@@ -1353,6 +1593,14 @@ int write_commit_graph(const char *obj_dir,
 		goto cleanup;
 	}
 
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split)
+		init_commit_graph_chain(ctx);
+	else
+		ctx->num_commit_graphs_after = 1;
+
 	compute_generation_numbers(ctx);
 
 	res = write_commit_graph_file(ctx);
@@ -1361,6 +1609,21 @@ int write_commit_graph(const char *obj_dir,
 	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
+
+	if (ctx->commit_graph_filenames_after) {
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			free(ctx->commit_graph_filenames_after[i]);
+			free(ctx->commit_graph_hash_after[i]);
+		}
+
+		for (i = 0; i < ctx->num_commit_graphs_before; i++)
+			free(ctx->commit_graph_filenames_before[i]);
+
+		free(ctx->commit_graph_filenames_after);
+		free(ctx->commit_graph_filenames_before);
+		free(ctx->commit_graph_hash_after);
+	}
+
 	free(ctx);
 
 	return res;
@@ -1548,5 +1811,6 @@ void free_commit_graph(struct commit_graph *g)
 		g->data = NULL;
 		close(g->graph_fd);
 	}
+	free(g->filename);
 	free(g);
 }
diff --git a/commit-graph.h b/commit-graph.h
index 80f4917ddb..5c48c4f66a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -47,6 +47,7 @@ struct commit_graph {
 	unsigned char num_chunks;
 	uint32_t num_commits;
 	struct object_id oid;
+	char *filename;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
@@ -71,6 +72,7 @@ int generation_numbers_enabled(struct repository *r);
 
 #define COMMIT_GRAPH_APPEND     (1 << 0)
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
+#define COMMIT_GRAPH_SPLIT      (1 << 2)
 
 int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
 int write_commit_graph(const char *obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3b6fd0d728..063f906b3e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -20,7 +20,7 @@ test_expect_success 'verify graph with no graph file' '
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
-	test_path_is_file info/commit-graph
+	test_path_is_missing info/commit-graph
 '
 
 test_expect_success 'close with correct error on bad input' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 08/14] commit-graph: add --split option to builtin
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (6 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 07/14] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-07 21:57         ` Junio C Hamano
  2019-06-06 14:15       ` [PATCH v4 09/14] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
                         ` (7 subsequent siblings)
  15 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new "--split" option to the 'git commit-graph write' subcommand. This
option allows the optional behavior of writing a commit-graph chain.

The current behavior will add a tip commit-graph containing any commits that
are not in the existing commit-graph or commit-graph chain. Later changes
will allow merging the chain and expiring out-dated files.

Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
behavior.

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c        |  10 ++-
 commit-graph.c                |  14 ++--
 t/t5323-split-commit-graph.sh | 122 ++++++++++++++++++++++++++++++++++
 3 files changed, 138 insertions(+), 8 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 828b1a713f..c2c07d3917 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -35,9 +35,9 @@ static struct opts_commit_graph {
 	int stdin_packs;
 	int stdin_commits;
 	int append;
+	int split;
 } opts;
 
-
 static int graph_verify(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -156,6 +156,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "split", &opts.split,
+			N_("allow writing an incremental commit-graph file")),
 		OPT_END(),
 	};
 
@@ -169,6 +171,8 @@ static int graph_write(int argc, const char **argv)
 		opts.obj_dir = get_object_directory();
 	if (opts.append)
 		flags |= COMMIT_GRAPH_APPEND;
+	if (opts.split)
+		flags |= COMMIT_GRAPH_SPLIT;
 
 	read_replace_refs = 0;
 
diff --git a/commit-graph.c b/commit-graph.c
index 7d3e001479..9b4acc0ac9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1465,12 +1465,16 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		}
 
 		if (ctx->base_graph_name) {
-			result = rename(ctx->base_graph_name,
-					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+			const char *dest = ctx->commit_graph_filenames_after[
+						ctx->num_commit_graphs_after - 2];
 
-			if (result) {
-				error(_("failed to rename base commit-graph file"));
-				return -1;
+			if (strcmp(ctx->base_graph_name, dest)) {
+				result = rename(ctx->base_graph_name, dest);
+
+				if (result) {
+					error(_("failed to rename base commit-graph file"));
+					return -1;
+				}
 			}
 		} else {
 			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
new file mode 100755
index 0000000000..ccd24bd22b
--- /dev/null
+++ b/t/t5323-split-commit-graph.sh
@@ -0,0 +1,122 @@
+#!/bin/sh
+
+test_description='split commit graph'
+. ./test-lib.sh
+
+GIT_TEST_COMMIT_GRAPH=0
+
+test_expect_success 'setup repo' '
+	git init &&
+	git config core.commitGraph true &&
+	infodir=".git/objects/info" &&
+	graphdir="$infodir/commit-graphs" &&
+	test_oid_init
+'
+
+graph_read_expect() {
+	NUM_BASE=0
+	if test ! -z $2
+	then
+		NUM_BASE=$2
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 3 $NUM_BASE
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	git commit-graph read >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'create commits and write commit-graph' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable &&
+	test_path_is_file $infodir/commit-graph &&
+	graph_read_expect 3
+'
+
+graph_git_two_modes() {
+	git -c core.commitGraph=true $1 >output
+	git -c core.commitGraph=false $1 >expect
+	test_cmp expect output
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_git_behavior 'graph exists' commits/3 commits/1
+
+verify_chain_files_exist() {
+	for hash in $(cat $1/commit-graph-chain)
+	do
+		test_path_is_file $1/graph-$hash.graph || return 1
+	done
+}
+
+test_expect_success 'add more commits, and write a new base graph' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 12
+'
+
+test_expect_success 'add three more commits, write a tip graph' '
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
+
+test_expect_success 'add one commit, write a tip graph' '
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 3 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 09/14] commit-graph: merge commit-graph chains
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (7 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 08/14] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 10/14] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
                         ` (6 subsequent siblings)
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.

To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:

  1. If the number of commits we are adding is more than half the number
     of commits in the graph below, then merge with that graph.

  2. If we are writing more than 64,000 commits into a single graph,
     then merge with all lower graphs.

The numeric values in the conditions above are currently constant, but
can become config options in a future update.

As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.

After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt |  80 ++++++++++
 commit-graph.c                           | 184 +++++++++++++++++++----
 t/t5323-split-commit-graph.sh            |  13 ++
 3 files changed, 244 insertions(+), 33 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 1dca3bd8fe..d9c6253b0a 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -186,6 +186,86 @@ positions to refer to their parents, which may be in `graph-{hash1}.graph` or
 its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
 X2).
 
+Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
+specifying the hashes of all files in the lower layers. In the above example,
+`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
+`{hash0}` and `{hash1}`.
+
+## Merging commit-graph files
+
+If we only added a new commit-graph file on every write, we would run into a
+linear search problem through many commit-graph files.  Instead, we use a merge
+strategy to decide when the stack should collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits are added, it
+is determined by the merge strategy that the files should collapse to
+`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
+the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
+file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  graph-{hash2} |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  graph-{hash1} |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                  tmp_graphXXX
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0} |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted and we write the
+contents to a temporary file, all while holding a `commit-graph-chain.lock`
+lock-file.  When the file is flushed, we rename it to `graph-{hash3}`
+according to the computed `{hash3}`. Finally, we write the new chain data to
+`commit-graph-chain.lock`:
+
+```
+	{hash3}
+	{hash0}
+```
+
+We then close the lock-file.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the commit-graph stack of
+height N, we default to creating a new file at level N + 1. We then decide to
+merge with the Nth level if one of two conditions hold:
+
+  1. The expected file size for level N + 1 is at least half the file size for
+     level N.
+
+  2. Level N + 1 contains more than 64,0000 commits.
+
+This decision cascades down the levels: when we merge a level we create a new
+set of commits that then compares to the next level.
+
+The first condition bounds the number of levels to be logarithmic in the total
+number of commits.  The second condition bounds the total number of commits in
+a `graph-{hashN}` file and not in the `commit-graph` file, preventing
+significant performance issues when the stack merges and another process only
+partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple, 64,000 for the maximum
+number of commits) could be extracted into config settings for full
+flexibility.
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
diff --git a/commit-graph.c b/commit-graph.c
index 9b4acc0ac9..59c1067e70 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1291,36 +1291,6 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
-{
-	struct commit_graph *g = ctx->r->objects->commit_graph;
-	uint32_t i;
-
-	ctx->new_base_graph = g;
-	ctx->base_graph_name = xstrdup(g->filename);
-	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
-
-	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
-
-	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
-	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
-
-	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
-		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
-
-	if (ctx->num_commit_graphs_before)
-		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
-			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
-
-	i = ctx->num_commit_graphs_before - 1;
-
-	while (g) {
-		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
-		i--;
-		g = g->base_graph;
-	}
-}
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
@@ -1502,6 +1472,149 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
+static int split_strategy_max_commits = 64000;
+static float split_strategy_size_mult = 2.0f;
+
+static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t num_commits = ctx->commits.nr;
+	uint32_t i;
+
+	g = ctx->r->objects->commit_graph;
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
+		     num_commits > split_strategy_max_commits)) {
+		num_commits += g->num_commits;
+		g = g->base_graph;
+
+		ctx->num_commit_graphs_after--;
+	}
+
+	ctx->new_base_graph = g;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_after &&
+		    i < ctx->num_commit_graphs_before; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	i = ctx->num_commit_graphs_before - 1;
+	g = ctx->r->objects->commit_graph;
+
+	while (g) {
+		if (i < ctx->num_commit_graphs_after)
+			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+
+		i--;
+		g = g->base_graph;
+	}
+}
+
+static void merge_commit_graph(struct write_commit_graph_context *ctx,
+			       struct commit_graph *g)
+{
+	uint32_t i;
+	uint32_t offset = g->num_commits_in_base;
+
+	ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct object_id oid;
+		struct commit *result;
+
+		display_progress(ctx->progress, i + 1);
+
+		load_oid_from_graph(g, i + offset, &oid);
+
+		/* only add commits if they still exist in the repo */
+		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+
+		if (result) {
+			ctx->commits.list[ctx->commits.nr] = result;
+			ctx->commits.nr++;
+		}
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit **)_a;
+	const struct commit *b = *(const struct commit **)_b;
+	return oidcmp(&a->object.oid, &b->object.oid);
+}
+
+static void deduplicate_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i, num_parents, last_distinct = 0, duplicates = 0;
+	struct commit_list *parent;
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("De-duplicating merged commits"),
+					ctx->commits.nr);
+
+	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+
+	ctx->num_extra_edges = 0;
+	for (i = 1; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, i);
+
+		if (oideq(&ctx->commits.list[last_distinct]->object.oid,
+			  &ctx->commits.list[i]->object.oid)) {
+			duplicates++;
+		} else {
+			if (duplicates)
+				ctx->commits.list[last_distinct + 1] = ctx->commits.list[i];
+			last_distinct++;
+
+			num_parents = 0;
+			for (parent = ctx->commits.list[i]->parents; parent; parent = parent->next)
+				num_parents++;
+
+			if (num_parents > 2)
+				ctx->num_extra_edges += num_parents - 2;
+		}
+	}
+
+	ctx->commits.nr -= duplicates;
+	stop_progress(&ctx->progress);
+}
+
+static void merge_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t current_graph_number = ctx->num_commit_graphs_before;
+	struct strbuf progress_title = STRBUF_INIT;
+
+	while (g && current_graph_number >= ctx->num_commit_graphs_after) {
+		current_graph_number--;
+
+		if (ctx->report_progress) {
+			strbuf_addstr(&progress_title, _("Merging commit-graph"));
+			ctx->progress = start_delayed_progress(progress_title.buf, 0);
+		}
+
+		merge_commit_graph(ctx, g);
+		stop_progress(&ctx->progress);
+		strbuf_release(&progress_title);
+
+		g = g->base_graph;
+	}
+
+	if (g) {
+		ctx->new_base_graph = g;
+		ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+	}
+
+	if (ctx->new_base_graph)
+		ctx->base_graph_name = xstrdup(ctx->new_base_graph->filename);
+
+	deduplicate_commits(ctx);
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1547,6 +1660,9 @@ int write_commit_graph(const char *obj_dir,
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
+	if (ctx->split && ctx->oids.alloc > split_strategy_max_commits)
+		ctx->oids.alloc = split_strategy_max_commits;
+
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
 		if (ctx->r->objects->commit_graph)
@@ -1600,9 +1716,11 @@ int write_commit_graph(const char *obj_dir,
 	if (!ctx->commits.nr)
 		goto cleanup;
 
-	if (ctx->split)
-		init_commit_graph_chain(ctx);
-	else
+	if (ctx->split) {
+		split_graph_merge_strategy(ctx);
+
+		merge_commit_graphs(ctx);
+	} else
 		ctx->num_commit_graphs_after = 1;
 
 	compute_generation_numbers(ctx);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index ccd24bd22b..5cb5663a30 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -119,4 +119,17 @@ test_expect_success 'add one commit, write a tip graph' '
 
 graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
 
+test_expect_success 'add one commit, write a merged graph' '
+	test_commit 12 &&
+	git branch commits/12 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 2 $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 4 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 10/14] commit-graph: allow cross-alternate chains
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (8 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 09/14] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 17:00         ` Philip Oakley
  2019-06-06 14:15       ` [PATCH v4 11/14] commit-graph: expire commit-graph files Derrick Stolee via GitGitGadget
                         ` (5 subsequent siblings)
  15 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment like a fork network, it is helpful to have a
commit-graph chain that spans both the base repo and the fork repo. The
fork is usually a small set of data on top of the large repo, but
sometimes the fork is much larger. For example, git-for-windows/git has
almost double the number of commits as git/git because it rebases its
commits on every major version update.

To allow cross-alternate commit-graph chains, we need a few pieces:

1. When looking for a graph-{hash}.graph file, check all alternates.

2. When merging commit-graph chains, do not merge across alternates.

3. When writing a new commit-graph chain based on a commit-graph file
   in another object directory, do not allow success if the base file
   has of the name "commit-graph" instead of
   "commit-graphs/graoh-{hash}.graph".

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 40 +++++++++++++++++++++
 commit-graph.c                           | 46 ++++++++++++++++++------
 commit-graph.h                           |  1 +
 t/t5323-split-commit-graph.sh            | 37 +++++++++++++++++++
 4 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index d9c6253b0a..473032e476 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -266,6 +266,42 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Chains across multiple object directories
+
+In a repo with alternates, we look for the `commit-graph-chain` file starting
+in the local object directory and then in each alternate. The first file that
+exists defines our chain. As we look for the `graph-{hash}` files for
+each `{hash}` in the chain file, we follow the same pattern for the host
+directories.
+
+This allows commit-graphs to be split across multiple forks in a fork network.
+The typical case is a large "base" repo with many smaller forks.
+
+As the base repo advances, it will likely update and merge its commit-graph
+chain more frequently than the forks. If a fork updates their commit-graph after
+the base repo, then it should "reparent" the commit-graph chain onto the new
+chain in the base repo. When reading each `graph-{hash}` file, we track
+the object directory containing it. During a write of a new commit-graph file,
+we check for any changes in the source object directory and read the
+`commit-graph-chain` file for that source and create a new file based on those
+files. During this "reparent" operation, we necessarily need to collapse all
+levels in the fork, as all of the files are invalid against the new base file.
+
+It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
+files in this scenario. It falls to the user to define the proper settings for
+their custom environment:
+
+ 1. When merging levels in the base repo, the unreferenced files may still be
+    referenced by chains from fork repos.
+
+ 2. The expiry time should be set to a length of time such that every fork has
+    time to recompute their commit-graph chain to "reparent" onto the new base
+    file(s).
+
+ 3. If the commit-graph chain is updated in the base, the fork will not have
+    access to the new chain until its chain is updated to reference those files.
+    (This may change in the future [5].)
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
@@ -292,3 +328,7 @@ Related Links
 
 [4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
     A patch to remove the ahead-behind calculation from 'status'.
+
+[5] https://public-inbox.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
+    A discussion of a "two-dimensional graph position" that can allow reading
+    multiple commit-graph chains at the same time.
diff --git a/commit-graph.c b/commit-graph.c
index 59c1067e70..39d986bb29 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -320,6 +320,9 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r, const cha
 	struct commit_graph *g = load_commit_graph_one(graph_name);
 	free(graph_name);
 
+	if (g)
+		g->obj_dir = obj_dir;
+
 	return g;
 }
 
@@ -385,8 +388,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r, const
 	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
 
 	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
-		char *graph_name;
-		struct commit_graph *g;
+		struct object_directory *odb;
 
 		if (get_oid_hex(line.buf, &oids[i])) {
 			warning(_("invalid commit-graph chain: line '%s' not a hash"),
@@ -395,14 +397,23 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r, const
 			break;
 		}
 
-		graph_name = get_split_graph_filename(obj_dir, line.buf);
-		g = load_commit_graph_one(graph_name);
-		free(graph_name);
+		for (odb = r->objects->odb; odb; odb = odb->next) {
+			char *graph_name = get_split_graph_filename(odb->path, line.buf);
+			struct commit_graph *g = load_commit_graph_one(graph_name);
 
-		if (g && add_graph_to_chain(g, graph_chain, oids, i))
-			graph_chain = g;
-		else
-			valid = 0;
+			free(graph_name);
+
+			if (g) {
+				g->obj_dir = odb->path;
+
+				if (add_graph_to_chain(g, graph_chain, oids, i))
+					graph_chain = g;
+				else
+					valid = 0;
+
+				break;
+			}
+		}
 	}
 
 	free(oids);
@@ -1411,7 +1422,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
 		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
-		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+		char *new_base_name = get_split_graph_filename(ctx->new_base_graph->obj_dir, new_base_hash);
 
 		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
 		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
@@ -1486,6 +1497,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 
 	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
 		     num_commits > split_strategy_max_commits)) {
+		if (strcmp(g->obj_dir, ctx->obj_dir))
+			break;
+
 		num_commits += g->num_commits;
 		g = g->base_graph;
 
@@ -1494,6 +1508,18 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 
 	ctx->new_base_graph = g;
 
+	if (ctx->num_commit_graphs_after == 2) {
+		char *old_graph_name = get_commit_graph_filename(g->obj_dir);
+
+		if (!strcmp(g->filename, old_graph_name) &&
+		    strcmp(g->obj_dir, ctx->obj_dir)) {
+			ctx->num_commit_graphs_after = 1;
+			ctx->new_base_graph = NULL;
+		}
+
+		free(old_graph_name);
+	}
+
 	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
 	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
 
diff --git a/commit-graph.h b/commit-graph.h
index 5c48c4f66a..10466bc064 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,7 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 	char *filename;
+	const char *obj_dir;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index 5cb5663a30..cd4d5f05b6 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -90,6 +90,21 @@ test_expect_success 'add more commits, and write a new base graph' '
 	graph_read_expect 12
 '
 
+test_expect_success 'fork and fail to base a chain on a commit-graph file' '
+	test_when_finished rm -rf fork &&
+	git clone . fork &&
+	(
+		cd fork &&
+		rm .git/objects/info/commit-graph &&
+		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
+		test_commit new-commit &&
+		git commit-graph write --reachable --split &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir
+	)
+'
+
 test_expect_success 'add three more commits, write a tip graph' '
 	git reset --hard commits/3 &&
 	git merge merge/1 &&
@@ -132,4 +147,26 @@ test_expect_success 'add one commit, write a merged graph' '
 
 graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
 
+test_expect_success 'create fork and chain across alternate' '
+	git clone . fork &&
+	(
+		cd fork &&
+		git config core.commitGraph true &&
+		rm -rf $graphdir &&
+		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
+		test_commit 13 &&
+		git branch commits/13 &&
+		git commit-graph write --reachable --split &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files &&
+		git -c core.commitGraph=true  rev-list HEAD >expect &&
+		git -c core.commitGraph=false rev-list HEAD >actual &&
+		test_cmp expect actual
+	)
+'
+
+graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 11/14] commit-graph: expire commit-graph files
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (9 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 10/14] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 12/14] commit-graph: create options for split files Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we merge commit-graph files in a commit-graph chain, we should clean
up the files that are no longer used.

This change introduces an 'expiry_window' value to the context, which is
always zero (for now). We then check the modified time of each
graph-{hash}.graph file in the $OBJDIR/info/commit-graphs folder and
unlink the files that are older than the expiry_window.

Since this is always zero, this immediately clears all unused graph
files. We will update the value to match a config setting in a future
change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 15 +++++
 commit-graph.c                           | 79 +++++++++++++++++++++++-
 t/t5323-split-commit-graph.sh            |  2 +-
 3 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 473032e476..aed4350a59 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -266,6 +266,21 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Deleting graph-{hash} files
+
+After a new tip file is written, some `graph-{hash}` files may no longer
+be part of a chain. It is important to remove these files from disk, eventually.
+The main reason to delay removal is that another process could read the
+`commit-graph-chain` file before it is rewritten, but then look for the
+`graph-{hash}` files after they are deleted.
+
+To allow holding old split commit-graphs for a while after they are unreferenced,
+we update the modified times of the files when they become unreferenced. Then,
+we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
+files whose modified times are older than a given expiry window. This window
+defaults to zero, but can be changed using command-line arguments or a config
+setting.
+
 ## Chains across multiple object directories
 
 In a repo with alternates, we look for the `commit-graph-chain` file starting
diff --git a/commit-graph.c b/commit-graph.c
index 39d986bb29..6409e5ed8d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -500,9 +500,18 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+static void close_commit_graph_one(struct commit_graph *g)
+{
+	if (!g)
+		return;
+
+	close_commit_graph_one(g->base_graph);
+	free_commit_graph(g);
+}
+
 void close_commit_graph(struct repository *r)
 {
-	free_commit_graph(r->objects->commit_graph);
+	close_commit_graph_one(r->objects->commit_graph);
 	r->objects->commit_graph = NULL;
 }
 
@@ -1641,6 +1650,69 @@ static void merge_commit_graphs(struct write_commit_graph_context *ctx)
 	deduplicate_commits(ctx);
 }
 
+static void mark_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	uint32_t i;
+	time_t now = time(NULL);
+
+	for (i = ctx->num_commit_graphs_after - 1; i < ctx->num_commit_graphs_before; i++) {
+		struct stat st;
+		struct utimbuf updated_time;
+
+		stat(ctx->commit_graph_filenames_before[i], &st);
+
+		updated_time.actime = st.st_atime;
+		updated_time.modtime = now;
+		utime(ctx->commit_graph_filenames_before[i], &updated_time);
+	}
+}
+
+static void expire_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct strbuf path = STRBUF_INIT;
+	DIR *dir;
+	struct dirent *de;
+	size_t dirnamelen;
+	time_t expire_time = time(NULL);
+
+	strbuf_addstr(&path, ctx->obj_dir);
+	strbuf_addstr(&path, "/info/commit-graphs");
+	dir = opendir(path.buf);
+
+	if (!dir) {
+		strbuf_release(&path);
+		return;
+	}
+
+	strbuf_addch(&path, '/');
+	dirnamelen = path.len;
+	while ((de = readdir(dir)) != NULL) {
+		struct stat st;
+		uint32_t i, found = 0;
+
+		strbuf_setlen(&path, dirnamelen);
+		strbuf_addstr(&path, de->d_name);
+
+		stat(path.buf, &st);
+
+		if (st.st_mtime > expire_time)
+			continue;
+		if (path.len < 6 || strcmp(path.buf + path.len - 6, ".graph"))
+			continue;
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			if (!strcmp(ctx->commit_graph_filenames_after[i],
+				    path.buf)) {
+				found = 1;
+				break;
+			}
+		}
+
+		if (!found)
+			unlink(path.buf);
+	}
+}
+
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
@@ -1753,6 +1825,11 @@ int write_commit_graph(const char *obj_dir,
 
 	res = write_commit_graph_file(ctx);
 
+	if (ctx->split) {
+		mark_commit_graphs(ctx);
+		expire_commit_graphs(ctx);
+	}
+
 cleanup:
 	free(ctx->graph_name);
 	free(ctx->commits.list);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index cd4d5f05b6..c6bb685eb9 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -141,7 +141,7 @@ test_expect_success 'add one commit, write a merged graph' '
 	test_path_is_file $graphdir/commit-graph-chain &&
 	test_line_count = 2 $graphdir/commit-graph-chain &&
 	ls $graphdir/graph-*.graph >graph-files &&
-	test_line_count = 4 graph-files &&
+	test_line_count = 2 graph-files &&
 	verify_chain_files_exist $graphdir
 '
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 12/14] commit-graph: create options for split files
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (10 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 11/14] commit-graph: expire commit-graph files Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 18:41         ` Ramsay Jones
  2019-06-06 14:15       ` [PATCH v4 13/14] commit-graph: verify chains with --shallow mode Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  15 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The split commit-graph feature is now fully implemented, but needs
some more run-time configurability. Allow direct callers to 'git
commit-graph write --split' to specify the values used in the
merge strategy and the expire time.

Update the documentation to specify these values.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt       | 21 +++++++++++++-
 Documentation/technical/commit-graph.txt |  7 +++--
 builtin/commit-graph.c                   | 20 +++++++++++---
 builtin/commit.c                         |  2 +-
 builtin/gc.c                             |  3 +-
 commit-graph.c                           | 35 ++++++++++++++++--------
 commit-graph.h                           | 12 ++++++--
 t/t5323-split-commit-graph.sh            | 35 ++++++++++++++++++++++++
 8 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 624470e198..365e145e82 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -26,7 +26,7 @@ OPTIONS
 	Use given directory for the location of packfiles and commit-graph
 	file. This parameter exists to specify the location of an alternate
 	that only has the objects directory, not a full `.git` directory. The
-	commit-graph file is expected to be at `<dir>/info/commit-graph` and
+	commit-graph file is expected to be in the `<dir>/info` directory and
 	the packfiles are expected to be in `<dir>/pack`.
 
 
@@ -51,6 +51,25 @@ or `--stdin-packs`.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
++
+With the `--split` option, write the commit-graph as a chain of multiple
+commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
+not already in the commit-graph are added in a new "tip" file. This file
+is merged with the existing file if the following merge conditions are
+met:
++
+* If `--size-multiple=<X>` is not specified, let `X` equal 2. If the new
+tip file would have `N` commits and the previous tip has `M` commits and
+`X` times `N` is greater than  `M`, instead merge the two files into a
+single file.
++
+* If `--max-commits=<M>` is specified with `M` a positive integer, and the
+new tip file would have more than `M` commits, then instead merge the new
+tip with the previous tip.
++
+Finally, if `--expire-time=<datetime>` is not specified, let `datetime`
+be the current time. After writing the split commit-graph, delete all
+unused commit-graph whose modified times are older than `datetime`.
 
 'read'::
 
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index aed4350a59..729fbcb32f 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -248,10 +248,11 @@ When writing a set of commits that do not exist in the commit-graph stack of
 height N, we default to creating a new file at level N + 1. We then decide to
 merge with the Nth level if one of two conditions hold:
 
-  1. The expected file size for level N + 1 is at least half the file size for
-     level N.
+  1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
+     level N is less than X times the number of commits in level N + 1.
 
-  2. Level N + 1 contains more than 64,0000 commits.
+  2. `--max-commits=<C>` is specified with non-zero C and the number of commits
+     in level N + 1 is more than C commits.
 
 This decision cascades down the levels: when we merge a level we create a new
 set of commits that then compares to the next level.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index c2c07d3917..18e3b61fb6 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
 	NULL
 };
 
@@ -135,6 +135,7 @@ static int graph_read(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
+struct split_commit_graph_opts split_opts;
 
 static int graph_write(int argc, const char **argv)
 {
@@ -158,9 +159,19 @@ static int graph_write(int argc, const char **argv)
 			N_("include all commits already in the commit-graph file")),
 		OPT_BOOL(0, "split", &opts.split,
 			N_("allow writing an incremental commit-graph file")),
+		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+			N_("maximum number of commits in a non-base split commit-graph")),
+		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+			N_("maximum ratio between two levels of a split commit-graph")),
+		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+			N_("maximum number of commits in a non-base split commit-graph")),
 		OPT_END(),
 	};
 
+	split_opts.size_multiple = 2;
+	split_opts.max_commits = 0;
+	split_opts.expire_time = 0;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
@@ -177,7 +188,7 @@ static int graph_write(int argc, const char **argv)
 	read_replace_refs = 0;
 
 	if (opts.reachable)
-		return write_commit_graph_reachable(opts.obj_dir, flags);
+		return write_commit_graph_reachable(opts.obj_dir, flags, &split_opts);
 
 	string_list_init(&lines, 0);
 	if (opts.stdin_packs || opts.stdin_commits) {
@@ -197,7 +208,8 @@ static int graph_write(int argc, const char **argv)
 	result = write_commit_graph(opts.obj_dir,
 				    pack_indexes,
 				    commit_hex,
-				    flags);
+				    flags,
+				    &split_opts);
 
 	UNLEAK(lines);
 	return result;
diff --git a/builtin/commit.c b/builtin/commit.c
index b001ef565d..9216e9c043 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1670,7 +1670,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "not exceeded, and then \"git reset HEAD\" to recover."));
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(get_object_directory(), 0))
+	    write_commit_graph_reachable(get_object_directory(), 0, NULL))
 		return 1;
 
 	repo_rerere(the_repository, 0);
diff --git a/builtin/gc.c b/builtin/gc.c
index df2573f124..2ab590ffd4 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -666,7 +666,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	if (gc_write_commit_graph &&
 	    write_commit_graph_reachable(get_object_directory(),
-					 !quiet && !daemonized ? COMMIT_GRAPH_PROGRESS : 0))
+					 !quiet && !daemonized ? COMMIT_GRAPH_PROGRESS : 0,
+					 NULL))
 		return 1;
 
 	if (auto_gc && too_many_loose_objects())
diff --git a/commit-graph.c b/commit-graph.c
index 6409e5ed8d..abbfc12f1f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -762,6 +762,8 @@ struct write_commit_graph_context {
 	unsigned append:1,
 		 report_progress:1,
 		 split:1;
+
+	const struct split_commit_graph_opts *split_opts;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1110,14 +1112,15 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
-int write_commit_graph_reachable(const char *obj_dir, unsigned int flags)
+int write_commit_graph_reachable(const char *obj_dir, unsigned int flags,
+				 const struct split_commit_graph_opts *split_opts)
 {
 	struct string_list list = STRING_LIST_INIT_DUP;
 	int result;
 
 	for_each_ref(add_ref_to_list, &list);
 	result = write_commit_graph(obj_dir, NULL, &list,
-				    flags);
+				    flags, split_opts);
 
 	string_list_clear(&list, 0);
 	return result;
@@ -1492,20 +1495,25 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
-static int split_strategy_max_commits = 64000;
-static float split_strategy_size_mult = 2.0f;
-
 static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 {
 	struct commit_graph *g = ctx->r->objects->commit_graph;
 	uint32_t num_commits = ctx->commits.nr;
 	uint32_t i;
 
+	int max_commits = 0;
+	int size_mult = 2;
+
+	if (ctx->split_opts) {
+		max_commits = ctx->split_opts->max_commits;
+		size_mult = ctx->split_opts->size_multiple;
+	}
+
 	g = ctx->r->objects->commit_graph;
 	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
 
-	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
-		     num_commits > split_strategy_max_commits)) {
+	while (g && (g->num_commits <= size_mult * num_commits ||
+		    (max_commits && num_commits > max_commits))) {
 		if (strcmp(g->obj_dir, ctx->obj_dir))
 			break;
 
@@ -1673,7 +1681,10 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	DIR *dir;
 	struct dirent *de;
 	size_t dirnamelen;
-	time_t expire_time = time(NULL);
+	timestamp_t expire_time = time(NULL);
+
+	if (ctx->split_opts && ctx->split_opts->expire_time)
+		expire_time -= ctx->split_opts->expire_time;
 
 	strbuf_addstr(&path, ctx->obj_dir);
 	strbuf_addstr(&path, "/info/commit-graphs");
@@ -1716,7 +1727,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
-		       unsigned int flags)
+		       unsigned int flags,
+		       const struct split_commit_graph_opts *split_opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -1731,6 +1743,7 @@ int write_commit_graph(const char *obj_dir,
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+	ctx->split_opts = split_opts;
 
 	if (ctx->split) {
 		struct commit_graph *g;
@@ -1758,8 +1771,8 @@ int write_commit_graph(const char *obj_dir,
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && ctx->oids.alloc > split_strategy_max_commits)
-		ctx->oids.alloc = split_strategy_max_commits;
+	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
+		ctx->oids.alloc = split_opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->obj_dir);
diff --git a/commit-graph.h b/commit-graph.h
index 10466bc064..194acab2b7 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,19 @@ int generation_numbers_enabled(struct repository *r);
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
 #define COMMIT_GRAPH_SPLIT      (1 << 2)
 
-int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
+struct split_commit_graph_opts {
+	int size_multiple;
+	int max_commits;
+	timestamp_t expire_time;
+};
+
+int write_commit_graph_reachable(const char *obj_dir, unsigned int flags,
+				 const struct split_commit_graph_opts *split_opts);
 int write_commit_graph(const char *obj_dir,
 		       struct string_list *pack_indexes,
 		       struct string_list *commit_hex,
-		       unsigned int flags);
+		       unsigned int flags,
+		       const struct split_commit_graph_opts *split_opts);
 
 int verify_commit_graph(struct repository *r, struct commit_graph *g);
 
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index c6bb685eb9..a915ef388c 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -169,4 +169,39 @@ test_expect_success 'create fork and chain across alternate' '
 
 graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
 
+test_expect_success 'test merge stragety constants' '
+	git clone . merge-2 &&
+	(
+		cd merge-2 &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --size-multiple=2 &&
+		test_line_count = 3 $graphdir/commit-graph-chain
+
+	) &&
+	git clone . merge-10 &&
+	(
+		cd merge-10 &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --size-multiple=10 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files
+	) &&
+	git clone . merge-10-expire &&
+	(
+		cd merge-10-expire &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 15 &&
+		git commit-graph write --reachable --split --size-multiple=10 --expire-time=1980-01-01 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 3 graph-files
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 13/14] commit-graph: verify chains with --shallow mode
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (11 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 12/14] commit-graph: create options for split files Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 14:15       ` [PATCH v4 14/14] commit-graph: clean up chains after flattened write Derrick Stolee via GitGitGadget
                         ` (2 subsequent siblings)
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we wrote a commit-graph chain, we only modified the tip file in
the chain. It is valuable to verify what we wrote, but not waste
time checking files we did not write.

Add a '--shallow' option to the 'git commit-graph verify' subcommand
and check that it does not read the base graph in a two-file chain.

Making the verify subcommand read from a chain of commit-graphs takes
some rearranging of the builtin code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  5 ++++-
 builtin/commit-graph.c             | 27 +++++++++++++++++++--------
 commit-graph.c                     | 15 ++++++++++++---
 commit-graph.h                     |  6 ++++--
 t/t5323-split-commit-graph.sh      | 21 +++++++++++++++++++++
 5 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 365e145e82..eb5e7865f0 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git commit-graph read' [--object-dir <dir>]
-'git commit-graph verify' [--object-dir <dir>]
+'git commit-graph verify' [--object-dir <dir>] [--shallow]
 'git commit-graph write' <options> [--object-dir <dir>]
 
 
@@ -80,6 +80,9 @@ Used for debugging purposes.
 
 Read the commit-graph file and verify its contents against the object
 database. Used to check for corrupted data.
++
+With the `--shallow` option, only check the tip commit-graph file in
+a chain of split commit-graphs.
 
 
 EXAMPLES
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 18e3b61fb6..7cde1e1aaa 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -5,17 +5,18 @@
 #include "parse-options.h"
 #include "repository.h"
 #include "commit-graph.h"
+#include "object-store.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
-	N_("git commit-graph verify [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>] [--shallow]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
 	NULL
 };
 
 static const char * const builtin_commit_graph_verify_usage[] = {
-	N_("git commit-graph verify [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>] [--shallow]"),
 	NULL
 };
 
@@ -36,6 +37,7 @@ static struct opts_commit_graph {
 	int stdin_commits;
 	int append;
 	int split;
+	int shallow;
 } opts;
 
 static int graph_verify(int argc, const char **argv)
@@ -45,11 +47,14 @@ static int graph_verify(int argc, const char **argv)
 	int open_ok;
 	int fd;
 	struct stat st;
+	int flags = 0;
 
 	static struct option builtin_commit_graph_verify_options[] = {
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
 			   N_("dir"),
 			   N_("The object directory to store the graph")),
+		OPT_BOOL(0, "shallow", &opts.shallow,
+			 N_("if the commit-graph is split, only verify the tip file")),
 		OPT_END(),
 	};
 
@@ -59,21 +64,27 @@ static int graph_verify(int argc, const char **argv)
 
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
+	if (opts.shallow)
+		flags |= COMMIT_GRAPH_VERIFY_SHALLOW;
 
 	graph_name = get_commit_graph_filename(opts.obj_dir);
 	open_ok = open_commit_graph(graph_name, &fd, &st);
-	if (!open_ok && errno == ENOENT)
-		return 0;
-	if (!open_ok)
+	if (!open_ok && errno != ENOENT)
 		die_errno(_("Could not open commit-graph '%s'"), graph_name);
-	graph = load_commit_graph_one_fd_st(fd, &st);
+
 	FREE_AND_NULL(graph_name);
 
+	if (open_ok)
+		graph = load_commit_graph_one_fd_st(fd, &st);
+	 else
+		graph = read_commit_graph_one(the_repository, opts.obj_dir);
+
+	/* Return failure if open_ok predicted success */
 	if (!graph)
-		return 1;
+		return !!open_ok;
 
 	UNLEAK(graph);
-	return verify_commit_graph(the_repository, graph);
+	return verify_commit_graph(the_repository, graph, flags);
 }
 
 static int graph_read(int argc, const char **argv)
diff --git a/commit-graph.c b/commit-graph.c
index abbfc12f1f..07856959c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -422,7 +422,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r, const
 	return graph_chain;
 }
 
-static struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
+struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
 {
 	struct commit_graph *g = load_commit_graph_v1(r, obj_dir);
 
@@ -1884,7 +1884,7 @@ static void graph_report(const char *fmt, ...)
 #define GENERATION_ZERO_EXISTS 1
 #define GENERATION_NUMBER_EXISTS 2
 
-int verify_commit_graph(struct repository *r, struct commit_graph *g)
+int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 {
 	uint32_t i, cur_fanout_pos = 0;
 	struct object_id prev_oid, cur_oid, checksum;
@@ -1892,6 +1892,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 	struct hashfile *f;
 	int devnull;
 	struct progress *progress = NULL;
+	int local_error = 0;
 
 	if (!g) {
 		graph_report("no commit-graph file loaded");
@@ -1986,6 +1987,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 				break;
 			}
 
+			/* parse parent in case it is in a base graph */
+			parse_commit_in_graph_one(r, g, graph_parents->item);
+
 			if (!oideq(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
 				graph_report(_("commit-graph parent for %s is %s != %s"),
 					     oid_to_hex(&cur_oid),
@@ -2037,7 +2041,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 	}
 	stop_progress(&progress);
 
-	return verify_commit_graph_error;
+	local_error = verify_commit_graph_error;
+
+	if (!(flags & COMMIT_GRAPH_VERIFY_SHALLOW) && g->base_graph)
+		local_error |= verify_commit_graph(r, g->base_graph, flags);
+
+	return local_error;
 }
 
 void free_commit_graph(struct commit_graph *g)
diff --git a/commit-graph.h b/commit-graph.h
index 194acab2b7..84e5e91fc6 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -61,7 +61,7 @@ struct commit_graph {
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-
+struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir);
 struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 					size_t graph_size);
 
@@ -89,7 +89,9 @@ int write_commit_graph(const char *obj_dir,
 		       unsigned int flags,
 		       const struct split_commit_graph_opts *split_opts);
 
-int verify_commit_graph(struct repository *r, struct commit_graph *g);
+#define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
+
+int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags);
 
 void close_commit_graph(struct repository *);
 void free_commit_graph(struct commit_graph *);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index a915ef388c..b2bc07d72c 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -204,4 +204,25 @@ test_expect_success 'test merge stragety constants' '
 	)
 '
 
+corrupt_file() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+test_expect_success 'verify shallow' '
+	git clone . verify &&
+	(
+		cd verify &&
+		git commit-graph verify &&
+		base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
+		corrupt_file "$base_file" 1760 "\01" &&
+		git commit-graph verify --shallow &&
+		test_must_fail git commit-graph verify 2>test_err &&
+		grep -v "^+" test_err >err &&
+		test_i18ngrep "incorrect checksum" err
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v4 14/14] commit-graph: clean up chains after flattened write
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (12 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 13/14] commit-graph: verify chains with --shallow mode Derrick Stolee via GitGitGadget
@ 2019-06-06 14:15       ` Derrick Stolee via GitGitGadget
  2019-06-06 16:57       ` [PATCH v4 00/14] Commit-graph: Write incremental files Junio C Hamano
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
  15 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-06 14:15 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we write a commit-graph file without the split option, then
we write to $OBJDIR/info/commit-graph and start to ignore
the chains in $OBJDIR/info/commit-graphs/.

Unlink the commit-graph-chain file and expire the graph-{hash}.graph
files in $OBJDIR/info/commit-graphs/ during every write.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c                | 12 +++++++++---
 t/t5323-split-commit-graph.sh | 12 ++++++++++++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 07856959c1..affa969e79 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1685,6 +1685,12 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 
 	if (ctx->split_opts && ctx->split_opts->expire_time)
 		expire_time -= ctx->split_opts->expire_time;
+	if (!ctx->split) {
+		char *chain_file_name = get_chain_filename(ctx->obj_dir);
+		unlink(chain_file_name);
+		free(chain_file_name);
+		ctx->num_commit_graphs_after = 0;
+	}
 
 	strbuf_addstr(&path, ctx->obj_dir);
 	strbuf_addstr(&path, "/info/commit-graphs");
@@ -1838,10 +1844,10 @@ int write_commit_graph(const char *obj_dir,
 
 	res = write_commit_graph_file(ctx);
 
-	if (ctx->split) {
+	if (ctx->split)
 		mark_commit_graphs(ctx);
-		expire_commit_graphs(ctx);
-	}
+
+	expire_commit_graphs(ctx);
 
 cleanup:
 	free(ctx->graph_name);
diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
index b2bc07d72c..bd2e90e512 100755
--- a/t/t5323-split-commit-graph.sh
+++ b/t/t5323-split-commit-graph.sh
@@ -204,6 +204,18 @@ test_expect_success 'test merge stragety constants' '
 	)
 '
 
+test_expect_success 'remove commit-graph-chain file after flattening' '
+	git clone . flatten &&
+	(
+		cd flatten &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		git commit-graph write --reachable &&
+		test_path_is_missing $graphdir/commit-graph-chain &&
+		ls $graphdir >graph-files &&
+		test_line_count = 0 graph-files
+	)
+'
+
 corrupt_file() {
 	file=$1
 	pos=$2
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 02/14] commit-graph: prepare for commit-graph chains
  2019-06-06 14:15       ` [PATCH v4 02/14] commit-graph: prepare for " Derrick Stolee via GitGitGadget
@ 2019-06-06 15:19         ` Philip Oakley
  2019-06-06 21:28         ` Junio C Hamano
  1 sibling, 0 replies; 136+ messages in thread
From: Philip Oakley @ 2019-06-06 15:19 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

spelling nit in the commit message..
On 06/06/2019 15:15, Derrick Stolee via GitGitGadget wrote:
>   * graph position: the posiiton within the concatenated order
s/posiiton/position/
--
Philip

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 00/14] Commit-graph: Write incremental files
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (13 preceding siblings ...)
  2019-06-06 14:15       ` [PATCH v4 14/14] commit-graph: clean up chains after flattened write Derrick Stolee via GitGitGadget
@ 2019-06-06 16:57       ` Junio C Hamano
  2019-06-07 12:37         ` Derrick Stolee
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
  15 siblings, 1 reply; 136+ messages in thread
From: Junio C Hamano @ 2019-06-06 16:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is based on ds/commit-graph-write-refactor.
>
> Thanks, -Stolee
>
> [1] 
> https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
> commit-graph write: don't die if the existing graph is corrupt
>
> Derrick Stolee (14):
>   commit-graph: document commit-graph chains
>   commit-graph: prepare for commit-graph chains
>   commit-graph: rename commit_compare to oid_compare
>   commit-graph: load commit-graph chains
>   commit-graph: add base graphs chunk
>   commit-graph: rearrange chunk count logic
>   commit-graph: write commit-graph chains
>   commit-graph: add --split option to builtin
>   commit-graph: merge commit-graph chains
>   commit-graph: allow cross-alternate chains
>   commit-graph: expire commit-graph files
>   commit-graph: create options for split files
>   commit-graph: verify chains with --shallow mode
>   commit-graph: clean up chains after flattened write
>
>  Documentation/git-commit-graph.txt            |  26 +-
>  .../technical/commit-graph-format.txt         |  11 +-
>  Documentation/technical/commit-graph.txt      | 195 +++++
>  builtin/commit-graph.c                        |  53 +-
>  builtin/commit.c                              |   2 +-
>  builtin/gc.c                                  |   3 +-
>  commit-graph.c                                | 794 +++++++++++++++++-
>  commit-graph.h                                |  25 +-
>  t/t5318-commit-graph.sh                       |   2 +-
>  t/t5323-split-commit-graph.sh                 | 240 ++++++

This breaks test-lint, as t5323 is already taken in 'pu' by another
topic.  I tentatively moved it to 5234 for now.

Thanks.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 10/14] commit-graph: allow cross-alternate chains
  2019-06-06 14:15       ` [PATCH v4 10/14] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
@ 2019-06-06 17:00         ` Philip Oakley
  0 siblings, 0 replies; 136+ messages in thread
From: Philip Oakley @ 2019-06-06 17:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee

Is this a spelling nit?
On 06/06/2019 15:15, Derrick Stolee via GitGitGadget wrote:
> 3. When writing a new commit-graph chain based on a commit-graph file
>     in another object directory, do not allow success if the base file
>     has of the name "commit-graph" instead of
>     "commit-graphs/graoh-{hash}.graph".
s/graoh-/graph-/    ?
--
Philip

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: document commit-graph chains
  2019-06-06 12:10       ` Philip Oakley
@ 2019-06-06 17:09         ` Derrick Stolee
  2019-06-06 21:59           ` Philip Oakley
  0 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee @ 2019-06-06 17:09 UTC (permalink / raw)
  To: Philip Oakley, Derrick Stolee via GitGitGadget, git
  Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

On 6/6/2019 8:10 AM, Philip Oakley wrote:
> Hi Derrick ,
> 
> On 03/06/2019 17:03, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Add a basic description of commit-graph chains.
> Not really your problem, but I did notice that we don't actually explain what we mean here by a commit graph (before we start chaining them), and the distinction between the generic concept and the specific implementation.
> 
> If I understand it correctly, the regular DAG (directed acyclic graph) already inherently contains the commit graph, showing the parent(s) of each commit. Hence, why do we need another? (which then needs explaining the what/why/how)
> 
> So, in one sense, another commit chain is potentially duplicated redundant data. What hasn't been surfaced (for the reader coming later) is probably that accessing the DAG commit graph can be (a) slow, (b) one way (no child relationships), and (c) accesses large amounts of other data that isn't relevant to the task at hand.
> 
> So the commit graph (implementation) is [I think] a fast, compact, sorted(?), list of commit oids that provides two way linkage through the commit graph (?) to allow fast queries within the Git codebase.
> 
> The commit graph is normally considered immutable,

_Commits_ are immutable. The graph grows as commits are added.

This may be the crux of your confusion, since the commit-graph
file can become stale as commits are added by 'git commit' or
'git fetch'. The point of the incremental file format is to
update the commit-graph data without rewriting the entire thing
every time.

Does this help clarify what's going on?

> however the DAG commit graph can be extended by new commits, trimmed by branch deletion, rebasing, forced push, etc, or even reorganised via 'replace' or grafts commits, which must then be reflected in the commit graph (implementation).

These things create new commit objects, which would not be in
the commit-graph file until it is rewritten.

> It just felt that there is a gap between the high level DAG, explained in the glossary, and the commit-graph That perhaps the technical/commit-graph.txt ought to summarise.

I do think that technical/commit-graph.txt does summarize a lot
about the commit-graph _file_ and how that accelerates walks on
the high-level DAG. The added content in this patch does assume
a full understanding of the previous contents of that file.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 12/14] commit-graph: create options for split files
  2019-06-06 14:15       ` [PATCH v4 12/14] commit-graph: create options for split files Derrick Stolee via GitGitGadget
@ 2019-06-06 18:41         ` Ramsay Jones
  0 siblings, 0 replies; 136+ messages in thread
From: Ramsay Jones @ 2019-06-06 18:41 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Junio C Hamano, Derrick Stolee



On 06/06/2019 15:15, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The split commit-graph feature is now fully implemented, but needs
> some more run-time configurability. Allow direct callers to 'git
> commit-graph write --split' to specify the values used in the
> merge strategy and the expire time.
> 
> Update the documentation to specify these values.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt       | 21 +++++++++++++-
>  Documentation/technical/commit-graph.txt |  7 +++--
>  builtin/commit-graph.c                   | 20 +++++++++++---
>  builtin/commit.c                         |  2 +-
>  builtin/gc.c                             |  3 +-
>  commit-graph.c                           | 35 ++++++++++++++++--------
>  commit-graph.h                           | 12 ++++++--
>  t/t5323-split-commit-graph.sh            | 35 ++++++++++++++++++++++++
>  8 files changed, 112 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 624470e198..365e145e82 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -26,7 +26,7 @@ OPTIONS
>  	Use given directory for the location of packfiles and commit-graph
>  	file. This parameter exists to specify the location of an alternate
>  	that only has the objects directory, not a full `.git` directory. The
> -	commit-graph file is expected to be at `<dir>/info/commit-graph` and
> +	commit-graph file is expected to be in the `<dir>/info` directory and
>  	the packfiles are expected to be in `<dir>/pack`.
>  
>  
> @@ -51,6 +51,25 @@ or `--stdin-packs`.)
>  +
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
> ++
> +With the `--split` option, write the commit-graph as a chain of multiple
> +commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
> +not already in the commit-graph are added in a new "tip" file. This file
> +is merged with the existing file if the following merge conditions are
> +met:
> ++
> +* If `--size-multiple=<X>` is not specified, let `X` equal 2. If the new
> +tip file would have `N` commits and the previous tip has `M` commits and
> +`X` times `N` is greater than  `M`, instead merge the two files into a
> +single file.
> ++
> +* If `--max-commits=<M>` is specified with `M` a positive integer, and the
> +new tip file would have more than `M` commits, then instead merge the new
> +tip with the previous tip.
> ++
> +Finally, if `--expire-time=<datetime>` is not specified, let `datetime`
> +be the current time. After writing the split commit-graph, delete all
> +unused commit-graph whose modified times are older than `datetime`.
>  
>  'read'::
>  
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index aed4350a59..729fbcb32f 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -248,10 +248,11 @@ When writing a set of commits that do not exist in the commit-graph stack of
>  height N, we default to creating a new file at level N + 1. We then decide to
>  merge with the Nth level if one of two conditions hold:
>  
> -  1. The expected file size for level N + 1 is at least half the file size for
> -     level N.
> +  1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
> +     level N is less than X times the number of commits in level N + 1.
>  
> -  2. Level N + 1 contains more than 64,0000 commits.
> +  2. `--max-commits=<C>` is specified with non-zero C and the number of commits
> +     in level N + 1 is more than C commits.
>  
>  This decision cascades down the levels: when we merge a level we create a new
>  set of commits that then compares to the next level.
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index c2c07d3917..18e3b61fb6 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--object-dir <objdir>]"),
>  	N_("git commit-graph read [--object-dir <objdir>]"),
>  	N_("git commit-graph verify [--object-dir <objdir>]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
>  	NULL
>  };
>  
> @@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] <split options>"),
>  	NULL
>  };
>  
> @@ -135,6 +135,7 @@ static int graph_read(int argc, const char **argv)
>  }
>  
>  extern int read_replace_refs;
> +struct split_commit_graph_opts split_opts;

This 'split_opts' variable needs to be marked 'static'.

Thanks.

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 02/14] commit-graph: prepare for commit-graph chains
  2019-06-06 14:15       ` [PATCH v4 02/14] commit-graph: prepare for " Derrick Stolee via GitGitGadget
  2019-06-06 15:19         ` Philip Oakley
@ 2019-06-06 21:28         ` Junio C Hamano
  2019-06-07 12:44           ` Derrick Stolee
  1 sibling, 1 reply; 136+ messages in thread
From: Junio C Hamano @ 2019-06-06 21:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
> +{
> +	uint32_t lex_index;
> +
> +	if (!g)
> +		BUG("NULL commit-graph");
> +
> +	while (pos < g->num_commits_in_base)
> +		g = g->base_graph;

If a rogue caller calls this function with pos < 0, this loop would
eventually exhaust the chain and make g==NULL, I think.  Shouldn't a
similar assert exist upfront for "if (pos < 0)" or perhaps make pos
unsigned int instead?

> +	if (pos >= g->num_commits + g->num_commits_in_base)
> +		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
> +		    pos, g->num_commits, g->num_commits_in_base);

Where does 'pos' typically come from?  Taken from a parent commit
field of a commit-graph file or something like that?

As this is a "BUG()" and not a "die()", the callers of this function
are responsible for making sure that, even if they are fed a set of
corrupt commit-graph files, they never feed 'pos' that is out of
bounds to this function.  The same is true for the other BUG() in
fill_commit_in_graph().

I am wondering if they have already sufficient protection, or if we
are better off having die() instead saying "corrupted commit graph
file" or something.  I dunno.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: document commit-graph chains
  2019-06-06 17:09         ` Derrick Stolee
@ 2019-06-06 21:59           ` Philip Oakley
  0 siblings, 0 replies; 136+ messages in thread
From: Philip Oakley @ 2019-06-06 21:59 UTC (permalink / raw)
  To: Derrick Stolee, Derrick Stolee via GitGitGadget, git
  Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano, Derrick Stolee

Hi Stolee,

We may be talking at cross-purposes.
On 06/06/2019 18:09, Derrick Stolee wrote:
> On 6/6/2019 8:10 AM, Philip Oakley wrote:
>> Hi Derrick ,
>>
>> On 03/06/2019 17:03, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> Add a basic description of commit-graph chains.
>> Not really your problem, but I did notice that we don't actually explain what we mean here by a commit graph (before we start chaining them), and the distinction between the generic concept and the specific implementation.
The purpose of my comment is here. We have/had not explained why we need 
(another) commit graph, either within the man page, or the technical 
docs. It's an understanding gap.
>>
>> If I understand it correctly, the regular DAG (directed acyclic graph) already inherently contains the commit graph, showing the parent(s) of each commit. Hence, why do we need another? (which then needs explaining the what/why/how)
>>
>> So, in one sense, another commit chain is potentially duplicated redundant data. What hasn't been surfaced (for the reader coming later) is probably that accessing the DAG commit graph can be (a) slow, (b) one way (no child relationships), and (c) accesses large amounts of other data that isn't relevant to the task at hand.
>>
>> So the commit graph (implementation) is [I think] a fast, compact, sorted(?), list of commit oids that provides two way linkage through the commit graph (?) to allow fast queries within the Git codebase.
>>
>> The commit graph is normally considered immutable,
> _Commits_ are immutable. The graph grows as commits are added.
I was aware that individual commits are immutable. However the tips, 
grafts and replacements can change the topology of the graph (especially 
the grafts and replacements, hence the desire to have something that 
acts as a guide as to what, generally, is trying to be achieved).
>
> This may be the crux of your confusion, since the commit-graph
> file can become stale as commits are added by 'git commit' or
> 'git fetch'. The point of the incremental file format is to
> update the commit-graph data without rewriting the entire thing
> every time.
>
> Does this help clarify what's going on?
Only slightly, see below.
>
>> however the DAG commit graph can be extended by new commits, trimmed by branch deletion, rebasing, forced push, etc, or even reorganised via 'replace' or grafts commits, which must then be reflected in the commit graph (implementation).
> These things create new commit objects, which would not be in
> the commit-graph file until it is rewritten.
>
>> It just felt that there is a gap between the high level DAG, explained in the glossary, and the commit-graph That perhaps the technical/commit-graph.txt ought to summarise.
> I do think that technical/commit-graph.txt does summarize a lot
> about the commit-graph _file_ and how that accelerates walks on
> the high-level DAG. The added content in this patch does assume
> a full understanding of the previous contents of that file.
The current (prior) documentation is a bit Catch 22 with regard to that 
assumed full understanding, hence my comment, including the "Not really 
your problem," bit.
>
> Thanks,
> -Stolee
>
Philip

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 04/14] commit-graph: load commit-graph chains
  2019-06-06 14:15       ` [PATCH v4 04/14] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-06 22:20         ` Junio C Hamano
  2019-06-07 12:53           ` Derrick Stolee
  0 siblings, 1 reply; 136+ messages in thread
From: Junio C Hamano @ 2019-06-06 22:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	if (stat(chain_name, &st)) {
> ...
> +	if (st.st_size <= the_hash_algo->hexsz) {
> ...
> +	fp = fopen(chain_name, "r");
> +	free(chain_name);
> +
> +	if (!fp)
> +		return NULL;

Checking for size before opening is an invitation for an unnecessary
race, isn't it?  Perhaps fopen() followed by fstat() is a better
alternative?

> +	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
> +
> +	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
> +		char *graph_name;
> +		struct commit_graph *g;

I am imagining an evil tester growing the file after you called
xcalloc() above ;-) Should we at least protect ourselves not to read
more than we planned to read originally?  I would imagine that the
ideal code organization would be more like

	valid = 1; have_read_all = 0;

	fopen();
	fstat(fp->fileno);
	count = st.st_size / hashsize;
	oids = xcalloc();

	for (i = 0; i < count; i++) {
        	if (getline() == EOF) {
			have_read_all = 1;
			break;
		}
		add one graph based on the line;
		if (error) {
			valid = 0;
			break;
		}
	}
	if (valid && i < count)
		die("file truncated while we are reading?");
	if (valid && !have_read_all)
		die("file grew while we are reading?");

if we really care, but even without going to that extreme, at least
we should refrain from reading more than we allocated.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 00/14] Commit-graph: Write incremental files
  2019-06-06 16:57       ` [PATCH v4 00/14] Commit-graph: Write incremental files Junio C Hamano
@ 2019-06-07 12:37         ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-06-07 12:37 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin

On 6/6/2019 12:57 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:>
>>  t/t5323-split-commit-graph.sh                 | 240 ++++++
> 
> This breaks test-lint, as t5323 is already taken in 'pu' by another
> topic.  I tentatively moved it to 5234 for now.

Sorry for not noticing this. I've moved the file to 5234 in my local
copy, so the next version will not hit this problem.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 02/14] commit-graph: prepare for commit-graph chains
  2019-06-06 21:28         ` Junio C Hamano
@ 2019-06-07 12:44           ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-06-07 12:44 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Derrick Stolee

On 6/6/2019 5:28 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
>> +{
>> +	uint32_t lex_index;
>> +
>> +	if (!g)
>> +		BUG("NULL commit-graph");
>> +
>> +	while (pos < g->num_commits_in_base)
>> +		g = g->base_graph;
> 
> If a rogue caller calls this function with pos < 0, this loop would
> eventually exhaust the chain and make g==NULL, I think.  Shouldn't a
> similar assert exist upfront for "if (pos < 0)" or perhaps make pos
> unsigned int instead?

This is a good point. I will use 'uint32_t' since the only caller is
insert_parent_or_die() which has an unsigned position. I did notice that
insert_parent_or_die() uses "uint64_t pos" but its only caller passes a
"uint32_t edge_value". The 32-bit value makes more sense because of the
built-in limits in the commit-graph format for number of commits. I'll
change insert_parent_or_die() to use 32-bits as well.

As for the while loop, it would also be good to rearrange the checks
as follows:

        while (g && pos < g->num_commits_in_base)
                g = g->base_graph;

        if (!g)
                BUG("NULL commit-graph");

> 
>> +	if (pos >= g->num_commits + g->num_commits_in_base)
>> +		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
>> +		    pos, g->num_commits, g->num_commits_in_base);
> 
> Where does 'pos' typically come from?  Taken from a parent commit
> field of a commit-graph file or something like that?

It comes from the commit-graph file.

> As this is a "BUG()" and not a "die()", the callers of this function
> are responsible for making sure that, even if they are fed a set of
> corrupt commit-graph files, they never feed 'pos' that is out of
> bounds to this function.  The same is true for the other BUG() in
> fill_commit_in_graph().> 
> I am wondering if they have already sufficient protection, or if we
> are better off having die() instead saying "corrupted commit graph
> file" or something.  I dunno.

I can replace this with a die() that points to a corrupt commit-graph
file. Perhaps "BUG()" made more sense while I was developing the feature
and wanted to tell myself why the error condition happened. That doesn't
make sense any more now that the feature is working.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 04/14] commit-graph: load commit-graph chains
  2019-06-06 22:20         ` Junio C Hamano
@ 2019-06-07 12:53           ` Derrick Stolee
  0 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee @ 2019-06-07 12:53 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Derrick Stolee

On 6/6/2019 6:20 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +	if (stat(chain_name, &st)) {
>> ...
>> +	if (st.st_size <= the_hash_algo->hexsz) {
>> ...
>> +	fp = fopen(chain_name, "r");
>> +	free(chain_name);
>> +
>> +	if (!fp)
>> +		return NULL;
> 
> Checking for size before opening is an invitation for an unnecessary
> race, isn't it?  Perhaps fopen() followed by fstat() is a better
> alternative?
> 
>> +	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
>> +
>> +	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
>> +		char *graph_name;
>> +		struct commit_graph *g;
> 
> I am imagining an evil tester growing the file after you called
> xcalloc() above ;-) Should we at least protect ourselves not to read
> more than we planned to read originally?  I would imagine that the
> ideal code organization would be more like
> 
> 	valid = 1; have_read_all = 0;
> 
> 	fopen();
> 	fstat(fp->fileno);
> 	count = st.st_size / hashsize;
> 	oids = xcalloc();
> 
> 	for (i = 0; i < count; i++) {
>         	if (getline() == EOF) {
> 			have_read_all = 1;
> 			break;
> 		}
> 		add one graph based on the line;
> 		if (error) {
> 			valid = 0;
> 			break;
> 		}
> 	}
> 	if (valid && i < count)
> 		die("file truncated while we are reading?");
> 	if (valid && !have_read_all)
> 		die("file grew while we are reading?");
> 
> if we really care, but even without going to that extreme, at least
> we should refrain from reading more than we allocated.

Thanks! I clearly was not careful enough with this input, which should
have been easy to get right. I think all your points are valid. The
code looks much cleaner after rewriting it to care about counts and to
properly order the stat() call.

-Stolee

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 05/14] commit-graph: add base graphs chunk
  2019-06-06 14:15       ` [PATCH v4 05/14] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
@ 2019-06-07 18:15         ` Junio C Hamano
  0 siblings, 0 replies; 136+ messages in thread
From: Junio C Hamano @ 2019-06-07 18:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -  1-byte (reserved for later use)
> -     Current clients should ignore this value.
> +  1-byte number (B) of base commit-graphs
> +      We infer the length (H*B) of the Base Graphs chunk
> +      from this value.
>  
>  CHUNK LOOKUP:
>  
> @@ -92,6 +93,12 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
> +      This list of H-byte hashes describe a set of B commit-graph files that
> +      form a commit-graph chain. The graph position for the ith commit in this
> +      file's OID Lookup chunk is equal to i plus the number of commits in all
> +      base graphs.  If B is non-zero, this chunk must exist.

Hmph, an obvious alternative design would be to make the base list
self describing without using the "reserved for future use" byte,
which would allow more than 256 bases (not that being able to use
300 bases is necessarily a useful feature) and also leave the
reserved byte unused.

It's not like being able to detect discrepancy (e.g. B!=0 but BASE
chunk is missing, and/or BASE chunk appears but B==0) adds value by
offering more protection against file corruption, so I am wondering
why it is a good idea to consume the reserved byte for this.

> +	if (n && !g->chunk_base_graphs) {
> +		warning(_("commit-graph has no base graphs chunk"));
> +		return 0;
> +	}
> +
n>  	while (n) {
>  		n--;
> +
> +		if (!oideq(&oids[n], &cur_g->oid) ||
> +		    !hasheq(oids[n].hash, g->chunk_base_graphs + g->hash_len * n)) {

Here, load_commit_graph_chain() that goes over the on-disk chain
file that lists graph files called us with 'n', which can run up to
the number of graph files listed in that file---and that number can
be more than what is recorded in the graph-list chunk, in which case
we are over-reading with this hasheq(), right?

It seems that parse_commit_graph() only cares about the beginning of
each chunk, and a crafted graph file can record two chunks with a
gap in between, or two chunks that overlap, and nobody would notice.
Is that true?

Wasted space in the file between two chunks (i.e. a gap) is not
necessarily bad and may not be a warning-worthy thing, but two
chunks that overlap is probably not a good idea and worth noticing.
The only sanity check it seems to do is to forbid chunks of the same
kind from appearing twice.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v4 06/14] commit-graph: rearrange chunk count logic
  2019-06-06 14:15       ` [PATCH v4 06/14] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
@ 2019-06-07 18:23         ` Junio C Hamano
  0 siblings, 0 replies; 136+ messages in thread
From: Junio C Hamano @ 2019-06-07 18:23 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The number of chunks in a commit-graph file can change depending on
> whether we need the Extra Edges Chunk. We are going to add more optional
> chunks, and it will be helpful to rearrange this logic around the chunk
> count before doing so.
>
> Specifically, we need to finalize the number of chunks before writing
> the commit-graph header. Further, we also need to fill out the chunk
> lookup table dynamically and using "num_chunks" as we add optional
> chunks is useful for adding optional chunks in the future.

Yup.  The resulting code may be slightly longer at this step than
before, but it certainly is easier to follow the logic.  Good.


>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 35 +++++++++++++++++++++--------------
>  1 file changed, 21 insertions(+), 14 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 909c841db5..80df6d6d9d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1206,7 +1206,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	uint64_t chunk_offsets[5];
>  	const unsigned hashsz = the_hash_algo->rawsz;
>  	struct strbuf progress_title = STRBUF_INIT;
> -	int num_chunks = ctx->num_extra_edges ? 4 : 3;
> +	int num_chunks = 3;
>  
>  	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
>  	if (safe_create_leading_directories(ctx->graph_name)) {
> @@ -1219,27 +1219,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
>  	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
>  
> -	hashwrite_be32(f, GRAPH_SIGNATURE);
> -
> -	hashwrite_u8(f, GRAPH_VERSION);
> -	hashwrite_u8(f, oid_version());
> -	hashwrite_u8(f, num_chunks);
> -	hashwrite_u8(f, 0); /* unused padding byte */
> -
>  	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
>  	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> -	if (ctx->num_extra_edges)
> -		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
> -	else
> -		chunk_ids[3] = 0;
> -	chunk_ids[4] = 0;
> +	if (ctx->num_extra_edges) {
> +		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
> +		num_chunks++;
> +	}
> +
> +	chunk_ids[num_chunks] = 0;
>  
>  	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
>  	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
>  	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
>  	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
> -	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
> +
> +	num_chunks = 3;
> +	if (ctx->num_extra_edges) {
> +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> +						4 * ctx->num_extra_edges;
> +		num_chunks++;
> +	}
> +
> +	hashwrite_be32(f, GRAPH_SIGNATURE);
> +
> +	hashwrite_u8(f, GRAPH_VERSION);
> +	hashwrite_u8(f, oid_version());
> +	hashwrite_u8(f, num_chunks);
> +	hashwrite_u8(f, 0);
>  
>  	for (i = 0; i <= num_chunks; i++) {
>  		uint32_t chunk_write[3];

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 00/16] Commit-graph: Write incremental files
  2019-06-06 14:15     ` [PATCH v4 00/14] Commit-graph: Write incremental files Derrick Stolee via GitGitGadget
                         ` (14 preceding siblings ...)
  2019-06-06 16:57       ` [PATCH v4 00/14] Commit-graph: Write incremental files Junio C Hamano
@ 2019-06-07 18:38       ` " Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 01/16] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
                           ` (17 more replies)
  15 siblings, 18 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano

This version is now ready for review.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "chain" of commit-graphs in the
.git/objects/info/commit-graphs folder with name graph-{hash}.graph. The
list of hashes is given by the commit-graph-chain file, and also in a "base
graph chunk" in the commit-graph format. As we read a chain, we can verify
that the hashes match the trailing hash of each commit-graph we read along
the way and each hash below a level is expected by that graph file.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write is less than
half of the commits in the level above. This can be tweaked by the
--size-multiple and --max-commits options.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

write --stdin-commits:         8m 12s
write --stdin-commits --split:    28s
write --split && verify --shallow: 60s

Updates in V3:

 * git commit-graph verify now works on commit-graph chains. We do a simple
   test to check the behavior of a new --shallow option.
   
   
 * When someone writes a flat commit-graph, we now expire the old chain
   according to the expire time.
   
   
 * The "max commits" limit is no longer enabled by default, but instead is
   enabled by a --max-commits=<n> option. Ignored if n=0.
   
   

Updates in V4:

Johannes pointed out some test failures on the Windows platform. We found
that the tests were not running on Windows in the gitgitgadget PR builds,
which is now resolved.

 * We need to close commit-graphs recursively down the chain. This prevented
   an unlink() from working because of an open handle.
   
   
 * Creating the alternates file used a path-specification that didn't work
   on Windows.
   
   
 * Renaming a file to the same name failed, but is probably related to the
   unlink() error mentioned above.
   
   

Updates in V5:

 * Responding to multiple items of feedback. Thanks Philip, Junio, and
   Ramsay!
   
   
 * Used the test coverage report to find holes in the test coverage. While
   adding tests, I found a bug in octopus merges. The fix is in the rewrite
   of "deduplicate_commits()" as "sort_and_scan_merged_commits()" and
   covered by the new tests.
   
   

This is based on ds/commit-graph-write-refactor.

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (16):
  commit-graph: document commit-graph chains
  commit-graph: prepare for commit-graph chains
  commit-graph: rename commit_compare to oid_compare
  commit-graph: load commit-graph chains
  commit-graph: add base graphs chunk
  commit-graph: rearrange chunk count logic
  commit-graph: write commit-graph chains
  commit-graph: add --split option to builtin
  commit-graph: merge commit-graph chains
  commit-graph: allow cross-alternate chains
  commit-graph: expire commit-graph files
  commit-graph: create options for split files
  commit-graph: verify chains with --shallow mode
  commit-graph: clean up chains after flattened write
  commit-graph: test octopus merges with --split
  commit-graph: test --split across alternate without --split

 Documentation/git-commit-graph.txt            |  26 +-
 .../technical/commit-graph-format.txt         |  11 +-
 Documentation/technical/commit-graph.txt      | 195 +++++
 builtin/commit-graph.c                        |  53 +-
 builtin/commit.c                              |   2 +-
 builtin/gc.c                                  |   3 +-
 commit-graph.c                                | 799 +++++++++++++++++-
 commit-graph.h                                |  25 +-
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5324-split-commit-graph.sh                 | 319 +++++++
 10 files changed, 1365 insertions(+), 70 deletions(-)
 create mode 100755 t/t5324-split-commit-graph.sh


base-commit: 8520d7fc7c6edd4d71582c69a873436029b6cb1b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/184

Range-diff vs v4:

  1:  b184919255 =  1:  7a0bfaaa6d commit-graph: document commit-graph chains
  2:  d0dc154a27 !  2:  ce139d80df commit-graph: prepare for commit-graph chains
     @@ -14,7 +14,7 @@
           * lexicographic index : the position within the lexicographic
             order in a single commit-graph file.
      
     -     * graph position: the posiiton within the concatenated order
     +     * graph position: the position within the concatenated order
             of multiple commit-graph files
      
          Given the lexicographic index of a commit in a graph, we can
     @@ -22,28 +22,53 @@
          the lower-level graphs. To find the lexicographic index of
          a commit, we subtract the number of commits in lower-level graphs.
      
     +    While here, change insert_parent_or_die() to take a uint32_t
     +    position, as that is the type used by its only caller and that
     +    makes more sense with the limits in the commit-graph format.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       diff --git a/commit-graph.c b/commit-graph.c
       --- a/commit-graph.c
       +++ b/commit-graph.c
     +@@
     + 	return !!first_generation;
     + }
     + 
     ++static void close_commit_graph_one(struct commit_graph *g)
     ++{
     ++	if (!g)
     ++		return;
     ++
     ++	close_commit_graph_one(g->base_graph);
     ++	free_commit_graph(g);
     ++}
     ++
     + void close_commit_graph(struct repository *r)
     + {
     +-	free_commit_graph(r->objects->commit_graph);
     ++	close_commit_graph_one(r->objects->commit_graph);
     + 	r->objects->commit_graph = NULL;
     + }
     + 
      @@
       			    g->chunk_oid_lookup, g->hash_len, pos);
       }
       
     -+static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
     ++static void load_oid_from_graph(struct commit_graph *g,
     ++				uint32_t pos,
     ++				struct object_id *oid)
      +{
      +	uint32_t lex_index;
      +
     ++	while (g && pos < g->num_commits_in_base)
     ++		g = g->base_graph;
     ++
      +	if (!g)
      +		BUG("NULL commit-graph");
      +
     -+	while (pos < g->num_commits_in_base)
     -+		g = g->base_graph;
     -+
      +	if (pos >= g->num_commits + g->num_commits_in_base)
     -+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
     -+		    pos, g->num_commits, g->num_commits_in_base);
     ++		die(_("invalid commit position. commit-graph is likely corrupt"));
      +
      +	lex_index = pos - g->num_commits_in_base;
      +
     @@ -52,14 +77,17 @@
      +
       static struct commit_list **insert_parent_or_die(struct repository *r,
       						 struct commit_graph *g,
     - 						 uint64_t pos,
     -@@
     +-						 uint64_t pos,
     ++						 uint32_t pos,
     + 						 struct commit_list **pptr)
     + {
       	struct commit *c;
       	struct object_id oid;
       
      -	if (pos >= g->num_commits)
     +-		die("invalid parent position %"PRIu64, pos);
      +	if (pos >= g->num_commits + g->num_commits_in_base)
     - 		die("invalid parent position %"PRIu64, pos);
     ++		die("invalid parent position %"PRIu32, pos);
       
      -	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
      +	load_oid_from_graph(g, pos, &oid);
     @@ -95,8 +123,7 @@
      +		g = g->base_graph;
      +
      +	if (pos >= g->num_commits + g->num_commits_in_base)
     -+		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
     -+		    pos, g->num_commits, g->num_commits_in_base);
     ++		die(_("invalid commit position. commit-graph is likely corrupt"));
      +
      +	/*
      +	 * Store the "full" position, but then use the
  3:  f35b04224a =  3:  2470d2b548 commit-graph: rename commit_compare to oid_compare
  4:  ca670536df !  4:  fc3423046b commit-graph: load commit-graph chains
     @@ -82,32 +82,30 @@
      +	struct strbuf line = STRBUF_INIT;
      +	struct stat st;
      +	struct object_id *oids;
     -+	int i = 0, valid = 1;
     ++	int i = 0, valid = 1, count;
      +	char *chain_name = get_chain_filename(obj_dir);
      +	FILE *fp;
     -+
     -+	if (stat(chain_name, &st)) {
     -+		free(chain_name);
     -+		return NULL;
     -+	}
     -+
     -+	if (st.st_size <= the_hash_algo->hexsz) {
     -+		free(chain_name);
     -+		return NULL;
     -+	}
     ++	int stat_res;
      +
      +	fp = fopen(chain_name, "r");
     ++	stat_res = stat(chain_name, &st);
      +	free(chain_name);
      +
     -+	if (!fp)
     ++	if (!fp ||
     ++	    stat_res ||
     ++	    st.st_size <= the_hash_algo->hexsz)
      +		return NULL;
      +
     -+	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
     ++	count = st.st_size / (the_hash_algo->hexsz + 1);
     ++	oids = xcalloc(count, sizeof(struct object_id));
      +
     -+	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
     ++	for (i = 0; i < count && valid; i++) {
      +		char *graph_name;
      +		struct commit_graph *g;
      +
     ++		if (strbuf_getline_lf(&line, fp) == EOF)
     ++			break;
     ++
      +		if (get_oid_hex(line.buf, &oids[i])) {
      +			warning(_("invalid commit-graph chain: line '%s' not a hash"),
      +				line.buf);
  5:  df44cbc1bf !  5:  d14c79f9d5 commit-graph: add base graphs chunk
     @@ -88,12 +88,12 @@
       	while (n) {
       		n--;
      +
     -+		if (!oideq(&oids[n], &cur_g->oid) ||
     ++		if (!cur_g ||
     ++		    !oideq(&oids[n], &cur_g->oid) ||
      +		    !hasheq(oids[n].hash, g->chunk_base_graphs + g->hash_len * n)) {
      +			warning(_("commit-graph chain does not match"));
      +			return 0;
      +		}
     -+
      +
       		cur_g = cur_g->base_graph;
       	}
  6:  e65f9e841d =  6:  5238bbbec3 commit-graph: rearrange chunk count logic
  7:  fe0aa343cd =  7:  02b0359571 commit-graph: write commit-graph chains
  8:  c42e683ef6 !  8:  a0330ebd2d commit-graph: add --split option to builtin
     @@ -9,7 +9,7 @@
          are not in the existing commit-graph or commit-graph chain. Later changes
          will allow merging the chain and expiring out-dated files.
      
     -    Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
     +    Add a new test script (t5324-split-commit-graph.sh) that demonstrates this
          behavior.
      
          Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
     @@ -92,10 +92,10 @@
       		} else {
       			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
       new file mode 100755
       --- /dev/null
     - +++ b/t/t5323-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
      +#!/bin/sh
      +
  9:  d065758454 !  9:  28eccfa52b commit-graph: merge commit-graph chains
     @@ -242,30 +242,27 @@
      +	return oidcmp(&a->object.oid, &b->object.oid);
      +}
      +
     -+static void deduplicate_commits(struct write_commit_graph_context *ctx)
     ++static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
      +{
     -+	uint32_t i, num_parents, last_distinct = 0, duplicates = 0;
     ++	uint32_t i, num_parents;
      +	struct commit_list *parent;
      +
      +	if (ctx->report_progress)
      +		ctx->progress = start_delayed_progress(
     -+					_("De-duplicating merged commits"),
     ++					_("Scanning merged commits"),
      +					ctx->commits.nr);
      +
      +	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
      +
      +	ctx->num_extra_edges = 0;
     -+	for (i = 1; i < ctx->commits.nr; i++) {
     ++	for (i = 0; i < ctx->commits.nr; i++) {
      +		display_progress(ctx->progress, i);
      +
     -+		if (oideq(&ctx->commits.list[last_distinct]->object.oid,
     ++		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
      +			  &ctx->commits.list[i]->object.oid)) {
     -+			duplicates++;
     ++			die(_("unexpected duplicate commit id %s"),
     ++			    oid_to_hex(&ctx->commits.list[i]->object.oid));
      +		} else {
     -+			if (duplicates)
     -+				ctx->commits.list[last_distinct + 1] = ctx->commits.list[i];
     -+			last_distinct++;
     -+
      +			num_parents = 0;
      +			for (parent = ctx->commits.list[i]->parents; parent; parent = parent->next)
      +				num_parents++;
     @@ -275,7 +272,6 @@
      +		}
      +	}
      +
     -+	ctx->commits.nr -= duplicates;
      +	stop_progress(&ctx->progress);
      +}
      +
     @@ -308,7 +304,7 @@
      +	if (ctx->new_base_graph)
      +		ctx->base_graph_name = xstrdup(ctx->new_base_graph->filename);
      +
     -+	deduplicate_commits(ctx);
     ++	sort_and_scan_merged_commits(ctx);
      +}
      +
       int write_commit_graph(const char *obj_dir,
     @@ -340,9 +336,9 @@
       
       	compute_generation_numbers(ctx);
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     - --- a/t/t5323-split-commit-graph.sh
     - +++ b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
     + --- a/t/t5324-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
       
       graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
 10:  62b3fca582 ! 10:  2093bab5b1 commit-graph: allow cross-alternate chains
     @@ -18,7 +18,7 @@
          3. When writing a new commit-graph chain based on a commit-graph file
             in another object directory, do not allow success if the base file
             has of the name "commit-graph" instead of
     -       "commit-graphs/graoh-{hash}.graph".
     +       "commit-graphs/graph-{hash}.graph".
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ -91,15 +91,17 @@
       }
       
      @@
     - 	oids = xcalloc(st.st_size / (the_hash_algo->hexsz + 1), sizeof(struct object_id));
     + 	count = st.st_size / (the_hash_algo->hexsz + 1);
     + 	oids = xcalloc(count, sizeof(struct object_id));
       
     - 	while (strbuf_getline_lf(&line, fp) != EOF && valid) {
     +-	for (i = 0; i < count && valid; i++) {
      -		char *graph_name;
      -		struct commit_graph *g;
     ++	for (i = 0; i < count; i++) {
      +		struct object_directory *odb;
       
     - 		if (get_oid_hex(line.buf, &oids[i])) {
     - 			warning(_("invalid commit-graph chain: line '%s' not a hash"),
     + 		if (strbuf_getline_lf(&line, fp) == EOF)
     + 			break;
      @@
       			break;
       		}
     @@ -107,6 +109,7 @@
      -		graph_name = get_split_graph_filename(obj_dir, line.buf);
      -		g = load_commit_graph_one(graph_name);
      -		free(graph_name);
     ++		valid = 0;
      +		for (odb = r->objects->odb; odb; odb = odb->next) {
      +			char *graph_name = get_split_graph_filename(odb->path, line.buf);
      +			struct commit_graph *g = load_commit_graph_one(graph_name);
     @@ -120,13 +123,18 @@
      +			if (g) {
      +				g->obj_dir = odb->path;
      +
     -+				if (add_graph_to_chain(g, graph_chain, oids, i))
     ++				if (add_graph_to_chain(g, graph_chain, oids, i)) {
      +					graph_chain = g;
     -+				else
     -+					valid = 0;
     ++					valid = 1;
     ++				}
      +
      +				break;
      +			}
     ++		}
     ++
     ++		if (!valid) {
     ++			warning(_("unable to find all commit-graph files"));
     ++			break;
      +		}
       	}
       
     @@ -182,9 +190,9 @@
       	uint32_t num_commits_in_base;
       	struct commit_graph *base_graph;
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     - --- a/t/t5323-split-commit-graph.sh
     - +++ b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
     + --- a/t/t5324-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
       	graph_read_expect 12
       '
     @@ -195,7 +203,7 @@
      +	(
      +		cd fork &&
      +		rm .git/objects/info/commit-graph &&
     -+		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
     ++		echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
      +		test_commit new-commit &&
      +		git commit-graph write --reachable --split &&
      +		test_path_is_file $graphdir/commit-graph-chain &&
     @@ -217,7 +225,7 @@
      +		cd fork &&
      +		git config core.commitGraph true &&
      +		rm -rf $graphdir &&
     -+		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
     ++		echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
      +		test_commit 13 &&
      +		git branch commits/13 &&
      +		git commit-graph write --reachable --split &&
 11:  b5aeeed909 ! 11:  554880e3d7 commit-graph: expire commit-graph files
     @@ -46,27 +46,7 @@
       --- a/commit-graph.c
       +++ b/commit-graph.c
      @@
     - 	return !!first_generation;
     - }
     - 
     -+static void close_commit_graph_one(struct commit_graph *g)
     -+{
     -+	if (!g)
     -+		return;
     -+
     -+	close_commit_graph_one(g->base_graph);
     -+	free_commit_graph(g);
     -+}
     -+
     - void close_commit_graph(struct repository *r)
     - {
     --	free_commit_graph(r->objects->commit_graph);
     -+	close_commit_graph_one(r->objects->commit_graph);
     - 	r->objects->commit_graph = NULL;
     - }
     - 
     -@@
     - 	deduplicate_commits(ctx);
     + 	sort_and_scan_merged_commits(ctx);
       }
       
      +static void mark_commit_graphs(struct write_commit_graph_context *ctx)
     @@ -129,6 +109,7 @@
      +
      +		if (!found)
      +			unlink(path.buf);
     ++
      +	}
      +}
      +
     @@ -148,9 +129,9 @@
       	free(ctx->graph_name);
       	free(ctx->commits.list);
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     - --- a/t/t5323-split-commit-graph.sh
     - +++ b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
     + --- a/t/t5324-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
       	test_path_is_file $graphdir/commit-graph-chain &&
       	test_line_count = 2 $graphdir/commit-graph-chain &&
 12:  ac5586a20f ! 12:  66be8b03a8 commit-graph: create options for split files
     @@ -94,7 +94,7 @@
       }
       
       extern int read_replace_refs;
     -+struct split_commit_graph_opts split_opts;
     ++static struct split_commit_graph_opts split_opts;
       
       static int graph_write(int argc, const char **argv)
       {
     @@ -294,9 +294,9 @@
       int verify_commit_graph(struct repository *r, struct commit_graph *g);
       
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     - --- a/t/t5323-split-commit-graph.sh
     - +++ b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
     + --- a/t/t5324-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
       
       graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
     @@ -333,6 +333,18 @@
      +		test_line_count = 1 $graphdir/commit-graph-chain &&
      +		ls $graphdir/graph-*.graph >graph-files &&
      +		test_line_count = 3 graph-files
     ++	) &&
     ++	git clone --no-hardlinks . max-commits &&
     ++	(
     ++		cd max-commits &&
     ++		git config core.commitGraph true &&
     ++		test_line_count = 2 $graphdir/commit-graph-chain &&
     ++		test_commit 16 &&
     ++		test_commit 17 &&
     ++		git commit-graph write --reachable --split --max-commits=1 &&
     ++		test_line_count = 1 $graphdir/commit-graph-chain &&
     ++		ls $graphdir/graph-*.graph >graph-files &&
     ++		test_line_count = 1 graph-files
      +	)
      +'
      +
 13:  548ec69d01 ! 13:  9fec4f9a36 commit-graph: verify chains with --shallow mode
     @@ -197,9 +197,9 @@
       void close_commit_graph(struct repository *);
       void free_commit_graph(struct commit_graph *);
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     - --- a/t/t5323-split-commit-graph.sh
     - +++ b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
     + --- a/t/t5324-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
       	)
       '
     @@ -211,18 +211,59 @@
      +	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
      +}
      +
     -+test_expect_success 'verify shallow' '
     -+	git clone . verify &&
     ++test_expect_success 'verify hashes along chain, even in shallow' '
     ++	git clone --no-hardlinks . verify &&
      +	(
      +		cd verify &&
      +		git commit-graph verify &&
      +		base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
      +		corrupt_file "$base_file" 1760 "\01" &&
     ++		test_must_fail git commit-graph verify --shallow 2>test_err &&
     ++		grep -v "^+" test_err >err &&
     ++		test_i18ngrep "incorrect checksum" err
     ++	)
     ++'
     ++
     ++test_expect_success 'verify --shallow does not check base contents' '
     ++	git clone --no-hardlinks . verify-shallow &&
     ++	(
     ++		cd verify-shallow &&
     ++		git commit-graph verify &&
     ++		base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
     ++		corrupt_file "$base_file" 1000 "\01" &&
      +		git commit-graph verify --shallow &&
      +		test_must_fail git commit-graph verify 2>test_err &&
      +		grep -v "^+" test_err >err &&
      +		test_i18ngrep "incorrect checksum" err
      +	)
      +'
     ++
     ++test_expect_success 'warn on base graph chunk incorrect' '
     ++	git clone --no-hardlinks . base-chunk &&
     ++	(
     ++		cd base-chunk &&
     ++		git commit-graph verify &&
     ++		base_file=$graphdir/graph-$(tail -n 1 $graphdir/commit-graph-chain).graph &&
     ++		corrupt_file "$base_file" 1376 "\01" &&
     ++		git commit-graph verify --shallow 2>test_err &&
     ++		grep -v "^+" test_err >err &&
     ++		test_i18ngrep "commit-graph chain does not match" err
     ++	)
     ++'
     ++
     ++test_expect_success 'verify after commit-graph-chain corruption' '
     ++	git clone --no-hardlinks . verify-chain &&
     ++	(
     ++		cd verify-chain &&
     ++		corrupt_file "$graphdir/commit-graph-chain" 60 "G" &&
     ++		git commit-graph verify 2>test_err &&
     ++		grep -v "^+" test_err >err &&
     ++		test_i18ngrep "invalid commit-graph chain" err &&
     ++		corrupt_file "$graphdir/commit-graph-chain" 60 "A" &&
     ++		git commit-graph verify 2>test_err &&
     ++		grep -v "^+" test_err >err &&
     ++		test_i18ngrep "unable to find all commit-graph files" err
     ++	)
     ++'
      +
       test_done
 14:  6084bbd164 ! 14:  795ea36ff4 commit-graph: clean up chains after flattened write
     @@ -42,9 +42,9 @@
       cleanup:
       	free(ctx->graph_name);
      
     - diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     - --- a/t/t5323-split-commit-graph.sh
     - +++ b/t/t5323-split-commit-graph.sh
     + diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
     + --- a/t/t5324-split-commit-graph.sh
     + +++ b/t/t5324-split-commit-graph.sh
      @@
       	)
       '
  -:  ---------- > 15:  101792b92d commit-graph: test octopus merges with --split
  -:  ---------- > 16:  84a3ff7c61 commit-graph: test --split across alternate without --split

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 01/16] commit-graph: document commit-graph chains
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 02/16] commit-graph: prepare for " Derrick Stolee via GitGitGadget
                           ` (16 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a basic description of commit-graph chains. More details about the
feature will be added as we add functionality. This introduction gives a
high-level overview to the goals of the feature and the basic layout of
commit-graph chains.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 59 ++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index fb53341d5e..1dca3bd8fe 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,6 +127,65 @@ Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
+Commit Graphs Chains
+--------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over time,
+the number of commits added by a fetch operation is much smaller than the
+number of commits in the full history. By creating a "chain" of commit-graphs,
+we enable fast writes of new commit data without rewriting the entire commit
+history -- at least, most of the time.
+
+## File Layout
+
+A commit-graph chain uses multiple files, and we use a fixed naming convention
+to organize these files. Each commit-graph file has a name
+`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
+valued hash stored in the footer of that file (which is a hash of the file's
+contents before that hash). For a chain of commit-graph files, a plain-text
+file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
+hashes for the files in order from "lowest" to "highest".
+
+For example, if the `commit-graph-chain` file contains the lines
+
+```
+	{hash0}
+	{hash1}
+	{hash2}
+```
+
+then the commit-graph chain looks like the following diagram:
+
+ +-----------------------+
+ |  graph-{hash2}.graph  |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  graph-{hash1}.graph  |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0}.graph  |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
+commits in `graph-{hash1}.graph`, and X2 be the number of commits in
+`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
+then we interpret this as being the commit in position (X0 + X1 + i), and that
+will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
+positions to refer to their parents, which may be in `graph-{hash1}.graph` or
+`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
+its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
+X2).
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 02/16] commit-graph: prepare for commit-graph chains
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 01/16] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` " Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 03/16] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
                           ` (15 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To prepare for a chain of commit-graph files, augment the
commit_graph struct to point to a base commit_graph. As we load
commits from the graph, we may actually want to read from a base
file according to the graph position.

The "graph position" of a commit is given by concatenating the
lexicographic commit orders from each of the commit-graph files in
the chain. This means that we must distinguish two values:

 * lexicographic index : the position within the lexicographic
   order in a single commit-graph file.

 * graph position: the position within the concatenated order
   of multiple commit-graph files

Given the lexicographic index of a commit in a graph, we can
compute the graph position by adding the number of commits in
the lower-level graphs. To find the lexicographic index of
a commit, we subtract the number of commits in lower-level graphs.

While here, change insert_parent_or_die() to take a uint32_t
position, as that is the type used by its only caller and that
makes more sense with the limits in the commit-graph format.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 89 +++++++++++++++++++++++++++++++++++++++++++-------
 commit-graph.h |  3 ++
 2 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7723156964..8c3598037b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -359,9 +359,18 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+static void close_commit_graph_one(struct commit_graph *g)
+{
+	if (!g)
+		return;
+
+	close_commit_graph_one(g->base_graph);
+	free_commit_graph(g);
+}
+
 void close_commit_graph(struct repository *r)
 {
-	free_commit_graph(r->objects->commit_graph);
+	close_commit_graph_one(r->objects->commit_graph);
 	r->objects->commit_graph = NULL;
 }
 
@@ -371,18 +380,38 @@ static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t
 			    g->chunk_oid_lookup, g->hash_len, pos);
 }
 
+static void load_oid_from_graph(struct commit_graph *g,
+				uint32_t pos,
+				struct object_id *oid)
+{
+	uint32_t lex_index;
+
+	while (g && pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!g)
+		BUG("NULL commit-graph");
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
+	lex_index = pos - g->num_commits_in_base;
+
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * lex_index);
+}
+
 static struct commit_list **insert_parent_or_die(struct repository *r,
 						 struct commit_graph *g,
-						 uint64_t pos,
+						 uint32_t pos,
 						 struct commit_list **pptr)
 {
 	struct commit *c;
 	struct object_id oid;
 
-	if (pos >= g->num_commits)
-		die("invalid parent position %"PRIu64, pos);
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die("invalid parent position %"PRIu32, pos);
 
-	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+	load_oid_from_graph(g, pos, &oid);
 	c = lookup_commit(r, &oid);
 	if (!c)
 		die(_("could not find commit %s"), oid_to_hex(&oid));
@@ -392,7 +421,14 @@ static struct commit_list **insert_parent_or_die(struct repository *r,
 
 static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
 {
-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
+
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	lex_index = pos - g->num_commits_in_base;
+	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
 	item->graph_pos = pos;
 	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
 }
@@ -405,10 +441,25 @@ static int fill_commit_in_graph(struct repository *r,
 	uint32_t *parent_data_ptr;
 	uint64_t date_low, date_high;
 	struct commit_list **pptr;
-	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+	const unsigned char *commit_data;
+	uint32_t lex_index;
 
-	item->object.parsed = 1;
+	while (pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (pos >= g->num_commits + g->num_commits_in_base)
+		die(_("invalid commit position. commit-graph is likely corrupt"));
+
+	/*
+	 * Store the "full" position, but then use the
+	 * "local" position for the rest of the calculation.
+	 */
 	item->graph_pos = pos;
+	lex_index = pos - g->num_commits_in_base;
+
+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
+
+	item->object.parsed = 1;
 
 	item->maybe_tree = NULL;
 
@@ -452,7 +503,18 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 		*pos = item->graph_pos;
 		return 1;
 	} else {
-		return bsearch_graph(g, &(item->object.oid), pos);
+		struct commit_graph *cur_g = g;
+		uint32_t lex_index;
+
+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
+			cur_g = cur_g->base_graph;
+
+		if (cur_g) {
+			*pos = lex_index + cur_g->num_commits_in_base;
+			return 1;
+		}
+
+		return 0;
 	}
 }
 
@@ -492,8 +554,13 @@ static struct tree *load_tree_for_commit(struct repository *r,
 					 struct commit *c)
 {
 	struct object_id oid;
-	const unsigned char *commit_data = g->chunk_commit_data +
-					   GRAPH_DATA_WIDTH * (c->graph_pos);
+	const unsigned char *commit_data;
+
+	while (c->graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	commit_data = g->chunk_commit_data +
+			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
 
 	hashcpy(oid.hash, commit_data);
 	c->maybe_tree = lookup_tree(r, &oid);
diff --git a/commit-graph.h b/commit-graph.h
index 70f4caf0c7..f9fe32ebe3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,6 +48,9 @@ struct commit_graph {
 	uint32_t num_commits;
 	struct object_id oid;
 
+	uint32_t num_commits_in_base;
+	struct commit_graph *base_graph;
+
 	const uint32_t *chunk_oid_fanout;
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 04/16] commit-graph: load commit-graph chains
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
                           ` (2 preceding siblings ...)
  2019-06-07 18:38         ` [PATCH v5 03/16] commit-graph: rename commit_compare to oid_compare Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-10 21:47           ` Junio C Hamano
  2019-06-07 18:38         ` [PATCH v5 05/16] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
                           ` (13 subsequent siblings)
  17 siblings, 1 reply; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Prepare the logic for reading a chain of commit-graphs.

First, look for a file at $OBJDIR/info/commit-graph. If it exists,
then use that file and stop.

Next, look for the chain file at $OBJDIR/info/commit-graphs/commit-graph-chain.
If this file exists, then load the hash values as line-separated values in that
file and load $OBJDIR/info/commit-graphs/graph-{hash[i]}.graph for each hash[i]
in that file. The file is given in order, so the first hash corresponds to the
"base" file and the final hash corresponds to the "tip" file.

This implementation assumes that all of the graph-{hash}.graph files are in
the same object directory as the commit-graph-chain file. This will be updated
in a future change. This change is purposefully simple so we can isolate the
different concerns.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 106 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 70d5889892..263c73282e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,6 +45,19 @@ char *get_commit_graph_filename(const char *obj_dir)
 	return xstrfmt("%s/info/commit-graph", obj_dir);
 }
 
+static char *get_split_graph_filename(const char *obj_dir,
+				      const char *oid_hex)
+{
+	return xstrfmt("%s/info/commit-graphs/graph-%s.graph",
+		       obj_dir,
+		       oid_hex);
+}
+
+static char *get_chain_filename(const char *obj_dir)
+{
+	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", obj_dir);
+}
+
 static uint8_t oid_version(void)
 {
 	return 1;
@@ -286,18 +299,105 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 	return load_commit_graph_one_fd_st(fd, &st);
 }
 
+static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
+{
+	char *graph_name = get_commit_graph_filename(obj_dir);
+	struct commit_graph *g = load_commit_graph_one(graph_name);
+	free(graph_name);
+
+	return g;
+}
+
+static int add_graph_to_chain(struct commit_graph *g,
+			      struct commit_graph *chain,
+			      struct object_id *oids,
+			      int n)
+{
+	struct commit_graph *cur_g = chain;
+
+	while (n) {
+		n--;
+		cur_g = cur_g->base_graph;
+	}
+
+	g->base_graph = chain;
+
+	if (chain)
+		g->num_commits_in_base = chain->num_commits + chain->num_commits_in_base;
+
+	return 1;
+}
+
+static struct commit_graph *load_commit_graph_chain(struct repository *r, const char *obj_dir)
+{
+	struct commit_graph *graph_chain = NULL;
+	struct strbuf line = STRBUF_INIT;
+	struct stat st;
+	struct object_id *oids;
+	int i = 0, valid = 1, count;
+	char *chain_name = get_chain_filename(obj_dir);
+	FILE *fp;
+	int stat_res;
+
+	fp = fopen(chain_name, "r");
+	stat_res = stat(chain_name, &st);
+	free(chain_name);
+
+	if (!fp ||
+	    stat_res ||
+	    st.st_size <= the_hash_algo->hexsz)
+		return NULL;
+
+	count = st.st_size / (the_hash_algo->hexsz + 1);
+	oids = xcalloc(count, sizeof(struct object_id));
+
+	for (i = 0; i < count && valid; i++) {
+		char *graph_name;
+		struct commit_graph *g;
+
+		if (strbuf_getline_lf(&line, fp) == EOF)
+			break;
+
+		if (get_oid_hex(line.buf, &oids[i])) {
+			warning(_("invalid commit-graph chain: line '%s' not a hash"),
+				line.buf);
+			valid = 0;
+			break;
+		}
+
+		graph_name = get_split_graph_filename(obj_dir, line.buf);
+		g = load_commit_graph_one(graph_name);
+		free(graph_name);
+
+		if (g && add_graph_to_chain(g, graph_chain, oids, i))
+			graph_chain = g;
+		else
+			valid = 0;
+	}
+
+	free(oids);
+	fclose(fp);
+
+	return graph_chain;
+}
+
+static struct commit_graph *read_commit_graph_one(struct repository *r, const char *obj_dir)
+{
+	struct commit_graph *g = load_commit_graph_v1(r, obj_dir);
+
+	if (!g)
+		g = load_commit_graph_chain(r, obj_dir);
+
+	return g;
+}
+
 static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
 {
-	char *graph_name;
 
 	if (r->objects->commit_graph)
 		return;
 
-	graph_name = get_commit_graph_filename(obj_dir);
-	r->objects->commit_graph =
-		load_commit_graph_one(graph_name);
-
-	FREE_AND_NULL(graph_name);
+	r->objects->commit_graph = read_commit_graph_one(r, obj_dir);
 }
 
 /*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 03/16] commit-graph: rename commit_compare to oid_compare
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 01/16] commit-graph: document commit-graph chains Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 02/16] commit-graph: prepare for " Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 04/16] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
                           ` (14 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The helper function commit_compare() actually compares object_id
structs, not commits. A future change to commit-graph.c will need
to sort commit structs, so rename this function in advance.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8c3598037b..70d5889892 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -770,7 +770,7 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
-static int commit_compare(const void *_a, const void *_b)
+static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
 	const struct object_id *b = (const struct object_id *)_b;
@@ -1039,7 +1039,7 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 			_("Counting distinct commits in commit graph"),
 			ctx->oids.nr);
 	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
-	QSORT(ctx->oids.list, ctx->oids.nr, commit_compare);
+	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 05/16] commit-graph: add base graphs chunk
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
                           ` (3 preceding siblings ...)
  2019-06-07 18:38         ` [PATCH v5 04/16] commit-graph: load commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 06/16] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
                           ` (12 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To quickly verify a commit-graph chain is valid on load, we will
read from the new "Base Graphs Chunk" of each file in the chain.
This will prevent accidentally loading incorrect data from manually
editing the commit-graph-chain file or renaming graph-{hash}.graph
files.

The commit_graph struct already had an object_id struct "oid", but
it was never initialized or used. Add a line to read the hash from
the end of the commit-graph file and into the oid member.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .../technical/commit-graph-format.txt         | 11 ++++++++--
 commit-graph.c                                | 22 +++++++++++++++++++
 commit-graph.h                                |  1 +
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 16452a0504..a4f17441ae 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -44,8 +44,9 @@ HEADER:
 
   1-byte number (C) of "chunks"
 
-  1-byte (reserved for later use)
-     Current clients should ignore this value.
+  1-byte number (B) of base commit-graphs
+      We infer the length (H*B) of the Base Graphs chunk
+      from this value.
 
 CHUNK LOOKUP:
 
@@ -92,6 +93,12 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
+      This list of H-byte hashes describe a set of B commit-graph files that
+      form a commit-graph chain. The graph position for the ith commit in this
+      file's OID Lookup chunk is equal to i plus the number of commits in all
+      base graphs.  If B is non-zero, this chunk must exist.
+
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.
diff --git a/commit-graph.c b/commit-graph.c
index 263c73282e..0ff1a4f379 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -22,6 +22,7 @@
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -262,6 +263,12 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 			else
 				graph->chunk_extra_edges = data + chunk_offset;
 			break;
+
+		case GRAPH_CHUNKID_BASE:
+			if (graph->chunk_base_graphs)
+				chunk_repeated = 1;
+			else
+				graph->chunk_base_graphs = data + chunk_offset;
 		}
 
 		if (chunk_repeated) {
@@ -280,6 +287,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		last_chunk_offset = chunk_offset;
 	}
 
+	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
+
 	if (verify_commit_graph_lite(graph))
 		return NULL;
 
@@ -315,8 +324,21 @@ static int add_graph_to_chain(struct commit_graph *g,
 {
 	struct commit_graph *cur_g = chain;
 
+	if (n && !g->chunk_base_graphs) {
+		warning(_("commit-graph has no base graphs chunk"));
+		return 0;
+	}
+
 	while (n) {
 		n--;
+
+		if (!cur_g ||
+		    !oideq(&oids[n], &cur_g->oid) ||
+		    !hasheq(oids[n].hash, g->chunk_base_graphs + g->hash_len * n)) {
+			warning(_("commit-graph chain does not match"));
+			return 0;
+		}
+
 		cur_g = cur_g->base_graph;
 	}
 
diff --git a/commit-graph.h b/commit-graph.h
index f9fe32ebe3..80f4917ddb 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -55,6 +55,7 @@ struct commit_graph {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
+	const unsigned char *chunk_base_graphs;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 07/16] commit-graph: write commit-graph chains
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
                           ` (5 preceding siblings ...)
  2019-06-07 18:38         ` [PATCH v5 06/16] commit-graph: rearrange chunk count logic Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 08/16] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
                           ` (10 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Extend write_commit_graph() to write a commit-graph chain when given the
COMMIT_GRAPH_SPLIT flag.

This implementation is purposefully simplistic in how it creates a new
chain. The commits not already in the chain are added to a new tip
commit-graph file.

Much of the logic around writing a graph-{hash}.graph file and updating
the commit-graph-chain file is the same as the commit-graph file case.
However, there are several places where we need to do some extra logic
in the split case.

Track the list of graph filenames before and after the planned write.
This will be more important when we start merging graph files, but it
also allows us to upgrade our commit-graph file to the appropriate
graph-{hash}.graph file when we upgrade to a chain of commit-graphs.

Note that we use the eighth byte of the commit-graph header to store the
number of base graph files. This determines the length of the base
graphs chunk.

A subtle change of behavior with the new logic is that we do not write a
commit-graph if we our commit list is empty. This extends to the typical
case, which is reflected in t5318-commit-graph.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 286 ++++++++++++++++++++++++++++++++++++++--
 commit-graph.h          |   2 +
 t/t5318-commit-graph.sh |   2 +-
 3 files changed, 278 insertions(+), 12 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 5f25dff193..eb6d79567a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -300,12 +300,18 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file)
 
 	struct stat st;
 	int fd;
+	struct commit_graph *g;
 	int open_ok = open_commit_graph(graph_file, &fd, &st);
 
 	if (!open_ok)
 		return NULL;
 
-	return load_commit_graph_one_fd_st(fd, &st);
+	g = load_commit_graph_one_fd_st(fd, &st);
+
+	if (g)
+		g->filename = xstrdup(graph_file);
+
+	return g;
 }
 
 static struct commit_graph *load_commit_graph_v1(struct repository *r, const char *obj_dir)
@@ -730,8 +736,19 @@ struct write_commit_graph_context {
 	struct progress *progress;
 	int progress_done;
 	uint64_t progress_cnt;
+
+	char *base_graph_name;
+	int num_commit_graphs_before;
+	int num_commit_graphs_after;
+	char **commit_graph_filenames_before;
+	char **commit_graph_filenames_after;
+	char **commit_graph_hash_after;
+	uint32_t new_num_commits_in_base;
+	struct commit_graph *new_base_graph;
+
 	unsigned append:1,
-		 report_progress:1;
+		 report_progress:1,
+		 split:1;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -801,6 +818,16 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.nr,
 					      commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -821,6 +848,17 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 					      ctx->commits.list,
 					      ctx->commits.nr,
 					      commit_to_sha1);
+
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -878,6 +916,16 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 						  ctx->commits.nr,
 						  commit_to_sha1);
 
+			if (edge_value >= 0)
+				edge_value += ctx->new_num_commits_in_base;
+			else {
+				uint32_t pos;
+				if (find_commit_in_graph(parent->item,
+							 ctx->new_base_graph,
+							 &pos))
+					edge_value = pos;
+			}
+
 			if (edge_value < 0)
 				BUG("missing parent %s for commit %s",
 				    oid_to_hex(&parent->item->object.oid),
@@ -969,7 +1017,13 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 		display_progress(ctx->progress, i + 1);
 		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
 
-		if (commit && !parse_commit_no_graph(commit))
+		if (!commit)
+			continue;
+		if (ctx->split) {
+			if (!parse_commit(commit) &&
+			    commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+				add_missing_parents(ctx, commit);
+		} else if (!parse_commit_no_graph(commit))
 			add_missing_parents(ctx, commit);
 	}
 	stop_progress(&ctx->progress);
@@ -1165,8 +1219,16 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 
 	for (i = 1; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
-		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
+		if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i])) {
+			if (ctx->split) {
+				struct commit *c = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+				if (!c || c->graph_pos != COMMIT_NOT_FROM_GRAPH)
+					continue;
+			}
+
 			count_distinct++;
+		}
 	}
 	stop_progress(&ctx->progress);
 
@@ -1189,7 +1251,13 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 		if (i > 0 && oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i]))
 			continue;
 
+		ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
 		ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.list[i]);
+
+		if (ctx->split &&
+		    ctx->commits.list[ctx->commits.nr]->graph_pos != COMMIT_NOT_FROM_GRAPH)
+			continue;
+
 		parse_commit_no_graph(ctx->commits.list[ctx->commits.nr]);
 
 		for (parent = ctx->commits.list[ctx->commits.nr]->parents;
@@ -1204,18 +1272,86 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static int write_graph_chunk_base_1(struct hashfile *f,
+				    struct commit_graph *g)
+{
+	int num = 0;
+
+	if (!g)
+		return 0;
+
+	num = write_graph_chunk_base_1(f, g->base_graph);
+	hashwrite(f, g->oid.hash, the_hash_algo->rawsz);
+	return num + 1;
+}
+
+static int write_graph_chunk_base(struct hashfile *f,
+				  struct write_commit_graph_context *ctx)
+{
+	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
+
+	if (num != ctx->num_commit_graphs_after - 1) {
+		error(_("failed to write correct number of base graph ids"));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t i;
+
+	ctx->new_base_graph = g;
+	ctx->base_graph_name = xstrdup(g->filename);
+	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
+
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	if (ctx->num_commit_graphs_before)
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
+			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
+
+	i = ctx->num_commit_graphs_before - 1;
+
+	while (g) {
+		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+		i--;
+		g = g->base_graph;
+	}
+}
+
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
+	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[5];
-	uint64_t chunk_offsets[5];
+	uint32_t chunk_ids[6];
+	uint64_t chunk_offsets[6];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
+	struct object_id file_hash;
+
+	if (ctx->split) {
+		struct strbuf tmp_file = STRBUF_INIT;
+
+		strbuf_addf(&tmp_file,
+			    "%s/info/commit-graphs/tmp_graph_XXXXXX",
+			    ctx->obj_dir);
+		ctx->graph_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
+	}
 
-	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
 		UNLEAK(ctx->graph_name);
 		error(_("unable to create leading directories of %s"),
@@ -1223,8 +1359,23 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		return errno;
 	}
 
-	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
-	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	if (ctx->split) {
+		char *lock_name = get_chain_filename(ctx->obj_dir);
+
+		hold_lock_file_for_update(&lk, lock_name, LOCK_DIE_ON_ERROR);
+
+		fd = git_mkstemp_mode(ctx->graph_name, 0444);
+		if (fd < 0) {
+			error(_("unable to create '%s'"), ctx->graph_name);
+			return -1;
+		}
+
+		f = hashfd(fd, ctx->graph_name);
+	} else {
+		hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
+		fd = lk.tempfile->fd;
+		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+	}
 
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
@@ -1233,6 +1384,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
+		num_chunks++;
+	}
 
 	chunk_ids[num_chunks] = 0;
 
@@ -1247,13 +1402,18 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 						4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
+	if (ctx->num_commit_graphs_after > 1) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						hashsz * (ctx->num_commit_graphs_after - 1);
+		num_chunks++;
+	}
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
 	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0);
+	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
@@ -1279,11 +1439,67 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_data(f, hashsz, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
+	if (ctx->num_commit_graphs_after > 1 &&
+	    write_graph_chunk_base(f, ctx)) {
+		return -1;
+	}
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
 
+	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
+		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
+		char *new_base_name = get_split_graph_filename(ctx->obj_dir, new_base_hash);
+
+		free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+		free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2] = new_base_name;
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2] = new_base_hash;
+	}
+
 	close_commit_graph(ctx->r);
-	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+
+	if (ctx->split) {
+		FILE *chainf = fdopen_lock_file(&lk, "w");
+		char *final_graph_name;
+		int result;
+
+		close(fd);
+
+		if (!chainf) {
+			error(_("unable to open commit-graph chain file"));
+			return -1;
+		}
+
+		if (ctx->base_graph_name) {
+			result = rename(ctx->base_graph_name,
+					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+
+			if (result) {
+				error(_("failed to rename base commit-graph file"));
+				return -1;
+			}
+		} else {
+			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
+			unlink(graph_name);
+		}
+
+		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1] = xstrdup(oid_to_hex(&file_hash));
+		final_graph_name = get_split_graph_filename(ctx->obj_dir,
+					ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1]);
+		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 1] = final_graph_name;
+
+		result = rename(ctx->graph_name, final_graph_name);
+
+		for (i = 0; i < ctx->num_commit_graphs_after; i++)
+			fprintf(lk.tempfile->fp, "%s\n", ctx->commit_graph_hash_after[i]);
+
+		if (result) {
+			error(_("failed to rename temporary commit-graph file"));
+			return -1;
+		}
+	}
+
 	commit_lock_file(&lk);
 
 	return 0;
@@ -1306,6 +1522,30 @@ int write_commit_graph(const char *obj_dir,
 	ctx->obj_dir = obj_dir;
 	ctx->append = flags & COMMIT_GRAPH_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_PROGRESS ? 1 : 0;
+	ctx->split = flags & COMMIT_GRAPH_SPLIT ? 1 : 0;
+
+	if (ctx->split) {
+		struct commit_graph *g;
+		prepare_commit_graph(ctx->r);
+
+		g = ctx->r->objects->commit_graph;
+
+		while (g) {
+			ctx->num_commit_graphs_before++;
+			g = g->base_graph;
+		}
+
+		if (ctx->num_commit_graphs_before) {
+			ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before);
+			i = ctx->num_commit_graphs_before;
+			g = ctx->r->objects->commit_graph;
+
+			while (g) {
+				ctx->commit_graph_filenames_before[--i] = xstrdup(g->filename);
+				g = g->base_graph;
+			}
+		}
+	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
@@ -1360,6 +1600,14 @@ int write_commit_graph(const char *obj_dir,
 		goto cleanup;
 	}
 
+	if (!ctx->commits.nr)
+		goto cleanup;
+
+	if (ctx->split)
+		init_commit_graph_chain(ctx);
+	else
+		ctx->num_commit_graphs_after = 1;
+
 	compute_generation_numbers(ctx);
 
 	res = write_commit_graph_file(ctx);
@@ -1368,6 +1616,21 @@ int write_commit_graph(const char *obj_dir,
 	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
+
+	if (ctx->commit_graph_filenames_after) {
+		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
+			free(ctx->commit_graph_filenames_after[i]);
+			free(ctx->commit_graph_hash_after[i]);
+		}
+
+		for (i = 0; i < ctx->num_commit_graphs_before; i++)
+			free(ctx->commit_graph_filenames_before[i]);
+
+		free(ctx->commit_graph_filenames_after);
+		free(ctx->commit_graph_filenames_before);
+		free(ctx->commit_graph_hash_after);
+	}
+
 	free(ctx);
 
 	return res;
@@ -1555,5 +1818,6 @@ void free_commit_graph(struct commit_graph *g)
 		g->data = NULL;
 		close(g->graph_fd);
 	}
+	free(g->filename);
 	free(g);
 }
diff --git a/commit-graph.h b/commit-graph.h
index 80f4917ddb..5c48c4f66a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -47,6 +47,7 @@ struct commit_graph {
 	unsigned char num_chunks;
 	uint32_t num_commits;
 	struct object_id oid;
+	char *filename;
 
 	uint32_t num_commits_in_base;
 	struct commit_graph *base_graph;
@@ -71,6 +72,7 @@ int generation_numbers_enabled(struct repository *r);
 
 #define COMMIT_GRAPH_APPEND     (1 << 0)
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
+#define COMMIT_GRAPH_SPLIT      (1 << 2)
 
 int write_commit_graph_reachable(const char *obj_dir, unsigned int flags);
 int write_commit_graph(const char *obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3b6fd0d728..063f906b3e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -20,7 +20,7 @@ test_expect_success 'verify graph with no graph file' '
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
-	test_path_is_file info/commit-graph
+	test_path_is_missing info/commit-graph
 '
 
 test_expect_success 'close with correct error on bad input' '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 06/16] commit-graph: rearrange chunk count logic
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
                           ` (4 preceding siblings ...)
  2019-06-07 18:38         ` [PATCH v5 05/16] commit-graph: add base graphs chunk Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 07/16] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
                           ` (11 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The number of chunks in a commit-graph file can change depending on
whether we need the Extra Edges Chunk. We are going to add more optional
chunks, and it will be helpful to rearrange this logic around the chunk
count before doing so.

Specifically, we need to finalize the number of chunks before writing
the commit-graph header. Further, we also need to fill out the chunk
lookup table dynamically and using "num_chunks" as we add optional
chunks is useful for adding optional chunks in the future.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0ff1a4f379..5f25dff193 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1213,7 +1213,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint64_t chunk_offsets[5];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = ctx->num_extra_edges ? 4 : 3;
+	int num_chunks = 3;
 
 	ctx->graph_name = get_commit_graph_filename(ctx->obj_dir);
 	if (safe_create_leading_directories(ctx->graph_name)) {
@@ -1226,27 +1226,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hold_lock_file_for_update(&lk, ctx->graph_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
-	hashwrite_be32(f, GRAPH_SIGNATURE);
-
-	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
-	hashwrite_u8(f, 0); /* unused padding byte */
-
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
 	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
 	chunk_ids[2] = GRAPH_CHUNKID_DATA;
-	if (ctx->num_extra_edges)
-		chunk_ids[3] = GRAPH_CHUNKID_EXTRAEDGES;
-	else
-		chunk_ids[3] = 0;
-	chunk_ids[4] = 0;
+	if (ctx->num_extra_edges) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
+		num_chunks++;
+	}
+
+	chunk_ids[num_chunks] = 0;
 
 	chunk_offsets[0] = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
 	chunk_offsets[2] = chunk_offsets[1] + hashsz * ctx->commits.nr;
 	chunk_offsets[3] = chunk_offsets[2] + (hashsz + 16) * ctx->commits.nr;
-	chunk_offsets[4] = chunk_offsets[3] + 4 * ctx->num_extra_edges;
+
+	num_chunks = 3;
+	if (ctx->num_extra_edges) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						4 * ctx->num_extra_edges;
+		num_chunks++;
+	}
+
+	hashwrite_be32(f, GRAPH_SIGNATURE);
+
+	hashwrite_u8(f, GRAPH_VERSION);
+	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, 0);
 
 	for (i = 0; i <= num_chunks; i++) {
 		uint32_t chunk_write[3];
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 08/16] commit-graph: add --split option to builtin
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
                           ` (6 preceding siblings ...)
  2019-06-07 18:38         ` [PATCH v5 07/16] commit-graph: write commit-graph chains Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 09/16] commit-graph: merge commit-graph chains Derrick Stolee via GitGitGadget
                           ` (9 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new "--split" option to the 'git commit-graph write' subcommand. This
option allows the optional behavior of writing a commit-graph chain.

The current behavior will add a tip commit-graph containing any commits that
are not in the existing commit-graph or commit-graph chain. Later changes
will allow merging the chain and expiring out-dated files.

Add a new test script (t5324-split-commit-graph.sh) that demonstrates this
behavior.

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c        |  10 ++-
 commit-graph.c                |  14 ++--
 t/t5324-split-commit-graph.sh | 122 ++++++++++++++++++++++++++++++++++
 3 files changed, 138 insertions(+), 8 deletions(-)
 create mode 100755 t/t5324-split-commit-graph.sh

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 828b1a713f..c2c07d3917 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,7 +10,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -25,7 +25,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -35,9 +35,9 @@ static struct opts_commit_graph {
 	int stdin_packs;
 	int stdin_commits;
 	int append;
+	int split;
 } opts;
 
-
 static int graph_verify(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -156,6 +156,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "split", &opts.split,
+			N_("allow writing an incremental commit-graph file")),
 		OPT_END(),
 	};
 
@@ -169,6 +171,8 @@ static int graph_write(int argc, const char **argv)
 		opts.obj_dir = get_object_directory();
 	if (opts.append)
 		flags |= COMMIT_GRAPH_APPEND;
+	if (opts.split)
+		flags |= COMMIT_GRAPH_SPLIT;
 
 	read_replace_refs = 0;
 
diff --git a/commit-graph.c b/commit-graph.c
index eb6d79567a..f94538c3df 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1472,12 +1472,16 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		}
 
 		if (ctx->base_graph_name) {
-			result = rename(ctx->base_graph_name,
-					ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
+			const char *dest = ctx->commit_graph_filenames_after[
+						ctx->num_commit_graphs_after - 2];
 
-			if (result) {
-				error(_("failed to rename base commit-graph file"));
-				return -1;
+			if (strcmp(ctx->base_graph_name, dest)) {
+				result = rename(ctx->base_graph_name, dest);
+
+				if (result) {
+					error(_("failed to rename base commit-graph file"));
+					return -1;
+				}
 			}
 		} else {
 			char *graph_name = get_commit_graph_filename(ctx->obj_dir);
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
new file mode 100755
index 0000000000..ccd24bd22b
--- /dev/null
+++ b/t/t5324-split-commit-graph.sh
@@ -0,0 +1,122 @@
+#!/bin/sh
+
+test_description='split commit graph'
+. ./test-lib.sh
+
+GIT_TEST_COMMIT_GRAPH=0
+
+test_expect_success 'setup repo' '
+	git init &&
+	git config core.commitGraph true &&
+	infodir=".git/objects/info" &&
+	graphdir="$infodir/commit-graphs" &&
+	test_oid_init
+'
+
+graph_read_expect() {
+	NUM_BASE=0
+	if test ! -z $2
+	then
+		NUM_BASE=$2
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 3 $NUM_BASE
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	git commit-graph read >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'create commits and write commit-graph' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable &&
+	test_path_is_file $infodir/commit-graph &&
+	graph_read_expect 3
+'
+
+graph_git_two_modes() {
+	git -c core.commitGraph=true $1 >output
+	git -c core.commitGraph=false $1 >expect
+	test_cmp expect output
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_git_behavior 'graph exists' commits/3 commits/1
+
+verify_chain_files_exist() {
+	for hash in $(cat $1/commit-graph-chain)
+	do
+		test_path_is_file $1/graph-$hash.graph || return 1
+	done
+}
+
+test_expect_success 'add more commits, and write a new base graph' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git commit-graph write --reachable &&
+	graph_read_expect 12
+'
+
+test_expect_success 'add three more commits, write a tip graph' '
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
+
+test_expect_success 'add one commit, write a tip graph' '
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 3 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v5 09/16] commit-graph: merge commit-graph chains
  2019-06-07 18:38       ` [PATCH v5 00/16] " Derrick Stolee via GitGitGadget
                           ` (7 preceding siblings ...)
  2019-06-07 18:38         ` [PATCH v5 08/16] commit-graph: add --split option to builtin Derrick Stolee via GitGitGadget
@ 2019-06-07 18:38         ` Derrick Stolee via GitGitGadget
  2019-06-07 18:38         ` [PATCH v5 10/16] commit-graph: allow cross-alternate chains Derrick Stolee via GitGitGadget
                           ` (8 subsequent siblings)
  17 siblings, 0 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-06-07 18:38 UTC (permalink / raw)
  To: git
  Cc: peff, avarab, git, jrnieder, steadmon, johannes.schindelin,
	philipoakley, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.

To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:

  1. If the number of commits we are adding is more than half the number
     of commits in the graph below, then merge with that graph.

  2. If we are writing more than 64,000 commits into a single graph,
     then merge with all lower graphs.

The numeric values in the conditions above are currently constant, but
can become config options in a future update.

As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.

After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt |  80 ++++++++++
 commit-graph.c                           | 180 ++++++++++++++++++-----
 t/t5324-split-commit-graph.sh            |  13 ++
 3 files changed, 240 insertions(+), 33 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 1dca3bd8fe..d9c6253b0a 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -186,6 +186,86 @@ positions to refer to their parents, which may be in `graph-{hash1}.graph` or
 its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
 X2).
 
+Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
+specifying the hashes of all files in the lower layers. In the above example,
+`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
+`{hash0}` and `{hash1}`.
+
+## Merging commit-graph files
+
+If we only added a new commit-graph file on every write, we would run into a
+linear search problem through many commit-graph files.  Instead, we use a merge
+strategy to decide when the stack should collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits are added, it
+is determined by the merge strategy that the files should collapse to
+`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
+the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
+file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  graph-{hash2} |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  graph-{hash1} |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                  tmp_graphXXX
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0} |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted and we write the
+contents to a temporary file, all while holding a `commit-graph-chain.lock`
+lock-file.  When the file is flushed, we rename it to `graph-{hash3}`
+according to the computed `{hash3}`. Finally, we write the new chain data to
+`commit-graph-chain.lock`:
+
+```
+	{hash3}
+	{hash0}
+```
+
+We then close the lock-file.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the commit-graph stack of
+height N, we default to creating a new file at level N + 1. We then decide to
+merge with the Nth level if one of two conditions hold:
+
+  1. The expected file size for level N + 1 is at least half the file size for
+     level N.
+
+  2. Level N + 1 contains more than 64,0000 commits.
+
+This decision cascades down the levels: when we merge a level we create a new
+set of commits that then compares to the next level.
+
+The first condition bounds the number of levels to be logarithmic in the total
+number of commits.  The second condition bounds the total number of commits in
+a `graph-{hashN}` file and not in the `commit-graph` file, preventing
+significant performance issues when the stack merges and another process only
+partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple, 64,000 for the maximum
+number of commits) could be extracted into config settings for full
+flexibility.
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8
diff --git a/commit-graph.c b/commit-graph.c
index f94538c3df..288b5e0280 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1298,36 +1298,6 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-static void init_commit_graph_chain(struct write_commit_graph_context *ctx)
-{
-	struct commit_graph *g = ctx->r->objects->commit_graph;
-	uint32_t i;
-
-	ctx->new_base_graph = g;
-	ctx->base_graph_name = xstrdup(g->filename);
-	ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
-
-	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
-
-	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
-	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
-
-	for (i = 0; i < ctx->num_commit_graphs_before - 1; i++)
-		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
-
-	if (ctx->num_commit_graphs_before)
-		ctx->commit_graph_filenames_after[ctx->num_commit_graphs_before - 1] =
-			get_split_graph_filename(ctx->obj_dir, oid_to_hex(&g->oid));
-
-	i = ctx->num_commit_graphs_before - 1;
-
-	while (g) {
-		ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
-		i--;
-		g = g->base_graph;
-	}
-}
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
@@ -1509,6 +1479,145 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
+static int split_strategy_max_commits = 64000;
+static float split_strategy_size_mult = 2.0f;
+
+static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t num_commits = ctx->commits.nr;
+	uint32_t i;
+
+	g = ctx->r->objects->commit_graph;
+	ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
+
+	while (g && (g->num_commits <= split_strategy_size_mult * num_commits ||
+		     num_commits > split_strategy_max_commits)) {
+		num_commits += g->num_commits;
+		g = g->base_graph;
+
+		ctx->num_commit_graphs_after--;
+	}
+
+	ctx->new_base_graph = g;
+
+	ALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
+	ALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
+
+	for (i = 0; i < ctx->num_commit_graphs_after &&
+		    i < ctx->num_commit_graphs_before; i++)
+		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
+
+	i = ctx->num_commit_graphs_before - 1;
+	g = ctx->r->objects->commit_graph;
+
+	while (g) {
+		if (i < ctx->num_commit_graphs_after)
+			ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+
+		i--;
+		g = g->base_graph;
+	}
+}
+
+static void merge_commit_graph(struct write_commit_graph_context *ctx,
+			       struct commit_graph *g)
+{
+	uint32_t i;
+	uint32_t offset = g->num_commits_in_base;
+
+	ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct object_id oid;
+		struct commit *result;
+
+		display_progress(ctx->progress, i + 1);
+
+		load_oid_from_graph(g, i + offset, &oid);
+
+		/* only add commits if they still exist in the repo */
+		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+
+		if (result) {
+			ctx->commits.list[ctx->commits.nr] = result;
+			ctx->commits.nr++;
+		}
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit **)_a;
+	const struct commit *b = *(const struct commit **)_b;
+	return oidcmp(&a->object.oid, &b->object.oid);
+}
+
+static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
+{
+	uint32_t i, num_parents;
+	struct commit_list *parent;
+
+	if (ctx->report_progress)
+		ctx->progress = start_delayed_progress(
+					_("Scanning merged commits"),
+					ctx->commits.nr);
+
+	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+
+	ctx->num_extra_edges = 0;
+	for (i = 0; i < ctx->commits.nr; i++) {
+		display_progress(ctx->progress, i);
+
+		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
+			  &ctx->commits.list[i]->object.oid)) {
+			die(_("unexpected duplicate commit id %s"),
+			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+		} else {
+			num_parents = 0;
+			for (parent = ctx->commits.list[i]->parents; parent; parent = parent->next)
+				num_parents++;
+
+			if (num_parents > 2)
+				ctx->num_extra_edges += num_parents - 2;
+		}
+	}
+
+	stop_progress(&ctx->progress);
+}
+
+static void merge_commit_graphs(struct write_commit_graph_context *ctx)
+{
+	struct commit_graph *g = ctx->r->objects->commit_graph;
+	uint32_t current_graph_number = ctx->num_commit_graphs_before;
+	struct strbuf progress_title = STRBUF_INIT;
+
+	while (g && current_graph_number >= ctx->num_commit_graphs_after) {
+		current_graph_number--;
+
+		if (ctx->report_progress) {
+			strbuf_addstr(&progress_title, _("Merging commit-graph"));
+			ctx->progress = start_delayed_progress(progress_title.buf, 0);
+		}
+
+		merge_commit_graph(ctx, g);
+		stop_progress(&ctx->prog