[PATCH 0/5] Add a new "sparse" tree walk algorithm

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/5] Add a new "sparse" tree walk algorithm
@ 2018-11-28 21:52 Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Derrick Stolee (5):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting

 Documentation/git-pack-objects.txt |   9 +-
 bisect.c                           |   2 +-
 builtin/pack-objects.c             |   9 +-
 builtin/rev-list.c                 |   2 +-
 http-push.c                        |   2 +-
 list-objects.c                     |  51 ++++++++++-
 list-objects.h                     |   4 +-
 revision.c                         | 113 +++++++++++++++++++++++
 revision.h                         |   2 +
 t/t5322-pack-objects-sparse.sh     | 139 +++++++++++++++++++++++++++++
 10 files changed, 323 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/89
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/5] revision: add mark_tree_uninteresting_sparse
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
@ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 22 ++++++++++++++++++++++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+				     struct oidset *set)
+{
+	struct object_id *oid;
+	struct oidset_iter iter;
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+
+		if (tree->object.flags & UNINTERESTING) {
+			/*
+			 * Remove the flag so the next call
+			 * is not a no-op. The flag is added
+			 * in mark_tree_unintersting().
+			 */
+			tree->object.flags ^= UNINTERESTING;
+			mark_tree_uninteresting(r, tree);
+		}
+	}
+}
+
 struct commit_stack {
 	struct commit **items;
 	size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/5] list-objects: consume sparse tree walk
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
@ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bisect.c               |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c     |  2 +-
 http-push.c            |  2 +-
 list-objects.c         | 51 ++++++++++++++++++++++++++++++++++++++----
 list-objects.h         |  4 +++-
 6 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	if (revs->tree_objects)
-		mark_edges_uninteresting(revs, NULL);
+		mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge);
+	mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	if (revs.tree_objects)
-		mark_edges_uninteresting(&revs, show_edge);
+		mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (bisect_list) {
 		int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
 		pushing = 0;
 		if (prepare_revision_walk(&revs))
 			die("revision walk setup failed");
-		mark_edges_uninteresting(&revs, NULL);
+		mark_edges_uninteresting(&revs, NULL, 0);
 		objects_to_send = get_delta(&revs, ref_lock);
 		finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..9bb93d1640 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,68 @@ static void mark_edge_parents_uninteresting(struct commit *commit,
 	}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+			     struct rev_info *revs,
+			     show_edge_fn show_edge,
+			     struct oidset *set)
+{
+	struct commit_list *parents;
+
+	for (parents = commit->parents; parents; parents = parents->next) {
+		struct commit *parent = parents->item;
+		struct tree *tree = get_commit_tree(parent);
+		oidset_insert(set, &tree->object.oid);
+
+		if (!(parent->object.flags & UNINTERESTING))
+			continue;
+		tree->object.flags |= UNINTERESTING;
+
+		if (revs->edge_hint && !(parent->object.flags & SHOWN)) {
+			parent->object.flags |= SHOWN;
+			show_edge(parent);
+		}
+	}
+}
+
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse)
 {
 	struct commit_list *list;
+	struct oidset set;
 	int i;
 
+	if (sparse)
+		oidset_init(&set, 16);
+
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *commit = list->item;
+		
+		if (sparse) {
+			struct tree *tree = get_commit_tree(commit);
+			
+			if (commit->object.flags & UNINTERESTING)
+				tree->object.flags |= UNINTERESTING;
 
-		if (commit->object.flags & UNINTERESTING) {
+			oidset_insert(&set, &tree->object.oid);
+			add_edge_parents(commit, revs, show_edge, &set);
+		} else if (commit->object.flags & UNINTERESTING) {
 			mark_tree_uninteresting(revs->repo,
 						get_commit_tree(commit));
 			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
 				commit->object.flags |= SHOWN;
 				show_edge(commit);
 			}
-			continue;
+		} else {
+			mark_edge_parents_uninteresting(commit, revs, show_edge);
 		}
-		mark_edge_parents_uninteresting(commit, revs, show_edge);
 	}
+
+	if (sparse) {
+		mark_trees_uninteresting_sparse(revs->repo, &set);
+		oidset_clear(&set);
+	}
+
 	if (revs->edge_hint_aggressive) {
 		for (i = 0; i < revs->cmdline.nr; i++) {
 			struct object *obj = revs->cmdline.rev[i].item;
diff --git a/list-objects.h b/list-objects.h
index ad40762926..a952680e46 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *);
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
 
 typedef void (*show_edge_fn)(struct commit *);
-void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse);
 
 struct oidset;
 struct list_objects_filter_options;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/5] pack-objects: add --sparse option
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
@ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget
  2018-11-28 22:11   ` Stefan Beller
  2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-pack-objects.txt |   9 ++-
 builtin/pack-objects.c             |   5 +-
 t/t5322-pack-objects-sparse.sh     | 115 +++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 40c825c381..ced2630eb3 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--stdout [--filter=<filter-spec>] | base-name]
-	[--shallow] [--keep-true-parents] < object-list
+	[--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,13 @@ depth is 4095.
 	Add --no-reuse-object if you want to force a uniform compression
 	level on all data no matter the source.
 
+--sparse::
+	Use the "sparse" algorithm to determine which objects to include in
+	the pack. This can have significant performance benefits when computing
+	a pack to send a small change. However, it is possible that extra
+	objects are added to the pack-file if the included commits contain
+	certain types of direct renames.
+
 --thin::
 	Create a "thin" pack by omitting the common objects between a
 	sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge, 0);
+	mark_edges_uninteresting(&revs, show_edge, sparse);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+		OPT_BOOL(0, "sparse", &sparse,
+			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
 		OPT_BOOL(0, "shallow", &shallow,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 0000000000..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	test_commit initial &&
+	for i in $(test_seq 1 3)
+	do
+		mkdir f$i &&
+		for j in $(test_seq 1 3)
+		do
+			mkdir f$i/f$j &&
+			echo $j >f$i/f$j/data.txt
+		done
+	done &&
+	git add . &&
+	git commit -m "Initialized trees" &&
+	for i in $(test_seq 1 3)
+	do
+		git checkout -b topic$i master &&
+		echo change-$i >f$i/f$i/data.txt &&
+		git commit -a -m "Changed f$i/f$i/data.txt"
+	done &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic2
+	^topic3
+	EOF
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+# Demonstrate that both algorithms send "extra" objects because
+# they are not in the frontier.
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+	git checkout topic1 &&
+	echo change-3 >f3/f3/data.txt &&
+	git commit -a -m "Changed f3/f3/data.txt" &&
+	git rev-parse			\
+		topic1~1		\
+		topic1~1^{tree}		\
+		topic1^{tree}		\
+		topic1			\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt	\
+		topic1:f3		\
+		topic1:f3/f3		\
+		topic1:f3/f3/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f1 into f3' '
+	mkdir f3/f4 &&
+	cp -r f1/f1/* f3/f4 &&
+	git add f3/f4 &&
+	git commit -m "Copied f1/f1 to f3/f4" &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic1~1
+	EOF
+	git rev-parse		\
+		topic1		\
+		topic1^{tree}	\
+		topic1:f3 | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 4/5] revision: implement sparse algorithm
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
@ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget
  2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

	v2.19.0
	^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

	v4.18
	^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:    220
Walked (old alg): 22,804
Walked (new alg):    129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c                     | 111 ++++++++++++++++++++++++++++++---
 t/t5322-pack-objects-sparse.sh |  21 +++++--
 2 files changed, 116 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index 3a62c7c187..7e4bfe621a 100644
--- a/revision.c
+++ b/revision.c
@@ -99,26 +99,117 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }

+struct paths_and_oids {
+	struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+	string_list_init(&po->list, 1);
+}
+
+static void paths_and_oids_clear(struct paths_and_oids *po)
+{
+	int i;
+	for (i = 0; i < po->list.nr; i++) {
+		oidset_clear(po->list.items[i].util);
+		free(po->list.items[i].util);
+	}
+
+	string_list_clear(&po->list, 0);
+}
+
+static void paths_and_oids_insert(struct paths_and_oids *po,
+				  const char *path,
+				  const struct object_id *oid)
+{
+	struct string_list_item *item = string_list_insert(&po->list, path);
+	struct oidset *set;
+
+	if (!item->util) {
+		set = xcalloc(1, sizeof(struct oidset));
+		oidset_init(set, 16);
+		item->util = set;
+	} else {
+		set = item->util;
+	}
+
+	oidset_insert(set, oid);
+}
+
+static void add_children_by_path(struct repository *r,
+				 struct tree *tree,
+				 struct paths_and_oids *po)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if (parse_tree_gently(tree, 1) < 0)
+		return;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		switch (object_type(entry.mode)) {
+		case OBJ_TREE:
+			paths_and_oids_insert(po, entry.path, entry.oid);
+
+			if (tree->object.flags & UNINTERESTING) {
+				struct tree *child = lookup_tree(r, entry.oid);
+				child->object.flags |= UNINTERESTING;
+			}
+			break;
+		case OBJ_BLOB:
+			if (tree->object.flags & UNINTERESTING) {
+				struct blob *child = lookup_blob(r, entry.oid);
+				child->object.flags |= UNINTERESTING;
+			}
+			break;
+		default:
+			/* Subproject commit - not in this repository */
+			break;
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
 void mark_trees_uninteresting_sparse(struct repository *r,
 				     struct oidset *set)
 {
+	int i;
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct paths_and_oids po;
 	struct object_id *oid;
 	struct oidset_iter iter;

 	oidset_iter_init(set, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);

-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
 	}
+
+	/* Do not walk unless we have both types of trees. */
+	if (!has_uninteresting || !has_interesting)
+		return;
+
+	paths_and_oids_init(&po);
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+		add_children_by_path(r, tree, &po);
+	}
+
+	for (i = 0; i < po.list.nr; i++)
+		mark_trees_uninteresting_sparse(
+			r, (struct oidset *)po.list.items[i].util);
+
+	paths_and_oids_clear(&po);
 }

 struct commit_stack {
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 81f6805bc3..45dba6e014 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_objects.txt sparse_objects.txt
 '

+# Demonstrate that the algorithms differ when we copy a tree wholesale
+# from one folder to another.
+
 test_expect_success 'duplicate a folder from f1 into f3' '
 	mkdir f3/f4 &&
 	cp -r f1/f1/* f3/f4 &&
 	git add f3/f4 &&
 	git commit -m "Copied f1/f1 to f3/f4" &&
-	cat >packinput.txt <<-EOF &&
+	cat >packinput.txt <<-EOF
 	topic1
 	^topic1~1
 	EOF
-	git rev-parse		\
-		topic1		\
-		topic1^{tree}	\
-		topic1:f3 | sort >expect_objects.txt
 '

 test_expect_success 'non-sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3 | sort >expect_objects.txt &&
 	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
@@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' '
 '

 test_expect_success 'sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3		\
+		topic1:f3/f4		\
+		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
 	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
 	git index-pack -o sparse.idx sparse.pack &&
 	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
-	test_cmp expect_objects.txt sparse_objects.txt
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '

 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 5/5] pack-objects: create pack.useSparse setting
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
@ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget
  2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/pack-objects.c         |  4 ++++
 t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.usesparse")) {
+		sparse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+	git config pack.useSparse true &&
+	git pack-objects --stdout --revs <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] pack-objects: add --sparse option
  2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
@ 2018-11-28 22:11   ` Stefan Beller
  2018-11-29 14:20     ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Beller @ 2018-11-28 22:11 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Junio C Hamano, Derrick Stolee

> +--sparse::
> +       Use the "sparse" algorithm to determine which objects to include in
> +       the pack. This can have significant performance benefits when computing
> +       a pack to send a small change. However, it is possible that extra
> +       objects are added to the pack-file if the included commits contain
> +       certain types of direct renames.

As a user, where do I find a discussion of these walking algorithms?
(i.e. how can I tell if I can really expect to gain performance benefits,
what are the tradeoffs? "If it were strictly better than the original,
it would be on by default" might be a thought of a user)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/5] Add a new "sparse" tree walk algorithm
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
@ 2018-11-28 22:18 ` Ævar Arnfjörð Bjarmason
  2018-11-29  4:05   ` Derrick Stolee
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 51+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-28 22:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff, jrnieder, Junio C Hamano


On Wed, Nov 28 2018, Derrick Stolee via GitGitGadget wrote:

> One of the biggest remaining pain points for users of very large
> repositories is the time it takes to run 'git push'. We inspected some slow
> pushes by our developers and found that the "Enumerating Objects" phase of a
> push was very slow. This is unsurprising, because this is why reachability
> bitmaps exist. However, reachability bitmaps are not available to us because
> of the single pack-file requirement. The bitmap approach is intended for
> servers anyway, and clients have a much different behavior pattern.
>
> Specifically, clients are normally pushing a very small number of objects
> compared to the entire working directory. A typical user changes only a
> small cone of the working directory, so let's use that to our benefit.
>
> Create a new "sparse" mode for 'git pack-objects' that uses the paths that
> introduce new objects to direct our search into the reachable trees. By
> collecting trees at each path, we can then recurse into a path only when
> there are uninteresting and interesting trees at that path. This gains a
> significant performance boost for small topics while presenting a
> possibility of packing extra objects.
>
> The main algorithm change is in patch 4, but is set up a little bit in
> patches 1 and 2.
>
> As demonstrated in the included test script, we see that the existing
> algorithm can send extra objects due to the way we specify the "frontier".
> But we can send even more objects if a user copies objects from one folder
> to another. I say "copy" because a rename would (usually) change the
> original folder and trigger a walk into that path, discovering the objects.
>
> In order to benefit from this approach, the user can opt-in using the
> pack.useSparse config setting. This setting can be overridden using the
> '--no-sparse' option.

This is really interesting. I tested this with:

    diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
    index 124b1bafc4..5c7615f06c 100644
    --- a/builtin/pack-objects.c
    +++ b/builtin/pack-objects.c
    @@ -3143 +3143 @@ static void get_object_list(int ac, const char **av)
    -       mark_edges_uninteresting(&revs, show_edge, sparse);
    +       mark_edges_uninteresting(&revs, show_edge, 1);

To emulate having a GIT_TEST_* mode for this, which seems like a good
idea since it turned up a lot of segfaults in pack-objects. I wasn't
able to get a backtrace for that since it always happens indirectly, and
I didn't dig enough to see how to manually invoke it the right way.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/5] Add a new "sparse" tree walk algorithm
  2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason
@ 2018-11-29  4:05   ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2018-11-29  4:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, Junio C Hamano

On 11/28/2018 5:18 PM, Ævar Arnfjörð Bjarmason wrote:
> This is really interesting. I tested this with:
>
>      diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>      index 124b1bafc4..5c7615f06c 100644
>      --- a/builtin/pack-objects.c
>      +++ b/builtin/pack-objects.c
>      @@ -3143 +3143 @@ static void get_object_list(int ac, const char **av)
>      -       mark_edges_uninteresting(&revs, show_edge, sparse);
>      +       mark_edges_uninteresting(&revs, show_edge, 1);
>
> To emulate having a GIT_TEST_* mode for this, which seems like a good
> idea since it turned up a lot of segfaults in pack-objects. I wasn't
> able to get a backtrace for that since it always happens indirectly, and
> I didn't dig enough to see how to manually invoke it the right way.

Thanks for double-checking this. I had run a similar test in my 
prototype implementation, but over-simplified some code when rewriting 
it for submission (and then forgot to re-run that test).

Specifically, these null checks are important:

diff --git a/list-objects.c b/list-objects.c
index 9bb93d1640..7e864b4db8 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -232,6 +232,10 @@ static void add_edge_parents(struct commit *commit,
         for (parents = commit->parents; parents; parents = parents->next) {
                 struct commit *parent = parents->item;
                 struct tree *tree = get_commit_tree(parent);
+
+               if (!tree)
+                       continue;
+
                 oidset_insert(set, &tree->object.oid);

                 if (!(parent->object.flags & UNINTERESTING))
@@ -261,6 +265,8 @@ void mark_edges_uninteresting(struct rev_info *revs,

                 if (sparse) {
                         struct tree *tree = get_commit_tree(commit);
+                       if (!tree)
+                               continue;

                         if (commit->object.flags & UNINTERESTING)
                                 tree->object.flags |= UNINTERESTING;

I will definitely include a GIT_TEST_* variable in v2.

Thanks,

-Stolee


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] pack-objects: add --sparse option
  2018-11-28 22:11   ` Stefan Beller
@ 2018-11-29 14:20     ` Derrick Stolee
  2018-11-30  2:39       ` Junio C Hamano
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2018-11-29 14:20 UTC (permalink / raw)
  To: Stefan Beller, gitgitgadget
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Jonathan Nieder, Junio C Hamano, Derrick Stolee

On 11/28/2018 5:11 PM, Stefan Beller wrote:
>> +--sparse::
>> +       Use the "sparse" algorithm to determine which objects to include in
>> +       the pack. This can have significant performance benefits when computing
>> +       a pack to send a small change. However, it is possible that extra
>> +       objects are added to the pack-file if the included commits contain
>> +       certain types of direct renames.
> As a user, where do I find a discussion of these walking algorithms?
> (i.e. how can I tell if I can really expect to gain performance benefits,
> what are the tradeoffs? "If it were strictly better than the original,
> it would be on by default" might be a thought of a user)

You're right that having this hidden as an opt-in config variable makes 
it hard to discover as a typical user.

I would argue that we should actually make the config setting true by 
default, and recommend that servers opt-out. Here are my reasons:

1. The vast majority of users are clients.

2. Client users are not likely to know about and tweak these settings.

3. Server users are more likely to keep an eye on the different knobs 
they can tweak.

4. Servers should use the reachability bitmaps, which don't use this 
logic anyway.

While _eventually_ we should make this opt-out, we shouldn't do that 
until it has cooked a while.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 0/6] Add a new "sparse" tree walk algorithm
  2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason
@ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
                     ` (6 more replies)
  6 siblings, 7 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/git-pack-objects.txt |   9 +-
 bisect.c                           |   2 +-
 builtin/pack-objects.c             |  10 ++-
 builtin/rev-list.c                 |   2 +-
 http-push.c                        |   2 +-
 list-objects.c                     |  55 +++++++++++-
 list-objects.h                     |   4 +-
 revision.c                         | 121 +++++++++++++++++++++++++
 revision.h                         |   2 +
 t/README                           |   4 +
 t/t5322-pack-objects-sparse.sh     | 139 +++++++++++++++++++++++++++++
 11 files changed, 340 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v1:

 1:  b73b8de98c = 1:  60617681f7 revision: add mark_tree_uninteresting_sparse
 2:  9bf04c748b ! 2:  4527addacb list-objects: consume sparse tree walk
     @@ -116,6 +116,10 @@
      +	for (parents = commit->parents; parents; parents = parents->next) {
      +		struct commit *parent = parents->item;
      +		struct tree *tree = get_commit_tree(parent);
     ++
     ++		if (!tree)
     ++			continue;
     ++
      +		oidset_insert(set, &tree->object.oid);
      +
      +		if (!(parent->object.flags & UNINTERESTING))
     @@ -142,14 +146,14 @@
      +
       	for (list = revs->commits; list; list = list->next) {
       		struct commit *commit = list->item;
     -+		
     + 
     +-		if (commit->object.flags & UNINTERESTING) {
      +		if (sparse) {
      +			struct tree *tree = get_commit_tree(commit);
     -+			
     ++
      +			if (commit->object.flags & UNINTERESTING)
      +				tree->object.flags |= UNINTERESTING;
     - 
     --		if (commit->object.flags & UNINTERESTING) {
     ++
      +			oidset_insert(&set, &tree->object.oid);
      +			add_edge_parents(commit, revs, show_edge, &set);
      +		} else if (commit->object.flags & UNINTERESTING) {
     @@ -189,3 +193,17 @@
       
       struct oidset;
       struct list_objects_filter_options;
     +
     +diff --git a/revision.c b/revision.c
     +--- a/revision.c
     ++++ b/revision.c
     +@@
     + 	while ((oid = oidset_iter_next(&iter))) {
     + 		struct tree *tree = lookup_tree(r, oid);
     + 
     ++		if (!tree)
     ++			continue;
     ++
     + 		if (tree->object.flags & UNINTERESTING) {
     + 			/*
     + 			 * Remove the flag so the next call
 3:  9d6b8f6d06 = 3:  9644f6ff04 pack-objects: add --sparse option
 4:  0725aac4bb ! 4:  c99957d06f revision: implement sparse algorithm
     @@ -157,6 +157,9 @@
      +	struct tree_desc desc;
      +	struct name_entry entry;
      +
     ++	if (!tree)
     ++		return;
     ++
      +	if (parse_tree_gently(tree, 1) < 0)
      +		return;
      +
     @@ -168,13 +171,15 @@
      +
      +			if (tree->object.flags & UNINTERESTING) {
      +				struct tree *child = lookup_tree(r, entry.oid);
     -+				child->object.flags |= UNINTERESTING;
     ++				if (child)
     ++					child->object.flags |= UNINTERESTING;
      +			}
      +			break;
      +		case OBJ_BLOB:
      +			if (tree->object.flags & UNINTERESTING) {
      +				struct blob *child = lookup_blob(r, entry.oid);
     -+				child->object.flags |= UNINTERESTING;
     ++				if (child)
     ++					child->object.flags |= UNINTERESTING;
      +			}
      +			break;
      +		default:
     @@ -201,6 +206,9 @@
      +	       (oid = oidset_iter_next(&iter))) {
       		struct tree *tree = lookup_tree(r, oid);
       
     + 		if (!tree)
     + 			continue;
     + 
      -		if (tree->object.flags & UNINTERESTING) {
      -			/*
      -			 * Remove the flag so the next call
 5:  bbc3f78182 = 5:  d6912188be pack-objects: create pack.useSparse setting
 -:  ---------- > 6:  3d394a9136 pack-objects: create GIT_TEST_PACK_SPARSE

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
@ 2018-11-29 14:24   ` Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 22 ++++++++++++++++++++++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+				     struct oidset *set)
+{
+	struct object_id *oid;
+	struct oidset_iter iter;
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+
+		if (tree->object.flags & UNINTERESTING) {
+			/*
+			 * Remove the flag so the next call
+			 * is not a no-op. The flag is added
+			 * in mark_tree_unintersting().
+			 */
+			tree->object.flags ^= UNINTERESTING;
+			mark_tree_uninteresting(r, tree);
+		}
+	}
+}
+
 struct commit_stack {
 	struct commit **items;
 	size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 2/6] list-objects: consume sparse tree walk
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
@ 2018-11-29 14:24   ` Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bisect.c               |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c     |  2 +-
 http-push.c            |  2 +-
 list-objects.c         | 55 +++++++++++++++++++++++++++++++++++++++---
 list-objects.h         |  4 ++-
 revision.c             |  3 +++
 7 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	if (revs->tree_objects)
-		mark_edges_uninteresting(revs, NULL);
+		mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge);
+	mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	if (revs.tree_objects)
-		mark_edges_uninteresting(&revs, show_edge);
+		mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (bisect_list) {
 		int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
 		pushing = 0;
 		if (prepare_revision_walk(&revs))
 			die("revision walk setup failed");
-		mark_edges_uninteresting(&revs, NULL);
+		mark_edges_uninteresting(&revs, NULL, 0);
 		objects_to_send = get_delta(&revs, ref_lock);
 		finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..4fbdeca0a4 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit,
 	}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+			     struct rev_info *revs,
+			     show_edge_fn show_edge,
+			     struct oidset *set)
+{
+	struct commit_list *parents;
+
+	for (parents = commit->parents; parents; parents = parents->next) {
+		struct commit *parent = parents->item;
+		struct tree *tree = get_commit_tree(parent);
+
+		if (!tree)
+			continue;
+
+		oidset_insert(set, &tree->object.oid);
+
+		if (!(parent->object.flags & UNINTERESTING))
+			continue;
+		tree->object.flags |= UNINTERESTING;
+
+		if (revs->edge_hint && !(parent->object.flags & SHOWN)) {
+			parent->object.flags |= SHOWN;
+			show_edge(parent);
+		}
+	}
+}
+
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse)
 {
 	struct commit_list *list;
+	struct oidset set;
 	int i;
 
+	if (sparse)
+		oidset_init(&set, 16);
+
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *commit = list->item;
 
-		if (commit->object.flags & UNINTERESTING) {
+		if (sparse) {
+			struct tree *tree = get_commit_tree(commit);
+
+			if (commit->object.flags & UNINTERESTING)
+				tree->object.flags |= UNINTERESTING;
+
+			oidset_insert(&set, &tree->object.oid);
+			add_edge_parents(commit, revs, show_edge, &set);
+		} else if (commit->object.flags & UNINTERESTING) {
 			mark_tree_uninteresting(revs->repo,
 						get_commit_tree(commit));
 			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
 				commit->object.flags |= SHOWN;
 				show_edge(commit);
 			}
-			continue;
+		} else {
+			mark_edge_parents_uninteresting(commit, revs, show_edge);
 		}
-		mark_edge_parents_uninteresting(commit, revs, show_edge);
 	}
+
+	if (sparse) {
+		mark_trees_uninteresting_sparse(revs->repo, &set);
+		oidset_clear(&set);
+	}
+
 	if (revs->edge_hint_aggressive) {
 		for (i = 0; i < revs->cmdline.nr; i++) {
 			struct object *obj = revs->cmdline.rev[i].item;
diff --git a/list-objects.h b/list-objects.h
index ad40762926..a952680e46 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *);
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
 
 typedef void (*show_edge_fn)(struct commit *);
-void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse);
 
 struct oidset;
 struct list_objects_filter_options;
diff --git a/revision.c b/revision.c
index 3a62c7c187..f9eb6400f1 100644
--- a/revision.c
+++ b/revision.c
@@ -109,6 +109,9 @@ void mark_trees_uninteresting_sparse(struct repository *r,
 	while ((oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);
 
+		if (!tree)
+			continue;
+
 		if (tree->object.flags & UNINTERESTING) {
 			/*
 			 * Remove the flag so the next call
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 3/6] pack-objects: add --sparse option
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
@ 2018-11-29 14:24   ` Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-pack-objects.txt |   9 ++-
 builtin/pack-objects.c             |   5 +-
 t/t5322-pack-objects-sparse.sh     | 115 +++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 40c825c381..ced2630eb3 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--stdout [--filter=<filter-spec>] | base-name]
-	[--shallow] [--keep-true-parents] < object-list
+	[--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,13 @@ depth is 4095.
 	Add --no-reuse-object if you want to force a uniform compression
 	level on all data no matter the source.
 
+--sparse::
+	Use the "sparse" algorithm to determine which objects to include in
+	the pack. This can have significant performance benefits when computing
+	a pack to send a small change. However, it is possible that extra
+	objects are added to the pack-file if the included commits contain
+	certain types of direct renames.
+
 --thin::
 	Create a "thin" pack by omitting the common objects between a
 	sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge, 0);
+	mark_edges_uninteresting(&revs, show_edge, sparse);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+		OPT_BOOL(0, "sparse", &sparse,
+			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
 		OPT_BOOL(0, "shallow", &shallow,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 0000000000..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	test_commit initial &&
+	for i in $(test_seq 1 3)
+	do
+		mkdir f$i &&
+		for j in $(test_seq 1 3)
+		do
+			mkdir f$i/f$j &&
+			echo $j >f$i/f$j/data.txt
+		done
+	done &&
+	git add . &&
+	git commit -m "Initialized trees" &&
+	for i in $(test_seq 1 3)
+	do
+		git checkout -b topic$i master &&
+		echo change-$i >f$i/f$i/data.txt &&
+		git commit -a -m "Changed f$i/f$i/data.txt"
+	done &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic2
+	^topic3
+	EOF
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+# Demonstrate that both algorithms send "extra" objects because
+# they are not in the frontier.
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+	git checkout topic1 &&
+	echo change-3 >f3/f3/data.txt &&
+	git commit -a -m "Changed f3/f3/data.txt" &&
+	git rev-parse			\
+		topic1~1		\
+		topic1~1^{tree}		\
+		topic1^{tree}		\
+		topic1			\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt	\
+		topic1:f3		\
+		topic1:f3/f3		\
+		topic1:f3/f3/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f1 into f3' '
+	mkdir f3/f4 &&
+	cp -r f1/f1/* f3/f4 &&
+	git add f3/f4 &&
+	git commit -m "Copied f1/f1 to f3/f4" &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic1~1
+	EOF
+	git rev-parse		\
+		topic1		\
+		topic1^{tree}	\
+		topic1:f3 | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 4/6] revision: implement sparse algorithm
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2018-11-29 14:24   ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
@ 2018-11-29 14:24   ` Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

	v2.19.0
	^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

	v4.18
	^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:    220
Walked (old alg): 22,804
Walked (new alg):    129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c                     | 116 ++++++++++++++++++++++++++++++---
 t/t5322-pack-objects-sparse.sh |  21 ++++--
 2 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..971f1bb095 100644
--- a/revision.c
+++ b/revision.c
@@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }

+struct paths_and_oids {
+	struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+	string_list_init(&po->list, 1);
+}
+
+static void paths_and_oids_clear(struct paths_and_oids *po)
+{
+	int i;
+	for (i = 0; i < po->list.nr; i++) {
+		oidset_clear(po->list.items[i].util);
+		free(po->list.items[i].util);
+	}
+
+	string_list_clear(&po->list, 0);
+}
+
+static void paths_and_oids_insert(struct paths_and_oids *po,
+				  const char *path,
+				  const struct object_id *oid)
+{
+	struct string_list_item *item = string_list_insert(&po->list, path);
+	struct oidset *set;
+
+	if (!item->util) {
+		set = xcalloc(1, sizeof(struct oidset));
+		oidset_init(set, 16);
+		item->util = set;
+	} else {
+		set = item->util;
+	}
+
+	oidset_insert(set, oid);
+}
+
+static void add_children_by_path(struct repository *r,
+				 struct tree *tree,
+				 struct paths_and_oids *po)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if (!tree)
+		return;
+
+	if (parse_tree_gently(tree, 1) < 0)
+		return;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		switch (object_type(entry.mode)) {
+		case OBJ_TREE:
+			paths_and_oids_insert(po, entry.path, entry.oid);
+
+			if (tree->object.flags & UNINTERESTING) {
+				struct tree *child = lookup_tree(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		case OBJ_BLOB:
+			if (tree->object.flags & UNINTERESTING) {
+				struct blob *child = lookup_blob(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		default:
+			/* Subproject commit - not in this repository */
+			break;
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
 void mark_trees_uninteresting_sparse(struct repository *r,
 				     struct oidset *set)
 {
+	int i;
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct paths_and_oids po;
 	struct object_id *oid;
 	struct oidset_iter iter;

 	oidset_iter_init(set, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);

 		if (!tree)
 			continue;

-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
 	}
+
+	/* Do not walk unless we have both types of trees. */
+	if (!has_uninteresting || !has_interesting)
+		return;
+
+	paths_and_oids_init(&po);
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+		add_children_by_path(r, tree, &po);
+	}
+
+	for (i = 0; i < po.list.nr; i++)
+		mark_trees_uninteresting_sparse(
+			r, (struct oidset *)po.list.items[i].util);
+
+	paths_and_oids_clear(&po);
 }

 struct commit_stack {
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 81f6805bc3..45dba6e014 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_objects.txt sparse_objects.txt
 '

+# Demonstrate that the algorithms differ when we copy a tree wholesale
+# from one folder to another.
+
 test_expect_success 'duplicate a folder from f1 into f3' '
 	mkdir f3/f4 &&
 	cp -r f1/f1/* f3/f4 &&
 	git add f3/f4 &&
 	git commit -m "Copied f1/f1 to f3/f4" &&
-	cat >packinput.txt <<-EOF &&
+	cat >packinput.txt <<-EOF
 	topic1
 	^topic1~1
 	EOF
-	git rev-parse		\
-		topic1		\
-		topic1^{tree}	\
-		topic1:f3 | sort >expect_objects.txt
 '

 test_expect_success 'non-sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3 | sort >expect_objects.txt &&
 	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
@@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' '
 '

 test_expect_success 'sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3		\
+		topic1:f3/f4		\
+		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
 	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
 	git index-pack -o sparse.idx sparse.pack &&
 	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
-	test_cmp expect_objects.txt sparse_objects.txt
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '

 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 5/6] pack-objects: create pack.useSparse setting
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2018-11-29 14:24   ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
@ 2018-11-29 14:24   ` Derrick Stolee via GitGitGadget
  2018-11-29 14:24   ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/pack-objects.c         |  4 ++++
 t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.usesparse")) {
+		sparse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+	git config pack.useSparse true &&
+	git pack-objects --stdout --revs <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2018-11-29 14:24   ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
@ 2018-11-29 14:24   ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/pack-objects.c         | 1 +
 t/README                       | 4 ++++
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	read_replace_refs = 0;
 
+	sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
 		topic1			\
 		topic1^{tree}		\
 		topic1:f3 | sort >expect_objects.txt &&
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] pack-objects: add --sparse option
  2018-11-29 14:20     ` Derrick Stolee
@ 2018-11-30  2:39       ` Junio C Hamano
  2018-11-30 15:53         ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2018-11-30  2:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Stefan Beller, gitgitgadget, git, Jeff King,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> You're right that having this hidden as an opt-in config variable
> makes it hard to discover as a typical user.
>
> I would argue that we should actually make the config setting true by
> default, and recommend that servers opt-out. Here are my reasons:
>
> 1. The vast majority of users are clients.
>
> 2. Client users are not likely to know about and tweak these settings.
>
> 3. Server users are more likely to keep an eye on the different knobs
> they can tweak.
>
> 4. Servers should use the reachability bitmaps, which don't use this
> logic anyway.
>
> While _eventually_ we should make this opt-out, we shouldn't do that
> until it has cooked a while.

I actually do not agree.  If the knob gives enough benefit, the
users will learn about it viva voce, and in a few more releases
we'll hear "enough users complain that they have to turn it on,
let's make it the default".  If that does not happen, the knob
does not deserve to be turned on in the first place.

The same applies to many shiny new toys people are discussing
recently on this list (e.g. precious vs expendable and non-overlay
checkout are the ones that immediately come to my mind).

Having said that, I won't be commenting on this shiny new toy before
the final.  I want to see more people help tying the loose ends and
give it final varnish to the upcoming release, as it seems to have
become rockier and larger release than we originally anticipated.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] pack-objects: add --sparse option
  2018-11-30  2:39       ` Junio C Hamano
@ 2018-11-30 15:53         ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2018-11-30 15:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Stefan Beller, gitgitgadget, git, Jeff King,
	Ævar Arnfjörð Bjarmason, Jonathan Nieder,
	Derrick Stolee

On 11/29/2018 9:39 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> While _eventually_ we should make this opt-out, we shouldn't do that
>> until it has cooked a while.
> I actually do not agree.  If the knob gives enough benefit, the
> users will learn about it viva voce, and in a few more releases
> we'll hear "enough users complain that they have to turn it on,
> let's make it the default".  If that does not happen, the knob
> does not deserve to be turned on in the first place.

That's a good philosophy. I'll keep it in mind.
> Having said that, I won't be commenting on this shiny new toy before
> the final.  I want to see more people help tying the loose ends and
> give it final varnish to the upcoming release, as it seems to have
> become rockier and larger release than we originally anticipated.

Yeah, no rush on this one. I just wanted to get some initial feedback 
about the idea.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 0/6] Add a new "sparse" tree walk algorithm
  2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2018-11-29 14:24   ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42   ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
                       ` (6 more replies)
  6 siblings, 7 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Update in V3:

 * Change documentation around 'pack.useSparse' config setting to better
   advertise to users.
 * Mostly a ping now that v2.20.0 is out.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/config/pack.txt      |   9 ++
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c                           |   2 +-
 builtin/pack-objects.c             |  10 ++-
 builtin/rev-list.c                 |   2 +-
 http-push.c                        |   2 +-
 list-objects.c                     |  55 +++++++++++-
 list-objects.h                     |   4 +-
 revision.c                         | 121 +++++++++++++++++++++++++
 revision.h                         |   2 +
 t/README                           |   4 +
 t/t5322-pack-objects-sparse.sh     | 139 +++++++++++++++++++++++++++++
 12 files changed, 351 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v2:

 1:  60617681f7 = 1:  60617681f7 revision: add mark_tree_uninteresting_sparse
 2:  4527addacb = 2:  4527addacb list-objects: consume sparse tree walk
 3:  9644f6ff04 ! 3:  4ef318bdb2 pack-objects: add --sparse option
     @@ -32,7 +32,9 @@

      +--sparse::
      +	Use the "sparse" algorithm to determine which objects to include in
     -+	the pack. This can have significant performance benefits when computing
     ++	the pack, when combined with the "--revs" option. This algorithm
     ++	only walks trees that appear in paths that introduce new objects.
     ++	This can have significant performance benefits when computing
      +	a pack to send a small change. However, it is possible that extra
      +	objects are added to the pack-file if the included commits contain
      +	certain types of direct renames.
 4:  c99957d06f = 4:  571b2e2784 revision: implement sparse algorithm
 5:  d6912188be ! 5:  33d2c04dd6 pack-objects: create pack.useSparse setting
     @@ -19,6 +19,26 @@

          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

     +diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
     +--- a/Documentation/config/pack.txt
     ++++ b/Documentation/config/pack.txt
     +@@
     + 	true. You should not generally need to turn this off unless
     + 	you are debugging pack bitmaps.
     + 
     ++pack.useSparse::
     ++	When true, git will default to using the '--sparse' option in
     ++	'git pack-objects' when the '--revs' option is present. This
     ++	algorithm only walks trees that appear in paths that introduce new
     ++	objects. This can have significant performance benefits when
     ++	computing a pack to send a small change. However, it is possible
     ++	that extra objects are added to the pack-file if the included
     ++	commits contain certain types of direct renames.
     ++
     + pack.writeBitmaps (deprecated)::
     + 	This is a deprecated synonym for `repack.writeBitmaps`.
     + 
     +
      diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
      --- a/builtin/pack-objects.c
      +++ b/builtin/pack-objects.c
 6:  3d394a9136 = 6:  e4f29543ee pack-objects: create GIT_TEST_PACK_SPARSE

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42     ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 22 ++++++++++++++++++++++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+				     struct oidset *set)
+{
+	struct object_id *oid;
+	struct oidset_iter iter;
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+
+		if (tree->object.flags & UNINTERESTING) {
+			/*
+			 * Remove the flag so the next call
+			 * is not a no-op. The flag is added
+			 * in mark_tree_unintersting().
+			 */
+			tree->object.flags ^= UNINTERESTING;
+			mark_tree_uninteresting(r, tree);
+		}
+	}
+}
+
 struct commit_stack {
 	struct commit **items;
 	size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 2/6] list-objects: consume sparse tree walk
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42     ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bisect.c               |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c     |  2 +-
 http-push.c            |  2 +-
 list-objects.c         | 55 +++++++++++++++++++++++++++++++++++++++---
 list-objects.h         |  4 ++-
 revision.c             |  3 +++
 7 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	if (revs->tree_objects)
-		mark_edges_uninteresting(revs, NULL);
+		mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge);
+	mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	if (revs.tree_objects)
-		mark_edges_uninteresting(&revs, show_edge);
+		mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (bisect_list) {
 		int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
 		pushing = 0;
 		if (prepare_revision_walk(&revs))
 			die("revision walk setup failed");
-		mark_edges_uninteresting(&revs, NULL);
+		mark_edges_uninteresting(&revs, NULL, 0);
 		objects_to_send = get_delta(&revs, ref_lock);
 		finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..4fbdeca0a4 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit,
 	}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+			     struct rev_info *revs,
+			     show_edge_fn show_edge,
+			     struct oidset *set)
+{
+	struct commit_list *parents;
+
+	for (parents = commit->parents; parents; parents = parents->next) {
+		struct commit *parent = parents->item;
+		struct tree *tree = get_commit_tree(parent);
+
+		if (!tree)
+			continue;
+
+		oidset_insert(set, &tree->object.oid);
+
+		if (!(parent->object.flags & UNINTERESTING))
+			continue;
+		tree->object.flags |= UNINTERESTING;
+
+		if (revs->edge_hint && !(parent->object.flags & SHOWN)) {
+			parent->object.flags |= SHOWN;
+			show_edge(parent);
+		}
+	}
+}
+
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse)
 {
 	struct commit_list *list;
+	struct oidset set;
 	int i;
 
+	if (sparse)
+		oidset_init(&set, 16);
+
 	for (list = revs->commits; list; list = list->next) {
 		struct commit *commit = list->item;
 
-		if (commit->object.flags & UNINTERESTING) {
+		if (sparse) {
+			struct tree *tree = get_commit_tree(commit);
+
+			if (commit->object.flags & UNINTERESTING)
+				tree->object.flags |= UNINTERESTING;
+
+			oidset_insert(&set, &tree->object.oid);
+			add_edge_parents(commit, revs, show_edge, &set);
+		} else if (commit->object.flags & UNINTERESTING) {
 			mark_tree_uninteresting(revs->repo,
 						get_commit_tree(commit));
 			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
 				commit->object.flags |= SHOWN;
 				show_edge(commit);
 			}
-			continue;
+		} else {
+			mark_edge_parents_uninteresting(commit, revs, show_edge);
 		}
-		mark_edge_parents_uninteresting(commit, revs, show_edge);
 	}
+
+	if (sparse) {
+		mark_trees_uninteresting_sparse(revs->repo, &set);
+		oidset_clear(&set);
+	}
+
 	if (revs->edge_hint_aggressive) {
 		for (i = 0; i < revs->cmdline.nr; i++) {
 			struct object *obj = revs->cmdline.rev[i].item;
diff --git a/list-objects.h b/list-objects.h
index ad40762926..a952680e46 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *);
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
 
 typedef void (*show_edge_fn)(struct commit *);
-void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse);
 
 struct oidset;
 struct list_objects_filter_options;
diff --git a/revision.c b/revision.c
index 3a62c7c187..f9eb6400f1 100644
--- a/revision.c
+++ b/revision.c
@@ -109,6 +109,9 @@ void mark_trees_uninteresting_sparse(struct repository *r,
 	while ((oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);
 
+		if (!tree)
+			continue;
+
 		if (tree->object.flags & UNINTERESTING) {
 			/*
 			 * Remove the flag so the next call
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 3/6] pack-objects: add --sparse option
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42     ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-pack-objects.txt |  11 ++-
 builtin/pack-objects.c             |   5 +-
 t/t5322-pack-objects-sparse.sh     | 115 +++++++++++++++++++++++++++++
 3 files changed, 129 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 40c825c381..e45f3e680d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--stdout [--filter=<filter-spec>] | base-name]
-	[--shallow] [--keep-true-parents] < object-list
+	[--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,15 @@ depth is 4095.
 	Add --no-reuse-object if you want to force a uniform compression
 	level on all data no matter the source.
 
+--sparse::
+	Use the "sparse" algorithm to determine which objects to include in
+	the pack, when combined with the "--revs" option. This algorithm
+	only walks trees that appear in paths that introduce new objects.
+	This can have significant performance benefits when computing
+	a pack to send a small change. However, it is possible that extra
+	objects are added to the pack-file if the included commits contain
+	certain types of direct renames.
+
 --thin::
 	Create a "thin" pack by omitting the common objects between a
 	sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge, 0);
+	mark_edges_uninteresting(&revs, show_edge, sparse);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+		OPT_BOOL(0, "sparse", &sparse,
+			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
 		OPT_BOOL(0, "shallow", &shallow,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 0000000000..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	test_commit initial &&
+	for i in $(test_seq 1 3)
+	do
+		mkdir f$i &&
+		for j in $(test_seq 1 3)
+		do
+			mkdir f$i/f$j &&
+			echo $j >f$i/f$j/data.txt
+		done
+	done &&
+	git add . &&
+	git commit -m "Initialized trees" &&
+	for i in $(test_seq 1 3)
+	do
+		git checkout -b topic$i master &&
+		echo change-$i >f$i/f$i/data.txt &&
+		git commit -a -m "Changed f$i/f$i/data.txt"
+	done &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic2
+	^topic3
+	EOF
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+# Demonstrate that both algorithms send "extra" objects because
+# they are not in the frontier.
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+	git checkout topic1 &&
+	echo change-3 >f3/f3/data.txt &&
+	git commit -a -m "Changed f3/f3/data.txt" &&
+	git rev-parse			\
+		topic1~1		\
+		topic1~1^{tree}		\
+		topic1^{tree}		\
+		topic1			\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt	\
+		topic1:f3		\
+		topic1:f3/f3		\
+		topic1:f3/f3/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f1 into f3' '
+	mkdir f3/f4 &&
+	cp -r f1/f1/* f3/f4 &&
+	git add f3/f4 &&
+	git commit -m "Copied f1/f1 to f3/f4" &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic1~1
+	EOF
+	git rev-parse		\
+		topic1		\
+		topic1^{tree}	\
+		topic1:f3 | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 4/6] revision: implement sparse algorithm
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2018-12-10 16:42     ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42     ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

	v2.19.0
	^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

	v4.18
	^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:    220
Walked (old alg): 22,804
Walked (new alg):    129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c                     | 116 ++++++++++++++++++++++++++++++---
 t/t5322-pack-objects-sparse.sh |  21 ++++--
 2 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..971f1bb095 100644
--- a/revision.c
+++ b/revision.c
@@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }

+struct paths_and_oids {
+	struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+	string_list_init(&po->list, 1);
+}
+
+static void paths_and_oids_clear(struct paths_and_oids *po)
+{
+	int i;
+	for (i = 0; i < po->list.nr; i++) {
+		oidset_clear(po->list.items[i].util);
+		free(po->list.items[i].util);
+	}
+
+	string_list_clear(&po->list, 0);
+}
+
+static void paths_and_oids_insert(struct paths_and_oids *po,
+				  const char *path,
+				  const struct object_id *oid)
+{
+	struct string_list_item *item = string_list_insert(&po->list, path);
+	struct oidset *set;
+
+	if (!item->util) {
+		set = xcalloc(1, sizeof(struct oidset));
+		oidset_init(set, 16);
+		item->util = set;
+	} else {
+		set = item->util;
+	}
+
+	oidset_insert(set, oid);
+}
+
+static void add_children_by_path(struct repository *r,
+				 struct tree *tree,
+				 struct paths_and_oids *po)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if (!tree)
+		return;
+
+	if (parse_tree_gently(tree, 1) < 0)
+		return;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		switch (object_type(entry.mode)) {
+		case OBJ_TREE:
+			paths_and_oids_insert(po, entry.path, entry.oid);
+
+			if (tree->object.flags & UNINTERESTING) {
+				struct tree *child = lookup_tree(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		case OBJ_BLOB:
+			if (tree->object.flags & UNINTERESTING) {
+				struct blob *child = lookup_blob(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		default:
+			/* Subproject commit - not in this repository */
+			break;
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
 void mark_trees_uninteresting_sparse(struct repository *r,
 				     struct oidset *set)
 {
+	int i;
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct paths_and_oids po;
 	struct object_id *oid;
 	struct oidset_iter iter;

 	oidset_iter_init(set, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);

 		if (!tree)
 			continue;

-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
 	}
+
+	/* Do not walk unless we have both types of trees. */
+	if (!has_uninteresting || !has_interesting)
+		return;
+
+	paths_and_oids_init(&po);
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+		add_children_by_path(r, tree, &po);
+	}
+
+	for (i = 0; i < po.list.nr; i++)
+		mark_trees_uninteresting_sparse(
+			r, (struct oidset *)po.list.items[i].util);
+
+	paths_and_oids_clear(&po);
 }

 struct commit_stack {
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 81f6805bc3..45dba6e014 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_objects.txt sparse_objects.txt
 '

+# Demonstrate that the algorithms differ when we copy a tree wholesale
+# from one folder to another.
+
 test_expect_success 'duplicate a folder from f1 into f3' '
 	mkdir f3/f4 &&
 	cp -r f1/f1/* f3/f4 &&
 	git add f3/f4 &&
 	git commit -m "Copied f1/f1 to f3/f4" &&
-	cat >packinput.txt <<-EOF &&
+	cat >packinput.txt <<-EOF
 	topic1
 	^topic1~1
 	EOF
-	git rev-parse		\
-		topic1		\
-		topic1^{tree}	\
-		topic1:f3 | sort >expect_objects.txt
 '

 test_expect_success 'non-sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3 | sort >expect_objects.txt &&
 	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
@@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' '
 '

 test_expect_success 'sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3		\
+		topic1:f3/f4		\
+		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
 	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
 	git index-pack -o sparse.idx sparse.pack &&
 	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
-	test_cmp expect_objects.txt sparse_objects.txt
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '

 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 5/6] pack-objects: create pack.useSparse setting
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2018-12-10 16:42     ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42     ` Derrick Stolee via GitGitGadget
  2018-12-10 16:42     ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/pack.txt  |  9 +++++++++
 builtin/pack-objects.c         |  4 ++++
 t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index edac75c83f..425c73aa52 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -105,6 +105,15 @@ pack.useBitmaps::
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useSparse::
+	When true, git will default to using the '--sparse' option in
+	'git pack-objects' when the '--revs' option is present. This
+	algorithm only walks trees that appear in paths that introduce new
+	objects. This can have significant performance benefits when
+	computing a pack to send a small change. However, it is possible
+	that extra objects are added to the pack-file if the included
+	commits contain certain types of direct renames.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.usesparse")) {
+		sparse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+	git config pack.useSparse true &&
+	git pack-objects --stdout --revs <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2018-12-10 16:42     ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
@ 2018-12-10 16:42     ` Derrick Stolee via GitGitGadget
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/pack-objects.c         | 1 +
 t/README                       | 4 ++++
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	read_replace_refs = 0;
 
+	sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
 		topic1			\
 		topic1^{tree}		\
 		topic1:f3 | sort >expect_objects.txt &&
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 0/6] Add a new "sparse" tree walk algorithm
  2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2018-12-10 16:42     ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22     ` Derrick Stolee via GitGitGadget
  2018-12-14 21:22       ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
                         ` (6 more replies)
  6 siblings, 7 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Update in V3:

 * Change documentation around 'pack.useSparse' config setting to better
   advertise to users.
 * Mostly a ping now that v2.20.0 is out.

Updates in V4:

 * Switched to using hashmap instead of string_list for the path/oidset
   dictionary. (This is due to some fear that the string_list performance
   would degrade for a very wide tree, but I am unable to measure a
   performance difference.)
 * Some cleanup of code snippets across commits.
 * Some grammar cleanup in the commit messages.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/config/pack.txt      |   9 ++
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c                           |   2 +-
 builtin/pack-objects.c             |  10 +-
 builtin/rev-list.c                 |   2 +-
 http-push.c                        |   2 +-
 list-objects.c                     |  70 +++++++++++---
 list-objects.h                     |   4 +-
 revision.c                         | 144 +++++++++++++++++++++++++++++
 revision.h                         |   2 +
 t/README                           |   4 +
 t/t5322-pack-objects-sparse.sh     | 139 ++++++++++++++++++++++++++++
 12 files changed, 382 insertions(+), 17 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v3:

 1:  60617681f7 ! 1:  817e30a287 revision: add mark_tree_uninteresting_sparse
     @@ -35,6 +35,9 @@
      +	while ((oid = oidset_iter_next(&iter))) {
      +		struct tree *tree = lookup_tree(r, oid);
      +
     ++		if (!tree)
     ++			continue;
     ++
      +		if (tree->object.flags & UNINTERESTING) {
      +			/*
      +			 * Remove the flag so the next call
 2:  4527addacb ! 2:  39dc89beb9 list-objects: consume sparse tree walk
     @@ -11,9 +11,10 @@
          We walk these commits until we discover a frontier of commits such
          that every commit walk starting at interesting commits ends in a root
          commit or unintersting commit. We then need to discover which
     -    non-commit objects are reachable from  uninteresting commits.
     +    non-commit objects are reachable from  uninteresting commits. This
     +    commit walk is not changing during this series.
      
     -    The mark_edges_unintersting() method in list-objects.c iterates on
     +    The mark_edges_uninteresting() method in list-objects.c iterates on
          the commit list and does the following:
      
          * If the commit is UNINTERSTING, then mark its root tree and every
     @@ -138,17 +139,22 @@
      +			      int sparse)
       {
       	struct commit_list *list;
     -+	struct oidset set;
       	int i;
       
     -+	if (sparse)
     +-	for (list = revs->commits; list; list = list->next) {
     +-		struct commit *commit = list->item;
     ++	if (sparse) {
     ++		struct oidset set;
      +		oidset_init(&set, 16);
     -+
     - 	for (list = revs->commits; list; list = list->next) {
     - 		struct commit *commit = list->item;
       
      -		if (commit->object.flags & UNINTERESTING) {
     -+		if (sparse) {
     +-			mark_tree_uninteresting(revs->repo,
     +-						get_commit_tree(commit));
     +-			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
     +-				commit->object.flags |= SHOWN;
     +-				show_edge(commit);
     ++		for (list = revs->commits; list; list = list->next) {
     ++			struct commit *commit = list->item;
      +			struct tree *tree = get_commit_tree(commit);
      +
      +			if (commit->object.flags & UNINTERESTING)
     @@ -156,24 +162,27 @@
      +
      +			oidset_insert(&set, &tree->object.oid);
      +			add_edge_parents(commit, revs, show_edge, &set);
     -+		} else if (commit->object.flags & UNINTERESTING) {
     - 			mark_tree_uninteresting(revs->repo,
     - 						get_commit_tree(commit));
     - 			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
     - 				commit->object.flags |= SHOWN;
     - 				show_edge(commit);
     ++		}
     ++
     ++		mark_trees_uninteresting_sparse(revs->repo, &set);
     ++		oidset_clear(&set);
     ++	} else {
     ++		for (list = revs->commits; list; list = list->next) {
     ++			struct commit *commit = list->item;
     ++			if (commit->object.flags & UNINTERESTING) {
     ++				mark_tree_uninteresting(revs->repo,
     ++							get_commit_tree(commit));
     ++				if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
     ++					commit->object.flags |= SHOWN;
     ++					show_edge(commit);
     ++				}
     ++				continue;
       			}
      -			continue;
     -+		} else {
      +			mark_edge_parents_uninteresting(commit, revs, show_edge);
       		}
      -		mark_edge_parents_uninteresting(commit, revs, show_edge);
       	}
     -+
     -+	if (sparse) {
     -+		mark_trees_uninteresting_sparse(revs->repo, &set);
     -+		oidset_clear(&set);
     -+	}
      +
       	if (revs->edge_hint_aggressive) {
       		for (i = 0; i < revs->cmdline.nr; i++) {
     @@ -193,17 +202,3 @@
       
       struct oidset;
       struct list_objects_filter_options;
     -
     -diff --git a/revision.c b/revision.c
     ---- a/revision.c
     -+++ b/revision.c
     -@@
     - 	while ((oid = oidset_iter_next(&iter))) {
     - 		struct tree *tree = lookup_tree(r, oid);
     - 
     -+		if (!tree)
     -+			continue;
     -+
     - 		if (tree->object.flags & UNINTERESTING) {
     - 			/*
     - 			 * Remove the flag so the next call
 3:  4ef318bdb2 = 3:  ab733daff5 pack-objects: add --sparse option
 4:  571b2e2784 ! 4:  c44172c35e revision: implement sparse algorithm
     @@ -6,8 +6,9 @@
          pack-objects --revs', we discover the "frontier" of commits
          that we care about and the boundary with commit we find
          uninteresting. From that point, we walk trees to discover which
     -    trees and blobs are uninteresting. Finally, we walk trees to find
     -    the interesting trees.
     +    trees and blobs are uninteresting. Finally, we walk trees from the
     +    interesting commits to find the interesting objects that are
     +    placed in the pack.
      
          This commit introduces a new, "sparse" way to discover the
          uninteresting trees. We use the perspective of a single user trying
     @@ -31,11 +32,13 @@
          parsing is UNINTERESTING.
      
          To actually improve the peformance, we need to terminate our
     -    recursion unless the oidset contains some intersting trees and
     -    some uninteresting trees. Technically, we only need one interesting
     -    tree for this to speed up in most cases, but we also will not mark
     -    anything UNINTERESTING if there are no uninteresting trees, so
     -    that would be wasted effort.
     +    recursion. If the oidset contains only UNINTERESTING trees, then
     +    we do not continue the recursion. This avoids walking trees that
     +    are likely to not be reachable from interesting trees. If the
     +    oidset contains only interesting trees, then we will walk these
     +    trees in the final stage that collects the intersting objects to
     +    place in the pack. Thus, we only recurse if the oidset contains
     +    both interesting and UNINITERESTING trees.
      
          There are a few ways that this is not a universally better option.
      
     @@ -108,51 +111,80 @@
      diff --git a/revision.c b/revision.c
      --- a/revision.c
      +++ b/revision.c
     +@@
     + #include "commit-reach.h"
     + #include "commit-graph.h"
     + #include "prio-queue.h"
     ++#include "hashmap.h"
     + 
     + volatile show_early_output_fn_t show_early_output;
     + 
      @@
       	mark_tree_contents_uninteresting(r, tree);
       }
       
     -+struct paths_and_oids {
     -+	struct string_list list;
     ++struct path_and_oids_entry {
     ++	struct hashmap_entry ent;
     ++	char *path;
     ++	struct oidset set;
      +};
      +
     -+static void paths_and_oids_init(struct paths_and_oids *po)
     ++static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
     ++			     const struct path_and_oids_entry *e1,
     ++			     const struct path_and_oids_entry *e2,
     ++			     const void *keydata)
     ++{
     ++	return strcmp(e1->path, e2->path);
     ++}
     ++
     ++int map_flags = 0;
     ++static void paths_and_oids_init(struct hashmap *map)
      +{
     -+	string_list_init(&po->list, 1);
     ++	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0);
      +}
      +
     -+static void paths_and_oids_clear(struct paths_and_oids *po)
     ++static void paths_and_oids_clear(struct hashmap *map)
      +{
     -+	int i;
     -+	for (i = 0; i < po->list.nr; i++) {
     -+		oidset_clear(po->list.items[i].util);
     -+		free(po->list.items[i].util);
     ++	struct hashmap_iter iter;
     ++	struct path_and_oids_entry *entry;
     ++	hashmap_iter_init(map, &iter);
     ++
     ++	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
     ++		oidset_clear(&entry->set);
     ++		free(entry->path);
      +	}
      +
     -+	string_list_clear(&po->list, 0);
     ++	hashmap_free(map, 1);
      +}
      +
     -+static void paths_and_oids_insert(struct paths_and_oids *po,
     ++static void paths_and_oids_insert(struct hashmap *map,
      +				  const char *path,
      +				  const struct object_id *oid)
      +{
     -+	struct string_list_item *item = string_list_insert(&po->list, path);
     -+	struct oidset *set;
     ++	int hash = strhash(path);
     ++	struct path_and_oids_entry key;
     ++	struct path_and_oids_entry *entry;
     ++
     ++	hashmap_entry_init(&key, hash);
     ++	key.path = xstrdup(path);
     ++	oidset_init(&key.set, 0);
      +
     -+	if (!item->util) {
     -+		set = xcalloc(1, sizeof(struct oidset));
     -+		oidset_init(set, 16);
     -+		item->util = set;
     ++	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
     ++		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
     ++		hashmap_entry_init(entry, hash);
     ++		entry->path = key.path;
     ++		oidset_init(&entry->set, 16);
     ++		hashmap_put(map, entry);
      +	} else {
     -+		set = item->util;
     ++		free(key.path);
      +	}
      +
     -+	oidset_insert(set, oid);
     ++	oidset_insert(&entry->set, oid);
      +}
      +
      +static void add_children_by_path(struct repository *r,
      +				 struct tree *tree,
     -+				 struct paths_and_oids *po)
     ++				 struct hashmap *map)
      +{
      +	struct tree_desc desc;
      +	struct name_entry entry;
     @@ -167,7 +199,7 @@
      +	while (tree_entry(&desc, &entry)) {
      +		switch (object_type(entry.mode)) {
      +		case OBJ_TREE:
     -+			paths_and_oids_insert(po, entry.path, entry.oid);
     ++			paths_and_oids_insert(map, entry.path, entry.oid);
      +
      +			if (tree->object.flags & UNINTERESTING) {
      +				struct tree *child = lookup_tree(r, entry.oid);
     @@ -194,9 +226,10 @@
       void mark_trees_uninteresting_sparse(struct repository *r,
       				     struct oidset *set)
       {
     -+	int i;
      +	unsigned has_interesting = 0, has_uninteresting = 0;
     -+	struct paths_and_oids po;
     ++	struct hashmap map;
     ++	struct hashmap_iter map_iter;
     ++	struct path_and_oids_entry *entry;
       	struct object_id *oid;
       	struct oidset_iter iter;
       
     @@ -222,25 +255,25 @@
      +			has_uninteresting = 1;
      +		else
      +			has_interesting = 1;
     - 	}
     ++	}
      +
      +	/* Do not walk unless we have both types of trees. */
      +	if (!has_uninteresting || !has_interesting)
      +		return;
      +
     -+	paths_and_oids_init(&po);
     ++	paths_and_oids_init(&map);
      +
      +	oidset_iter_init(set, &iter);
      +	while ((oid = oidset_iter_next(&iter))) {
      +		struct tree *tree = lookup_tree(r, oid);
     -+		add_children_by_path(r, tree, &po);
     -+	}
     ++		add_children_by_path(r, tree, &map);
     + 	}
      +
     -+	for (i = 0; i < po.list.nr; i++)
     -+		mark_trees_uninteresting_sparse(
     -+			r, (struct oidset *)po.list.items[i].util);
     ++	hashmap_iter_init(&map, &map_iter);
     ++	while ((entry = hashmap_iter_next(&map_iter)))
     ++		mark_trees_uninteresting_sparse(r, &entry->set);
      +
     -+	paths_and_oids_clear(&po);
     ++	paths_and_oids_clear(&map);
       }
       
       struct commit_stack {
 5:  33d2c04dd6 = 5:  f386f6c3c9 pack-objects: create pack.useSparse setting
 6:  e4f29543ee = 6:  d011a9c1b1 pack-objects: create GIT_TEST_PACK_SPARSE

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22       ` Derrick Stolee via GitGitGadget
  2019-01-11 19:43         ` Junio C Hamano
  2018-12-14 21:22       ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
                         ` (5 subsequent siblings)
  6 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 25 +++++++++++++++++++++++++
 revision.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..f9eb6400f1 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+				     struct oidset *set)
+{
+	struct object_id *oid;
+	struct oidset_iter iter;
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+
+		if (!tree)
+			continue;
+
+		if (tree->object.flags & UNINTERESTING) {
+			/*
+			 * Remove the flag so the next call
+			 * is not a no-op. The flag is added
+			 * in mark_tree_unintersting().
+			 */
+			tree->object.flags ^= UNINTERESTING;
+			mark_tree_uninteresting(r, tree);
+		}
+	}
+}
+
 struct commit_stack {
 	struct commit **items;
 	size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 2/6] list-objects: consume sparse tree walk
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2018-12-14 21:22       ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22       ` Derrick Stolee via GitGitGadget
  2019-01-11 23:20         ` Junio C Hamano
  2018-12-14 21:22       ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  6 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits. This
commit walk is not changing during this series.

The mark_edges_uninteresting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bisect.c               |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c     |  2 +-
 http-push.c            |  2 +-
 list-objects.c         | 70 +++++++++++++++++++++++++++++++++++-------
 list-objects.h         |  4 ++-
 6 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	if (revs->tree_objects)
-		mark_edges_uninteresting(revs, NULL);
+		mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge);
+	mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	if (revs.tree_objects)
-		mark_edges_uninteresting(&revs, show_edge);
+		mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (bisect_list) {
 		int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
 		pushing = 0;
 		if (prepare_revision_walk(&revs))
 			die("revision walk setup failed");
-		mark_edges_uninteresting(&revs, NULL);
+		mark_edges_uninteresting(&revs, NULL, 0);
 		objects_to_send = get_delta(&revs, ref_lock);
 		finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..fb728f7842 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,73 @@ static void mark_edge_parents_uninteresting(struct commit *commit,
 	}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+			     struct rev_info *revs,
+			     show_edge_fn show_edge,
+			     struct oidset *set)
+{
+	struct commit_list *parents;
+
+	for (parents = commit->parents; parents; parents = parents->next) {
+		struct commit *parent = parents->item;
+		struct tree *tree = get_commit_tree(parent);
+
+		if (!tree)
+			continue;
+
+		oidset_insert(set, &tree->object.oid);
+
+		if (!(parent->object.flags & UNINTERESTING))
+			continue;
+		tree->object.flags |= UNINTERESTING;
+
+		if (revs->edge_hint && !(parent->object.flags & SHOWN)) {
+			parent->object.flags |= SHOWN;
+			show_edge(parent);
+		}
+	}
+}
+
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse)
 {
 	struct commit_list *list;
 	int i;
 
-	for (list = revs->commits; list; list = list->next) {
-		struct commit *commit = list->item;
+	if (sparse) {
+		struct oidset set;
+		oidset_init(&set, 16);
 
-		if (commit->object.flags & UNINTERESTING) {
-			mark_tree_uninteresting(revs->repo,
-						get_commit_tree(commit));
-			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
-				commit->object.flags |= SHOWN;
-				show_edge(commit);
+		for (list = revs->commits; list; list = list->next) {
+			struct commit *commit = list->item;
+			struct tree *tree = get_commit_tree(commit);
+
+			if (commit->object.flags & UNINTERESTING)
+				tree->object.flags |= UNINTERESTING;
+
+			oidset_insert(&set, &tree->object.oid);
+			add_edge_parents(commit, revs, show_edge, &set);
+		}
+
+		mark_trees_uninteresting_sparse(revs->repo, &set);
+		oidset_clear(&set);
+	} else {
+		for (list = revs->commits; list; list = list->next) {
+			struct commit *commit = list->item;
+			if (commit->object.flags & UNINTERESTING) {
+				mark_tree_uninteresting(revs->repo,
+							get_commit_tree(commit));
+				if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
+					commit->object.flags |= SHOWN;
+					show_edge(commit);
+				}
+				continue;
 			}
-			continue;
+			mark_edge_parents_uninteresting(commit, revs, show_edge);
 		}
-		mark_edge_parents_uninteresting(commit, revs, show_edge);
 	}
+
 	if (revs->edge_hint_aggressive) {
 		for (i = 0; i < revs->cmdline.nr; i++) {
 			struct object *obj = revs->cmdline.rev[i].item;
diff --git a/list-objects.h b/list-objects.h
index ad40762926..a952680e46 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *);
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
 
 typedef void (*show_edge_fn)(struct commit *);
-void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse);
 
 struct oidset;
 struct list_objects_filter_options;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 3/6] pack-objects: add --sparse option
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2018-12-14 21:22       ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
  2018-12-14 21:22       ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22       ` Derrick Stolee via GitGitGadget
  2019-01-11 22:30         ` Junio C Hamano
  2018-12-14 21:22       ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  6 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-pack-objects.txt |  11 ++-
 builtin/pack-objects.c             |   5 +-
 t/t5322-pack-objects-sparse.sh     | 115 +++++++++++++++++++++++++++++
 3 files changed, 129 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 40c825c381..e45f3e680d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--stdout [--filter=<filter-spec>] | base-name]
-	[--shallow] [--keep-true-parents] < object-list
+	[--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,15 @@ depth is 4095.
 	Add --no-reuse-object if you want to force a uniform compression
 	level on all data no matter the source.
 
+--sparse::
+	Use the "sparse" algorithm to determine which objects to include in
+	the pack, when combined with the "--revs" option. This algorithm
+	only walks trees that appear in paths that introduce new objects.
+	This can have significant performance benefits when computing
+	a pack to send a small change. However, it is possible that extra
+	objects are added to the pack-file if the included commits contain
+	certain types of direct renames.
+
 --thin::
 	Create a "thin" pack by omitting the common objects between a
 	sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge, 0);
+	mark_edges_uninteresting(&revs, show_edge, sparse);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+		OPT_BOOL(0, "sparse", &sparse,
+			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
 		OPT_BOOL(0, "shallow", &shallow,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 0000000000..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	test_commit initial &&
+	for i in $(test_seq 1 3)
+	do
+		mkdir f$i &&
+		for j in $(test_seq 1 3)
+		do
+			mkdir f$i/f$j &&
+			echo $j >f$i/f$j/data.txt
+		done
+	done &&
+	git add . &&
+	git commit -m "Initialized trees" &&
+	for i in $(test_seq 1 3)
+	do
+		git checkout -b topic$i master &&
+		echo change-$i >f$i/f$i/data.txt &&
+		git commit -a -m "Changed f$i/f$i/data.txt"
+	done &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic2
+	^topic3
+	EOF
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+# Demonstrate that both algorithms send "extra" objects because
+# they are not in the frontier.
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+	git checkout topic1 &&
+	echo change-3 >f3/f3/data.txt &&
+	git commit -a -m "Changed f3/f3/data.txt" &&
+	git rev-parse			\
+		topic1~1		\
+		topic1~1^{tree}		\
+		topic1^{tree}		\
+		topic1			\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt	\
+		topic1:f3		\
+		topic1:f3/f3		\
+		topic1:f3/f3/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f1 into f3' '
+	mkdir f3/f4 &&
+	cp -r f1/f1/* f3/f4 &&
+	git add f3/f4 &&
+	git commit -m "Copied f1/f1 to f3/f4" &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic1~1
+	EOF
+	git rev-parse		\
+		topic1		\
+		topic1^{tree}	\
+		topic1:f3 | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 4/6] revision: implement sparse algorithm
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2018-12-14 21:22       ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22       ` Derrick Stolee via GitGitGadget
  2018-12-14 23:32         ` Ævar Arnfjörð Bjarmason
  2019-01-11 23:20         ` Junio C Hamano
  2018-12-14 21:22       ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
                         ` (2 subsequent siblings)
  6 siblings, 2 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

	v2.19.0
	^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

	v4.18
	^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:    220
Walked (old alg): 22,804
Walked (new alg):    129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c                     | 139 ++++++++++++++++++++++++++++++---
 t/t5322-pack-objects-sparse.sh |  21 +++--
 2 files changed, 144 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..63bf6230dc 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "prio-queue.h"
+#include "hashmap.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -99,29 +100,147 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+struct path_and_oids_entry {
+	struct hashmap_entry ent;
+	char *path;
+	struct oidset set;
+};
+
+static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
+			     const struct path_and_oids_entry *e1,
+			     const struct path_and_oids_entry *e2,
+			     const void *keydata)
+{
+	return strcmp(e1->path, e2->path);
+}
+
+int map_flags = 0;
+static void paths_and_oids_init(struct hashmap *map)
+{
+	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0);
+}
+
+static void paths_and_oids_clear(struct hashmap *map)
+{
+	struct hashmap_iter iter;
+	struct path_and_oids_entry *entry;
+	hashmap_iter_init(map, &iter);
+
+	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
+		oidset_clear(&entry->set);
+		free(entry->path);
+	}
+
+	hashmap_free(map, 1);
+}
+
+static void paths_and_oids_insert(struct hashmap *map,
+				  const char *path,
+				  const struct object_id *oid)
+{
+	int hash = strhash(path);
+	struct path_and_oids_entry key;
+	struct path_and_oids_entry *entry;
+
+	hashmap_entry_init(&key, hash);
+	key.path = xstrdup(path);
+	oidset_init(&key.set, 0);
+
+	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
+		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
+		hashmap_entry_init(entry, hash);
+		entry->path = key.path;
+		oidset_init(&entry->set, 16);
+		hashmap_put(map, entry);
+	} else {
+		free(key.path);
+	}
+
+	oidset_insert(&entry->set, oid);
+}
+
+static void add_children_by_path(struct repository *r,
+				 struct tree *tree,
+				 struct hashmap *map)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if (!tree)
+		return;
+
+	if (parse_tree_gently(tree, 1) < 0)
+		return;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		switch (object_type(entry.mode)) {
+		case OBJ_TREE:
+			paths_and_oids_insert(map, entry.path, entry.oid);
+
+			if (tree->object.flags & UNINTERESTING) {
+				struct tree *child = lookup_tree(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		case OBJ_BLOB:
+			if (tree->object.flags & UNINTERESTING) {
+				struct blob *child = lookup_blob(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		default:
+			/* Subproject commit - not in this repository */
+			break;
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
 void mark_trees_uninteresting_sparse(struct repository *r,
 				     struct oidset *set)
 {
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct hashmap map;
+	struct hashmap_iter map_iter;
+	struct path_and_oids_entry *entry;
 	struct object_id *oid;
 	struct oidset_iter iter;
 
 	oidset_iter_init(set, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);
 
 		if (!tree)
 			continue;
 
-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
+	}
+
+	/* Do not walk unless we have both types of trees. */
+	if (!has_uninteresting || !has_interesting)
+		return;
+
+	paths_and_oids_init(&map);
+
+	oidset_iter_init(set, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+		add_children_by_path(r, tree, &map);
 	}
+
+	hashmap_iter_init(&map, &map_iter);
+	while ((entry = hashmap_iter_next(&map_iter)))
+		mark_trees_uninteresting_sparse(r, &entry->set);
+
+	paths_and_oids_clear(&map);
 }
 
 struct commit_stack {
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 81f6805bc3..45dba6e014 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_objects.txt sparse_objects.txt
 '
 
+# Demonstrate that the algorithms differ when we copy a tree wholesale
+# from one folder to another.
+
 test_expect_success 'duplicate a folder from f1 into f3' '
 	mkdir f3/f4 &&
 	cp -r f1/f1/* f3/f4 &&
 	git add f3/f4 &&
 	git commit -m "Copied f1/f1 to f3/f4" &&
-	cat >packinput.txt <<-EOF &&
+	cat >packinput.txt <<-EOF
 	topic1
 	^topic1~1
 	EOF
-	git rev-parse		\
-		topic1		\
-		topic1^{tree}	\
-		topic1:f3 | sort >expect_objects.txt
 '
 
 test_expect_success 'non-sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3 | sort >expect_objects.txt &&
 	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
@@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' '
 '
 
 test_expect_success 'sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3		\
+		topic1:f3/f4		\
+		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
 	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
 	git index-pack -o sparse.idx sparse.pack &&
 	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
-	test_cmp expect_objects.txt sparse_objects.txt
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 5/6] pack-objects: create pack.useSparse setting
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2018-12-14 21:22       ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22       ` Derrick Stolee via GitGitGadget
  2018-12-14 21:22       ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/pack.txt  |  9 +++++++++
 builtin/pack-objects.c         |  4 ++++
 t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index edac75c83f..425c73aa52 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -105,6 +105,15 @@ pack.useBitmaps::
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useSparse::
+	When true, git will default to using the '--sparse' option in
+	'git pack-objects' when the '--revs' option is present. This
+	algorithm only walks trees that appear in paths that introduce new
+	objects. This can have significant performance benefits when
+	computing a pack to send a small change. However, it is possible
+	that extra objects are added to the pack-file if the included
+	commits contain certain types of direct renames.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.usesparse")) {
+		sparse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+	git config pack.useSparse true &&
+	git pack-objects --stdout --revs <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2018-12-14 21:22       ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
@ 2018-12-14 21:22       ` Derrick Stolee via GitGitGadget
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/pack-objects.c         | 1 +
 t/README                       | 4 ++++
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	read_replace_refs = 0;
 
+	sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
 		topic1			\
 		topic1^{tree}		\
 		topic1:f3 | sort >expect_objects.txt &&
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 4/6] revision: implement sparse algorithm
  2018-12-14 21:22       ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
@ 2018-12-14 23:32         ` Ævar Arnfjörð Bjarmason
  2018-12-17 14:20           ` Derrick Stolee
  2019-01-11 23:20         ` Junio C Hamano
  1 sibling, 1 reply; 51+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-14 23:32 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, Junio C Hamano, Derrick Stolee


On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote:

> Despite these potential drawbacks, the benefits of the algorithm
> are clear. By adding a counter to 'add_children_by_path' and
> 'mark_tree_contents_uninteresting', I measured the number of
> parsed trees for the two algorithms in a variety of repos.

Ah, so we take one or the other. I tested this with:

    diff --git a/revision.c b/revision.c
    index 63bf6230dc..e9c67aa550 100644
    --- a/revision.c
    +++ b/revision.c
    @@ -61,6 +61,8 @@ static void mark_tree_contents_uninteresting(struct repository *r,
            struct tree_desc desc;
            struct name_entry entry;

    +       fprintf(stderr, "MTCU\n");
    +
            if (parse_tree_gently(tree, 1) < 0)
                    return;

    @@ -166,6 +168,8 @@ static void add_children_by_path(struct repository *r,
            struct tree_desc desc;
            struct name_entry entry;

    +       fprintf(stderr, "ACBP\n");
    +
            if (!tree)
                    return;

And tried testing my pushing this branch to my git.git:

    $ for v in true false; do git push --delete avar push/sparse; ./git -c pack.usesparse=$v --exec-path=$PWD push avar HEAD 2>&1 | grep -e MTCU -e ACBP | uniq -c; done
    To github.com:avar/git.git
     - [deleted]               push/sparse
         22 ACBP
    To github.com:avar/git.git
     - [deleted]               push/sparse
        184 MTCU

But snipping around a bit from the commit message:

>When enumerating objects to place in a pack-file during 'git
>pack-objects --revs', we discover the "frontier" of commits
>that we care about and the boundary with commit we find
>uninteresting. From that point, we walk trees to discover which
>trees and blobs are uninteresting. Finally, we walk trees from the
>interesting commits to find the interesting objects that are
>placed in the pack.
>[...]
>These improvements will have even larger benefits in the super-
>large Windows repository. In our experiments, we see the
>"Enumerate objects" phase of pack-objects taking 60-80% of the
>end-to-end time of non-trivial pushes, taking longer than the
>network time to send the pack and the server time to verify the
>pack.

If I change my monkeypatch to:

    diff --git a/revision.c b/revision.c
    index 63bf6230dc..9171ca27c5 100644
    --- a/revision.c
    +++ b/revision.c
    @@ -63,0 +64,3 @@ static void mark_tree_contents_uninteresting(struct repository *r,
    +       fprintf(stderr, "MTCU\n");
    +       sleep(1);
    +
    @@ -168,0 +172,3 @@ static void add_children_by_path(struct repository *r,
    +       fprintf(stderr, "ACBP\n");
    +       sleep(1);
    +

We spend a long time printing those out before we ever get to
"Enumerating objects".

Which was where I was trying to test this, i.e. is this a lot of work we
perform before we print out the progress bar, and regardless of this
optimization should have other progress output there, so we can see this
time we're spending on this?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 4/6] revision: implement sparse algorithm
  2018-12-14 23:32         ` Ævar Arnfjörð Bjarmason
@ 2018-12-17 14:20           ` Derrick Stolee
  2018-12-17 14:26             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2018-12-17 14:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, Junio C Hamano, Derrick Stolee

On 12/14/2018 6:32 PM, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote:
>
>> Despite these potential drawbacks, the benefits of the algorithm
>> are clear. By adding a counter to 'add_children_by_path' and
>> 'mark_tree_contents_uninteresting', I measured the number of
>> parsed trees for the two algorithms in a variety of repos.
> We spend a long time printing those out before we ever get to
> "Enumerating objects".
>
> Which was where I was trying to test this, i.e. is this a lot of work we
> perform before we print out the progress bar, and regardless of this
> optimization should have other progress output there, so we can see this
> time we're spending on this?

It is true that part of the problem is that a 'git push' will sit for a 
while without presenting any feedback until this part of the algorithm 
is complete. The current series intends to significantly reduce this time.

As for adding progress to this step, I'm open to it. It can be done as a 
sequel series.

What would we use to describe this section? "Enumerating remote objects"?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 4/6] revision: implement sparse algorithm
  2018-12-17 14:20           ` Derrick Stolee
@ 2018-12-17 14:26             ` Ævar Arnfjörð Bjarmason
  2018-12-17 14:50               ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-17 14:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Junio C Hamano, Derrick Stolee


On Mon, Dec 17 2018, Derrick Stolee wrote:

> On 12/14/2018 6:32 PM, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote:
>>
>>> Despite these potential drawbacks, the benefits of the algorithm
>>> are clear. By adding a counter to 'add_children_by_path' and
>>> 'mark_tree_contents_uninteresting', I measured the number of
>>> parsed trees for the two algorithms in a variety of repos.
>> We spend a long time printing those out before we ever get to
>> "Enumerating objects".
>>
>> Which was where I was trying to test this, i.e. is this a lot of work we
>> perform before we print out the progress bar, and regardless of this
>> optimization should have other progress output there, so we can see this
>> time we're spending on this?
>
> It is true that part of the problem is that a 'git push' will sit for
> a while without presenting any feedback until this part of the
> algorithm is complete. The current series intends to significantly
> reduce this time.


> As for adding progress to this step, I'm open to it. It can be done as
> a sequel series.

Okey. To clarify I wasn't complaining about the lack of progress output,
we didn't have it before, just clarifying (as I've found out now) that
when you're talking about "enumerating objects" in your commit message
it's *not* what we're doing when we show the "Enumerating objects"
progress bar, but an unrelated step prior to that.

I thought I might have been holding this wrong.

> What would we use to describe this section? "Enumerating remote
> objects"?

Isn't this "Discovering objects to push to remote" i.e. "Selecting
objects", but then what's "Enumerating objects" really doing? "Looping
over objects we selected before and creating a pack"?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 4/6] revision: implement sparse algorithm
  2018-12-17 14:26             ` Ævar Arnfjörð Bjarmason
@ 2018-12-17 14:50               ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2018-12-17 14:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Junio C Hamano, Derrick Stolee

On 12/17/2018 9:26 AM, Ævar Arnfjörð Bjarmason wrote:
> On Mon, Dec 17 2018, Derrick Stolee wrote:
>
>
>> As for adding progress to this step, I'm open to it. It can be done as
>> a sequel series.
> Okey. To clarify I wasn't complaining about the lack of progress output,
> we didn't have it before, just clarifying (as I've found out now) that
> when you're talking about "enumerating objects" in your commit message
> it's *not* what we're doing when we show the "Enumerating objects"
> progress bar, but an unrelated step prior to that.
Part of the problem is that in builtin/pack-objects.c, we have the 
following:

         if (progress)
                 progress_state = start_progress(_("Enumerating 
objects"), 0);
         if (!use_internal_rev_list)
                 read_object_list_from_stdin();
         else {
                 get_object_list(rp.argc, rp.argv);
                 argv_array_clear(&rp);
         }
         cleanup_preferred_base();
         if (include_tag && nr_result)
                 for_each_ref(add_ref_tag, NULL);
         stop_progress(&progress_state);

and the logic for walking uninteresting objects is the 
mark_edges_uninteresting() inside get_object_list() (both entirely 
contained in this progress state). Perhaps the best thing to do is to 
untangle the progress for the two modes based on 
'use_internal_rev_list'. Could we then have the progress for 
get_object_list() be "Walking objects" instead?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse
  2018-12-14 21:22       ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
@ 2019-01-11 19:43         ` Junio C Hamano
  2019-01-11 20:25           ` Junio C Hamano
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2019-01-11 19:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> In preparation for a new algorithm that walks fewer trees when
> creating a pack from a set of revisions, create a method that
> takes an oidset of tree oids and marks reachable objects as
> UNINTERESTING.
>
> The current implementation uses the existing
> mark_tree_uninteresting to recursively walk the trees and blobs.
> This will walk the same number of trees as the old mechanism.
>
> There is one new assumption in this approach: we are also given
> the oids of the interesting trees. This implementation does not
> use those trees at the moment, but we will use them in a later
> rewrite of this method.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c | 25 +++++++++++++++++++++++++
>  revision.h |  2 ++
>  2 files changed, 27 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index 13e0519c02..f9eb6400f1 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
>  	mark_tree_contents_uninteresting(r, tree);
>  }
>  
> +void mark_trees_uninteresting_sparse(struct repository *r,
> +				     struct oidset *set)
> +{
> +	struct object_id *oid;
> +	struct oidset_iter iter;
> +
> +	oidset_iter_init(set, &iter);
> +	while ((oid = oidset_iter_next(&iter))) {
> +		struct tree *tree = lookup_tree(r, oid);
> +
> +		if (!tree)
> +			continue;
> +
> +		if (tree->object.flags & UNINTERESTING) {
> +			/*
> +			 * Remove the flag so the next call
> +			 * is not a no-op. The flag is added
> +			 * in mark_tree_unintersting().
> +			 */
> +			tree->object.flags ^= UNINTERESTING;
> +			mark_tree_uninteresting(r, tree);
> +		}

The proposed log message claims that the method takes an oidset and
marks reachable objects as uninteresting, but the implementation
only marks those that are reachable from already uninteresting
trees.  Either one of them must be wrong.

Did you mean to have this instead?

		if (!tree)
			continue;
		/*
		 * Force traversal of the tree, even if it has been
		 * already marked as UNINTERESTING.
		 */
		tree->object.flags &= ~UNINTERESTING;
		mark_tree_uninteresting(r, tree);

By the way, one of the bigger reasons why I have to ask, instead of
making an educated guess, is because "struct oidset *set" parameter
does not give any useful information with the variable name to the
readers.  We know it is a set because its type is oidset; readers
need to know what meaning the 'set' has, what it is used for, why
the caller wants to (or decides not to) place a tree object in the
set when it calls it.  None of that can be read from its name.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse
  2019-01-11 19:43         ` Junio C Hamano
@ 2019-01-11 20:25           ` Junio C Hamano
  2019-01-11 22:05             ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2019-01-11 20:25 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> In preparation for a new algorithm that walks fewer trees when
>> creating a pack from a set of revisions, create a method that
>> takes an oidset of tree oids and marks reachable objects as
>> UNINTERESTING.
>> ...
>> There is one new assumption in this approach: we are also given
>> the oids of the interesting trees. This implementation does not
>> use those trees at the moment, but we will use them in a later
>> rewrite of this method.

Ahh....

> The proposed log message claims that the method takes an oidset and
> marks reachable objects as uninteresting, but the implementation
> only marks those that are reachable from already uninteresting
> trees.  Either one of them must be wrong.
>
> Did you mean to have this instead?
>
> 		if (!tree)
> 			continue;
> 		/*
> 		 * Force traversal of the tree, even if it has been
> 		 * already marked as UNINTERESTING.
> 		 */
> 		tree->object.flags &= ~UNINTERESTING;
> 		mark_tree_uninteresting(r, tree);

So, I assumed that the implementation was wrong, but it is the other
way around.  You do mean to pick only already uninteresting trees
out of "set" and mark its reachables.

One thing that would make me worried is what help the callers of
this function will get (or they may have to devise the way
themselves) to avoid having to traverse the same tree number of
times.  A tree can be made uninteresting after a traversal of
another tree that contains it, but the logic in this function

> +		if (tree->object.flags & UNINTERESTING) {
> +			/*
> +			 * Remove the flag so the next call
> +			 * is not a no-op. The flag is added
> +			 * in mark_tree_unintersting().
> +			 */
> +			tree->object.flags ^= UNINTERESTING;
> +			mark_tree_uninteresting(r, tree);
> +		}

ignores the fact that it is already UNINTERESTING (in fact, in a
sense it is even worse---it cannot be used to make a not-yet
UNINTERESTING one into UNINTERESTING), drops the UNINTERESING bit
and forces the traversal of that tree.  The only thing I see that
would work as a saving grace is that mark_tree_uninteresting()
itself would honor existing UNINTERESING bit and refuses to recurse
into its subtrees, but that means blobs at the top-level of such a
tree would be marked UNINTERESING while those in its subtrees can be
left not-UNINTERESING, which sounds like inviting a mess.

It does *not* immediately mean this function is misdesigned.  It
just means that the caller needs to carefully follow whatever
calling convention this series will establish in the later patches
(which we haven't seen yet at this point).

> By the way, one of the bigger reasons why I have to ask, instead of
> making an educated guess, is because "struct oidset *set" parameter
> does not give any useful information with the variable name to the
> readers.  We know it is a set because its type is oidset; readers
> need to know what meaning the 'set' has, what it is used for, why
> the caller wants to (or decides not to) place a tree object in the
> set when it calls it.  None of that can be read from its name.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse
  2019-01-11 20:25           ` Junio C Hamano
@ 2019-01-11 22:05             ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2019-01-11 22:05 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

On 1/11/2019 3:25 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> So, I assumed that the implementation was wrong, but it is the other
> way around.  You do mean to pick only already uninteresting trees
> out of "set" and mark its reachables.
>
> One thing that would make me worried is what help the callers of
> this function will get (or they may have to devise the way
> themselves) to avoid having to traverse the same tree number of
> times.  A tree can be made uninteresting after a traversal of
> another tree that contains it, but the logic in this function
>
>> +		if (tree->object.flags & UNINTERESTING) {
>> +			/*
>> +			 * Remove the flag so the next call
>> +			 * is not a no-op. The flag is added
>> +			 * in mark_tree_unintersting().
>> +			 */
>> +			tree->object.flags ^= UNINTERESTING;
>> +			mark_tree_uninteresting(r, tree);
>> +		}
> ignores the fact that it is already UNINTERESTING (in fact, in a
> sense it is even worse---it cannot be used to make a not-yet
> UNINTERESTING one into UNINTERESTING), drops the UNINTERESING bit
> and forces the traversal of that tree.  The only thing I see that
> would work as a saving grace is that mark_tree_uninteresting()
> itself would honor existing UNINTERESING bit and refuses to recurse
> into its subtrees, but that means blobs at the top-level of such a
> tree would be marked UNINTERESING while those in its subtrees can be
> left not-UNINTERESING, which sounds like inviting a mess.
>
> It does *not* immediately mean this function is misdesigned.  It
> just means that the caller needs to carefully follow whatever
> calling convention this series will establish in the later patches
> (which we haven't seen yet at this point).
>
I'm sorry that this implementation is particularly confusing. It is 
created only as a placeholder so we can wire up the --sparse option to 
use this method without being "wrong" but is then removed entirely in 
PATCH 4/6:

...

void mark_trees_uninteresting_sparse(struct repository *r,
  				     struct oidset *set)
  {
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct hashmap map;
+	struct hashmap_iter map_iter;
+	struct path_and_oids_entry *entry;
  	struct object_id *oid;
  	struct oidset_iter iter;
  
  	oidset_iter_init(set, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
  		struct tree *tree = lookup_tree(r, oid);
  
  		if (!tree)
  			continue;
  
-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
+	}
...

You are definitely correct that "set" is not a valuable variable name. It could instead be "tree_oids" to be slightly more informative.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 3/6] pack-objects: add --sparse option
  2018-12-14 21:22       ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
@ 2019-01-11 22:30         ` Junio C Hamano
  2019-01-15 15:06           ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2019-01-11 22:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a '--sparse' option flag to the pack-objects builtin. This
> allows the user to specify that they want to use the new logic
> for walking trees. This logic currently does not differ from the
> existing output, but will in a later change.
>
> Create a new test script, t5322-pack-objects-sparse.sh, to ensure
> the object list that is selected matches what we expect. When we
> update the logic to walk in a sparse fashion, the final test will
> be updated to show the extra objects that are added.

Somehow checking the "these are exactly what we expect" feels
brittle.  In a history with three relevant commits A---B---C,
packing B..C could omit trees and blobs in C that appear in A but
not in B, but traditionally, because we stop traversal at B and do
not even look at A, we do not notice that such objects that need to
complete C's tree are already available in a repository that has B.
The approach of test in this patch feels similar to saying "we must
send these duplicates because we must stay dumb", even though with a
perfect knowledge about the history, e.g. bitmap, we would be able
to omit these objects in A that appear in C but not in B.

I think we want to test test both "are we sending enough, even
though we might be wasting some bandwidth by not noticing the other
side already have some?" and "are we still avoiding from becoming
overly stupid, even though we may be cheating to save traversal cost
and risking to redundantly send some objects?"

IOW, a set of tests to make sure that truly new objects are all sent
by seeing what is sent is a strict superset of what we expect.  And
another set of tests to make sure that objects that are so obviously
(this criterion may be highly subjective) be present in the
receiving repository (e.g. the tree object of commit B and what it
contains when seinding B..C) are never sent, even when using an
algorithm that are tuned for traversal cost over bandwidth
consumption.

The code change in this step looks all trivially good, and it may
want to be squashed into the previous step to become a single patch.
Otherwise, [2/6] would be adding a dead code that nobody exercises
until this step.

Thanks.


> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-pack-objects.txt |  11 ++-
>  builtin/pack-objects.c             |   5 +-
>  t/t5322-pack-objects-sparse.sh     | 115 +++++++++++++++++++++++++++++
>  3 files changed, 129 insertions(+), 2 deletions(-)
>  create mode 100755 t/t5322-pack-objects-sparse.sh
>
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index 40c825c381..e45f3e680d 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -14,7 +14,7 @@ SYNOPSIS
>  	[--local] [--incremental] [--window=<n>] [--depth=<n>]
>  	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
>  	[--stdout [--filter=<filter-spec>] | base-name]
> -	[--shallow] [--keep-true-parents] < object-list
> +	[--shallow] [--keep-true-parents] [--sparse] < object-list
>  
>  
>  DESCRIPTION
> @@ -196,6 +196,15 @@ depth is 4095.
>  	Add --no-reuse-object if you want to force a uniform compression
>  	level on all data no matter the source.
>  
> +--sparse::
> +	Use the "sparse" algorithm to determine which objects to include in
> +	the pack, when combined with the "--revs" option. This algorithm
> +	only walks trees that appear in paths that introduce new objects.
> +	This can have significant performance benefits when computing
> +	a pack to send a small change. However, it is possible that extra
> +	objects are added to the pack-file if the included commits contain
> +	certain types of direct renames.
> +
>  --thin::
>  	Create a "thin" pack by omitting the common objects between a
>  	sender and a receiver in order to reduce network transfer. This
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 5f70d840a7..7d5b0735e3 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
>  static int depth = 50;
>  static int delta_search_threads;
>  static int pack_to_stdout;
> +static int sparse;
>  static int thin;
>  static int num_preferred_base;
>  static struct progress *progress_state;
> @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
>  
>  	if (prepare_revision_walk(&revs))
>  		die(_("revision walk setup failed"));
> -	mark_edges_uninteresting(&revs, show_edge, 0);
> +	mark_edges_uninteresting(&revs, show_edge, sparse);
>  
>  	if (!fn_show_object)
>  		fn_show_object = show_object;
> @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
>  		  N_("unpack unreachable objects newer than <time>"),
>  		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
> +		OPT_BOOL(0, "sparse", &sparse,
> +			 N_("use the sparse reachability algorithm")),
>  		OPT_BOOL(0, "thin", &thin,
>  			 N_("create thin packs")),
>  		OPT_BOOL(0, "shallow", &shallow,
> diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
> new file mode 100755
> index 0000000000..81f6805bc3
> --- /dev/null
> +++ b/t/t5322-pack-objects-sparse.sh
> @@ -0,0 +1,115 @@
> +#!/bin/sh
> +
> +test_description='pack-objects object selection using sparse algorithm'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup repo' '
> +	test_commit initial &&
> +	for i in $(test_seq 1 3)
> +	do
> +		mkdir f$i &&
> +		for j in $(test_seq 1 3)
> +		do
> +			mkdir f$i/f$j &&
> +			echo $j >f$i/f$j/data.txt
> +		done
> +	done &&
> +	git add . &&
> +	git commit -m "Initialized trees" &&
> +	for i in $(test_seq 1 3)
> +	do
> +		git checkout -b topic$i master &&
> +		echo change-$i >f$i/f$i/data.txt &&
> +		git commit -a -m "Changed f$i/f$i/data.txt"
> +	done &&
> +	cat >packinput.txt <<-EOF &&
> +	topic1
> +	^topic2
> +	^topic3
> +	EOF
> +	git rev-parse			\
> +		topic1			\
> +		topic1^{tree}		\
> +		topic1:f1		\
> +		topic1:f1/f1		\
> +		topic1:f1/f1/data.txt | sort >expect_objects.txt
> +'
> +
> +test_expect_success 'non-sparse pack-objects' '
> +	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
> +	git index-pack -o nonsparse.idx nonsparse.pack &&
> +	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
> +	test_cmp expect_objects.txt nonsparse_objects.txt
> +'
> +
> +test_expect_success 'sparse pack-objects' '
> +	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
> +	git index-pack -o sparse.idx sparse.pack &&
> +	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
> +	test_cmp expect_objects.txt sparse_objects.txt
> +'
> +
> +# Demonstrate that both algorithms send "extra" objects because
> +# they are not in the frontier.
> +
> +test_expect_success 'duplicate a folder from f3 and commit to topic1' '
> +	git checkout topic1 &&
> +	echo change-3 >f3/f3/data.txt &&
> +	git commit -a -m "Changed f3/f3/data.txt" &&
> +	git rev-parse			\
> +		topic1~1		\
> +		topic1~1^{tree}		\
> +		topic1^{tree}		\
> +		topic1			\
> +		topic1:f1		\
> +		topic1:f1/f1		\
> +		topic1:f1/f1/data.txt	\
> +		topic1:f3		\
> +		topic1:f3/f3		\
> +		topic1:f3/f3/data.txt | sort >expect_objects.txt
> +'
> +
> +test_expect_success 'non-sparse pack-objects' '
> +	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
> +	git index-pack -o nonsparse.idx nonsparse.pack &&
> +	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
> +	test_cmp expect_objects.txt nonsparse_objects.txt
> +'
> +
> +test_expect_success 'sparse pack-objects' '
> +	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
> +	git index-pack -o sparse.idx sparse.pack &&
> +	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
> +	test_cmp expect_objects.txt sparse_objects.txt
> +'
> +
> +test_expect_success 'duplicate a folder from f1 into f3' '
> +	mkdir f3/f4 &&
> +	cp -r f1/f1/* f3/f4 &&
> +	git add f3/f4 &&
> +	git commit -m "Copied f1/f1 to f3/f4" &&
> +	cat >packinput.txt <<-EOF &&
> +	topic1
> +	^topic1~1
> +	EOF
> +	git rev-parse		\
> +		topic1		\
> +		topic1^{tree}	\
> +		topic1:f3 | sort >expect_objects.txt
> +'
> +
> +test_expect_success 'non-sparse pack-objects' '
> +	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
> +	git index-pack -o nonsparse.idx nonsparse.pack &&
> +	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
> +	test_cmp expect_objects.txt nonsparse_objects.txt
> +'
> +
> +test_expect_success 'sparse pack-objects' '
> +	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
> +	git index-pack -o sparse.idx sparse.pack &&
> +	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
> +	test_cmp expect_objects.txt sparse_objects.txt
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 4/6] revision: implement sparse algorithm
  2018-12-14 21:22       ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
  2018-12-14 23:32         ` Ævar Arnfjörð Bjarmason
@ 2019-01-11 23:20         ` Junio C Hamano
  1 sibling, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2019-01-11 23:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When enumerating objects to place in a pack-file during 'git
> pack-objects --revs', we discover the "frontier" of commits
> that we care about and the boundary with commit we find
> uninteresting. From that point, we walk trees to discover which
> trees and blobs are uninteresting. Finally, we walk trees from the
> interesting commits to find the interesting objects that are
> placed in the pack.
>
> This commit introduces a new, "sparse" way to discover the
> uninteresting trees. We use the perspective of a single user trying
> to push their topic to a large repository. That user likely changed
> a very small fraction of the paths in their working directory, but
> we spend a lot of time walking all reachable trees.
>
> The way to switch the logic to work in this sparse way is to start
> caring about which paths introduce new trees. While it is not
> possible to generate a diff between the frontier boundary and all
> of the interesting commits, we can simulate that behavior by
> inspecting all of the root trees as a whole, then recursing down
> to the set of trees at each path.
>
> We already had taken the first step by passing an oidset to
> mark_trees_uninteresting_sparse(). We now create a dictionary
> whose keys are paths and values are oidsets. We consider the set
> of trees that appear at each path. While we inspect a tree, we
> add its subtrees to the oidsets corresponding to the tree entry's
> path. We also mark trees as UNINTERESTING if the tree we are
> parsing is UNINTERESTING.
>
> To actually improve the peformance, we need to terminate our

performance?

> recursion. If the oidset contains only UNINTERESTING trees, then

So, at this point, it is not quite clear what "the oidset" refers to
to me.  When you are looking at the root tree of one commit, you'd
add the object ID of the tree to the dictionary keyed by the path
(i.e. root of the tree).  You traverse to the parent and keep adding
these root trees, and after a while you'd hit a negative commit and
add its tree with UNINTERESTING bit.  Before that point, the oidset
for the root of the tree has only interesting trees, so you won't
recurse into its contents, but now at the final point you'd walk to
propagate UNINTERESTING bit down?

> we do not continue the recursion. This avoids walking trees that
> are likely to not be reachable from interesting trees. If the
> oidset contains only interesting trees, then we will walk these
> trees in the final stage that collects the intersting objects to
> place in the pack. Thus, we only recurse if the oidset contains
> both interesting and UNINITERESTING trees.
>
> There are a few ways that this is not a universally better option.
>
> First, we can pack extra objects. If someone copies a subtree
> from one tree to another, the first tree will appear UNINTERESTING
> and we will not recurse to see that the subtree should also be
> UNINTERESTING. We will walk the new tree and see the subtree as
> a "new" object and add it to the pack. We add a test case that
> demonstrates this as a way to prove that the --sparse option is
> actually working.

That's an interesting use of the word "working" ;-)

> Second, we can have extra memory pressure. If instead of being a
> single user pushing a small topic we are a server sending new
> objects from across the entire working directory, then we will
> gain very little (the recursion will rarely terminate early) but
> will spend extra time maintaining the path-oidset dictionaries.
> ...
> I used the number of walked trees the main metric above because
> it is consistent across multiple runs. When I ran my tests, the
> performance of the pack-objects command with the same options
> could change the end-to-end time by 10x depending on the file
> system being warm. However, by repeating the same test on repeat
> I could get more consistent timing results. The git.git and
> Linux tests were too fast overall (less than 0.5s) to measure
> an end-to-end difference. The Azure DevOps case was slow enough
> to see the time improve from 15s to 1s in the warm case. The
> cold case was 90s to 9s in my testing.

Understandable.  Projects that tend to be deep (e.g. Java?) would
likely to benefit if they are structured and partitioned very well.

> +struct path_and_oids_entry {
> +	struct hashmap_entry ent;
> +	char *path;
> +	struct oidset set;

Again "set"?  If I understand the description of the logic explained
in the proposed commit log message correctly, then this is a set of
"tree objects at this path", right?  At least calling this "trees"
may give us slightly better indication.

> +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
> +			     const struct path_and_oids_entry *e1,
> +			     const struct path_and_oids_entry *e2,
> +			     const void *keydata)
> +{
> +	return strcmp(e1->path, e2->path);
> +}
> +
> +int map_flags = 0;

That's too bland a name for a system-wide global variable.  If it is
a file-scope "static", that might be more acceptable.

In any case, let BSS take care of the initialization and do not
initialize an int explicitly to 0.

Is it even used?  At least at this step the comparison callback
function does not seem to make any use of the fn-data, so a better
organization may be to delay the introduction of this variable to a
later step where it actually gets used, at which time, the code may
have a better context to give the variable a more suitable name.

> +static void paths_and_oids_init(struct hashmap *map)
> +{
> +	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0);
> +}
> +
> +static void paths_and_oids_clear(struct hashmap *map)
> +{
> +	struct hashmap_iter iter;
> +	struct path_and_oids_entry *entry;
> +	hashmap_iter_init(map, &iter);
> +
> +	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
> +		oidset_clear(&entry->set);
> +		free(entry->path);
> +	}
> +
> +	hashmap_free(map, 1);
> +}
> +
> +static void paths_and_oids_insert(struct hashmap *map,
> +				  const char *path,
> +				  const struct object_id *oid)
> +{
> +	int hash = strhash(path);
> +	struct path_and_oids_entry key;
> +	struct path_and_oids_entry *entry;
> +
> +	hashmap_entry_init(&key, hash);
> +	key.path = xstrdup(path);
> +	oidset_init(&key.set, 0);
> +
> +	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
> +		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
> +		hashmap_entry_init(entry, hash);
> +		entry->path = key.path;
> +		oidset_init(&entry->set, 16);
> +		hashmap_put(map, entry);
> +	} else {
> +		free(key.path);
> +	}

Hmph, having to always allocate only to free sounds wasteful.  

Can't we instead do the look-up with "path" that are given read-only
to us, and allocate only when we are creating the entry in the
dictionary for the first time?  After all, we'd make many look-ups
of the same path to add trees for the same path and only one of them
(i.e. the first one) is used to create an entry in the dictionary.

Or does it not matter?

> +
> +	oidset_insert(&entry->set, oid);
> +}
> +
> +static void add_children_by_path(struct repository *r,
> +				 struct tree *tree,
> +				 struct hashmap *map)
> +{
> +	struct tree_desc desc;
> +	struct name_entry entry;
> +
> +	if (!tree)
> +		return;
> +
> +	if (parse_tree_gently(tree, 1) < 0)
> +		return;
> +
> +	init_tree_desc(&desc, tree->buffer, tree->size);
> +	while (tree_entry(&desc, &entry)) {
> +		switch (object_type(entry.mode)) {
> +		case OBJ_TREE:
> +			paths_and_oids_insert(map, entry.path, entry.oid);
> +
> +			if (tree->object.flags & UNINTERESTING) {
> +				struct tree *child = lookup_tree(r, entry.oid);
> +				if (child)
> +					child->object.flags |= UNINTERESTING;
> +			}
> +			break;
> +		case OBJ_BLOB:
> +			if (tree->object.flags & UNINTERESTING) {
> +				struct blob *child = lookup_blob(r, entry.oid);
> +				if (child)
> +					child->object.flags |= UNINTERESTING;
> +			}
> +			break;
> +		default:
> +			/* Subproject commit - not in this repository */
> +			break;
> +		}
> +	}
> +
> +	free_tree_buffer(tree);
> +}

OK.

>  void mark_trees_uninteresting_sparse(struct repository *r,
>  				     struct oidset *set)
>  {
> +	unsigned has_interesting = 0, has_uninteresting = 0;
> +	struct hashmap map;
> +	struct hashmap_iter map_iter;
> +	struct path_and_oids_entry *entry;
>  	struct object_id *oid;
>  	struct oidset_iter iter;
>  
>  	oidset_iter_init(set, &iter);
> -	while ((oid = oidset_iter_next(&iter))) {
> +	while ((!has_interesting || !has_uninteresting) &&
> +	       (oid = oidset_iter_next(&iter))) {
>  		struct tree *tree = lookup_tree(r, oid);
>  
>  		if (!tree)
>  			continue;
>  
> +		if (tree->object.flags & UNINTERESTING)
> +			has_uninteresting = 1;
> +		else
> +			has_interesting = 1;
> +	}
> +
> +	/* Do not walk unless we have both types of trees. */
> +	if (!has_uninteresting || !has_interesting)
> +		return;

OK.

> +	paths_and_oids_init(&map);
> +
> +	oidset_iter_init(set, &iter);
> +	while ((oid = oidset_iter_next(&iter))) {
> +		struct tree *tree = lookup_tree(r, oid);
> +		add_children_by_path(r, tree, &map);
>  	}
> +
> +	hashmap_iter_init(&map, &map_iter);
> +	while ((entry = hashmap_iter_next(&map_iter)))
> +		mark_trees_uninteresting_sparse(r, &entry->set);
> +
> +	paths_and_oids_clear(&map);
>  }
>  
>  struct commit_stack {
> diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
> index 81f6805bc3..45dba6e014 100755
> --- a/t/t5322-pack-objects-sparse.sh
> +++ b/t/t5322-pack-objects-sparse.sh
> @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' '
>  	test_cmp expect_objects.txt sparse_objects.txt
>  '
>  
> +# Demonstrate that the algorithms differ when we copy a tree wholesale
> +# from one folder to another.
> +
>  test_expect_success 'duplicate a folder from f1 into f3' '
>  	mkdir f3/f4 &&
>  	cp -r f1/f1/* f3/f4 &&
>  	git add f3/f4 &&
>  	git commit -m "Copied f1/f1 to f3/f4" &&
> -	cat >packinput.txt <<-EOF &&
> +	cat >packinput.txt <<-EOF
>  	topic1
>  	^topic1~1
>  	EOF
> -	git rev-parse		\
> -		topic1		\
> -		topic1^{tree}	\
> -		topic1:f3 | sort >expect_objects.txt
>  '
>  
>  test_expect_success 'non-sparse pack-objects' '
> +	git rev-parse			\
> +		topic1			\
> +		topic1^{tree}		\
> +		topic1:f3 | sort >expect_objects.txt &&

OK, the change above sort of makes sense ;-)

>  	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
>  	git index-pack -o nonsparse.idx nonsparse.pack &&
>  	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
> @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' '
>  '
>  
>  test_expect_success 'sparse pack-objects' '
> +	git rev-parse			\
> +		topic1			\
> +		topic1^{tree}		\
> +		topic1:f3		\
> +		topic1:f3/f4		\
> +		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&

And the reason why f4 and its contents are wastefully sent is
because the whole tree is duplicated, which is understandable.

>  	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
>  	git index-pack -o sparse.idx sparse.pack &&
>  	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
> -	test_cmp expect_objects.txt sparse_objects.txt
> +	test_cmp expect_sparse_objects.txt sparse_objects.txt
>  '
>  
>  test_done

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 2/6] list-objects: consume sparse tree walk
  2018-12-14 21:22       ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
@ 2019-01-11 23:20         ` Junio C Hamano
  0 siblings, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2019-01-11 23:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When creating a pack-file using 'git pack-objects --revs' we provide
> a list of interesting and uninteresting commits. For example, a push
> operation would make the local topic branch be interesting and the
> known remote refs as uninteresting. We want to discover the set of
> new objects to send to the server as a thin pack.
>
> We walk these commits until we discover a frontier of commits such
> that every commit walk starting at interesting commits ends in a root
> commit or unintersting commit. We then need to discover which
> non-commit objects are reachable from  uninteresting commits. This
> commit walk is not changing during this series.
>
> The mark_edges_uninteresting() method in list-objects.c iterates on
> the commit list and does the following:
>
> * If the commit is UNINTERSTING, then mark its root tree and every
>   object it can reach as UNINTERESTING.
>
> * If the commit is interesting, then mark the root tree of every
>   UNINTERSTING parent (and all objects that tree can reach) as
>   UNINTERSTING.
>
> At the very end, we repeat the process on every commit directly
> given to the revision walk from stdin. This helps ensure we properly
> cover shallow commits that otherwise were not included in the
> frontier.
>
> The logic to recursively follow trees is in the
> mark_tree_uninteresting() method in revision.c. The algorithm avoids
> duplicate work by not recursing into trees that are already marked
> UNINTERSTING.
>
> Add a new 'sparse' option to the mark_edges_uninteresting() method
> that performs this logic in a slightly new way. As we iterate over
> the commits, we add all of the root trees to an oidset. Then, call
> mark_trees_uninteresting_sparse() on that oidset. Note that we
> include interesting trees in this process. The current implementation
> of mark_trees_unintersting_sparse() will walk the same trees as
> the old logic, but this will be replaced in a later change.

It is unclear what "a slightly new way" refers to.  The updated code
adds the UNINTERSTING edge commits and UNINTERSTING parents of
interesting edge commits (the latter is done using the new
add-edge-parents() helper function) to an oidset and calls the new
helper introduced in [1/6] to ensure all the objects reachable from
these UNINTERSTING trees become UNINTERSTING.

Which seems to be doing exactly the same thing as the original.

Puzzled.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 3/6] pack-objects: add --sparse option
  2019-01-11 22:30         ` Junio C Hamano
@ 2019-01-15 15:06           ` Derrick Stolee
  2019-01-15 18:23             ` Junio C Hamano
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2019-01-15 15:06 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, avarab, jrnieder, Derrick Stolee

On 1/11/2019 5:30 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Add a '--sparse' option flag to the pack-objects builtin. This
>> allows the user to specify that they want to use the new logic
>> for walking trees. This logic currently does not differ from the
>> existing output, but will in a later change.
>>
>> Create a new test script, t5322-pack-objects-sparse.sh, to ensure
>> the object list that is selected matches what we expect. When we
>> update the logic to walk in a sparse fashion, the final test will
>> be updated to show the extra objects that are added.
> Somehow checking the "these are exactly what we expect" feels
> brittle.  In a history with three relevant commits A---B---C,
> packing B..C could omit trees and blobs in C that appear in A but
> not in B, but traditionally, because we stop traversal at B and do
> not even look at A, we do not notice that such objects that need to
> complete C's tree are already available in a repository that has B.
> The approach of test in this patch feels similar to saying "we must
> send these duplicates because we must stay dumb", even though with a
> perfect knowledge about the history, e.g. bitmap, we would be able
> to omit these objects in A that appear in C but not in B.
My intention with this test was to show that the existing algorithm also 
sends "extra" objects in certain situations, which is later contrasted 
by a test where the new algorithm sends objects the old algorithm did not.

Instead of "we must stay dumb" I wanted to say "we are not dumber (in 
this case)".

I understand your brittle feeling. If the logic changed in either to be 
smarter, then we would need to change the test. In some sense, that does 
mean we have extra coverage of "this is how it works now" so we would 
understand the behavior change of that hypothetical future change.

> I think we want to test test both "are we sending enough, even
> though we might be wasting some bandwidth by not noticing the other
> side already have some?" and "are we still avoiding from becoming
> overly stupid, even though we may be cheating to save traversal cost
> and risking to redundantly send some objects?"
>
> IOW, a set of tests to make sure that truly new objects are all sent
> by seeing what is sent is a strict superset of what we expect.  And
> another set of tests to make sure that objects that are so obviously
> (this criterion may be highly subjective) be present in the
> receiving repository (e.g. the tree object of commit B and what it
> contains when seinding B..C) are never sent, even when using an
> algorithm that are tuned for traversal cost over bandwidth
> consumption.

To properly test "these objects are included," do we have an established 
pattern for saying "this file is a subset of that file"? Or, should I 
use something like `comm expected actual >common && test_cmp expected 
common`?

> The code change in this step looks all trivially good, and it may
> want to be squashed into the previous step to become a single patch.
> Otherwise, [2/6] would be adding a dead code that nobody exercises
> until this step.

Will squash.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v4 3/6] pack-objects: add --sparse option
  2019-01-15 15:06           ` Derrick Stolee
@ 2019-01-15 18:23             ` Junio C Hamano
  0 siblings, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2019-01-15 18:23 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, peff, avarab, jrnieder,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> Instead of "we must stay dumb" I wanted to say "we are not dumber (in
> this case)".

Thanks, then what we aim for are compatible ;-)  I do agree that we
should make sure we will not ever become overly stupid.

> To properly test "these objects are included," do we have an
> established pattern for saying "this file is a subset of that file"?
> Or, should I use something like `comm expected actual >common &&
> test_cmp expected common`?

Yeah, "comm -23 must-exist actual" would be a natural thing to use
when we have two sorted files and want to see what is missing from
the actual output among those that must exist.  t6500 and t9350
already seem to use the tool, so it should be OK from portability's
point of view to non-UNIX systems.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v5 0/5] Add a new "sparse" tree walk algorithm
  2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2018-12-14 21:22       ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
@ 2019-01-16 18:25       ` Derrick Stolee via GitGitGadget
  2019-01-16 18:25         ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
                           ` (4 more replies)
  6 siblings, 5 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Update in V3:

 * Change documentation around 'pack.useSparse' config setting to better
   advertise to users.
 * Mostly a ping now that v2.20.0 is out.

Updates in V4:

 * Switched to using hashmap instead of string_list for the path/oidset
   dictionary. (This is due to some fear that the string_list performance
   would degrade for a very wide tree, but I am unable to measure a
   performance difference.)
 * Some cleanup of code snippets across commits.
 * Some grammar cleanup in the commit messages.

Updates in V5:

 * Renamed the generic "set" to "trees"
 * Used a less-restrictive test condition when testing the old algorithm,
   except for the case where we want to verify that the new algorithm is
   being run.
 * Removed a global variable that was not used. Thanks, Ramsay and Junio!

Derrick Stolee (5):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/config/pack.txt      |   9 ++
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c                           |   2 +-
 builtin/pack-objects.c             |  10 +-
 builtin/rev-list.c                 |   2 +-
 http-push.c                        |   2 +-
 list-objects.c                     |  70 +++++++++++---
 list-objects.h                     |   4 +-
 revision.c                         | 143 +++++++++++++++++++++++++++++
 revision.h                         |   2 +
 t/README                           |   4 +
 t/t5322-pack-objects-sparse.sh     | 136 +++++++++++++++++++++++++++
 12 files changed, 378 insertions(+), 17 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v4:

 1:  817e30a287 ! 1:  02ef702884 revision: add mark_tree_uninteresting_sparse
     @@ -9,7 +9,10 @@
      
          The current implementation uses the existing
          mark_tree_uninteresting to recursively walk the trees and blobs.
     -    This will walk the same number of trees as the old mechanism.
     +    This will walk the same number of trees as the old mechanism. To
     +    ensure that mark_tree_uninteresting walks the tree, we need to
     +    remove the UNINTERESTING flag before calling the method. This
     +    implementation will be replaced entirely in a later commit.
      
          There is one new assumption in this approach: we are also given
          the oids of the interesting trees. This implementation does not
     @@ -26,12 +29,12 @@
       }
       
      +void mark_trees_uninteresting_sparse(struct repository *r,
     -+				     struct oidset *set)
     ++				     struct oidset *trees)
      +{
      +	struct object_id *oid;
      +	struct oidset_iter iter;
      +
     -+	oidset_iter_init(set, &iter);
     ++	oidset_iter_init(trees, &iter);
      +	while ((oid = oidset_iter_next(&iter))) {
      +		struct tree *tree = lookup_tree(r, oid);
      +
     @@ -69,7 +72,7 @@
       
       void mark_parents_uninteresting(struct commit *commit);
       void mark_tree_uninteresting(struct repository *r, struct tree *tree);
     -+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
     ++void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees);
       
       void show_object_with_name(FILE *, struct object *, const char *);
       
 2:  39dc89beb9 ! 2:  0747f7494e list-objects: consume sparse tree walk
     @@ -35,18 +35,53 @@
          UNINTERSTING.
      
          Add a new 'sparse' option to the mark_edges_uninteresting() method
     -    that performs this logic in a slightly new way. As we iterate over
     -    the commits, we add all of the root trees to an oidset. Then, call
     -    mark_trees_uninteresting_sparse() on that oidset. Note that we
     +    that performs this logic in a slightly different way. As we iterate
     +    over the commits, we add all of the root trees to an oidset. Then,
     +    call mark_trees_uninteresting_sparse() on that oidset. Note that we
          include interesting trees in this process. The current implementation
          of mark_trees_unintersting_sparse() will walk the same trees as
          the old logic, but this will be replaced in a later change.
      
     -    The sparse option is not used by any callers at the moment, but
     -    will be wired to 'git pack-objects' in the next change.
     +    Add a '--sparse' flag in 'git pack-objects' to call this new logic.
     +    Add a new test script t/t5322-pack-objects-sparse.sh that tests this
     +    option. The tests currently demonstrate that the resulting object
     +    list is the same as the old algorithm. This includes a case where
     +    both algorithms pack an object that is not needed by a remote due to
     +    limits on the explored set of trees. When the sparse algorithm is
     +    changed in a later commit, we will add a test that demonstrates a
     +    change of behavior in some cases.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     + diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
     + --- a/Documentation/git-pack-objects.txt
     + +++ b/Documentation/git-pack-objects.txt
     +@@
     + 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
     + 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
     + 	[--stdout [--filter=<filter-spec>] | base-name]
     +-	[--shallow] [--keep-true-parents] < object-list
     ++	[--shallow] [--keep-true-parents] [--sparse] < object-list
     + 
     + 
     + DESCRIPTION
     +@@
     + 	Add --no-reuse-object if you want to force a uniform compression
     + 	level on all data no matter the source.
     + 
     ++--sparse::
     ++	Use the "sparse" algorithm to determine which objects to include in
     ++	the pack, when combined with the "--revs" option. This algorithm
     ++	only walks trees that appear in paths that introduce new objects.
     ++	This can have significant performance benefits when computing
     ++	a pack to send a small change. However, it is possible that extra
     ++	objects are added to the pack-file if the included commits contain
     ++	certain types of direct renames.
     ++
     + --thin::
     + 	Create a "thin" pack by omitting the common objects between a
     + 	sender and a receiver in order to reduce network transfer. This
     +
       diff --git a/bisect.c b/bisect.c
       --- a/bisect.c
       +++ b/bisect.c
     @@ -64,14 +99,31 @@
       --- a/builtin/pack-objects.c
       +++ b/builtin/pack-objects.c
      @@
     + static int depth = 50;
     + static int delta_search_threads;
     + static int pack_to_stdout;
     ++static int sparse;
     + static int thin;
     + static int num_preferred_base;
     + static struct progress *progress_state;
     +@@
       
       	if (prepare_revision_walk(&revs))
       		die(_("revision walk setup failed"));
      -	mark_edges_uninteresting(&revs, show_edge);
     -+	mark_edges_uninteresting(&revs, show_edge, 0);
     ++	mark_edges_uninteresting(&revs, show_edge, sparse);
       
       	if (!fn_show_object)
       		fn_show_object = show_object;
     +@@
     + 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
     + 		  N_("unpack unreachable objects newer than <time>"),
     + 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
     ++		OPT_BOOL(0, "sparse", &sparse,
     ++			 N_("use the sparse reachability algorithm")),
     + 		OPT_BOOL(0, "thin", &thin,
     + 			 N_("create thin packs")),
     + 		OPT_BOOL(0, "shallow", &shallow,
      
       diff --git a/builtin/rev-list.c b/builtin/rev-list.c
       --- a/builtin/rev-list.c
     @@ -202,3 +254,122 @@
       
       struct oidset;
       struct list_objects_filter_options;
     +
     + diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
     + new file mode 100755
     + --- /dev/null
     + +++ b/t/t5322-pack-objects-sparse.sh
     +@@
     ++#!/bin/sh
     ++
     ++test_description='pack-objects object selection using sparse algorithm'
     ++. ./test-lib.sh
     ++
     ++test_expect_success 'setup repo' '
     ++	test_commit initial &&
     ++	for i in $(test_seq 1 3)
     ++	do
     ++		mkdir f$i &&
     ++		for j in $(test_seq 1 3)
     ++		do
     ++			mkdir f$i/f$j &&
     ++			echo $j >f$i/f$j/data.txt
     ++		done
     ++	done &&
     ++	git add . &&
     ++	git commit -m "Initialized trees" &&
     ++	for i in $(test_seq 1 3)
     ++	do
     ++		git checkout -b topic$i master &&
     ++		echo change-$i >f$i/f$i/data.txt &&
     ++		git commit -a -m "Changed f$i/f$i/data.txt"
     ++	done &&
     ++	cat >packinput.txt <<-EOF &&
     ++	topic1
     ++	^topic2
     ++	^topic3
     ++	EOF
     ++	git rev-parse			\
     ++		topic1			\
     ++		topic1^{tree}		\
     ++		topic1:f1		\
     ++		topic1:f1/f1		\
     ++		topic1:f1/f1/data.txt | sort >expect_objects.txt
     ++'
     ++
     ++test_expect_success 'non-sparse pack-objects' '
     ++	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
     ++	git index-pack -o nonsparse.idx nonsparse.pack &&
     ++	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
     ++	test_cmp expect_objects.txt nonsparse_objects.txt
     ++'
     ++
     ++test_expect_success 'sparse pack-objects' '
     ++	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
     ++	git index-pack -o sparse.idx sparse.pack &&
     ++	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
     ++	test_cmp expect_objects.txt sparse_objects.txt
     ++'
     ++
     ++test_expect_success 'duplicate a folder from f3 and commit to topic1' '
     ++	git checkout topic1 &&
     ++	echo change-3 >f3/f3/data.txt &&
     ++	git commit -a -m "Changed f3/f3/data.txt" &&
     ++	git rev-parse			\
     ++		topic1~1		\
     ++		topic1~1^{tree}		\
     ++		topic1^{tree}		\
     ++		topic1			\
     ++		topic1:f1		\
     ++		topic1:f1/f1		\
     ++		topic1:f1/f1/data.txt | sort >required_objects.txt
     ++'
     ++
     ++test_expect_success 'non-sparse pack-objects' '
     ++	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
     ++	git index-pack -o nonsparse.idx nonsparse.pack &&
     ++	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
     ++	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
     ++	test_cmp required_objects.txt nonsparse_required_objects.txt
     ++'
     ++
     ++test_expect_success 'sparse pack-objects' '
     ++	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
     ++	git index-pack -o sparse.idx sparse.pack &&
     ++	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
     ++	comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
     ++	test_cmp required_objects.txt sparse_required_objects.txt
     ++'
     ++
     ++test_expect_success 'duplicate a folder from f1 into f3' '
     ++	mkdir f3/f4 &&
     ++	cp -r f1/f1/* f3/f4 &&
     ++	git add f3/f4 &&
     ++	git commit -m "Copied f1/f1 to f3/f4" &&
     ++	cat >packinput.txt <<-EOF &&
     ++	topic1
     ++	^topic1~1
     ++	EOF
     ++	git rev-parse		\
     ++		topic1		\
     ++		topic1^{tree}   \
     ++		topic1:f3 | sort >required_objects.txt
     ++'
     ++
     ++test_expect_success 'non-sparse pack-objects' '
     ++	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
     ++	git index-pack -o nonsparse.idx nonsparse.pack &&
     ++	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
     ++	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
     ++	test_cmp required_objects.txt nonsparse_required_objects.txt
     ++'
     ++
     ++test_expect_success 'sparse pack-objects' '
     ++	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
     ++	git index-pack -o sparse.idx sparse.pack &&
     ++	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
     ++	comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
     ++	test_cmp required_objects.txt sparse_required_objects.txt
     ++'
     ++
     ++test_done
 3:  ab733daff5 < -:  ---------- pack-objects: add --sparse option
 4:  c44172c35e ! 3:  327c102477 revision: implement sparse algorithm
     @@ -31,7 +31,7 @@
          path. We also mark trees as UNINTERESTING if the tree we are
          parsing is UNINTERESTING.
      
     -    To actually improve the peformance, we need to terminate our
     +    To actually improve the performance, we need to terminate our
          recursion. If the oidset contains only UNINTERESTING trees, then
          we do not continue the recursion. This avoids walking trees that
          are likely to not be reachable from interesting trees. If the
     @@ -46,9 +46,9 @@
          from one tree to another, the first tree will appear UNINTERESTING
          and we will not recurse to see that the subtree should also be
          UNINTERESTING. We will walk the new tree and see the subtree as
     -    a "new" object and add it to the pack. We add a test case that
     -    demonstrates this as a way to prove that the --sparse option is
     -    actually working.
     +    a "new" object and add it to the pack. A test is modified to
     +    demonstrate this behavior and to verify that the new logic is
     +    being exercised.
      
          Second, we can have extra memory pressure. If instead of being a
          single user pushing a small topic we are a server sending new
     @@ -126,7 +126,7 @@
      +struct path_and_oids_entry {
      +	struct hashmap_entry ent;
      +	char *path;
     -+	struct oidset set;
     ++	struct oidset trees;
      +};
      +
      +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
     @@ -137,10 +137,9 @@
      +	return strcmp(e1->path, e2->path);
      +}
      +
     -+int map_flags = 0;
      +static void paths_and_oids_init(struct hashmap *map)
      +{
     -+	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0);
     ++	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, NULL, 0);
      +}
      +
      +static void paths_and_oids_clear(struct hashmap *map)
     @@ -150,7 +149,7 @@
      +	hashmap_iter_init(map, &iter);
      +
      +	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
     -+		oidset_clear(&entry->set);
     ++		oidset_clear(&entry->trees);
      +		free(entry->path);
      +	}
      +
     @@ -166,20 +165,20 @@
      +	struct path_and_oids_entry *entry;
      +
      +	hashmap_entry_init(&key, hash);
     -+	key.path = xstrdup(path);
     -+	oidset_init(&key.set, 0);
     ++
     ++	/* use a shallow copy for the lookup */
     ++	key.path = (char *)path;
     ++	oidset_init(&key.trees, 0);
      +
      +	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
      +		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
      +		hashmap_entry_init(entry, hash);
     -+		entry->path = key.path;
     -+		oidset_init(&entry->set, 16);
     ++		entry->path = xstrdup(key.path);
     ++		oidset_init(&entry->trees, 16);
      +		hashmap_put(map, entry);
     -+	} else {
     -+		free(key.path);
      +	}
      +
     -+	oidset_insert(&entry->set, oid);
     ++	oidset_insert(&entry->trees, oid);
      +}
      +
      +static void add_children_by_path(struct repository *r,
     @@ -224,7 +223,7 @@
      +}
      +
       void mark_trees_uninteresting_sparse(struct repository *r,
     - 				     struct oidset *set)
     + 				     struct oidset *trees)
       {
      +	unsigned has_interesting = 0, has_uninteresting = 0;
      +	struct hashmap map;
     @@ -233,7 +232,7 @@
       	struct object_id *oid;
       	struct oidset_iter iter;
       
     - 	oidset_iter_init(set, &iter);
     + 	oidset_iter_init(trees, &iter);
      -	while ((oid = oidset_iter_next(&iter))) {
      +	while ((!has_interesting || !has_uninteresting) &&
      +	       (oid = oidset_iter_next(&iter))) {
     @@ -263,7 +262,7 @@
      +
      +	paths_and_oids_init(&map);
      +
     -+	oidset_iter_init(set, &iter);
     ++	oidset_iter_init(trees, &iter);
      +	while ((oid = oidset_iter_next(&iter))) {
      +		struct tree *tree = lookup_tree(r, oid);
      +		add_children_by_path(r, tree, &map);
     @@ -271,7 +270,7 @@
      +
      +	hashmap_iter_init(&map, &map_iter);
      +	while ((entry = hashmap_iter_next(&map_iter)))
     -+		mark_trees_uninteresting_sparse(r, &entry->set);
     ++		mark_trees_uninteresting_sparse(r, &entry->trees);
      +
      +	paths_and_oids_clear(&map);
       }
     @@ -282,7 +281,7 @@
       --- a/t/t5322-pack-objects-sparse.sh
       +++ b/t/t5322-pack-objects-sparse.sh
      @@
     - 	test_cmp expect_objects.txt sparse_objects.txt
     + 	test_cmp required_objects.txt sparse_required_objects.txt
       '
       
      +# Demonstrate that the algorithms differ when we copy a tree wholesale
     @@ -291,27 +290,15 @@
       test_expect_success 'duplicate a folder from f1 into f3' '
       	mkdir f3/f4 &&
       	cp -r f1/f1/* f3/f4 &&
     - 	git add f3/f4 &&
     - 	git commit -m "Copied f1/f1 to f3/f4" &&
     --	cat >packinput.txt <<-EOF &&
     -+	cat >packinput.txt <<-EOF
     - 	topic1
     - 	^topic1~1
     - 	EOF
     --	git rev-parse		\
     --		topic1		\
     --		topic1^{tree}	\
     --		topic1:f3 | sort >expect_objects.txt
     +@@
       '
       
       test_expect_success 'non-sparse pack-objects' '
     -+	git rev-parse			\
     -+		topic1			\
     -+		topic1^{tree}		\
     -+		topic1:f3 | sort >expect_objects.txt &&
     - 	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
     +-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
     ++	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
       	git index-pack -o nonsparse.idx nonsparse.pack &&
       	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
     + 	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
      @@
       '
       
     @@ -325,7 +312,8 @@
       	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
       	git index-pack -o sparse.idx sparse.pack &&
       	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
     --	test_cmp expect_objects.txt sparse_objects.txt
     +-	comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
     +-	test_cmp required_objects.txt sparse_required_objects.txt
      +	test_cmp expect_sparse_objects.txt sparse_objects.txt
       '
       
 5:  f386f6c3c9 ! 4:  28111d70d4 pack-objects: create pack.useSparse setting
     @@ -73,7 +73,7 @@
      +	git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack &&
      +	git index-pack -o sparse.idx sparse.pack &&
      +	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
     -+	test_cmp expect_objects.txt sparse_objects.txt
     ++	test_cmp required_objects.txt sparse_objects.txt
      +'
      +
       test_done
 6:  d011a9c1b1 ! 5:  ae046e8525 pack-objects: create GIT_TEST_PACK_SPARSE
     @@ -56,13 +56,4 @@
      +	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
       	git index-pack -o nonsparse.idx nonsparse.pack &&
       	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
     - 	test_cmp expect_objects.txt nonsparse_objects.txt
     -@@
     - 		topic1			\
     - 		topic1^{tree}		\
     - 		topic1:f3 | sort >expect_objects.txt &&
     --	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
     -+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
     - 	git index-pack -o nonsparse.idx nonsparse.pack &&
     - 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
     - 	test_cmp expect_objects.txt nonsparse_objects.txt
     + 	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v5 2/5] list-objects: consume sparse tree walk
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
@ 2019-01-16 18:25         ` Derrick Stolee via GitGitGadget
  2019-01-16 18:25         ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits. This
commit walk is not changing during this series.

The mark_edges_uninteresting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly different way. As we iterate
over the commits, we add all of the root trees to an oidset. Then,
call mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

Add a '--sparse' flag in 'git pack-objects' to call this new logic.
Add a new test script t/t5322-pack-objects-sparse.sh that tests this
option. The tests currently demonstrate that the resulting object
list is the same as the old algorithm. This includes a case where
both algorithms pack an object that is not needed by a remote due to
limits on the explored set of trees. When the sparse algorithm is
changed in a later commit, we will add a test that demonstrates a
change of behavior in some cases.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-pack-objects.txt |  11 ++-
 bisect.c                           |   2 +-
 builtin/pack-objects.c             |   5 +-
 builtin/rev-list.c                 |   2 +-
 http-push.c                        |   2 +-
 list-objects.c                     |  70 +++++++++++++++---
 list-objects.h                     |   4 +-
 t/t5322-pack-objects-sparse.sh     | 113 +++++++++++++++++++++++++++++
 8 files changed, 192 insertions(+), 17 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 40c825c381..e45f3e680d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
 	[--stdout [--filter=<filter-spec>] | base-name]
-	[--shallow] [--keep-true-parents] < object-list
+	[--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,15 @@ depth is 4095.
 	Add --no-reuse-object if you want to force a uniform compression
 	level on all data no matter the source.
 
+--sparse::
+	Use the "sparse" algorithm to determine which objects to include in
+	the pack, when combined with the "--revs" option. This algorithm
+	only walks trees that appear in paths that introduce new objects.
+	This can have significant performance benefits when computing
+	a pack to send a small change. However, it is possible that extra
+	objects are added to the pack-file if the included commits contain
+	certain types of direct renames.
+
 --thin::
 	Create a "thin" pack by omitting the common objects between a
 	sender and a receiver in order to reduce network transfer. This
diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	if (revs->tree_objects)
-		mark_edges_uninteresting(revs, NULL);
+		mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge);
+	mark_edges_uninteresting(&revs, show_edge, sparse);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+		OPT_BOOL(0, "sparse", &sparse,
+			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
 			 N_("create thin packs")),
 		OPT_BOOL(0, "shallow", &shallow,
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	if (revs.tree_objects)
-		mark_edges_uninteresting(&revs, show_edge);
+		mark_edges_uninteresting(&revs, show_edge, 0);
 
 	if (bisect_list) {
 		int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
 		pushing = 0;
 		if (prepare_revision_walk(&revs))
 			die("revision walk setup failed");
-		mark_edges_uninteresting(&revs, NULL);
+		mark_edges_uninteresting(&revs, NULL, 0);
 		objects_to_send = get_delta(&revs, ref_lock);
 		finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..fb728f7842 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,73 @@ static void mark_edge_parents_uninteresting(struct commit *commit,
 	}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+			     struct rev_info *revs,
+			     show_edge_fn show_edge,
+			     struct oidset *set)
+{
+	struct commit_list *parents;
+
+	for (parents = commit->parents; parents; parents = parents->next) {
+		struct commit *parent = parents->item;
+		struct tree *tree = get_commit_tree(parent);
+
+		if (!tree)
+			continue;
+
+		oidset_insert(set, &tree->object.oid);
+
+		if (!(parent->object.flags & UNINTERESTING))
+			continue;
+		tree->object.flags |= UNINTERESTING;
+
+		if (revs->edge_hint && !(parent->object.flags & SHOWN)) {
+			parent->object.flags |= SHOWN;
+			show_edge(parent);
+		}
+	}
+}
+
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse)
 {
 	struct commit_list *list;
 	int i;
 
-	for (list = revs->commits; list; list = list->next) {
-		struct commit *commit = list->item;
+	if (sparse) {
+		struct oidset set;
+		oidset_init(&set, 16);
 
-		if (commit->object.flags & UNINTERESTING) {
-			mark_tree_uninteresting(revs->repo,
-						get_commit_tree(commit));
-			if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
-				commit->object.flags |= SHOWN;
-				show_edge(commit);
+		for (list = revs->commits; list; list = list->next) {
+			struct commit *commit = list->item;
+			struct tree *tree = get_commit_tree(commit);
+
+			if (commit->object.flags & UNINTERESTING)
+				tree->object.flags |= UNINTERESTING;
+
+			oidset_insert(&set, &tree->object.oid);
+			add_edge_parents(commit, revs, show_edge, &set);
+		}
+
+		mark_trees_uninteresting_sparse(revs->repo, &set);
+		oidset_clear(&set);
+	} else {
+		for (list = revs->commits; list; list = list->next) {
+			struct commit *commit = list->item;
+			if (commit->object.flags & UNINTERESTING) {
+				mark_tree_uninteresting(revs->repo,
+							get_commit_tree(commit));
+				if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
+					commit->object.flags |= SHOWN;
+					show_edge(commit);
+				}
+				continue;
 			}
-			continue;
+			mark_edge_parents_uninteresting(commit, revs, show_edge);
 		}
-		mark_edge_parents_uninteresting(commit, revs, show_edge);
 	}
+
 	if (revs->edge_hint_aggressive) {
 		for (i = 0; i < revs->cmdline.nr; i++) {
 			struct object *obj = revs->cmdline.rev[i].item;
diff --git a/list-objects.h b/list-objects.h
index ad40762926..a952680e46 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *);
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
 
 typedef void (*show_edge_fn)(struct commit *);
-void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
+void mark_edges_uninteresting(struct rev_info *revs,
+			      show_edge_fn show_edge,
+			      int sparse);
 
 struct oidset;
 struct list_objects_filter_options;
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 0000000000..30aef6498a
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,113 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	test_commit initial &&
+	for i in $(test_seq 1 3)
+	do
+		mkdir f$i &&
+		for j in $(test_seq 1 3)
+		do
+			mkdir f$i/f$j &&
+			echo $j >f$i/f$j/data.txt
+		done
+	done &&
+	git add . &&
+	git commit -m "Initialized trees" &&
+	for i in $(test_seq 1 3)
+	do
+		git checkout -b topic$i master &&
+		echo change-$i >f$i/f$i/data.txt &&
+		git commit -a -m "Changed f$i/f$i/data.txt"
+	done &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic2
+	^topic3
+	EOF
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+	git checkout topic1 &&
+	echo change-3 >f3/f3/data.txt &&
+	git commit -a -m "Changed f3/f3/data.txt" &&
+	git rev-parse			\
+		topic1~1		\
+		topic1~1^{tree}		\
+		topic1^{tree}		\
+		topic1			\
+		topic1:f1		\
+		topic1:f1/f1		\
+		topic1:f1/f1/data.txt | sort >required_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
+	test_cmp required_objects.txt nonsparse_required_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
+	test_cmp required_objects.txt sparse_required_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f1 into f3' '
+	mkdir f3/f4 &&
+	cp -r f1/f1/* f3/f4 &&
+	git add f3/f4 &&
+	git commit -m "Copied f1/f1 to f3/f4" &&
+	cat >packinput.txt <<-EOF &&
+	topic1
+	^topic1~1
+	EOF
+	git rev-parse		\
+		topic1		\
+		topic1^{tree}   \
+		topic1:f3 | sort >required_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git index-pack -o nonsparse.idx nonsparse.pack &&
+	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
+	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
+	test_cmp required_objects.txt nonsparse_required_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
+	test_cmp required_objects.txt sparse_required_objects.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2019-01-16 18:25         ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
@ 2019-01-16 18:25         ` Derrick Stolee via GitGitGadget
  2019-01-16 18:25         ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism. To
ensure that mark_tree_uninteresting walks the tree, we need to
remove the UNINTERESTING flag before calling the method. This
implementation will be replaced entirely in a later commit.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 25 +++++++++++++++++++++++++
 revision.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..60421f3a10 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+				     struct oidset *trees)
+{
+	struct object_id *oid;
+	struct oidset_iter iter;
+
+	oidset_iter_init(trees, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+
+		if (!tree)
+			continue;
+
+		if (tree->object.flags & UNINTERESTING) {
+			/*
+			 * Remove the flag so the next call
+			 * is not a no-op. The flag is added
+			 * in mark_tree_unintersting().
+			 */
+			tree->object.flags ^= UNINTERESTING;
+			mark_tree_uninteresting(r, tree);
+		}
+	}
+}
+
 struct commit_stack {
 	struct commit **items;
 	size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..df684701b9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 3/5] revision: implement sparse algorithm
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
  2019-01-16 18:25         ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
  2019-01-16 18:25         ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
@ 2019-01-16 18:25         ` Derrick Stolee via GitGitGadget
  2019-01-16 18:26         ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
  2019-01-16 18:26         ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

	v2.19.0
	^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

	v4.18
	^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:    220
Walked (old alg): 22,804
Walked (new alg):    129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c                     | 138 ++++++++++++++++++++++++++++++---
 t/t5322-pack-objects-sparse.sh |  14 +++-
 2 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/revision.c b/revision.c
index 60421f3a10..5de739384a 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "prio-queue.h"
+#include "hashmap.h"

 volatile show_early_output_fn_t show_early_output;

@@ -99,29 +100,146 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }

+struct path_and_oids_entry {
+	struct hashmap_entry ent;
+	char *path;
+	struct oidset trees;
+};
+
+static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
+			     const struct path_and_oids_entry *e1,
+			     const struct path_and_oids_entry *e2,
+			     const void *keydata)
+{
+	return strcmp(e1->path, e2->path);
+}
+
+static void paths_and_oids_init(struct hashmap *map)
+{
+	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, NULL, 0);
+}
+
+static void paths_and_oids_clear(struct hashmap *map)
+{
+	struct hashmap_iter iter;
+	struct path_and_oids_entry *entry;
+	hashmap_iter_init(map, &iter);
+
+	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
+		oidset_clear(&entry->trees);
+		free(entry->path);
+	}
+
+	hashmap_free(map, 1);
+}
+
+static void paths_and_oids_insert(struct hashmap *map,
+				  const char *path,
+				  const struct object_id *oid)
+{
+	int hash = strhash(path);
+	struct path_and_oids_entry key;
+	struct path_and_oids_entry *entry;
+
+	hashmap_entry_init(&key, hash);
+
+	/* use a shallow copy for the lookup */
+	key.path = (char *)path;
+	oidset_init(&key.trees, 0);
+
+	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
+		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
+		hashmap_entry_init(entry, hash);
+		entry->path = xstrdup(key.path);
+		oidset_init(&entry->trees, 16);
+		hashmap_put(map, entry);
+	}
+
+	oidset_insert(&entry->trees, oid);
+}
+
+static void add_children_by_path(struct repository *r,
+				 struct tree *tree,
+				 struct hashmap *map)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if (!tree)
+		return;
+
+	if (parse_tree_gently(tree, 1) < 0)
+		return;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		switch (object_type(entry.mode)) {
+		case OBJ_TREE:
+			paths_and_oids_insert(map, entry.path, entry.oid);
+
+			if (tree->object.flags & UNINTERESTING) {
+				struct tree *child = lookup_tree(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		case OBJ_BLOB:
+			if (tree->object.flags & UNINTERESTING) {
+				struct blob *child = lookup_blob(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		default:
+			/* Subproject commit - not in this repository */
+			break;
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
 void mark_trees_uninteresting_sparse(struct repository *r,
 				     struct oidset *trees)
 {
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct hashmap map;
+	struct hashmap_iter map_iter;
+	struct path_and_oids_entry *entry;
 	struct object_id *oid;
 	struct oidset_iter iter;

 	oidset_iter_init(trees, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);

 		if (!tree)
 			continue;

-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
+	}
+
+	/* Do not walk unless we have both types of trees. */
+	if (!has_uninteresting || !has_interesting)
+		return;
+
+	paths_and_oids_init(&map);
+
+	oidset_iter_init(trees, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+		add_children_by_path(r, tree, &map);
 	}
+
+	hashmap_iter_init(&map, &map_iter);
+	while ((entry = hashmap_iter_next(&map_iter)))
+		mark_trees_uninteresting_sparse(r, &entry->trees);
+
+	paths_and_oids_clear(&map);
 }

 struct commit_stack {
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 30aef6498a..9f2a6e5d31 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -79,6 +79,9 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp required_objects.txt sparse_required_objects.txt
 '

+# Demonstrate that the algorithms differ when we copy a tree wholesale
+# from one folder to another.
+
 test_expect_success 'duplicate a folder from f1 into f3' '
 	mkdir f3/f4 &&
 	cp -r f1/f1/* f3/f4 &&
@@ -95,7 +98,7 @@ test_expect_success 'duplicate a folder from f1 into f3' '
 '

 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
@@ -103,11 +106,16 @@ test_expect_success 'non-sparse pack-objects' '
 '

 test_expect_success 'sparse pack-objects' '
+	git rev-parse			\
+		topic1			\
+		topic1^{tree}		\
+		topic1:f3		\
+		topic1:f3/f4		\
+		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
 	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
 	git index-pack -o sparse.idx sparse.pack &&
 	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
-	comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
-	test_cmp required_objects.txt sparse_required_objects.txt
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '

 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 4/5] pack-objects: create pack.useSparse setting
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                           ` (2 preceding siblings ...)
  2019-01-16 18:25         ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
@ 2019-01-16 18:26         ` Derrick Stolee via GitGitGadget
  2019-01-16 18:26         ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:26 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/pack.txt  |  9 +++++++++
 builtin/pack-objects.c         |  4 ++++
 t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index edac75c83f..425c73aa52 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -105,6 +105,15 @@ pack.useBitmaps::
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useSparse::
+	When true, git will default to using the '--sparse' option in
+	'git pack-objects' when the '--revs' option is present. This
+	algorithm only walks trees that appear in paths that introduce new
+	objects. This can have significant performance benefits when
+	computing a pack to send a small change. However, it is possible
+	that extra objects are added to the pack-file if the included
+	commits contain certain types of direct renames.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.usesparse")) {
+		sparse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 9f2a6e5d31..3233fafc90 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -118,4 +118,19 @@ test_expect_success 'sparse pack-objects' '
 	test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+	git config pack.useSparse true &&
+	git pack-objects --stdout --revs <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack &&
+	git index-pack -o sparse.idx sparse.pack &&
+	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
+	test_cmp required_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE
  2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
                           ` (3 preceding siblings ...)
  2019-01-16 18:26         ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
@ 2019-01-16 18:26         ` Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:26 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/pack-objects.c         | 1 +
 t/README                       | 4 ++++
 t/t5322-pack-objects-sparse.sh | 4 ++--
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	read_replace_refs = 0;
 
+	sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 3233fafc90..7124b5581a 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	test_cmp expect_objects.txt nonsparse_objects.txt
@@ -64,7 +64,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
+	git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
 	git index-pack -o nonsparse.idx nonsparse.pack &&
 	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
 	comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2019-01-16 18:26 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2018-11-28 22:11   ` Stefan Beller
2018-11-29 14:20     ` Derrick Stolee
2018-11-30  2:39       ` Junio C Hamano
2018-11-30 15:53         ` Derrick Stolee
2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason
2018-11-29  4:05   ` Derrick Stolee
2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2018-12-14 21:22       ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2019-01-11 19:43         ` Junio C Hamano
2019-01-11 20:25           ` Junio C Hamano
2019-01-11 22:05             ` Derrick Stolee
2018-12-14 21:22       ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2019-01-11 23:20         ` Junio C Hamano
2018-12-14 21:22       ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2019-01-11 22:30         ` Junio C Hamano
2019-01-15 15:06           ` Derrick Stolee
2019-01-15 18:23             ` Junio C Hamano
2018-12-14 21:22       ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-12-14 23:32         ` Ævar Arnfjörð Bjarmason
2018-12-17 14:20           ` Derrick Stolee
2018-12-17 14:26             ` Ævar Arnfjörð Bjarmason
2018-12-17 14:50               ` Derrick Stolee
2019-01-11 23:20         ` Junio C Hamano
2018-12-14 21:22       ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-12-14 21:22       ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2019-01-16 18:25         ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2019-01-16 18:25         ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2019-01-16 18:25         ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2019-01-16 18:26         ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2019-01-16 18:26         ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).