* [PATCH 0/5] Add a new "sparse" tree walk algorithm @ 2018-11-28 21:52 Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget ` (6 more replies) 0 siblings, 7 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Derrick Stolee (5): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting Documentation/git-pack-objects.txt | 9 +- bisect.c | 2 +- builtin/pack-objects.c | 9 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 51 ++++++++++- list-objects.h | 4 +- revision.c | 113 +++++++++++++++++++++++ revision.h | 2 + t/t5322-pack-objects-sparse.sh | 139 +++++++++++++++++++++++++++++ 10 files changed, 323 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/89 -- gitgitgadget ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 1/5] revision: add mark_tree_uninteresting_sparse 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget ` (5 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 22 ++++++++++++++++++++++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, + struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call + * is not a no-op. The flag is added + * in mark_tree_unintersting(). + */ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 2/5] list-objects: consume sparse tree walk 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 51 ++++++++++++++++++++++++++++++++++++++---- list-objects.h | 4 +++- 6 files changed, 54 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..9bb93d1640 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,68 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, + struct rev_info *revs, + show_edge_fn show_edge, + struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); + oidset_insert(set, &tree->object.oid); + + if (!(parent->object.flags & UNINTERESTING)) + continue; + tree->object.flags |= UNINTERESTING; + + if (revs->edge_hint && !(parent->object.flags & SHOWN)) { + parent->object.flags |= SHOWN; + show_edge(parent); + } + } +} + +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse) { struct commit_list *list; + struct oidset set; int i; + if (sparse) + oidset_init(&set, 16); + for (list = revs->commits; list; list = list->next) { struct commit *commit = list->item; + + if (sparse) { + struct tree *tree = get_commit_tree(commit); + + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; - if (commit->object.flags & UNINTERESTING) { + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); + } else if (commit->object.flags & UNINTERESTING) { mark_tree_uninteresting(revs->repo, get_commit_tree(commit)); if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { commit->object.flags |= SHOWN; show_edge(commit); } - continue; + } else { + mark_edge_parents_uninteresting(commit, revs, show_edge); } - mark_edge_parents_uninteresting(commit, revs, show_edge); } + + if (sparse) { + mark_trees_uninteresting_sparse(revs->repo, &set); + oidset_clear(&set); + } + if (revs->edge_hint_aggressive) { for (i = 0; i < revs->cmdline.nr; i++) { struct object *obj = revs->cmdline.rev[i].item; diff --git a/list-objects.h b/list-objects.h index ad40762926..a952680e46 100644 --- a/list-objects.h +++ b/list-objects.h @@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *); void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *); typedef void (*show_edge_fn)(struct commit *); -void mark_edges_uninteresting(struct rev_info *, show_edge_fn); +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse); struct oidset; struct list_objects_filter_options; -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 3/5] pack-objects: add --sparse option 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget 2018-11-28 22:11 ` Stefan Beller 2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 6 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/git-pack-objects.txt | 9 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 +++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..ced2630eb3 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] [--stdout [--filter=<filter-spec>] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,13 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge, 0); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, + N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 0000000000..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +# Demonstrate that both algorithms send "extra" objects because +# they are not in the frontier. + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + git checkout topic1 && + echo change-3 >f3/f3/data.txt && + git commit -a -m "Changed f3/f3/data.txt" && + git rev-parse \ + topic1~1 \ + topic1~1^{tree} \ + topic1^{tree} \ + topic1 \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt \ + topic1:f3 \ + topic1:f3/f3 \ + topic1:f3/f3/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_expect_success 'duplicate a folder from f1 into f3' ' + mkdir f3/f4 && + cp -r f1/f1/* f3/f4 && + git add f3/f4 && + git commit -m "Copied f1/f1 to f3/f4" && + cat >packinput.txt <<-EOF && + topic1 + ^topic1~1 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/5] pack-objects: add --sparse option 2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget @ 2018-11-28 22:11 ` Stefan Beller 2018-11-29 14:20 ` Derrick Stolee 0 siblings, 1 reply; 51+ messages in thread From: Stefan Beller @ 2018-11-28 22:11 UTC (permalink / raw) To: gitgitgadget Cc: git, Jeff King, Ævar Arnfjörð Bjarmason, Jonathan Nieder, Junio C Hamano, Derrick Stolee > +--sparse:: > + Use the "sparse" algorithm to determine which objects to include in > + the pack. This can have significant performance benefits when computing > + a pack to send a small change. However, it is possible that extra > + objects are added to the pack-file if the included commits contain > + certain types of direct renames. As a user, where do I find a discussion of these walking algorithms? (i.e. how can I tell if I can really expect to gain performance benefits, what are the tradeoffs? "If it were strictly better than the original, it would be on by default" might be a thought of a user) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/5] pack-objects: add --sparse option 2018-11-28 22:11 ` Stefan Beller @ 2018-11-29 14:20 ` Derrick Stolee 2018-11-30 2:39 ` Junio C Hamano 0 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee @ 2018-11-29 14:20 UTC (permalink / raw) To: Stefan Beller, gitgitgadget Cc: git, Jeff King, Ævar Arnfjörð Bjarmason, Jonathan Nieder, Junio C Hamano, Derrick Stolee On 11/28/2018 5:11 PM, Stefan Beller wrote: >> +--sparse:: >> + Use the "sparse" algorithm to determine which objects to include in >> + the pack. This can have significant performance benefits when computing >> + a pack to send a small change. However, it is possible that extra >> + objects are added to the pack-file if the included commits contain >> + certain types of direct renames. > As a user, where do I find a discussion of these walking algorithms? > (i.e. how can I tell if I can really expect to gain performance benefits, > what are the tradeoffs? "If it were strictly better than the original, > it would be on by default" might be a thought of a user) You're right that having this hidden as an opt-in config variable makes it hard to discover as a typical user. I would argue that we should actually make the config setting true by default, and recommend that servers opt-out. Here are my reasons: 1. The vast majority of users are clients. 2. Client users are not likely to know about and tweak these settings. 3. Server users are more likely to keep an eye on the different knobs they can tweak. 4. Servers should use the reachability bitmaps, which don't use this logic anyway. While _eventually_ we should make this opt-out, we shouldn't do that until it has cooked a while. Thanks, -Stolee ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/5] pack-objects: add --sparse option 2018-11-29 14:20 ` Derrick Stolee @ 2018-11-30 2:39 ` Junio C Hamano 2018-11-30 15:53 ` Derrick Stolee 0 siblings, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2018-11-30 2:39 UTC (permalink / raw) To: Derrick Stolee Cc: Stefan Beller, gitgitgadget, git, Jeff King, Ævar Arnfjörð Bjarmason, Jonathan Nieder, Derrick Stolee Derrick Stolee <stolee@gmail.com> writes: > You're right that having this hidden as an opt-in config variable > makes it hard to discover as a typical user. > > I would argue that we should actually make the config setting true by > default, and recommend that servers opt-out. Here are my reasons: > > 1. The vast majority of users are clients. > > 2. Client users are not likely to know about and tweak these settings. > > 3. Server users are more likely to keep an eye on the different knobs > they can tweak. > > 4. Servers should use the reachability bitmaps, which don't use this > logic anyway. > > While _eventually_ we should make this opt-out, we shouldn't do that > until it has cooked a while. I actually do not agree. If the knob gives enough benefit, the users will learn about it viva voce, and in a few more releases we'll hear "enough users complain that they have to turn it on, let's make it the default". If that does not happen, the knob does not deserve to be turned on in the first place. The same applies to many shiny new toys people are discussing recently on this list (e.g. precious vs expendable and non-overlay checkout are the ones that immediately come to my mind). Having said that, I won't be commenting on this shiny new toy before the final. I want to see more people help tying the loose ends and give it final varnish to the upcoming release, as it seems to have become rockier and larger release than we originally anticipated. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/5] pack-objects: add --sparse option 2018-11-30 2:39 ` Junio C Hamano @ 2018-11-30 15:53 ` Derrick Stolee 0 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee @ 2018-11-30 15:53 UTC (permalink / raw) To: Junio C Hamano Cc: Stefan Beller, gitgitgadget, git, Jeff King, Ævar Arnfjörð Bjarmason, Jonathan Nieder, Derrick Stolee On 11/29/2018 9:39 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> While _eventually_ we should make this opt-out, we shouldn't do that >> until it has cooked a while. > I actually do not agree. If the knob gives enough benefit, the > users will learn about it viva voce, and in a few more releases > we'll hear "enough users complain that they have to turn it on, > let's make it the default". If that does not happen, the knob > does not deserve to be turned on in the first place. That's a good philosophy. I'll keep it in mind. > Having said that, I won't be commenting on this shiny new toy before > the final. I want to see more people help tying the loose ends and > give it final varnish to the upcoming release, as it seems to have > become rockier and larger release than we originally anticipated. Yeah, no rush on this one. I just wanted to get some initial feedback about the idea. Thanks, -Stolee ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 4/5] revision: implement sparse algorithm 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 111 ++++++++++++++++++++++++++++++--- t/t5322-pack-objects-sparse.sh | 21 +++++-- 2 files changed, 116 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index 3a62c7c187..7e4bfe621a 100644 --- a/revision.c +++ b/revision.c @@ -99,26 +99,117 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ + string_list_init(&po->list, 1); +} + +static void paths_and_oids_clear(struct paths_and_oids *po) +{ + int i; + for (i = 0; i < po->list.nr; i++) { + oidset_clear(po->list.items[i].util); + free(po->list.items[i].util); + } + + string_list_clear(&po->list, 0); +} + +static void paths_and_oids_insert(struct paths_and_oids *po, + const char *path, + const struct object_id *oid) +{ + struct string_list_item *item = string_list_insert(&po->list, path); + struct oidset *set; + + if (!item->util) { + set = xcalloc(1, sizeof(struct oidset)); + oidset_init(set, 16); + item->util = set; + } else { + set = item->util; + } + + oidset_insert(set, oid); +} + +static void add_children_by_path(struct repository *r, + struct tree *tree, + struct paths_and_oids *po) +{ + struct tree_desc desc; + struct name_entry entry; + + if (parse_tree_gently(tree, 1) < 0) + return; + + init_tree_desc(&desc, tree->buffer, tree->size); + while (tree_entry(&desc, &entry)) { + switch (object_type(entry.mode)) { + case OBJ_TREE: + paths_and_oids_insert(po, entry.path, entry.oid); + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); + child->object.flags |= UNINTERESTING; + } + break; + case OBJ_BLOB: + if (tree->object.flags & UNINTERESTING) { + struct blob *child = lookup_blob(r, entry.oid); + child->object.flags |= UNINTERESTING; + } + break; + default: + /* Subproject commit - not in this repository */ + break; + } + } + + free_tree_buffer(tree); +} + void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { + int i; + unsigned has_interesting = 0, has_uninteresting = 0; + struct paths_and_oids po; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(set, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call - * is not a no-op. The flag is added - * in mark_tree_unintersting(). - */ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; } + + /* Do not walk unless we have both types of trees. */ + if (!has_uninteresting || !has_interesting) + return; + + paths_and_oids_init(&po); + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + add_children_by_path(r, tree, &po); + } + + for (i = 0; i < po.list.nr; i++) + mark_trees_uninteresting_sparse( + r, (struct oidset *)po.list.items[i].util); + + paths_and_oids_clear(&po); } struct commit_stack { diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 81f6805bc3..45dba6e014 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_objects.txt sparse_objects.txt ' +# Demonstrate that the algorithms differ when we copy a tree wholesale +# from one folder to another. + test_expect_success 'duplicate a folder from f1 into f3' ' mkdir f3/f4 && cp -r f1/f1/* f3/f4 && git add f3/f4 && git commit -m "Copied f1/f1 to f3/f4" && - cat >packinput.txt <<-EOF && + cat >packinput.txt <<-EOF topic1 ^topic1~1 EOF - git rev-parse \ - topic1 \ - topic1^{tree} \ - topic1:f3 | sort >expect_objects.txt ' test_expect_success 'non-sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt && git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' ' ' test_expect_success 'sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 \ + topic1:f3/f4 \ + topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt && git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && git index-pack -o sparse.idx sparse.pack && git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && - test_cmp expect_objects.txt sparse_objects.txt + test_cmp expect_sparse_objects.txt sparse_objects.txt ' test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 5/5] pack-objects: create pack.useSparse setting 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 ` Derrick Stolee via GitGitGadget 2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-28 21:52 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/pack-objects.c | 4 ++++ t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 0/5] Add a new "sparse" tree walk algorithm 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget @ 2018-11-28 22:18 ` Ævar Arnfjörð Bjarmason 2018-11-29 4:05 ` Derrick Stolee 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget 6 siblings, 1 reply; 51+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-11-28 22:18 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget; +Cc: git, peff, jrnieder, Junio C Hamano On Wed, Nov 28 2018, Derrick Stolee via GitGitGadget wrote: > One of the biggest remaining pain points for users of very large > repositories is the time it takes to run 'git push'. We inspected some slow > pushes by our developers and found that the "Enumerating Objects" phase of a > push was very slow. This is unsurprising, because this is why reachability > bitmaps exist. However, reachability bitmaps are not available to us because > of the single pack-file requirement. The bitmap approach is intended for > servers anyway, and clients have a much different behavior pattern. > > Specifically, clients are normally pushing a very small number of objects > compared to the entire working directory. A typical user changes only a > small cone of the working directory, so let's use that to our benefit. > > Create a new "sparse" mode for 'git pack-objects' that uses the paths that > introduce new objects to direct our search into the reachable trees. By > collecting trees at each path, we can then recurse into a path only when > there are uninteresting and interesting trees at that path. This gains a > significant performance boost for small topics while presenting a > possibility of packing extra objects. > > The main algorithm change is in patch 4, but is set up a little bit in > patches 1 and 2. > > As demonstrated in the included test script, we see that the existing > algorithm can send extra objects due to the way we specify the "frontier". > But we can send even more objects if a user copies objects from one folder > to another. I say "copy" because a rename would (usually) change the > original folder and trigger a walk into that path, discovering the objects. > > In order to benefit from this approach, the user can opt-in using the > pack.useSparse config setting. This setting can be overridden using the > '--no-sparse' option. This is really interesting. I tested this with: diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..5c7615f06c 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3143 +3143 @@ static void get_object_list(int ac, const char **av) - mark_edges_uninteresting(&revs, show_edge, sparse); + mark_edges_uninteresting(&revs, show_edge, 1); To emulate having a GIT_TEST_* mode for this, which seems like a good idea since it turned up a lot of segfaults in pack-objects. I wasn't able to get a backtrace for that since it always happens indirectly, and I didn't dig enough to see how to manually invoke it the right way. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/5] Add a new "sparse" tree walk algorithm 2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason @ 2018-11-29 4:05 ` Derrick Stolee 0 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee @ 2018-11-29 4:05 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, peff, jrnieder, Junio C Hamano On 11/28/2018 5:18 PM, Ævar Arnfjörð Bjarmason wrote: > This is really interesting. I tested this with: > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 124b1bafc4..5c7615f06c 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3143 +3143 @@ static void get_object_list(int ac, const char **av) > - mark_edges_uninteresting(&revs, show_edge, sparse); > + mark_edges_uninteresting(&revs, show_edge, 1); > > To emulate having a GIT_TEST_* mode for this, which seems like a good > idea since it turned up a lot of segfaults in pack-objects. I wasn't > able to get a backtrace for that since it always happens indirectly, and > I didn't dig enough to see how to manually invoke it the right way. Thanks for double-checking this. I had run a similar test in my prototype implementation, but over-simplified some code when rewriting it for submission (and then forgot to re-run that test). Specifically, these null checks are important: diff --git a/list-objects.c b/list-objects.c index 9bb93d1640..7e864b4db8 100644 --- a/list-objects.c +++ b/list-objects.c @@ -232,6 +232,10 @@ static void add_edge_parents(struct commit *commit, for (parents = commit->parents; parents; parents = parents->next) { struct commit *parent = parents->item; struct tree *tree = get_commit_tree(parent); + + if (!tree) + continue; + oidset_insert(set, &tree->object.oid); if (!(parent->object.flags & UNINTERESTING)) @@ -261,6 +265,8 @@ void mark_edges_uninteresting(struct rev_info *revs, if (sparse) { struct tree *tree = get_commit_tree(commit); + if (!tree) + continue; if (commit->object.flags & UNINTERESTING) tree->object.flags |= UNINTERESTING; I will definitely include a GIT_TEST_* variable in v2. Thanks, -Stolee ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 0/6] Add a new "sparse" tree walk algorithm 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (5 preceding siblings ...) 2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget ` (6 more replies) 6 siblings, 7 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/git-pack-objects.txt | 9 +- bisect.c | 2 +- builtin/pack-objects.c | 10 ++- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 55 +++++++++++- list-objects.h | 4 +- revision.c | 121 +++++++++++++++++++++++++ revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 +++++++++++++++++++++++++++++ 11 files changed, 340 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v1: 1: b73b8de98c = 1: 60617681f7 revision: add mark_tree_uninteresting_sparse 2: 9bf04c748b ! 2: 4527addacb list-objects: consume sparse tree walk @@ -116,6 +116,10 @@ + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); ++ ++ if (!tree) ++ continue; ++ + oidset_insert(set, &tree->object.oid); + + if (!(parent->object.flags & UNINTERESTING)) @@ -142,14 +146,14 @@ + for (list = revs->commits; list; list = list->next) { struct commit *commit = list->item; -+ + +- if (commit->object.flags & UNINTERESTING) { + if (sparse) { + struct tree *tree = get_commit_tree(commit); -+ ++ + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; - -- if (commit->object.flags & UNINTERESTING) { ++ + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); + } else if (commit->object.flags & UNINTERESTING) { @@ -189,3 +193,17 @@ struct oidset; struct list_objects_filter_options; + +diff --git a/revision.c b/revision.c +--- a/revision.c ++++ b/revision.c +@@ + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + ++ if (!tree) ++ continue; ++ + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call 3: 9d6b8f6d06 = 3: 9644f6ff04 pack-objects: add --sparse option 4: 0725aac4bb ! 4: c99957d06f revision: implement sparse algorithm @@ -157,6 +157,9 @@ + struct tree_desc desc; + struct name_entry entry; + ++ if (!tree) ++ return; ++ + if (parse_tree_gently(tree, 1) < 0) + return; + @@ -168,13 +171,15 @@ + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); -+ child->object.flags |= UNINTERESTING; ++ if (child) ++ child->object.flags |= UNINTERESTING; + } + break; + case OBJ_BLOB: + if (tree->object.flags & UNINTERESTING) { + struct blob *child = lookup_blob(r, entry.oid); -+ child->object.flags |= UNINTERESTING; ++ if (child) ++ child->object.flags |= UNINTERESTING; + } + break; + default: @@ -201,6 +206,9 @@ + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); + if (!tree) + continue; + - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call 5: bbc3f78182 = 5: d6912188be pack-objects: create pack.useSparse setting -: ---------- > 6: 3d394a9136 pack-objects: create GIT_TEST_PACK_SPARSE -- gitgitgadget ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget ` (5 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 22 ++++++++++++++++++++++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, + struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call + * is not a no-op. The flag is added + * in mark_tree_unintersting(). + */ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 2/6] list-objects: consume sparse tree walk 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 55 +++++++++++++++++++++++++++++++++++++++--- list-objects.h | 4 ++- revision.c | 3 +++ 7 files changed, 61 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..4fbdeca0a4 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, + struct rev_info *revs, + show_edge_fn show_edge, + struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); + + if (!tree) + continue; + + oidset_insert(set, &tree->object.oid); + + if (!(parent->object.flags & UNINTERESTING)) + continue; + tree->object.flags |= UNINTERESTING; + + if (revs->edge_hint && !(parent->object.flags & SHOWN)) { + parent->object.flags |= SHOWN; + show_edge(parent); + } + } +} + +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse) { struct commit_list *list; + struct oidset set; int i; + if (sparse) + oidset_init(&set, 16); + for (list = revs->commits; list; list = list->next) { struct commit *commit = list->item; - if (commit->object.flags & UNINTERESTING) { + if (sparse) { + struct tree *tree = get_commit_tree(commit); + + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; + + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); + } else if (commit->object.flags & UNINTERESTING) { mark_tree_uninteresting(revs->repo, get_commit_tree(commit)); if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { commit->object.flags |= SHOWN; show_edge(commit); } - continue; + } else { + mark_edge_parents_uninteresting(commit, revs, show_edge); } - mark_edge_parents_uninteresting(commit, revs, show_edge); } + + if (sparse) { + mark_trees_uninteresting_sparse(revs->repo, &set); + oidset_clear(&set); + } + if (revs->edge_hint_aggressive) { for (i = 0; i < revs->cmdline.nr; i++) { struct object *obj = revs->cmdline.rev[i].item; diff --git a/list-objects.h b/list-objects.h index ad40762926..a952680e46 100644 --- a/list-objects.h +++ b/list-objects.h @@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *); void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *); typedef void (*show_edge_fn)(struct commit *); -void mark_edges_uninteresting(struct rev_info *, show_edge_fn); +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse); struct oidset; struct list_objects_filter_options; diff --git a/revision.c b/revision.c index 3a62c7c187..f9eb6400f1 100644 --- a/revision.c +++ b/revision.c @@ -109,6 +109,9 @@ void mark_trees_uninteresting_sparse(struct repository *r, while ((oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); + if (!tree) + continue; + if (tree->object.flags & UNINTERESTING) { /* * Remove the flag so the next call -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 3/6] pack-objects: add --sparse option 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/git-pack-objects.txt | 9 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 +++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..ced2630eb3 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] [--stdout [--filter=<filter-spec>] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,13 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge, 0); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, + N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 0000000000..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +# Demonstrate that both algorithms send "extra" objects because +# they are not in the frontier. + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + git checkout topic1 && + echo change-3 >f3/f3/data.txt && + git commit -a -m "Changed f3/f3/data.txt" && + git rev-parse \ + topic1~1 \ + topic1~1^{tree} \ + topic1^{tree} \ + topic1 \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt \ + topic1:f3 \ + topic1:f3/f3 \ + topic1:f3/f3/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_expect_success 'duplicate a folder from f1 into f3' ' + mkdir f3/f4 && + cp -r f1/f1/* f3/f4 && + git add f3/f4 && + git commit -m "Copied f1/f1 to f3/f4" && + cat >packinput.txt <<-EOF && + topic1 + ^topic1~1 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 4/6] revision: implement sparse algorithm 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2018-11-29 14:24 ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 116 ++++++++++++++++++++++++++++++--- t/t5322-pack-objects-sparse.sh | 21 ++++-- 2 files changed, 121 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..971f1bb095 100644 --- a/revision.c +++ b/revision.c @@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ + string_list_init(&po->list, 1); +} + +static void paths_and_oids_clear(struct paths_and_oids *po) +{ + int i; + for (i = 0; i < po->list.nr; i++) { + oidset_clear(po->list.items[i].util); + free(po->list.items[i].util); + } + + string_list_clear(&po->list, 0); +} + +static void paths_and_oids_insert(struct paths_and_oids *po, + const char *path, + const struct object_id *oid) +{ + struct string_list_item *item = string_list_insert(&po->list, path); + struct oidset *set; + + if (!item->util) { + set = xcalloc(1, sizeof(struct oidset)); + oidset_init(set, 16); + item->util = set; + } else { + set = item->util; + } + + oidset_insert(set, oid); +} + +static void add_children_by_path(struct repository *r, + struct tree *tree, + struct paths_and_oids *po) +{ + struct tree_desc desc; + struct name_entry entry; + + if (!tree) + return; + + if (parse_tree_gently(tree, 1) < 0) + return; + + init_tree_desc(&desc, tree->buffer, tree->size); + while (tree_entry(&desc, &entry)) { + switch (object_type(entry.mode)) { + case OBJ_TREE: + paths_and_oids_insert(po, entry.path, entry.oid); + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + case OBJ_BLOB: + if (tree->object.flags & UNINTERESTING) { + struct blob *child = lookup_blob(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + default: + /* Subproject commit - not in this repository */ + break; + } + } + + free_tree_buffer(tree); +} + void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { + int i; + unsigned has_interesting = 0, has_uninteresting = 0; + struct paths_and_oids po; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(set, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); if (!tree) continue; - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call - * is not a no-op. The flag is added - * in mark_tree_unintersting(). - */ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; } + + /* Do not walk unless we have both types of trees. */ + if (!has_uninteresting || !has_interesting) + return; + + paths_and_oids_init(&po); + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + add_children_by_path(r, tree, &po); + } + + for (i = 0; i < po.list.nr; i++) + mark_trees_uninteresting_sparse( + r, (struct oidset *)po.list.items[i].util); + + paths_and_oids_clear(&po); } struct commit_stack { diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 81f6805bc3..45dba6e014 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_objects.txt sparse_objects.txt ' +# Demonstrate that the algorithms differ when we copy a tree wholesale +# from one folder to another. + test_expect_success 'duplicate a folder from f1 into f3' ' mkdir f3/f4 && cp -r f1/f1/* f3/f4 && git add f3/f4 && git commit -m "Copied f1/f1 to f3/f4" && - cat >packinput.txt <<-EOF && + cat >packinput.txt <<-EOF topic1 ^topic1~1 EOF - git rev-parse \ - topic1 \ - topic1^{tree} \ - topic1:f3 | sort >expect_objects.txt ' test_expect_success 'non-sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt && git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' ' ' test_expect_success 'sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 \ + topic1:f3/f4 \ + topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt && git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && git index-pack -o sparse.idx sparse.pack && git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && - test_cmp expect_objects.txt sparse_objects.txt + test_cmp expect_sparse_objects.txt sparse_objects.txt ' test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 5/6] pack-objects: create pack.useSparse setting 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2018-11-29 14:24 ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/pack-objects.c | 4 ++++ t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2018-11-29 14:24 ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-11-29 14:24 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/pack-objects.c | 1 + t/README | 4 ++++ t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 0/6] Add a new "sparse" tree walk algorithm 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget ` (5 preceding siblings ...) 2018-11-29 14:24 ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget ` (6 more replies) 6 siblings, 7 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Update in V3: * Change documentation around 'pack.useSparse' config setting to better advertise to users. * Mostly a ping now that v2.20.0 is out. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/config/pack.txt | 9 ++ Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 10 ++- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 55 +++++++++++- list-objects.h | 4 +- revision.c | 121 +++++++++++++++++++++++++ revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 +++++++++++++++++++++++++++++ 12 files changed, 351 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v2: 1: 60617681f7 = 1: 60617681f7 revision: add mark_tree_uninteresting_sparse 2: 4527addacb = 2: 4527addacb list-objects: consume sparse tree walk 3: 9644f6ff04 ! 3: 4ef318bdb2 pack-objects: add --sparse option @@ -32,7 +32,9 @@ +--sparse:: + Use the "sparse" algorithm to determine which objects to include in -+ the pack. This can have significant performance benefits when computing ++ the pack, when combined with the "--revs" option. This algorithm ++ only walks trees that appear in paths that introduce new objects. ++ This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. 4: c99957d06f = 4: 571b2e2784 revision: implement sparse algorithm 5: d6912188be ! 5: 33d2c04dd6 pack-objects: create pack.useSparse setting @@ -19,6 +19,26 @@ Signed-off-by: Derrick Stolee <dstolee@microsoft.com> +diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt +--- a/Documentation/config/pack.txt ++++ b/Documentation/config/pack.txt +@@ + true. You should not generally need to turn this off unless + you are debugging pack bitmaps. + ++pack.useSparse:: ++ When true, git will default to using the '--sparse' option in ++ 'git pack-objects' when the '--revs' option is present. This ++ algorithm only walks trees that appear in paths that introduce new ++ objects. This can have significant performance benefits when ++ computing a pack to send a small change. However, it is possible ++ that extra objects are added to the pack-file if the included ++ commits contain certain types of direct renames. ++ + pack.writeBitmaps (deprecated):: + This is a deprecated synonym for `repack.writeBitmaps`. + + diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c 6: 3d394a9136 = 6: e4f29543ee pack-objects: create GIT_TEST_PACK_SPARSE -- gitgitgadget ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget ` (5 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 22 ++++++++++++++++++++++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, + struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call + * is not a no-op. The flag is added + * in mark_tree_unintersting(). + */ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 2/6] list-objects: consume sparse tree walk 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 55 +++++++++++++++++++++++++++++++++++++++--- list-objects.h | 4 ++- revision.c | 3 +++ 7 files changed, 61 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..4fbdeca0a4 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, + struct rev_info *revs, + show_edge_fn show_edge, + struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); + + if (!tree) + continue; + + oidset_insert(set, &tree->object.oid); + + if (!(parent->object.flags & UNINTERESTING)) + continue; + tree->object.flags |= UNINTERESTING; + + if (revs->edge_hint && !(parent->object.flags & SHOWN)) { + parent->object.flags |= SHOWN; + show_edge(parent); + } + } +} + +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse) { struct commit_list *list; + struct oidset set; int i; + if (sparse) + oidset_init(&set, 16); + for (list = revs->commits; list; list = list->next) { struct commit *commit = list->item; - if (commit->object.flags & UNINTERESTING) { + if (sparse) { + struct tree *tree = get_commit_tree(commit); + + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; + + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); + } else if (commit->object.flags & UNINTERESTING) { mark_tree_uninteresting(revs->repo, get_commit_tree(commit)); if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { commit->object.flags |= SHOWN; show_edge(commit); } - continue; + } else { + mark_edge_parents_uninteresting(commit, revs, show_edge); } - mark_edge_parents_uninteresting(commit, revs, show_edge); } + + if (sparse) { + mark_trees_uninteresting_sparse(revs->repo, &set); + oidset_clear(&set); + } + if (revs->edge_hint_aggressive) { for (i = 0; i < revs->cmdline.nr; i++) { struct object *obj = revs->cmdline.rev[i].item; diff --git a/list-objects.h b/list-objects.h index ad40762926..a952680e46 100644 --- a/list-objects.h +++ b/list-objects.h @@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *); void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *); typedef void (*show_edge_fn)(struct commit *); -void mark_edges_uninteresting(struct rev_info *, show_edge_fn); +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse); struct oidset; struct list_objects_filter_options; diff --git a/revision.c b/revision.c index 3a62c7c187..f9eb6400f1 100644 --- a/revision.c +++ b/revision.c @@ -109,6 +109,9 @@ void mark_trees_uninteresting_sparse(struct repository *r, while ((oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); + if (!tree) + continue; + if (tree->object.flags & UNINTERESTING) { /* * Remove the flag so the next call -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 3/6] pack-objects: add --sparse option 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/git-pack-objects.txt | 11 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 +++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..e45f3e680d 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] [--stdout [--filter=<filter-spec>] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,15 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack, when combined with the "--revs" option. This algorithm + only walks trees that appear in paths that introduce new objects. + This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge, 0); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, + N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 0000000000..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +# Demonstrate that both algorithms send "extra" objects because +# they are not in the frontier. + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + git checkout topic1 && + echo change-3 >f3/f3/data.txt && + git commit -a -m "Changed f3/f3/data.txt" && + git rev-parse \ + topic1~1 \ + topic1~1^{tree} \ + topic1^{tree} \ + topic1 \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt \ + topic1:f3 \ + topic1:f3/f3 \ + topic1:f3/f3/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_expect_success 'duplicate a folder from f1 into f3' ' + mkdir f3/f4 && + cp -r f1/f1/* f3/f4 && + git add f3/f4 && + git commit -m "Copied f1/f1 to f3/f4" && + cat >packinput.txt <<-EOF && + topic1 + ^topic1~1 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 4/6] revision: implement sparse algorithm 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2018-12-10 16:42 ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 116 ++++++++++++++++++++++++++++++--- t/t5322-pack-objects-sparse.sh | 21 ++++-- 2 files changed, 121 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..971f1bb095 100644 --- a/revision.c +++ b/revision.c @@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ + string_list_init(&po->list, 1); +} + +static void paths_and_oids_clear(struct paths_and_oids *po) +{ + int i; + for (i = 0; i < po->list.nr; i++) { + oidset_clear(po->list.items[i].util); + free(po->list.items[i].util); + } + + string_list_clear(&po->list, 0); +} + +static void paths_and_oids_insert(struct paths_and_oids *po, + const char *path, + const struct object_id *oid) +{ + struct string_list_item *item = string_list_insert(&po->list, path); + struct oidset *set; + + if (!item->util) { + set = xcalloc(1, sizeof(struct oidset)); + oidset_init(set, 16); + item->util = set; + } else { + set = item->util; + } + + oidset_insert(set, oid); +} + +static void add_children_by_path(struct repository *r, + struct tree *tree, + struct paths_and_oids *po) +{ + struct tree_desc desc; + struct name_entry entry; + + if (!tree) + return; + + if (parse_tree_gently(tree, 1) < 0) + return; + + init_tree_desc(&desc, tree->buffer, tree->size); + while (tree_entry(&desc, &entry)) { + switch (object_type(entry.mode)) { + case OBJ_TREE: + paths_and_oids_insert(po, entry.path, entry.oid); + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + case OBJ_BLOB: + if (tree->object.flags & UNINTERESTING) { + struct blob *child = lookup_blob(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + default: + /* Subproject commit - not in this repository */ + break; + } + } + + free_tree_buffer(tree); +} + void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { + int i; + unsigned has_interesting = 0, has_uninteresting = 0; + struct paths_and_oids po; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(set, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); if (!tree) continue; - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call - * is not a no-op. The flag is added - * in mark_tree_unintersting(). - */ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; } + + /* Do not walk unless we have both types of trees. */ + if (!has_uninteresting || !has_interesting) + return; + + paths_and_oids_init(&po); + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + add_children_by_path(r, tree, &po); + } + + for (i = 0; i < po.list.nr; i++) + mark_trees_uninteresting_sparse( + r, (struct oidset *)po.list.items[i].util); + + paths_and_oids_clear(&po); } struct commit_stack { diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 81f6805bc3..45dba6e014 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_objects.txt sparse_objects.txt ' +# Demonstrate that the algorithms differ when we copy a tree wholesale +# from one folder to another. + test_expect_success 'duplicate a folder from f1 into f3' ' mkdir f3/f4 && cp -r f1/f1/* f3/f4 && git add f3/f4 && git commit -m "Copied f1/f1 to f3/f4" && - cat >packinput.txt <<-EOF && + cat >packinput.txt <<-EOF topic1 ^topic1~1 EOF - git rev-parse \ - topic1 \ - topic1^{tree} \ - topic1:f3 | sort >expect_objects.txt ' test_expect_success 'non-sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt && git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' ' ' test_expect_success 'sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 \ + topic1:f3/f4 \ + topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt && git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && git index-pack -o sparse.idx sparse.pack && git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && - test_cmp expect_objects.txt sparse_objects.txt + test_cmp expect_sparse_objects.txt sparse_objects.txt ' test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 5/6] pack-objects: create pack.useSparse setting 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2018-12-10 16:42 ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/config/pack.txt | 9 +++++++++ builtin/pack-objects.c | 4 ++++ t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++ 3 files changed, 28 insertions(+) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index edac75c83f..425c73aa52 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -105,6 +105,15 @@ pack.useBitmaps:: true. You should not generally need to turn this off unless you are debugging pack bitmaps. +pack.useSparse:: + When true, git will default to using the '--sparse' option in + 'git pack-objects' when the '--revs' option is present. This + algorithm only walks trees that appear in paths that introduce new + objects. This can have significant performance benefits when + computing a pack to send a small change. However, it is possible + that extra objects are added to the pack-file if the included + commits contain certain types of direct renames. + pack.writeBitmaps (deprecated):: This is a deprecated synonym for `repack.writeBitmaps`. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2018-12-10 16:42 ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 ` Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-10 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/pack-objects.c | 1 + t/README | 4 ++++ t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v4 0/6] Add a new "sparse" tree walk algorithm 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (5 preceding siblings ...) 2018-12-10 16:42 ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget ` (6 more replies) 6 siblings, 7 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Update in V3: * Change documentation around 'pack.useSparse' config setting to better advertise to users. * Mostly a ping now that v2.20.0 is out. Updates in V4: * Switched to using hashmap instead of string_list for the path/oidset dictionary. (This is due to some fear that the string_list performance would degrade for a very wide tree, but I am unable to measure a performance difference.) * Some cleanup of code snippets across commits. * Some grammar cleanup in the commit messages. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/config/pack.txt | 9 ++ Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 10 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 70 +++++++++++--- list-objects.h | 4 +- revision.c | 144 +++++++++++++++++++++++++++++ revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 ++++++++++++++++++++++++++++ 12 files changed, 382 insertions(+), 17 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v3: 1: 60617681f7 ! 1: 817e30a287 revision: add mark_tree_uninteresting_sparse @@ -35,6 +35,9 @@ + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + ++ if (!tree) ++ continue; ++ + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call 2: 4527addacb ! 2: 39dc89beb9 list-objects: consume sparse tree walk @@ -11,9 +11,10 @@ We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which - non-commit objects are reachable from uninteresting commits. + non-commit objects are reachable from uninteresting commits. This + commit walk is not changing during this series. - The mark_edges_unintersting() method in list-objects.c iterates on + The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every @@ -138,17 +139,22 @@ + int sparse) { struct commit_list *list; -+ struct oidset set; int i; -+ if (sparse) +- for (list = revs->commits; list; list = list->next) { +- struct commit *commit = list->item; ++ if (sparse) { ++ struct oidset set; + oidset_init(&set, 16); -+ - for (list = revs->commits; list; list = list->next) { - struct commit *commit = list->item; - if (commit->object.flags & UNINTERESTING) { -+ if (sparse) { +- mark_tree_uninteresting(revs->repo, +- get_commit_tree(commit)); +- if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { +- commit->object.flags |= SHOWN; +- show_edge(commit); ++ for (list = revs->commits; list; list = list->next) { ++ struct commit *commit = list->item; + struct tree *tree = get_commit_tree(commit); + + if (commit->object.flags & UNINTERESTING) @@ -156,24 +162,27 @@ + + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); -+ } else if (commit->object.flags & UNINTERESTING) { - mark_tree_uninteresting(revs->repo, - get_commit_tree(commit)); - if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { - commit->object.flags |= SHOWN; - show_edge(commit); ++ } ++ ++ mark_trees_uninteresting_sparse(revs->repo, &set); ++ oidset_clear(&set); ++ } else { ++ for (list = revs->commits; list; list = list->next) { ++ struct commit *commit = list->item; ++ if (commit->object.flags & UNINTERESTING) { ++ mark_tree_uninteresting(revs->repo, ++ get_commit_tree(commit)); ++ if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { ++ commit->object.flags |= SHOWN; ++ show_edge(commit); ++ } ++ continue; } - continue; -+ } else { + mark_edge_parents_uninteresting(commit, revs, show_edge); } - mark_edge_parents_uninteresting(commit, revs, show_edge); } -+ -+ if (sparse) { -+ mark_trees_uninteresting_sparse(revs->repo, &set); -+ oidset_clear(&set); -+ } + if (revs->edge_hint_aggressive) { for (i = 0; i < revs->cmdline.nr; i++) { @@ -193,17 +202,3 @@ struct oidset; struct list_objects_filter_options; - -diff --git a/revision.c b/revision.c ---- a/revision.c -+++ b/revision.c -@@ - while ((oid = oidset_iter_next(&iter))) { - struct tree *tree = lookup_tree(r, oid); - -+ if (!tree) -+ continue; -+ - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call 3: 4ef318bdb2 = 3: ab733daff5 pack-objects: add --sparse option 4: 571b2e2784 ! 4: c44172c35e revision: implement sparse algorithm @@ -6,8 +6,9 @@ pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which - trees and blobs are uninteresting. Finally, we walk trees to find - the interesting trees. + trees and blobs are uninteresting. Finally, we walk trees from the + interesting commits to find the interesting objects that are + placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying @@ -31,11 +32,13 @@ parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our - recursion unless the oidset contains some intersting trees and - some uninteresting trees. Technically, we only need one interesting - tree for this to speed up in most cases, but we also will not mark - anything UNINTERESTING if there are no uninteresting trees, so - that would be wasted effort. + recursion. If the oidset contains only UNINTERESTING trees, then + we do not continue the recursion. This avoids walking trees that + are likely to not be reachable from interesting trees. If the + oidset contains only interesting trees, then we will walk these + trees in the final stage that collects the intersting objects to + place in the pack. Thus, we only recurse if the oidset contains + both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. @@ -108,51 +111,80 @@ diff --git a/revision.c b/revision.c --- a/revision.c +++ b/revision.c +@@ + #include "commit-reach.h" + #include "commit-graph.h" + #include "prio-queue.h" ++#include "hashmap.h" + + volatile show_early_output_fn_t show_early_output; + @@ mark_tree_contents_uninteresting(r, tree); } -+struct paths_and_oids { -+ struct string_list list; ++struct path_and_oids_entry { ++ struct hashmap_entry ent; ++ char *path; ++ struct oidset set; +}; + -+static void paths_and_oids_init(struct paths_and_oids *po) ++static int path_and_oids_cmp(const void *hashmap_cmp_fn_data, ++ const struct path_and_oids_entry *e1, ++ const struct path_and_oids_entry *e2, ++ const void *keydata) ++{ ++ return strcmp(e1->path, e2->path); ++} ++ ++int map_flags = 0; ++static void paths_and_oids_init(struct hashmap *map) +{ -+ string_list_init(&po->list, 1); ++ hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0); +} + -+static void paths_and_oids_clear(struct paths_and_oids *po) ++static void paths_and_oids_clear(struct hashmap *map) +{ -+ int i; -+ for (i = 0; i < po->list.nr; i++) { -+ oidset_clear(po->list.items[i].util); -+ free(po->list.items[i].util); ++ struct hashmap_iter iter; ++ struct path_and_oids_entry *entry; ++ hashmap_iter_init(map, &iter); ++ ++ while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) { ++ oidset_clear(&entry->set); ++ free(entry->path); + } + -+ string_list_clear(&po->list, 0); ++ hashmap_free(map, 1); +} + -+static void paths_and_oids_insert(struct paths_and_oids *po, ++static void paths_and_oids_insert(struct hashmap *map, + const char *path, + const struct object_id *oid) +{ -+ struct string_list_item *item = string_list_insert(&po->list, path); -+ struct oidset *set; ++ int hash = strhash(path); ++ struct path_and_oids_entry key; ++ struct path_and_oids_entry *entry; ++ ++ hashmap_entry_init(&key, hash); ++ key.path = xstrdup(path); ++ oidset_init(&key.set, 0); + -+ if (!item->util) { -+ set = xcalloc(1, sizeof(struct oidset)); -+ oidset_init(set, 16); -+ item->util = set; ++ if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) { ++ entry = xcalloc(1, sizeof(struct path_and_oids_entry)); ++ hashmap_entry_init(entry, hash); ++ entry->path = key.path; ++ oidset_init(&entry->set, 16); ++ hashmap_put(map, entry); + } else { -+ set = item->util; ++ free(key.path); + } + -+ oidset_insert(set, oid); ++ oidset_insert(&entry->set, oid); +} + +static void add_children_by_path(struct repository *r, + struct tree *tree, -+ struct paths_and_oids *po) ++ struct hashmap *map) +{ + struct tree_desc desc; + struct name_entry entry; @@ -167,7 +199,7 @@ + while (tree_entry(&desc, &entry)) { + switch (object_type(entry.mode)) { + case OBJ_TREE: -+ paths_and_oids_insert(po, entry.path, entry.oid); ++ paths_and_oids_insert(map, entry.path, entry.oid); + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); @@ -194,9 +226,10 @@ void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { -+ int i; + unsigned has_interesting = 0, has_uninteresting = 0; -+ struct paths_and_oids po; ++ struct hashmap map; ++ struct hashmap_iter map_iter; ++ struct path_and_oids_entry *entry; struct object_id *oid; struct oidset_iter iter; @@ -222,25 +255,25 @@ + has_uninteresting = 1; + else + has_interesting = 1; - } ++ } + + /* Do not walk unless we have both types of trees. */ + if (!has_uninteresting || !has_interesting) + return; + -+ paths_and_oids_init(&po); ++ paths_and_oids_init(&map); + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); -+ add_children_by_path(r, tree, &po); -+ } ++ add_children_by_path(r, tree, &map); + } + -+ for (i = 0; i < po.list.nr; i++) -+ mark_trees_uninteresting_sparse( -+ r, (struct oidset *)po.list.items[i].util); ++ hashmap_iter_init(&map, &map_iter); ++ while ((entry = hashmap_iter_next(&map_iter))) ++ mark_trees_uninteresting_sparse(r, &entry->set); + -+ paths_and_oids_clear(&po); ++ paths_and_oids_clear(&map); } struct commit_stack { 5: 33d2c04dd6 = 5: f386f6c3c9 pack-objects: create pack.useSparse setting 6: e4f29543ee = 6: d011a9c1b1 pack-objects: create GIT_TEST_PACK_SPARSE -- gitgitgadget ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2019-01-11 19:43 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget ` (5 subsequent siblings) 6 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 25 +++++++++++++++++++++++++ revision.h | 2 ++ 2 files changed, 27 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..f9eb6400f1 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, + struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (!tree) + continue; + + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call + * is not a no-op. The flag is added + * in mark_tree_unintersting(). + */ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse 2018-12-14 21:22 ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget @ 2019-01-11 19:43 ` Junio C Hamano 2019-01-11 20:25 ` Junio C Hamano 0 siblings, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2019-01-11 19:43 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > In preparation for a new algorithm that walks fewer trees when > creating a pack from a set of revisions, create a method that > takes an oidset of tree oids and marks reachable objects as > UNINTERESTING. > > The current implementation uses the existing > mark_tree_uninteresting to recursively walk the trees and blobs. > This will walk the same number of trees as the old mechanism. > > There is one new assumption in this approach: we are also given > the oids of the interesting trees. This implementation does not > use those trees at the moment, but we will use them in a later > rewrite of this method. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > revision.c | 25 +++++++++++++++++++++++++ > revision.h | 2 ++ > 2 files changed, 27 insertions(+) > > diff --git a/revision.c b/revision.c > index 13e0519c02..f9eb6400f1 100644 > --- a/revision.c > +++ b/revision.c > @@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) > mark_tree_contents_uninteresting(r, tree); > } > > +void mark_trees_uninteresting_sparse(struct repository *r, > + struct oidset *set) > +{ > + struct object_id *oid; > + struct oidset_iter iter; > + > + oidset_iter_init(set, &iter); > + while ((oid = oidset_iter_next(&iter))) { > + struct tree *tree = lookup_tree(r, oid); > + > + if (!tree) > + continue; > + > + if (tree->object.flags & UNINTERESTING) { > + /* > + * Remove the flag so the next call > + * is not a no-op. The flag is added > + * in mark_tree_unintersting(). > + */ > + tree->object.flags ^= UNINTERESTING; > + mark_tree_uninteresting(r, tree); > + } The proposed log message claims that the method takes an oidset and marks reachable objects as uninteresting, but the implementation only marks those that are reachable from already uninteresting trees. Either one of them must be wrong. Did you mean to have this instead? if (!tree) continue; /* * Force traversal of the tree, even if it has been * already marked as UNINTERESTING. */ tree->object.flags &= ~UNINTERESTING; mark_tree_uninteresting(r, tree); By the way, one of the bigger reasons why I have to ask, instead of making an educated guess, is because "struct oidset *set" parameter does not give any useful information with the variable name to the readers. We know it is a set because its type is oidset; readers need to know what meaning the 'set' has, what it is used for, why the caller wants to (or decides not to) place a tree object in the set when it calls it. None of that can be read from its name. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse 2019-01-11 19:43 ` Junio C Hamano @ 2019-01-11 20:25 ` Junio C Hamano 2019-01-11 22:05 ` Derrick Stolee 0 siblings, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2019-01-11 20:25 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee Junio C Hamano <gitster@pobox.com> writes: >> In preparation for a new algorithm that walks fewer trees when >> creating a pack from a set of revisions, create a method that >> takes an oidset of tree oids and marks reachable objects as >> UNINTERESTING. >> ... >> There is one new assumption in this approach: we are also given >> the oids of the interesting trees. This implementation does not >> use those trees at the moment, but we will use them in a later >> rewrite of this method. Ahh.... > The proposed log message claims that the method takes an oidset and > marks reachable objects as uninteresting, but the implementation > only marks those that are reachable from already uninteresting > trees. Either one of them must be wrong. > > Did you mean to have this instead? > > if (!tree) > continue; > /* > * Force traversal of the tree, even if it has been > * already marked as UNINTERESTING. > */ > tree->object.flags &= ~UNINTERESTING; > mark_tree_uninteresting(r, tree); So, I assumed that the implementation was wrong, but it is the other way around. You do mean to pick only already uninteresting trees out of "set" and mark its reachables. One thing that would make me worried is what help the callers of this function will get (or they may have to devise the way themselves) to avoid having to traverse the same tree number of times. A tree can be made uninteresting after a traversal of another tree that contains it, but the logic in this function > + if (tree->object.flags & UNINTERESTING) { > + /* > + * Remove the flag so the next call > + * is not a no-op. The flag is added > + * in mark_tree_unintersting(). > + */ > + tree->object.flags ^= UNINTERESTING; > + mark_tree_uninteresting(r, tree); > + } ignores the fact that it is already UNINTERESTING (in fact, in a sense it is even worse---it cannot be used to make a not-yet UNINTERESTING one into UNINTERESTING), drops the UNINTERESING bit and forces the traversal of that tree. The only thing I see that would work as a saving grace is that mark_tree_uninteresting() itself would honor existing UNINTERESING bit and refuses to recurse into its subtrees, but that means blobs at the top-level of such a tree would be marked UNINTERESING while those in its subtrees can be left not-UNINTERESING, which sounds like inviting a mess. It does *not* immediately mean this function is misdesigned. It just means that the caller needs to carefully follow whatever calling convention this series will establish in the later patches (which we haven't seen yet at this point). > By the way, one of the bigger reasons why I have to ask, instead of > making an educated guess, is because "struct oidset *set" parameter > does not give any useful information with the variable name to the > readers. We know it is a set because its type is oidset; readers > need to know what meaning the 'set' has, what it is used for, why > the caller wants to (or decides not to) place a tree object in the > set when it calls it. None of that can be read from its name. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse 2019-01-11 20:25 ` Junio C Hamano @ 2019-01-11 22:05 ` Derrick Stolee 0 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee @ 2019-01-11 22:05 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee On 1/11/2019 3:25 PM, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > So, I assumed that the implementation was wrong, but it is the other > way around. You do mean to pick only already uninteresting trees > out of "set" and mark its reachables. > > One thing that would make me worried is what help the callers of > this function will get (or they may have to devise the way > themselves) to avoid having to traverse the same tree number of > times. A tree can be made uninteresting after a traversal of > another tree that contains it, but the logic in this function > >> + if (tree->object.flags & UNINTERESTING) { >> + /* >> + * Remove the flag so the next call >> + * is not a no-op. The flag is added >> + * in mark_tree_unintersting(). >> + */ >> + tree->object.flags ^= UNINTERESTING; >> + mark_tree_uninteresting(r, tree); >> + } > ignores the fact that it is already UNINTERESTING (in fact, in a > sense it is even worse---it cannot be used to make a not-yet > UNINTERESTING one into UNINTERESTING), drops the UNINTERESING bit > and forces the traversal of that tree. The only thing I see that > would work as a saving grace is that mark_tree_uninteresting() > itself would honor existing UNINTERESING bit and refuses to recurse > into its subtrees, but that means blobs at the top-level of such a > tree would be marked UNINTERESING while those in its subtrees can be > left not-UNINTERESING, which sounds like inviting a mess. > > It does *not* immediately mean this function is misdesigned. It > just means that the caller needs to carefully follow whatever > calling convention this series will establish in the later patches > (which we haven't seen yet at this point). > I'm sorry that this implementation is particularly confusing. It is created only as a placeholder so we can wire up the --sparse option to use this method without being "wrong" but is then removed entirely in PATCH 4/6: ... void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { + unsigned has_interesting = 0, has_uninteresting = 0; + struct hashmap map; + struct hashmap_iter map_iter; + struct path_and_oids_entry *entry; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(set, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); if (!tree) continue; - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call - * is not a no-op. The flag is added - * in mark_tree_unintersting(). - */ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; + } ... You are definitely correct that "set" is not a valuable variable name. It could instead be "tree_oids" to be slightly more informative. Thanks, -Stolee ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v4 2/6] list-objects: consume sparse tree walk 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2019-01-11 23:20 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 6 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. This commit walk is not changing during this series. The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 70 +++++++++++++++++++++++++++++++++++------- list-objects.h | 4 ++- 6 files changed, 66 insertions(+), 16 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..fb728f7842 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,73 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, + struct rev_info *revs, + show_edge_fn show_edge, + struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); + + if (!tree) + continue; + + oidset_insert(set, &tree->object.oid); + + if (!(parent->object.flags & UNINTERESTING)) + continue; + tree->object.flags |= UNINTERESTING; + + if (revs->edge_hint && !(parent->object.flags & SHOWN)) { + parent->object.flags |= SHOWN; + show_edge(parent); + } + } +} + +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse) { struct commit_list *list; int i; - for (list = revs->commits; list; list = list->next) { - struct commit *commit = list->item; + if (sparse) { + struct oidset set; + oidset_init(&set, 16); - if (commit->object.flags & UNINTERESTING) { - mark_tree_uninteresting(revs->repo, - get_commit_tree(commit)); - if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { - commit->object.flags |= SHOWN; - show_edge(commit); + for (list = revs->commits; list; list = list->next) { + struct commit *commit = list->item; + struct tree *tree = get_commit_tree(commit); + + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; + + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); + } + + mark_trees_uninteresting_sparse(revs->repo, &set); + oidset_clear(&set); + } else { + for (list = revs->commits; list; list = list->next) { + struct commit *commit = list->item; + if (commit->object.flags & UNINTERESTING) { + mark_tree_uninteresting(revs->repo, + get_commit_tree(commit)); + if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { + commit->object.flags |= SHOWN; + show_edge(commit); + } + continue; } - continue; + mark_edge_parents_uninteresting(commit, revs, show_edge); } - mark_edge_parents_uninteresting(commit, revs, show_edge); } + if (revs->edge_hint_aggressive) { for (i = 0; i < revs->cmdline.nr; i++) { struct object *obj = revs->cmdline.rev[i].item; diff --git a/list-objects.h b/list-objects.h index ad40762926..a952680e46 100644 --- a/list-objects.h +++ b/list-objects.h @@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *); void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *); typedef void (*show_edge_fn)(struct commit *); -void mark_edges_uninteresting(struct rev_info *, show_edge_fn); +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse); struct oidset; struct list_objects_filter_options; -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v4 2/6] list-objects: consume sparse tree walk 2018-12-14 21:22 ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget @ 2019-01-11 23:20 ` Junio C Hamano 0 siblings, 0 replies; 51+ messages in thread From: Junio C Hamano @ 2019-01-11 23:20 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > When creating a pack-file using 'git pack-objects --revs' we provide > a list of interesting and uninteresting commits. For example, a push > operation would make the local topic branch be interesting and the > known remote refs as uninteresting. We want to discover the set of > new objects to send to the server as a thin pack. > > We walk these commits until we discover a frontier of commits such > that every commit walk starting at interesting commits ends in a root > commit or unintersting commit. We then need to discover which > non-commit objects are reachable from uninteresting commits. This > commit walk is not changing during this series. > > The mark_edges_uninteresting() method in list-objects.c iterates on > the commit list and does the following: > > * If the commit is UNINTERSTING, then mark its root tree and every > object it can reach as UNINTERESTING. > > * If the commit is interesting, then mark the root tree of every > UNINTERSTING parent (and all objects that tree can reach) as > UNINTERSTING. > > At the very end, we repeat the process on every commit directly > given to the revision walk from stdin. This helps ensure we properly > cover shallow commits that otherwise were not included in the > frontier. > > The logic to recursively follow trees is in the > mark_tree_uninteresting() method in revision.c. The algorithm avoids > duplicate work by not recursing into trees that are already marked > UNINTERSTING. > > Add a new 'sparse' option to the mark_edges_uninteresting() method > that performs this logic in a slightly new way. As we iterate over > the commits, we add all of the root trees to an oidset. Then, call > mark_trees_uninteresting_sparse() on that oidset. Note that we > include interesting trees in this process. The current implementation > of mark_trees_unintersting_sparse() will walk the same trees as > the old logic, but this will be replaced in a later change. It is unclear what "a slightly new way" refers to. The updated code adds the UNINTERSTING edge commits and UNINTERSTING parents of interesting edge commits (the latter is done using the new add-edge-parents() helper function) to an oidset and calls the new helper introduced in [1/6] to ensure all the objects reachable from these UNINTERSTING trees become UNINTERSTING. Which seems to be doing exactly the same thing as the original. Puzzled. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v4 3/6] pack-objects: add --sparse option 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2019-01-11 22:30 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 6 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/git-pack-objects.txt | 11 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 +++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..e45f3e680d 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] [--stdout [--filter=<filter-spec>] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,15 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack, when combined with the "--revs" option. This algorithm + only walks trees that appear in paths that introduce new objects. + This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge, 0); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, + N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 0000000000..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +# Demonstrate that both algorithms send "extra" objects because +# they are not in the frontier. + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + git checkout topic1 && + echo change-3 >f3/f3/data.txt && + git commit -a -m "Changed f3/f3/data.txt" && + git rev-parse \ + topic1~1 \ + topic1~1^{tree} \ + topic1^{tree} \ + topic1 \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt \ + topic1:f3 \ + topic1:f3/f3 \ + topic1:f3/f3/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_expect_success 'duplicate a folder from f1 into f3' ' + mkdir f3/f4 && + cp -r f1/f1/* f3/f4 && + git add f3/f4 && + git commit -m "Copied f1/f1 to f3/f4" && + cat >packinput.txt <<-EOF && + topic1 + ^topic1~1 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v4 3/6] pack-objects: add --sparse option 2018-12-14 21:22 ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget @ 2019-01-11 22:30 ` Junio C Hamano 2019-01-15 15:06 ` Derrick Stolee 0 siblings, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2019-01-11 22:30 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > Add a '--sparse' option flag to the pack-objects builtin. This > allows the user to specify that they want to use the new logic > for walking trees. This logic currently does not differ from the > existing output, but will in a later change. > > Create a new test script, t5322-pack-objects-sparse.sh, to ensure > the object list that is selected matches what we expect. When we > update the logic to walk in a sparse fashion, the final test will > be updated to show the extra objects that are added. Somehow checking the "these are exactly what we expect" feels brittle. In a history with three relevant commits A---B---C, packing B..C could omit trees and blobs in C that appear in A but not in B, but traditionally, because we stop traversal at B and do not even look at A, we do not notice that such objects that need to complete C's tree are already available in a repository that has B. The approach of test in this patch feels similar to saying "we must send these duplicates because we must stay dumb", even though with a perfect knowledge about the history, e.g. bitmap, we would be able to omit these objects in A that appear in C but not in B. I think we want to test test both "are we sending enough, even though we might be wasting some bandwidth by not noticing the other side already have some?" and "are we still avoiding from becoming overly stupid, even though we may be cheating to save traversal cost and risking to redundantly send some objects?" IOW, a set of tests to make sure that truly new objects are all sent by seeing what is sent is a strict superset of what we expect. And another set of tests to make sure that objects that are so obviously (this criterion may be highly subjective) be present in the receiving repository (e.g. the tree object of commit B and what it contains when seinding B..C) are never sent, even when using an algorithm that are tuned for traversal cost over bandwidth consumption. The code change in this step looks all trivially good, and it may want to be squashed into the previous step to become a single patch. Otherwise, [2/6] would be adding a dead code that nobody exercises until this step. Thanks. > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > Documentation/git-pack-objects.txt | 11 ++- > builtin/pack-objects.c | 5 +- > t/t5322-pack-objects-sparse.sh | 115 +++++++++++++++++++++++++++++ > 3 files changed, 129 insertions(+), 2 deletions(-) > create mode 100755 t/t5322-pack-objects-sparse.sh > > diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt > index 40c825c381..e45f3e680d 100644 > --- a/Documentation/git-pack-objects.txt > +++ b/Documentation/git-pack-objects.txt > @@ -14,7 +14,7 @@ SYNOPSIS > [--local] [--incremental] [--window=<n>] [--depth=<n>] > [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] > [--stdout [--filter=<filter-spec>] | base-name] > - [--shallow] [--keep-true-parents] < object-list > + [--shallow] [--keep-true-parents] [--sparse] < object-list > > > DESCRIPTION > @@ -196,6 +196,15 @@ depth is 4095. > Add --no-reuse-object if you want to force a uniform compression > level on all data no matter the source. > > +--sparse:: > + Use the "sparse" algorithm to determine which objects to include in > + the pack, when combined with the "--revs" option. This algorithm > + only walks trees that appear in paths that introduce new objects. > + This can have significant performance benefits when computing > + a pack to send a small change. However, it is possible that extra > + objects are added to the pack-file if the included commits contain > + certain types of direct renames. > + > --thin:: > Create a "thin" pack by omitting the common objects between a > sender and a receiver in order to reduce network transfer. This > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 5f70d840a7..7d5b0735e3 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; > static int depth = 50; > static int delta_search_threads; > static int pack_to_stdout; > +static int sparse; > static int thin; > static int num_preferred_base; > static struct progress *progress_state; > @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) > > if (prepare_revision_walk(&revs)) > die(_("revision walk setup failed")); > - mark_edges_uninteresting(&revs, show_edge, 0); > + mark_edges_uninteresting(&revs, show_edge, sparse); > > if (!fn_show_object) > fn_show_object = show_object; > @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), > N_("unpack unreachable objects newer than <time>"), > PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, > + OPT_BOOL(0, "sparse", &sparse, > + N_("use the sparse reachability algorithm")), > OPT_BOOL(0, "thin", &thin, > N_("create thin packs")), > OPT_BOOL(0, "shallow", &shallow, > diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh > new file mode 100755 > index 0000000000..81f6805bc3 > --- /dev/null > +++ b/t/t5322-pack-objects-sparse.sh > @@ -0,0 +1,115 @@ > +#!/bin/sh > + > +test_description='pack-objects object selection using sparse algorithm' > +. ./test-lib.sh > + > +test_expect_success 'setup repo' ' > + test_commit initial && > + for i in $(test_seq 1 3) > + do > + mkdir f$i && > + for j in $(test_seq 1 3) > + do > + mkdir f$i/f$j && > + echo $j >f$i/f$j/data.txt > + done > + done && > + git add . && > + git commit -m "Initialized trees" && > + for i in $(test_seq 1 3) > + do > + git checkout -b topic$i master && > + echo change-$i >f$i/f$i/data.txt && > + git commit -a -m "Changed f$i/f$i/data.txt" > + done && > + cat >packinput.txt <<-EOF && > + topic1 > + ^topic2 > + ^topic3 > + EOF > + git rev-parse \ > + topic1 \ > + topic1^{tree} \ > + topic1:f1 \ > + topic1:f1/f1 \ > + topic1:f1/f1/data.txt | sort >expect_objects.txt > +' > + > +test_expect_success 'non-sparse pack-objects' ' > + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && > + git index-pack -o nonsparse.idx nonsparse.pack && > + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && > + test_cmp expect_objects.txt nonsparse_objects.txt > +' > + > +test_expect_success 'sparse pack-objects' ' > + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && > + git index-pack -o sparse.idx sparse.pack && > + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && > + test_cmp expect_objects.txt sparse_objects.txt > +' > + > +# Demonstrate that both algorithms send "extra" objects because > +# they are not in the frontier. > + > +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' > + git checkout topic1 && > + echo change-3 >f3/f3/data.txt && > + git commit -a -m "Changed f3/f3/data.txt" && > + git rev-parse \ > + topic1~1 \ > + topic1~1^{tree} \ > + topic1^{tree} \ > + topic1 \ > + topic1:f1 \ > + topic1:f1/f1 \ > + topic1:f1/f1/data.txt \ > + topic1:f3 \ > + topic1:f3/f3 \ > + topic1:f3/f3/data.txt | sort >expect_objects.txt > +' > + > +test_expect_success 'non-sparse pack-objects' ' > + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && > + git index-pack -o nonsparse.idx nonsparse.pack && > + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && > + test_cmp expect_objects.txt nonsparse_objects.txt > +' > + > +test_expect_success 'sparse pack-objects' ' > + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && > + git index-pack -o sparse.idx sparse.pack && > + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && > + test_cmp expect_objects.txt sparse_objects.txt > +' > + > +test_expect_success 'duplicate a folder from f1 into f3' ' > + mkdir f3/f4 && > + cp -r f1/f1/* f3/f4 && > + git add f3/f4 && > + git commit -m "Copied f1/f1 to f3/f4" && > + cat >packinput.txt <<-EOF && > + topic1 > + ^topic1~1 > + EOF > + git rev-parse \ > + topic1 \ > + topic1^{tree} \ > + topic1:f3 | sort >expect_objects.txt > +' > + > +test_expect_success 'non-sparse pack-objects' ' > + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && > + git index-pack -o nonsparse.idx nonsparse.pack && > + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && > + test_cmp expect_objects.txt nonsparse_objects.txt > +' > + > +test_expect_success 'sparse pack-objects' ' > + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && > + git index-pack -o sparse.idx sparse.pack && > + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && > + test_cmp expect_objects.txt sparse_objects.txt > +' > + > +test_done ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 3/6] pack-objects: add --sparse option 2019-01-11 22:30 ` Junio C Hamano @ 2019-01-15 15:06 ` Derrick Stolee 2019-01-15 18:23 ` Junio C Hamano 0 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee @ 2019-01-15 15:06 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee On 1/11/2019 5:30 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> From: Derrick Stolee <dstolee@microsoft.com> >> >> Add a '--sparse' option flag to the pack-objects builtin. This >> allows the user to specify that they want to use the new logic >> for walking trees. This logic currently does not differ from the >> existing output, but will in a later change. >> >> Create a new test script, t5322-pack-objects-sparse.sh, to ensure >> the object list that is selected matches what we expect. When we >> update the logic to walk in a sparse fashion, the final test will >> be updated to show the extra objects that are added. > Somehow checking the "these are exactly what we expect" feels > brittle. In a history with three relevant commits A---B---C, > packing B..C could omit trees and blobs in C that appear in A but > not in B, but traditionally, because we stop traversal at B and do > not even look at A, we do not notice that such objects that need to > complete C's tree are already available in a repository that has B. > The approach of test in this patch feels similar to saying "we must > send these duplicates because we must stay dumb", even though with a > perfect knowledge about the history, e.g. bitmap, we would be able > to omit these objects in A that appear in C but not in B. My intention with this test was to show that the existing algorithm also sends "extra" objects in certain situations, which is later contrasted by a test where the new algorithm sends objects the old algorithm did not. Instead of "we must stay dumb" I wanted to say "we are not dumber (in this case)". I understand your brittle feeling. If the logic changed in either to be smarter, then we would need to change the test. In some sense, that does mean we have extra coverage of "this is how it works now" so we would understand the behavior change of that hypothetical future change. > I think we want to test test both "are we sending enough, even > though we might be wasting some bandwidth by not noticing the other > side already have some?" and "are we still avoiding from becoming > overly stupid, even though we may be cheating to save traversal cost > and risking to redundantly send some objects?" > > IOW, a set of tests to make sure that truly new objects are all sent > by seeing what is sent is a strict superset of what we expect. And > another set of tests to make sure that objects that are so obviously > (this criterion may be highly subjective) be present in the > receiving repository (e.g. the tree object of commit B and what it > contains when seinding B..C) are never sent, even when using an > algorithm that are tuned for traversal cost over bandwidth > consumption. To properly test "these objects are included," do we have an established pattern for saying "this file is a subset of that file"? Or, should I use something like `comm expected actual >common && test_cmp expected common`? > The code change in this step looks all trivially good, and it may > want to be squashed into the previous step to become a single patch. > Otherwise, [2/6] would be adding a dead code that nobody exercises > until this step. Will squash. Thanks, -Stolee ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 3/6] pack-objects: add --sparse option 2019-01-15 15:06 ` Derrick Stolee @ 2019-01-15 18:23 ` Junio C Hamano 0 siblings, 0 replies; 51+ messages in thread From: Junio C Hamano @ 2019-01-15 18:23 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee via GitGitGadget, git, peff, avarab, jrnieder, Derrick Stolee Derrick Stolee <stolee@gmail.com> writes: > Instead of "we must stay dumb" I wanted to say "we are not dumber (in > this case)". Thanks, then what we aim for are compatible ;-) I do agree that we should make sure we will not ever become overly stupid. > To properly test "these objects are included," do we have an > established pattern for saying "this file is a subset of that file"? > Or, should I use something like `comm expected actual >common && > test_cmp expected common`? Yeah, "comm -23 must-exist actual" would be a natural thing to use when we have two sorted files and want to see what is missing from the actual output among those that must exist. t6500 and t9350 already seem to use the tool, so it should be OK from portability's point of view to non-UNIX systems. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v4 4/6] revision: implement sparse algorithm 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2018-12-14 21:22 ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2018-12-14 23:32 ` Ævar Arnfjörð Bjarmason 2019-01-11 23:20 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 6 siblings, 2 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 139 ++++++++++++++++++++++++++++++--- t/t5322-pack-objects-sparse.sh | 21 +++-- 2 files changed, 144 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..63bf6230dc 100644 --- a/revision.c +++ b/revision.c @@ -27,6 +27,7 @@ #include "commit-reach.h" #include "commit-graph.h" #include "prio-queue.h" +#include "hashmap.h" volatile show_early_output_fn_t show_early_output; @@ -99,29 +100,147 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct path_and_oids_entry { + struct hashmap_entry ent; + char *path; + struct oidset set; +}; + +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data, + const struct path_and_oids_entry *e1, + const struct path_and_oids_entry *e2, + const void *keydata) +{ + return strcmp(e1->path, e2->path); +} + +int map_flags = 0; +static void paths_and_oids_init(struct hashmap *map) +{ + hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0); +} + +static void paths_and_oids_clear(struct hashmap *map) +{ + struct hashmap_iter iter; + struct path_and_oids_entry *entry; + hashmap_iter_init(map, &iter); + + while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) { + oidset_clear(&entry->set); + free(entry->path); + } + + hashmap_free(map, 1); +} + +static void paths_and_oids_insert(struct hashmap *map, + const char *path, + const struct object_id *oid) +{ + int hash = strhash(path); + struct path_and_oids_entry key; + struct path_and_oids_entry *entry; + + hashmap_entry_init(&key, hash); + key.path = xstrdup(path); + oidset_init(&key.set, 0); + + if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) { + entry = xcalloc(1, sizeof(struct path_and_oids_entry)); + hashmap_entry_init(entry, hash); + entry->path = key.path; + oidset_init(&entry->set, 16); + hashmap_put(map, entry); + } else { + free(key.path); + } + + oidset_insert(&entry->set, oid); +} + +static void add_children_by_path(struct repository *r, + struct tree *tree, + struct hashmap *map) +{ + struct tree_desc desc; + struct name_entry entry; + + if (!tree) + return; + + if (parse_tree_gently(tree, 1) < 0) + return; + + init_tree_desc(&desc, tree->buffer, tree->size); + while (tree_entry(&desc, &entry)) { + switch (object_type(entry.mode)) { + case OBJ_TREE: + paths_and_oids_insert(map, entry.path, entry.oid); + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + case OBJ_BLOB: + if (tree->object.flags & UNINTERESTING) { + struct blob *child = lookup_blob(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + default: + /* Subproject commit - not in this repository */ + break; + } + } + + free_tree_buffer(tree); +} + void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set) { + unsigned has_interesting = 0, has_uninteresting = 0; + struct hashmap map; + struct hashmap_iter map_iter; + struct path_and_oids_entry *entry; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(set, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); if (!tree) continue; - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call - * is not a no-op. The flag is added - * in mark_tree_unintersting(). - */ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; + } + + /* Do not walk unless we have both types of trees. */ + if (!has_uninteresting || !has_interesting) + return; + + paths_and_oids_init(&map); + + oidset_iter_init(set, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + add_children_by_path(r, tree, &map); } + + hashmap_iter_init(&map, &map_iter); + while ((entry = hashmap_iter_next(&map_iter))) + mark_trees_uninteresting_sparse(r, &entry->set); + + paths_and_oids_clear(&map); } struct commit_stack { diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 81f6805bc3..45dba6e014 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_objects.txt sparse_objects.txt ' +# Demonstrate that the algorithms differ when we copy a tree wholesale +# from one folder to another. + test_expect_success 'duplicate a folder from f1 into f3' ' mkdir f3/f4 && cp -r f1/f1/* f3/f4 && git add f3/f4 && git commit -m "Copied f1/f1 to f3/f4" && - cat >packinput.txt <<-EOF && + cat >packinput.txt <<-EOF topic1 ^topic1~1 EOF - git rev-parse \ - topic1 \ - topic1^{tree} \ - topic1:f3 | sort >expect_objects.txt ' test_expect_success 'non-sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >expect_objects.txt && git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' ' ' test_expect_success 'sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 \ + topic1:f3/f4 \ + topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt && git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && git index-pack -o sparse.idx sparse.pack && git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && - test_cmp expect_objects.txt sparse_objects.txt + test_cmp expect_sparse_objects.txt sparse_objects.txt ' test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v4 4/6] revision: implement sparse algorithm 2018-12-14 21:22 ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget @ 2018-12-14 23:32 ` Ævar Arnfjörð Bjarmason 2018-12-17 14:20 ` Derrick Stolee 2019-01-11 23:20 ` Junio C Hamano 1 sibling, 1 reply; 51+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-12-14 23:32 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, jrnieder, Junio C Hamano, Derrick Stolee On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote: > Despite these potential drawbacks, the benefits of the algorithm > are clear. By adding a counter to 'add_children_by_path' and > 'mark_tree_contents_uninteresting', I measured the number of > parsed trees for the two algorithms in a variety of repos. Ah, so we take one or the other. I tested this with: diff --git a/revision.c b/revision.c index 63bf6230dc..e9c67aa550 100644 --- a/revision.c +++ b/revision.c @@ -61,6 +61,8 @@ static void mark_tree_contents_uninteresting(struct repository *r, struct tree_desc desc; struct name_entry entry; + fprintf(stderr, "MTCU\n"); + if (parse_tree_gently(tree, 1) < 0) return; @@ -166,6 +168,8 @@ static void add_children_by_path(struct repository *r, struct tree_desc desc; struct name_entry entry; + fprintf(stderr, "ACBP\n"); + if (!tree) return; And tried testing my pushing this branch to my git.git: $ for v in true false; do git push --delete avar push/sparse; ./git -c pack.usesparse=$v --exec-path=$PWD push avar HEAD 2>&1 | grep -e MTCU -e ACBP | uniq -c; done To github.com:avar/git.git - [deleted] push/sparse 22 ACBP To github.com:avar/git.git - [deleted] push/sparse 184 MTCU But snipping around a bit from the commit message: >When enumerating objects to place in a pack-file during 'git >pack-objects --revs', we discover the "frontier" of commits >that we care about and the boundary with commit we find >uninteresting. From that point, we walk trees to discover which >trees and blobs are uninteresting. Finally, we walk trees from the >interesting commits to find the interesting objects that are >placed in the pack. >[...] >These improvements will have even larger benefits in the super- >large Windows repository. In our experiments, we see the >"Enumerate objects" phase of pack-objects taking 60-80% of the >end-to-end time of non-trivial pushes, taking longer than the >network time to send the pack and the server time to verify the >pack. If I change my monkeypatch to: diff --git a/revision.c b/revision.c index 63bf6230dc..9171ca27c5 100644 --- a/revision.c +++ b/revision.c @@ -63,0 +64,3 @@ static void mark_tree_contents_uninteresting(struct repository *r, + fprintf(stderr, "MTCU\n"); + sleep(1); + @@ -168,0 +172,3 @@ static void add_children_by_path(struct repository *r, + fprintf(stderr, "ACBP\n"); + sleep(1); + We spend a long time printing those out before we ever get to "Enumerating objects". Which was where I was trying to test this, i.e. is this a lot of work we perform before we print out the progress bar, and regardless of this optimization should have other progress output there, so we can see this time we're spending on this? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 4/6] revision: implement sparse algorithm 2018-12-14 23:32 ` Ævar Arnfjörð Bjarmason @ 2018-12-17 14:20 ` Derrick Stolee 2018-12-17 14:26 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 51+ messages in thread From: Derrick Stolee @ 2018-12-17 14:20 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, peff, jrnieder, Junio C Hamano, Derrick Stolee On 12/14/2018 6:32 PM, Ævar Arnfjörð Bjarmason wrote: > On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote: > >> Despite these potential drawbacks, the benefits of the algorithm >> are clear. By adding a counter to 'add_children_by_path' and >> 'mark_tree_contents_uninteresting', I measured the number of >> parsed trees for the two algorithms in a variety of repos. > We spend a long time printing those out before we ever get to > "Enumerating objects". > > Which was where I was trying to test this, i.e. is this a lot of work we > perform before we print out the progress bar, and regardless of this > optimization should have other progress output there, so we can see this > time we're spending on this? It is true that part of the problem is that a 'git push' will sit for a while without presenting any feedback until this part of the algorithm is complete. The current series intends to significantly reduce this time. As for adding progress to this step, I'm open to it. It can be done as a sequel series. What would we use to describe this section? "Enumerating remote objects"? Thanks, -Stolee ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 4/6] revision: implement sparse algorithm 2018-12-17 14:20 ` Derrick Stolee @ 2018-12-17 14:26 ` Ævar Arnfjörð Bjarmason 2018-12-17 14:50 ` Derrick Stolee 0 siblings, 1 reply; 51+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-12-17 14:26 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee via GitGitGadget, git, peff, jrnieder, Junio C Hamano, Derrick Stolee On Mon, Dec 17 2018, Derrick Stolee wrote: > On 12/14/2018 6:32 PM, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote: >> >>> Despite these potential drawbacks, the benefits of the algorithm >>> are clear. By adding a counter to 'add_children_by_path' and >>> 'mark_tree_contents_uninteresting', I measured the number of >>> parsed trees for the two algorithms in a variety of repos. >> We spend a long time printing those out before we ever get to >> "Enumerating objects". >> >> Which was where I was trying to test this, i.e. is this a lot of work we >> perform before we print out the progress bar, and regardless of this >> optimization should have other progress output there, so we can see this >> time we're spending on this? > > It is true that part of the problem is that a 'git push' will sit for > a while without presenting any feedback until this part of the > algorithm is complete. The current series intends to significantly > reduce this time. > As for adding progress to this step, I'm open to it. It can be done as > a sequel series. Okey. To clarify I wasn't complaining about the lack of progress output, we didn't have it before, just clarifying (as I've found out now) that when you're talking about "enumerating objects" in your commit message it's *not* what we're doing when we show the "Enumerating objects" progress bar, but an unrelated step prior to that. I thought I might have been holding this wrong. > What would we use to describe this section? "Enumerating remote > objects"? Isn't this "Discovering objects to push to remote" i.e. "Selecting objects", but then what's "Enumerating objects" really doing? "Looping over objects we selected before and creating a pack"? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 4/6] revision: implement sparse algorithm 2018-12-17 14:26 ` Ævar Arnfjörð Bjarmason @ 2018-12-17 14:50 ` Derrick Stolee 0 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee @ 2018-12-17 14:50 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Derrick Stolee via GitGitGadget, git, peff, jrnieder, Junio C Hamano, Derrick Stolee On 12/17/2018 9:26 AM, Ævar Arnfjörð Bjarmason wrote: > On Mon, Dec 17 2018, Derrick Stolee wrote: > > >> As for adding progress to this step, I'm open to it. It can be done as >> a sequel series. > Okey. To clarify I wasn't complaining about the lack of progress output, > we didn't have it before, just clarifying (as I've found out now) that > when you're talking about "enumerating objects" in your commit message > it's *not* what we're doing when we show the "Enumerating objects" > progress bar, but an unrelated step prior to that. Part of the problem is that in builtin/pack-objects.c, we have the following: if (progress) progress_state = start_progress(_("Enumerating objects"), 0); if (!use_internal_rev_list) read_object_list_from_stdin(); else { get_object_list(rp.argc, rp.argv); argv_array_clear(&rp); } cleanup_preferred_base(); if (include_tag && nr_result) for_each_ref(add_ref_tag, NULL); stop_progress(&progress_state); and the logic for walking uninteresting objects is the mark_edges_uninteresting() inside get_object_list() (both entirely contained in this progress state). Perhaps the best thing to do is to untangle the progress for the two modes based on 'use_internal_rev_list'. Could we then have the progress for get_object_list() be "Walking objects" instead? Thanks, -Stolee ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v4 4/6] revision: implement sparse algorithm 2018-12-14 21:22 ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget 2018-12-14 23:32 ` Ævar Arnfjörð Bjarmason @ 2019-01-11 23:20 ` Junio C Hamano 1 sibling, 0 replies; 51+ messages in thread From: Junio C Hamano @ 2019-01-11 23:20 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, avarab, jrnieder, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > When enumerating objects to place in a pack-file during 'git > pack-objects --revs', we discover the "frontier" of commits > that we care about and the boundary with commit we find > uninteresting. From that point, we walk trees to discover which > trees and blobs are uninteresting. Finally, we walk trees from the > interesting commits to find the interesting objects that are > placed in the pack. > > This commit introduces a new, "sparse" way to discover the > uninteresting trees. We use the perspective of a single user trying > to push their topic to a large repository. That user likely changed > a very small fraction of the paths in their working directory, but > we spend a lot of time walking all reachable trees. > > The way to switch the logic to work in this sparse way is to start > caring about which paths introduce new trees. While it is not > possible to generate a diff between the frontier boundary and all > of the interesting commits, we can simulate that behavior by > inspecting all of the root trees as a whole, then recursing down > to the set of trees at each path. > > We already had taken the first step by passing an oidset to > mark_trees_uninteresting_sparse(). We now create a dictionary > whose keys are paths and values are oidsets. We consider the set > of trees that appear at each path. While we inspect a tree, we > add its subtrees to the oidsets corresponding to the tree entry's > path. We also mark trees as UNINTERESTING if the tree we are > parsing is UNINTERESTING. > > To actually improve the peformance, we need to terminate our performance? > recursion. If the oidset contains only UNINTERESTING trees, then So, at this point, it is not quite clear what "the oidset" refers to to me. When you are looking at the root tree of one commit, you'd add the object ID of the tree to the dictionary keyed by the path (i.e. root of the tree). You traverse to the parent and keep adding these root trees, and after a while you'd hit a negative commit and add its tree with UNINTERESTING bit. Before that point, the oidset for the root of the tree has only interesting trees, so you won't recurse into its contents, but now at the final point you'd walk to propagate UNINTERESTING bit down? > we do not continue the recursion. This avoids walking trees that > are likely to not be reachable from interesting trees. If the > oidset contains only interesting trees, then we will walk these > trees in the final stage that collects the intersting objects to > place in the pack. Thus, we only recurse if the oidset contains > both interesting and UNINITERESTING trees. > > There are a few ways that this is not a universally better option. > > First, we can pack extra objects. If someone copies a subtree > from one tree to another, the first tree will appear UNINTERESTING > and we will not recurse to see that the subtree should also be > UNINTERESTING. We will walk the new tree and see the subtree as > a "new" object and add it to the pack. We add a test case that > demonstrates this as a way to prove that the --sparse option is > actually working. That's an interesting use of the word "working" ;-) > Second, we can have extra memory pressure. If instead of being a > single user pushing a small topic we are a server sending new > objects from across the entire working directory, then we will > gain very little (the recursion will rarely terminate early) but > will spend extra time maintaining the path-oidset dictionaries. > ... > I used the number of walked trees the main metric above because > it is consistent across multiple runs. When I ran my tests, the > performance of the pack-objects command with the same options > could change the end-to-end time by 10x depending on the file > system being warm. However, by repeating the same test on repeat > I could get more consistent timing results. The git.git and > Linux tests were too fast overall (less than 0.5s) to measure > an end-to-end difference. The Azure DevOps case was slow enough > to see the time improve from 15s to 1s in the warm case. The > cold case was 90s to 9s in my testing. Understandable. Projects that tend to be deep (e.g. Java?) would likely to benefit if they are structured and partitioned very well. > +struct path_and_oids_entry { > + struct hashmap_entry ent; > + char *path; > + struct oidset set; Again "set"? If I understand the description of the logic explained in the proposed commit log message correctly, then this is a set of "tree objects at this path", right? At least calling this "trees" may give us slightly better indication. > +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data, > + const struct path_and_oids_entry *e1, > + const struct path_and_oids_entry *e2, > + const void *keydata) > +{ > + return strcmp(e1->path, e2->path); > +} > + > +int map_flags = 0; That's too bland a name for a system-wide global variable. If it is a file-scope "static", that might be more acceptable. In any case, let BSS take care of the initialization and do not initialize an int explicitly to 0. Is it even used? At least at this step the comparison callback function does not seem to make any use of the fn-data, so a better organization may be to delay the introduction of this variable to a later step where it actually gets used, at which time, the code may have a better context to give the variable a more suitable name. > +static void paths_and_oids_init(struct hashmap *map) > +{ > + hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0); > +} > + > +static void paths_and_oids_clear(struct hashmap *map) > +{ > + struct hashmap_iter iter; > + struct path_and_oids_entry *entry; > + hashmap_iter_init(map, &iter); > + > + while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) { > + oidset_clear(&entry->set); > + free(entry->path); > + } > + > + hashmap_free(map, 1); > +} > + > +static void paths_and_oids_insert(struct hashmap *map, > + const char *path, > + const struct object_id *oid) > +{ > + int hash = strhash(path); > + struct path_and_oids_entry key; > + struct path_and_oids_entry *entry; > + > + hashmap_entry_init(&key, hash); > + key.path = xstrdup(path); > + oidset_init(&key.set, 0); > + > + if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) { > + entry = xcalloc(1, sizeof(struct path_and_oids_entry)); > + hashmap_entry_init(entry, hash); > + entry->path = key.path; > + oidset_init(&entry->set, 16); > + hashmap_put(map, entry); > + } else { > + free(key.path); > + } Hmph, having to always allocate only to free sounds wasteful. Can't we instead do the look-up with "path" that are given read-only to us, and allocate only when we are creating the entry in the dictionary for the first time? After all, we'd make many look-ups of the same path to add trees for the same path and only one of them (i.e. the first one) is used to create an entry in the dictionary. Or does it not matter? > + > + oidset_insert(&entry->set, oid); > +} > + > +static void add_children_by_path(struct repository *r, > + struct tree *tree, > + struct hashmap *map) > +{ > + struct tree_desc desc; > + struct name_entry entry; > + > + if (!tree) > + return; > + > + if (parse_tree_gently(tree, 1) < 0) > + return; > + > + init_tree_desc(&desc, tree->buffer, tree->size); > + while (tree_entry(&desc, &entry)) { > + switch (object_type(entry.mode)) { > + case OBJ_TREE: > + paths_and_oids_insert(map, entry.path, entry.oid); > + > + if (tree->object.flags & UNINTERESTING) { > + struct tree *child = lookup_tree(r, entry.oid); > + if (child) > + child->object.flags |= UNINTERESTING; > + } > + break; > + case OBJ_BLOB: > + if (tree->object.flags & UNINTERESTING) { > + struct blob *child = lookup_blob(r, entry.oid); > + if (child) > + child->object.flags |= UNINTERESTING; > + } > + break; > + default: > + /* Subproject commit - not in this repository */ > + break; > + } > + } > + > + free_tree_buffer(tree); > +} OK. > void mark_trees_uninteresting_sparse(struct repository *r, > struct oidset *set) > { > + unsigned has_interesting = 0, has_uninteresting = 0; > + struct hashmap map; > + struct hashmap_iter map_iter; > + struct path_and_oids_entry *entry; > struct object_id *oid; > struct oidset_iter iter; > > oidset_iter_init(set, &iter); > - while ((oid = oidset_iter_next(&iter))) { > + while ((!has_interesting || !has_uninteresting) && > + (oid = oidset_iter_next(&iter))) { > struct tree *tree = lookup_tree(r, oid); > > if (!tree) > continue; > > + if (tree->object.flags & UNINTERESTING) > + has_uninteresting = 1; > + else > + has_interesting = 1; > + } > + > + /* Do not walk unless we have both types of trees. */ > + if (!has_uninteresting || !has_interesting) > + return; OK. > + paths_and_oids_init(&map); > + > + oidset_iter_init(set, &iter); > + while ((oid = oidset_iter_next(&iter))) { > + struct tree *tree = lookup_tree(r, oid); > + add_children_by_path(r, tree, &map); > } > + > + hashmap_iter_init(&map, &map_iter); > + while ((entry = hashmap_iter_next(&map_iter))) > + mark_trees_uninteresting_sparse(r, &entry->set); > + > + paths_and_oids_clear(&map); > } > > struct commit_stack { > diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh > index 81f6805bc3..45dba6e014 100755 > --- a/t/t5322-pack-objects-sparse.sh > +++ b/t/t5322-pack-objects-sparse.sh > @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' ' > test_cmp expect_objects.txt sparse_objects.txt > ' > > +# Demonstrate that the algorithms differ when we copy a tree wholesale > +# from one folder to another. > + > test_expect_success 'duplicate a folder from f1 into f3' ' > mkdir f3/f4 && > cp -r f1/f1/* f3/f4 && > git add f3/f4 && > git commit -m "Copied f1/f1 to f3/f4" && > - cat >packinput.txt <<-EOF && > + cat >packinput.txt <<-EOF > topic1 > ^topic1~1 > EOF > - git rev-parse \ > - topic1 \ > - topic1^{tree} \ > - topic1:f3 | sort >expect_objects.txt > ' > > test_expect_success 'non-sparse pack-objects' ' > + git rev-parse \ > + topic1 \ > + topic1^{tree} \ > + topic1:f3 | sort >expect_objects.txt && OK, the change above sort of makes sense ;-) > git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && > git index-pack -o nonsparse.idx nonsparse.pack && > git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && > @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' ' > ' > > test_expect_success 'sparse pack-objects' ' > + git rev-parse \ > + topic1 \ > + topic1^{tree} \ > + topic1:f3 \ > + topic1:f3/f4 \ > + topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt && And the reason why f4 and its contents are wastefully sent is because the whole tree is duplicated, which is understandable. > git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && > git index-pack -o sparse.idx sparse.pack && > git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && > - test_cmp expect_objects.txt sparse_objects.txt > + test_cmp expect_sparse_objects.txt sparse_objects.txt > ' > > test_done ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v4 5/6] pack-objects: create pack.useSparse setting 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2018-12-14 21:22 ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/config/pack.txt | 9 +++++++++ builtin/pack-objects.c | 4 ++++ t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++ 3 files changed, 28 insertions(+) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index edac75c83f..425c73aa52 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -105,6 +105,15 @@ pack.useBitmaps:: true. You should not generally need to turn this off unless you are debugging pack bitmaps. +pack.useSparse:: + When true, git will default to using the '--sparse' option in + 'git pack-objects' when the '--revs' option is present. This + algorithm only walks trees that appear in paths that introduce new + objects. This can have significant performance benefits when + computing a pack to send a small change. However, it is possible + that extra objects are added to the pack-file if the included + commits contain certain types of direct renames. + pack.writeBitmaps (deprecated):: This is a deprecated synonym for `repack.writeBitmaps`. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2018-12-14 21:22 ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 ` Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 6 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-12-14 21:22 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/pack-objects.c | 1 + t/README | 4 ++++ t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v5 0/5] Add a new "sparse" tree walk algorithm 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (5 preceding siblings ...) 2018-12-14 21:22 ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 ` Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget ` (4 more replies) 6 siblings, 5 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Update in V3: * Change documentation around 'pack.useSparse' config setting to better advertise to users. * Mostly a ping now that v2.20.0 is out. Updates in V4: * Switched to using hashmap instead of string_list for the path/oidset dictionary. (This is due to some fear that the string_list performance would degrade for a very wide tree, but I am unable to measure a performance difference.) * Some cleanup of code snippets across commits. * Some grammar cleanup in the commit messages. Updates in V5: * Renamed the generic "set" to "trees" * Used a less-restrictive test condition when testing the old algorithm, except for the case where we want to verify that the new algorithm is being run. * Removed a global variable that was not used. Thanks, Ramsay and Junio! Derrick Stolee (5): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/config/pack.txt | 9 ++ Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 10 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 70 +++++++++++--- list-objects.h | 4 +- revision.c | 143 +++++++++++++++++++++++++++++ revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 136 +++++++++++++++++++++++++++ 12 files changed, 378 insertions(+), 17 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v5 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v5 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v4: 1: 817e30a287 ! 1: 02ef702884 revision: add mark_tree_uninteresting_sparse @@ -9,7 +9,10 @@ The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. - This will walk the same number of trees as the old mechanism. + This will walk the same number of trees as the old mechanism. To + ensure that mark_tree_uninteresting walks the tree, we need to + remove the UNINTERESTING flag before calling the method. This + implementation will be replaced entirely in a later commit. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not @@ -26,12 +29,12 @@ } +void mark_trees_uninteresting_sparse(struct repository *r, -+ struct oidset *set) ++ struct oidset *trees) +{ + struct object_id *oid; + struct oidset_iter iter; + -+ oidset_iter_init(set, &iter); ++ oidset_iter_init(trees, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + @@ -69,7 +72,7 @@ void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); -+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); ++void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees); void show_object_with_name(FILE *, struct object *, const char *); 2: 39dc89beb9 ! 2: 0747f7494e list-objects: consume sparse tree walk @@ -35,18 +35,53 @@ UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method - that performs this logic in a slightly new way. As we iterate over - the commits, we add all of the root trees to an oidset. Then, call - mark_trees_uninteresting_sparse() on that oidset. Note that we + that performs this logic in a slightly different way. As we iterate + over the commits, we add all of the root trees to an oidset. Then, + call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. - The sparse option is not used by any callers at the moment, but - will be wired to 'git pack-objects' in the next change. + Add a '--sparse' flag in 'git pack-objects' to call this new logic. + Add a new test script t/t5322-pack-objects-sparse.sh that tests this + option. The tests currently demonstrate that the resulting object + list is the same as the old algorithm. This includes a case where + both algorithms pack an object that is not needed by a remote due to + limits on the explored set of trees. When the sparse algorithm is + changed in a later commit, we will add a test that demonstrates a + change of behavior in some cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> + diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt + --- a/Documentation/git-pack-objects.txt + +++ b/Documentation/git-pack-objects.txt +@@ + [--local] [--incremental] [--window=<n>] [--depth=<n>] + [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] + [--stdout [--filter=<filter-spec>] | base-name] +- [--shallow] [--keep-true-parents] < object-list ++ [--shallow] [--keep-true-parents] [--sparse] < object-list + + + DESCRIPTION +@@ + Add --no-reuse-object if you want to force a uniform compression + level on all data no matter the source. + ++--sparse:: ++ Use the "sparse" algorithm to determine which objects to include in ++ the pack, when combined with the "--revs" option. This algorithm ++ only walks trees that appear in paths that introduce new objects. ++ This can have significant performance benefits when computing ++ a pack to send a small change. However, it is possible that extra ++ objects are added to the pack-file if the included commits contain ++ certain types of direct renames. ++ + --thin:: + Create a "thin" pack by omitting the common objects between a + sender and a receiver in order to reduce network transfer. This + diff --git a/bisect.c b/bisect.c --- a/bisect.c +++ b/bisect.c @@ -64,14 +99,31 @@ --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ + static int depth = 50; + static int delta_search_threads; + static int pack_to_stdout; ++static int sparse; + static int thin; + static int num_preferred_base; + static struct progress *progress_state; +@@ if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); -+ mark_edges_uninteresting(&revs, show_edge, 0); ++ mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; +@@ + { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), + N_("unpack unreachable objects newer than <time>"), + PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, ++ OPT_BOOL(0, "sparse", &sparse, ++ N_("use the sparse reachability algorithm")), + OPT_BOOL(0, "thin", &thin, + N_("create thin packs")), + OPT_BOOL(0, "shallow", &shallow, diff --git a/builtin/rev-list.c b/builtin/rev-list.c --- a/builtin/rev-list.c @@ -202,3 +254,122 @@ struct oidset; struct list_objects_filter_options; + + diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh + new file mode 100755 + --- /dev/null + +++ b/t/t5322-pack-objects-sparse.sh +@@ ++#!/bin/sh ++ ++test_description='pack-objects object selection using sparse algorithm' ++. ./test-lib.sh ++ ++test_expect_success 'setup repo' ' ++ test_commit initial && ++ for i in $(test_seq 1 3) ++ do ++ mkdir f$i && ++ for j in $(test_seq 1 3) ++ do ++ mkdir f$i/f$j && ++ echo $j >f$i/f$j/data.txt ++ done ++ done && ++ git add . && ++ git commit -m "Initialized trees" && ++ for i in $(test_seq 1 3) ++ do ++ git checkout -b topic$i master && ++ echo change-$i >f$i/f$i/data.txt && ++ git commit -a -m "Changed f$i/f$i/data.txt" ++ done && ++ cat >packinput.txt <<-EOF && ++ topic1 ++ ^topic2 ++ ^topic3 ++ EOF ++ git rev-parse \ ++ topic1 \ ++ topic1^{tree} \ ++ topic1:f1 \ ++ topic1:f1/f1 \ ++ topic1:f1/f1/data.txt | sort >expect_objects.txt ++' ++ ++test_expect_success 'non-sparse pack-objects' ' ++ git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && ++ git index-pack -o nonsparse.idx nonsparse.pack && ++ git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && ++ test_cmp expect_objects.txt nonsparse_objects.txt ++' ++ ++test_expect_success 'sparse pack-objects' ' ++ git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && ++ git index-pack -o sparse.idx sparse.pack && ++ git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && ++ test_cmp expect_objects.txt sparse_objects.txt ++' ++ ++test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ++ git checkout topic1 && ++ echo change-3 >f3/f3/data.txt && ++ git commit -a -m "Changed f3/f3/data.txt" && ++ git rev-parse \ ++ topic1~1 \ ++ topic1~1^{tree} \ ++ topic1^{tree} \ ++ topic1 \ ++ topic1:f1 \ ++ topic1:f1/f1 \ ++ topic1:f1/f1/data.txt | sort >required_objects.txt ++' ++ ++test_expect_success 'non-sparse pack-objects' ' ++ git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && ++ git index-pack -o nonsparse.idx nonsparse.pack && ++ git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && ++ comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && ++ test_cmp required_objects.txt nonsparse_required_objects.txt ++' ++ ++test_expect_success 'sparse pack-objects' ' ++ git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && ++ git index-pack -o sparse.idx sparse.pack && ++ git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && ++ comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt && ++ test_cmp required_objects.txt sparse_required_objects.txt ++' ++ ++test_expect_success 'duplicate a folder from f1 into f3' ' ++ mkdir f3/f4 && ++ cp -r f1/f1/* f3/f4 && ++ git add f3/f4 && ++ git commit -m "Copied f1/f1 to f3/f4" && ++ cat >packinput.txt <<-EOF && ++ topic1 ++ ^topic1~1 ++ EOF ++ git rev-parse \ ++ topic1 \ ++ topic1^{tree} \ ++ topic1:f3 | sort >required_objects.txt ++' ++ ++test_expect_success 'non-sparse pack-objects' ' ++ git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && ++ git index-pack -o nonsparse.idx nonsparse.pack && ++ git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && ++ comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && ++ test_cmp required_objects.txt nonsparse_required_objects.txt ++' ++ ++test_expect_success 'sparse pack-objects' ' ++ git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && ++ git index-pack -o sparse.idx sparse.pack && ++ git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && ++ comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt && ++ test_cmp required_objects.txt sparse_required_objects.txt ++' ++ ++test_done 3: ab733daff5 < -: ---------- pack-objects: add --sparse option 4: c44172c35e ! 3: 327c102477 revision: implement sparse algorithm @@ -31,7 +31,7 @@ path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. - To actually improve the peformance, we need to terminate our + To actually improve the performance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the @@ -46,9 +46,9 @@ from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as - a "new" object and add it to the pack. We add a test case that - demonstrates this as a way to prove that the --sparse option is - actually working. + a "new" object and add it to the pack. A test is modified to + demonstrate this behavior and to verify that the new logic is + being exercised. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new @@ -126,7 +126,7 @@ +struct path_and_oids_entry { + struct hashmap_entry ent; + char *path; -+ struct oidset set; ++ struct oidset trees; +}; + +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data, @@ -137,10 +137,9 @@ + return strcmp(e1->path, e2->path); +} + -+int map_flags = 0; +static void paths_and_oids_init(struct hashmap *map) +{ -+ hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0); ++ hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, NULL, 0); +} + +static void paths_and_oids_clear(struct hashmap *map) @@ -150,7 +149,7 @@ + hashmap_iter_init(map, &iter); + + while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) { -+ oidset_clear(&entry->set); ++ oidset_clear(&entry->trees); + free(entry->path); + } + @@ -166,20 +165,20 @@ + struct path_and_oids_entry *entry; + + hashmap_entry_init(&key, hash); -+ key.path = xstrdup(path); -+ oidset_init(&key.set, 0); ++ ++ /* use a shallow copy for the lookup */ ++ key.path = (char *)path; ++ oidset_init(&key.trees, 0); + + if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) { + entry = xcalloc(1, sizeof(struct path_and_oids_entry)); + hashmap_entry_init(entry, hash); -+ entry->path = key.path; -+ oidset_init(&entry->set, 16); ++ entry->path = xstrdup(key.path); ++ oidset_init(&entry->trees, 16); + hashmap_put(map, entry); -+ } else { -+ free(key.path); + } + -+ oidset_insert(&entry->set, oid); ++ oidset_insert(&entry->trees, oid); +} + +static void add_children_by_path(struct repository *r, @@ -224,7 +223,7 @@ +} + void mark_trees_uninteresting_sparse(struct repository *r, - struct oidset *set) + struct oidset *trees) { + unsigned has_interesting = 0, has_uninteresting = 0; + struct hashmap map; @@ -233,7 +232,7 @@ struct object_id *oid; struct oidset_iter iter; - oidset_iter_init(set, &iter); + oidset_iter_init(trees, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { @@ -263,7 +262,7 @@ + + paths_and_oids_init(&map); + -+ oidset_iter_init(set, &iter); ++ oidset_iter_init(trees, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + add_children_by_path(r, tree, &map); @@ -271,7 +270,7 @@ + + hashmap_iter_init(&map, &map_iter); + while ((entry = hashmap_iter_next(&map_iter))) -+ mark_trees_uninteresting_sparse(r, &entry->set); ++ mark_trees_uninteresting_sparse(r, &entry->trees); + + paths_and_oids_clear(&map); } @@ -282,7 +281,7 @@ --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ - test_cmp expect_objects.txt sparse_objects.txt + test_cmp required_objects.txt sparse_required_objects.txt ' +# Demonstrate that the algorithms differ when we copy a tree wholesale @@ -291,27 +290,15 @@ test_expect_success 'duplicate a folder from f1 into f3' ' mkdir f3/f4 && cp -r f1/f1/* f3/f4 && - git add f3/f4 && - git commit -m "Copied f1/f1 to f3/f4" && -- cat >packinput.txt <<-EOF && -+ cat >packinput.txt <<-EOF - topic1 - ^topic1~1 - EOF -- git rev-parse \ -- topic1 \ -- topic1^{tree} \ -- topic1:f3 | sort >expect_objects.txt +@@ ' test_expect_success 'non-sparse pack-objects' ' -+ git rev-parse \ -+ topic1 \ -+ topic1^{tree} \ -+ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && +- git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && ++ git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && @@ ' @@ -325,7 +312,8 @@ git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && git index-pack -o sparse.idx sparse.pack && git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && -- test_cmp expect_objects.txt sparse_objects.txt +- comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt && +- test_cmp required_objects.txt sparse_required_objects.txt + test_cmp expect_sparse_objects.txt sparse_objects.txt ' 5: f386f6c3c9 ! 4: 28111d70d4 pack-objects: create pack.useSparse setting @@ -73,7 +73,7 @@ + git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && -+ test_cmp expect_objects.txt sparse_objects.txt ++ test_cmp required_objects.txt sparse_objects.txt +' + test_done 6: d011a9c1b1 ! 5: ae046e8525 pack-objects: create GIT_TEST_PACK_SPARSE @@ -56,13 +56,4 @@ + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && - test_cmp expect_objects.txt nonsparse_objects.txt -@@ - topic1 \ - topic1^{tree} \ - topic1:f3 | sort >expect_objects.txt && -- git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && -+ git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && - git index-pack -o nonsparse.idx nonsparse.pack && - git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && - test_cmp expect_objects.txt nonsparse_objects.txt + comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && -- gitgitgadget ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v5 2/5] list-objects: consume sparse tree walk 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 ` Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 4 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. This commit walk is not changing during this series. The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly different way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. Add a '--sparse' flag in 'git pack-objects' to call this new logic. Add a new test script t/t5322-pack-objects-sparse.sh that tests this option. The tests currently demonstrate that the resulting object list is the same as the old algorithm. This includes a case where both algorithms pack an object that is not needed by a remote due to limits on the explored set of trees. When the sparse algorithm is changed in a later commit, we will add a test that demonstrates a change of behavior in some cases. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/git-pack-objects.txt | 11 ++- bisect.c | 2 +- builtin/pack-objects.c | 5 +- builtin/rev-list.c | 2 +- http-push.c | 2 +- list-objects.c | 70 +++++++++++++++--- list-objects.h | 4 +- t/t5322-pack-objects-sparse.sh | 113 +++++++++++++++++++++++++++++ 8 files changed, 192 insertions(+), 17 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..e45f3e680d 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] [--stdout [--filter=<filter-spec>] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,15 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack, when combined with the "--revs" option. This algorithm + only walks trees that appear in paths that introduce new objects. + This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk(&revs)) die(_("revision walk setup failed")); - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", &sparse, + N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, N_("create thin packs")), OPT_BOOL(0, "shallow", &shallow, diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk(&revs)) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(&revs, show_edge); + mark_edges_uninteresting(&revs, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk(&revs)) die("revision walk setup failed"); - mark_edges_uninteresting(&revs, NULL); + mark_edges_uninteresting(&revs, NULL, 0); objects_to_send = get_delta(&revs, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..fb728f7842 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,73 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, + struct rev_info *revs, + show_edge_fn show_edge, + struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); + + if (!tree) + continue; + + oidset_insert(set, &tree->object.oid); + + if (!(parent->object.flags & UNINTERESTING)) + continue; + tree->object.flags |= UNINTERESTING; + + if (revs->edge_hint && !(parent->object.flags & SHOWN)) { + parent->object.flags |= SHOWN; + show_edge(parent); + } + } +} + +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse) { struct commit_list *list; int i; - for (list = revs->commits; list; list = list->next) { - struct commit *commit = list->item; + if (sparse) { + struct oidset set; + oidset_init(&set, 16); - if (commit->object.flags & UNINTERESTING) { - mark_tree_uninteresting(revs->repo, - get_commit_tree(commit)); - if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { - commit->object.flags |= SHOWN; - show_edge(commit); + for (list = revs->commits; list; list = list->next) { + struct commit *commit = list->item; + struct tree *tree = get_commit_tree(commit); + + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; + + oidset_insert(&set, &tree->object.oid); + add_edge_parents(commit, revs, show_edge, &set); + } + + mark_trees_uninteresting_sparse(revs->repo, &set); + oidset_clear(&set); + } else { + for (list = revs->commits; list; list = list->next) { + struct commit *commit = list->item; + if (commit->object.flags & UNINTERESTING) { + mark_tree_uninteresting(revs->repo, + get_commit_tree(commit)); + if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) { + commit->object.flags |= SHOWN; + show_edge(commit); + } + continue; } - continue; + mark_edge_parents_uninteresting(commit, revs, show_edge); } - mark_edge_parents_uninteresting(commit, revs, show_edge); } + if (revs->edge_hint_aggressive) { for (i = 0; i < revs->cmdline.nr; i++) { struct object *obj = revs->cmdline.rev[i].item; diff --git a/list-objects.h b/list-objects.h index ad40762926..a952680e46 100644 --- a/list-objects.h +++ b/list-objects.h @@ -10,7 +10,9 @@ typedef void (*show_object_fn)(struct object *, const char *, void *); void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *); typedef void (*show_edge_fn)(struct commit *); -void mark_edges_uninteresting(struct rev_info *, show_edge_fn); +void mark_edges_uninteresting(struct rev_info *revs, + show_edge_fn show_edge, + int sparse); struct oidset; struct list_objects_filter_options; diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 0000000000..30aef6498a --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,113 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + test_cmp expect_objects.txt nonsparse_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + git checkout topic1 && + echo change-3 >f3/f3/data.txt && + git commit -a -m "Changed f3/f3/data.txt" && + git rev-parse \ + topic1~1 \ + topic1~1^{tree} \ + topic1^{tree} \ + topic1 \ + topic1:f1 \ + topic1:f1/f1 \ + topic1:f1/f1/data.txt | sort >required_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && + test_cmp required_objects.txt nonsparse_required_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt && + test_cmp required_objects.txt sparse_required_objects.txt +' + +test_expect_success 'duplicate a folder from f1 into f3' ' + mkdir f3/f4 && + cp -r f1/f1/* f3/f4 && + git add f3/f4 && + git commit -m "Copied f1/f1 to f3/f4" && + cat >packinput.txt <<-EOF && + topic1 + ^topic1~1 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >required_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && + comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && + test_cmp required_objects.txt nonsparse_required_objects.txt +' + +test_expect_success 'sparse pack-objects' ' + git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt && + test_cmp required_objects.txt sparse_required_objects.txt +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 ` Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 4 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. To ensure that mark_tree_uninteresting walks the tree, we need to remove the UNINTERESTING flag before calling the method. This implementation will be replaced entirely in a later commit. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 25 +++++++++++++++++++++++++ revision.h | 2 ++ 2 files changed, 27 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..60421f3a10 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,31 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, + struct oidset *trees) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(trees, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + + if (!tree) + continue; + + if (tree->object.flags & UNINTERESTING) { + /* + * Remove the flag so the next call + * is not a no-op. The flag is added + * in mark_tree_unintersting(). + */ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..df684701b9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v5 3/5] revision: implement sparse algorithm 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 ` Derrick Stolee via GitGitGadget 2019-01-16 18:26 ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget 2019-01-16 18:26 ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 4 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:25 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the performance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. A test is modified to demonstrate this behavior and to verify that the new logic is being exercised. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- revision.c | 138 ++++++++++++++++++++++++++++++--- t/t5322-pack-objects-sparse.sh | 14 +++- 2 files changed, 139 insertions(+), 13 deletions(-) diff --git a/revision.c b/revision.c index 60421f3a10..5de739384a 100644 --- a/revision.c +++ b/revision.c @@ -27,6 +27,7 @@ #include "commit-reach.h" #include "commit-graph.h" #include "prio-queue.h" +#include "hashmap.h" volatile show_early_output_fn_t show_early_output; @@ -99,29 +100,146 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct path_and_oids_entry { + struct hashmap_entry ent; + char *path; + struct oidset trees; +}; + +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data, + const struct path_and_oids_entry *e1, + const struct path_and_oids_entry *e2, + const void *keydata) +{ + return strcmp(e1->path, e2->path); +} + +static void paths_and_oids_init(struct hashmap *map) +{ + hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, NULL, 0); +} + +static void paths_and_oids_clear(struct hashmap *map) +{ + struct hashmap_iter iter; + struct path_and_oids_entry *entry; + hashmap_iter_init(map, &iter); + + while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) { + oidset_clear(&entry->trees); + free(entry->path); + } + + hashmap_free(map, 1); +} + +static void paths_and_oids_insert(struct hashmap *map, + const char *path, + const struct object_id *oid) +{ + int hash = strhash(path); + struct path_and_oids_entry key; + struct path_and_oids_entry *entry; + + hashmap_entry_init(&key, hash); + + /* use a shallow copy for the lookup */ + key.path = (char *)path; + oidset_init(&key.trees, 0); + + if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) { + entry = xcalloc(1, sizeof(struct path_and_oids_entry)); + hashmap_entry_init(entry, hash); + entry->path = xstrdup(key.path); + oidset_init(&entry->trees, 16); + hashmap_put(map, entry); + } + + oidset_insert(&entry->trees, oid); +} + +static void add_children_by_path(struct repository *r, + struct tree *tree, + struct hashmap *map) +{ + struct tree_desc desc; + struct name_entry entry; + + if (!tree) + return; + + if (parse_tree_gently(tree, 1) < 0) + return; + + init_tree_desc(&desc, tree->buffer, tree->size); + while (tree_entry(&desc, &entry)) { + switch (object_type(entry.mode)) { + case OBJ_TREE: + paths_and_oids_insert(map, entry.path, entry.oid); + + if (tree->object.flags & UNINTERESTING) { + struct tree *child = lookup_tree(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + case OBJ_BLOB: + if (tree->object.flags & UNINTERESTING) { + struct blob *child = lookup_blob(r, entry.oid); + if (child) + child->object.flags |= UNINTERESTING; + } + break; + default: + /* Subproject commit - not in this repository */ + break; + } + } + + free_tree_buffer(tree); +} + void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees) { + unsigned has_interesting = 0, has_uninteresting = 0; + struct hashmap map; + struct hashmap_iter map_iter; + struct path_and_oids_entry *entry; struct object_id *oid; struct oidset_iter iter; oidset_iter_init(trees, &iter); - while ((oid = oidset_iter_next(&iter))) { + while ((!has_interesting || !has_uninteresting) && + (oid = oidset_iter_next(&iter))) { struct tree *tree = lookup_tree(r, oid); if (!tree) continue; - if (tree->object.flags & UNINTERESTING) { - /* - * Remove the flag so the next call - * is not a no-op. The flag is added - * in mark_tree_unintersting(). - */ - tree->object.flags ^= UNINTERESTING; - mark_tree_uninteresting(r, tree); - } + if (tree->object.flags & UNINTERESTING) + has_uninteresting = 1; + else + has_interesting = 1; + } + + /* Do not walk unless we have both types of trees. */ + if (!has_uninteresting || !has_interesting) + return; + + paths_and_oids_init(&map); + + oidset_iter_init(trees, &iter); + while ((oid = oidset_iter_next(&iter))) { + struct tree *tree = lookup_tree(r, oid); + add_children_by_path(r, tree, &map); } + + hashmap_iter_init(&map, &map_iter); + while ((entry = hashmap_iter_next(&map_iter))) + mark_trees_uninteresting_sparse(r, &entry->trees); + + paths_and_oids_clear(&map); } struct commit_stack { diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 30aef6498a..9f2a6e5d31 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -79,6 +79,9 @@ test_expect_success 'sparse pack-objects' ' test_cmp required_objects.txt sparse_required_objects.txt ' +# Demonstrate that the algorithms differ when we copy a tree wholesale +# from one folder to another. + test_expect_success 'duplicate a folder from f1 into f3' ' mkdir f3/f4 && cp -r f1/f1/* f3/f4 && @@ -95,7 +98,7 @@ test_expect_success 'duplicate a folder from f1 into f3' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && @@ -103,11 +106,16 @@ test_expect_success 'non-sparse pack-objects' ' ' test_expect_success 'sparse pack-objects' ' + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 \ + topic1:f3/f4 \ + topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt && git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack && git index-pack -o sparse.idx sparse.pack && git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && - comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt && - test_cmp required_objects.txt sparse_required_objects.txt + test_cmp expect_sparse_objects.txt sparse_objects.txt ' test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v5 4/5] pack-objects: create pack.useSparse setting 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2019-01-16 18:25 ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget @ 2019-01-16 18:26 ` Derrick Stolee via GitGitGadget 2019-01-16 18:26 ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 4 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:26 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/config/pack.txt | 9 +++++++++ builtin/pack-objects.c | 4 ++++ t/t5322-pack-objects-sparse.sh | 15 +++++++++++++++ 3 files changed, 28 insertions(+) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index edac75c83f..425c73aa52 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -105,6 +105,15 @@ pack.useBitmaps:: true. You should not generally need to turn this off unless you are debugging pack bitmaps. +pack.useSparse:: + When true, git will default to using the '--sparse' option in + 'git pack-objects' when the '--revs' option is present. This + algorithm only walks trees that appear in paths that introduce new + objects. This can have significant performance benefits when + computing a pack to send a small change. However, it is possible + that extra objects are added to the pack-file if the included + commits contain certain types of direct renames. + pack.writeBitmaps (deprecated):: This is a deprecated synonym for `repack.writeBitmaps`. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 9f2a6e5d31..3233fafc90 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -118,4 +118,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse <packinput.txt >sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt && + test_cmp required_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2019-01-16 18:26 ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget @ 2019-01-16 18:26 ` Derrick Stolee via GitGitGadget 4 siblings, 0 replies; 51+ messages in thread From: Derrick Stolee via GitGitGadget @ 2019-01-16 18:26 UTC (permalink / raw) To: git; +Cc: peff, avarab, jrnieder, ramsay, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/pack-objects.c | 1 + t/README | 4 ++++ t/t5322-pack-objects-sparse.sh | 4 ++-- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(&pack_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE=<boolean> if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 3233fafc90..7124b5581a 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -64,7 +64,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs <packinput.txt >nonsparse.pack && + git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt && comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt && -- gitgitgadget ^ permalink raw reply related [flat|nested] 51+ messages in thread
end of thread, other threads:[~2019-01-16 18:26 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget 2018-11-28 22:11 ` Stefan Beller 2018-11-29 14:20 ` Derrick Stolee 2018-11-30 2:39 ` Junio C Hamano 2018-11-30 15:53 ` Derrick Stolee 2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget 2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget 2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason 2018-11-29 4:05 ` Derrick Stolee 2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget 2018-11-29 14:24 ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget 2018-12-10 16:42 ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2019-01-11 19:43 ` Junio C Hamano 2019-01-11 20:25 ` Junio C Hamano 2019-01-11 22:05 ` Derrick Stolee 2018-12-14 21:22 ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget 2019-01-11 23:20 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget 2019-01-11 22:30 ` Junio C Hamano 2019-01-15 15:06 ` Derrick Stolee 2019-01-15 18:23 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget 2018-12-14 23:32 ` Ævar Arnfjörð Bjarmason 2018-12-17 14:20 ` Derrick Stolee 2018-12-17 14:26 ` Ævar Arnfjörð Bjarmason 2018-12-17 14:50 ` Derrick Stolee 2019-01-11 23:20 ` Junio C Hamano 2018-12-14 21:22 ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget 2018-12-14 21:22 ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget 2019-01-16 18:25 ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget 2019-01-16 18:26 ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget 2019-01-16 18:26 ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).