* [PATCH 0/3] some prune optimizations @ 2019-02-14 4:31 Jeff King 2019-02-14 4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King ` (2 more replies) 0 siblings, 3 replies; 57+ messages in thread From: Jeff King @ 2019-02-14 4:31 UTC (permalink / raw) To: git Here are two optimizations for git-prune that we've been running at GitHub for ages. No particular reason to share them now; they just finally floated to the top of my todo pile. Do note that I rebased and polished them (and the third one is brand new), so the concepts are proven, but it's possible I introduced a new bug. ;) [1/3]: prune: lazily perform reachability traversal [2/3]: prune: use bitmaps for reachability traversal [3/3]: prune: check SEEN flag for reachability builtin/prune.c | 44 +++++++++++++++++++++++++++++++------------ reachable.c | 42 +++++++++++++++++++++++++++++++++++++++++ t/perf/p5304-prune.sh | 35 ++++++++++++++++++++++++++++++++++ t/t5304-prune.sh | 12 ++++++++++++ 4 files changed, 121 insertions(+), 12 deletions(-) create mode 100755 t/perf/p5304-prune.sh -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 1/3] prune: lazily perform reachability traversal 2019-02-14 4:31 [PATCH 0/3] some prune optimizations Jeff King @ 2019-02-14 4:35 ` Jeff King 2019-02-14 10:54 ` Eric Sunshine 2019-02-14 4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King 2019-02-14 4:38 ` [PATCH 3/3] prune: check SEEN flag for reachability Jeff King 2 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-02-14 4:35 UTC (permalink / raw) To: git The general strategy of "git prune" is to do a full reachability walk, then for each loose object see if we found it in our walk. But if we don't have any loose objects, we don't need to do the expensive walk in the first place. This patch postpones that walk until the first time we need to see its results. Note that this is really a specific case of a more general optimization, which is that we could traverse only far enough to find the object under consideration (i.e., stop the traversal when we find it, then pick up again when asked about the next object, etc). That could save us in some instances from having to do a full walk. But it's actually a bit tricky to do with our traversal code, and you'd need to do a full walk anyway if you have even a single unreachable object (which you generally do, if any objects are actually left after running git-repack). So in practice this lazy-load of the full walk catches one easy but common case (i.e., you've just repacked via git-gc, and there's nothing unreachable). The perf script is fairly contrived, but it does show off the improvement: Test HEAD^ HEAD ------------------------------------------------------------------------- 5304.4: prune with no objects 3.66(3.60+0.05) 0.00(0.00+0.00) -100.0% and would let us know if we accidentally regress this optimization. Note also that we need to take special care with prune_shallow(), which relies on us having performed the traversal. So this optimization can only kick in for a non-shallow repository. Since this is easy to get wrong and is not covered by existing tests, let's add an extra test to t5304 that covers this case explicitly. Signed-off-by: Jeff King <peff@peff.net> --- The diff looks nice with --color-moved. I wish there was a way to communicate that information in a plaintext email. builtin/prune.c | 43 ++++++++++++++++++++++++++++++++----------- t/perf/p5304-prune.sh | 24 ++++++++++++++++++++++++ t/t5304-prune.sh | 12 ++++++++++++ 3 files changed, 68 insertions(+), 11 deletions(-) create mode 100755 t/perf/p5304-prune.sh diff --git a/builtin/prune.c b/builtin/prune.c index 1ec9ddd751..04b6573945 100644 --- a/builtin/prune.c +++ b/builtin/prune.c @@ -31,16 +31,40 @@ static int prune_tmp_file(const char *fullpath) return 0; } -static int prune_object(const struct object_id *oid, const char *fullpath, - void *data) +static void perform_reachability_traversal(struct rev_info *revs) { - struct stat st; + static int initialized; + struct progress *progress = NULL; + + if (initialized) + return; + + if (show_progress) + progress = start_delayed_progress(_("Checking connectivity"), 0); + mark_reachable_objects(revs, 1, expire, progress); + stop_progress(&progress); + initialized = 1; +} + +static int is_object_reachable(const struct object_id *oid, + struct rev_info *revs) +{ + perform_reachability_traversal(revs); /* * Do we know about this object? * It must have been reachable */ - if (lookup_object(the_repository, oid->hash)) + return !!lookup_object(the_repository, oid->hash); +} + +static int prune_object(const struct object_id *oid, const char *fullpath, + void *data) +{ + struct rev_info *revs = data; + struct stat st; + + if (is_object_reachable(oid, revs)) return 0; if (lstat(fullpath, &st)) { @@ -102,7 +126,6 @@ static void remove_temporary_files(const char *path) int cmd_prune(int argc, const char **argv, const char *prefix) { struct rev_info revs; - struct progress *progress = NULL; int exclude_promisor_objects = 0; const struct option options[] = { OPT__DRY_RUN(&show_only, N_("do not remove, show only")), @@ -142,17 +165,13 @@ int cmd_prune(int argc, const char **argv, const char *prefix) if (show_progress == -1) show_progress = isatty(2); - if (show_progress) - progress = start_delayed_progress(_("Checking connectivity"), 0); if (exclude_promisor_objects) { fetch_if_missing = 0; revs.exclude_promisor_objects = 1; } - mark_reachable_objects(&revs, 1, expire, progress); - stop_progress(&progress); for_each_loose_file_in_objdir(get_object_directory(), prune_object, - prune_cruft, prune_subdir, NULL); + prune_cruft, prune_subdir, &revs); prune_packed_objects(show_only ? PRUNE_PACKED_DRY_RUN : 0); remove_temporary_files(get_object_directory()); @@ -160,8 +179,10 @@ int cmd_prune(int argc, const char **argv, const char *prefix) remove_temporary_files(s); free(s); - if (is_repository_shallow(the_repository)) + if (is_repository_shallow(the_repository)) { + perform_reachability_traversal(&revs); prune_shallow(show_only ? PRUNE_SHOW_ONLY : 0); + } return 0; } diff --git a/t/perf/p5304-prune.sh b/t/perf/p5304-prune.sh new file mode 100755 index 0000000000..3c852084eb --- /dev/null +++ b/t/perf/p5304-prune.sh @@ -0,0 +1,24 @@ +#!/bin/sh + +test_description='performance tests of prune' +. ./perf-lib.sh + +test_perf_default_repo + +test_expect_success 'remove reachable loose objects' ' + git repack -ad +' + +test_expect_success 'remove unreachable loose objects' ' + git prune +' + +test_expect_success 'confirm there are no loose objects' ' + git count-objects | grep ^0 +' + +test_perf 'prune with no objects' ' + git prune +' + +test_done diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh index 270da21ac3..2c19a790c1 100755 --- a/t/t5304-prune.sh +++ b/t/t5304-prune.sh @@ -274,6 +274,18 @@ test_expect_success 'prune .git/shallow' ' test_path_is_missing .git/shallow ' +test_expect_success 'prune .git/shallow when there are no loose objects' ' + SHA1=$(echo hi|git commit-tree HEAD^{tree}) && + echo $SHA1 >.git/shallow && + git update-ref refs/heads/shallow-tip $SHA1 && + git repack -ad && + # verify assumption that all loose objects are gone + git count-objects | grep ^0 && + git prune && + echo $SHA1 >expect && + test_cmp expect .git/shallow +' + test_expect_success 'prune: handle alternate object database' ' test_create_repo A && git -C A commit --allow-empty -m "initial commit" && -- 2.21.0.rc0.586.gffba1126a0 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH 1/3] prune: lazily perform reachability traversal 2019-02-14 4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King @ 2019-02-14 10:54 ` Eric Sunshine 2019-02-14 11:07 ` Jeff King 0 siblings, 1 reply; 57+ messages in thread From: Eric Sunshine @ 2019-02-14 10:54 UTC (permalink / raw) To: Jeff King; +Cc: Git List On Wed, Feb 13, 2019 at 11:35 PM Jeff King <peff@peff.net> wrote: > diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh > @@ -274,6 +274,18 @@ test_expect_success 'prune .git/shallow' ' > +test_expect_success 'prune .git/shallow when there are no loose objects' ' > + SHA1=$(echo hi|git commit-tree HEAD^{tree}) && Perhaps name this variable 'oid' rather than 'SHA1' to make brian happy. > + echo $SHA1 >.git/shallow && > + git update-ref refs/heads/shallow-tip $SHA1 && > + git repack -ad && > + # verify assumption that all loose objects are gone > + git count-objects | grep ^0 && > + git prune && > + echo $SHA1 >expect && > + test_cmp expect .git/shallow > +' ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 1/3] prune: lazily perform reachability traversal 2019-02-14 10:54 ` Eric Sunshine @ 2019-02-14 11:07 ` Jeff King 0 siblings, 0 replies; 57+ messages in thread From: Jeff King @ 2019-02-14 11:07 UTC (permalink / raw) To: Eric Sunshine; +Cc: Git List On Thu, Feb 14, 2019 at 05:54:25AM -0500, Eric Sunshine wrote: > On Wed, Feb 13, 2019 at 11:35 PM Jeff King <peff@peff.net> wrote: > > diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh > > @@ -274,6 +274,18 @@ test_expect_success 'prune .git/shallow' ' > > +test_expect_success 'prune .git/shallow when there are no loose objects' ' > > + SHA1=$(echo hi|git commit-tree HEAD^{tree}) && > > Perhaps name this variable 'oid' rather than 'SHA1' to make brian happy. Heh, that line (and the one below it) are cut-and-paste from the setup of the test directly above. Perhaps we should do this on top: -- >8 -- Subject: [PATCH] t5304: rename "sha1" variables to "oid" Let's make the script less jarring to read in a post-sha1 world by using more hash-agnostic variable names. Signed-off-by: Jeff King <peff@peff.net> --- t/t5304-prune.sh | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh index 2c19a790c1..1eeb828a15 100755 --- a/t/t5304-prune.sh +++ b/t/t5304-prune.sh @@ -118,10 +118,10 @@ test_expect_success 'prune: do not prune detached HEAD with no reflog' ' test_expect_success 'prune: prune former HEAD after checking out branch' ' - head_sha1=$(git rev-parse HEAD) && + head_oid=$(git rev-parse HEAD) && git checkout --quiet master && git prune -v >prune_actual && - grep "$head_sha1" prune_actual + grep "$head_oid" prune_actual ' @@ -265,24 +265,24 @@ EOF ' test_expect_success 'prune .git/shallow' ' - SHA1=$(echo hi|git commit-tree HEAD^{tree}) && - echo $SHA1 >.git/shallow && + oid=$(echo hi|git commit-tree HEAD^{tree}) && + echo $oid >.git/shallow && git prune --dry-run >out && - grep $SHA1 .git/shallow && - grep $SHA1 out && + grep $oid .git/shallow && + grep $oid out && git prune && test_path_is_missing .git/shallow ' test_expect_success 'prune .git/shallow when there are no loose objects' ' - SHA1=$(echo hi|git commit-tree HEAD^{tree}) && - echo $SHA1 >.git/shallow && - git update-ref refs/heads/shallow-tip $SHA1 && + oid=$(echo hi|git commit-tree HEAD^{tree}) && + echo $oid >.git/shallow && + git update-ref refs/heads/shallow-tip $oid && git repack -ad && # verify assumption that all loose objects are gone git count-objects | grep ^0 && git prune && - echo $SHA1 >expect && + echo $oid >expect && test_cmp expect .git/shallow ' @@ -326,8 +326,8 @@ test_expect_success 'prune: handle HEAD reflog in multiple worktrees' ' git reset --hard HEAD^ ) && git prune --expire=now && - SHA1=`git hash-object expected` && - git -C third-worktree show "$SHA1" >actual && + oid=`git hash-object expected` && + git -C third-worktree show "$oid" >actual && test_cmp expected actual ' -- 2.21.0.rc1.567.g7f045cdc92 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 2/3] prune: use bitmaps for reachability traversal 2019-02-14 4:31 [PATCH 0/3] some prune optimizations Jeff King 2019-02-14 4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King @ 2019-02-14 4:37 ` Jeff King 2019-03-09 2:49 ` bitmaps by default? [was: prune: use bitmaps for reachability traversal] Eric Wong 2019-04-15 15:00 ` [PATCH 2/3] prune: use bitmaps for reachability traversal Derrick Stolee 2019-02-14 4:38 ` [PATCH 3/3] prune: check SEEN flag for reachability Jeff King 2 siblings, 2 replies; 57+ messages in thread From: Jeff King @ 2019-02-14 4:37 UTC (permalink / raw) To: git Pruning generally has to traverse the whole commit graph in order to see which objects are reachable. This is the exact problem that reachability bitmaps were meant to solve, so let's use them (if they're available, of course). Here are timings on git.git: Test HEAD^ HEAD ------------------------------------------------------------------------ 5304.6: prune with bitmaps 3.65(3.56+0.09) 1.01(0.92+0.08) -72.3% And on linux.git: Test HEAD^ HEAD -------------------------------------------------------------------------- 5304.6: prune with bitmaps 35.05(34.79+0.23) 3.00(2.78+0.21) -91.4% The tests show a pretty optimal case, as we'll have just repacked and should have pretty good coverage of all refs with our bitmaps. But that's actually pretty realistic: normally prune is run via "gc" right after repacking. A few notes on the implementation: - the change is actually in reachable.c, so it would improve reachability traversals by "reflog expire --stale-fix", as well. Those aren't performed regularly, though (a normal "git gc" doesn't use --stale-fix), so they're not really worth measuring. There's a low chance of regressing that caller, since the use of bitmaps is totally transparent from the caller's perspective. - The bitmap case could actually get away without creating a "struct object", and instead the caller could just look up each object id in the bitmap result. However, this would be a marginal improvement in runtime, and it would make the callers much more complicated. They'd have to handle both the bitmap and non-bitmap cases separately, and in the case of git-prune, we'd also have to tweak prune_shallow(), which relies on our SEEN flags. - Because we do create real object structs, we go through a few contortions to create ones of the right type. This isn't strictly necessary (lookup_unknown_object() would suffice), but it's more memory efficient to use the correct types, since we already know them. Signed-off-by: Jeff King <peff@peff.net> --- reachable.c | 42 ++++++++++++++++++++++++++++++++++++++++++ t/perf/p5304-prune.sh | 11 +++++++++++ 2 files changed, 53 insertions(+) diff --git a/reachable.c b/reachable.c index 6e9b810d2a..0d00a91de4 100644 --- a/reachable.c +++ b/reachable.c @@ -12,6 +12,7 @@ #include "packfile.h" #include "worktree.h" #include "object-store.h" +#include "pack-bitmap.h" struct connectivity_progress { struct progress *progress; @@ -158,10 +159,44 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs, FOR_EACH_OBJECT_LOCAL_ONLY); } +static void *lookup_object_by_type(struct repository *r, + const struct object_id *oid, + enum object_type type) +{ + switch (type) { + case OBJ_COMMIT: + return lookup_commit(r, oid); + case OBJ_TREE: + return lookup_tree(r, oid); + case OBJ_TAG: + return lookup_tag(r, oid); + case OBJ_BLOB: + return lookup_blob(r, oid); + default: + die("BUG: unknown object type %d", type); + } +} + +static int mark_object_seen(const struct object_id *oid, + enum object_type type, + int exclude, + uint32_t name_hash, + struct packed_git *found_pack, + off_t found_offset) +{ + struct object *obj = lookup_object_by_type(the_repository, oid, type); + if (!obj) + die("unable to create object '%s'", oid_to_hex(oid)); + + obj->flags |= SEEN; + return 0; +} + void mark_reachable_objects(struct rev_info *revs, int mark_reflog, timestamp_t mark_recent, struct progress *progress) { struct connectivity_progress cp; + struct bitmap_index *bitmap_git; /* * Set up revision parsing, and mark us as being interested @@ -188,6 +223,13 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog, cp.progress = progress; cp.count = 0; + bitmap_git = prepare_bitmap_walk(revs); + if (bitmap_git) { + traverse_bitmap_commit_list(bitmap_git, mark_object_seen); + free_bitmap_index(bitmap_git); + return; + } + /* * Set up the revision walk - this will move all commits * from the pending list to the commit walking list. diff --git a/t/perf/p5304-prune.sh b/t/perf/p5304-prune.sh index 3c852084eb..83baedb8a4 100755 --- a/t/perf/p5304-prune.sh +++ b/t/perf/p5304-prune.sh @@ -21,4 +21,15 @@ test_perf 'prune with no objects' ' git prune ' +test_expect_success 'repack with bitmaps' ' + git repack -adb +' + +# We have to create the object in each trial run, since otherwise +# runs after the first see no object and just skip the traversal entirely! +test_perf 'prune with bitmaps' ' + echo "probably not present in repo" | git hash-object -w --stdin && + git prune +' + test_done -- 2.21.0.rc0.586.gffba1126a0 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* bitmaps by default? [was: prune: use bitmaps for reachability traversal] 2019-02-14 4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King @ 2019-03-09 2:49 ` Eric Wong 2019-03-10 23:39 ` Jeff King 2019-04-15 15:00 ` [PATCH 2/3] prune: use bitmaps for reachability traversal Derrick Stolee 1 sibling, 1 reply; 57+ messages in thread From: Eric Wong @ 2019-03-09 2:49 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> wrote: > Pruning generally has to traverse the whole commit graph in order to > see which objects are reachable. This is the exact problem that > reachability bitmaps were meant to solve, so let's use them (if they're > available, of course). Perhaps this is good impetus for doing bitmaps by default? It would make life easier for people new to hosting git servers (and hopefully reduce centralization :) I started working on it, but t0410-partial-clone.sh fails with "Failed to write bitmap index. Packfile doesn't have full closure"; so more work needs to be done w.r.t. default behavior on partial clones... Here's my WIP: diff --git a/builtin/repack.c b/builtin/repack.c index 67f8978043..ca98d32715 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -14,7 +14,7 @@ static int delta_base_offset = 1; static int pack_kept_objects = -1; -static int write_bitmaps; +static int write_bitmaps = -1; static int use_delta_islands; static char *packdir, *packtmp; @@ -344,10 +344,14 @@ int cmd_repack(int argc, const char **argv, const char *prefix) die(_("--keep-unreachable and -A are incompatible")); if (pack_kept_objects < 0) - pack_kept_objects = write_bitmaps; + pack_kept_objects = write_bitmaps > 0; - if (write_bitmaps && !(pack_everything & ALL_INTO_ONE)) - die(_(incremental_bitmap_conflict_error)); + if (!(pack_everything & ALL_INTO_ONE)) { + if (write_bitmaps > 0) + die(_(incremental_bitmap_conflict_error)); + } else if (write_bitmaps < 0) { + write_bitmaps = 1; + } packdir = mkpathdup("%s/pack", get_object_directory()); packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid()); @@ -368,7 +372,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) argv_array_push(&cmd.args, "--indexed-objects"); if (repository_format_partial_clone) argv_array_push(&cmd.args, "--exclude-promisor-objects"); - if (write_bitmaps) + if (write_bitmaps > 0) argv_array_push(&cmd.args, "--write-bitmap-index"); if (use_delta_islands) argv_array_push(&cmd.args, "--delta-islands"); ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: bitmaps by default? [was: prune: use bitmaps for reachability traversal] 2019-03-09 2:49 ` bitmaps by default? [was: prune: use bitmaps for reachability traversal] Eric Wong @ 2019-03-10 23:39 ` Jeff King 2019-03-12 3:13 ` [PATCH] repack: enable bitmaps by default on bare repos Eric Wong 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-03-10 23:39 UTC (permalink / raw) To: Eric Wong; +Cc: git On Sat, Mar 09, 2019 at 02:49:44AM +0000, Eric Wong wrote: > Jeff King <peff@peff.net> wrote: > > Pruning generally has to traverse the whole commit graph in order to > > see which objects are reachable. This is the exact problem that > > reachability bitmaps were meant to solve, so let's use them (if they're > > available, of course). > > Perhaps this is good impetus for doing bitmaps by default? I'm actually not sure it is, because the prune costs less than making the bitmaps. Here are some timings on linux.git. This full-graph traversal is roughly the same cost as the reachability walk that prune would do internally: $ time git rev-list --objects --all >/dev/null real 0m47.714s user 0m47.113s sys 0m0.600s Here's a normal noop repack as a baseline. $ time git repack -ad real 1m26.922s user 1m20.029s sys 0m7.878s And here's another immediately after with bitmap generation. Generating the bitmaps takes about 100s, compared to the 47s it would save us on the prune. $ time git repack -adb real 3m5.915s user 2m59.377s sys 0m7.718s Things are a little rosier if you generate the bitmaps a second time: $ time git repack -adb real 1m43.571s user 1m37.403s sys 0m8.179s We can reuse some of the old bitmaps and it only takes 20 extra seconds, making it a net win. But I'm not sure how realistic that is. There were literally no new objects introduced between those two command. If this were a "real" repack occurring after we'd accumulated a week or two worth of objects, how long would it take? A few other random observations: - I do suspect there are some real inefficiencies in the way we generate bitmaps. It _should_ be about as expensive as the graph traversal, but clearly it's not. I think this is because of the way the current bitmap code picks commits to bitmap, and then walks somewhat haphazardly over the history, trying to accumulate the set of objects for each commit. IOW, I believe it may sometimes traverse over some sequences of history more than once. So if we could make that faster, then the balance would shift in its favor. - This is comparing the cost of generating the bitmaps to the time saved for _one_ operation. On a repo serving many fetches, the cost to generate it is amortized over many requests. But for a normal end-user, that's not true (they'd presumably push out their work, but that usually only needs to walk a small bit of history anyway). The balance would change if we had more operations that used bitmaps (e.g., --contains can use them, as can ahead/behind checks). We don't do those things yet, but we could. However, those algorithms are also using other commit-graph optimizations, and we've discussed revamping the bitmap format as part of that work (one problem in particular is that to use the current bitmaps you have to parse the whole .bitmap file, making it sometimes a net negative to use the bitmaps). So I'd consider holding off any decision like "let's make this the default" until we see where that work goes. > It would make life easier for people new to hosting git servers > (and hopefully reduce centralization :) I do think they're a net win for people hosting git servers. But if that's the goal, I think at most you'd want to make bitmaps the default for bare repos. They're really not much help for normal end-user repos at this point. > I started working on it, but t0410-partial-clone.sh fails with > "Failed to write bitmap index. Packfile doesn't have full > closure"; so more work needs to be done w.r.t. default behavior > on partial clones... Yeah, you can't use bitmaps at all in an incomplete clone. Shallow clones would probably have the same issue (though in general we just disable bitmaps entirely in shallow situations, so that might kick in). -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH] repack: enable bitmaps by default on bare repos 2019-03-10 23:39 ` Jeff King @ 2019-03-12 3:13 ` Eric Wong 2019-03-12 9:07 ` Ævar Arnfjörð Bjarmason 2019-03-12 10:49 ` Jeff King 0 siblings, 2 replies; 57+ messages in thread From: Eric Wong @ 2019-03-12 3:13 UTC (permalink / raw) To: Jeff King; +Cc: git Jeff King <peff@peff.net> wrote: > On Sat, Mar 09, 2019 at 02:49:44AM +0000, Eric Wong wrote: > > It would make life easier for people new to hosting git servers > > (and hopefully reduce centralization :) > > I do think they're a net win for people hosting git servers. But if > that's the goal, I think at most you'd want to make bitmaps the default > for bare repos. They're really not much help for normal end-user repos > at this point. Fair enough, hopefully this can make life easier for admins new to hosting git: ----------8<--------- Subject: [PATCH] repack: enable bitmaps by default on bare repos A typical use case for bare repos is for serving clones and fetches to clients. Enable bitmaps by default on bare repos to make it easier for admins to host git repos in a performant way. Signed-off-by: Eric Wong <e@80x24.org> --- builtin/repack.c | 16 ++++++++++------ t/t7700-repack.sh | 14 +++++++++++++- 2 files changed, 23 insertions(+), 7 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 67f8978043..5d4758b515 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -14,7 +14,7 @@ static int delta_base_offset = 1; static int pack_kept_objects = -1; -static int write_bitmaps; +static int write_bitmaps = -1; static int use_delta_islands; static char *packdir, *packtmp; @@ -343,11 +343,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix) (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) die(_("--keep-unreachable and -A are incompatible")); + if (!(pack_everything & ALL_INTO_ONE)) { + if (write_bitmaps > 0) + die(_(incremental_bitmap_conflict_error)); + } else if (write_bitmaps < 0) { + write_bitmaps = is_bare_repository(); + } + if (pack_kept_objects < 0) - pack_kept_objects = write_bitmaps; - - if (write_bitmaps && !(pack_everything & ALL_INTO_ONE)) - die(_(incremental_bitmap_conflict_error)); + pack_kept_objects = write_bitmaps > 0; packdir = mkpathdup("%s/pack", get_object_directory()); packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid()); @@ -368,7 +372,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) argv_array_push(&cmd.args, "--indexed-objects"); if (repository_format_partial_clone) argv_array_push(&cmd.args, "--exclude-promisor-objects"); - if (write_bitmaps) + if (write_bitmaps > 0) argv_array_push(&cmd.args, "--write-bitmap-index"); if (use_delta_islands) argv_array_push(&cmd.args, "--delta-islands"); diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index 6162e2a8e6..3e0b5c40e4 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -221,5 +221,17 @@ test_expect_success 'repack --keep-pack' ' ) ' +test_expect_success 'bitmaps are created by default in bare repos' ' + git clone --bare .git bare.git && + cd bare.git && + mkdir old && + mv objects/pack/* old && + pack=$(ls old/*.pack) && + test_path_is_file "$pack" && + git unpack-objects -q <"$pack" && + git repack -ad && + bitmap=$(ls objects/pack/*.bitmap) && + test_path_is_file "$bitmap" +' + test_done - ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH] repack: enable bitmaps by default on bare repos 2019-03-12 3:13 ` [PATCH] repack: enable bitmaps by default on bare repos Eric Wong @ 2019-03-12 9:07 ` Ævar Arnfjörð Bjarmason 2019-03-12 10:49 ` Jeff King 1 sibling, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-03-12 9:07 UTC (permalink / raw) To: Eric Wong; +Cc: Jeff King, git On Tue, Mar 12 2019, Eric Wong wrote: > Jeff King <peff@peff.net> wrote: >> On Sat, Mar 09, 2019 at 02:49:44AM +0000, Eric Wong wrote: >> > It would make life easier for people new to hosting git servers >> > (and hopefully reduce centralization :) >> >> I do think they're a net win for people hosting git servers. But if >> that's the goal, I think at most you'd want to make bitmaps the default >> for bare repos. They're really not much help for normal end-user repos >> at this point. > > Fair enough, hopefully this can make life easier for admins > new to hosting git: > > ----------8<--------- > Subject: [PATCH] repack: enable bitmaps by default on bare repos > > A typical use case for bare repos is for serving clones and > fetches to clients. Enable bitmaps by default on bare repos to > make it easier for admins to host git repos in a performant way. > > Signed-off-by: Eric Wong <e@80x24.org> > --- > builtin/repack.c | 16 ++++++++++------ > t/t7700-repack.sh | 14 +++++++++++++- > 2 files changed, 23 insertions(+), 7 deletions(-) > > diff --git a/builtin/repack.c b/builtin/repack.c > index 67f8978043..5d4758b515 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -14,7 +14,7 @@ > > static int delta_base_offset = 1; > static int pack_kept_objects = -1; > -static int write_bitmaps; > +static int write_bitmaps = -1; > static int use_delta_islands; > static char *packdir, *packtmp; > > @@ -343,11 +343,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix) > (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) > die(_("--keep-unreachable and -A are incompatible")); > > + if (!(pack_everything & ALL_INTO_ONE)) { > + if (write_bitmaps > 0) > + die(_(incremental_bitmap_conflict_error)); > + } else if (write_bitmaps < 0) { > + write_bitmaps = is_bare_repository(); > + } > + > if (pack_kept_objects < 0) > - pack_kept_objects = write_bitmaps; > - > - if (write_bitmaps && !(pack_everything & ALL_INTO_ONE)) > - die(_(incremental_bitmap_conflict_error)); > + pack_kept_objects = write_bitmaps > 0; > > packdir = mkpathdup("%s/pack", get_object_directory()); > packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid()); > @@ -368,7 +372,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) > argv_array_push(&cmd.args, "--indexed-objects"); > if (repository_format_partial_clone) > argv_array_push(&cmd.args, "--exclude-promisor-objects"); > - if (write_bitmaps) > + if (write_bitmaps > 0) > argv_array_push(&cmd.args, "--write-bitmap-index"); > if (use_delta_islands) > argv_array_push(&cmd.args, "--delta-islands"); > diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh > index 6162e2a8e6..3e0b5c40e4 100755 > --- a/t/t7700-repack.sh > +++ b/t/t7700-repack.sh > @@ -221,5 +221,17 @@ test_expect_success 'repack --keep-pack' ' > ) > ' > > +test_expect_success 'bitmaps are created by default in bare repos' ' > + git clone --bare .git bare.git && > + cd bare.git && > + mkdir old && > + mv objects/pack/* old && > + pack=$(ls old/*.pack) && > + test_path_is_file "$pack" && > + git unpack-objects -q <"$pack" && > + git repack -ad && > + bitmap=$(ls objects/pack/*.bitmap) && > + test_path_is_file "$bitmap" > +' > + > test_done > - Looks sensible in principle, but now the git-repack and git-config docs talking about repack.writeBitmaps are out-of-date. A v2 should update those. Also worth testing that -c repack.writeBitmaps=false on a bare repository still overrides it, even though glancing at the code it looks like that case is handled, but without being tested for. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH] repack: enable bitmaps by default on bare repos 2019-03-12 3:13 ` [PATCH] repack: enable bitmaps by default on bare repos Eric Wong 2019-03-12 9:07 ` Ævar Arnfjörð Bjarmason @ 2019-03-12 10:49 ` Jeff King 2019-03-12 12:05 ` Jeff King 2019-03-13 1:51 ` Eric Wong 1 sibling, 2 replies; 57+ messages in thread From: Jeff King @ 2019-03-12 10:49 UTC (permalink / raw) To: Eric Wong; +Cc: git On Tue, Mar 12, 2019 at 03:13:03AM +0000, Eric Wong wrote: > > I do think they're a net win for people hosting git servers. But if > > that's the goal, I think at most you'd want to make bitmaps the default > > for bare repos. They're really not much help for normal end-user repos > > at this point. > > Fair enough, hopefully this can make life easier for admins > new to hosting git: > > ----------8<--------- > Subject: [PATCH] repack: enable bitmaps by default on bare repos > > A typical use case for bare repos is for serving clones and > fetches to clients. Enable bitmaps by default on bare repos to > make it easier for admins to host git repos in a performant way. OK. I still think of bitmaps as something that might need manual care and feeding, but I think that may be leftover superstition. I can't offhand think of any real downsides to this. > static int delta_base_offset = 1; > static int pack_kept_objects = -1; > -static int write_bitmaps; > +static int write_bitmaps = -1; So we'll have "-1" be "not decided yet". Makes sense. > @@ -343,11 +343,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix) > (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) > die(_("--keep-unreachable and -A are incompatible")); > > + if (!(pack_everything & ALL_INTO_ONE)) { > + if (write_bitmaps > 0) > + die(_(incremental_bitmap_conflict_error)); > + } else if (write_bitmaps < 0) { > + write_bitmaps = is_bare_repository(); > + } Might it be easier here to always resolve "-1" into a 0/1? I.e., like: if (write_bitmaps < 0) write_bitmaps = (pack_everything & ALL_INTO_ONE) && is_bare_repository(); and then the rest of the logic can stay the same, and does not need to be modified to handle "write_bitmaps < 0"? > +test_expect_success 'bitmaps are created by default in bare repos' ' > + git clone --bare .git bare.git && > + cd bare.git && Please don't "cd" outside of a subshell, since it impacts further tests that are added. > + mkdir old && > + mv objects/pack/* old && > + pack=$(ls old/*.pack) && Are we sure we have just done $pack here? Our repo came from a local-disk clone, which would have just hard-linked whatever was in the source repo. So we're subtly relying on the state that other tests have left. I'm not sure what we're trying to accomplish with this unpacking, though. Running "git repack -ad" should generate bitmaps whether the objects were already in a single pack or not. So I think this test can just be: git clone --bare . bare.git && git -C bare.git repack -ad && bitmap=$(ls objects/pack/*.bitmap) test_path_is_file "$bitmap" I do agree with Ævar it might also be worth testing that disabling bitmaps explicitly still works. And also that repacking _without_ "-a" (i.e., an incremental) does not complain about being unable to generate bitmaps. -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH] repack: enable bitmaps by default on bare repos 2019-03-12 10:49 ` Jeff King @ 2019-03-12 12:05 ` Jeff King 2019-03-13 1:51 ` Eric Wong 1 sibling, 0 replies; 57+ messages in thread From: Jeff King @ 2019-03-12 12:05 UTC (permalink / raw) To: Eric Wong; +Cc: git On Tue, Mar 12, 2019 at 06:49:54AM -0400, Jeff King wrote: > I'm not sure what we're trying to accomplish with this unpacking, > though. Running "git repack -ad" should generate bitmaps whether the > objects were already in a single pack or not. So I think this test can > just be: > > git clone --bare . bare.git && > git -C bare.git repack -ad && > bitmap=$(ls objects/pack/*.bitmap) > test_path_is_file "$bitmap" Of course that should be "bare.git/objects/pack/*.bitmap" in the third line, since we skipped the "cd" entirely. -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH] repack: enable bitmaps by default on bare repos 2019-03-12 10:49 ` Jeff King 2019-03-12 12:05 ` Jeff King @ 2019-03-13 1:51 ` Eric Wong 2019-03-13 14:54 ` Jeff King 1 sibling, 1 reply; 57+ messages in thread From: Eric Wong @ 2019-03-13 1:51 UTC (permalink / raw) To: Jeff King; +Cc: git, Ævar Arnfjörð Bjarmason Jeff King <peff@peff.net> wrote: > OK. I still think of bitmaps as something that might need manual care > and feeding, but I think that may be leftover superstition. I can't > offhand think of any real downsides to this. It's a _relatively_ new feature to long-time users like us, so maybe the "new == immature" mindset sets in[*]. I skimmed some mail archives but couldn't find any reason not to... But I did find Ævar's forgotten gitperformance doc and thread where the topic was brought up: https://public-inbox.org/git/20170403211644.26814-1-avarab@gmail.com/ > On Tue, Mar 12, 2019 at 03:13:03AM +0000, Eric Wong wrote: > > @@ -343,11 +343,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix) > > (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) > > die(_("--keep-unreachable and -A are incompatible")); > > > > + if (!(pack_everything & ALL_INTO_ONE)) { > > + if (write_bitmaps > 0) > > + die(_(incremental_bitmap_conflict_error)); > > + } else if (write_bitmaps < 0) { > > + write_bitmaps = is_bare_repository(); > > + } > > Might it be easier here to always resolve "-1" into a 0/1? I.e., like: > > if (write_bitmaps < 0) > write_bitmaps = (pack_everything & ALL_INTO_ONE) && is_bare_repository(); > > and then the rest of the logic can stay the same, and does not need to > be modified to handle "write_bitmaps < 0"? Good point, changed in v2. > > +test_expect_success 'bitmaps are created by default in bare repos' ' > > + git clone --bare .git bare.git && > > + cd bare.git && > > Please don't "cd" outside of a subshell, since it impacts further tests > that are added. Oops, I got it into my head that each test was already run in a separate subshell :x Fixed. > > + mkdir old && > > + mv objects/pack/* old && > > + pack=$(ls old/*.pack) && > > Are we sure we have just done $pack here? Our repo came from a > local-disk clone, which would have just hard-linked whatever was in the > source repo. So we're subtly relying on the state that other tests have > left. > > I'm not sure what we're trying to accomplish with this unpacking, > though. Running "git repack -ad" should generate bitmaps whether the > objects were already in a single pack or not. So I think this test can > just be: Right, I forgot "repack -a" didn't need loose objects to operate :x > I do agree with Ævar it might also be worth testing that disabling > bitmaps explicitly still works. And also that repacking _without_ "-a" > (i.e., an incremental) does not complain about being unable to generate > bitmaps. Yup, now with additional tests for repack.writeBitmaps=false, repack (w/o "-a") and a config/repack.txt update ------------8<--------- Subject: [PATCH] repack: enable bitmaps by default on bare repos A typical use case for bare repos is for serving clones and fetches to clients. Enable bitmaps by default on bare repos to make it easier for admins to host git repos in a performant way. Signed-off-by: Eric Wong <e@80x24.org> Helped-by: Jeff King <peff@peff.net> --- Documentation/config/repack.txt | 2 +- builtin/repack.c | 5 ++++- t/t7700-repack.sh | 19 ++++++++++++++++++- 3 files changed, 23 insertions(+), 3 deletions(-) diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt index a5c37813fd..9c413e177e 100644 --- a/Documentation/config/repack.txt +++ b/Documentation/config/repack.txt @@ -24,4 +24,4 @@ repack.writeBitmaps:: packs created for clones and fetches, at the cost of some disk space and extra time spent on the initial repack. This has no effect if multiple packfiles are created. - Defaults to false. + Defaults to true on bare repos, false otherwise. diff --git a/builtin/repack.c b/builtin/repack.c index 67f8978043..caca113927 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -14,7 +14,7 @@ static int delta_base_offset = 1; static int pack_kept_objects = -1; -static int write_bitmaps; +static int write_bitmaps = -1; static int use_delta_islands; static char *packdir, *packtmp; @@ -343,6 +343,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) die(_("--keep-unreachable and -A are incompatible")); + if (write_bitmaps < 0) + write_bitmaps = (pack_everything & ALL_INTO_ONE) && + is_bare_repository(); if (pack_kept_objects < 0) pack_kept_objects = write_bitmaps; diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index 6162e2a8e6..6458286dcf 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -221,5 +221,22 @@ test_expect_success 'repack --keep-pack' ' ) ' -test_done +test_expect_success 'bitmaps are created by default in bare repos' ' + git clone --bare .git bare.git && + git -C bare.git repack -ad && + bitmap=$(ls bare.git/objects/pack/*.bitmap) && + test_path_is_file "$bitmap" +' + +test_expect_success 'incremental repack does not complain' ' + git -C bare.git repack -q 2>repack.err && + ! test -s repack.err +' +test_expect_success 'bitmaps can be disabled on bare repos' ' + git -c repack.writeBitmaps=false -C bare.git repack -ad && + bitmap=$(ls bare.git/objects/pack/*.bitmap 2>/dev/null || :) && + test -z "$bitmap" +' + +test_done -- EW ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH] repack: enable bitmaps by default on bare repos 2019-03-13 1:51 ` Eric Wong @ 2019-03-13 14:54 ` Jeff King 2019-03-14 9:12 ` [PATCH v3] " Eric Wong 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-03-13 14:54 UTC (permalink / raw) To: Eric Wong; +Cc: git, Ævar Arnfjörð Bjarmason On Wed, Mar 13, 2019 at 01:51:33AM +0000, Eric Wong wrote: > But I did find Ævar's forgotten gitperformance doc and thread > where the topic was brought up: > > https://public-inbox.org/git/20170403211644.26814-1-avarab@gmail.com/ One thing that thread reminded me of: we probably also want to default pack.writebitmaphashcache on. Otherwise the time saved during the object enumeration can backfire when we spend much more time trying delta compression (because we don't know the pathnames of any objects). The reason it defaults to off is for on-disk compatibility with JGit. But I have very little experience running without the hash-cache on. We added it very early on because we found performance was not great without it (I don't know if people running JGit have run into the same problem and if not, why not). > ------------8<--------- > Subject: [PATCH] repack: enable bitmaps by default on bare repos > > A typical use case for bare repos is for serving clones and > fetches to clients. Enable bitmaps by default on bare repos to > make it easier for admins to host git repos in a performant way. This looks good to me, with one minor nit: > +test_expect_success 'incremental repack does not complain' ' > + git -C bare.git repack -q 2>repack.err && > + ! test -s repack.err > +' This last line could use "test_must_be_empty". -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH v3] repack: enable bitmaps by default on bare repos 2019-03-13 14:54 ` Jeff King @ 2019-03-14 9:12 ` Eric Wong 2019-03-14 16:02 ` Jeff King 2019-04-09 15:10 ` [PATCH v3] repack: enable bitmaps by default on bare repos Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 57+ messages in thread From: Eric Wong @ 2019-03-14 9:12 UTC (permalink / raw) To: Jeff King; +Cc: git, Ævar Arnfjörð Bjarmason Jeff King <peff@peff.net> wrote: > On Wed, Mar 13, 2019 at 01:51:33AM +0000, Eric Wong wrote: > > > But I did find Ævar's forgotten gitperformance doc and thread > > where the topic was brought up: > > > > https://public-inbox.org/git/20170403211644.26814-1-avarab@gmail.com/ > > One thing that thread reminded me of: we probably also want to default > pack.writebitmaphashcache on. Otherwise the time saved during the object > enumeration can backfire when we spend much more time trying delta > compression (because we don't know the pathnames of any objects). Interesting... I found a big improvement with public-inbox just using bitmaps; but have never tried the hash cache. > The reason it defaults to off is for on-disk compatibility with JGit. Right. Our documentation seems to indicate JGit just warns (but doesn't fall over), so maybe that can be considered separately. I've never used JGit myself; and was satisfied enough with bitmaps alone that I never tried the hash-cache. > But I have very little experience running without the hash-cache on. We > added it very early on because we found performance was not great > without it (I don't know if people running JGit have run into the same > problem and if not, why not). As far as serving clones and fetches, public-inbox-init has always created bare repos with bitmaps enabled, but without the hash-cache for compatibility concerns. That's a lot of fetches and clones over the years. > > +test_expect_success 'incremental repack does not complain' ' > > + git -C bare.git repack -q 2>repack.err && > > + ! test -s repack.err > > +' > > This last line could use "test_must_be_empty". Thanks for the review! ---------8<----------- Subject: [PATCH] repack: enable bitmaps by default on bare repos A typical use case for bare repos is for serving clones and fetches to clients. Enable bitmaps by default on bare repos to make it easier for admins to host git repos in a performant way. Signed-off-by: Eric Wong <e@80x24.org> Helped-by: Jeff King <peff@peff.net> --- Documentation/config/repack.txt | 2 +- builtin/repack.c | 5 ++++- t/t7700-repack.sh | 19 ++++++++++++++++++- 3 files changed, 23 insertions(+), 3 deletions(-) diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt index a5c37813fd..9c413e177e 100644 --- a/Documentation/config/repack.txt +++ b/Documentation/config/repack.txt @@ -24,4 +24,4 @@ repack.writeBitmaps:: packs created for clones and fetches, at the cost of some disk space and extra time spent on the initial repack. This has no effect if multiple packfiles are created. - Defaults to false. + Defaults to true on bare repos, false otherwise. diff --git a/builtin/repack.c b/builtin/repack.c index 67f8978043..caca113927 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -14,7 +14,7 @@ static int delta_base_offset = 1; static int pack_kept_objects = -1; -static int write_bitmaps; +static int write_bitmaps = -1; static int use_delta_islands; static char *packdir, *packtmp; @@ -343,6 +343,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) die(_("--keep-unreachable and -A are incompatible")); + if (write_bitmaps < 0) + write_bitmaps = (pack_everything & ALL_INTO_ONE) && + is_bare_repository(); if (pack_kept_objects < 0) pack_kept_objects = write_bitmaps; diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index 6162e2a8e6..86d05160a3 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -221,5 +221,22 @@ test_expect_success 'repack --keep-pack' ' ) ' -test_done +test_expect_success 'bitmaps are created by default in bare repos' ' + git clone --bare .git bare.git && + git -C bare.git repack -ad && + bitmap=$(ls bare.git/objects/pack/*.bitmap) && + test_path_is_file "$bitmap" +' + +test_expect_success 'incremental repack does not complain' ' + git -C bare.git repack -q 2>repack.err && + test_must_be_empty repack.err +' +test_expect_success 'bitmaps can be disabled on bare repos' ' + git -c repack.writeBitmaps=false -C bare.git repack -ad && + bitmap=$(ls bare.git/objects/pack/*.bitmap 2>/dev/null || :) && + test -z "$bitmap" +' + +test_done -- EW ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-03-14 9:12 ` [PATCH v3] " Eric Wong @ 2019-03-14 16:02 ` Jeff King 2019-03-15 6:21 ` [PATCH 0/2] enable bitmap hash-cache by default Jeff King 2019-04-09 15:10 ` [PATCH v3] repack: enable bitmaps by default on bare repos Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 57+ messages in thread From: Jeff King @ 2019-03-14 16:02 UTC (permalink / raw) To: Eric Wong; +Cc: git, Ævar Arnfjörð Bjarmason On Thu, Mar 14, 2019 at 09:12:54AM +0000, Eric Wong wrote: > > The reason it defaults to off is for on-disk compatibility with JGit. > > Right. Our documentation seems to indicate JGit just warns (but > doesn't fall over), so maybe that can be considered separately. I think it was a hard error in the beginning, but they changed it pretty soon after we added more flags. So it might be reasonable to just enable it by default (but it wouldn't hurt to double check the behavior). I tried running t5310 (which does a back-and-forth with jgit) using this patch: diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a154fc29f6..5264bf055a 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -97,7 +97,7 @@ static off_t reuse_packfile_offset; static int use_bitmap_index_default = 1; static int use_bitmap_index = -1; static int write_bitmap_index; -static uint16_t write_bitmap_options; +static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE; static int exclude_promisor_objects; and it seemed happy. > As far as serving clones and fetches, public-inbox-init has > always created bare repos with bitmaps enabled, but without > the hash-cache for compatibility concerns. > > That's a lot of fetches and clones over the years. The symptom you'd see is that "Compressing objects" during a fetch takes a long time, and/or produces lousy deltas. But it matters less if: - you keep things packed pretty promptly, because we don't bother looking for new deltas between objects in the same pack. Just trying to clone public-inbox.org/git, it does look like it's mostly packed (based on the object counts) but the compression phase still takes 10+ seconds. - how much the names actually help. In your case, I'd think not at all, because being based on hashes, they're effectively random. So the pack-objects heuristics to try to find deltas between files of similar filenames will not help you. Regarding the second thing, I wondered if the overall packing of your public-inbox git repo might not be good, so I did a "git repack -adf --window=1000" on a clone. Hundreds of CPU minutes later, I was only able to shave off about 80MB. I'm not sure if that means you occasionally do very aggressive repacks, or if there simply isn't all that much delta opportunity (after all, you're not storing many versions of one file, but rather tons of different emails; I would expect to find deltas between various versions of a patch, though). Anyway... > ---------8<----------- > Subject: [PATCH] repack: enable bitmaps by default on bare repos > > A typical use case for bare repos is for serving clones and > fetches to clients. Enable bitmaps by default on bare repos to > make it easier for admins to host git repos in a performant way. > > Signed-off-by: Eric Wong <e@80x24.org> > Helped-by: Jeff King <peff@peff.net> This version looks good to me. If we're going to flip the hash-cache default, I think that should be a separate patch anyway. -Peff ^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 0/2] enable bitmap hash-cache by default 2019-03-14 16:02 ` Jeff King @ 2019-03-15 6:21 ` Jeff King 2019-03-15 6:22 ` [PATCH 1/2] t5310: correctly remove bitmaps for jgit test Jeff King 2019-03-15 6:25 ` [PATCH 2/2] pack-objects: default to writing bitmap hash-cache Jeff King 0 siblings, 2 replies; 57+ messages in thread From: Jeff King @ 2019-03-15 6:21 UTC (permalink / raw) To: Eric Wong; +Cc: git, Ævar Arnfjörð Bjarmason On Thu, Mar 14, 2019 at 12:02:56PM -0400, Jeff King wrote: > On Thu, Mar 14, 2019 at 09:12:54AM +0000, Eric Wong wrote: > > > > The reason it defaults to off is for on-disk compatibility with JGit. > > > > Right. Our documentation seems to indicate JGit just warns (but > > doesn't fall over), so maybe that can be considered separately. > > I think it was a hard error in the beginning, but they changed it pretty > soon after we added more flags. So it might be reasonable to just enable > it by default (but it wouldn't hurt to double check the behavior). > > I tried running t5310 (which does a back-and-forth with jgit) using this > patch: I dug up the actual JGit change, and it was indeed from 2014. So here's a more complete series to handle that. There's a minor performance mystery in the second patch, but I think it might be OK to proceed even without solving it. Conceptually these go on top of your patch, but they could be applied separately. [1/2]: t5310: correctly remove bitmaps for jgit test [2/2]: pack-objects: default to writing bitmap hash-cache Documentation/config/pack.txt | 4 +--- builtin/pack-objects.c | 2 +- t/perf/p5310-pack-bitmaps.sh | 3 +-- t/perf/p5311-pack-bitmaps-fetch.sh | 1 - t/t5310-pack-bitmaps.sh | 5 ++--- 5 files changed, 5 insertions(+), 10 deletions(-) -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 1/2] t5310: correctly remove bitmaps for jgit test 2019-03-15 6:21 ` [PATCH 0/2] enable bitmap hash-cache by default Jeff King @ 2019-03-15 6:22 ` Jeff King 2019-03-15 13:25 ` SZEDER Gábor 2019-03-15 6:25 ` [PATCH 2/2] pack-objects: default to writing bitmap hash-cache Jeff King 1 sibling, 1 reply; 57+ messages in thread From: Jeff King @ 2019-03-15 6:22 UTC (permalink / raw) To: Eric Wong; +Cc: git, Ævar Arnfjörð Bjarmason We use "jgit gc" to generate a pack bitmap file, and then make sure our implementation can read it. To prepare the repo before running jgit, we try to "rm -f" any existing bitmap files. But we got the path wrong; we're in a bare repo, so looking in ".git/" finds nothing. Our "rm" doesn't complain because of the "-f", and when we run "rev-list" there are two bitmap files (ours and jgit's). Our reader implementation will ignore one of the bitmap files, but it's likely non-deterministic which one we will use. We'd prefer the one with the more recent timestamp (just because of the way the packed_git list is sorted), but in most test runs they'd have identical timestamps. So this was probably actually testing something useful about 50% of the time, and other half just testing that we could read our own bitmaps (which is covered elsewhere). Signed-off-by: Jeff King <peff@peff.net> --- Just a cleanup I noticed in the area; can be applied independently from patch 2. t/t5310-pack-bitmaps.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 82d7f7f6a5..44a038881a 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -269,7 +269,7 @@ test_expect_success JGIT 'we can read jgit bitmaps' ' git clone --bare . compat-jgit.git && ( cd compat-jgit.git && - rm -f .git/objects/pack/*.bitmap && + rm -f objects/pack/*.bitmap && jgit gc && git rev-list --test-bitmap HEAD ) -- 2.21.0.543.g90eed137f3 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH 1/2] t5310: correctly remove bitmaps for jgit test 2019-03-15 6:22 ` [PATCH 1/2] t5310: correctly remove bitmaps for jgit test Jeff King @ 2019-03-15 13:25 ` SZEDER Gábor 2019-03-15 18:36 ` Jeff King 0 siblings, 1 reply; 57+ messages in thread From: SZEDER Gábor @ 2019-03-15 13:25 UTC (permalink / raw) To: Jeff King; +Cc: Eric Wong, git, Ævar Arnfjörð Bjarmason On Fri, Mar 15, 2019 at 02:22:44AM -0400, Jeff King wrote: > We use "jgit gc" to generate a pack bitmap file, and then make sure our > implementation can read it. To prepare the repo before running jgit, we > try to "rm -f" any existing bitmap files. But we got the path wrong; > we're in a bare repo, so looking in ".git/" finds nothing. Our "rm" > doesn't complain because of the "-f", and when we run "rev-list" there > are two bitmap files (ours and jgit's). Oh, indeed. > Our reader implementation will ignore one of the bitmap files, but it's > likely non-deterministic which one we will use. We'd prefer the one with > the more recent timestamp (just because of the way the packed_git list > is sorted), but in most test runs they'd have identical timestamps. > > So this was probably actually testing something useful about 50% of the > time, and other half just testing that we could read our own bitmaps > (which is covered elsewhere). At least it was testing the right thing up until 87a6bb701a (t5310-pack-bitmaps: make JGit tests work with GIT_TEST_SPLIT_INDEX, 2018-05-10) came along. > t/t5310-pack-bitmaps.sh | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh > index 82d7f7f6a5..44a038881a 100755 > --- a/t/t5310-pack-bitmaps.sh > +++ b/t/t5310-pack-bitmaps.sh > @@ -269,7 +269,7 @@ test_expect_success JGIT 'we can read jgit bitmaps' ' > git clone --bare . compat-jgit.git && > ( > cd compat-jgit.git && > - rm -f .git/objects/pack/*.bitmap && > + rm -f objects/pack/*.bitmap && > jgit gc && > git rev-list --test-bitmap HEAD > ) > -- > 2.21.0.543.g90eed137f3 > ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 1/2] t5310: correctly remove bitmaps for jgit test 2019-03-15 13:25 ` SZEDER Gábor @ 2019-03-15 18:36 ` Jeff King 0 siblings, 0 replies; 57+ messages in thread From: Jeff King @ 2019-03-15 18:36 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Eric Wong, git, Ævar Arnfjörð Bjarmason On Fri, Mar 15, 2019 at 02:25:45PM +0100, SZEDER Gábor wrote: > > Our reader implementation will ignore one of the bitmap files, but it's > > likely non-deterministic which one we will use. We'd prefer the one with > > the more recent timestamp (just because of the way the packed_git list > > is sorted), but in most test runs they'd have identical timestamps. > > > > So this was probably actually testing something useful about 50% of the > > time, and other half just testing that we could read our own bitmaps > > (which is covered elsewhere). > > At least it was testing the right thing up until 87a6bb701a > (t5310-pack-bitmaps: make JGit tests work with GIT_TEST_SPLIT_INDEX, > 2018-05-10) came along. Heh, indeed. I didn't even dig into the history, and just assumed I had botched it in 2013. :) -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 2/2] pack-objects: default to writing bitmap hash-cache 2019-03-15 6:21 ` [PATCH 0/2] enable bitmap hash-cache by default Jeff King 2019-03-15 6:22 ` [PATCH 1/2] t5310: correctly remove bitmaps for jgit test Jeff King @ 2019-03-15 6:25 ` Jeff King 1 sibling, 0 replies; 57+ messages in thread From: Jeff King @ 2019-03-15 6:25 UTC (permalink / raw) To: Eric Wong; +Cc: git, Ævar Arnfjörð Bjarmason Enabling pack.writebitmaphashcache should always be a performance win. It costs only 4 bytes per object on disk, and the timings in ae4f07fbcc (pack-bitmap: implement optional name_hash cache, 2013-12-21) show it improving fetch and partial-bitmap clone times by 40-50%. The only reason we didn't enable it by default at the time is that early versions of JGit's bitmap reader complained about the presence of optional header bits it didn't understand. But that was changed in JGit's d2fa3987a (Use bitcheck to check for presence of OPT_FULL option, 2013-10-30), which made it into JGit v3.5.0 in late 2014. So let's turn this option on by default. It's backwards-compatible with all versions of Git, and if you are also using JGit on the same repository, you'd only run into problems using a version that's almost 5 years old. We'll drop the manual setting from all of our test scripts, including perf tests. This isn't strictly necessary, but it has two advantages: 1. If the hash-cache ever stops being enabled by default, our perf regression tests will notice. 2. We can use the modified perf tests to show off the behavior of an otherwise unconfigured repo, as shown below. These are the results of a few of a perf tests against linux.git that showed interesting results. You can see the expected speedup in 5310.4, which was noted in ae4f07fbcc. Curiously, 5310.8 did not improve (and actually got slower), despite seeing the opposite in ae4f07fbcc. I don't have an explanation for that. The tests from p5311 did not exist back then, but do show improvements (a smaller pack due to better deltas, which we found in less time). Test HEAD^ HEAD ------------------------------------------------------------------------------------- 5310.4: simulated fetch 7.39(22.70+0.25) 5.64(11.43+0.22) -23.7% 5310.8: clone (partial bitmap) 18.45(24.83+1.19) 19.94(28.40+1.36) +8.1% 5311.31: server (128 days) 0.41(1.13+0.05) 0.34(0.72+0.02) -17.1% 5311.32: size (128 days) 7.4M 7.0M -4.8% 5311.33: client (128 days) 1.33(1.49+0.06) 1.29(1.37+0.12) -3.0% Signed-off-by: Jeff King <peff@peff.net> --- Documentation/config/pack.txt | 4 +--- builtin/pack-objects.c | 2 +- t/perf/p5310-pack-bitmaps.sh | 3 +-- t/perf/p5311-pack-bitmaps-fetch.sh | 1 - t/t5310-pack-bitmaps.sh | 3 +-- 5 files changed, 4 insertions(+), 9 deletions(-) diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt index 425c73aa52..9cdcfa7324 100644 --- a/Documentation/config/pack.txt +++ b/Documentation/config/pack.txt @@ -124,6 +124,4 @@ pack.writeBitmapHashCache:: bitmapped and non-bitmapped objects (e.g., when serving a fetch between an older, bitmapped pack and objects that have been pushed since the last gc). The downside is that it consumes 4 - bytes per object of disk space, and that JGit's bitmap - implementation does not understand it, causing it to complain if - Git and JGit are used on the same repository. Defaults to false. + bytes per object of disk space. Defaults to true. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a154fc29f6..5264bf055a 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -97,7 +97,7 @@ static off_t reuse_packfile_offset; static int use_bitmap_index_default = 1; static int use_bitmap_index = -1; static int write_bitmap_index; -static uint16_t write_bitmap_options; +static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE; static int exclude_promisor_objects; diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh index bb91dbb173..6a3a42531b 100755 --- a/t/perf/p5310-pack-bitmaps.sh +++ b/t/perf/p5310-pack-bitmaps.sh @@ -12,8 +12,7 @@ test_perf_large_repo # We intentionally use the deprecated pack.writebitmaps # config so that we can test against older versions of git. test_expect_success 'setup bitmap config' ' - git config pack.writebitmaps true && - git config pack.writebitmaphashcache true + git config pack.writebitmaps true ' test_perf 'repack to disk' ' diff --git a/t/perf/p5311-pack-bitmaps-fetch.sh b/t/perf/p5311-pack-bitmaps-fetch.sh index b04575951f..47c3fd7581 100755 --- a/t/perf/p5311-pack-bitmaps-fetch.sh +++ b/t/perf/p5311-pack-bitmaps-fetch.sh @@ -7,7 +7,6 @@ test_perf_default_repo test_expect_success 'create bitmapped server repo' ' git config pack.writebitmaps true && - git config pack.writebitmaphashcache true && git repack -ad ' diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 44a038881a..a26c8ba9a2 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -34,8 +34,7 @@ test_expect_success 'setup repo with moderate-sized history' ' bitmaptip=$(git rev-parse master) && blob=$(echo tagged-blob | git hash-object -w --stdin) && git tag tagged-blob $blob && - git config repack.writebitmaps true && - git config pack.writebitmaphashcache true + git config repack.writebitmaps true ' test_expect_success 'full repack creates bitmaps' ' -- 2.21.0.543.g90eed137f3 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-03-14 9:12 ` [PATCH v3] " Eric Wong 2019-03-14 16:02 ` Jeff King @ 2019-04-09 15:10 ` Ævar Arnfjörð Bjarmason 2019-04-10 22:57 ` Jeff King 1 sibling, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-04-09 15:10 UTC (permalink / raw) To: Eric Wong; +Cc: Jeff King, git On Thu, Mar 14 2019, Eric Wong wrote: > Jeff King <peff@peff.net> wrote: >> On Wed, Mar 13, 2019 at 01:51:33AM +0000, Eric Wong wrote: >> >> > But I did find Ævar's forgotten gitperformance doc and thread >> > where the topic was brought up: >> > >> > https://public-inbox.org/git/20170403211644.26814-1-avarab@gmail.com/ >> >> One thing that thread reminded me of: we probably also want to default >> pack.writebitmaphashcache on. Otherwise the time saved during the object >> enumeration can backfire when we spend much more time trying delta >> compression (because we don't know the pathnames of any objects). > > Interesting... I found a big improvement with public-inbox > just using bitmaps; but have never tried the hash cache. > >> The reason it defaults to off is for on-disk compatibility with JGit. > > Right. Our documentation seems to indicate JGit just warns (but > doesn't fall over), so maybe that can be considered separately. > > I've never used JGit myself; and was satisfied enough with > bitmaps alone that I never tried the hash-cache. > >> But I have very little experience running without the hash-cache on. We >> added it very early on because we found performance was not great >> without it (I don't know if people running JGit have run into the same >> problem and if not, why not). > > As far as serving clones and fetches, public-inbox-init has > always created bare repos with bitmaps enabled, but without > the hash-cache for compatibility concerns. > > That's a lot of fetches and clones over the years. > >> > +test_expect_success 'incremental repack does not complain' ' >> > + git -C bare.git repack -q 2>repack.err && >> > + ! test -s repack.err >> > +' >> >> This last line could use "test_must_be_empty". > > Thanks for the review! > > ---------8<----------- > Subject: [PATCH] repack: enable bitmaps by default on bare repos > > A typical use case for bare repos is for serving clones and > fetches to clients. Enable bitmaps by default on bare repos to > make it easier for admins to host git repos in a performant way. > > Signed-off-by: Eric Wong <e@80x24.org> > Helped-by: Jeff King <peff@peff.net> > --- > Documentation/config/repack.txt | 2 +- > builtin/repack.c | 5 ++++- > t/t7700-repack.sh | 19 ++++++++++++++++++- > 3 files changed, 23 insertions(+), 3 deletions(-) > > diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt > index a5c37813fd..9c413e177e 100644 > --- a/Documentation/config/repack.txt > +++ b/Documentation/config/repack.txt > @@ -24,4 +24,4 @@ repack.writeBitmaps:: > packs created for clones and fetches, at the cost of some disk > space and extra time spent on the initial repack. This has > no effect if multiple packfiles are created. > - Defaults to false. > + Defaults to true on bare repos, false otherwise. > diff --git a/builtin/repack.c b/builtin/repack.c > index 67f8978043..caca113927 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -14,7 +14,7 @@ > > static int delta_base_offset = 1; > static int pack_kept_objects = -1; > -static int write_bitmaps; > +static int write_bitmaps = -1; > static int use_delta_islands; > static char *packdir, *packtmp; > > @@ -343,6 +343,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) > (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))) > die(_("--keep-unreachable and -A are incompatible")); > > + if (write_bitmaps < 0) > + write_bitmaps = (pack_everything & ALL_INTO_ONE) && > + is_bare_repository(); > if (pack_kept_objects < 0) > pack_kept_objects = write_bitmaps; > > diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh > index 6162e2a8e6..86d05160a3 100755 > --- a/t/t7700-repack.sh > +++ b/t/t7700-repack.sh > @@ -221,5 +221,22 @@ test_expect_success 'repack --keep-pack' ' > ) > ' > > -test_done > +test_expect_success 'bitmaps are created by default in bare repos' ' > + git clone --bare .git bare.git && > + git -C bare.git repack -ad && > + bitmap=$(ls bare.git/objects/pack/*.bitmap) && > + test_path_is_file "$bitmap" > +' > + > +test_expect_success 'incremental repack does not complain' ' > + git -C bare.git repack -q 2>repack.err && > + test_must_be_empty repack.err > +' > > +test_expect_success 'bitmaps can be disabled on bare repos' ' > + git -c repack.writeBitmaps=false -C bare.git repack -ad && > + bitmap=$(ls bare.git/objects/pack/*.bitmap 2>/dev/null || :) && > + test -z "$bitmap" > +' > + > +test_done I've found a case where turning bitmaps on does horrible things for bitmap "push" performance. As it turns out it's not related to this patch per-se, because I had a *.bitmap for other reasons, but replying to this because we'd presumably get the same thing in the bare repo case once this merges down. I can't share the repo, but I had a report where just a "git push" of a topic branch that was 2/58 ahead/behind took ~2 minutes just in "Enumerating objects", but ~500ms without bitmaps. Using a horrible "print to stderr"[1] monkeypatch I'd get without bitmaps and reported by trace2 / ts: Apr 09 16:45:15 16:45:15.365817 git.c:433 | d1 | main | cmd_name | | | | | pack-objects (push/pack-objects) Apr 09 16:45:15 16:45:15.366220 builtin/pack-objects.c:3493 | d1 | main | region_enter | r1 | 0.000928 | | pack-objects | label:enumerate-objects Apr 09 16:45:15 16:45:15.366241 builtin/pack-objects.c:3495 | d1 | main | region_enter | r1 | 0.000950 | | pack-objects | ..label:enumerate-objects-prepare-packing-data Apr 09 16:45:15 16:45:15.366384 builtin/pack-objects.c:3498 | d1 | main | region_leave | r1 | 0.001091 | 0.000141 | pack-objects | ..label:enumerate-objects-prepare-packing-data Apr 09 16:45:15 16:45:15.366394 builtin/pack-objects.c:3510 | d1 | main | region_enter | r1 | 0.001102 | | pack-objects | ..label:enumerate-objects-get-obj-list Apr 09 16:45:15 get obj list 1 Apr 09 16:45:15 get obj list 2, did 29391 lines Apr 09 16:45:15 get obj list 3 Apr 09 16:45:15 get obj list 4 Apr 09 16:45:15 get obj list 5 Apr 09 16:45:15 get obj list 6 Apr 09 16:45:15 get obj list 7 Apr 09 16:45:15 get obj list 8 Apr 09 16:45:15 get obj list 9 Apr 09 16:45:15 16:45:15.776559 builtin/pack-objects.c:3514 | d1 | main | region_leave | r1 | 0.411263 | 0.410161 | pack-objects | ..label:enumerate-objects-get-obj-list Apr 09 16:45:15 16:45:15.776577 builtin/pack-objects.c:3517 | d1 | main | region_enter | r1 | 0.411285 | | pack-objects | ..label:enumerate-objects-cleanup-preferred-base Apr 09 16:45:15 16:45:15.776584 builtin/pack-objects.c:3520 | d1 | main | region_leave | r1 | 0.411292 | 0.000007 | pack-objects | ..label:enumerate-objects-cleanup-preferred-base Apr 09 16:45:15 16:45:15.776605 builtin/pack-objects.c:3530 | d1 | main | region_leave | r1 | 0.411313 | 0.410385 | pack-objects | label:enumerate-objects Apr 09 16:45:15 16:45:15.776609 builtin/pack-objects.c:3542 | d1 | main | region_enter | r1 | 0.411318 | | pack-objects | label:write-pack-file Apr 09 16:45:15 16:45:15.794235 builtin/pack-objects.c:3544 | d1 | main | region_leave | r1 | 0.428942 | 0.017624 | pack-objects | label:write-pack-file But with pack.useBitmaps=true: Apr 09 16:39:59 16:39:59.139022 git.c:433 | d1 | main | cmd_name | | | | | pack-objects (push/pack-objects) Apr 09 16:39:59 16:39:59.139398 builtin/pack-objects.c:3493 | d1 | main | region_enter | r1 | 0.000869 | | pack-objects | label:enumerate-objects Apr 09 16:39:59 16:39:59.139419 builtin/pack-objects.c:3495 | d1 | main | region_enter | r1 | 0.000892 | | pack-objects | ..label:enumerate-objects-prepare-packing-data Apr 09 16:39:59 16:39:59.139551 builtin/pack-objects.c:3498 | d1 | main | region_leave | r1 | 0.001023 | 0.000131 | pack-objects | ..label:enumerate-objects-prepare-packing-data Apr 09 16:39:59 16:39:59.139559 builtin/pack-objects.c:3510 | d1 | main | region_enter | r1 | 0.001032 | | pack-objects | ..label:enumerate-objects-get-obj-list Apr 09 16:39:59 get obj list 1 Apr 09 16:39:59 get obj list 2, did 29392 lines Apr 09 16:39:59 get obj list 3 Apr 09 16:39:59 prepping walk Apr 09 16:39:59 opening packed bitmap... Apr 09 16:39:59 opening packed bitmap done Apr 09 16:39:59 walking 29392 pending Apr 09 16:39:59 done walking 29392 pending Apr 09 16:39:59 prepare_bitmap_walk 3 Apr 09 16:39:59 prepare_bitmap_walk 4 Apr 09 16:39:59 prepare_bitmap_walk 5 Apr 09 16:40:00 prepare_bitmap_walk 6 Apr 09 16:40:00 prepare_bitmap_walk 6.1 Apr 09 16:41:35 prepare_bitmap_walk 6.2 Apr 09 16:41:35 prepare_bitmap_walk 7 Apr 09 16:41:52 prepare_bitmap_walk 8 Apr 09 16:41:52 walking? Apr 09 16:41:52 traversing Apr 09 16:41:52 traversing done Apr 09 16:41:52 16:41:52.091634 builtin/pack-objects.c:3514 | d1 | main | region_leave | r1 | 112.953099 | 112.952067 | pack-objects | ..label:enumerate-objects-get-obj-list Apr 09 16:41:52 16:41:52.091655 builtin/pack-objects.c:3517 | d1 | main | region_enter | r1 | 112.953128 | | pack-objects | ..label:enumerate-objects-cleanup-preferred-base Apr 09 16:41:52 16:41:52.091668 builtin/pack-objects.c:3520 | d1 | main | region_leave | r1 | 112.953141 | 0.000013 | pack-objects | ..label:enumerate-objects-cleanup-preferred-base Apr 09 16:41:52 16:41:52.091700 builtin/pack-objects.c:3530 | d1 | main | region_leave | r1 | 112.953172 | 112.952303 | pack-objects | label:enumerate-objects Apr 09 16:41:52 16:41:52.091706 builtin/pack-objects.c:3542 | d1 | main | region_enter | r1 | 112.953179 | | pack-objects | label:write-pack-file Apr 09 16:41:52 16:41:52.111966 builtin/pack-objects.c:3544 | d1 | main | region_leave | r1 | 112.973438 | 0.020259 | pack-objects | label:write-pack-file I.e. almost all the time is in get_object_list_from_bitmap() and around 1m30s in just this in pack-bitmap.c: haves_bitmap = find_objects(bitmap_git, revs, haves, NULL); And then another ~20s in: wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap); This is with 10 packs and where only the largest (initial clone pack) had a *.bitmap, but I can also reproduce with a 'git repack -A -d -b', i.e. with only one pack with a *.bitmap, although that makes it a bit better for the first bit, and almost completely cuts down on the time spent in the second phase: Apr 09 17:08:37 17:08:37.261507 builtin/pack-objects.c:3493 | d1 | main | region_enter | r1 | 0.000922 | | pack-objects | label:enumerate-objects Apr 09 17:08:37 17:08:37.261527 builtin/pack-objects.c:3495 | d1 | main | region_enter | r1 | 0.000943 | | pack-objects | ..label:enumerate-objects-prepare-packing-data Apr 09 17:08:37 17:08:37.261600 builtin/pack-objects.c:3498 | d1 | main | region_leave | r1 | 0.001015 | 0.000072 | pack-objects | ..label:enumerate-objects-prepare-packing-data Apr 09 17:08:37 17:08:37.261608 builtin/pack-objects.c:3510 | d1 | main | region_enter | r1 | 0.001024 | | pack-objects | ..label:enumerate-objects-get-obj-list Apr 09 17:08:37 get obj list 1 Apr 09 17:08:37 get obj list 2, did 29380 lines Apr 09 17:08:37 get obj list 3 Apr 09 17:08:37 prepping walk Apr 09 17:08:37 opening packed bitmap... Apr 09 17:08:37 opening packed bitmap done Apr 09 17:08:37 walking 29380 pending Apr 09 17:08:37 done walking 29380 pending Apr 09 17:08:37 prepare_bitmap_walk 3 Apr 09 17:08:37 prepare_bitmap_walk 4 Apr 09 17:08:37 prepare_bitmap_walk 5 Apr 09 17:08:38 prepare_bitmap_walk 6 Apr 09 17:08:38 prepare_bitmap_walk 6.1 Apr 09 17:09:07 prepare_bitmap_walk 6.2 Apr 09 17:09:07 prepare_bitmap_walk 7 Apr 09 17:09:09 prepare_bitmap_walk 8 Apr 09 17:09:09 walking? Apr 09 17:09:09 traversing Apr 09 17:09:09 traversing done Apr 09 17:09:09 17:09:09.229185 builtin/pack-objects.c:3514 | d1 | main | region_leave | r1 | 31.968595 | 31.967571 | pack-objects | ..label:enumerate-objects-get-obj-list Apr 09 17:09:09 17:09:09.229203 builtin/pack-objects.c:3517 | d1 | main | region_enter | r1 | 31.968619 | | pack-objects | ..label:enumerate-objects-cleanup-preferred-base Apr 09 17:09:09 17:09:09.229214 builtin/pack-objects.c:3520 | d1 | main | region_leave | r1 | 31.968630 | 0.000011 | pack-objects | ..label:enumerate-objects-cleanup-preferred-base Apr 09 17:09:09 17:09:09.229242 builtin/pack-objects.c:3530 | d1 | main | region_leave | r1 | 31.968658 | 31.967736 | pack-objects | label:enumerate-objects Apr 09 17:09:09 17:09:09.229265 builtin/pack-objects.c:3542 | d1 | main | region_enter | r1 | 31.968681 | | pack-objects | label:write-pack-file Apr 09 17:09:09 17:09:09.251998 builtin/pack-objects.c:3544 | d1 | main | region_leave | r1 | 31.991412 | 0.022731 | pack-objects | label:write-pack-file I don't have time to dig more into this now, just wanted to send these initial results... 1. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a154fc29f6..8b2af1740e 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3052,2 +3052,3 @@ static int get_object_list_from_bitmap(struct rev_info *revs) { + fprintf(stderr, "prepping walk\n"); if (!(bitmap_git = prepare_bitmap_walk(revs))) @@ -3055,2 +3056,3 @@ static int get_object_list_from_bitmap(struct rev_info *revs) + fprintf(stderr, "walking?\n"); if (pack_options_allow_reuse() && @@ -3066,3 +3068,5 @@ static int get_object_list_from_bitmap(struct rev_info *revs) + fprintf(stderr, "traversing\n"); traverse_bitmap_commit_list(bitmap_git, &add_object_entry_from_bitmap); + fprintf(stderr, "traversing done\n"); return 0; @@ -3091,2 +3095,3 @@ static void get_object_list(int ac, const char **av) int save_warning; + int lns = 0; @@ -3102,3 +3107,5 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 1\n"); while (fgets(line, sizeof(line), stdin) != NULL) { + lns++; int len = strlen(line); @@ -3128,4 +3135,6 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 2, did %d lines\n", lns); warn_on_object_refname_ambiguity = save_warning; + fprintf(stderr, "get obj list 3\n"); if (use_bitmap_index && !get_object_list_from_bitmap(&revs)) @@ -3133,2 +3142,3 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 4\n"); if (use_delta_islands) @@ -3136,2 +3146,3 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 5\n"); if (prepare_revision_walk(&revs)) @@ -3140,2 +3151,3 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 6\n"); if (!fn_show_object) @@ -3146,2 +3158,3 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 7\n"); if (unpack_unreachable_expiration) { @@ -3157,2 +3170,3 @@ static void get_object_list(int ac, const char **av) + fprintf(stderr, "get obj list 8\n"); if (keep_unreachable) @@ -3163,2 +3177,3 @@ static void get_object_list(int ac, const char **av) loosen_unused_packed_objects(&revs); + fprintf(stderr, "get obj list 9\n"); @@ -3478,3 +3493,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) the_repository); + trace2_region_enter("pack-objects", "enumerate-objects-prepare-packing-data", + the_repository); prepare_packing_data(the_repository, &to_pack); + trace2_region_leave("pack-objects", "enumerate-objects-prepare-packing-data", + the_repository); @@ -3482,11 +3501,28 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) progress_state = start_progress(_("Enumerating objects"), 0); - if (!use_internal_rev_list) + if (!use_internal_rev_list) { + trace2_region_enter("pack-objects", "enumerate-objects-read-stdin", + the_repository); read_object_list_from_stdin(); - else { + trace2_region_leave("pack-objects", "enumerate-objects-read-stdin", + the_repository); + } else { + trace2_region_enter("pack-objects", "enumerate-objects-get-obj-list", + the_repository); get_object_list(rp.argc, rp.argv); argv_array_clear(&rp); + trace2_region_leave("pack-objects", "enumerate-objects-get-obj-list", + the_repository); } + trace2_region_enter("pack-objects", "enumerate-objects-cleanup-preferred-base", + the_repository); cleanup_preferred_base(); - if (include_tag && nr_result) + trace2_region_leave("pack-objects", "enumerate-objects-cleanup-preferred-base", + the_repository); + if (include_tag && nr_result) { + trace2_region_enter("pack-objects", "enumerate-objects-add-tags", + the_repository); for_each_ref(add_ref_tag, NULL); + trace2_region_leave("pack-objects", "enumerate-objects-add-tags", + the_repository); + } stop_progress(&progress_state); diff --git a/pack-bitmap.c b/pack-bitmap.c index 4695aaf6b4..0ab71597fd 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -693,5 +693,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) * because we may not need to use it */ + fprintf(stderr, "opening packed bitmap...\n"); if (open_pack_bitmap(revs->repo, bitmap_git) < 0) goto cleanup; + fprintf(stderr, "opening packed bitmap done\n"); + fprintf(stderr, "walking %d pending\n", revs->pending.nr); for (i = 0; i < revs->pending.nr; ++i) { @@ -720,2 +723,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) } + fprintf(stderr, "done walking %d pending\n", revs->pending.nr); @@ -726,2 +730,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) */ + fprintf(stderr, "prepare_bitmap_walk 3\n"); if (haves && !in_bitmapped_pack(bitmap_git, haves)) @@ -729,2 +734,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) + fprintf(stderr, "prepare_bitmap_walk 4\n"); /* if we don't want anything, we're done here */ @@ -738,2 +744,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) */ + fprintf(stderr, "prepare_bitmap_walk 5\n"); if (load_pack_bitmap(bitmap_git) < 0) @@ -741,2 +748,3 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) + fprintf(stderr, "prepare_bitmap_walk 6\n"); object_array_clear(&revs->pending); @@ -745,3 +753,5 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) revs->ignore_missing_links = 1; + fprintf(stderr, "prepare_bitmap_walk 6.1\n"); haves_bitmap = find_objects(bitmap_git, revs, haves, NULL); + fprintf(stderr, "prepare_bitmap_walk 6.2\n"); reset_revision_walk(); @@ -752,4 +762,6 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) } + fprintf(stderr, "prepare_bitmap_walk 7\n"); wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap); + fprintf(stderr, "prepare_bitmap_walk 8\n"); diff --git a/revision.c b/revision.c index eb8e51bc63..4592d01ee7 100644 --- a/revision.c +++ b/revision.c @@ -63,2 +63,4 @@ static void mark_tree_contents_uninteresting(struct repository *r, + fprintf(stderr, "MTCU\n"); + if (parse_tree_gently(tree, 1) < 0) @@ -167,2 +169,4 @@ static void add_children_by_path(struct repository *r, + fprintf(stderr, "ACBP\n"); + if (!tree) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-04-09 15:10 ` [PATCH v3] repack: enable bitmaps by default on bare repos Ævar Arnfjörð Bjarmason @ 2019-04-10 22:57 ` Jeff King 2019-04-25 7:16 ` Junio C Hamano 2019-05-23 11:30 ` Jeff King 0 siblings, 2 replies; 57+ messages in thread From: Jeff King @ 2019-04-10 22:57 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, git On Tue, Apr 09, 2019 at 05:10:43PM +0200, Ævar Arnfjörð Bjarmason wrote: > I've found a case where turning bitmaps on does horrible things for > bitmap "push" performance. > [...] > I can't share the repo, but I had a report where just a "git push" of a > topic branch that was 2/58 ahead/behind took ~2 minutes just in > "Enumerating objects", but ~500ms without bitmaps. That's pretty bad, though I'm not terribly surprised. The worst cases are ones where we have to traverse a lot to fill in the bitmap. So either there are a lot of commits newer than the bitmapped pack, or we've done a bad job of picking old commits to bitmap, requiring us to walk back to find all of the reachable objects (until we find something that does have a bitmap). And "bad" here is somewhat subjective. The other side told us some "have" lines, and those are what we have to walk back from. _Usually_ those would correspond to ref tips, and the bitmap code tries to put a bitmap at each ref tip. But if you have tons of refs, it can't always do so. > I.e. almost all the time is in get_object_list_from_bitmap() and around > 1m30s in just this in pack-bitmap.c: > > haves_bitmap = find_objects(bitmap_git, revs, haves, NULL); > > And then another ~20s in: > > wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap); Yeah, that's where I'd expect the time to go. Inside find_objects() we'll call traverse_commit_list() to do the actual walk. The first walk for "haves" is looking mostly at old commits. Stuff that is mentioned in the pack, but for which we have to walk to find the bitmap. The second walk for wants is probably mostly looking at new commits. Things that don't have a bitmap yet, and for which we have to walk (though it's interesting that it's so much more expensive than the non-bitmap walk, which has to fully enumerate those trees itself; that implies that some of the "haves" are recent, too, with respect to the last pack). > This is with 10 packs and where only the largest (initial clone pack) > had a *.bitmap, but I can also reproduce with a 'git repack -A -d -b', > i.e. with only one pack with a *.bitmap, although that makes it a bit > better for the first bit, and almost completely cuts down on the time > spent in the second phase: Yeah, that makes sense. By repacking you've taken all those new commits and included them in on-disk bitmaps. So I'd expect the "wants" to get much shorter, but the "haves" phase staying long means we could do a better job of picking commits to have on-disk bitmaps. So two avenues for exploration I think: 1. I've long suspected that the bitmap selection code isn't ideal. Both in terms of what it picks, but also in its runtime (I think it ends up walking the same slices of history multiple times in some cases). 2. The answer we get from a bitmap versus a regular traversal are not apples-to-apples equivalent. The regular traversal walks down to the UNINTERESTING commits, marks the boundaries trees and blobs as UNINTERESTING, and then adds in all the interesting trees and blobs minus the UNINTERESTING parts. So it can sometimes give the wrong answer, claiming something is interesting when it is not. Whereas with bitmaps we fill in the trees and blobs as we walk, and you get the true answer. But it means we may open up a lot more trees than the equivalent traversal would. So one thing I've never really experimented with (and indeed, never really thought about until writing this email) is that the bitmaps could try to do that looser style of traversal, knowing that we might err on the side of calling things interesting in a few cases. But hopefully spending a lot less time opening trees. I'm not even 100% sure what that would look like in code, but just thinking about it from a high level, I don't there's a particular mathematical reason it couldn't work. -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-04-10 22:57 ` Jeff King @ 2019-04-25 7:16 ` Junio C Hamano 2019-05-04 1:37 ` Jeff King 2019-05-23 11:30 ` Jeff King 1 sibling, 1 reply; 57+ messages in thread From: Junio C Hamano @ 2019-04-25 7:16 UTC (permalink / raw) To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Eric Wong, git Jeff King <peff@peff.net> writes: > On Tue, Apr 09, 2019 at 05:10:43PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> I've found a case where turning bitmaps on does horrible things for >> bitmap "push" performance. >> [...] >> I can't share the repo, but I had a report where just a "git push" of a >> topic branch that was 2/58 ahead/behind took ~2 minutes just in >> "Enumerating objects", but ~500ms without bitmaps. > > That's pretty bad, though I'm not terribly surprised. I was revisiting the recent "What's cooking" report, and I am not sure what the current status of the topic is. I do not get a feel that the current bitmap implementation has been widely used in repositories that have vastly different access patterns---it probably has been tried only by those who can afford the engineering cost to see if the implementation happens to work well for their workload and some may have chosen to adopt it while others didn't. So it may be very well tuned for the former people but once we merge this topic down, we'll hear from others with quite different workload, which may lead to us tuning the code to bit better to their workload while not hurting other existing users, hopefully. Or not. I am somewhat tempted to make things more exciting by merging it to 'next' soonish, but I guess Ævar and you are not quite ready for that excitement yet, judging from the following (which looks quite sensible suggestions)? Thanks. > Yeah, that makes sense. By repacking you've taken all those new commits > and included them in on-disk bitmaps. So I'd expect the "wants" to get > much shorter, but the "haves" phase staying long means we could do a > better job of picking commits to have on-disk bitmaps. > > So two avenues for exploration I think: > > 1. I've long suspected that the bitmap selection code isn't ideal. > Both in terms of what it picks, but also in its runtime (I think it > ends up walking the same slices of history multiple times in some > cases). > > 2. The answer we get from a bitmap versus a regular traversal are not > apples-to-apples equivalent. The regular traversal walks down to > the UNINTERESTING commits, marks the boundaries trees and blobs as > UNINTERESTING, and then adds in all the interesting trees and blobs > minus the UNINTERESTING parts. So it can sometimes give the wrong > answer, claiming something is interesting when it is not. > > Whereas with bitmaps we fill in the trees and blobs as we walk, and > you get the true answer. But it means we may open up a lot more > trees than the equivalent traversal would. > > So one thing I've never really experimented with (and indeed, never > really thought about until writing this email) is that the bitmaps > could try to do that looser style of traversal, knowing that we > might err on the side of calling things interesting in a few cases. > But hopefully spending a lot less time opening trees. > > I'm not even 100% sure what that would look like in code, but just > thinking about it from a high level, I don't there's a particular > mathematical reason it couldn't work. > > -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-04-25 7:16 ` Junio C Hamano @ 2019-05-04 1:37 ` Jeff King 2019-05-04 6:52 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-05-04 1:37 UTC (permalink / raw) To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, Eric Wong, git On Thu, Apr 25, 2019 at 04:16:46PM +0900, Junio C Hamano wrote: > I was revisiting the recent "What's cooking" report, and I am not > sure what the current status of the topic is. > > I do not get a feel that the current bitmap implementation has been > widely used in repositories that have vastly different access > patterns---it probably has been tried only by those who can afford > the engineering cost to see if the implementation happens to work > well for their workload and some may have chosen to adopt it while > others didn't. So it may be very well tuned for the former people > but once we merge this topic down, we'll hear from others with quite > different workload, which may lead to us tuning the code to bit > better to their workload while not hurting other existing users, > hopefully. > > Or not. Note that Ævar's case was somebody running bitmaps locally and trying to push, which I think is generally not a good match for bitmaps (even when they work, they cost more to generate than what you save if you're only pushing once). The goal of Eric's patch was that by kicking in for bare repos, we'd mostly be hitting servers that are serving up fetches. So if by "workload" you mean that we some people might use bare repos for other cases, yeah, there's a potential for confusion or regression there. If you mean that bitmaps might not work for some workloads even when we're serving a lot of fetches, I won't say that's _not_ true, but my experience is that they are generally a net win. Both for the smaller repositories we see on github.com, but also for big, busy ones that our on-premises customers use. Actually, there is one curiosity with Eric's patch that I haven't tested. As I've mentioned before, we store "forks" as single repositories pointing to a single shared alternates repository. Since the bitmap code only handles one .bitmap per invocation, you really want just one big one in the shared repo. If "git repack" in the forks started generating one, that would be surprising and annoying. In practice this is a pretty extreme corner case. And a lot would depend on how you're using "repack" in the fork (e.g., a partial repack would know that it can't generate bitmaps anyway). I'm pretty sure it would not even impact our setup at all, but I can probably come up with a devils advocate one where it would. > I am somewhat tempted to make things more exciting by merging it to > 'next' soonish, but I guess Ævar and you are not quite ready for > that excitement yet, judging from the following (which looks quite > sensible suggestions)? It's OK with me for this to go to 'next'. Note that the other two patches from me could actually graduate separately. One is a straight-out test fix, and the other should always be a win (and does nothing if you're not already generating bitmaps). By the way, there were some timing puzzles mentioned in that second commit. I re-ran them today and everything was what I'd expect. So I wonder if I just screwed up the timings before. I can re-write that commit message if it hasn't made it to 'next' yet. -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-04 1:37 ` Jeff King @ 2019-05-04 6:52 ` Ævar Arnfjörð Bjarmason 2019-05-04 13:23 ` SZEDER Gábor 2019-05-07 7:45 ` Jeff King 0 siblings, 2 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-04 6:52 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor On Sat, May 04 2019, Jeff King wrote: > On Thu, Apr 25, 2019 at 04:16:46PM +0900, Junio C Hamano wrote: > >> I was revisiting the recent "What's cooking" report, and I am not >> sure what the current status of the topic is. >> >> I do not get a feel that the current bitmap implementation has been >> widely used in repositories that have vastly different access >> patterns---it probably has been tried only by those who can afford >> the engineering cost to see if the implementation happens to work >> well for their workload and some may have chosen to adopt it while >> others didn't. So it may be very well tuned for the former people >> but once we merge this topic down, we'll hear from others with quite >> different workload, which may lead to us tuning the code to bit >> better to their workload while not hurting other existing users, >> hopefully. >> >> Or not. > > Note that Ævar's case was somebody running bitmaps locally and trying to > push, which I think is generally not a good match for bitmaps (even when > they work, they cost more to generate than what you save if you're only > pushing once). Right. It was *not* caused by this "enable bitmaps by default on bare repos" patch (which I wasn't even running with), but *is* indicative of a pretty big edge case with enabling bitmaps that *will* happen for some on such bare repos if we ship the patch. > The goal of Eric's patch was that by kicking in for bare repos, we'd > mostly be hitting servers that are serving up fetches. So if by > "workload" you mean that we some people might use bare repos for other > cases, yeah, there's a potential for confusion or regression there. As noted I suspect that's a really rare use-case in practice, and in reply to Junio's <xmqq1s1qy2ox.fsf@gitster-ct.c.googlers.com> upthread I think it's fine to just try this. Maybe we'll finally get such use-cases out of the woodworks by turning it on by default. As an aside this is the Nth time I notice how crappy that "Enumerating objects" progress bar is. We do a *lot* of things there, including this bitmap calculation. But just splitting it up might result in either no progress (all individually below 2 seconds), or a lot of noise as you have 20 things that each take 2 seconds. I wonder if someone's looked at supporting: Enumerating Objects (X%) => Calculating bitmaps (Y%) Where X% is the total progres, and %Y is the sub-progress. I eyeballed doing this once by "chaining" the progress structs, but there's probably a less crappy way... > If you mean that bitmaps might not work for some workloads even when > we're serving a lot of fetches, I won't say that's _not_ true, but my > experience is that they are generally a net win. Both for the smaller > repositories we see on github.com, but also for big, busy ones that our > on-premises customers use. > > Actually, there is one curiosity with Eric's patch that I haven't > tested. As I've mentioned before, we store "forks" as single > repositories pointing to a single shared alternates repository. Since > the bitmap code only handles one .bitmap per invocation, you really > want just one big one in the shared repo. If "git repack" in the forks > started generating one, that would be surprising and annoying. > > In practice this is a pretty extreme corner case. And a lot would > depend on how you're using "repack" in the fork (e.g., a partial > repack would know that it can't generate bitmaps anyway). I'm pretty > sure it would not even impact our setup at all, but I can probably > come up with a devils advocate one where it would. > >> I am somewhat tempted to make things more exciting by merging it to >> 'next' soonish, but I guess Ævar and you are not quite ready for >> that excitement yet, judging from the following (which looks quite >> sensible suggestions)? > > It's OK with me for this to go to 'next'. Note that the other two > patches from me could actually graduate separately. One is a > straight-out test fix, and the other should always be a win (and does > nothing if you're not already generating bitmaps). > > By the way, there were some timing puzzles mentioned in that second > commit. I re-ran them today and everything was what I'd expect. So I > wonder if I just screwed up the timings before. I can re-write that > commit message if it hasn't made it to 'next' yet. > > -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-04 6:52 ` Ævar Arnfjörð Bjarmason @ 2019-05-04 13:23 ` SZEDER Gábor 2019-05-08 20:17 ` Ævar Arnfjörð Bjarmason 2019-05-07 7:45 ` Jeff King 1 sibling, 1 reply; 57+ messages in thread From: SZEDER Gábor @ 2019-05-04 13:23 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Junio C Hamano, Eric Wong, git On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote: > As an aside this is the Nth time I notice how crappy that "Enumerating > objects" progress bar is. And don't forget that the commit-graph progress bar still prints nonsense :) https://public-inbox.org/git/87ef6ydds8.fsf@evledraar.gmail.com/ ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-04 13:23 ` SZEDER Gábor @ 2019-05-08 20:17 ` Ævar Arnfjörð Bjarmason 2019-05-09 4:24 ` Junio C Hamano 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-08 20:17 UTC (permalink / raw) To: SZEDER Gábor Cc: Jeff King, Junio C Hamano, Eric Wong, git, Derrick Stolee On Sat, May 04 2019, SZEDER Gábor wrote: > On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote: >> As an aside this is the Nth time I notice how crappy that "Enumerating >> objects" progress bar is. > > And don't forget that the commit-graph progress bar still prints > nonsense :) > > https://public-inbox.org/git/87ef6ydds8.fsf@evledraar.gmail.com/ I forgot for a bit, but then figured I'd get out of Derrick's way with his more significant commit-graph work (which would have inevitably had merge conflicts with this patch), and look at the time. Here we are coming up on rc0. I'm inclined to just wait until Derrick's refactorings land post-2.22.0 unless a) we need it enough before 2.22.0 to cause him trouble b) Junio agrees with "a" and would take such a patch to fix this part of the progress output before 2.22.0. If people (just you would do) tell me "yes I/we want it" and Junio says "yeah I'll do that" I'll write this patch in the next couple of days, otherwise I'll do other stuff. Junio? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-08 20:17 ` Ævar Arnfjörð Bjarmason @ 2019-05-09 4:24 ` Junio C Hamano 0 siblings, 0 replies; 57+ messages in thread From: Junio C Hamano @ 2019-05-09 4:24 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: SZEDER Gábor, Jeff King, Eric Wong, git, Derrick Stolee Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > On Sat, May 04 2019, SZEDER Gábor wrote: > ... >> And don't forget that the commit-graph progress bar still prints >> nonsense :) > > I'm inclined to just wait until Derrick's refactorings land post-2.22.0 Fine by me. Thanks. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-04 6:52 ` Ævar Arnfjörð Bjarmason 2019-05-04 13:23 ` SZEDER Gábor @ 2019-05-07 7:45 ` Jeff King 2019-05-07 8:12 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 57+ messages in thread From: Jeff King @ 2019-05-07 7:45 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote: > > Note that Ævar's case was somebody running bitmaps locally and trying to > > push, which I think is generally not a good match for bitmaps (even when > > they work, they cost more to generate than what you save if you're only > > pushing once). > > Right. It was *not* caused by this "enable bitmaps by default on bare > repos" patch (which I wasn't even running with), but *is* indicative of > a pretty big edge case with enabling bitmaps that *will* happen for some > on such bare repos if we ship the patch. Yeah. To clarify my comments a bit: I do think it would be possible to hit a weird case like this while serving fetches (i.e., as far as I know there is nothing in what you saw that is inherent to pushes). But I do think for serving fetches, bitmaps are overall a big net win (based on my experiences). So I think it may come down to a tradeoff: enabling this by default would probably be a net win across the population, but that's little comfort to the unlucky somebody who may see it as a regression. I'm not sure which is more important to maintain. > As an aside this is the Nth time I notice how crappy that "Enumerating > objects" progress bar is. We do a *lot* of things there, including this > bitmap calculation. > > But just splitting it up might result in either no progress (all > individually below 2 seconds), or a lot of noise as you have 20 things > that each take 2 seconds. I wonder if someone's looked at supporting: > > Enumerating Objects (X%) => Calculating bitmaps (Y%) > > Where X% is the total progres, and %Y is the sub-progress. I eyeballed > doing this once by "chaining" the progress structs, but there's probably > a less crappy way... I don't think it needs to be split; I think we just need to update the object count while we're traversing the bitmaps. The problem is that the progress object is known to pack-objects.c. Without bitmaps, as the revision machinery walks the graph, our callbacks bump the progress meter every time we see an object. With bitmaps, all that walking happens behind the scenes, and the bitmap code delivers us the final answer. So we pause for a long time, and then suddenly it shoots forward. I think we'd want a way to tell the bitmap code to update our progress meter as it traverses (both single objects, but also taking into account when it finds a bitmap and then suddenly bumps the value by a large amount). -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-07 7:45 ` Jeff King @ 2019-05-07 8:12 ` Ævar Arnfjörð Bjarmason 2019-05-08 7:11 ` Jeff King 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-07 8:12 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor, Derrick Stolee On Tue, May 07 2019, Jeff King wrote: > On Sat, May 04, 2019 at 08:52:01AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > Note that Ævar's case was somebody running bitmaps locally and trying to >> > push, which I think is generally not a good match for bitmaps (even when >> > they work, they cost more to generate than what you save if you're only >> > pushing once). >> >> Right. It was *not* caused by this "enable bitmaps by default on bare >> repos" patch (which I wasn't even running with), but *is* indicative of >> a pretty big edge case with enabling bitmaps that *will* happen for some >> on such bare repos if we ship the patch. > > Yeah. To clarify my comments a bit: I do think it would be possible to > hit a weird case like this while serving fetches (i.e., as far as I know > there is nothing in what you saw that is inherent to pushes). But I do > think for serving fetches, bitmaps are overall a big net win (based on > my experiences). > > So I think it may come down to a tradeoff: enabling this by default > would probably be a net win across the population, but that's little > comfort to the unlucky somebody who may see it as a regression. I'm not > sure which is more important to maintain. > >> As an aside this is the Nth time I notice how crappy that "Enumerating >> objects" progress bar is. We do a *lot* of things there, including this >> bitmap calculation. >> >> But just splitting it up might result in either no progress (all >> individually below 2 seconds), or a lot of noise as you have 20 things >> that each take 2 seconds. I wonder if someone's looked at supporting: >> >> Enumerating Objects (X%) => Calculating bitmaps (Y%) >> >> Where X% is the total progres, and %Y is the sub-progress. I eyeballed >> doing this once by "chaining" the progress structs, but there's probably >> a less crappy way... > > I don't think it needs to be split; I think we just need to update the > object count while we're traversing the bitmaps. The problem is that the > progress object is known to pack-objects.c. Without bitmaps, as the > revision machinery walks the graph, our callbacks bump the progress > meter every time we see an object. > > With bitmaps, all that walking happens behind the scenes, and the bitmap > code delivers us the final answer. So we pause for a long time, and then > suddenly it shoots forward. > > I think we'd want a way to tell the bitmap code to update our progress > meter as it traverses (both single objects, but also taking into account > when it finds a bitmap and then suddenly bumps the value by a large > amount). Not splitting it will fix the progress bar stalling, so it fixes the problem that the user is wondering if the command is entirely hanging. But I was hoping to give the user an idea of roughly where we're spending our time, e.g. so you can see how much the pack.useSparse setting is helping (or not). So something where we report sub-progress as we go along, and perhaps print some brief summary at the end if it took long enough, e.g.: Enumerating Objects (X^1%) => Marking trees (Y^1%) Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%) And at the end: Enumerating Objects (100%) in ~2m30s -- (~10s marking trees, ~2m10s bitmaps, ~10s other) I.e. bringing the whole "nested" trace2 regions full circle with the progress bar where we could elect to trace/show some of that info, and then you could turn on some trace2 mode/verbose progress to see more. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-07 8:12 ` Ævar Arnfjörð Bjarmason @ 2019-05-08 7:11 ` Jeff King 2019-05-08 14:20 ` Derrick Stolee 2019-05-08 16:13 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 57+ messages in thread From: Jeff King @ 2019-05-08 7:11 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor, Derrick Stolee On Tue, May 07, 2019 at 10:12:06AM +0200, Ævar Arnfjörð Bjarmason wrote: > > I think we'd want a way to tell the bitmap code to update our progress > > meter as it traverses (both single objects, but also taking into account > > when it finds a bitmap and then suddenly bumps the value by a large > > amount). > > Not splitting it will fix the progress bar stalling, so it fixes the > problem that the user is wondering if the command is entirely hanging. > > But I was hoping to give the user an idea of roughly where we're > spending our time, e.g. so you can see how much the pack.useSparse > setting is helping (or not). Yeah, I think that's a bigger and more complicated problem. I admit that my main annoyance is just the stall while we fill in the bitmaps (and it's easy because the bitmap traversal is the same unit of work as a regular traversal). > So something where we report sub-progress as we go along, and perhaps > print some brief summary at the end if it took long enough, e.g.: > > Enumerating Objects (X^1%) => Marking trees (Y^1%) > Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%) > > And at the end: > > Enumerating Objects (100%) in ~2m30s -- (~10s marking trees, ~2m10s bitmaps, ~10s other) > > I.e. bringing the whole "nested" trace2 regions full circle with the > progress bar where we could elect to trace/show some of that info, and > then you could turn on some trace2 mode/verbose progress to see more. I do wonder if this really needs to be part of the progress bar. The goal of the progress bar is to give the user a sense that work is happening, and (if possible, but not for "enumerating") an idea of when it might finish. If the trace code can already do detailed timings, then shouldn't we just be encouraging people to use that? -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-08 7:11 ` Jeff King @ 2019-05-08 14:20 ` Derrick Stolee 2019-05-08 16:13 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 57+ messages in thread From: Derrick Stolee @ 2019-05-08 14:20 UTC (permalink / raw) To: Jeff King, Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor On 5/8/2019 3:11 AM, Jeff King wrote: > On Tue, May 07, 2019 at 10:12:06AM +0200, Ævar Arnfjörð Bjarmason wrote: > >>> I think we'd want a way to tell the bitmap code to update our progress >>> meter as it traverses (both single objects, but also taking into account >>> when it finds a bitmap and then suddenly bumps the value by a large >>> amount). >> >> Not splitting it will fix the progress bar stalling, so it fixes the >> problem that the user is wondering if the command is entirely hanging. >> >> But I was hoping to give the user an idea of roughly where we're >> spending our time, e.g. so you can see how much the pack.useSparse >> setting is helping (or not). > > Yeah, I think that's a bigger and more complicated problem. I admit that > my main annoyance is just the stall while we fill in the bitmaps (and > it's easy because the bitmap traversal is the same unit of work as a > regular traversal). The pack.useSparse setting also speeds up a section that is not marked by progress: that portion usually is walking all UNINTERESTING trees and the"Enumerating Objects" progress is just for walking the INTERESTING objects. >> So something where we report sub-progress as we go along, and perhaps >> print some brief summary at the end if it took long enough, e.g.: >> >> Enumerating Objects (X^1%) => Marking trees (Y^1%) >> Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%) I like this idea for splitting the "normal" mechanism, too: Enumerating Objects (X^1%) => Marking trees (Y^1%) Enumerating Objects (X^2%) => Enumerating objects to pack (Y^2%) >> I.e. bringing the whole "nested" trace2 regions full circle with the >> progress bar where we could elect to trace/show some of that info, and >> then you could turn on some trace2 mode/verbose progress to see more. > > I do wonder if this really needs to be part of the progress bar. The > goal of the progress bar is to give the user a sense that work is > happening, and (if possible, but not for "enumerating") an idea of when > it might finish. If the trace code can already do detailed timings, then > shouldn't we just be encouraging people to use that? The problem I've seen (without bitmaps) is that running `git push` can take a while before _any_ progress is listed. Good news is: `pack.useSparse` fixed our push problem in the Windows OS repo. The end-to-end time for `git push` sped up by 7.7x with the change, and this "blank" time is too fast for users to notice. Updating the progress could help in cases without pack.useSparse. Thanks, -Stolee ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-08 7:11 ` Jeff King 2019-05-08 14:20 ` Derrick Stolee @ 2019-05-08 16:13 ` Ævar Arnfjörð Bjarmason 2019-05-08 22:25 ` Jeff King 1 sibling, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-08 16:13 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor, Derrick Stolee On Wed, May 08 2019, Jeff King wrote: > On Tue, May 07, 2019 at 10:12:06AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > I think we'd want a way to tell the bitmap code to update our progress >> > meter as it traverses (both single objects, but also taking into account >> > when it finds a bitmap and then suddenly bumps the value by a large >> > amount). >> >> Not splitting it will fix the progress bar stalling, so it fixes the >> problem that the user is wondering if the command is entirely hanging. >> >> But I was hoping to give the user an idea of roughly where we're >> spending our time, e.g. so you can see how much the pack.useSparse >> setting is helping (or not). > > Yeah, I think that's a bigger and more complicated problem. I admit that > my main annoyance is just the stall while we fill in the bitmaps (and > it's easy because the bitmap traversal is the same unit of work as a > regular traversal). > >> So something where we report sub-progress as we go along, and perhaps >> print some brief summary at the end if it took long enough, e.g.: >> >> Enumerating Objects (X^1%) => Marking trees (Y^1%) >> Enumerating Objects (X^2%) => Calculating bitmaps (Y^2%) >> >> And at the end: >> >> Enumerating Objects (100%) in ~2m30s -- (~10s marking trees, ~2m10s bitmaps, ~10s other) >> >> I.e. bringing the whole "nested" trace2 regions full circle with the >> progress bar where we could elect to trace/show some of that info, and >> then you could turn on some trace2 mode/verbose progress to see more. > > I do wonder if this really needs to be part of the progress bar. The > goal of the progress bar is to give the user a sense that work is > happening, and (if possible, but not for "enumerating") an idea of when > it might finish. If the trace code can already do detailed timings, then > shouldn't we just be encouraging people to use that? To just show work happening we could save ourselves some horizontal space and the debates over counting v.s. enumerating with: diff --git a/progress.c b/progress.c index 0318bdd41b..83336ca391 100644 --- a/progress.c +++ b/progress.c @@ -226,3 +226,3 @@ static struct progress *start_progress_delay(const char *title, uint64_t total, struct progress *progress = xmalloc(sizeof(*progress)); - progress->title = title; + progress->title = "Reticulating splines"; progress->total = total; :) Obviously that's silly, but the point is we do show some user messaging with these now, and e.g. the other day here on-list (can't be bothered to find the msgid) someone was lamenting that the N progressbars we show on "push" were too verbose. So by coalescing some of the existing bars that do one logical operation (push) in N steps we could be less verbose without losing the "stuff's happening" part of it, and would see if something odd was going on, e.g. the "I/O write" part being proportionally slower on this box than the other, or when they upgrade bitmaps suddenly showing up as >95% of the time. The bit I find interesting about tying it into trace2 is that once you do that the trace logs can contain e.g. min/max/avg/median/percentile time for doing some operation we can break into N steps same/similar steps, which might be interesting for performance analysis. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-08 16:13 ` Ævar Arnfjörð Bjarmason @ 2019-05-08 22:25 ` Jeff King 0 siblings, 0 replies; 57+ messages in thread From: Jeff King @ 2019-05-08 22:25 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Eric Wong, git, SZEDER Gábor, Derrick Stolee On Wed, May 08, 2019 at 06:13:58PM +0200, Ævar Arnfjörð Bjarmason wrote: > > I do wonder if this really needs to be part of the progress bar. The > > goal of the progress bar is to give the user a sense that work is > > happening, and (if possible, but not for "enumerating") an idea of when > > it might finish. If the trace code can already do detailed timings, then > > shouldn't we just be encouraging people to use that? > [...] > - progress->title = title; > + progress->title = "Reticulating splines"; Heh, OK, that's a fair reductio ad absurdum of my point. I guess what I was trying to say is that we don't need to go overboard with accuracy as long as we're giving a vague sense of the work. Precision measurements can be handled through a different system. But I don't mind if you can find a way to do these kind of cascaded progress meters in a way that is pleasing to the eye and not hard to handle in the callers of the code. -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-04-10 22:57 ` Jeff King 2019-04-25 7:16 ` Junio C Hamano @ 2019-05-23 11:30 ` Jeff King 2019-05-23 12:53 ` Derrick Stolee ` (2 more replies) 1 sibling, 3 replies; 57+ messages in thread From: Jeff King @ 2019-05-23 11:30 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Junio C Hamano, Eric Wong, git On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote: > 2. The answer we get from a bitmap versus a regular traversal are not > apples-to-apples equivalent. The regular traversal walks down to > the UNINTERESTING commits, marks the boundaries trees and blobs as > UNINTERESTING, and then adds in all the interesting trees and blobs > minus the UNINTERESTING parts. So it can sometimes give the wrong > answer, claiming something is interesting when it is not. > > Whereas with bitmaps we fill in the trees and blobs as we walk, and > you get the true answer. But it means we may open up a lot more > trees than the equivalent traversal would. > > So one thing I've never really experimented with (and indeed, never > really thought about until writing this email) is that the bitmaps > could try to do that looser style of traversal, knowing that we > might err on the side of calling things interesting in a few cases. > But hopefully spending a lot less time opening trees. > > I'm not even 100% sure what that would look like in code, but just > thinking about it from a high level, I don't there's a particular > mathematical reason it couldn't work. I spent a while thinking and experimenting with this tonight. The result is the patch below. Ævar, do you still have a copy of the repo that misbehaved? I'd be curious to see how it fares. Finding the right trees to explore is a little tricky with bitmaps. In a normal traversal, we consider the "edges" to be worth exploring. I.e., the places where an UNINTERESTING commit is the parent of an interesting one. But with bitmaps, we don't have that information in the same way, because we're trying to avoid walking the commits in the first place. So e.g., in a history like this: A--B--C \ D Let's imagine we're computing "C..D", and that D has a bitmap on disk but C does not. When we visit D, we'll see that it has a bitmap, fill in the results in our cumulative "want" bitmap, and then stop walking its parents (because we know everything they could tell us already). Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal traversal, we can't stop walking even though there are only UNINTERESTING commits left, because we have to fill the complete bitmap, including A and B (and you can imagine there might be thousands of ancestors of A, too). The reason is that we skipped adding ancestors of D when we saw its bitmap, so no still_interesting() cutoff will work there. But how do we know that it's worth looking into the tree of "B"? If we assume we visit commits before their ancestors (because of the commit timestamp ordering), then we'd see that "B" is in the "want" bitmap already, which makes it worth visiting its tree. Unfortunately we'd then continue on to "A", and it would _also_ look interesting, because it's also in the "want" bitmap. We don't have an easy way of knowing that "A" is basically already covered by "B", and is therefore not worth pursuing. In this example, we could pass down a bit from B to A as we traverse. But you can imagine another interesting commit branched from A, which _would_ make A's tree worth considering. Fundamentally, as soon as we load a bitmap and stop traversing, we lose all information about the graph structure. So the patch below just looks at every tree that might be worth exploring (so both "A" and "B" here, but not "C"). That shouldn't be any _worse_ than what the current code does (because it looks at all the trees). It appears to behave correctly, at least so far as passing the test suite. Running p5310 and p5311 against git.git and linux.git, it seems to perform worse. I'm not quite sure why that is (i.e., if I screwed up something in the implementation, or if the algorithm is fundamentally worse). There are a lot of rough edges in the patch; I was just trying to see if the idea was even viable. So don't bother reviewing it as a real proposal for inclusion. If you do read it, I'd recommend just looking at the post-image, since it's essentially a rewrite and the diff is a bit messy. -Peff --- pack-bitmap.c | 398 ++++++++++++++++++++++++-------------------------- 1 file changed, 193 insertions(+), 205 deletions(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index 6069b2fe55..4a40f62a38 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -12,6 +12,8 @@ #include "packfile.h" #include "repository.h" #include "object-store.h" +#include "blob.h" +#include "prio-queue.h" /* * An entry on the bitmap index, representing the bitmap for a given @@ -422,180 +424,22 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git, return bitmap_pos + bitmap_git->pack->num_objects; } -struct bitmap_show_data { - struct bitmap_index *bitmap_git; - struct bitmap *base; -}; - -static void show_object(struct object *object, const char *name, void *data_) -{ - struct bitmap_show_data *data = data_; - int bitmap_pos; - - bitmap_pos = bitmap_position(data->bitmap_git, &object->oid); - - if (bitmap_pos < 0) - bitmap_pos = ext_index_add_object(data->bitmap_git, object, - name); - - bitmap_set(data->base, bitmap_pos); -} - -static void show_commit(struct commit *commit, void *data) -{ -} - -static int add_to_include_set(struct bitmap_index *bitmap_git, - struct include_data *data, - const struct object_id *oid, - int bitmap_pos) -{ - khiter_t hash_pos; - - if (data->seen && bitmap_get(data->seen, bitmap_pos)) - return 0; - - if (bitmap_get(data->base, bitmap_pos)) - return 0; - - hash_pos = kh_get_oid_map(bitmap_git->bitmaps, *oid); - if (hash_pos < kh_end(bitmap_git->bitmaps)) { - struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, hash_pos); - bitmap_or_ewah(data->base, lookup_stored_bitmap(st)); - return 0; - } - - bitmap_set(data->base, bitmap_pos); - return 1; -} - -static int should_include(struct commit *commit, void *_data) -{ - struct include_data *data = _data; - int bitmap_pos; - - bitmap_pos = bitmap_position(data->bitmap_git, &commit->object.oid); - if (bitmap_pos < 0) - bitmap_pos = ext_index_add_object(data->bitmap_git, - (struct object *)commit, - NULL); - - if (!add_to_include_set(data->bitmap_git, data, &commit->object.oid, - bitmap_pos)) { - struct commit_list *parent = commit->parents; - - while (parent) { - parent->item->object.flags |= SEEN; - parent = parent->next; - } - - return 0; - } - - return 1; -} - -static struct bitmap *find_objects(struct bitmap_index *bitmap_git, - struct rev_info *revs, - struct object_list *roots, - struct bitmap *seen) +static int add_commit_to_bitmap(struct bitmap_index *bitmap_git, + struct bitmap **base, struct commit *commit) { - struct bitmap *base = NULL; - int needs_walk = 0; - - struct object_list *not_mapped = NULL; - - /* - * Go through all the roots for the walk. The ones that have bitmaps - * on the bitmap index will be `or`ed together to form an initial - * global reachability analysis. - * - * The ones without bitmaps in the index will be stored in the - * `not_mapped_list` for further processing. - */ - while (roots) { - struct object *object = roots->item; - roots = roots->next; - - if (object->type == OBJ_COMMIT) { - khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, object->oid); - - if (pos < kh_end(bitmap_git->bitmaps)) { - struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos); - struct ewah_bitmap *or_with = lookup_stored_bitmap(st); - - if (base == NULL) - base = ewah_to_bitmap(or_with); - else - bitmap_or_ewah(base, or_with); - - object->flags |= SEEN; - continue; - } - } - - object_list_insert(object, ¬_mapped); - } - - /* - * Best case scenario: We found bitmaps for all the roots, - * so the resulting `or` bitmap has the full reachability analysis - */ - if (not_mapped == NULL) - return base; - - roots = not_mapped; - - /* - * Let's iterate through all the roots that don't have bitmaps to - * check if we can determine them to be reachable from the existing - * global bitmap. - * - * If we cannot find them in the existing global bitmap, we'll need - * to push them to an actual walk and run it until we can confirm - * they are reachable - */ - while (roots) { - struct object *object = roots->item; - int pos; - - roots = roots->next; - pos = bitmap_position(bitmap_git, &object->oid); - - if (pos < 0 || base == NULL || !bitmap_get(base, pos)) { - object->flags &= ~UNINTERESTING; - add_pending_object(revs, object, ""); - needs_walk = 1; - } else { - object->flags |= SEEN; - } - } - - if (needs_walk) { - struct include_data incdata; - struct bitmap_show_data show_data; - - if (base == NULL) - base = bitmap_new(); - - incdata.bitmap_git = bitmap_git; - incdata.base = base; - incdata.seen = seen; - - revs->include_check = should_include; - revs->include_check_data = &incdata; - - if (prepare_revision_walk(revs)) - die("revision walk setup failed"); + khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid); + if (pos < kh_end(bitmap_git->bitmaps)) { + struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos); + struct ewah_bitmap *or_with = lookup_stored_bitmap(st); - show_data.bitmap_git = bitmap_git; - show_data.base = base; + if (*base == NULL) + *base = ewah_to_bitmap(or_with); + else + bitmap_or_ewah(*base, or_with); - traverse_commit_list(revs, show_commit, show_object, - &show_data); + return 1; } - - return base; + return 0; } static void show_extended_objects(struct bitmap_index *bitmap_git, @@ -665,24 +509,122 @@ static void show_objects_for_type( } } -static int in_bitmapped_pack(struct bitmap_index *bitmap_git, - struct object_list *roots) +static void add_object_to_bitmap(struct bitmap_index *bitmap_git, + struct object *obj, + struct bitmap *bitmap, + struct bitmap *seen, + int ignore_missing); + +static void add_tag_to_bitmap(struct bitmap_index *bitmap_git, + struct tag *tag, + struct bitmap *bitmap, + struct bitmap *seen, + int ignore_missing) +{ + if (parse_tag(tag) < 0 || !tag->tagged) { + if (ignore_missing) + return; + die("unable to parse tag %s", oid_to_hex(&tag->object.oid)); + } + add_object_to_bitmap(bitmap_git, tag->tagged, bitmap, seen, + ignore_missing); +} + +static void add_tree_to_bitmap(struct bitmap_index *bitmap_git, + struct tree *tree, + struct bitmap *bitmap, + struct bitmap *seen, + int ignore_missing) { - while (roots) { - struct object *object = roots->item; - roots = roots->next; + struct tree_desc desc; + struct name_entry entry; - if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0) - return 1; + if (parse_tree(tree) < 0) { + if (ignore_missing) + return; + die("unable to parse tree %s", oid_to_hex(&tree->object.oid)); } - return 0; + init_tree_desc(&desc, tree->buffer, tree->size); + while (tree_entry(&desc, &entry)) { + if (S_ISDIR(entry.mode)) { + struct tree *t = lookup_tree(the_repository, &entry.oid); + if (!t) { + die(_("entry '%s' in tree %s has tree mode, " + "but is not a tree"), + entry.path, oid_to_hex(&tree->object.oid)); + } + add_object_to_bitmap(bitmap_git, &t->object, + bitmap, seen, ignore_missing); + } else if (!S_ISGITLINK(entry.mode)) { + struct blob *b = lookup_blob(the_repository, &entry.oid); + if (!b) { + die(_("entry '%s' in tree %s has blob mode, " + "but is not a blob"), + entry.path, oid_to_hex(&tree->object.oid)); + } + add_object_to_bitmap(bitmap_git, &b->object, + bitmap, seen, ignore_missing); + } + } + + free_tree_buffer(tree); +} + +static void add_object_to_bitmap(struct bitmap_index *bitmap_git, + struct object *obj, + struct bitmap *bitmap, + struct bitmap *seen, + int ignore_missing) +{ + int pos = bitmap_position(bitmap_git, &obj->oid); + + if (pos < 0) + pos = ext_index_add_object(bitmap_git, obj, NULL); + + if (bitmap_get(bitmap, pos)) + return; /* already there */ + + if (seen && bitmap_get(seen, pos)) + return; /* seen in other bitmap; not worth digging further */ + + bitmap_set(bitmap, pos); + + if (obj->type == OBJ_BLOB) + return; + else if (obj->type == OBJ_TAG) + add_tag_to_bitmap(bitmap_git, (struct tag *)obj, + bitmap, seen, ignore_missing); + else if (obj->type == OBJ_TREE) + add_tree_to_bitmap(bitmap_git, (struct tree *)obj, + bitmap, seen, ignore_missing); + else + BUG("unexpected object type: %d", obj->type); +} + +static void add_objects_to_bitmap(struct bitmap_index *bitmap_git, + struct object_list **list, + struct bitmap *bitmap, + struct bitmap *seen, + int ignore_missing) +{ + while (*list) { + struct object_list *cur = *list; + + add_object_to_bitmap(bitmap_git, cur->item, + bitmap, seen, ignore_missing); + + *list = cur->next; + free(cur); + } } struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) { unsigned int i; + struct prio_queue commits = { compare_commits_by_commit_date }; + struct object_list *wants = NULL; struct object_list *haves = NULL; @@ -714,24 +656,16 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) object = parse_object_or_die(&tag->tagged->oid, NULL); } - if (object->flags & UNINTERESTING) - object_list_insert(object, &haves); - else - object_list_insert(object, &wants); + if (object->type == OBJ_COMMIT) + prio_queue_put(&commits, object); + else { + if (object->flags & UNINTERESTING) + object_list_insert(object, &haves); + else + object_list_insert(object, &wants); + } } - /* - * if we have a HAVES list, but none of those haves is contained - * in the packfile that has a bitmap, we don't have anything to - * optimize here - */ - if (haves && !in_bitmapped_pack(bitmap_git, haves)) - goto cleanup; - - /* if we don't want anything, we're done here */ - if (!wants) - goto cleanup; - /* * now we're going to use bitmaps, so load the actual bitmap entries * from disk. this is the point of no return; after this the rev_list @@ -742,20 +676,74 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) object_array_clear(&revs->pending); - if (haves) { - revs->ignore_missing_links = 1; - haves_bitmap = find_objects(bitmap_git, revs, haves, NULL); - reset_revision_walk(); - revs->ignore_missing_links = 0; + haves_bitmap = bitmap_new(); + wants_bitmap = bitmap_new(); - if (haves_bitmap == NULL) - BUG("failed to perform bitmap walk"); - } + /* + * First traverse the relevant commits as we would for a normal + * traversal. + */ + while (commits.nr) { + struct commit *commit = prio_queue_get(&commits); + struct bitmap **dst_bitmap; + int pos; + struct commit_list *parent; + + if (commit->object.flags & UNINTERESTING) + dst_bitmap = &haves_bitmap; + else + dst_bitmap = &wants_bitmap; - wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap); + pos = bitmap_position(bitmap_git, &commit->object.oid); + if (pos >= 0 && *dst_bitmap && bitmap_get(*dst_bitmap, pos)) + continue; /* already covered */ - if (!wants_bitmap) - BUG("failed to perform bitmap walk"); + if (add_commit_to_bitmap(bitmap_git, dst_bitmap, commit)) + continue; /* skip adding parents, they're covered */ + + + /* otherwise mark ourselves and queue dependent objects */ + if (pos < 0) + pos = ext_index_add_object(bitmap_git, &commit->object, NULL); + bitmap_set(*dst_bitmap, pos); + if (parse_commit(commit)) { + if (commit->object.flags & UNINTERESTING) + continue; + die("unable to parse commit %s", + oid_to_hex(&commit->object.oid)); + } + if (commit->object.flags & UNINTERESTING) { + /* + * If an uninteresting commit is in the "wants" bitmap, + * then our tree is worth exploring. This means we may + * miss some trees in the "haves" that are not + * ancestors of "wants" (or that are, but we missed + * because of out-of-order timestamps). + */ + if (wants_bitmap && bitmap_get(wants_bitmap, pos)) + object_list_insert(&get_commit_tree(commit)->object, + &haves); + } else { + /* + * "wants" must always be explored + */ + object_list_insert(&get_commit_tree(commit)->object, + &wants); + } + + for (parent = commit->parents; parent; parent = parent->next) { + if (commit->object.flags & UNINTERESTING) + parent->item->object.flags |= UNINTERESTING; + prio_queue_put(&commits, parent->item); + } + } + + /* + * At this point we've processed all of the commits, and queued items + * in "haves" and "wants" that need to be marked. + */ + add_objects_to_bitmap(bitmap_git, &haves, haves_bitmap, NULL, 1); + add_objects_to_bitmap(bitmap_git, &wants, wants_bitmap, haves_bitmap, 0); if (haves_bitmap) bitmap_and_not(wants_bitmap, haves_bitmap); -- 2.22.0.rc1.549.gadb183c4cb ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-23 11:30 ` Jeff King @ 2019-05-23 12:53 ` Derrick Stolee 2019-05-24 7:24 ` Jeff King 2019-05-23 19:26 ` Ævar Arnfjörð Bjarmason 2019-05-24 11:31 ` [PATCH] pack-bitmap: look for an uninteresting bitmap Derrick Stolee 2 siblings, 1 reply; 57+ messages in thread From: Derrick Stolee @ 2019-05-23 12:53 UTC (permalink / raw) To: Jeff King, Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Eric Wong, git On 5/23/2019 7:30 AM, Jeff King wrote: > On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote: > >> 2. The answer we get from a bitmap versus a regular traversal are not >> apples-to-apples equivalent. The regular traversal walks down to >> the UNINTERESTING commits, marks the boundaries trees and blobs as >> UNINTERESTING, and then adds in all the interesting trees and blobs >> minus the UNINTERESTING parts. So it can sometimes give the wrong >> answer, claiming something is interesting when it is not. >> >> Whereas with bitmaps we fill in the trees and blobs as we walk, and >> you get the true answer. But it means we may open up a lot more >> trees than the equivalent traversal would. >> >> So one thing I've never really experimented with (and indeed, never >> really thought about until writing this email) is that the bitmaps >> could try to do that looser style of traversal, knowing that we >> might err on the side of calling things interesting in a few cases. >> But hopefully spending a lot less time opening trees. >> >> I'm not even 100% sure what that would look like in code, but just >> thinking about it from a high level, I don't there's a particular >> mathematical reason it couldn't work. > > I spent a while thinking and experimenting with this tonight. The result > is the patch below. Ævar, do you still have a copy of the repo that > misbehaved? I'd be curious to see how it fares. This patch caught my attention, and I think I understand some of the issues at hand. I'm not well-versed specifically in Git's bitmap implementation. The fundamental ideas are there, but my perspective is biased by my own independent bitmap implementation for Azure Repos. What worked there may not work at all here. > Finding the right trees to explore is a little tricky with bitmaps. In > a normal traversal, we consider the "edges" to be worth exploring. > I.e., the places where an UNINTERESTING commit is the parent of an > interesting one. This is the "commit frontier" which I bring up again below. > But with bitmaps, we don't have that information in the same way, > because we're trying to avoid walking the commits in the first place. So > e.g., in a history like this: > > A--B--C > \ > D > > Let's imagine we're computing "C..D", and that D has a bitmap on disk > but C does not. (As I read your discussion below, I'm confused. For "C..D", C is a have and D is a want. We should explore all the haves _first_, then walk the wants, right?) > When we visit D, we'll see that it has a bitmap, fill in > the results in our cumulative "want" bitmap, and then stop walking its > parents (because we know everything they could tell us already). I may be misreading something, but we would construct _a_ bitmap for C by walking its reachable objects until hitting a bitmap we know about. Perhaps A or B have a bitmap to start the construction. If we never construct a bitmap for C, then we don't know what to remove from the "D" bitmap to show the difference. If "C" is not even in the bitmap pack, then we don't use bitmaps for this calculation because of this condition: /* * if we have a HAVES list, but none of those haves is contained * in the packfile that has a bitmap, we don't have anything to * optimize here */ if (haves && !in_bitmapped_pack(bitmap_git, haves)) goto cleanup; > Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal > traversal, we can't stop walking even though there are only > UNINTERESTING commits left, because we have to fill the complete bitmap, > including A and B (and you can imagine there might be thousands of > ancestors of A, too). The reason is that we skipped adding ancestors of > D when we saw its bitmap, so no still_interesting() cutoff will work > there. > > But how do we know that it's worth looking into the tree of "B"? If we > assume we visit commits before their ancestors (because of the commit > timestamp ordering), then we'd see that "B" is in the "want" bitmap > already, which makes it worth visiting its tree. > > Unfortunately we'd then continue on to "A", and it would _also_ look > interesting, because it's also in the "want" bitmap. We don't have an > easy way of knowing that "A" is basically already covered by "B", and is > therefore not worth pursuing. In this example, we could pass down a bit > from B to A as we traverse. But you can imagine another interesting > commit branched from A, which _would_ make A's tree worth considering. > > Fundamentally, as soon as we load a bitmap and stop traversing, we lose > all information about the graph structure. > > So the patch below just looks at every tree that might be worth > exploring (so both "A" and "B" here, but not "C"). That shouldn't be any > _worse_ than what the current code does (because it looks at all the > trees). It appears to behave correctly, at least so far as passing the > test suite. Running p5310 and p5311 against git.git and linux.git, it > seems to perform worse. I'm not quite sure why that is (i.e., if I > screwed up something in the implementation, or if the algorithm is > fundamentally worse). I'm of the opinion that the old method was better. It followed a very clear three-step process: 1. Compute the "haves" bitmap. 2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap. 3. Subtract the "haves" bitmap from the "wants" (in case we added bits to the wants before they were in the haves). But there is hope in your idea to improve some cases. As noted, we give up if all of the haves are not in the bitmapped pack. By starting with a commit walk, we can find the "commit frontier," which is the set of commits that are explored by paint_down_to_common(). In this case, the set {B, C, D}. After walking commits and finding a set of UNINTERESTING commits, we could look for any UNINTERESTING commits that have bitmaps (or simply are in the bitmapped pack) and use those to see our "haves" bitmap. So, if B has a bitmap, but C is too new for the bitmap, then we can still create the haves based on B and then take a bitmap diff "D minus B". In fact, using the "merge base set" for the diff reflects what the non-bitmap algorithm does in this case. It ignores C and only explores B. I learned a lot looking at both versions of this method side-by-side. Thanks! -Stolee ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-23 12:53 ` Derrick Stolee @ 2019-05-24 7:24 ` Jeff King 2019-05-24 10:33 ` Derrick Stolee 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-05-24 7:24 UTC (permalink / raw) To: Derrick Stolee Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, Eric Wong, git On Thu, May 23, 2019 at 08:53:56AM -0400, Derrick Stolee wrote: > > I spent a while thinking and experimenting with this tonight. The result > > is the patch below. Ævar, do you still have a copy of the repo that > > misbehaved? I'd be curious to see how it fares. > > This patch caught my attention, and I think I understand some of the issues > at hand. I'm not well-versed specifically in Git's bitmap implementation. > The fundamental ideas are there, but my perspective is biased by my own > independent bitmap implementation for Azure Repos. What worked there may not > work at all here. Thanks for looking at this. There are a lot of short-comings in the current bitmap implementation, so if there's a better way to do things, I'm not opposed to moving to a new format. :) > > Finding the right trees to explore is a little tricky with bitmaps. In > > a normal traversal, we consider the "edges" to be worth exploring. > > I.e., the places where an UNINTERESTING commit is the parent of an > > interesting one. > > This is the "commit frontier" which I bring up again below. Right. I actually had trouble coming up with a succinct way of describing this, and stole the definition from your recent blog post. ;) > > But with bitmaps, we don't have that information in the same way, > > because we're trying to avoid walking the commits in the first place. So > > e.g., in a history like this: > > > > A--B--C > > \ > > D > > > > Let's imagine we're computing "C..D", and that D has a bitmap on disk > > but C does not. > > (As I read your discussion below, I'm confused. For "C..D", C is a have > and D is a want. We should explore all the haves _first_, then walk the > wants, right?) I think I may have confused things by starting my description with a hypothetical combined want/have walk. To take a step back: the problem we were discussing is that we spend a lot of time reading trees to fill in the "have" bitmap, even though most of those objects are unlikely to be in the "want" in the first place (only the frontier trees are really useful). The current code does indeed fill the "have" bitmap first (so that while walking "want", it can avoid walking down paths whose bits we know we're going to clear eventually anyway). But we can't know the frontier if we completely fill the "have" bitmap first. We have to walk both sides together, looking for the frontier. > > When we visit D, we'll see that it has a bitmap, fill in > > the results in our cumulative "want" bitmap, and then stop walking its > > parents (because we know everything they could tell us already). > > I may be misreading something, but we would construct _a_ bitmap for C > by walking its reachable objects until hitting a bitmap we know about. > Perhaps A or B have a bitmap to start the construction. If we never > construct a bitmap for C, then we don't know what to remove from the "D" > bitmap to show the difference. Yes, and that's how the current code works. If we walk back to B and it has a bitmap, we can stop walking there and just OR in its bitmap, and look at the trees for any intermediate commits (in this case just C) to fill in the rest. The problem is that we don't have a bitmap for every commit. So you can imagine a history like this: A_1 -- A_2 -- ... -- A_1000 -- B -- C \ D where we have a bitmap _only_ for D. Filling in the accurate and true bitmap for C requires walking a thousand commits (and their trees!), when the non-bitmap algorithm would find the frontier at B and call that good enough. > If "C" is not even in the bitmap pack, then we don't use bitmaps for > this calculation because of this condition: > > /* > * if we have a HAVES list, but none of those haves is contained > * in the packfile that has a bitmap, we don't have anything to > * optimize here > */ > if (haves && !in_bitmapped_pack(bitmap_git, haves)) > goto cleanup; Right, but it may be in the pack but without a bitmap. We don't store a bitmap for every commit. The idea was to save space, but select some key commits that let us generally find a bitmap with a small amount of walking. And most of the time it works fine. But in some cases, we seem to find pathological cases where we do quite a lot of walking. As I said earlier in the thread, I suspect our commit selection is not great. It's mostly some heuristics we threw together in 2013, and I don't think it was tested very thoroughly. So fixing that may be a simpler approach. What I was wondering here was whether we could get an easy fix based on the same frontier ideas that the non-bitmap walk uses. > I'm of the opinion that the old method was better. It followed a very clear > three-step process: > > 1. Compute the "haves" bitmap. > > 2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap. > > 3. Subtract the "haves" bitmap from the "wants" (in case we added bits to > the wants before they were in the haves). Right, this is _way_ simpler. But it necessarily means spending effort to find the complete "haves", because we don't know which parts are useful. > But there is hope in your idea to improve some cases. As noted, we give up > if all of the haves are not in the bitmapped pack. By starting with a > commit walk, we can find the "commit frontier," which is the set of commits > that are explored by paint_down_to_common(). In this case, the set {B, C, D}. > > After walking commits and finding a set of UNINTERESTING commits, we could > look for any UNINTERESTING commits that have bitmaps (or simply are in the > bitmapped pack) and use those to see our "haves" bitmap. So, if B has a > bitmap, but C is too new for the bitmap, then we can still create the haves > based on B and then take a bitmap diff "D minus B". But doing that commit walk to find the frontier negates part of the purpose of using the bitmaps, which is avoiding even walking commits. Going back to a variant of our example: A -- B -- C_1 -- .. -- C_1000 \ D_1 -- .. -- D_1000 If we have a bitmap at C_1000 and D_1000, we don't have to walk any commits at all. But finding the frontier requires walking 2000 commits. Is there a way to find it with just bitmaps? You can certainly find the intersection, but you don't have any idea which of the many shared commits is the merge base. Of course in this example you don't actually care about the frontier (you have the full answer immediately). But how do you decide which way to optimize: try to avoid walking commits to get a quick answer from bitmaps, or prefer to walk some commits to find the frontier and avoid opening too many trees? -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 7:24 ` Jeff King @ 2019-05-24 10:33 ` Derrick Stolee 0 siblings, 0 replies; 57+ messages in thread From: Derrick Stolee @ 2019-05-24 10:33 UTC (permalink / raw) To: Jeff King Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, Eric Wong, git On 5/24/2019 3:24 AM, Jeff King wrote: > On Thu, May 23, 2019 at 08:53:56AM -0400, Derrick Stolee wrote: > >>> I spent a while thinking and experimenting with this tonight. The result >>> is the patch below. Ævar, do you still have a copy of the repo that >>> misbehaved? I'd be curious to see how it fares. >> >> This patch caught my attention, and I think I understand some of the issues >> at hand. I'm not well-versed specifically in Git's bitmap implementation. >> The fundamental ideas are there, but my perspective is biased by my own >> independent bitmap implementation for Azure Repos. What worked there may not >> work at all here. > > Thanks for looking at this. There are a lot of short-comings in the > current bitmap implementation, so if there's a better way to do things, > I'm not opposed to moving to a new format. :) > >>> Finding the right trees to explore is a little tricky with bitmaps. In >>> a normal traversal, we consider the "edges" to be worth exploring. >>> I.e., the places where an UNINTERESTING commit is the parent of an >>> interesting one. >> >> This is the "commit frontier" which I bring up again below. > > Right. I actually had trouble coming up with a succinct way of > describing this, and stole the definition from your recent blog post. ;) > >>> But with bitmaps, we don't have that information in the same way, >>> because we're trying to avoid walking the commits in the first place. So >>> e.g., in a history like this: >>> >>> A--B--C >>> \ >>> D >>> >>> Let's imagine we're computing "C..D", and that D has a bitmap on disk >>> but C does not. >> >> (As I read your discussion below, I'm confused. For "C..D", C is a have >> and D is a want. We should explore all the haves _first_, then walk the >> wants, right?) > > I think I may have confused things by starting my description with a > hypothetical combined want/have walk. To take a step back: the problem > we were discussing is that we spend a lot of time reading trees to fill > in the "have" bitmap, even though most of those objects are unlikely to > be in the "want" in the first place (only the frontier trees are really > useful). Thank you for resolving my confusion. [snip] > As I said earlier in the thread, I suspect our commit selection is not > great. It's mostly some heuristics we threw together in 2013, and I > don't think it was tested very thoroughly. So fixing that may be a > simpler approach. It's a hard problem! There are no _sure_ answers here, and what works in some cases will probably not work in other cases. > What I was wondering here was whether we could get an easy fix based on > the same frontier ideas that the non-bitmap walk uses. [snip] > But doing that commit walk to find the frontier negates part of the > purpose of using the bitmaps, which is avoiding even walking commits. > Going back to a variant of our example: > > A -- B -- C_1 -- .. -- C_1000 > \ > D_1 -- .. -- D_1000 > > If we have a bitmap at C_1000 and D_1000, we don't have to walk any > commits at all. But finding the frontier requires walking 2000 commits. In my opinion, walking commits is easy (easier with the commit-graph) and walking trees is hard. We probably _should_ do more commit walking if it helps avoid walking some trees. > Is there a way to find it with just bitmaps? You can certainly find the > intersection, but you don't have any idea which of the many shared > commits is the merge base. Of course in this example you don't actually > care about the frontier (you have the full answer immediately). But how > do you decide which way to optimize: try to avoid walking commits to > get a quick answer from bitmaps, or prefer to walk some commits to find > the frontier and avoid opening too many trees? With a new perspective on the problem, I think perhaps trying to solve this problem with bitmaps is incorrect. If you want to use bitmaps for C..D and you don't have any bitmaps in the range D..C, then maybe we should just use the old-fashioned method of walking trees? In your examples above, the new method would walk trees for the commits in {B, C_i, D_j} while the bitmap method would walk trees for the commits in {B, C_i, A_k}. I would expect the list of {A_k} commits to be the largest in most cases. The approach here would be to do the commit frontier walk, and check for commits with bitmaps along the way. If we don't find an UNINTERESTING bitmap, then use the non-bitmap way instead. Perhaps worth a shot. I'll bring up this code snippet again: /* * if we have a HAVES list, but none of those haves is contained * in the packfile that has a bitmap, we don't have anything to * optimize here */ if (haves && !in_bitmapped_pack(bitmap_git, haves)) goto cleanup; In addition to this, we can fill the "haves" set with the commits in D..C (with boundary) and then check if any of those commits have a precomputed bitmap. If not, goto cleanup. Thanks, -Stolee ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-23 11:30 ` Jeff King 2019-05-23 12:53 ` Derrick Stolee @ 2019-05-23 19:26 ` Ævar Arnfjörð Bjarmason 2019-05-24 7:27 ` Jeff King 2019-05-24 11:31 ` [PATCH] pack-bitmap: look for an uninteresting bitmap Derrick Stolee 2 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-23 19:26 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Eric Wong, git On Thu, May 23 2019, Jeff King wrote: > On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote: > >> 2. The answer we get from a bitmap versus a regular traversal are not >> apples-to-apples equivalent. The regular traversal walks down to >> the UNINTERESTING commits, marks the boundaries trees and blobs as >> UNINTERESTING, and then adds in all the interesting trees and blobs >> minus the UNINTERESTING parts. So it can sometimes give the wrong >> answer, claiming something is interesting when it is not. >> >> Whereas with bitmaps we fill in the trees and blobs as we walk, and >> you get the true answer. But it means we may open up a lot more >> trees than the equivalent traversal would. >> >> So one thing I've never really experimented with (and indeed, never >> really thought about until writing this email) is that the bitmaps >> could try to do that looser style of traversal, knowing that we >> might err on the side of calling things interesting in a few cases. >> But hopefully spending a lot less time opening trees. >> >> I'm not even 100% sure what that would look like in code, but just >> thinking about it from a high level, I don't there's a particular >> mathematical reason it couldn't work. > > I spent a while thinking and experimenting with this tonight. The result > is the patch below. Ævar, do you still have a copy of the repo that > misbehaved? I'd be curious to see how it fares. No, sorry. I think we could make an artificial test to emulate it, which would be something like: * ~1 million commits * local clone setup to fetch all branches/tags (default 'git clone') * local a bit ahead/behind * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that * push try to push the local change upstream (or to a topic branch) I tried briefly to emulate this with git.git with: ( rm -rf /tmp/git /tmp/git.old && git init /tmp/git.old && git clone --bare https://github.com/git/git.git /tmp/git && cd /tmp/git && old_commit=$(git rev-parse v2.20.0^{}) && git rev-list v2.12.0..v2.21.0|parallel 'git branch topic-{} {}' && cd /tmp/git.old && echo /tmp/git/objects >.git/objects/info/alternates && git branch master $old_commit && git reset --hard master && git repack -Adb && rm .git/objects/info/alternates && for c in {1..10} do >$c && git add $c && git commit -m"bump $c" done ) But didn't get anywhere, probably because here my topics are all stuff I have already, and I just have 2x packs. It would be really cool to have some test tool that could produce various "shape of repo" states like that. I.e. given an end-state of a public repo emulate various plausible local client states like that given some assumptions about how often the local client fetches, what the GC settings are etc. > Finding the right trees to explore is a little tricky with bitmaps. In > a normal traversal, we consider the "edges" to be worth exploring. > I.e., the places where an UNINTERESTING commit is the parent of an > interesting one. > > But with bitmaps, we don't have that information in the same way, > because we're trying to avoid walking the commits in the first place. So > e.g., in a history like this: > > A--B--C > \ > D > > Let's imagine we're computing "C..D", and that D has a bitmap on disk > but C does not. When we visit D, we'll see that it has a bitmap, fill in > the results in our cumulative "want" bitmap, and then stop walking its > parents (because we know everything they could tell us already). > > Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal > traversal, we can't stop walking even though there are only > UNINTERESTING commits left, because we have to fill the complete bitmap, > including A and B (and you can imagine there might be thousands of > ancestors of A, too). The reason is that we skipped adding ancestors of > D when we saw its bitmap, so no still_interesting() cutoff will work > there. > > But how do we know that it's worth looking into the tree of "B"? If we > assume we visit commits before their ancestors (because of the commit > timestamp ordering), then we'd see that "B" is in the "want" bitmap > already, which makes it worth visiting its tree. > > Unfortunately we'd then continue on to "A", and it would _also_ look > interesting, because it's also in the "want" bitmap. We don't have an > easy way of knowing that "A" is basically already covered by "B", and is > therefore not worth pursuing. In this example, we could pass down a bit > from B to A as we traverse. But you can imagine another interesting > commit branched from A, which _would_ make A's tree worth considering. > > Fundamentally, as soon as we load a bitmap and stop traversing, we lose > all information about the graph structure. > > So the patch below just looks at every tree that might be worth > exploring (so both "A" and "B" here, but not "C"). That shouldn't be any > _worse_ than what the current code does (because it looks at all the > trees). It appears to behave correctly, at least so far as passing the > test suite. Running p5310 and p5311 against git.git and linux.git, it > seems to perform worse. I'm not quite sure why that is (i.e., if I > screwed up something in the implementation, or if the algorithm is > fundamentally worse). > > There are a lot of rough edges in the patch; I was just trying to see if > the idea was even viable. So don't bother reviewing it as a real > proposal for inclusion. If you do read it, I'd recommend just looking at > the post-image, since it's essentially a rewrite and the diff is a bit > messy. > > -Peff > > --- > pack-bitmap.c | 398 ++++++++++++++++++++++++-------------------------- > 1 file changed, 193 insertions(+), 205 deletions(-) > > diff --git a/pack-bitmap.c b/pack-bitmap.c > index 6069b2fe55..4a40f62a38 100644 > --- a/pack-bitmap.c > +++ b/pack-bitmap.c > @@ -12,6 +12,8 @@ > #include "packfile.h" > #include "repository.h" > #include "object-store.h" > +#include "blob.h" > +#include "prio-queue.h" > > /* > * An entry on the bitmap index, representing the bitmap for a given > @@ -422,180 +424,22 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git, > return bitmap_pos + bitmap_git->pack->num_objects; > } > > -struct bitmap_show_data { > - struct bitmap_index *bitmap_git; > - struct bitmap *base; > -}; > - > -static void show_object(struct object *object, const char *name, void *data_) > -{ > - struct bitmap_show_data *data = data_; > - int bitmap_pos; > - > - bitmap_pos = bitmap_position(data->bitmap_git, &object->oid); > - > - if (bitmap_pos < 0) > - bitmap_pos = ext_index_add_object(data->bitmap_git, object, > - name); > - > - bitmap_set(data->base, bitmap_pos); > -} > - > -static void show_commit(struct commit *commit, void *data) > -{ > -} > - > -static int add_to_include_set(struct bitmap_index *bitmap_git, > - struct include_data *data, > - const struct object_id *oid, > - int bitmap_pos) > -{ > - khiter_t hash_pos; > - > - if (data->seen && bitmap_get(data->seen, bitmap_pos)) > - return 0; > - > - if (bitmap_get(data->base, bitmap_pos)) > - return 0; > - > - hash_pos = kh_get_oid_map(bitmap_git->bitmaps, *oid); > - if (hash_pos < kh_end(bitmap_git->bitmaps)) { > - struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, hash_pos); > - bitmap_or_ewah(data->base, lookup_stored_bitmap(st)); > - return 0; > - } > - > - bitmap_set(data->base, bitmap_pos); > - return 1; > -} > - > -static int should_include(struct commit *commit, void *_data) > -{ > - struct include_data *data = _data; > - int bitmap_pos; > - > - bitmap_pos = bitmap_position(data->bitmap_git, &commit->object.oid); > - if (bitmap_pos < 0) > - bitmap_pos = ext_index_add_object(data->bitmap_git, > - (struct object *)commit, > - NULL); > - > - if (!add_to_include_set(data->bitmap_git, data, &commit->object.oid, > - bitmap_pos)) { > - struct commit_list *parent = commit->parents; > - > - while (parent) { > - parent->item->object.flags |= SEEN; > - parent = parent->next; > - } > - > - return 0; > - } > - > - return 1; > -} > - > -static struct bitmap *find_objects(struct bitmap_index *bitmap_git, > - struct rev_info *revs, > - struct object_list *roots, > - struct bitmap *seen) > +static int add_commit_to_bitmap(struct bitmap_index *bitmap_git, > + struct bitmap **base, struct commit *commit) > { > - struct bitmap *base = NULL; > - int needs_walk = 0; > - > - struct object_list *not_mapped = NULL; > - > - /* > - * Go through all the roots for the walk. The ones that have bitmaps > - * on the bitmap index will be `or`ed together to form an initial > - * global reachability analysis. > - * > - * The ones without bitmaps in the index will be stored in the > - * `not_mapped_list` for further processing. > - */ > - while (roots) { > - struct object *object = roots->item; > - roots = roots->next; > - > - if (object->type == OBJ_COMMIT) { > - khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, object->oid); > - > - if (pos < kh_end(bitmap_git->bitmaps)) { > - struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos); > - struct ewah_bitmap *or_with = lookup_stored_bitmap(st); > - > - if (base == NULL) > - base = ewah_to_bitmap(or_with); > - else > - bitmap_or_ewah(base, or_with); > - > - object->flags |= SEEN; > - continue; > - } > - } > - > - object_list_insert(object, ¬_mapped); > - } > - > - /* > - * Best case scenario: We found bitmaps for all the roots, > - * so the resulting `or` bitmap has the full reachability analysis > - */ > - if (not_mapped == NULL) > - return base; > - > - roots = not_mapped; > - > - /* > - * Let's iterate through all the roots that don't have bitmaps to > - * check if we can determine them to be reachable from the existing > - * global bitmap. > - * > - * If we cannot find them in the existing global bitmap, we'll need > - * to push them to an actual walk and run it until we can confirm > - * they are reachable > - */ > - while (roots) { > - struct object *object = roots->item; > - int pos; > - > - roots = roots->next; > - pos = bitmap_position(bitmap_git, &object->oid); > - > - if (pos < 0 || base == NULL || !bitmap_get(base, pos)) { > - object->flags &= ~UNINTERESTING; > - add_pending_object(revs, object, ""); > - needs_walk = 1; > - } else { > - object->flags |= SEEN; > - } > - } > - > - if (needs_walk) { > - struct include_data incdata; > - struct bitmap_show_data show_data; > - > - if (base == NULL) > - base = bitmap_new(); > - > - incdata.bitmap_git = bitmap_git; > - incdata.base = base; > - incdata.seen = seen; > - > - revs->include_check = should_include; > - revs->include_check_data = &incdata; > - > - if (prepare_revision_walk(revs)) > - die("revision walk setup failed"); > + khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid); > + if (pos < kh_end(bitmap_git->bitmaps)) { > + struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos); > + struct ewah_bitmap *or_with = lookup_stored_bitmap(st); > > - show_data.bitmap_git = bitmap_git; > - show_data.base = base; > + if (*base == NULL) > + *base = ewah_to_bitmap(or_with); > + else > + bitmap_or_ewah(*base, or_with); > > - traverse_commit_list(revs, show_commit, show_object, > - &show_data); > + return 1; > } > - > - return base; > + return 0; > } > > static void show_extended_objects(struct bitmap_index *bitmap_git, > @@ -665,24 +509,122 @@ static void show_objects_for_type( > } > } > > -static int in_bitmapped_pack(struct bitmap_index *bitmap_git, > - struct object_list *roots) > +static void add_object_to_bitmap(struct bitmap_index *bitmap_git, > + struct object *obj, > + struct bitmap *bitmap, > + struct bitmap *seen, > + int ignore_missing); > + > +static void add_tag_to_bitmap(struct bitmap_index *bitmap_git, > + struct tag *tag, > + struct bitmap *bitmap, > + struct bitmap *seen, > + int ignore_missing) > +{ > + if (parse_tag(tag) < 0 || !tag->tagged) { > + if (ignore_missing) > + return; > + die("unable to parse tag %s", oid_to_hex(&tag->object.oid)); > + } > + add_object_to_bitmap(bitmap_git, tag->tagged, bitmap, seen, > + ignore_missing); > +} > + > +static void add_tree_to_bitmap(struct bitmap_index *bitmap_git, > + struct tree *tree, > + struct bitmap *bitmap, > + struct bitmap *seen, > + int ignore_missing) > { > - while (roots) { > - struct object *object = roots->item; > - roots = roots->next; > + struct tree_desc desc; > + struct name_entry entry; > > - if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0) > - return 1; > + if (parse_tree(tree) < 0) { > + if (ignore_missing) > + return; > + die("unable to parse tree %s", oid_to_hex(&tree->object.oid)); > } > > - return 0; > + init_tree_desc(&desc, tree->buffer, tree->size); > + while (tree_entry(&desc, &entry)) { > + if (S_ISDIR(entry.mode)) { > + struct tree *t = lookup_tree(the_repository, &entry.oid); > + if (!t) { > + die(_("entry '%s' in tree %s has tree mode, " > + "but is not a tree"), > + entry.path, oid_to_hex(&tree->object.oid)); > + } > + add_object_to_bitmap(bitmap_git, &t->object, > + bitmap, seen, ignore_missing); > + } else if (!S_ISGITLINK(entry.mode)) { > + struct blob *b = lookup_blob(the_repository, &entry.oid); > + if (!b) { > + die(_("entry '%s' in tree %s has blob mode, " > + "but is not a blob"), > + entry.path, oid_to_hex(&tree->object.oid)); > + } > + add_object_to_bitmap(bitmap_git, &b->object, > + bitmap, seen, ignore_missing); > + } > + } > + > + free_tree_buffer(tree); > +} > + > +static void add_object_to_bitmap(struct bitmap_index *bitmap_git, > + struct object *obj, > + struct bitmap *bitmap, > + struct bitmap *seen, > + int ignore_missing) > +{ > + int pos = bitmap_position(bitmap_git, &obj->oid); > + > + if (pos < 0) > + pos = ext_index_add_object(bitmap_git, obj, NULL); > + > + if (bitmap_get(bitmap, pos)) > + return; /* already there */ > + > + if (seen && bitmap_get(seen, pos)) > + return; /* seen in other bitmap; not worth digging further */ > + > + bitmap_set(bitmap, pos); > + > + if (obj->type == OBJ_BLOB) > + return; > + else if (obj->type == OBJ_TAG) > + add_tag_to_bitmap(bitmap_git, (struct tag *)obj, > + bitmap, seen, ignore_missing); > + else if (obj->type == OBJ_TREE) > + add_tree_to_bitmap(bitmap_git, (struct tree *)obj, > + bitmap, seen, ignore_missing); > + else > + BUG("unexpected object type: %d", obj->type); > +} > + > +static void add_objects_to_bitmap(struct bitmap_index *bitmap_git, > + struct object_list **list, > + struct bitmap *bitmap, > + struct bitmap *seen, > + int ignore_missing) > +{ > + while (*list) { > + struct object_list *cur = *list; > + > + add_object_to_bitmap(bitmap_git, cur->item, > + bitmap, seen, ignore_missing); > + > + *list = cur->next; > + free(cur); > + } > } > > struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) > { > unsigned int i; > > + struct prio_queue commits = { compare_commits_by_commit_date }; > + > struct object_list *wants = NULL; > struct object_list *haves = NULL; > > @@ -714,24 +656,16 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) > object = parse_object_or_die(&tag->tagged->oid, NULL); > } > > - if (object->flags & UNINTERESTING) > - object_list_insert(object, &haves); > - else > - object_list_insert(object, &wants); > + if (object->type == OBJ_COMMIT) > + prio_queue_put(&commits, object); > + else { > + if (object->flags & UNINTERESTING) > + object_list_insert(object, &haves); > + else > + object_list_insert(object, &wants); > + } > } > > - /* > - * if we have a HAVES list, but none of those haves is contained > - * in the packfile that has a bitmap, we don't have anything to > - * optimize here > - */ > - if (haves && !in_bitmapped_pack(bitmap_git, haves)) > - goto cleanup; > - > - /* if we don't want anything, we're done here */ > - if (!wants) > - goto cleanup; > - > /* > * now we're going to use bitmaps, so load the actual bitmap entries > * from disk. this is the point of no return; after this the rev_list > @@ -742,20 +676,74 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) > > object_array_clear(&revs->pending); > > - if (haves) { > - revs->ignore_missing_links = 1; > - haves_bitmap = find_objects(bitmap_git, revs, haves, NULL); > - reset_revision_walk(); > - revs->ignore_missing_links = 0; > + haves_bitmap = bitmap_new(); > + wants_bitmap = bitmap_new(); > > - if (haves_bitmap == NULL) > - BUG("failed to perform bitmap walk"); > - } > + /* > + * First traverse the relevant commits as we would for a normal > + * traversal. > + */ > + while (commits.nr) { > + struct commit *commit = prio_queue_get(&commits); > + struct bitmap **dst_bitmap; > + int pos; > + struct commit_list *parent; > + > + if (commit->object.flags & UNINTERESTING) > + dst_bitmap = &haves_bitmap; > + else > + dst_bitmap = &wants_bitmap; > > - wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap); > + pos = bitmap_position(bitmap_git, &commit->object.oid); > + if (pos >= 0 && *dst_bitmap && bitmap_get(*dst_bitmap, pos)) > + continue; /* already covered */ > > - if (!wants_bitmap) > - BUG("failed to perform bitmap walk"); > + if (add_commit_to_bitmap(bitmap_git, dst_bitmap, commit)) > + continue; /* skip adding parents, they're covered */ > + > + > + /* otherwise mark ourselves and queue dependent objects */ > + if (pos < 0) > + pos = ext_index_add_object(bitmap_git, &commit->object, NULL); > + bitmap_set(*dst_bitmap, pos); > + if (parse_commit(commit)) { > + if (commit->object.flags & UNINTERESTING) > + continue; > + die("unable to parse commit %s", > + oid_to_hex(&commit->object.oid)); > + } > + if (commit->object.flags & UNINTERESTING) { > + /* > + * If an uninteresting commit is in the "wants" bitmap, > + * then our tree is worth exploring. This means we may > + * miss some trees in the "haves" that are not > + * ancestors of "wants" (or that are, but we missed > + * because of out-of-order timestamps). > + */ > + if (wants_bitmap && bitmap_get(wants_bitmap, pos)) > + object_list_insert(&get_commit_tree(commit)->object, > + &haves); > + } else { > + /* > + * "wants" must always be explored > + */ > + object_list_insert(&get_commit_tree(commit)->object, > + &wants); > + } > + > + for (parent = commit->parents; parent; parent = parent->next) { > + if (commit->object.flags & UNINTERESTING) > + parent->item->object.flags |= UNINTERESTING; > + prio_queue_put(&commits, parent->item); > + } > + } > + > + /* > + * At this point we've processed all of the commits, and queued items > + * in "haves" and "wants" that need to be marked. > + */ > + add_objects_to_bitmap(bitmap_git, &haves, haves_bitmap, NULL, 1); > + add_objects_to_bitmap(bitmap_git, &wants, wants_bitmap, haves_bitmap, 0); > > if (haves_bitmap) > bitmap_and_not(wants_bitmap, haves_bitmap); ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-23 19:26 ` Ævar Arnfjörð Bjarmason @ 2019-05-24 7:27 ` Jeff King 2019-05-24 7:55 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-05-24 7:27 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Junio C Hamano, Eric Wong, git On Thu, May 23, 2019 at 09:26:34PM +0200, Ævar Arnfjörð Bjarmason wrote: > > I spent a while thinking and experimenting with this tonight. The result > > is the patch below. Ævar, do you still have a copy of the repo that > > misbehaved? I'd be curious to see how it fares. > > No, sorry. I think we could make an artificial test to emulate it, which > would be something like: > > * ~1 million commits > * local clone setup to fetch all branches/tags (default 'git clone') > * local a bit ahead/behind > * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that > * push try to push the local change upstream (or to a topic branch) > > I tried briefly to emulate this with git.git with: > [...] > But didn't get anywhere, probably because here my topics are all stuff I > have already, and I just have 2x packs. Yeah, I haven't been able to find a reproduction for this problem at will. The bitmaps are _supposed_ to be sprinkled around through the commit graph so that we don't have to walk far. But presumably in your case that was not so. I'm not sure what tickles the bitmap-writer to fail so hard. Is it having too many refs? Weird patterns in the graph? Just a ton of commits? -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 7:27 ` Jeff King @ 2019-05-24 7:55 ` Ævar Arnfjörð Bjarmason 2019-05-24 8:26 ` Jeff King 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-24 7:55 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Eric Wong, git On Fri, May 24 2019, Jeff King wrote: > On Thu, May 23, 2019 at 09:26:34PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > I spent a while thinking and experimenting with this tonight. The result >> > is the patch below. Ævar, do you still have a copy of the repo that >> > misbehaved? I'd be curious to see how it fares. >> >> No, sorry. I think we could make an artificial test to emulate it, which >> would be something like: >> >> * ~1 million commits >> * local clone setup to fetch all branches/tags (default 'git clone') >> * local a bit ahead/behind >> * Have old "main" pack with *.bitmap, bunch of other packs/loose objects without that >> * push try to push the local change upstream (or to a topic branch) >> >> I tried briefly to emulate this with git.git with: >> [...] >> But didn't get anywhere, probably because here my topics are all stuff I >> have already, and I just have 2x packs. > > Yeah, I haven't been able to find a reproduction for this problem at > will. The bitmaps are _supposed_ to be sprinkled around through the > commit graph so that we don't have to walk far. But presumably in your > case that was not so. > > I'm not sure what tickles the bitmap-writer to fail so hard. Is it > having too many refs? Weird patterns in the graph? Just a ton of > commits? Ah, why did only this ancient (big) pack have a bitmap? The bitmap writer had never failed, this was just a repository where some automation (on a dev/staging box) cloned a repo, and someone had once run a manual "repack" to make make a pack with a bitmap. Then as time passed that pack stayed around, and re-looking at this that could have only happened because I had gc.bigPackThreshold turned on. I.e. without that we'd have eventually done a full repack, so the bitmap would have gone away. So getting the repo into that state was a series of unlikely events. I think to the extent that this is an issue we can reproduce in the future the proper fix for it in lieu of some easy fix in the bitmap code would be to just teach "gc" to unlink old *.bitmap files if we detect they're too stale. I.e. we don't need to deal gracefully with some case where the bitmaps just cover some tiny part of the graph, we can just teach "gc" to either update them, or (if we're not currently writing them) unlink them. That seems to me to be a good idea in general, not just with bitmaps but also the commit graph. If we're doing a GC and our current settings aren't such that we'd update those files, shouldn't we just unlink them? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 7:55 ` Ævar Arnfjörð Bjarmason @ 2019-05-24 8:26 ` Jeff King 2019-05-24 9:01 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-05-24 8:26 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24, 2019 at 09:55:05AM +0200, Ævar Arnfjörð Bjarmason wrote: > > I'm not sure what tickles the bitmap-writer to fail so hard. Is it > > having too many refs? Weird patterns in the graph? Just a ton of > > commits? > > Ah, why did only this ancient (big) pack have a bitmap? > > The bitmap writer had never failed, this was just a repository where > some automation (on a dev/staging box) cloned a repo, and someone had > once run a manual "repack" to make make a pack with a bitmap. Just to be clear, by "fail" I didn't mean that the writer failed to produce. I just meant it had poor commit selection for its bitmap coverage. But... > Then as time passed that pack stayed around, and re-looking at this that > could have only happened because I had gc.bigPackThreshold turned on. > > I.e. without that we'd have eventually done a full repack, so the bitmap > would have gone away. > > So getting the repo into that state was a series of unlikely events. Ah, now that's interesting. The issue may have just been that we had a ton of un-bitmapped commits because they weren't in the bitmapped pack at all. The logic that Stolee pointed out earlier: /* * if we have a HAVES list, but none of those haves is contained * in the packfile that has a bitmap, we don't have anything to * optimize here */ if (haves && !in_bitmapped_pack(bitmap_git, haves)) goto cleanup; is pretty feeble. If you have even _one_ UNINTERESTING tip that's in the pack, we'll try to use bitmaps. And in your case, you almost certainly had old tags on both the client and the server there were in the old bitmapped pack, but then a huge swath of history that had happened since then. And it was that newer part of the graph that we had to walk to fill in the bitmap. So all of this makes pretty good sense to me now. Bitmaps work incrementally as you add new, un-bitmapped history. But the cost gets higher and higher the longer you go before repacking and generating new bitmaps. So your case was very similar to a repo that uses bitmaps but just hadn't packed in a really long time. > I think to the extent that this is an issue we can reproduce in the > future the proper fix for it in lieu of some easy fix in the bitmap code > would be to just teach "gc" to unlink old *.bitmap files if we detect > they're too stale. Yeah. This could happen if you simply accumulated history without ever running "gc". But in general we can probably assume that "gc" will run periodically (though there is a real blind spot if you push up a very huge chunk of history in one pack, since gc counts packs, not objects). I agree that if gc is deciding to leave a big pack, it should probably ditch the bitmaps for it. > I.e. we don't need to deal gracefully with some case where the bitmaps > just cover some tiny part of the graph, we can just teach "gc" to either > update them, or (if we're not currently writing them) unlink them. Unfortunately I don't think we can update them, because all of the reachable objects need to be in the same pack. So any scheme that isn't doing a periodic all-into-one repack probably shouldn't be using bitmaps. I wonder if we need to tweak Eric's bitmaps-by-default logic to better account for this. > That seems to me to be a good idea in general, not just with bitmaps but > also the commit graph. If we're doing a GC and our current settings > aren't such that we'd update those files, shouldn't we just unlink them? I don't think the commit graph would suffer from this. It's not tied to a specific pack, so it can be freely updated on any gc. You still have the problem when gc does not run. It's possible that auto-gc should have separate thresholds for different activities (e.g., number of packs should tell us when to repack objects, number of loose refs should tell us when to pack refs, number of un-annotated commits should tell us when to update the commit graph). The trouble is that some of those checks are non-trivial. In the long run, I think the plan is for the commit graph to allow cheap incremental updating, which may make it reasonable to just update it preemptively after every fetch/push. -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 8:26 ` Jeff King @ 2019-05-24 9:01 ` Ævar Arnfjörð Bjarmason 2019-05-24 9:29 ` SZEDER Gábor 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-24 9:01 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24 2019, Jeff King wrote: > On Fri, May 24, 2019 at 09:55:05AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > I'm not sure what tickles the bitmap-writer to fail so hard. Is it >> > having too many refs? Weird patterns in the graph? Just a ton of >> > commits? >> >> Ah, why did only this ancient (big) pack have a bitmap? >> >> The bitmap writer had never failed, this was just a repository where >> some automation (on a dev/staging box) cloned a repo, and someone had >> once run a manual "repack" to make make a pack with a bitmap. > > Just to be clear, by "fail" I didn't mean that the writer failed to > produce. I just meant it had poor commit selection for its bitmap > coverage. But... > >> Then as time passed that pack stayed around, and re-looking at this that >> could have only happened because I had gc.bigPackThreshold turned on. >> >> I.e. without that we'd have eventually done a full repack, so the bitmap >> would have gone away. >> >> So getting the repo into that state was a series of unlikely events. > > Ah, now that's interesting. The issue may have just been that we had a > ton of un-bitmapped commits because they weren't in the bitmapped pack > at all. The logic that Stolee pointed out earlier: > > /* > * if we have a HAVES list, but none of those haves is contained > * in the packfile that has a bitmap, we don't have anything to > * optimize here > */ > if (haves && !in_bitmapped_pack(bitmap_git, haves)) > goto cleanup; > > is pretty feeble. If you have even _one_ UNINTERESTING tip that's in the > pack, we'll try to use bitmaps. And in your case, you almost certainly > had old tags on both the client and the server there were in the old > bitmapped pack, but then a huge swath of history that had happened since > then. And it was that newer part of the graph that we had to walk to > fill in the bitmap. > > So all of this makes pretty good sense to me now. Bitmaps work > incrementally as you add new, un-bitmapped history. But the cost gets > higher and higher the longer you go before repacking and generating new > bitmaps. So your case was very similar to a repo that uses bitmaps but > just hadn't packed in a really long time. > >> I think to the extent that this is an issue we can reproduce in the >> future the proper fix for it in lieu of some easy fix in the bitmap code >> would be to just teach "gc" to unlink old *.bitmap files if we detect >> they're too stale. > > Yeah. This could happen if you simply accumulated history without ever > running "gc". But in general we can probably assume that "gc" will run > periodically (though there is a real blind spot if you push up a very > huge chunk of history in one pack, since gc counts packs, not objects). > > I agree that if gc is deciding to leave a big pack, it should probably > ditch the bitmaps for it. > >> I.e. we don't need to deal gracefully with some case where the bitmaps >> just cover some tiny part of the graph, we can just teach "gc" to either >> update them, or (if we're not currently writing them) unlink them. > > Unfortunately I don't think we can update them, because all of the > reachable objects need to be in the same pack. So any scheme that isn't > doing a periodic all-into-one repack probably shouldn't be using > bitmaps. I wonder if we need to tweak Eric's bitmaps-by-default logic to > better account for this. I mean either we'd update them via repacking, or have some heuristic to do away with them. >> That seems to me to be a good idea in general, not just with bitmaps but >> also the commit graph. If we're doing a GC and our current settings >> aren't such that we'd update those files, shouldn't we just unlink them? > > I don't think the commit graph would suffer from this. It's not tied to > a specific pack, so it can be freely updated on any gc. You still have > the problem when gc does not run. It's possible that auto-gc should have > separate thresholds for different activities (e.g., number of packs > should tell us when to repack objects, number of loose refs should tell > us when to pack refs, number of un-annotated commits should tell us when > to update the commit graph). The trouble is that some of those checks > are non-trivial. > > In the long run, I think the plan is for the commit graph to allow cheap > incremental updating, which may make it reasonable to just update it > preemptively after every fetch/push. I don't think it's a performance problem to have an old commit-graph lying around. But if you turn on the commit-graph, run gc a bunch, then turn it off in config we'll have it lying around forever, even if you do subsequent gc's. So I think we should delete such things on the general principle that the end-state of a gc's shouldn't be the accumulation of the values of past configuration options if we can help it. Maybe that screws over other users who did a "commit-graph write" without setting gc.writeCommitGraph, but I think the only sane thing to do is to make "gc" fully 'own' such things if its turned on at all. We don't worry e.g. that we can't repack some recent pack because a user might have just manually produced it with handcrafted love seconds earlier. Some of what you bring up is "let's do incremental GC". I agree, but think that can be considered separately from what GC should do *when* it runs, whether that's a "full" or "partial" run. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 9:01 ` Ævar Arnfjörð Bjarmason @ 2019-05-24 9:29 ` SZEDER Gábor 2019-05-24 11:17 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: SZEDER Gábor @ 2019-05-24 9:29 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote: > I don't think it's a performance problem to have an old commit-graph > lying around. But if you turn on the commit-graph, run gc a bunch, then > turn it off in config we'll have it lying around forever, even if you do > subsequent gc's. > > So I think we should delete such things on the general principle that > the end-state of a gc's shouldn't be the accumulation of the values of > past configuration options if we can help it. > > Maybe that screws over other users who did a "commit-graph write" > without setting gc.writeCommitGraph, but I think the only sane thing to > do is to make "gc" fully 'own' such things if its turned on at all. Note that there is 'core.commitGraph' as well; as long as it's enabled, no commit-graph files should be deleted. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 9:29 ` SZEDER Gábor @ 2019-05-24 11:17 ` Ævar Arnfjörð Bjarmason 2019-05-24 11:41 ` SZEDER Gábor 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-24 11:17 UTC (permalink / raw) To: SZEDER Gábor Cc: Jeff King, Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24 2019, SZEDER Gábor wrote: > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote: >> I don't think it's a performance problem to have an old commit-graph >> lying around. But if you turn on the commit-graph, run gc a bunch, then >> turn it off in config we'll have it lying around forever, even if you do >> subsequent gc's. >> >> So I think we should delete such things on the general principle that >> the end-state of a gc's shouldn't be the accumulation of the values of >> past configuration options if we can help it. >> >> Maybe that screws over other users who did a "commit-graph write" >> without setting gc.writeCommitGraph, but I think the only sane thing to >> do is to make "gc" fully 'own' such things if its turned on at all. > > Note that there is 'core.commitGraph' as well; as long as it's > enabled, no commit-graph files should be deleted. Why? If we won't update it or write it if it's not there, why keep it around? It means the commit-graph code and anything else (like bitmaps) needs to deal with stale data for the common and default gc --auto case. You also can't have e.g. a global core.commitGraph=true config along with a per-repo gc.writeCommitGraph=true config do what you expect. Now just because you wanted to write it for some you'll end up keeping it around forever because you'd also want to optimistically always use it if it's there. Note that I'm talking about the *default* gc semantics, they don't have to cover all advanced use-cases, just be good enough for most, and it's also important that they're as simple as possible, and don't result in stuff like "my performance sucks because I turned this config option on once a year ago for 2 days". ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 11:17 ` Ævar Arnfjörð Bjarmason @ 2019-05-24 11:41 ` SZEDER Gábor 2019-05-24 11:58 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: SZEDER Gábor @ 2019-05-24 11:41 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, May 24 2019, SZEDER Gábor wrote: > > > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> I don't think it's a performance problem to have an old commit-graph > >> lying around. But if you turn on the commit-graph, run gc a bunch, then > >> turn it off in config we'll have it lying around forever, even if you do > >> subsequent gc's. > >> > >> So I think we should delete such things on the general principle that > >> the end-state of a gc's shouldn't be the accumulation of the values of > >> past configuration options if we can help it. > >> > >> Maybe that screws over other users who did a "commit-graph write" > >> without setting gc.writeCommitGraph, but I think the only sane thing to > >> do is to make "gc" fully 'own' such things if its turned on at all. > > > > Note that there is 'core.commitGraph' as well; as long as it's > > enabled, no commit-graph files should be deleted. > > Why? If we won't update it or write it if it's not there, why keep it > around? To read it, if 'core.commitGraph' says that is should be read. > It means the commit-graph code and anything else (like bitmaps) needs to > deal with stale data for the common and default gc --auto case. > > You also can't have e.g. a global core.commitGraph=true config along > with a per-repo gc.writeCommitGraph=true config do what you expect. > > Now just because you wanted to write it for some you'll end up keeping > it around forever because you'd also want to optimistically always use > it if it's there. This is exactly what I expect it to do. > Note that I'm talking about the *default* gc semantics, they don't have > to cover all advanced use-cases, just be good enough for most, and it's > also important that they're as simple as possible, and don't result in > stuff like "my performance sucks because I turned this config option on > once a year ago for 2 days". ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 11:41 ` SZEDER Gábor @ 2019-05-24 11:58 ` Ævar Arnfjörð Bjarmason 2019-05-24 12:34 ` SZEDER Gábor 0 siblings, 1 reply; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-24 11:58 UTC (permalink / raw) To: SZEDER Gábor Cc: Jeff King, Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24 2019, SZEDER Gábor wrote: > On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> On Fri, May 24 2019, SZEDER Gábor wrote: >> >> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> I don't think it's a performance problem to have an old commit-graph >> >> lying around. But if you turn on the commit-graph, run gc a bunch, then >> >> turn it off in config we'll have it lying around forever, even if you do >> >> subsequent gc's. >> >> >> >> So I think we should delete such things on the general principle that >> >> the end-state of a gc's shouldn't be the accumulation of the values of >> >> past configuration options if we can help it. >> >> >> >> Maybe that screws over other users who did a "commit-graph write" >> >> without setting gc.writeCommitGraph, but I think the only sane thing to >> >> do is to make "gc" fully 'own' such things if its turned on at all. >> > >> > Note that there is 'core.commitGraph' as well; as long as it's >> > enabled, no commit-graph files should be deleted. >> >> Why? If we won't update it or write it if it's not there, why keep it >> around? > > To read it, if 'core.commitGraph' says that is should be read. > >> It means the commit-graph code and anything else (like bitmaps) needs to >> deal with stale data for the common and default gc --auto case. >> >> You also can't have e.g. a global core.commitGraph=true config along >> with a per-repo gc.writeCommitGraph=true config do what you expect. >> >> Now just because you wanted to write it for some you'll end up keeping >> it around forever because you'd also want to optimistically always use >> it if it's there. > > This is exactly what I expect it to do. Do you also expect base packs with an associated bitmap to have an implicit *.keep flag under gc with pack.writeBitmaps=false and pack.useBitmaps=true? >> Note that I'm talking about the *default* gc semantics, they don't have >> to cover all advanced use-cases, just be good enough for most, and it's >> also important that they're as simple as possible, and don't result in >> stuff like "my performance sucks because I turned this config option on >> once a year ago for 2 days". ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 11:58 ` Ævar Arnfjörð Bjarmason @ 2019-05-24 12:34 ` SZEDER Gábor 2019-05-24 13:41 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 57+ messages in thread From: SZEDER Gábor @ 2019-05-24 12:34 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24, 2019 at 01:58:03PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, May 24 2019, SZEDER Gábor wrote: > > > On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > >> On Fri, May 24 2019, SZEDER Gábor wrote: > >> > >> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> >> I don't think it's a performance problem to have an old commit-graph > >> >> lying around. But if you turn on the commit-graph, run gc a bunch, then > >> >> turn it off in config we'll have it lying around forever, even if you do > >> >> subsequent gc's. > >> >> > >> >> So I think we should delete such things on the general principle that > >> >> the end-state of a gc's shouldn't be the accumulation of the values of > >> >> past configuration options if we can help it. > >> >> > >> >> Maybe that screws over other users who did a "commit-graph write" > >> >> without setting gc.writeCommitGraph, but I think the only sane thing to > >> >> do is to make "gc" fully 'own' such things if its turned on at all. > >> > > >> > Note that there is 'core.commitGraph' as well; as long as it's > >> > enabled, no commit-graph files should be deleted. > >> > >> Why? If we won't update it or write it if it's not there, why keep it > >> around? > > > > To read it, if 'core.commitGraph' says that is should be read. > > > >> It means the commit-graph code and anything else (like bitmaps) needs to > >> deal with stale data for the common and default gc --auto case. > >> > >> You also can't have e.g. a global core.commitGraph=true config along > >> with a per-repo gc.writeCommitGraph=true config do what you expect. > >> > >> Now just because you wanted to write it for some you'll end up keeping > >> it around forever because you'd also want to optimistically always use > >> it if it's there. > > > > This is exactly what I expect it to do. > > Do you also expect base packs with an associated bitmap to have an > implicit *.keep flag under gc with pack.writeBitmaps=false and > pack.useBitmaps=true? I don't understand what an "implicit *.keep flag" is. However, since a reachability bitmap is always associated with a pack, but the commit-graph is not, I don't think this is a valid comparison. > >> Note that I'm talking about the *default* gc semantics, they don't have > >> to cover all advanced use-cases, just be good enough for most, and it's > >> also important that they're as simple as possible, and don't result in > >> stuff like "my performance sucks because I turned this config option on > >> once a year ago for 2 days". ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH v3] repack: enable bitmaps by default on bare repos 2019-05-24 12:34 ` SZEDER Gábor @ 2019-05-24 13:41 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 57+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-24 13:41 UTC (permalink / raw) To: SZEDER Gábor Cc: Jeff King, Derrick Stolee, Junio C Hamano, Eric Wong, git On Fri, May 24 2019, SZEDER Gábor wrote: > On Fri, May 24, 2019 at 01:58:03PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> On Fri, May 24 2019, SZEDER Gábor wrote: >> >> > On Fri, May 24, 2019 at 01:17:06PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> >> >> On Fri, May 24 2019, SZEDER Gábor wrote: >> >> >> >> > On Fri, May 24, 2019 at 11:01:39AM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> >> I don't think it's a performance problem to have an old commit-graph >> >> >> lying around. But if you turn on the commit-graph, run gc a bunch, then >> >> >> turn it off in config we'll have it lying around forever, even if you do >> >> >> subsequent gc's. >> >> >> >> >> >> So I think we should delete such things on the general principle that >> >> >> the end-state of a gc's shouldn't be the accumulation of the values of >> >> >> past configuration options if we can help it. >> >> >> >> >> >> Maybe that screws over other users who did a "commit-graph write" >> >> >> without setting gc.writeCommitGraph, but I think the only sane thing to >> >> >> do is to make "gc" fully 'own' such things if its turned on at all. >> >> > >> >> > Note that there is 'core.commitGraph' as well; as long as it's >> >> > enabled, no commit-graph files should be deleted. >> >> >> >> Why? If we won't update it or write it if it's not there, why keep it >> >> around? >> > >> > To read it, if 'core.commitGraph' says that is should be read. >> > >> >> It means the commit-graph code and anything else (like bitmaps) needs to >> >> deal with stale data for the common and default gc --auto case. >> >> >> >> You also can't have e.g. a global core.commitGraph=true config along >> >> with a per-repo gc.writeCommitGraph=true config do what you expect. >> >> >> >> Now just because you wanted to write it for some you'll end up keeping >> >> it around forever because you'd also want to optimistically always use >> >> it if it's there. >> > >> > This is exactly what I expect it to do. >> >> Do you also expect base packs with an associated bitmap to have an >> implicit *.keep flag under gc with pack.writeBitmaps=false and >> pack.useBitmaps=true? > > I don't understand what an "implicit *.keep flag" is[...] A .keep means we keep the pack, and e.g. gc.bigPackThreshold is effectively an implicit *.keep flag on a pack matching some critera, which in this case caused this issue of a stale *.bitmap file (since the pack wasn't touched, neither was the bitmap). > [...]However, since a reachability bitmap is always associated with a > pack, but the commit-graph is not, I don't think this is a valid > comparison. I don't either, I'm just trying to understand where you're coming from, and I still don't. That bitmaps are associated with specific packs and the commit graph isn't is an internal implementation detail. Users who care about that distinction either don't use "git gc" or would be willing to tweak its settings. For the rest I think removing existing side-indexes is a good default for the practical reasons mentioned upthread. >> >> Note that I'm talking about the *default* gc semantics, they don't have >> >> to cover all advanced use-cases, just be good enough for most, and it's >> >> also important that they're as simple as possible, and don't result in >> >> stuff like "my performance sucks because I turned this config option on >> >> once a year ago for 2 days". ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH] pack-bitmap: look for an uninteresting bitmap 2019-05-23 11:30 ` Jeff King 2019-05-23 12:53 ` Derrick Stolee 2019-05-23 19:26 ` Ævar Arnfjörð Bjarmason @ 2019-05-24 11:31 ` Derrick Stolee 2 siblings, 0 replies; 57+ messages in thread From: Derrick Stolee @ 2019-05-24 11:31 UTC (permalink / raw) To: git; +Cc: peff, avarab, e, Derrick Stolee If we try to do a range query such as C..D with topology as A_0 - ... - A_10000 - B - C_1 - ... - C_1000 - C \ D_1 - ... - D_1000 - D and none of the commits in {A_i, B, C_j, C} have a computed bitmap, then we will very likely walk many many trees before computing one for the "have" bitmap. Instead, perform a commit walk to the boundary of C...D and look for computed bitmaps in { B, C_j, C }. If any are found, then it is worth starting from there and building a bitmap. If not, revert to the old method of walking trees. NOTE: this is only a proof-of-concept, as it fails a test in t5310-pack-bitmaps.sh (clearly marked as a failure now). Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- On 5/23/2019 7:30 AM, Jeff King wrote:> + /* > + * First traverse the relevant commits as we would for a normal > + * traversal. > + */ > + while (commits.nr) { > + struct commit *commit = prio_queue_get(&commits); > + struct bitmap **dst_bitmap; I was looking at this code again, and noticed this while() condition. Shouldn't this use queue_has_nonstale() like in paint_down_to_common()? Looking at the body of the loop, I don't see a way for the loop to stop without it walking the entire history of C _and_ D. Based on that, I wrote the patch below as an experiment. The has_uninteresting_bitmap_in_frontier() shamelessly steals code from paint_down_to_common(). Note the failing test, but perhaps there is something salvageable from this. Thanks, -Stolee pack-bitmap.c | 92 ++++++++++++++++++++++++++++++++++++++++- t/t5310-pack-bitmaps.sh | 2 +- 2 files changed, 91 insertions(+), 3 deletions(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index 6069b2fe55..1f4683663e 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -12,6 +12,7 @@ #include "packfile.h" #include "repository.h" #include "object-store.h" +#include "prio-queue.h" /* * An entry on the bitmap index, representing the bitmap for a given @@ -679,6 +680,81 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git, return 0; } +#define PARENT1 (1u<<16) +#define PARENT2 (1u<<17) +#define STALE (1u<<18) + +static const int all_flags = { PARENT1 | PARENT2 | STALE }; + +static int queue_has_nonstale(struct prio_queue *queue) +{ + int i; + for (i = 0; i < queue->nr; i++) { + struct commit *commit = queue->array[i].data; + if (!(commit->object.flags & STALE)) + return 1; + } + return 0; +} + +static int has_uninteresting_bitmap_in_frontier(struct repository *r, + struct commit_list *list, + struct bitmap_index *bitmap_git) +{ + int res = 0; + struct commit_list *iter = list; + struct prio_queue queue = { compare_commits_by_commit_date }; + + while (iter) { + prio_queue_put(&queue, iter->item); + iter = iter->next; + } + + while (queue_has_nonstale(&queue)) { + struct commit *commit = prio_queue_get(&queue); + struct commit_list *parents; + int flags; + + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); + if (flags == (PARENT1 | PARENT2)) { + /* Mark parents of a found merge stale */ + flags |= STALE; + } + + if (flags & PARENT1) { + khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid); + + if (pos < kh_end(bitmap_git->bitmaps)) { + res = 1; + goto cleanup; + } + } + + parents = commit->parents; + while (parents) { + struct commit *p = parents->item; + parents = parents->next; + if ((p->object.flags & flags) == flags) + continue; + if (repo_parse_commit(r, p)) + goto cleanup; + p->object.flags |= flags; + prio_queue_put(&queue, p); + } + } + +cleanup: + clear_prio_queue(&queue); + + iter = list; + while (iter) { + clear_commit_marks(iter->item, all_flags); + iter = iter->next; + } + + return res; +} + struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) { unsigned int i; @@ -689,6 +765,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) struct bitmap *wants_bitmap = NULL; struct bitmap *haves_bitmap = NULL; + struct commit_list *commits = NULL; + struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git)); /* try to open a bitmapped pack, but don't parse it yet * because we may not need to use it */ @@ -704,16 +782,22 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) while (object->type == OBJ_TAG) { struct tag *tag = (struct tag *) object; - if (object->flags & UNINTERESTING) + if (object->flags & UNINTERESTING) { + object->flags |= PARENT1; object_list_insert(object, &haves); - else + } else { + object->flags |= PARENT2; object_list_insert(object, &wants); + } if (!tag->tagged) die("bad tag"); object = parse_object_or_die(&tag->tagged->oid, NULL); } + if (object->type == OBJ_COMMIT) + commit_list_insert((struct commit *)object, &commits); + if (object->flags & UNINTERESTING) object_list_insert(object, &haves); else @@ -740,6 +824,10 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) if (load_pack_bitmap(bitmap_git) < 0) goto cleanup; + if (!has_uninteresting_bitmap_in_frontier(the_repository, commits, bitmap_git)) + goto cleanup; + + /* this is the real no-turning-back point! */ object_array_clear(&revs->pending); if (haves) { diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index a26c8ba9a2..615608fbbf 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -422,7 +422,7 @@ test_expect_success 'fetch without bitmaps ignores delta against old base' ' ' # And do the same for the bitmap case, where we do expect to find the delta. -test_expect_success 'fetch with bitmaps can reuse old base' ' +test_expect_failure 'fetch with bitmaps can reuse old base' ' test_config pack.usebitmaps true && test_when_finished "rm -rf client.git" && git init --bare client.git && -- 2.22.0.rc0 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH 2/3] prune: use bitmaps for reachability traversal 2019-02-14 4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King 2019-03-09 2:49 ` bitmaps by default? [was: prune: use bitmaps for reachability traversal] Eric Wong @ 2019-04-15 15:00 ` Derrick Stolee 2019-04-18 19:49 ` Jeff King 1 sibling, 1 reply; 57+ messages in thread From: Derrick Stolee @ 2019-04-15 15:00 UTC (permalink / raw) To: Jeff King, git On 2/13/2019 11:37 PM, Jeff King wrote: > +static void *lookup_object_by_type(struct repository *r, > + const struct object_id *oid, > + enum object_type type) > +{ > + switch (type) { > + case OBJ_COMMIT: > + return lookup_commit(r, oid); > + case OBJ_TREE: > + return lookup_tree(r, oid); > + case OBJ_TAG: > + return lookup_tag(r, oid); > + case OBJ_BLOB: > + return lookup_blob(r, oid); > + default: > + die("BUG: unknown object type %d", type); > + } > +} > + > +static int mark_object_seen(const struct object_id *oid, > + enum object_type type, > + int exclude, > + uint32_t name_hash, > + struct packed_git *found_pack, > + off_t found_offset) > +{ > + struct object *obj = lookup_object_by_type(the_repository, oid, type); > + if (!obj) > + die("unable to create object '%s'", oid_to_hex(oid)); > + > + obj->flags |= SEEN; > + return 0; > +} > + > void mark_reachable_objects(struct rev_info *revs, int mark_reflog, > timestamp_t mark_recent, struct progress *progress) > { > struct connectivity_progress cp; > + struct bitmap_index *bitmap_git; > > /* > * Set up revision parsing, and mark us as being interested > @@ -188,6 +223,13 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog, > cp.progress = progress; > cp.count = 0; > > + bitmap_git = prepare_bitmap_walk(revs); > + if (bitmap_git) { > + traverse_bitmap_commit_list(bitmap_git, mark_object_seen); > + free_bitmap_index(bitmap_git); > + return; > + } > + Peff, This block after "if (bitmap_git)" is not exercised by the (non-performance) test suite, so the rest of the code above is not tested, either. Could a test of this "prune" capability be added to the regression tests around the bitmaps? Thanks, -Stolee ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 2/3] prune: use bitmaps for reachability traversal 2019-04-15 15:00 ` [PATCH 2/3] prune: use bitmaps for reachability traversal Derrick Stolee @ 2019-04-18 19:49 ` Jeff King 2019-04-18 20:08 ` [PATCH] t5304: add a test for pruning with bitmaps Jeff King 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-04-18 19:49 UTC (permalink / raw) To: Derrick Stolee; +Cc: git On Mon, Apr 15, 2019 at 11:00:45AM -0400, Derrick Stolee wrote: > > void mark_reachable_objects(struct rev_info *revs, int mark_reflog, > > timestamp_t mark_recent, struct progress *progress) > [...] > > cp.progress = progress; > > cp.count = 0; > > > > + bitmap_git = prepare_bitmap_walk(revs); > > + if (bitmap_git) { > > + traverse_bitmap_commit_list(bitmap_git, mark_object_seen); > > + free_bitmap_index(bitmap_git); > > + return; > > + } > > + > > This block after "if (bitmap_git)" is not exercised by the (non-performance) > test suite, so the rest of the code above is not tested, either. Could a test > of this "prune" capability be added to the regression tests around the bitmaps? I have somewhat mixed feelings here. We can add a test with a trivial bitmap to get code coverage here. But as with many optimizations (bitmaps, but also your commit graph work), we get a much more thorough correctness test by re-running the whole test suite (though we do not yet have one for running with bitmaps on), and a better test that the optimization is kicking in and working via the t/perf tests. I dunno. I guess it does not hurt to at least to at least make sure this code is running in the normal suite. I don't think that will find the more interesting regressions, but at least saves us the from the most egregious ones ("oops, the whole thing segfaults because somebody changed the API" kinds of things). -Peff ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH] t5304: add a test for pruning with bitmaps 2019-04-18 19:49 ` Jeff King @ 2019-04-18 20:08 ` Jeff King 2019-04-20 1:01 ` Derrick Stolee 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-04-18 20:08 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, Junio C Hamano On Thu, Apr 18, 2019 at 03:49:53PM -0400, Jeff King wrote: > > This block after "if (bitmap_git)" is not exercised by the (non-performance) > > test suite, so the rest of the code above is not tested, either. Could a test > > of this "prune" capability be added to the regression tests around the bitmaps? > > I have somewhat mixed feelings here. We can add a test with a trivial > bitmap to get code coverage here. But as with many optimizations > (bitmaps, but also your commit graph work), we get a much more thorough > correctness test by re-running the whole test suite (though we do not > yet have one for running with bitmaps on), and a better test that the > optimization is kicking in and working via the t/perf tests. > > I dunno. I guess it does not hurt to at least to at least make sure this > code is running in the normal suite. I don't think that will find the > more interesting regressions, but at least saves us the from the most > egregious ones ("oops, the whole thing segfaults because somebody > changed the API" kinds of things). So here's a test. This goes on top of jk/prune-optim (which is also already in master). -- >8 -- Subject: [PATCH] t5304: add a test for pruning with bitmaps Commit fde67d6896 (prune: use bitmaps for reachability traversal, 2019-02-13) uses bitmaps for pruning when they're available, but only covers this functionality in the t/perf tests. This makes a kind of sense, since the point is that the behaviour is indistinguishable before and after the patch, just faster. But since the bitmap code path is not exercised at all in the regular test suite, it leaves us open to a regression where the behavior does in fact change. The most thorough way to test that would be running the whole suite with bitmaps enabled. But we don't yet have a way to do that, and anyway it's expensive to do so. Let's at least add a basic test that exercises this path and make sure we prune an object we should (and not one that we shouldn't). That would hopefully catch the most obvious breakages early. Signed-off-by: Jeff King <peff@peff.net> --- t/t5304-prune.sh | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh index 1eeb828a15..df60f18fb8 100755 --- a/t/t5304-prune.sh +++ b/t/t5304-prune.sh @@ -341,4 +341,12 @@ test_expect_success 'prune: handle expire option correctly' ' git prune --no-expire ' +test_expect_success 'trivial prune with bitmaps enabled' ' + git repack -adb && + blob=$(echo bitmap-unreachable-blob | git hash-object -w --stdin) && + git prune --expire=now && + git cat-file -e HEAD && + test_must_fail git cat-file -e $blob +' + test_done -- 2.21.0.1090.g9d17dac192 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH] t5304: add a test for pruning with bitmaps 2019-04-18 20:08 ` [PATCH] t5304: add a test for pruning with bitmaps Jeff King @ 2019-04-20 1:01 ` Derrick Stolee 2019-04-20 3:24 ` Jeff King 0 siblings, 1 reply; 57+ messages in thread From: Derrick Stolee @ 2019-04-20 1:01 UTC (permalink / raw) To: Jeff King; +Cc: git, Junio C Hamano On 4/18/2019 4:08 PM, Jeff King wrote: > On Thu, Apr 18, 2019 at 03:49:53PM -0400, Jeff King wrote: >> I dunno. I guess it does not hurt to at least to at least make sure this >> code is running in the normal suite. I don't think that will find the >> more interesting regressions, but at least saves us the from the most >> egregious ones ("oops, the whole thing segfaults because somebody >> changed the API" kinds of things). That's the level of coverage I was hoping to see. I won't be picky if the OBJ_TAG case isn't hit or something. > So here's a test. This goes on top of jk/prune-optim (which is also > already in master). [snip] > +test_expect_success 'trivial prune with bitmaps enabled' ' > + git repack -adb && > + blob=$(echo bitmap-unreachable-blob | git hash-object -w --stdin) && > + git prune --expire=now && > + git cat-file -e HEAD && > + test_must_fail git cat-file -e $blob > +' > + > test_done This test does cover the 'mark_object_seen()' method, which I tested by removing the "!" in the "if (!obj) die();" condition. However, my first inclination was to remove the line obj->flags |= SEEN; from the method, and I found that it still worked! This was confusing, so I looked for places where the SEEN flag was added during bitmap walks, and it turns out that the objects are marked as SEEN by prepare_bitmap_walk(). This means that we can remove a lot of the implementation from reachable.c and have the same effect (and drop an iteration through the objects). See the diff below. Thoughts? -Stolee -->8-- diff --git a/reachable.c b/reachable.c index 0d00a91de4..7d2762d47f 100644 --- a/reachable.c +++ b/reachable.c @@ -159,39 +159,6 @@ int add_unseen_recent_objects_to_traversal(struct rev_info *revs, FOR_EACH_OBJECT_LOCAL_ONLY); } -static void *lookup_object_by_type(struct repository *r, - const struct object_id *oid, - enum object_type type) -{ - switch (type) { - case OBJ_COMMIT: - return lookup_commit(r, oid); - case OBJ_TREE: - return lookup_tree(r, oid); - case OBJ_TAG: - return lookup_tag(r, oid); - case OBJ_BLOB: - return lookup_blob(r, oid); - default: - die("BUG: unknown object type %d", type); - } -} - -static int mark_object_seen(const struct object_id *oid, - enum object_type type, - int exclude, - uint32_t name_hash, - struct packed_git *found_pack, - off_t found_offset) -{ - struct object *obj = lookup_object_by_type(the_repository, oid, type); - if (!obj) - die("unable to create object '%s'", oid_to_hex(oid)); - - obj->flags |= SEEN; - return 0; -} - void mark_reachable_objects(struct rev_info *revs, int mark_reflog, timestamp_t mark_recent, struct progress *progress) { @@ -225,7 +192,10 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog, bitmap_git = prepare_bitmap_walk(revs); if (bitmap_git) { - traverse_bitmap_commit_list(bitmap_git, mark_object_seen); + /* + * reachable objects are marked as SEEN + * by prepare_bitmap_walk(revs). + */ free_bitmap_index(bitmap_git); return; } ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH] t5304: add a test for pruning with bitmaps 2019-04-20 1:01 ` Derrick Stolee @ 2019-04-20 3:24 ` Jeff King 2019-04-20 21:01 ` Derrick Stolee 0 siblings, 1 reply; 57+ messages in thread From: Jeff King @ 2019-04-20 3:24 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, Junio C Hamano On Fri, Apr 19, 2019 at 09:01:50PM -0400, Derrick Stolee wrote: > > +test_expect_success 'trivial prune with bitmaps enabled' ' > > + git repack -adb && > > + blob=$(echo bitmap-unreachable-blob | git hash-object -w --stdin) && > > + git prune --expire=now && > > + git cat-file -e HEAD && > > + test_must_fail git cat-file -e $blob > > +' > > + > > test_done > > This test does cover the 'mark_object_seen()' method, which I tested by > removing the "!" in the "if (!obj) die();" condition. > > However, my first inclination was to remove the line > > obj->flags |= SEEN; > > from the method, and I found that it still worked! This was confusing, > so I looked for places where the SEEN flag was added during bitmap walks, > and it turns out that the objects are marked as SEEN by prepare_bitmap_walk(). I think this is only _some_ objects. The full bitmap is created by reading the on-disk bitmaps, and then filling in any necessary gaps by doing a traditional traversal. We also set it on tip objects to de-dup the initial revs->pending list. Try running t5304 with this: diff --git a/reachable.c b/reachable.c index eba6f64e01..7ec73ef43f 100644 --- a/reachable.c +++ b/reachable.c @@ -191,6 +191,8 @@ static int mark_object_seen(const struct object_id *oid, if (!obj) die("unable to create object '%s'", oid_to_hex(oid)); + if (!(obj->flags & SEEN)) + die("seen flag not already there"); obj->flags |= SEEN; return 0; } which shows that we are indeed freshly setting SEEN on some objects. Interestingly, I don't _think_ you can cause an object to get pruned accidentally here, because: 1. Any object that will miss its SEEN has to be in the bitmapped pack, and not found through normal traversal. 2. git-prune only removes loose objects, not packed ones. 3. Loose objects cannot be in the bitmapped pack. Therefore, no object can hit both cases (1) and (2). Even if that were the end of the story, I don't think I'd want to rely on it here. The post-condition of the function should be that SEEN is set on all reachable objects, whether bitmaps are used or not. And we do indeed use this elsewhere: We'll later call prune_shallow(), which uses SEEN to discard unreachable entries. I'm not sure it even makes sense to have a shallow repo with bitmaps, though. Obviously we're lacking the full graph, but I wouldn't be surprised if the shallow code quietly pretends that all is well and we generate bogus bitmaps or something. Or maybe it even mostly works as long as you don't walk over the shallow cut-points. mark_reachable_objects() is also used for reflog expiration with --stale-fix. I tried generating a test that would fail with your patch, but I think it's not quite possible, because --stale-fix will do a follow-up walk of the graph to see which objects mentioned in the reflog we have and do not have. So it doesn't actually break things, it just makes them slower (because we erroneously fail to mark SEEN things that it's then forced to walk). So I don't _think_ your patch actually breaks anything user-visible, but it's a bit like leaving a live grenade in your living room for somebody to find. And I think we are indeed covering lookup_object_by_type(), etc in the t5304 test I added. So AFAICT all of the new code is covered after that? -Peff ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH] t5304: add a test for pruning with bitmaps 2019-04-20 3:24 ` Jeff King @ 2019-04-20 21:01 ` Derrick Stolee 0 siblings, 0 replies; 57+ messages in thread From: Derrick Stolee @ 2019-04-20 21:01 UTC (permalink / raw) To: Jeff King; +Cc: git, Junio C Hamano On 4/19/2019 11:24 PM, Jeff King wrote: > Try running t5304 with this: > > diff --git a/reachable.c b/reachable.c > index eba6f64e01..7ec73ef43f 100644 > --- a/reachable.c > +++ b/reachable.c > @@ -191,6 +191,8 @@ static int mark_object_seen(const struct object_id *oid, > if (!obj) > die("unable to create object '%s'", oid_to_hex(oid)); > > + if (!(obj->flags & SEEN)) > + die("seen flag not already there"); > obj->flags |= SEEN; > return 0; > } > > which shows that we are indeed freshly setting SEEN on some objects. Good point! Thanks for clearing that up for me. > Interestingly, I don't _think_ you can cause an object to get pruned > accidentally here, because: [snip] I appreciate the additional context that you gave (and I snipped). This makes me comfortable accepting this test as an exercise of the code (to guard against future changes that create failures like null-refs) and we will need to rely on the performance suite to catch issues like removing the SEEN markers that I had tested. Thanks, -Stolee ^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 3/3] prune: check SEEN flag for reachability 2019-02-14 4:31 [PATCH 0/3] some prune optimizations Jeff King 2019-02-14 4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King 2019-02-14 4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King @ 2019-02-14 4:38 ` Jeff King 2 siblings, 0 replies; 57+ messages in thread From: Jeff King @ 2019-02-14 4:38 UTC (permalink / raw) To: git The git-prune command checks reachability by doing a traversal, and then checking whether a given object exists in the global object hash. This can yield false positives if any other part of the code had to create an object struct for some reason. It's not clear whether this is even possible, but it's more robust to rely on something a little more concrete: the SEEN flag set by our traversal. Note that there is a slight possibility of regression here, as we're relying on mark_reachable_objects() to consistently set the flag. However, it has always done so, and we're already relying on that fact in prune_shallow(), which is called as part of git-prune. So this is making these two parts of the prune operation more consistent. Signed-off-by: Jeff King <peff@peff.net> --- builtin/prune.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/builtin/prune.c b/builtin/prune.c index 04b6573945..97613eccb5 100644 --- a/builtin/prune.c +++ b/builtin/prune.c @@ -49,13 +49,12 @@ static void perform_reachability_traversal(struct rev_info *revs) static int is_object_reachable(const struct object_id *oid, struct rev_info *revs) { + struct object *obj; + perform_reachability_traversal(revs); - /* - * Do we know about this object? - * It must have been reachable - */ - return !!lookup_object(the_repository, oid->hash); + obj = lookup_object(the_repository, oid->hash); + return obj && (obj->flags & SEEN); } static int prune_object(const struct object_id *oid, const char *fullpath, -- 2.21.0.rc0.586.gffba1126a0 ^ permalink raw reply related [flat|nested] 57+ messages in thread
end of thread, other threads:[~2019-05-24 13:41 UTC | newest] Thread overview: 57+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-14 4:31 [PATCH 0/3] some prune optimizations Jeff King 2019-02-14 4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King 2019-02-14 10:54 ` Eric Sunshine 2019-02-14 11:07 ` Jeff King 2019-02-14 4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King 2019-03-09 2:49 ` bitmaps by default? [was: prune: use bitmaps for reachability traversal] Eric Wong 2019-03-10 23:39 ` Jeff King 2019-03-12 3:13 ` [PATCH] repack: enable bitmaps by default on bare repos Eric Wong 2019-03-12 9:07 ` Ævar Arnfjörð Bjarmason 2019-03-12 10:49 ` Jeff King 2019-03-12 12:05 ` Jeff King 2019-03-13 1:51 ` Eric Wong 2019-03-13 14:54 ` Jeff King 2019-03-14 9:12 ` [PATCH v3] " Eric Wong 2019-03-14 16:02 ` Jeff King 2019-03-15 6:21 ` [PATCH 0/2] enable bitmap hash-cache by default Jeff King 2019-03-15 6:22 ` [PATCH 1/2] t5310: correctly remove bitmaps for jgit test Jeff King 2019-03-15 13:25 ` SZEDER Gábor 2019-03-15 18:36 ` Jeff King 2019-03-15 6:25 ` [PATCH 2/2] pack-objects: default to writing bitmap hash-cache Jeff King 2019-04-09 15:10 ` [PATCH v3] repack: enable bitmaps by default on bare repos Ævar Arnfjörð Bjarmason 2019-04-10 22:57 ` Jeff King 2019-04-25 7:16 ` Junio C Hamano 2019-05-04 1:37 ` Jeff King 2019-05-04 6:52 ` Ævar Arnfjörð Bjarmason 2019-05-04 13:23 ` SZEDER Gábor 2019-05-08 20:17 ` Ævar Arnfjörð Bjarmason 2019-05-09 4:24 ` Junio C Hamano 2019-05-07 7:45 ` Jeff King 2019-05-07 8:12 ` Ævar Arnfjörð Bjarmason 2019-05-08 7:11 ` Jeff King 2019-05-08 14:20 ` Derrick Stolee 2019-05-08 16:13 ` Ævar Arnfjörð Bjarmason 2019-05-08 22:25 ` Jeff King 2019-05-23 11:30 ` Jeff King 2019-05-23 12:53 ` Derrick Stolee 2019-05-24 7:24 ` Jeff King 2019-05-24 10:33 ` Derrick Stolee 2019-05-23 19:26 ` Ævar Arnfjörð Bjarmason 2019-05-24 7:27 ` Jeff King 2019-05-24 7:55 ` Ævar Arnfjörð Bjarmason 2019-05-24 8:26 ` Jeff King 2019-05-24 9:01 ` Ævar Arnfjörð Bjarmason 2019-05-24 9:29 ` SZEDER Gábor 2019-05-24 11:17 ` Ævar Arnfjörð Bjarmason 2019-05-24 11:41 ` SZEDER Gábor 2019-05-24 11:58 ` Ævar Arnfjörð Bjarmason 2019-05-24 12:34 ` SZEDER Gábor 2019-05-24 13:41 ` Ævar Arnfjörð Bjarmason 2019-05-24 11:31 ` [PATCH] pack-bitmap: look for an uninteresting bitmap Derrick Stolee 2019-04-15 15:00 ` [PATCH 2/3] prune: use bitmaps for reachability traversal Derrick Stolee 2019-04-18 19:49 ` Jeff King 2019-04-18 20:08 ` [PATCH] t5304: add a test for pruning with bitmaps Jeff King 2019-04-20 1:01 ` Derrick Stolee 2019-04-20 3:24 ` Jeff King 2019-04-20 21:01 ` Derrick Stolee 2019-02-14 4:38 ` [PATCH 3/3] prune: check SEEN flag for reachability Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).