git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects
@ 2019-06-24 12:07 Nathaniel Filardo
  2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Nathaniel Filardo @ 2019-06-24 12:07 UTC (permalink / raw)
  To: git; +Cc: stolee, Nathaniel Filardo

This patch series improves handling of very large repositories, as generated
by, for example, bup (https://github.com/bup/bup).  Prolonged operation
thereof creates quite a lot of small pack files; repacking improves
filesystem performance of the objects/pack directory, but is quite
expensive, in terms of time and memory.  We have adopted a strategy that
marks "large" (tens of GB) of pack files as "kept" and defers repacking
until there are enough un-kept packs or enough bytes of un-kept objects.
(The first patch in the series will make our accounting easier, replacing
some terrible shell scripting with grep.)

While this strategy has generally improved our lives relative to either
extreme (not repacking, or repacking after every bup save operation), it
still leaves a good bit to be desired.  Because our packs are marked as
kept, repacking will leave the objects therein alone, but it still must
instantiate in memory and walk the entire object graph.  However, because
our kept packs are transitively closed, such that an object in one
necessarily references only objects in other kept packs, we should like to
avoid reasoning about them more or less altogether.

This series attempts to do just that.  The fourth patch ("repack: optionally
assume transitive kept packs") adds an option to builtin/repack to enumerate
commits (and their trees) within kept packs as UNINTERESTING to its spawned
builtin/pack-objects command.  Together with inducing the use of sparse
reachability, this speeds enumerating candidate objects for repacking and
thereby substantially reduces the runtime of our repack operations, while
producing identical results.  Some test results are reported in that patch's
commit log.

This reroll incorporates feedback from Derrick Stolee, is rebased, and
is updated to pass "make test".

Nathaniel Filardo (5):
  count-objects: report statistics about kept packs
  revision walk: optionally use sparse reachability
  repack: add --sparse and pass to pack-objects
  repack: optionally assume transitive kept packs
  builtin/gc: add --assume-pack-keep-transitive

 Documentation/git-count-objects.txt |  8 ++++
 Documentation/git-gc.txt            |  4 ++
 Documentation/git-repack.txt        | 25 ++++++++++++
 bisect.c                            |  2 +-
 builtin/count-objects.c             | 17 +++++++-
 builtin/gc.c                        |  5 +++
 builtin/pack-objects.c              |  3 +-
 builtin/repack.c                    | 60 ++++++++++++++++++++++++++++-
 builtin/rev-list.c                  |  2 +-
 http-push.c                         |  2 +-
 list-objects.c                      |  5 +--
 list-objects.h                      |  3 +-
 revision.c                          |  3 +-
 revision.h                          |  1 +
 t/t5322-pack-objects-sparse.sh      |  6 +++
 t/t5500-fetch-pack.sh               |  8 ++--
 16 files changed, 137 insertions(+), 17 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/5] count-objects: report statistics about kept packs
  2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
@ 2019-06-24 12:07 ` Nathaniel Filardo
  2019-06-24 12:52   ` Derrick Stolee
  2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Nathaniel Filardo @ 2019-06-24 12:07 UTC (permalink / raw)
  To: git; +Cc: stolee, Nathaniel Filardo

Specifically: number of kept packs, size of kept packs (and indexes),
and number of objects in kept packs.

Add documentation of resulting fields.  While here, correct the omission
from ae72f68541 ("count-objects -v: show number of packs as well.",
2006-12-27) of the "packs" field.

Update tests which were searching for the "in-pack" field without
the terminating colon and so were now also picking up the "in-pack-keep"
field.

Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>

update count-objects
---
 Documentation/git-count-objects.txt |  8 ++++++++
 builtin/count-objects.c             | 17 +++++++++++++++--
 t/t5500-fetch-pack.sh               |  8 ++++----
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-count-objects.txt b/Documentation/git-count-objects.txt
index cb9b4d2e46..8d88f87856 100644
--- a/Documentation/git-count-objects.txt
+++ b/Documentation/git-count-objects.txt
@@ -26,10 +26,18 @@ count: the number of loose objects
 +
 size: disk space consumed by loose objects, in KiB (unless -H is specified)
 +
+packs: the number of pack files
++
+packs-keep: the number of pack files marked kept
++
 in-pack: the number of in-pack objects
 +
+in-pack-keep: the number of objects in packs marked kept
++
 size-pack: disk space consumed by the packs, in KiB (unless -H is specified)
 +
+size-pack-keep: disk space consumed by kept packs, in KiB (unless -H is specified)
++
 prune-packable: the number of loose objects that are also present in
 the packs. These objects could be pruned using `git prune-packed`.
 +
diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index 3fae474f6f..0309c7907f 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -117,17 +117,24 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix)
 
 	if (verbose) {
 		struct packed_git *p;
-		unsigned long num_pack = 0;
-		off_t size_pack = 0;
+		unsigned long num_pack = 0, num_pack_keep = 0;
+		unsigned long packed_keep = 0;
+		off_t size_pack = 0, size_pack_keep = 0;
 		struct strbuf loose_buf = STRBUF_INIT;
 		struct strbuf pack_buf = STRBUF_INIT;
 		struct strbuf garbage_buf = STRBUF_INIT;
+		struct strbuf pack_keep_buf = STRBUF_INIT;
 
 		for (p = get_all_packs(the_repository); p; p = p->next) {
 			if (!p->pack_local)
 				continue;
 			if (open_pack_index(p))
 				continue;
+			if (p->pack_keep || p->pack_keep_in_core) {
+				packed_keep += p->num_objects;
+				size_pack_keep += p->pack_size + p->index_size;
+				num_pack_keep++;
+			}
 			packed += p->num_objects;
 			size_pack += p->pack_size + p->index_size;
 			num_pack++;
@@ -136,12 +143,15 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix)
 		if (human_readable) {
 			strbuf_humanise_bytes(&loose_buf, loose_size);
 			strbuf_humanise_bytes(&pack_buf, size_pack);
+			strbuf_humanise_bytes(&pack_keep_buf, size_pack_keep);
 			strbuf_humanise_bytes(&garbage_buf, size_garbage);
 		} else {
 			strbuf_addf(&loose_buf, "%lu",
 				    (unsigned long)(loose_size / 1024));
 			strbuf_addf(&pack_buf, "%lu",
 				    (unsigned long)(size_pack / 1024));
+			strbuf_addf(&pack_keep_buf, "%lu",
+				    (unsigned long)(size_pack_keep / 1024));
 			strbuf_addf(&garbage_buf, "%lu",
 				    (unsigned long)(size_garbage / 1024));
 		}
@@ -149,8 +159,11 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix)
 		printf("count: %lu\n", loose);
 		printf("size: %s\n", loose_buf.buf);
 		printf("in-pack: %lu\n", packed);
+		printf("in-pack-keep: %lu\n", packed_keep);
 		printf("packs: %lu\n", num_pack);
+		printf("packs-keep: %lu\n", num_pack_keep);
 		printf("size-pack: %s\n", pack_buf.buf);
+		printf("size-pack-keep: %s\n", pack_keep_buf.buf);
 		printf("prune-packable: %lu\n", packed_loose);
 		printf("garbage: %lu\n", garbage);
 		printf("size-garbage: %s\n", garbage_buf.buf);
diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index 1c71c0ec77..a1d154a4ee 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -328,7 +328,7 @@ test_expect_success 'clone shallow with --branch' '
 test_expect_success 'clone shallow object count' '
 	echo "in-pack: 3" > count3.expected &&
 	GIT_DIR=shallow3/.git git count-objects -v |
-		grep "^in-pack" > count3.actual &&
+		grep "^in-pack:" > count3.actual &&
 	test_cmp count3.expected count3.actual
 '
 
@@ -357,7 +357,7 @@ EOF
 
 	echo "in-pack: 4" > count6.expected &&
 	GIT_DIR=shallow6/.git git count-objects -v |
-		grep "^in-pack" > count6.actual &&
+		grep "^in-pack:" > count6.actual &&
 	test_cmp count6.expected count6.actual
 '
 
@@ -372,7 +372,7 @@ EOF
 
 	echo "in-pack: 4" > count7.expected &&
 	GIT_DIR=shallow7/.git git count-objects -v |
-		grep "^in-pack" > count7.actual &&
+		grep "^in-pack:" > count7.actual &&
 	test_cmp count7.expected count7.actual
 '
 
@@ -381,7 +381,7 @@ test_expect_success 'clone shallow with packed refs' '
 	git clone --depth 1 --branch A "file://$(pwd)/." shallow8 &&
 	echo "in-pack: 4" > count8.expected &&
 	GIT_DIR=shallow8/.git git count-objects -v |
-		grep "^in-pack" > count8.actual &&
+		grep "^in-pack:" > count8.actual &&
 	test_cmp count8.expected count8.actual
 '
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/5] revision walk: optionally use sparse reachability
  2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
  2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
@ 2019-06-24 12:07 ` Nathaniel Filardo
  2019-06-24 12:54   ` Derrick Stolee
  2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Nathaniel Filardo @ 2019-06-24 12:07 UTC (permalink / raw)
  To: git; +Cc: stolee, Nathaniel Filardo

Add another bit flag to the struct rev_info.

The only caller that uses this after this patch is builtin/pack-objects.
Without this, sparsity seems to do little good therein, as
prepare_revision_walk will densely propagate UNINTERESTING flags from
trees to tree contents, before mark_edges_uninteresting has a chance to
be faster by being sparse.

While here, drop the "sparse" parameter to mark_edges_uninteresting,
introduced in 4f6d26b167 ("list-objects: consume sparse tree walk",
2019-01-16) which can now use the flag in struct rev_info.  No
functional change intended.

Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
---
 bisect.c               | 2 +-
 builtin/pack-objects.c | 3 ++-
 builtin/rev-list.c     | 2 +-
 http-push.c            | 2 +-
 list-objects.c         | 5 ++---
 list-objects.h         | 3 +--
 revision.c             | 3 ++-
 revision.h             | 1 +
 8 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/bisect.c b/bisect.c
index e87ac29a51..b6ed2b5c92 100644
--- a/bisect.c
+++ b/bisect.c
@@ -658,7 +658,7 @@ static void bisect_common(struct rev_info *revs)
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	if (revs->tree_objects)
-		mark_edges_uninteresting(revs, NULL, 0);
+		mark_edges_uninteresting(revs, NULL);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b2be8869c2..afda6064ca 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3136,9 +3136,10 @@ static void get_object_list(int ac, const char **av)
 	if (use_delta_islands)
 		load_delta_islands(the_repository);
 
+	revs.sparse_tree_walk = !!sparse;
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));
-	mark_edges_uninteresting(&revs, show_edge, sparse);
+	mark_edges_uninteresting(&revs, show_edge);
 
 	if (!fn_show_object)
 		fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 660172b014..74658b7344 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -548,7 +548,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	if (revs.tree_objects)
-		mark_edges_uninteresting(&revs, show_edge, 0);
+		mark_edges_uninteresting(&revs, show_edge);
 
 	if (bisect_list) {
 		int reaches, all;
diff --git a/http-push.c b/http-push.c
index e36561a6db..fad6382069 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1934,7 +1934,7 @@ int cmd_main(int argc, const char **argv)
 		pushing = 0;
 		if (prepare_revision_walk(&revs))
 			die("revision walk setup failed");
-		mark_edges_uninteresting(&revs, NULL, 0);
+		mark_edges_uninteresting(&revs, NULL);
 		objects_to_send = get_delta(&revs, ref_lock);
 		finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index b5651ddd5b..175673cf78 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -264,13 +264,12 @@ static void add_edge_parents(struct commit *commit,
 }
 
 void mark_edges_uninteresting(struct rev_info *revs,
-			      show_edge_fn show_edge,
-			      int sparse)
+			      show_edge_fn show_edge)
 {
 	struct commit_list *list;
 	int i;
 
-	if (sparse) {
+	if (revs->sparse_tree_walk) {
 		struct oidset set;
 		oidset_init(&set, 16);
 
diff --git a/list-objects.h b/list-objects.h
index a952680e46..9388d96785 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -11,8 +11,7 @@ void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, voi
 
 typedef void (*show_edge_fn)(struct commit *);
 void mark_edges_uninteresting(struct rev_info *revs,
-			      show_edge_fn show_edge,
-			      int sparse);
+			      show_edge_fn show_edge);
 
 struct oidset;
 struct list_objects_filter_options;
diff --git a/revision.c b/revision.c
index 621feb9df7..ef523c4a1e 100644
--- a/revision.c
+++ b/revision.c
@@ -458,7 +458,8 @@ static struct commit *handle_commit(struct rev_info *revs,
 		if (!revs->tree_objects)
 			return NULL;
 		if (flags & UNINTERESTING) {
-			mark_tree_contents_uninteresting(revs->repo, tree);
+			if (!revs->sparse_tree_walk)
+				mark_tree_contents_uninteresting(revs->repo, tree);
 			return NULL;
 		}
 		add_pending_object_with_path(revs, object, name, mode, path);
diff --git a/revision.h b/revision.h
index 4134dc6029..a7154566b3 100644
--- a/revision.h
+++ b/revision.h
@@ -145,6 +145,7 @@ struct rev_info {
 			first_parent_only:1,
 			line_level_traverse:1,
 			tree_blobs_in_commit_order:1,
+			sparse_tree_walk:1,
 
 			/*
 			 * Blobs are shown without regard for their existence.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/5] repack: add --sparse and pass to pack-objects
  2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
  2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
  2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
@ 2019-06-24 12:07 ` Nathaniel Filardo
  2019-06-24 13:03   ` Derrick Stolee
  2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
  2019-06-24 12:07 ` [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive Nathaniel Filardo
  4 siblings, 1 reply; 11+ messages in thread
From: Nathaniel Filardo @ 2019-06-24 12:07 UTC (permalink / raw)
  To: git; +Cc: stolee, Nathaniel Filardo

The sparse connectivity algorithm saves a whole lot of time when there
are UNINTERESTING trees around.
---
 Documentation/git-repack.txt   | 4 ++++
 builtin/repack.c               | 5 +++++
 t/t5322-pack-objects-sparse.sh | 6 ++++++
 3 files changed, 15 insertions(+)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index aa0cc8bd44..836d81457a 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -165,6 +165,10 @@ depth is 4095.
 	Pass the `--delta-islands` option to `git-pack-objects`, see
 	linkgit:git-pack-objects[1].
 
+--sparse::
+	Pass the `--sparse` option to `git-pack-objects`; see
+	linkgit:git-pack-objects[1].
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index caca113927..4cfdd62bb8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -288,6 +288,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int midx_cleared = 0;
 	struct pack_objects_args po_args = {NULL};
+	int sparse = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -326,6 +327,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("maximum size of each packfile")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
+		OPT_BOOL(0, "sparse", &sparse,
+			 N_("use the sparse reachability algorithm")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("do not repack this pack")),
 		OPT_END()
@@ -369,6 +372,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	argv_array_push(&cmd.args, "--all");
 	argv_array_push(&cmd.args, "--reflog");
 	argv_array_push(&cmd.args, "--indexed-objects");
+	if (sparse)
+		argv_array_push(&cmd.args, "--sparse");
 	if (repository_format_partial_clone)
 		argv_array_push(&cmd.args, "--exclude-promisor-objects");
 	if (write_bitmaps)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 7124b5581a..66e133dcfe 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -133,4 +133,10 @@ test_expect_success 'pack.useSparse overridden' '
 	test_cmp required_objects.txt sparse_objects.txt
 '
 
+# repack --sparse invokes pack-objects --sparse
+test_expect_success 'repack --sparse and fsck' '
+	git repack -a --sparse &&
+	git fsck
+'
+
 test_done
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/5] repack: optionally assume transitive kept packs
  2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
                   ` (2 preceding siblings ...)
  2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
@ 2019-06-24 12:07 ` Nathaniel Filardo
  2019-06-24 13:21   ` Derrick Stolee
  2019-06-24 12:07 ` [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive Nathaniel Filardo
  4 siblings, 1 reply; 11+ messages in thread
From: Nathaniel Filardo @ 2019-06-24 12:07 UTC (permalink / raw)
  To: git; +Cc: stolee, Nathaniel Filardo

If the user is careful to mark .pack files as kept only when they refer
to (other) kept packs, then we can rely on this when walking the object
graph in subsequent repack operations and reduce the time and memory
spent building the object graph.

Towards that end, then, teach git repack to enumerate the COMMITs and
TREEs in kept packs and mark them as UNINTERESTING for pack-object's
operation.  Because this creates UNINTERESTING marks, we make repack's
--assume-pack-keep-transitive imply --sparse for pack-object to avoid
hauling the entire object graph into memory.

In testing with a 203GB repository with 80M objects, this amounts to
several gigabytes of memory and over ten minutes saved.  This machine
has 32G of RAM and the repository is on a fast SSD to speed testing.

All told, this test repository takes about 33 minutes elapsed (28
minutes in user code) and 3 GB of RAM to repack 12M objects occupying
33GB scattered across 477 pack files not marked as kept (the remainder
of the objects are spread across three kept pack files).  The time and
RAM usage with --assume-pack-keep-transitive should be dominated by
factors proportional to the size and number of un-kept objects.

Without these optimizations, it takes about 45 minutes (34 minutes in
user code) and 7 GB of RAM to compute the same pack file.  The extra
time and total memory use are expected to be proportional to the kept
packs' size, not the working set of the repack.  The time discrepancy
can be expected to be more dramatic when the repository is on spinning
rust rather than SSD.

Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
---
 Documentation/git-repack.txt | 21 +++++++++++++
 builtin/repack.c             | 57 ++++++++++++++++++++++++++++++++++--
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 836d81457a..014812c4bb 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -169,6 +169,27 @@ depth is 4095.
 	Pass the `--sparse` option to `git-pack-objects`; see
 	linkgit:git-pack-objects[1].
 
+--assume-pack-keep-transitive::
+	This flag accelerates the search for objects to pack by assuming
+	that commit objects found in kept packfiles make reference only
+	to that or other kept packfiles.  This is very useful if, for
+	example, this repository periodically repacks all reachable objects
+	and marks the resulting pack file as kept.
+
+	Because this option may prevent git from descending into kept packs,
+	no objects inside kept packs will be available for delta processing.
+	One should therefore restrict the use of this option to situations
+	where delta processing is disabled or when relatively large amounts
+	of data are considered and the relative gain of a wider set of
+	possible delta base objects is reduced.
+
+	The simplest way to ensure transitivity of kept packs is to run `git
+	repack` with `-a` (or `-A`) and `-d` and mark all resulting packs as
+	kept, never removing the kept marker.  In practice, one may wish to
+	wait to mark packs as kept until they are sufficiently large
+	(reducing the number of pack files necessary for a given set of
+	objects) but not so large as to be unwieldy.
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 4cfdd62bb8..a2cd479686 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -11,6 +11,8 @@
 #include "midx.h"
 #include "packfile.h"
 #include "object-store.h"
+#include "revision.h"
+#include "list-objects.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -256,6 +258,30 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
+static void apkt_show_commit(struct commit *commit, void *data)
+{
+	struct tree *tree;
+	struct pack_entry e;
+
+	if (!find_pack_entry(the_repository, &commit->object.oid, &e))
+		return;
+
+	if (!(e.p->pack_keep || e.p->pack_keep_in_core))
+		return;
+
+	tree = get_commit_tree(commit);
+	if (tree)
+		tree->object.flags |= UNINTERESTING;
+
+	write_oid(&commit->object.oid, e.p, 0, data);
+	write_oid(&tree->object.oid, NULL, 0, data);
+}
+
+static void apkt_show_object(struct object *obj, const char *name, void *data)
+{
+	return;
+}
+
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
 
@@ -287,6 +313,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	int midx_cleared = 0;
+	int assume_pack_keep_transitive = 0;
 	struct pack_objects_args po_args = {NULL};
 	int sparse = 0;
 
@@ -329,6 +356,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("repack objects in packs marked with .keep")),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
+		OPT_BOOL(0, "assume-pack-keep-transitive", &assume_pack_keep_transitive,
+				N_("assume kept packs reference only kept packs")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("do not repack this pack")),
 		OPT_END()
@@ -346,6 +375,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("--keep-unreachable and -A are incompatible"));
 
+	if (assume_pack_keep_transitive) {
+		/* imply --honor-pack-keep */
+		pack_kept_objects = 0;
+	}
+
 	if (write_bitmaps < 0)
 		write_bitmaps = (pack_everything & ALL_INTO_ONE) &&
 				 is_bare_repository();
@@ -372,7 +406,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	argv_array_push(&cmd.args, "--all");
 	argv_array_push(&cmd.args, "--reflog");
 	argv_array_push(&cmd.args, "--indexed-objects");
-	if (sparse)
+	if (sparse || assume_pack_keep_transitive)
 		argv_array_push(&cmd.args, "--sparse");
 	if (repository_format_partial_clone)
 		argv_array_push(&cmd.args, "--exclude-promisor-objects");
@@ -407,12 +441,31 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		argv_array_push(&cmd.args, "--incremental");
 	}
 
-	cmd.no_stdin = 1;
+	cmd.in = -1;
+	cmd.no_stdin = !assume_pack_keep_transitive;
 
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
 
+	if (assume_pack_keep_transitive) {
+		struct rev_info revs;
+		const char *revargv[] = { "skip", "--all", "--reflog", "--indexed-objects", NULL };
+
+		repo_init_revisions(the_repository, &revs, NULL);
+		revs.sparse_tree_walk = !!(sparse || assume_pack_keep_transitive);
+		setup_revisions(sizeof(revargv)/sizeof(revargv[0]) - 1 , revargv, &revs, NULL);
+		if (prepare_revision_walk(&revs))
+			die("revision walk setup failed");
+
+		xwrite(cmd.in, "--not\n", 6);
+		traverse_commit_list(&revs, apkt_show_commit, apkt_show_object,
+				     &cmd);
+		xwrite(cmd.in, "--not\n", 6);
+
+		close(cmd.in);
+	}
+
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		if (line.len != the_hash_algo->hexsz)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive
  2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
                   ` (3 preceding siblings ...)
  2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
@ 2019-06-24 12:07 ` Nathaniel Filardo
  4 siblings, 0 replies; 11+ messages in thread
From: Nathaniel Filardo @ 2019-06-24 12:07 UTC (permalink / raw)
  To: git; +Cc: stolee, Nathaniel Filardo

Just pass it on down to builtin/repack.

Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
---
 Documentation/git-gc.txt | 4 ++++
 builtin/gc.c             | 5 +++++
 2 files changed, 9 insertions(+)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 247f765604..6f9f15ef19 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -77,6 +77,10 @@ be performed as well.
 	`.keep` files are consolidated into a single pack. When this
 	option is used, `gc.bigPackThreshold` is ignored.
 
+--assume-pack-keep-transitive::
+	Pass the `--assume-pack-keep-transitive` option to `git-repack`;
+	see linkgit:git-repack[1].
+
 AGGRESSIVE
 ----------
 
diff --git a/builtin/gc.c b/builtin/gc.c
index 8943bcc300..1b304631c5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -525,6 +525,7 @@ static void gc_before_repack(void)
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int aggressive = 0;
+	int assume_pack_keep_transitive = 0;
 	int auto_gc = 0;
 	int quiet = 0;
 	int force = 0;
@@ -547,6 +548,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			   PARSE_OPT_NOCOMPLETE),
 		OPT_BOOL(0, "keep-largest-pack", &keep_base_pack,
 			 N_("repack all other packs except the largest pack")),
+		OPT_BOOL(0, "assume-pack-keep-transitive", &assume_pack_keep_transitive,
+			 N_("assume kept packs reference only kept packs")),
 		OPT_END()
 	};
 
@@ -585,6 +588,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	}
 	if (quiet)
 		argv_array_push(&repack, "-q");
+	if (assume_pack_keep_transitive)
+		argv_array_push(&repack, "--assume-pack-keep-transitive");
 
 	if (auto_gc) {
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/5] count-objects: report statistics about kept packs
  2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
@ 2019-06-24 12:52   ` Derrick Stolee
  0 siblings, 0 replies; 11+ messages in thread
From: Derrick Stolee @ 2019-06-24 12:52 UTC (permalink / raw)
  To: Nathaniel Filardo, git

On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
> Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
> 
> update count-objects
> ---

This post-signoff text looks like cruft from a rebase.

-Stolee

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/5] revision walk: optionally use sparse reachability
  2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
@ 2019-06-24 12:54   ` Derrick Stolee
  0 siblings, 0 replies; 11+ messages in thread
From: Derrick Stolee @ 2019-06-24 12:54 UTC (permalink / raw)
  To: Nathaniel Filardo, git

On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
> Add another bit flag to the struct rev_info.
> 
> The only caller that uses this after this patch is builtin/pack-objects.
> Without this, sparsity seems to do little good therein, as
> prepare_revision_walk will densely propagate UNINTERESTING flags from
> trees to tree contents, before mark_edges_uninteresting has a chance to
> be faster by being sparse.
> 
> While here, drop the "sparse" parameter to mark_edges_uninteresting,
> introduced in 4f6d26b167 ("list-objects: consume sparse tree walk",
> 2019-01-16) which can now use the flag in struct rev_info.  No
> functional change intended.

Looks like a straight-forward refactor to me. I'll keep reading
for the use of this new bitflag.

-Stolee

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/5] repack: add --sparse and pass to pack-objects
  2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
@ 2019-06-24 13:03   ` Derrick Stolee
  0 siblings, 0 replies; 11+ messages in thread
From: Derrick Stolee @ 2019-06-24 13:03 UTC (permalink / raw)
  To: Nathaniel Filardo, git

On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
> The sparse connectivity algorithm saves a whole lot of time when there
> are UNINTERESTING trees around.

Neat! Any chance for some example performance stats?

> ---

Looks like you forgot your Signed-off-by here.

> +# repack --sparse invokes pack-objects --sparse
> +test_expect_success 'repack --sparse and fsck' '
> +	git repack -a --sparse &&
> +	git fsck
> +'

This test may not be enough to properly test the sparse
algorithm, as it is only really enabled when the --revs
argument is given to the pack-objects process AND there
is a "NOT" ref (i.e. "!{oid}" over stdin). Using "-a" here
will not have any "NOT" references. OR maybe this already
is enough when you have the .keep packs. Is there a way
you could set up the packs in this test to explicitly be
in the situation you describe where the sparse option
speeds things up?

It's hard to check that the '--sparse' option is doing
anything in a test, but it is important that we run the
logic. One way to see if this test is doing anything is
to insert a die() somewhere. For example, this die()
statement will check if we ever needed to mark things
uninteresting in a workdir path with both UNINTERESTING
and INTERESTING trees:

diff --git a/revision.c b/revision.c
index eb8e51bc63..8835f8e7b1 100644
--- a/revision.c
+++ b/revision.c
@@ -227,6 +227,7 @@ void mark_trees_uninteresting_sparse(struct repository *r,
        if (!has_uninteresting || !has_interesting)
                return;

+       die("we exercised the logic!");
        paths_and_oids_init(&map);

        oidset_iter_init(trees, &iter);

The implementation of the argument looks straight-forward.
It appears we are passing the argument to the sub-process,
so the bitflag isn't used yet.

-Stolee


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/5] repack: optionally assume transitive kept packs
  2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
@ 2019-06-24 13:21   ` Derrick Stolee
  2019-06-25 10:32     ` Dr N.W. Filardo
  0 siblings, 1 reply; 11+ messages in thread
From: Derrick Stolee @ 2019-06-24 13:21 UTC (permalink / raw)
  To: Nathaniel Filardo, git

On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
> If the user is careful to mark .pack files as kept only when they refer
> to (other) kept packs, then we can rely on this when walking the object
> graph in subsequent repack operations and reduce the time and memory
> spent building the object graph.
> 
> Towards that end, then, teach git repack to enumerate the COMMITs and
> TREEs in kept packs and mark them as UNINTERESTING for pack-object's
> operation.  Because this creates UNINTERESTING marks, we make repack's
> --assume-pack-keep-transitive imply --sparse for pack-object to avoid
> hauling the entire object graph into memory.
> 
> In testing with a 203GB repository with 80M objects, this amounts to
> several gigabytes of memory and over ten minutes saved.  This machine
> has 32G of RAM and the repository is on a fast SSD to speed testing.

Thanks for the performance details!

> All told, this test repository takes about 33 minutes elapsed (28
> minutes in user code) and 3 GB of RAM to repack 12M objects occupying
> 33GB scattered across 477 pack files not marked as kept (the remainder
> of the objects are spread across three kept pack files).  The time and
> RAM usage with --assume-pack-keep-transitive should be dominated by
> factors proportional to the size and number of un-kept objects.
> 
> Without these optimizations, it takes about 45 minutes (34 minutes in
> user code) and 7 GB of RAM to compute the same pack file.  The extra
> time and total memory use are expected to be proportional to the kept
> packs' size, not the working set of the repack.  The time discrepancy
> can be expected to be more dramatic when the repository is on spinning
> rust rather than SSD.
> 
> Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
> ---
>  Documentation/git-repack.txt | 21 +++++++++++++
>  builtin/repack.c             | 57 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index 836d81457a..014812c4bb 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -169,6 +169,27 @@ depth is 4095.
>  	Pass the `--sparse` option to `git-pack-objects`; see
>  	linkgit:git-pack-objects[1].
>  
> +--assume-pack-keep-transitive::
> +	This flag accelerates the search for objects to pack by assuming
> +	that commit objects found in kept packfiles make reference only
> +	to that or other kept packfiles.  This is very useful if, for
> +	example, this repository periodically repacks all reachable objects
> +	and marks the resulting pack file as kept.
> +
> +	Because this option may prevent git from descending into kept packs,
> +	no objects inside kept packs will be available for delta processing.
> +	One should therefore restrict the use of this option to situations
> +	where delta processing is disabled or when relatively large amounts
> +	of data are considered and the relative gain of a wider set of
> +	possible delta base objects is reduced.
> +
> +	The simplest way to ensure transitivity of kept packs is to run `git
> +	repack` with `-a` (or `-A`) and `-d` and mark all resulting packs as
> +	kept, never removing the kept marker.  In practice, one may wish to
> +	wait to mark packs as kept until they are sufficiently large
> +	(reducing the number of pack files necessary for a given set of
> +	objects) but not so large as to be unwieldy.
> +
>  Configuration
>  -------------
>  
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 4cfdd62bb8..a2cd479686 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -11,6 +11,8 @@
>  #include "midx.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "revision.h"
> +#include "list-objects.h"
>  
>  static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
> @@ -256,6 +258,30 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
>  		die(_("could not finish pack-objects to repack promisor objects"));
>  }
>  
> +static void apkt_show_commit(struct commit *commit, void *data)
> +{
> +	struct tree *tree;
> +	struct pack_entry e;
> +
> +	if (!find_pack_entry(the_repository, &commit->object.oid, &e))
> +		return;
> +
> +	if (!(e.p->pack_keep || e.p->pack_keep_in_core))
> +		return;
> +
> +	tree = get_commit_tree(commit);
> +	if (tree)
> +		tree->object.flags |= UNINTERESTING;
> +
> +	write_oid(&commit->object.oid, e.p, 0, data);
> +	write_oid(&tree->object.oid, NULL, 0, data);
> +}
> +
> +static void apkt_show_object(struct object *obj, const char *name, void *data)
> +{
> +	return;
> +}
> +
>  #define ALL_INTO_ONE 1
>  #define LOOSEN_UNREACHABLE 2
>  
> @@ -287,6 +313,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>  	int no_update_server_info = 0;
>  	int midx_cleared = 0;
> +	int assume_pack_keep_transitive = 0;
>  	struct pack_objects_args po_args = {NULL};
>  	int sparse = 0;
>  
> @@ -329,6 +356,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  				N_("repack objects in packs marked with .keep")),
>  		OPT_BOOL(0, "sparse", &sparse,
>  			 N_("use the sparse reachability algorithm")),
> +		OPT_BOOL(0, "assume-pack-keep-transitive", &assume_pack_keep_transitive,
> +				N_("assume kept packs reference only kept packs")),
>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>  				N_("do not repack this pack")),
>  		OPT_END()
> @@ -346,6 +375,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>  		die(_("--keep-unreachable and -A are incompatible"));
>  
> +	if (assume_pack_keep_transitive) {
> +		/* imply --honor-pack-keep */
> +		pack_kept_objects = 0;
> +	}
> +

If you also set `sparse = 1;` here, then...

> -	if (sparse)
> +	if (sparse || assume_pack_keep_transitive)
>  		argv_array_push(&cmd.args, "--sparse");

...this logic can stay the same. And be simpler.

>  	if (repository_format_partial_clone)
>  		argv_array_push(&cmd.args, "--exclude-promisor-objects");
> @@ -407,12 +441,31 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		argv_array_push(&cmd.args, "--incremental");
>  	}
>  
> -	cmd.no_stdin = 1;
> +	cmd.in = -1;
> +	cmd.no_stdin = !assume_pack_keep_transitive;

I wonder if this `cmd.in = -1;` needs to be here, or should be in
the `if (assume_pack_keep_transitive)` block below.

>  
>  	ret = start_command(&cmd);
>  	if (ret)
>  		return ret;
>  
> +	if (assume_pack_keep_transitive) {
> +		struct rev_info revs;
> +		const char *revargv[] = { "skip", "--all", "--reflog", "--indexed-objects", NULL };
> +
> +		repo_init_revisions(the_repository, &revs, NULL);

> +		revs.sparse_tree_walk = !!(sparse || assume_pack_keep_transitive);

Here is the bitflag! Excellent. Again, this can be simplified if we set `sparse = 1`
in the first assume_pack_keep_transitive block.

> +		setup_revisions(sizeof(revargv)/sizeof(revargv[0]) - 1 , revargv, &revs, NULL);

Some whitespace strangeness on this line.

> +		if (prepare_revision_walk(&revs))
> +			die("revision walk setup failed");

This string should be localizable. It's not entirely your fault, since running
'git grep -A 1 prepare_revision_walk' shows a variety of different uses, most
of which don't use the localizable string, but enough do that you could add it
here.

> +
> +		xwrite(cmd.in, "--not\n", 6);
> +		traverse_commit_list(&revs, apkt_show_commit, apkt_show_object,
> +				     &cmd);
> +		xwrite(cmd.in, "--not\n", 6);
> +
> +		close(cmd.in);
> +	}
> +
>  	out = xfdopen(cmd.out, "r");
>  	while (strbuf_getline_lf(&line, out) != EOF) {
>  		if (line.len != the_hash_algo->hexsz)
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/5] repack: optionally assume transitive kept packs
  2019-06-24 13:21   ` Derrick Stolee
@ 2019-06-25 10:32     ` Dr N.W. Filardo
  0 siblings, 0 replies; 11+ messages in thread
From: Dr N.W. Filardo @ 2019-06-25 10:32 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

On 2019-06-24 14:21, Derrick Stolee wrote:
> On 6/24/2019 8:07 AM, Nathaniel Filardo wrote:
>> If the user is careful to mark .pack files as kept only when they
>> refer
>> to (other) kept packs, then we can rely on this when walking the
>> object
>> graph in subsequent repack operations and reduce the time and memory
>> spent building the object graph.
>> 
>> Towards that end, then, teach git repack to enumerate the COMMITs and
>> TREEs in kept packs and mark them as UNINTERESTING for pack-object's
>> operation.  Because this creates UNINTERESTING marks, we make repack's
>> --assume-pack-keep-transitive imply --sparse for pack-object to avoid
>> hauling the entire object graph into memory.
>> 
>> In testing with a 203GB repository with 80M objects, this amounts to
>> several gigabytes of memory and over ten minutes saved.  This machine
>> has 32G of RAM and the repository is on a fast SSD to speed testing.
> 
> Thanks for the performance details!

Happy to provide.  Hopefully they justify my desire to see these patches
land in mainline. :)

>> All told, this test repository takes about 33 minutes elapsed (28
>> minutes in user code) and 3 GB of RAM to repack 12M objects occupying
>> 33GB scattered across 477 pack files not marked as kept (the remainder
>> of the objects are spread across three kept pack files).  The time and
>> RAM usage with --assume-pack-keep-transitive should be dominated by
>> factors proportional to the size and number of un-kept objects.
>> 
>> Without these optimizations, it takes about 45 minutes (34 minutes in
>> user code) and 7 GB of RAM to compute the same pack file.  The extra
>> time and total memory use are expected to be proportional to the kept
>> packs' size, not the working set of the repack.  The time discrepancy
>> can be expected to be more dramatic when the repository is on spinning
>> rust rather than SSD.
>> 
>> Signed-off-by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
>> ---
>>  Documentation/git-repack.txt | 21 +++++++++++++
>>  builtin/repack.c             | 57
>> ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 76 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Documentation/git-repack.txt
>> b/Documentation/git-repack.txt
>> index 836d81457a..014812c4bb 100644
>> --- a/Documentation/git-repack.txt
>> +++ b/Documentation/git-repack.txt
>> @@ -169,6 +169,27 @@ depth is 4095.
>>  	Pass the `--sparse` option to `git-pack-objects`; see
>>  	linkgit:git-pack-objects[1].
>> 
>> +--assume-pack-keep-transitive::
>> +	This flag accelerates the search for objects to pack by assuming
>> +	that commit objects found in kept packfiles make reference only
>> +	to that or other kept packfiles.  This is very useful if, for
>> +	example, this repository periodically repacks all reachable objects
>> +	and marks the resulting pack file as kept.
>> +
>> +	Because this option may prevent git from descending into kept packs,
>> +	no objects inside kept packs will be available for delta processing.
>> +	One should therefore restrict the use of this option to situations
>> +	where delta processing is disabled or when relatively large amounts
>> +	of data are considered and the relative gain of a wider set of
>> +	possible delta base objects is reduced.
>> +
>> +	The simplest way to ensure transitivity of kept packs is to run `git
>> +	repack` with `-a` (or `-A`) and `-d` and mark all resulting packs as
>> +	kept, never removing the kept marker.  In practice, one may wish to
>> +	wait to mark packs as kept until they are sufficiently large
>> +	(reducing the number of pack files necessary for a given set of
>> +	objects) but not so large as to be unwieldy.
>> +
>>  Configuration
>>  -------------
>> 
>> diff --git a/builtin/repack.c b/builtin/repack.c
>> index 4cfdd62bb8..a2cd479686 100644
>> --- a/builtin/repack.c
>> +++ b/builtin/repack.c
>> @@ -11,6 +11,8 @@
>>  #include "midx.h"
>>  #include "packfile.h"
>>  #include "object-store.h"
>> +#include "revision.h"
>> +#include "list-objects.h"
>> 
>>  static int delta_base_offset = 1;
>>  static int pack_kept_objects = -1;
>> @@ -256,6 +258,30 @@ static void repack_promisor_objects(const struct
>> pack_objects_args *args,
>>  		die(_("could not finish pack-objects to repack promisor objects"));
>>  }
>> 
>> +static void apkt_show_commit(struct commit *commit, void *data)
>> +{
>> +	struct tree *tree;
>> +	struct pack_entry e;
>> +
>> +	if (!find_pack_entry(the_repository, &commit->object.oid, &e))
>> +		return;
>> +
>> +	if (!(e.p->pack_keep || e.p->pack_keep_in_core))
>> +		return;
>> +
>> +	tree = get_commit_tree(commit);
>> +	if (tree)
>> +		tree->object.flags |= UNINTERESTING;
>> +
>> +	write_oid(&commit->object.oid, e.p, 0, data);
>> +	write_oid(&tree->object.oid, NULL, 0, data);
>> +}
>> +
>> +static void apkt_show_object(struct object *obj, const char *name,
>> void *data)
>> +{
>> +	return;
>> +}
>> +
>>  #define ALL_INTO_ONE 1
>>  #define LOOSEN_UNREACHABLE 2
>> 
>> @@ -287,6 +313,7 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>>  	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>>  	int no_update_server_info = 0;
>>  	int midx_cleared = 0;
>> +	int assume_pack_keep_transitive = 0;
>>  	struct pack_objects_args po_args = {NULL};
>>  	int sparse = 0;
>> 
>> @@ -329,6 +356,8 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>>  				N_("repack objects in packs marked with .keep")),
>>  		OPT_BOOL(0, "sparse", &sparse,
>>  			 N_("use the sparse reachability algorithm")),
>> +		OPT_BOOL(0, "assume-pack-keep-transitive",
>> &assume_pack_keep_transitive,
>> +				N_("assume kept packs reference only kept packs")),
>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>>  				N_("do not repack this pack")),
>>  		OPT_END()
>> @@ -346,6 +375,11 @@ int cmd_repack(int argc, const char **argv, const
>> char *prefix)
>>  	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
>>  		die(_("--keep-unreachable and -A are incompatible"));
>> 
>> +	if (assume_pack_keep_transitive) {
>> +		/* imply --honor-pack-keep */
>> +		pack_kept_objects = 0;
>> +	}
>> +
> 
> If you also set `sparse = 1;` here, then...

That seems reasonable.

>> -	if (sparse)
>> +	if (sparse || assume_pack_keep_transitive)
>>  		argv_array_push(&cmd.args, "--sparse");
> 
> ...this logic can stay the same. And be simpler.

Done.

>>  	if (repository_format_partial_clone)
>>  		argv_array_push(&cmd.args, "--exclude-promisor-objects");
>> @@ -407,12 +441,31 @@ int cmd_repack(int argc, const char **argv,
>> const char *prefix)
>>  		argv_array_push(&cmd.args, "--incremental");
>>  	}
>> 
>> -	cmd.no_stdin = 1;
>> +	cmd.in = -1;
>> +	cmd.no_stdin = !assume_pack_keep_transitive;
> 
> I wonder if this `cmd.in = -1;` needs to be here, or should be in
> the `if (assume_pack_keep_transitive)` block below.

It doesn't need to be, I suppose, but it seemed a little less subtle to
just assign it either way, as it will either be ignored or do the right
thing below.

>> 
>>  	ret = start_command(&cmd);
>>  	if (ret)
>>  		return ret;
>> 
>> +	if (assume_pack_keep_transitive) {
>> +		struct rev_info revs;
>> +		const char *revargv[] = { "skip", "--all", "--reflog",
>> "--indexed-objects", NULL };
>> +
>> +		repo_init_revisions(the_repository, &revs, NULL);
> 
>> +		revs.sparse_tree_walk = !!(sparse || assume_pack_keep_transitive);
> 
> Here is the bitflag! Excellent. Again, this can be simplified if we
> set `sparse = 1`
> in the first assume_pack_keep_transitive block.

In fact, maybe it should just be set to 1, because we're only here if...
.
I wonder if this is some cruft left over from how I developed the patch.
I'm OK with either "= 1" or "= sparse", as you prefer.

>> +		setup_revisions(sizeof(revargv)/sizeof(revargv[0]) - 1 , revargv,
>> &revs, NULL);
> 
> Some whitespace strangeness on this line.

Spurious space before , removed.

>> +		if (prepare_revision_walk(&revs))
>> +			die("revision walk setup failed");
> 
> This string should be localizable. It's not entirely your fault, since
> running
> 'git grep -A 1 prepare_revision_walk' shows a variety of different
> uses, most
> of which don't use the localizable string, but enough do that you could
> add it
> here.

Well, I'll try not to be a further bad example!

Thanks for the review.
--nwf;


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-06-25 10:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-24 12:07 [PATCH v3 0/5] Speed up repacking when lots of pack-kept objects Nathaniel Filardo
2019-06-24 12:07 ` [PATCH v3 1/5] count-objects: report statistics about kept packs Nathaniel Filardo
2019-06-24 12:52   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 2/5] revision walk: optionally use sparse reachability Nathaniel Filardo
2019-06-24 12:54   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 3/5] repack: add --sparse and pass to pack-objects Nathaniel Filardo
2019-06-24 13:03   ` Derrick Stolee
2019-06-24 12:07 ` [PATCH v3 4/5] repack: optionally assume transitive kept packs Nathaniel Filardo
2019-06-24 13:21   ` Derrick Stolee
2019-06-25 10:32     ` Dr N.W. Filardo
2019-06-24 12:07 ` [PATCH v3 5/5] builtin/gc: add --assume-pack-keep-transitive Nathaniel Filardo

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).