git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
* [PATCH 00/10] repack: support repacking into a geometric sequence
@ 2021-01-19 23:23 Taylor Blau
  2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
                   ` (13 more replies)
  0 siblings, 14 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:23 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

This series introduces a new mode of 'git repack' where (instead of packing just
loose objects or packing everything together into one pack), the set of packs
left forms a geometric progression by object count.

It does not depend on either series of the revindex patches I sent recently.

Roughly speaking, for a given factor, say "d", each pack has at least "d" times
the number of objects as the next largest pack. So, if there are "N" packs,
"P1", "P2", ..., "PN" ordered by object count (where "PN" has the most objects,
and "P1" the fewest), then:

  objects(Pi) > d * objects(P(i-1))

for all 1 < i <= N.

This is done by first ordering packs by object count, and then determining the
longest sequence of large packs which already form a geometric progression. All
packs on the small side of that cut must be repacked together, and so we check
that the existing progression can be maintained with the new pack, and adjust as
necessary.

In actuality, this is approximated in order for 'git repack' to have to create
at most one new pack. The details of this approximation are discussed at length
in the final patch.

'git repack' implements this new option by marking the packs that don't need to
be touched as "frozen" and it does this by marking them as pack_keep_in_core,
and then using a new option pack-objects option '--assume-kept-packs-closed' to
stop the reachability traversal once it encounters any objects in the kept
packs.

When repacking in this mode, the caller implicitly trusts that the unchanged
packs are closed under reachability, and thus they can halt the traversal as
soon as an object in any one of those packs is found.

The first three patches introduce the new revision and pack-objects options
necessary for this to work. The next four patches introduce an MRU cache for
kept packs only. Then a new pack-objects mode is introduced to allow callers to
specify the list of kept packs over stdin in case they are too long to be listed
as arguments. Finally, geometric repacking is introduced

Thanks in advance for your review.

Jeff King (4):
  p5303: add missing &&-chains
  p5303: measure time to repack with keep
  pack-objects: rewrite honor-pack-keep logic
  packfile: add kept-pack cache for find_kept_pack_entry()

Taylor Blau (6):
  packfile: introduce 'find_kept_pack_entry()'
  revision: learn '--no-kept-objects'
  builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  builtin/pack-objects.c: teach '--keep-pack-stdin'
  builtin/repack.c: extract loose object handling
  builtin/repack.c: add '--geometric' option

 Documentation/git-pack-objects.txt |  19 +++
 Documentation/git-repack.txt       |  11 ++
 Documentation/rev-list-options.txt |   7 +
 builtin/pack-objects.c             | 161 ++++++++++++++--------
 builtin/repack.c                   | 206 ++++++++++++++++++++++++++---
 list-objects.c                     |   7 +
 object-store.h                     |  10 ++
 packfile.c                         |  69 ++++++++++
 packfile.h                         |   2 +
 revision.c                         |  15 +++
 revision.h                         |   4 +
 t/perf/p5303-many-packs.sh         |  18 ++-
 t/t6114-keep-packs.sh              | 128 ++++++++++++++++++
 t/t7703-repack-geometric.sh        |  81 ++++++++++++
 14 files changed, 663 insertions(+), 75 deletions(-)
 create mode 100755 t/t6114-keep-packs.sh
 create mode 100755 t/t7703-repack-geometric.sh

-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-20 13:40   ` Derrick Stolee
  2021-01-29  2:33   ` Junio C Hamano
  2021-01-19 23:24 ` [PATCH 02/10] revision: learn '--no-kept-objects' Taylor Blau
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Future callers will want a function to fill a 'struct pack_entry' for a
given object id but _only_ from its position in any kept pack(s). They
could accomplish this by calling 'find_pack_entry()' and checking
whether the found pack is kept or not, but this is insufficient, since
there may be duplicate objects (and the mru cache makes it unpredictable
which variant we'll get).

Teach this new function to treat the two different kinds of kept packs
(on disk ones with .keep files, as well as in-core ones which are set by
manually poking the 'pack_keep_in_core' bit) separately. This will
become important for callers that only want to respect a certain kind of
kept pack.

Introduce 'find_kept_pack_entry()' which behaves like
'find_pack_entry()', except that it skips over packs which are not
marked kept. Callers will be added in subsequent patches.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.h |  6 +++++
 2 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 62d92e0c7c..30f43a1a35 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2015,7 +2015,10 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static int find_one_pack_entry(struct repository *r,
+			       const struct object_id *oid,
+			       struct pack_entry *e,
+			       int kept_only)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2025,26 +2028,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (fill_midx_entry(r, oid, e, m))
+		if (!(fill_midx_entry(r, oid, e, m)))
+			continue;
+
+		if (!kept_only)
+			return 1;
+
+		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
+		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
-			list_move(&p->mru, &r->objects->packed_git_mru);
-			return 1;
+		if (p->multi_pack_index && !kept_only) {
+			/*
+			 * If this pack is covered by the MIDX, we'd have found
+			 * the object already in the loop above if it was here,
+			 * so don't bother looking.
+			 *
+			 * The exception is if we are looking only at kept
+			 * packs. An object can be present in two packs covered
+			 * by the MIDX, one kept and one not-kept. And as the
+			 * MIDX points to only one copy of each object, it might
+			 * have returned only the non-kept version above. We
+			 * have to check again to be thorough.
+			 */
+			continue;
+		}
+		if (!kept_only ||
+		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
+		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
+			if (fill_pack_entry(oid, e, p)) {
+				list_move(&p->mru, &r->objects->packed_git_mru);
+				return 1;
+			}
 		}
 	}
 	return 0;
 }
 
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+{
+	return find_one_pack_entry(r, oid, e, 0);
+}
+
+int find_kept_pack_entry(struct repository *r,
+			 const struct object_id *oid,
+			 unsigned flags,
+			 struct pack_entry *e)
+{
+	/*
+	 * Load all packs, including midx packs, since our "kept" strategy
+	 * relies on that. We're relying on the side effect of it setting up
+	 * r->objects->packed_git, which is a little ugly.
+	 */
+	get_all_packs(r);
+	return find_one_pack_entry(r, oid, e, flags);
+}
+
 int has_object_pack(const struct object_id *oid)
 {
 	struct pack_entry e;
 	return find_pack_entry(the_repository, oid, &e);
 }
 
+int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+{
+	struct pack_entry e;
+	return find_kept_pack_entry(the_repository, oid, flags, &e);
+}
+
 int has_pack_index(const unsigned char *sha1)
 {
 	struct stat st;
diff --git a/packfile.h b/packfile.h
index a58fc738e0..624327f64d 100644
--- a/packfile.h
+++ b/packfile.h
@@ -161,13 +161,19 @@ int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
+#define ON_DISK_KEEP_PACKS 1
+#define IN_CORE_KEEP_PACKS 2
+#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
+
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
  */
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e);
+int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e);
 
 int has_object_pack(const struct object_id *oid);
+int has_object_kept_pack(const struct object_id *oid, unsigned flags);
 
 int has_pack_index(const unsigned char *sha1);
 
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 02/10] revision: learn '--no-kept-objects'
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
  2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-29  3:10   ` Junio C Hamano
  2021-01-19 23:24 ` [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed' Taylor Blau
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Some callers want to perform a reachability traversal that terminates
when an object is found in a kept pack. The closest existing option is
'--honor-pack-keep', but this isn't quite what we want. Instead of
halting the traversal midway through, a full traversal is always
performed, and the results are only trimmed afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs it wants to leave alone).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/rev-list-options.txt |  7 +++
 list-objects.c                     |  7 +++
 revision.c                         | 15 +++++++
 revision.h                         |  4 ++
 t/t6114-keep-packs.sh              | 69 ++++++++++++++++++++++++++++++
 5 files changed, 102 insertions(+)
 create mode 100755 t/t6114-keep-packs.sh

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 002379056a..817419d552 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -856,6 +856,13 @@ ifdef::git-rev-list[]
 	Only useful with `--objects`; print the object IDs that are not
 	in packs.
 
+--no-kept-objects[=<kind>]::
+	Halts the traversal as soon as an object in a kept pack is
+	found. If `<kind>` is `on-disk`, only packs with a corresponding
+	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
+	with their in-core kept state set are ignored. Otherwise, both
+	kinds of kept packs are ignored.
+
 --object-names::
 	Only useful with `--objects`; print the names of the object IDs
 	that are found. This is the default behavior.
diff --git a/list-objects.c b/list-objects.c
index e19589baa0..b06c3bfeba 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 			ctx->show_object(obj, name, ctx->show_data);
 			continue;
 		}
+		if (ctx->revs->no_kept_objects) {
+			struct pack_entry e;
+			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
+						 ctx->revs->keep_pack_cache_flags,
+						 &e))
+				continue;
+		}
 		if (!path)
 			path = "";
 		if (obj->type == OBJ_TREE) {
diff --git a/revision.c b/revision.c
index 1bb590ece7..ff1ea77224 100644
--- a/revision.c
+++ b/revision.c
@@ -2334,6 +2334,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->unpacked = 1;
 	} else if (starts_with(arg, "--unpacked=")) {
 		die(_("--unpacked=<packfile> no longer supported"));
+	} else if (!strcmp(arg, "--no-kept-objects")) {
+		revs->no_kept_objects = 1;
+		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
+		revs->no_kept_objects = 1;
+		if (!strcmp(optarg, "in-core"))
+			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		if (!strcmp(optarg, "on-disk"))
+			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
@@ -3822,6 +3832,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (revs->unpacked && has_object_pack(&commit->object.oid))
 		return commit_ignore;
+	if (revs->no_kept_objects) {
+		if (has_object_kept_pack(&commit->object.oid,
+					 revs->keep_pack_cache_flags))
+			return commit_ignore;
+	}
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
 	if (revs->line_level_traverse && !want_ancestry(revs)) {
diff --git a/revision.h b/revision.h
index 086ff10280..15d0e6aee5 100644
--- a/revision.h
+++ b/revision.h
@@ -148,6 +148,7 @@ struct rev_info {
 			edge_hint_aggressive:1,
 			limited:1,
 			unpacked:1,
+			no_kept_objects:1,
 			boundary:2,
 			count:1,
 			left_right:1,
@@ -312,6 +313,9 @@ struct rev_info {
 	 * This is loaded from the commit-graph being used.
 	 */
 	struct bloom_filter_settings *bloom_filter_settings;
+
+	/* misc. flags related to '--no-kept-objects' */
+	unsigned keep_pack_cache_flags;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
new file mode 100755
index 0000000000..9239d8aa46
--- /dev/null
+++ b/t/t6114-keep-packs.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='rev-list with .keep packs'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit loose &&
+	test_commit packed &&
+	test_commit kept &&
+
+	KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/kept
+	^refs/tags/packed
+	EOF
+	) &&
+	MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/packed
+	^refs/tags/loose
+	EOF
+	) &&
+
+	touch .git/objects/pack/pack-$KEPT_PACK.keep
+'
+
+rev_list_objects () {
+	git rev-list "$@" >out &&
+	sort out
+}
+
+idx_objects () {
+	git show-index <$1 >expect-idx &&
+	cut -d" " -f2 <expect-idx | sort
+}
+
+test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' '
+	rev_list_objects --objects --all --no-object-names >kept &&
+	rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept &&
+
+	idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect &&
+	comm -3 kept no-kept >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
+	test_config core.multiPackIndex true &&
+
+	# Create a pack with just the commit object in pack, and do not mark it
+	# as kept (even though it appears in $KEPT_PACK, which does have a .keep
+	# file).
+	MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$(git rev-parse kept)
+	EOF
+	) &&
+
+	# Write a MIDX containing all packs, but use the version of the commit
+	# at "kept" in a non-kept pack by touching $MIDX_PACK.
+	touch .git/objects/pack/pack-$MIDX_PACK.pack &&
+	git multi-pack-index write &&
+
+	rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual &&
+	(
+		idx_objects .git/objects/pack/pack-$MISC_PACK.idx &&
+		git rev-list --objects --no-object-names refs/tags/loose
+	) | sort >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
  2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
  2021-01-19 23:24 ` [PATCH 02/10] revision: learn '--no-kept-objects' Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-29  3:21   ` Junio C Hamano
  2021-01-19 23:24 ` [PATCH 04/10] p5303: add missing &&-chains Taylor Blau
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Teach pack-objects an option to imply the revision machinery's new
'--no-kept-objects' option when doing a reachability traversal.

When '--assume-kept-packs-closed' is given as an argument to
pack-objects, it behaves differently (i.e., passes different options to
the ensuing revision walk) depending on whether or not other arguments
are passed:

  - If the caller also specifies a '--keep-pack' argument (to mark a
    pack as kept in-core), then assume that this combination means to
    stop traversal only at in-core packs.

  - If instead the caller passes '--honor-pack-keep', then assume that
    the caller wants to stop traversal only at packs with a
    corresponding .keep file (consistent with the original meaning which
    only refers to packs with a .keep file).

  - If both '--keep-pack' and '--honor-pack-keep' are passed, then
    assume the caller wants to stop traversal at either kind of kept
    pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt | 11 ++++++
 builtin/pack-objects.c             | 13 +++++++
 t/t6114-keep-packs.sh              | 59 ++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 54d715ead1..cbe08e7415 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -135,6 +135,17 @@ depth is 4095.
 	leading directory (e.g. `pack-123.pack`). The option could be
 	specified multiple times to keep multiple packs.
 
+--assume-kept-packs-closed::
+	This flag causes `git rev-list` to halt the object traversal
+	when it encounters an object found in a kept pack. This is
+	dissimilar to `--honor-pack-keep`, which only prunes unwanted
+	results after the full traversal is completed.
++
+Without any `--keep-pack=<pack-name>` arguments, only packs with an
+on-disk `*.keep` files are used when considering when to halt the
+traversal. If other packs are artificially marked as "kept" with
+`--keep-pack`, then those are considered as well.
+
 --incremental::
 	This flag causes an object already in a pack to be ignored
 	even if it would have otherwise been packed.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a00358f34..a5dcd66f52 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -78,6 +78,7 @@ static int have_non_local_packs;
 static int incremental;
 static int ignore_packed_keep_on_disk;
 static int ignore_packed_keep_in_core;
+static int assume_kept_packs_closed;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
@@ -3542,6 +3543,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("create packs suitable for shallow fetches")),
 		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
 			 N_("ignore packs that have companion .keep file")),
+		OPT_BOOL(0, "assume-kept-packs-closed", &assume_kept_packs_closed,
+			 N_("assume the union of kept packs is closed under reachability")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("ignore this pack")),
 		OPT_INTEGER(0, "compression", &pack_compression_level,
@@ -3631,6 +3634,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--unpacked");
 	}
+	if (assume_kept_packs_closed)
+		use_internal_rev_list = 1;
 
 	if (exclude_promisor_objects) {
 		use_internal_rev_list = 1;
@@ -3711,6 +3716,14 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (!p) /* no keep-able packs found */
 			ignore_packed_keep_on_disk = 0;
 	}
+	if (assume_kept_packs_closed) {
+		if (ignore_packed_keep_on_disk && ignore_packed_keep_in_core)
+			strvec_push(&rp, "--no-kept-objects");
+		else if (ignore_packed_keep_on_disk)
+			strvec_push(&rp, "--no-kept-objects=on-disk");
+		else if (ignore_packed_keep_in_core)
+			strvec_push(&rp, "--no-kept-objects=in-core");
+	}
 	if (local) {
 		/*
 		 * unlike ignore_packed_keep_on_disk above, we do not
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
index 9239d8aa46..0861305a04 100755
--- a/t/t6114-keep-packs.sh
+++ b/t/t6114-keep-packs.sh
@@ -66,4 +66,63 @@ test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
 	test_cmp expect actual
 '
 
+test_expect_success '--no-kept-objects can respect only in-core keep packs' '
+	test_when_finished "rm -fr actual-*.idx actual-*.pack" &&
+	(
+		git rev-list --objects --no-object-names packed..kept &&
+		git rev-list --objects --no-object-names loose
+	) | sort >expect &&
+
+	git pack-objects \
+	  --assume-kept-packs-closed \
+	  --keep-pack=pack-$MISC_PACK.pack \
+	  --all actual </dev/null &&
+	idx_objects actual-*.idx >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success 'setup additional --no-kept-objects tests' '
+	test_commit additional &&
+
+	ADDITIONAL_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/additional
+	^refs/tags/kept
+	EOF
+	)
+'
+
+test_expect_success '--no-kept-objects can respect only on-disk keep packs' '
+	test_when_finished "rm -fr actual-*.idx actual-*.pack" &&
+	(
+		git rev-list --objects --no-object-names kept..additional &&
+		git rev-list --objects --no-object-names packed
+	) | sort >expect &&
+
+	git pack-objects \
+	  --assume-kept-packs-closed \
+	  --honor-pack-keep \
+	  --all actual </dev/null &&
+	idx_objects actual-*.idx >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects can respect mixed kept packs' '
+	test_when_finished "rm -fr actual-*.idx actual-*.pack" &&
+	(
+		git rev-list --objects --no-object-names kept..additional &&
+		git rev-list --objects --no-object-names loose
+	) | sort >expect &&
+
+	git pack-objects \
+	  --assume-kept-packs-closed \
+	  --honor-pack-keep \
+	  --keep-pack=pack-$MISC_PACK.pack \
+	  --all actual </dev/null &&
+	idx_objects actual-*.idx >actual &&
+
+	test_cmp expect actual
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 04/10] p5303: add missing &&-chains
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (2 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed' Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-19 23:24 ` [PATCH 05/10] p5303: measure time to repack with keep Taylor Blau
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

From: Jeff King <peff@peff.net>

These are in a helper function, so the usual chain-lint doesn't notice
them. This function is still not perfect, as it has some git invocations
on the left-hand-side of the pipe, but it's primary purpose is timing,
not finding bugs or correctness issues.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index f4c2ab0584..277d22ec4b 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -24,11 +24,11 @@ repack_into_n () {
 	sed -n '1~5p' |
 	head -n "$1" |
 	perl -e 'print reverse <>' \
-	>pushes
+	>pushes &&
 
 	# create base packfile
 	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack
+	git pack-objects --delta-base-offset --revs staging/pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 05/10] p5303: measure time to repack with keep
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (3 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 04/10] p5303: add missing &&-chains Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-29  3:40   ` Junio C Hamano
  2021-01-19 23:24 ` [PATCH 06/10] pack-objects: rewrite honor-pack-keep logic Taylor Blau
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

From: Jeff King <peff@peff.net>

This is the same as the regular repack test, except that we mark the
single base pack as "kept" and use --assume-kept-packs-closed. The
theory is that this should be faster than the normal repack, because
we'll have fewer objects to traverse and process.

And indeed, it is much faster in the single-pack case (all timings
measured on the kernel):

  5303.5: repack (1)                 57.29(54.88+10.39)
  5303.6: repack with keep (1)       1.25(1.19+0.05)

and in the 50-pack case:

  5303.10: repack (50)               89.71(132.78+6.14)
  5303.11: repack with keep (50)     6.92(26.93+0.58)

but our improvements vanish as we approach 1000 packs.

  5303.15: repack (1000)             217.14(493.76+15.29)
  5303.16: repack with keep (1000)   209.46(387.83+8.42)

That's because the code paths around handling .keep files are known to
scale badly; they look in every single pack file to find each object.
Our solution to that was to notice that most repos don't have keep
files, and to make that case a fast path. But as soon as you add a
single .keep, that part of pack-objects slows down again (even if we
have fewer objects total to look at).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index 277d22ec4b..85b077b72b 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -27,8 +27,11 @@ repack_into_n () {
 	>pushes &&
 
 	# create base packfile
-	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack &&
+	base_pack=$(
+		head -n 1 pushes |
+		git pack-objects --delta-base-offset --revs staging/pack
+	) &&
+	test_export base_pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
@@ -87,6 +90,15 @@ do
 		  --reflog --indexed-objects --delta-base-offset \
 		  --stdout </dev/null >/dev/null
 	'
+
+	test_perf "repack with keep ($nr_packs)" '
+		git pack-objects --keep-true-parents \
+		  --honor-pack-keep --assume-kept-packs-closed \
+		  --keep-pack=pack-$base_pack.pack \
+		  --non-empty --all \
+		  --reflog --indexed-objects --delta-base-offset \
+		  --stdout </dev/null >/dev/null
+	'
 done
 
 # Measure pack loading with 10,000 packs.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 06/10] pack-objects: rewrite honor-pack-keep logic
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (4 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 05/10] p5303: measure time to repack with keep Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-19 23:24 ` [PATCH 07/10] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

From: Jeff King <peff@peff.net>

Now that we have find_kept_pack_entry(), we don't have to manually keep
hunting through every pack to find a possible "kept" duplicate of the
object. This should be faster, assuming only a portion of your total
packs are actually kept.

Note that we have to re-order the logic a bit here; we can deal with the
"kept" situation completely, and then just fall back to the "--local"
question. It might be worth having a similar optimized function to look
at only local packs.

Here are the results from p5303 (measurements taken on git.git):

  Test                               HEAD^                  HEAD
  ------------------------------------------------------------------------------------
  5303.5: repack (1)                 57.29(54.88+10.39)     56.87(54.63+10.48) -0.7%
  5303.6: repack with keep (1)       1.25(1.19+0.05)        1.26(1.19+0.06) +0.8%
  5303.10: repack (50)               89.71(132.78+6.14)     89.35(132.42+6.25) -0.4%
  5303.11: repack with keep (50)     6.92(26.93+0.58)       6.73(26.61+0.59) -2.7%
  5303.15: repack (1000)             217.14(493.76+15.29)   217.25(494.38+15.24) +0.1%
  5303.16: repack with keep (1000)   209.46(387.83+8.42)    133.12(311.80+8.44) -36.4%

So our case with many packs and a .keep is finally now faster than the
non-keep case (because it gets the speed benefit of looking at fewer
objects, but not as big a penalty for looking at many packs).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 125 ++++++++++++++++++++++++-----------------
 1 file changed, 73 insertions(+), 52 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a5dcd66f52..c84642df98 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1178,7 +1178,8 @@ static int have_duplicate_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int want_found_object(const struct object_id *oid, int exclude,
+			     struct packed_git *p)
 {
 	if (exclude)
 		return 1;
@@ -1199,22 +1200,73 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 * Otherwise, we signal "-1" at the end to tell the caller that we do
 	 * not know either way, and it needs to check more packs.
 	 */
-	if (!ignore_packed_keep_on_disk &&
-	    !ignore_packed_keep_in_core &&
-	    (!local || !have_non_local_packs))
+
+	/*
+	 * Handle .keep first, as we have a fast(er) path there.
+	 */
+	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+		/*
+		 * Set the flags for the kept-pack cache to be the ones we want
+		 * to ignore.
+		 *
+		 * That is, if we are ignoring objects in on-disk keep packs,
+		 * then we want to search through the on-disk keep and ignore
+		 * the in-core ones.
+		 */
+		unsigned flags = 0;
+		if (ignore_packed_keep_on_disk)
+			flags |= ON_DISK_KEEP_PACKS;
+		if (ignore_packed_keep_in_core)
+			flags |= IN_CORE_KEEP_PACKS;
+
+		if (ignore_packed_keep_on_disk && p->pack_keep)
+			return 0;
+		if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+			return 0;
+		if (has_object_kept_pack(oid, flags))
+			return 0;
+	}
+
+	/*
+	 * At this point we know definitively that either we don't care about
+	 * keep-packs, or the object is not in one. Keep checking other
+	 * conditions...
+	 */
+
+	if (!local || !have_non_local_packs)
 		return 1;
-
 	if (local && !p->pack_local)
 		return 0;
-	if (p->pack_local &&
-	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
-	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
-		return 0;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
+static int want_object_in_pack_one(struct packed_git *p,
+				   const struct object_id *oid,
+				   int exclude,
+				   struct packed_git **found_pack,
+				   off_t *found_offset)
+{
+	off_t offset;
+
+	if (p == *found_pack)
+		offset = *found_offset;
+	else
+		offset = find_pack_entry_one(oid->hash, p);
+
+	if (offset) {
+		if (!*found_pack) {
+			if (!is_pack_valid(p))
+				return -1;
+			*found_offset = offset;
+			*found_pack = p;
+		}
+		return want_found_object(oid, exclude, p);
+	}
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
@@ -1242,7 +1294,7 @@ static int want_object_in_pack(const struct object_id *oid,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
+		want = want_found_object(oid, exclude, *found_pack);
 		if (want != -1)
 			return want;
 	}
@@ -1250,53 +1302,22 @@ static int want_object_in_pack(const struct object_id *oid,
 	for (m = get_multi_pack_index(the_repository); m; m = m->next) {
 		struct pack_entry e;
 		if (fill_midx_entry(the_repository, oid, &e, m)) {
-			struct packed_git *p = e.p;
-			off_t offset;
-
-			if (p == *found_pack)
-				offset = *found_offset;
-			else
-				offset = find_pack_entry_one(oid->hash, p);
-
-			if (offset) {
-				if (!*found_pack) {
-					if (!is_pack_valid(p))
-						continue;
-					*found_offset = offset;
-					*found_pack = p;
-				}
-				want = want_found_object(exclude, p);
-				if (want != -1)
-					return want;
-			}
-		}
-	}
-
-	list_for_each(pos, get_packed_git_mru(the_repository)) {
-		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		off_t offset;
-
-		if (p == *found_pack)
-			offset = *found_offset;
-		else
-			offset = find_pack_entry_one(oid->hash, p);
-
-		if (offset) {
-			if (!*found_pack) {
-				if (!is_pack_valid(p))
-					continue;
-				*found_offset = offset;
-				*found_pack = p;
-			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
-				list_move(&p->mru,
-					  get_packed_git_mru(the_repository));
+			want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
 			if (want != -1)
 				return want;
 		}
 	}
 
+	list_for_each(pos, get_packed_git_mru(the_repository)) {
+		struct packed_git *p = list_entry(pos, struct packed_git, mru);
+		want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+		if (!exclude && want > 0)
+			list_move(&p->mru,
+				  get_packed_git_mru(the_repository));
+		if (want != -1)
+			return want;
+	}
+
 	if (uri_protocols.nr) {
 		struct configured_exclusion *ex =
 			oidmap_get(&configured_exclusions, oid);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 07/10] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (5 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 06/10] pack-objects: rewrite honor-pack-keep logic Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-19 23:24 ` [PATCH 08/10] builtin/pack-objects.c: teach '--keep-pack-stdin' Taylor Blau
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

From: Jeff King <peff@peff.net>

In a recent patch we added a function 'find_kept_pack_entry()' to look
for an object only among kept packs.

While this function avoids doing any lookup work in non-kept packs, it
is still linear in the number of packs, since we have to traverse the
linked list of packs once per object. Let's cache a reduced version of
that list to save us time.

Note that this cache will last the lifetime of the program. We could
invalidate it on reprepare_packed_git(), but there's not much point in
being rigorous here:

  - we might already fail to notice new .keep packs showing up after the
    program starts. We only reprepare_packed_git() when we fail to find
    an object. But adding a new pack won't cause that to happen.
    Somebody repacking could add a new pack and delete an old one, but
    most of the time we'd have a descriptor or mmap open to the old
    pack anyway, so we might not even notice.

  - in pack-objects we already cache the .keep state at startup, since
    56dfeb6263 (pack-objects: compute local/ignore_pack_keep early,
    2016-07-29). So this is just extending that concept further.

  - we don't have to worry about any packed_git being removed; we always
    keep the old structs around, even after reprepare_packed_git()

Here are p5303 results (as always, measured against the kernel):

  Test                               HEAD^                  HEAD
  ------------------------------------------------------------------------------------
  5303.5: repack (1)                 56.87(54.63+10.48)     56.63(54.41+10.36) -0.4%
  5303.6: repack with keep (1)       1.26(1.19+0.06)        1.25(1.19+0.05) -0.8%
  5303.10: repack (50)               89.35(132.42+6.25)     89.49(132.31+6.31) +0.2%
  5303.11: repack with keep (50)     6.73(26.61+0.59)       6.72(26.70+0.53) -0.1%
  5303.15: repack (1000)             217.25(494.38+15.24)   218.69(495.62+14.99) +0.7%
  5303.16: repack with keep (1000)   133.12(311.80+8.44)    128.79(306.96+8.55) -3.3%

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |   4 +-
 object-store.h         |  10 ++++
 packfile.c             | 103 +++++++++++++++++++++++------------------
 packfile.h             |   4 --
 revision.c             |   8 ++--
 5 files changed, 75 insertions(+), 54 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c84642df98..f2c7a1e35b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1215,9 +1215,9 @@ static int want_found_object(const struct object_id *oid, int exclude,
 		 */
 		unsigned flags = 0;
 		if (ignore_packed_keep_on_disk)
-			flags |= ON_DISK_KEEP_PACKS;
+			flags |= CACHE_ON_DISK_KEEP_PACKS;
 		if (ignore_packed_keep_in_core)
-			flags |= IN_CORE_KEEP_PACKS;
+			flags |= CACHE_IN_CORE_KEEP_PACKS;
 
 		if (ignore_packed_keep_on_disk && p->pack_keep)
 			return 0;
diff --git a/object-store.h b/object-store.h
index c4fc9dd74e..4cbe8eae3c 100644
--- a/object-store.h
+++ b/object-store.h
@@ -105,6 +105,14 @@ static inline int pack_map_entry_cmp(const void *unused_cmp_data,
 	return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
 }
 
+#define CACHE_ON_DISK_KEEP_PACKS 1
+#define CACHE_IN_CORE_KEEP_PACKS 2
+
+struct kept_pack_cache {
+	struct packed_git **packs;
+	unsigned flags;
+};
+
 struct raw_object_store {
 	/*
 	 * Set of all object directories; the main directory is first (and
@@ -150,6 +158,8 @@ struct raw_object_store {
 	/* A most-recently-used ordered version of the packed_git list. */
 	struct list_head packed_git_mru;
 
+	struct kept_pack_cache *kept_pack_cache;
+
 	/*
 	 * A map of packfiles to packed_git structs for tracking which
 	 * packs have been loaded already.
diff --git a/packfile.c b/packfile.c
index 30f43a1a35..25f5407ed0 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2015,10 +2015,7 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int find_one_pack_entry(struct repository *r,
-			       const struct object_id *oid,
-			       struct pack_entry *e,
-			       int kept_only)
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2028,49 +2025,64 @@ static int find_one_pack_entry(struct repository *r,
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (!(fill_midx_entry(r, oid, e, m)))
-			continue;
-
-		if (!kept_only)
-			return 1;
-
-		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
-		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
+		if (fill_midx_entry(r, oid, e, m))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (p->multi_pack_index && !kept_only) {
-			/*
-			 * If this pack is covered by the MIDX, we'd have found
-			 * the object already in the loop above if it was here,
-			 * so don't bother looking.
-			 *
-			 * The exception is if we are looking only at kept
-			 * packs. An object can be present in two packs covered
-			 * by the MIDX, one kept and one not-kept. And as the
-			 * MIDX points to only one copy of each object, it might
-			 * have returned only the non-kept version above. We
-			 * have to check again to be thorough.
-			 */
-			continue;
-		}
-		if (!kept_only ||
-		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
-		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
-			if (fill_pack_entry(oid, e, p)) {
-				list_move(&p->mru, &r->objects->packed_git_mru);
-				return 1;
-			}
+		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
+			list_move(&p->mru, &r->objects->packed_git_mru);
+			return 1;
 		}
 	}
 	return 0;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static void maybe_invalidate_kept_pack_cache(struct repository *r,
+					     unsigned flags)
 {
-	return find_one_pack_entry(r, oid, e, 0);
+	if (!r->objects->kept_pack_cache)
+		return;
+	if (r->objects->kept_pack_cache->flags == flags)
+		return;
+	free(r->objects->kept_pack_cache->packs);
+	FREE_AND_NULL(r->objects->kept_pack_cache);
+}
+
+static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+{
+	maybe_invalidate_kept_pack_cache(r, flags);
+
+	if (!r->objects->kept_pack_cache) {
+		struct packed_git **packs = NULL;
+		size_t nr = 0, alloc = 0;
+		struct packed_git *p;
+
+		/*
+		 * We want "all" packs here, because we need to cover ones that
+		 * are used by a midx, as well. We need to look in every one of
+		 * them (instead of the midx itself) to cover duplicates. It's
+		 * possible that an object is found in two packs that the midx
+		 * covers, one kept and one not kept, but the midx returns only
+		 * the non-kept version.
+		 */
+		for (p = get_all_packs(r); p; p = p->next) {
+			if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) ||
+			    (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) {
+				ALLOC_GROW(packs, nr + 1, alloc);
+				packs[nr++] = p;
+			}
+		}
+		ALLOC_GROW(packs, nr + 1, alloc);
+		packs[nr] = NULL;
+
+		r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache));
+		r->objects->kept_pack_cache->packs = packs;
+		r->objects->kept_pack_cache->flags = flags;
+	}
+
+	return r->objects->kept_pack_cache->packs;
 }
 
 int find_kept_pack_entry(struct repository *r,
@@ -2078,13 +2090,15 @@ int find_kept_pack_entry(struct repository *r,
 			 unsigned flags,
 			 struct pack_entry *e)
 {
-	/*
-	 * Load all packs, including midx packs, since our "kept" strategy
-	 * relies on that. We're relying on the side effect of it setting up
-	 * r->objects->packed_git, which is a little ugly.
-	 */
-	get_all_packs(r);
-	return find_one_pack_entry(r, oid, e, flags);
+	struct packed_git **cache;
+
+	for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+		struct packed_git *p = *cache;
+		if (fill_pack_entry(oid, e, p))
+			return 1;
+	}
+
+	return 0;
 }
 
 int has_object_pack(const struct object_id *oid)
@@ -2093,7 +2107,8 @@ int has_object_pack(const struct object_id *oid)
 	return find_pack_entry(the_repository, oid, &e);
 }
 
-int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+int has_object_kept_pack(const struct object_id *oid,
+			 unsigned flags)
 {
 	struct pack_entry e;
 	return find_kept_pack_entry(the_repository, oid, flags, &e);
diff --git a/packfile.h b/packfile.h
index 624327f64d..eb56db2a7b 100644
--- a/packfile.h
+++ b/packfile.h
@@ -161,10 +161,6 @@ int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
-#define ON_DISK_KEEP_PACKS 1
-#define IN_CORE_KEEP_PACKS 2
-#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
-
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
diff --git a/revision.c b/revision.c
index ff1ea77224..ce87081b8e 100644
--- a/revision.c
+++ b/revision.c
@@ -2336,14 +2336,14 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		die(_("--unpacked=<packfile> no longer supported"));
 	} else if (!strcmp(arg, "--no-kept-objects")) {
 		revs->no_kept_objects = 1;
-		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
-		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= CACHE_ON_DISK_KEEP_PACKS;
 	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
 		revs->no_kept_objects = 1;
 		if (!strcmp(optarg, "in-core"))
-			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+			revs->keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
 		if (!strcmp(optarg, "on-disk"))
-			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+			revs->keep_pack_cache_flags |= CACHE_ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 08/10] builtin/pack-objects.c: teach '--keep-pack-stdin'
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (6 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 07/10] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-19 23:24 ` [PATCH 09/10] builtin/repack.c: extract loose object handling Taylor Blau
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Add a shortcut to specify '--keep-pack=<pack-name>' arguments over
stdin, in case a caller wishes to indicate more kept packs than the
argument limit will allow.

Passing this option overrides any other option to 'git pack-objects'
that takes input over stdin. For example, '--revs' still forces a
reachability traversal, but will not accept any revision arguments over
stdin. Use of '--keep-pack-stdin' within Git is limited to one caller
(added in a subsequent patch) which does not pass any other input over
stdin.

No new tests are added here, since a caller from 'git repack' will
exercise these options in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  8 ++++++++
 builtin/pack-objects.c             | 23 ++++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index cbe08e7415..45ecc4e9e5 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -135,6 +135,14 @@ depth is 4095.
 	leading directory (e.g. `pack-123.pack`). The option could be
 	specified multiple times to keep multiple packs.
 
+--keep-pack-stdin::
+	Take a list of line-delimited `<pack-name>` arguments, treating
+	them as if they were each passed as `--keep-pack=<pack-name>`.
+	Useful for when many packs are being kept to avoid argument
+	length limitations. Requires that `--revs` be passed or implied,
+	but does not allow the caller to pass additional traversal
+	arguments over standard input.
+
 --assume-kept-packs-closed::
 	This flag causes `git rev-list` to halt the object traversal
 	when it encounters an object found in a kept pack. This is
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f2c7a1e35b..f528a07d78 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3343,7 +3343,7 @@ static void record_recent_commit(struct commit *commit, void *data)
 	oid_array_append(&recent_objects, &commit->object.oid);
 }
 
-static void get_object_list(int ac, const char **av)
+static void get_object_list(int ac, const char **av, int read_from_stdin)
 {
 	struct rev_info revs;
 	struct setup_revision_opt s_r_opt = {
@@ -3363,7 +3363,7 @@ static void get_object_list(int ac, const char **av)
 	save_warning = warn_on_object_refname_ambiguity;
 	warn_on_object_refname_ambiguity = 0;
 
-	while (fgets(line, sizeof(line), stdin) != NULL) {
+	while (read_from_stdin && fgets(line, sizeof(line), stdin) != NULL) {
 		int len = strlen(line);
 		if (len && line[len - 1] == '\n')
 			line[--len] = 0;
@@ -3487,6 +3487,15 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static void collect_kept_packs(struct string_list *keep_pack_list)
+{
+	struct strbuf buf = STRBUF_INIT;
+	while (strbuf_getline(&buf, stdin) != EOF)
+		string_list_append(keep_pack_list,
+				   strbuf_detach(&buf, NULL));
+	strbuf_release(&buf);
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3496,6 +3505,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	int rev_list_index = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
+	int keep_pack_stdin = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
 			    N_("do not show progress meter"), 0),
@@ -3568,6 +3578,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("assume the union of kept packs is closed under reachability")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("ignore this pack")),
+		OPT_BOOL(0, "keep-pack-stdin", &keep_pack_stdin,
+			 N_("read the list of kept packs from stdin")),
 		OPT_INTEGER(0, "compression", &pack_compression_level,
 			    N_("pack compression level")),
 		OPT_SET_INT(0, "keep-true-parents", &grafts_replace_parents,
@@ -3728,6 +3740,11 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (progress && all_progress_implied)
 		progress = 2;
 
+	if (keep_pack_stdin) {
+		if (!use_internal_rev_list)
+			die(_("--keep-pack-stdin requires --revs"));
+		collect_kept_packs(&keep_pack_list);
+	}
 	add_extra_kept_packs(&keep_pack_list);
 	if (ignore_packed_keep_on_disk) {
 		struct packed_git *p;
@@ -3769,7 +3786,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
-		get_object_list(rp.nr, rp.v);
+		get_object_list(rp.nr, rp.v, !keep_pack_stdin);
 		strvec_clear(&rp);
 	}
 	cleanup_preferred_base();
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 09/10] builtin/repack.c: extract loose object handling
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (7 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 08/10] builtin/pack-objects.c: teach '--keep-pack-stdin' Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-20 13:59   ` Derrick Stolee
  2021-01-19 23:24 ` [PATCH 10/10] builtin/repack.c: add '--geometric' option Taylor Blau
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

'git repack -g' will have to learn about unreachable loose objects that
need to be removed in a separate path from the existing checks.

Extract that check into a function so it can be called from multiple
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 41 +++++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 279be11a16..664863111b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -298,6 +298,27 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
 
+static void handle_loose_and_reachable(struct child_process *cmd,
+				       const char *unpack_unreachable,
+				       int pack_everything,
+				       int keep_unreachable)
+{
+	if (unpack_unreachable) {
+		strvec_pushf(&cmd->args,
+			     "--unpack-unreachable=%s",
+			     unpack_unreachable);
+		strvec_push(&cmd->env_array, "GIT_REF_PARANOIA=1");
+	} else if (pack_everything & LOOSEN_UNREACHABLE) {
+		strvec_push(&cmd->args,
+			    "--unpack-unreachable");
+	} else if (keep_unreachable) {
+		strvec_push(&cmd->args, "--keep-unreachable");
+		strvec_push(&cmd->args, "--pack-loose-unreachable");
+	} else {
+		strvec_push(&cmd->env_array, "GIT_REF_PARANOIA=1");
+	}
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -414,22 +435,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_packs.nr && delete_redundant) {
-			if (unpack_unreachable) {
-				strvec_pushf(&cmd.args,
-					     "--unpack-unreachable=%s",
-					     unpack_unreachable);
-				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
-			} else if (pack_everything & LOOSEN_UNREACHABLE) {
-				strvec_push(&cmd.args,
-					    "--unpack-unreachable");
-			} else if (keep_unreachable) {
-				strvec_push(&cmd.args, "--keep-unreachable");
-				strvec_push(&cmd.args, "--pack-loose-unreachable");
-			} else {
-				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
-			}
-		}
+		if (existing_packs.nr && delete_redundant)
+			handle_loose_and_reachable(&cmd, unpack_unreachable,
+						   pack_everything,
+						   keep_unreachable);
 	} else {
 		strvec_push(&cmd.args, "--unpacked");
 		strvec_push(&cmd.args, "--incremental");
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 10/10] builtin/repack.c: add '--geometric' option
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (8 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 09/10] builtin/repack.c: extract loose object handling Taylor Blau
@ 2021-01-19 23:24 ` Taylor Blau
  2021-01-20 14:05 ` [PATCH 00/10] repack: support repacking into a geometric sequence Derrick Stolee
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-19 23:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Often it is useful to both:

  - have relatively few packfiles in a repository, and

  - avoid having so few packfiles in a repository that we repack its
    entire contents regularly

This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).

Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:

  objects(Pi) > r*objects(P(i-1))

for all i in [1, n], where the packs are sorted by

  objects(P1) <= objects(P2) <= ... <= objects(Pn).

Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:

  1. We assume that there is a cutoff of packs _before starting the
     repack_ where everything to the right of that cut-off already forms
     a geometric progression (or no cutoff exists and everything must be
     repacked).

  2. We assume that everything smaller than the cutoff count must be
     repacked. This forms our base assumption, but it can also cause
     even the "heavy" packs to get repacked, for e.g., if we have 6
     packs containing the following number of objects:

       1, 1, 1, 2, 4, 32

     then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
     rolling up the first two packs into a pack with 2 objects. That
     breaks our progression and leaves us:

       2, 1, 2, 4, 32
         ^

     (where the '^' indicates the position of our split). To restore a
     progression, we move the split forward (towards larger packs)
     joining each pack into our new pack until a geometric progression
     is restored. Here, that looks like:

       2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
         ^                   ^                ^                   ^

This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  11 +++
 builtin/repack.c             | 165 ++++++++++++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  |  81 +++++++++++++++++
 3 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100755 t/t7703-repack-geometric.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 92f146d27d..b1ffcfd974 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -165,6 +165,17 @@ depth is 4095.
 	Pass the `--delta-islands` option to `git-pack-objects`, see
 	linkgit:git-pack-objects[1].
 
+-g=<factor>::
+--geometric=<factor>::
+	Arrange resulting pack structure so that each successive pack
+	contains at least `<factor>` times the number of objects as the
+	next-largest pack.
++
+`git repack` ensures this by determining a "cut" of packfiles that need to be
+repacked into one in order to ensure a geometric progression. It picks the
+smallest set of packfiles such that as many of the larger packfiles (by count of
+objects contained in that pack) may be left intact.
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 664863111b..083088ae1f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -298,6 +298,116 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
 
+struct pack_geometry {
+	struct packed_git **pack;
+	uint32_t pack_nr, pack_alloc;
+	uint32_t split;
+};
+
+static uint32_t geometry_pack_weight(struct packed_git *p)
+{
+	if (open_pack_index(p))
+		die(_("cannot open index for %s"), p->pack_name);
+	return p->num_objects;
+}
+
+static int geometry_cmp(const void *va, const void *vb)
+{
+	uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
+		 bw = geometry_pack_weight(*(struct packed_git **)vb);
+
+	if (aw < bw)
+		return -1;
+	if (aw > bw)
+		return 1;
+	return 0;
+}
+
+static void init_pack_geometry(struct pack_geometry **geometry_p)
+{
+	struct packed_git *p;
+	struct pack_geometry *geometry;
+
+	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
+	geometry = *geometry_p;
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		ALLOC_GROW(geometry->pack,
+			   geometry->pack_nr + 1,
+			   geometry->pack_alloc);
+
+		geometry->pack[geometry->pack_nr] = p;
+		geometry->pack_nr++;
+	}
+
+	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
+}
+
+static void split_pack_geometry(struct pack_geometry *geometry, int factor)
+{
+	uint32_t i;
+	uint32_t split;
+	off_t total_size = 0;
+
+	split = geometry->pack_nr - 1;
+
+	/*
+	 * First, count the number of packs (in descending order of size) which
+	 * already form a geometric progression.
+	 */
+	for (i = geometry->pack_nr - 1; i > 0; i--) {
+		struct packed_git *ours = geometry->pack[i];
+		struct packed_git *prev = geometry->pack[i - 1];
+		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
+			split--;
+		else
+			break;
+	}
+
+	if (split) {
+		/*
+		 * Move the split one to the right, since the top element in the
+		 * last-compared pair can't be in the progression. Only do this
+		 * when we split in the middle of the array (otherwise if we got
+		 * to the end, then the split is in the right place).
+		 */
+		split++;
+	}
+
+	/*
+	 * Then, anything to the left of 'split' must be in a new pack. But,
+	 * creating that new pack may cause packs in the heavy half to no longer
+	 * form a geometric progression.
+	 *
+	 * Compute an expected size of the new pack, and then determine how many
+	 * packs in the heavy half need to be joined into it (if any) to restore
+	 * the geometric progression.
+	 */
+	for (i = 0; i < split; i++)
+		total_size += geometry_pack_weight(geometry->pack[i]);
+	for (i = split; i < geometry->pack_nr; i++) {
+		struct packed_git *ours = geometry->pack[i];
+		if (geometry_pack_weight(ours) < factor * total_size) {
+			split++;
+			total_size += geometry_pack_weight(ours);
+		} else
+			break;
+	}
+
+	geometry->split = split;
+}
+
+static void clear_pack_geometry(struct pack_geometry *geometry)
+{
+	if (!geometry)
+		return;
+
+	free(geometry->pack);
+	geometry->pack_nr = 0;
+	geometry->pack_alloc = 0;
+	geometry->split = 0;
+}
+
 static void handle_loose_and_reachable(struct child_process *cmd,
 				       const char *unpack_unreachable,
 				       int pack_everything,
@@ -326,6 +436,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
@@ -338,6 +449,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	int geometric_factor = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -378,6 +490,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("do not repack this pack")),
+		OPT_INTEGER('g', "geometric", &geometric_factor,
+			    N_("find a geometric progression with factor <N>")),
 		OPT_END()
 	};
 
@@ -404,6 +518,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
 		die(_(incremental_bitmap_conflict_error));
 
+	if (geometric_factor) {
+		init_pack_geometry(&geometry);
+		split_pack_geometry(geometry, geometric_factor);
+	}
+
 	packdir = mkpathdup("%s/pack", get_object_directory());
 	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
@@ -439,17 +558,41 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			handle_loose_and_reachable(&cmd, unpack_unreachable,
 						   pack_everything,
 						   keep_unreachable);
+	} else if (geometry) {
+		strvec_push(&cmd.args, "--keep-pack-stdin");
+		strvec_push(&cmd.args, "--honor-pack-keep");
+		strvec_push(&cmd.args, "--assume-kept-packs-closed");
+		if (delete_redundant)
+			handle_loose_and_reachable(&cmd, unpack_unreachable,
+						   pack_everything,
+						   keep_unreachable);
 	} else {
 		strvec_push(&cmd.args, "--unpacked");
 		strvec_push(&cmd.args, "--incremental");
 	}
 
-	cmd.no_stdin = 1;
+	if (geometry)
+		cmd.in = -1;
+	else
+		cmd.no_stdin = 1;
 
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
 
+	if (geometry) {
+		FILE *in = xfdopen(cmd.in, "w");
+		/*
+		 * Tell 'git pack-objects' to avoid tampering with the structure
+		 * with the packs that already form a geometric progression.
+		 *
+		 * Everything else will get picked up by the reachability walk.
+		 */
+		for (i = geometry->split; i < geometry->pack_nr; i++)
+			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
+		fclose(in);
+	}
+
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		if (line.len != the_hash_algo->hexsz)
@@ -517,6 +660,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			if (!string_list_has_string(&names, sha1))
 				remove_redundant_pack(packdir, item->string);
 		}
+
+		if (geometry) {
+			struct strbuf buf = STRBUF_INIT;
+
+			uint32_t i;
+			for (i = 0; i < geometry->split; i++) {
+				struct packed_git *p = geometry->pack[i];
+				if (string_list_has_string(&names,
+							   hash_to_hex(p->hash)))
+					continue;
+
+				strbuf_reset(&buf);
+				strbuf_addstr(&buf, pack_basename(p));
+				strbuf_strip_suffix(&buf, ".pack");
+
+				remove_redundant_pack(packdir, buf.buf);
+			}
+			strbuf_release(&buf);
+		}
 		if (!po_args.quiet && isatty(2))
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
@@ -538,6 +700,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
 	return 0;
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
new file mode 100755
index 0000000000..39cef892f8
--- /dev/null
+++ b/t/t7703-repack-geometric.sh
@@ -0,0 +1,81 @@
+#!/bin/sh
+
+test_description='git repack --geometric works correctly'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success '--geometric with an intact progression' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		test_commit_bulk --start=4 4 && # 12 objects
+
+		find $objdir/pack -name "*.pack" | sort >expect &&
+		GIT_TEST_MULTI_PACK_BITMAP=0 git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >actual &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--geometric with small-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		find $objdir/pack -name "*.pack" | sort >small &&
+		test_commit_bulk --start=3 4 && # 12 objects
+		test_commit_bulk --start=7 8 && # 24 objects
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		GIT_TEST_MULTI_PACK_BITMAP=0 git repack --geometric 2 -d &&
+
+		# Three packs in total; two of the existing large ones, and one
+		# new one.
+		find $objdir/pack -name "*.pack" | sort >after &&
+		test_line_count = 3 after &&
+		comm -3 small before | tr -d "\t" >large &&
+		grep -qFf large after
+	)
+'
+
+test_expect_success '--geometric with small- and large-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# size(small1) + size(small2) > size(medium) / 2
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		test_commit_bulk --start=2 3 && # 7 objects
+		test_commit_bulk --start=6 9 && # 27 objects &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		GIT_TEST_MULTI_PACK_BITMAP=0 git repack --geometric 2 -d &&
+
+		find $objdir/pack -name "*.pack" | sort >after &&
+		comm -12 before after >untouched &&
+
+		# Two packs in total; the largest pack from before running "git
+		# repack", and one new one.
+		test_line_count = 1 untouched &&
+		test_line_count = 2 after
+	)
+'
+
+test_done
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
@ 2021-01-20 13:40   ` Derrick Stolee
  2021-01-20 14:38     ` Taylor Blau
  2021-01-29  2:33   ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-01-20 13:40 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 1/19/2021 6:24 PM, Taylor Blau wrote:
>  	for (m = r->objects->multi_pack_index; m; m = m->next) {
> -		if (fill_midx_entry(r, oid, e, m))
> +		if (!(fill_midx_entry(r, oid, e, m)))

nit: we don't need extra parens around fill_midx_entry().

> -		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
> -			list_move(&p->mru, &r->objects->packed_git_mru);
> -			return 1;
> +		if (p->multi_pack_index && !kept_only) {
> +			/*
> +			 * If this pack is covered by the MIDX, we'd have found
> +			 * the object already in the loop above if it was here,
> +			 * so don't bother looking.
> +			 *
> +			 * The exception is if we are looking only at kept
> +			 * packs. An object can be present in two packs covered
> +			 * by the MIDX, one kept and one not-kept. And as the
> +			 * MIDX points to only one copy of each object, it might
> +			 * have returned only the non-kept version above. We
> +			 * have to check again to be thorough.
> +			 */
> +			continue;
> +		}
> +		if (!kept_only ||
> +		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
> +		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
> +			if (fill_pack_entry(oid, e, p)) {
> +				list_move(&p->mru, &r->objects->packed_git_mru);
> +				return 1;
> +			}

Here is the meat of your patch. The comment helps a lot.

This might have been easier if the MIDX had preferred kept packs
over non-kept packs (before sorting by modified time). Perhaps
the MIDX could get an extra field to say "I preferred kept packs"
which would let us trust the MIDX return here without the pack
loop.

(Note: we can't just change the MIDX selection and then start
trusting all MIDXs to have the right tie-breakers because of
existing files in the wild.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 09/10] builtin/repack.c: extract loose object handling
  2021-01-19 23:24 ` [PATCH 09/10] builtin/repack.c: extract loose object handling Taylor Blau
@ 2021-01-20 13:59   ` Derrick Stolee
  2021-01-20 14:34     ` Taylor Blau
  2021-01-21  3:45     ` Junio C Hamano
  0 siblings, 2 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-01-20 13:59 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 1/19/2021 6:24 PM, Taylor Blau wrote:
> 'git repack -g' will have to learn about unreachable loose objects that

This reference to the '-g' option is one patch too early. Perhaps
say

  An upcoming patch will introduce geometric repacking. This will
  require removing unreachable loose objects in a separate path
  from the existing checks.

or similar?

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 00/10] repack: support repacking into a geometric sequence
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (9 preceding siblings ...)
  2021-01-19 23:24 ` [PATCH 10/10] builtin/repack.c: add '--geometric' option Taylor Blau
@ 2021-01-20 14:05 ` Derrick Stolee
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-01-20 14:05 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 1/19/2021 6:23 PM, Taylor Blau wrote:
> This series introduces a new mode of 'git repack' where (instead of packing just
> loose objects or packing everything together into one pack), the set of packs
> left forms a geometric progression by object count.
...
> Thanks in advance for your review.

I had the pleasure of reading an early version of this series, but it's
been a while. Upon a fresh reading, I only had nitpicks. Otherwise, this
LGTM.

I encourage other reviewers to read patch 10 carefully, as that is the
most math-heavy of all of them.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 09/10] builtin/repack.c: extract loose object handling
  2021-01-20 13:59   ` Derrick Stolee
@ 2021-01-20 14:34     ` Taylor Blau
  2021-01-20 15:51       ` Derrick Stolee
  2021-01-21  3:45     ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-20 14:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee

On Wed, Jan 20, 2021 at 08:59:48AM -0500, Derrick Stolee wrote:
> On 1/19/2021 6:24 PM, Taylor Blau wrote:
> > 'git repack -g' will have to learn about unreachable loose objects that
>
> This reference to the '-g' option is one patch too early. Perhaps
> say
>
>   An upcoming patch will introduce geometric repacking. This will
>   require removing unreachable loose objects in a separate path
>   from the existing checks.
>
> or similar?

Mmm. I had imagined that this would be read either in the context of
this series, or by someone in the future long after 'git repack -g' had
been introduced.

I could see that it's confusing, though, and I do agree your wording
makes clearer that the option doesn't exist yet.

I'm happy to send a replacement or reroll if you feel strongly, but in
either case I'll wait for a little more review first.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-20 13:40   ` Derrick Stolee
@ 2021-01-20 14:38     ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-20 14:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee

On Wed, Jan 20, 2021 at 08:40:22AM -0500, Derrick Stolee wrote:
> On 1/19/2021 6:24 PM, Taylor Blau wrote:
> >  	for (m = r->objects->multi_pack_index; m; m = m->next) {
> > -		if (fill_midx_entry(r, oid, e, m))
> > +		if (!(fill_midx_entry(r, oid, e, m)))
>
> nit: we don't need extra parens around fill_midx_entry().

Yep. I checked whether we should have written this as "if
(fill_midx_entry(...) < 0)", but fill_midx_entry returns a positive
number on error, so checking "!fill_midx_entry" is certainly what we
should be doing.

> > -		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
> > -			list_move(&p->mru, &r->objects->packed_git_mru);
> > -			return 1;
> > +		if (p->multi_pack_index && !kept_only) {
> > +			/*
> > +			 * If this pack is covered by the MIDX, we'd have found
> > +			 * the object already in the loop above if it was here,
> > +			 * so don't bother looking.
> > +			 *
> > +			 * The exception is if we are looking only at kept
> > +			 * packs. An object can be present in two packs covered
> > +			 * by the MIDX, one kept and one not-kept. And as the
> > +			 * MIDX points to only one copy of each object, it might
> > +			 * have returned only the non-kept version above. We
> > +			 * have to check again to be thorough.
> > +			 */
> > +			continue;
> > +		}
> > +		if (!kept_only ||
> > +		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
> > +		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
> > +			if (fill_pack_entry(oid, e, p)) {
> > +				list_move(&p->mru, &r->objects->packed_git_mru);
> > +				return 1;
> > +			}
>
> Here is the meat of your patch. The comment helps a lot.
>
> This might have been easier if the MIDX had preferred kept packs
> over non-kept packs (before sorting by modified time). Perhaps
> the MIDX could get an extra field to say "I preferred kept packs"
> which would let us trust the MIDX return here without the pack
> loop.
>
> (Note: we can't just change the MIDX selection and then start
> trusting all MIDXs to have the right tie-breakers because of
> existing files in the wild.)

Yeah, that is what makes it tricky. Changing the code isn't so hard: a
new field that we check and do one of two things when we're breaking
ties.

But I think the cognitive load is high, and I'm not sure that the
benefit (skipping another linear pass through non-MIDX'd packs when
looking up an object in kept packs only _and_ that object is duplicated)
is worth the extra hassle with the MIDX code.

All of that said, I do think that it's worth revisiting this and giving
it some more thought after multi-pack bitmaps to see whether we feel the
same or not.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 09/10] builtin/repack.c: extract loose object handling
  2021-01-20 14:34     ` Taylor Blau
@ 2021-01-20 15:51       ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-01-20 15:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

On 1/20/2021 9:34 AM, Taylor Blau wrote:
> On Wed, Jan 20, 2021 at 08:59:48AM -0500, Derrick Stolee wrote:
>> On 1/19/2021 6:24 PM, Taylor Blau wrote:
>>> 'git repack -g' will have to learn about unreachable loose objects that
>>
>> This reference to the '-g' option is one patch too early. Perhaps
>> say
>>
>>   An upcoming patch will introduce geometric repacking. This will
>>   require removing unreachable loose objects in a separate path
>>   from the existing checks.
>>
>> or similar?
> 
> Mmm. I had imagined that this would be read either in the context of
> this series, or by someone in the future long after 'git repack -g' had
> been introduced.
> 
> I could see that it's confusing, though, and I do agree your wording
> makes clearer that the option doesn't exist yet.
> 
> I'm happy to send a replacement or reroll if you feel strongly, but in
> either case I'll wait for a little more review first.

Definitely don't rush a re-roll for my nit-picks.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 09/10] builtin/repack.c: extract loose object handling
  2021-01-20 13:59   ` Derrick Stolee
  2021-01-20 14:34     ` Taylor Blau
@ 2021-01-21  3:45     ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-21  3:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee

Derrick Stolee <stolee@gmail.com> writes:

> On 1/19/2021 6:24 PM, Taylor Blau wrote:
>> 'git repack -g' will have to learn about unreachable loose objects that
>
> This reference to the '-g' option is one patch too early. Perhaps
> say
>
>   An upcoming patch will introduce geometric repacking. This will
>   require removing unreachable loose objects in a separate path
>   from the existing checks.
>
> or similar?

Yeah, sounds like a trivially obvious improvement to me.  

It does not matter to reviewers who are very well aware that the
series is about adding "repack -g", but it may end up being
confusing when somebody tries to see what commit the feature was
added later when the help from the cover letter is not available.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
  2021-01-20 13:40   ` Derrick Stolee
@ 2021-01-29  2:33   ` Junio C Hamano
  2021-01-29 18:38     ` Taylor Blau
  2021-01-29 19:31     ` Jeff King
  1 sibling, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29  2:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> Future callers will want a function to fill a 'struct pack_entry' for a
> given object id but _only_ from its position in any kept pack(s). They
> could accomplish this by calling 'find_pack_entry()' and checking
> whether the found pack is kept or not, but this is insufficient, since
> there may be duplicate objects (and the mru cache makes it unpredictable
> which variant we'll get).

I wonder if we eventually need a callback interface to walk _all_
pack entries for a given object, so that "I am only interested in
instances in kept packs" will be under total control of the callers.
As it stands, it is "just grab any one that is in a kept pack, any
one of them is fine", which is almost just of as narrow utility as
the original's "just grab the first one---any one of them is fine",
the latter of which is "insufficient" as the log message says.

But this (in the context of the remainder of the series) might be
sufficient, at least for now.

> Teach this new function to treat the two different kinds of kept packs
> (on disk ones with .keep files, as well as in-core ones which are set by
> manually poking the 'pack_keep_in_core' bit) separately. This will
> become important for callers that only want to respect a certain kind of
> kept pack.

Or maybe not ;-)

If there are notable relationship between on-disk and in-core kept
packs (e.g. "the set of on-disk kept packs is a subset of in-core
kept packs", "usually on-disk kept packs get in-core kept bit upon
their packed_git instances are populated, but we can drop the bit at
runtime, so on-disk and in-core are pretty much independent and
there is no notable relationship"), it must be explained upfront to
help the reader form a sensible world view.

> Introduce 'find_kept_pack_entry()' which behaves like
> 'find_pack_entry()', except that it skips over packs which are not
> marked kept. Callers will be added in subsequent patches.
>
> Co-authored-by: Jeff King <peff@peff.net>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

Fun.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 02/10] revision: learn '--no-kept-objects'
  2021-01-19 23:24 ` [PATCH 02/10] revision: learn '--no-kept-objects' Taylor Blau
@ 2021-01-29  3:10   ` Junio C Hamano
  2021-01-29 19:13     ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29  3:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> Some callers want to perform a reachability traversal that terminates
> when an object is found in a kept pack. The closest existing option is
> '--honor-pack-keep', but this isn't quite what we want. Instead of
> halting the traversal midway through, a full traversal is always
> performed, and the results are only trimmed afterwords.

True.  

Is there a reason to keep both kinds?  It is obvious that stopping
traversal once we hit a kept pack would be more time and space
efficient (I presume that the reason why .kept pack matters is
because we are repacking everything else) to enumerate the objects
that need to be repacked than traversing all the way and filtering
out objects that appear in .kept packs, but would there be some
correctness implications to replace the existing use of
"--honor-pack-keep" with "--no-kept-objects=on-disk"?  

What it means to be excluded by the former is quite clear: any
object that appears in a kept pack, whether another copy of it
appears elsewhere, is excluded from getting enumerated for
repacking.  It is quite unclear what it means to enumerate objects
with "--no-kept-objects".  It is clear from the implementation side
of the thing (stop traversal at objects that appear in any kept
pack), but it is totally unclear what such a meaning defined
operationally affects the resulting enumeration.  We know that the
enumerated objects do not appear in any of the kept pack, but it
does not mean all objects that are reachable/in-use that are not in
any kept packs are enumerated.

> diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> index 002379056a..817419d552 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -856,6 +856,13 @@ ifdef::git-rev-list[]
>  	Only useful with `--objects`; print the object IDs that are not
>  	in packs.
>  
> +--no-kept-objects[=<kind>]::
> +	Halts the traversal as soon as an object in a kept pack is
> +	found. If `<kind>` is `on-disk`, only packs with a corresponding
> +	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
> +	with their in-core kept state set are ignored. Otherwise, both
> +	kinds of kept packs are ignored.

Is it explained anywhere how "in-core kept state" is bootstrapped,
modified and maintained?

The patch to C-part itself is a trivially correct implementation of
"stop at an object that can be found in a kept pack", and there is
no comment, but it is not clear to me what we want to achieve by
this.  Is the underlying assumption that no objects in .kept pack
would refer to outside world, either loose or packs that are not
kept?  How are we guaranteeing it?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-19 23:24 ` [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed' Taylor Blau
@ 2021-01-29  3:21   ` Junio C Hamano
  2021-01-29 19:19     ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29  3:21 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> Teach pack-objects an option to imply the revision machinery's new
> '--no-kept-objects' option when doing a reachability traversal.
>
> When '--assume-kept-packs-closed' is given as an argument to
> pack-objects, it behaves differently (i.e., passes different options to
> the ensuing revision walk) depending on whether or not other arguments
> are passed:
>
>   - If the caller also specifies a '--keep-pack' argument (to mark a
>     pack as kept in-core), then assume that this combination means to
>     stop traversal only at in-core packs.
>
>   - If instead the caller passes '--honor-pack-keep', then assume that
>     the caller wants to stop traversal only at packs with a
>     corresponding .keep file (consistent with the original meaning which
>     only refers to packs with a .keep file).
>
>   - If both '--keep-pack' and '--honor-pack-keep' are passed, then
>     assume the caller wants to stop traversal at either kind of kept
>     pack.

If there is an out-of-band guarantee that .kept packs won't refer to
outside world, then we can obtain identical results to what existing
--honor-pack-keep (which traverses everything and then filteres out
what is in .keep pack) does by just stopping traversal when we see
an object that is found in a .keep pack.  OK, I guess that it
answers the correctness question I asked about [02/10].

It still is curious how we can safely "assume", but presumably we
will see how in a patch that appears later in the series.

How "closed" are these kept packs supposed to be?  When there are
two .keep packs, should objects in each of the packs never refer to
outside their own pack, or is it OK for objects in one kept pack to
refer to another object in the other kept pack?  Readers and those
who want to understand and extend this code in the future would need
to know what definition of "closed" you are using here.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 05/10] p5303: measure time to repack with keep
  2021-01-19 23:24 ` [PATCH 05/10] p5303: measure time to repack with keep Taylor Blau
@ 2021-01-29  3:40   ` Junio C Hamano
  2021-01-29 19:32     ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29  3:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> From: Jeff King <peff@peff.net>

Not a fault of this series at all, but before the precontext of the
first hunk, there is  


> diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
> index 277d22ec4b..85b077b72b 100755
> --- a/t/perf/p5303-many-packs.sh
> +++ b/t/perf/p5303-many-packs.sh
> @@ -27,8 +27,11 @@ repack_into_n () {

this construct:

	... |
	sed -n '1~5p' |
	head -n "$1" |
        ...

which is a GNUism.  Peff often says that very small population
actually run our perf suite, and this seems to corroborate the
conjecture.

>  	>pushes &&
>  
>  	# create base packfile
> -	head -n 1 pushes |
> -	git pack-objects --delta-base-offset --revs staging/pack &&
> +	base_pack=$(
> +		head -n 1 pushes |
> +		git pack-objects --delta-base-offset --revs staging/pack
> +	) &&
> +	test_export base_pack &&
>  
>  	# and then incrementals between each pair of commits
>  	last= &&
> @@ -87,6 +90,15 @@ do
>  		  --reflog --indexed-objects --delta-base-offset \
>  		  --stdout </dev/null >/dev/null
>  	'
> +
> +	test_perf "repack with keep ($nr_packs)" '
> +		git pack-objects --keep-true-parents \
> +		  --honor-pack-keep --assume-kept-packs-closed \
> +		  --keep-pack=pack-$base_pack.pack \
> +		  --non-empty --all \
> +		  --reflog --indexed-objects --delta-base-offset \
> +		  --stdout </dev/null >/dev/null
> +	'
>  done
>  
>  # Measure pack loading with 10,000 packs.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-29  2:33   ` Junio C Hamano
@ 2021-01-29 18:38     ` Taylor Blau
  2021-01-29 19:31     ` Jeff King
  1 sibling, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 18:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, dstolee

On Thu, Jan 28, 2021 at 06:33:10PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Future callers will want a function to fill a 'struct pack_entry' for a
> > given object id but _only_ from its position in any kept pack(s). They
> > could accomplish this by calling 'find_pack_entry()' and checking
> > whether the found pack is kept or not, but this is insufficient, since
> > there may be duplicate objects (and the mru cache makes it unpredictable
> > which variant we'll get).
>
> I wonder if we eventually need a callback interface to walk _all_
> pack entries for a given object, so that "I am only interested in
> instances in kept packs" will be under total control of the callers.
> As it stands, it is "just grab any one that is in a kept pack, any
> one of them is fine", which is almost just of as narrow utility as
> the original's "just grab the first one---any one of them is fine",
> the latter of which is "insufficient" as the log message says.
>
> But this (in the context of the remainder of the series) might be
> sufficient, at least for now.

As you note, it's more about "can I find this object in any kept pack
(of a certain kind)" versus, "show me this object in a pack" (and hope
that if it appears in a kept pack, that that's the copy that is picked).

> > Teach this new function to treat the two different kinds of kept packs
> > (on disk ones with .keep files, as well as in-core ones which are set by
> > manually poking the 'pack_keep_in_core' bit) separately. This will
> > become important for callers that only want to respect a certain kind of
> > kept pack.
>
> Or maybe not ;-)

:-). The difference here is that we will only want to stop the traversal
at packs which are considered to be stable from the perspective of a
geometric repack.

We mark those packs as "stable" by setting their in-core kept bit, but
we don't write ".keep" files (which would make them on-disk kept). The
latter is up to the user, not us.

> If there are notable relationship between on-disk and in-core kept
> packs (e.g. "the set of on-disk kept packs is a subset of in-core
> kept packs", "usually on-disk kept packs get in-core kept bit upon
> their packed_git instances are populated, but we can drop the bit at
> runtime, so on-disk and in-core are pretty much independent and
> there is no notable relationship"), it must be explained upfront to
> help the reader form a sensible world view.

Unfortunately, I don't think that there is a sensible world-view here
to be formed. Honestly, the distinction between .keep packs and in-core
kept packs is incredibly narrow, and I find our separate handling of
them awkward and error-prone.

But, it is sort of what you'd want here (i.e., a way to mark all objects
in a pack as ignored without actually writing the physical file that
says "ignore all objects in this pack").

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 02/10] revision: learn '--no-kept-objects'
  2021-01-29  3:10   ` Junio C Hamano
@ 2021-01-29 19:13     ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 19:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, dstolee

On Thu, Jan 28, 2021 at 07:10:04PM -0800, Junio C Hamano wrote:
> We know that the enumerated objects do not appear in any of the kept
> pack, but it does not mean all objects that are reachable/in-use that
> are not in any kept packs are enumerated.

You raise a very valid point. FWIW, I originally wrote these patches as
just "enumerate the objects in these small packs, make a new pack out of
those, and then (optionally) delete the small ones. I abandoned that
idea because it needs special handling for loose objects, and it has no
idea which objects are unreachable, etc.

But maybe it is time to go back to the drawing board there. Perhaps a
`--geometric` repack implies that we keep unreachable objects in effect,
and that a full repack (i.e., one that does reachability analysis) is
required to drop them.

Other ideas are welcome.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29  3:21   ` Junio C Hamano
@ 2021-01-29 19:19     ` Jeff King
  2021-01-29 20:01       ` Taylor Blau
  2021-01-29 20:30       ` Junio C Hamano
  0 siblings, 2 replies; 120+ messages in thread
From: Jeff King @ 2021-01-29 19:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Thu, Jan 28, 2021 at 07:21:09PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > Teach pack-objects an option to imply the revision machinery's new
> > '--no-kept-objects' option when doing a reachability traversal.
> >
> > When '--assume-kept-packs-closed' is given as an argument to
> > pack-objects, it behaves differently (i.e., passes different options to
> > the ensuing revision walk) depending on whether or not other arguments
> > are passed:
> >
> >   - If the caller also specifies a '--keep-pack' argument (to mark a
> >     pack as kept in-core), then assume that this combination means to
> >     stop traversal only at in-core packs.
> >
> >   - If instead the caller passes '--honor-pack-keep', then assume that
> >     the caller wants to stop traversal only at packs with a
> >     corresponding .keep file (consistent with the original meaning which
> >     only refers to packs with a .keep file).
> >
> >   - If both '--keep-pack' and '--honor-pack-keep' are passed, then
> >     assume the caller wants to stop traversal at either kind of kept
> >     pack.
> 
> If there is an out-of-band guarantee that .kept packs won't refer to
> outside world, then we can obtain identical results to what existing
> --honor-pack-keep (which traverses everything and then filteres out
> what is in .keep pack) does by just stopping traversal when we see
> an object that is found in a .keep pack.  OK, I guess that it
> answers the correctness question I asked about [02/10].
> 
> It still is curious how we can safely "assume", but presumably we
> will see how in a patch that appears later in the series.

I think this would generally happen if the .keep packs are generated
using something like "git repack -a", which packs everything reachable
together. So if you do:

  git repack -ad
  touch .git/objects/pack/pack-whatever.keep
  ... some more packs come in, perhaps via pushes ...
  # imagine repack knew how to pass this along...
  git repack -a --assume-kept-packs-closed

then you'd repack just the objects that aren't in the big pack.

(And other kept packs; this probably would not want to be used with
.keep packs, since those can be racily created by receive-pack. Rather
you'd want to say "consider this specific list of packs to be closed and
kept").

It is dangerous, though. If your assumption is somehow wrong, then you'd
potentially corrupt the repository (because you'd stop traversing, but
perhaps delete a pack that contained a useful object you would have
reached that you don't have elsewhere).

The overall goal here is being able to roll up loose objects and smaller
packs without having to pay the cost of a full reachability traversal
(which can take several minutes on large repositories). Another
very-different direction there is to just enumerate those objects
without respect to reachability, stick them in a pack, and then delete
the originals. That does imply something like "repack -k", though, and
interacts weirdly with letting unreachable objects age out via their
mtimes (we'd constantly suck them back into fresh packs).

That would work better if we our unreachable "aging out" storage was
marked as such (say, in a pack marked with a ".cruft" file, rather than
just a regular loose object that might be new or might be cruft). Then a
roll-up repack would leave cruft packs alone (neither rolling them up,
nor deleting them). A "real" repack would eventually delete them, but
only after having done an actual reachability traversal, which make sure
there are no objects within them that need rescued.

> How "closed" are these kept packs supposed to be?  When there are
> two .keep packs, should objects in each of the packs never refer to
> outside their own pack, or is it OK for objects in one kept pack to
> refer to another object in the other kept pack?  Readers and those
> who want to understand and extend this code in the future would need
> to know what definition of "closed" you are using here.

I think it would want to be "the set of all .keep packs is closed". In a
"roll all into one" scenario like above, there is only one .keep pack.
But in a geometric progression, that single pack which constitutes your
base set could be multiple packs (the last whole "git repack -ad", but
then a sequence of roll-ups that came on top of it).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-29  2:33   ` Junio C Hamano
  2021-01-29 18:38     ` Taylor Blau
@ 2021-01-29 19:31     ` Jeff King
  2021-01-29 20:20       ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-01-29 19:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Thu, Jan 28, 2021 at 06:33:10PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > Future callers will want a function to fill a 'struct pack_entry' for a
> > given object id but _only_ from its position in any kept pack(s). They
> > could accomplish this by calling 'find_pack_entry()' and checking
> > whether the found pack is kept or not, but this is insufficient, since
> > there may be duplicate objects (and the mru cache makes it unpredictable
> > which variant we'll get).
> 
> I wonder if we eventually need a callback interface to walk _all_
> pack entries for a given object, so that "I am only interested in
> instances in kept packs" will be under total control of the callers.
> As it stands, it is "just grab any one that is in a kept pack, any
> one of them is fine", which is almost just of as narrow utility as
> the original's "just grab the first one---any one of them is fine",
> the latter of which is "insufficient" as the log message says.

We do that already in pack-objects, and that's the problem: it's really
slow. So if you have few kept packs, but a lot of other ones, you'd like
to pre-split the packs into two lists, and not bother walking the one
you know won't turn up interesting results.

I think the commit message here doesn't emphasize that reasoning enough.
It talks about using "find_pack_entry()", and that is definitely not
sufficient for our purposes. But the interesting part is replacing the
existing "walk all packs and see if any were kept" logic, which happens
in patch 6.

So the more compelling argument, I think, is something like:

  - you sometimes want to know if object X is any kept packs

  - you can't use find_pack_entry(), because it only gives you the first
    pack it finds

  - you can walk over all packs and look for the object in each.
    pack-objects does this. But it's slow, because you are looking in
    packs you don't care about.

  - so it's helpful for the lookup to know up front which packs are
    interesting to find objects in and which are not, to avoid looking
    in the uninteresting ones

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 05/10] p5303: measure time to repack with keep
  2021-01-29  3:40   ` Junio C Hamano
@ 2021-01-29 19:32     ` Jeff King
  2021-01-29 20:04       ` [PATCH] p5303: avoid sed GNU-ism Jeff King
  2021-01-29 20:38       ` [PATCH 05/10] p5303: measure time to repack with keep Junio C Hamano
  0 siblings, 2 replies; 120+ messages in thread
From: Jeff King @ 2021-01-29 19:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Thu, Jan 28, 2021 at 07:40:40PM -0800, Junio C Hamano wrote:

> > diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
> > index 277d22ec4b..85b077b72b 100755
> > --- a/t/perf/p5303-many-packs.sh
> > +++ b/t/perf/p5303-many-packs.sh
> > @@ -27,8 +27,11 @@ repack_into_n () {
> 
> this construct:
> 
> 	... |
> 	sed -n '1~5p' |
> 	head -n "$1" |
>         ...
> 
> which is a GNUism.  Peff often says that very small population
> actually run our perf suite, and this seems to corroborate the
> conjecture.

Oops. Looks like I was the one who introduced that. Nobody seems to have
complained, so I'm somewhat tempted to leave it. But it would not be too
hard to replace with perl, I think.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 19:19     ` Jeff King
@ 2021-01-29 20:01       ` Taylor Blau
  2021-01-29 20:25         ` Jeff King
  2021-01-29 20:30       ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 20:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Fri, Jan 29, 2021 at 02:19:25PM -0500, Jeff King wrote:
> The overall goal here is being able to roll up loose objects and smaller
> packs without having to pay the cost of a full reachability traversal
> (which can take several minutes on large repositories). Another
> very-different direction there is to just enumerate those objects
> without respect to reachability, stick them in a pack, and then delete
> the originals. That does imply something like "repack -k", though, and
> interacts weirdly with letting unreachable objects age out via their
> mtimes (we'd constantly suck them back into fresh packs).

As I mentioned in an earlier response to Junio, this was the original
approach that I took when implementing this, but ultimately decided
against it because it means that we'll never let unreachable objects age
out (as you note).

I wonder if we need our assumption that the union of kept packs is
closed under reachability to be specified as an option. If the option is
passed, then we stop the traversal as soon as we hit an object in the
frozen packs. If not passed, then we do a full traversal but pass
--honor-pack-keep to drop out objects in the frozen packs after the
fact.

Thoughts?

> I think it would want to be "the set of all .keep packs is closed". In a
> "roll all into one" scenario like above, there is only one .keep pack.
> But in a geometric progression, that single pack which constitutes your
> base set could be multiple packs (the last whole "git repack -ad", but
> then a sequence of roll-ups that came on top of it).

I don't think having a roll-up strategy of "all-except-one" simplifies
things. Or, if it does, then I don't understand it. Isn't this the exact
same thing as a geometric repack which decides to keep only one pack?

ISTM that you would be susceptible to the same problems in this case,
too.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH] p5303: avoid sed GNU-ism
  2021-01-29 19:32     ` Jeff King
@ 2021-01-29 20:04       ` Jeff King
  2021-01-29 20:19         ` Eric Sunshine
  2021-01-29 20:38       ` [PATCH 05/10] p5303: measure time to repack with keep Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-01-29 20:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Fri, Jan 29, 2021 at 02:32:50PM -0500, Jeff King wrote:

> > this construct:
> > 
> > 	... |
> > 	sed -n '1~5p' |
> > 	head -n "$1" |
> >         ...
> > 
> > which is a GNUism.  Peff often says that very small population
> > actually run our perf suite, and this seems to corroborate the
> > conjecture.
> 
> Oops. Looks like I was the one who introduced that. Nobody seems to have
> complained, so I'm somewhat tempted to leave it. But it would not be too
> hard to replace with perl, I think.

Maybe worth doing this?

-- >8 --
Subject: [PATCH] p5303: avoid sed GNU-ism

Using "1~5" isn't portable. Nobody seems to have noticed, since perhaps
people don't tend to run the perf suite on more exotic platforms. Still,
it's better to set a good example.

We can use:

  perl -ne 'print if $. % 5 == 1'

instead. But we can further observe that perl does a good job of the
other parts of this pipeline, and fold the whole thing together.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/p5303-many-packs.sh | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index f4c2ab0584..ce0c42cc9f 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -21,10 +21,14 @@ repack_into_n () {
 	mkdir staging &&
 
 	git rev-list --first-parent HEAD |
-	sed -n '1~5p' |
-	head -n "$1" |
-	perl -e 'print reverse <>' \
-	>pushes
+	perl -e '
+		my $n = shift;
+		while (<>) {
+			last unless @commits < $n;
+			push @commits, $_ if $. % 5 == 1;
+		}
+		print reverse @commits;
+	' "$1" >pushes
 
 	# create base packfile
 	head -n 1 pushes |
-- 
2.30.0.759.g69d54d14a7


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH] p5303: avoid sed GNU-ism
  2021-01-29 20:04       ` [PATCH] p5303: avoid sed GNU-ism Jeff King
@ 2021-01-29 20:19         ` Eric Sunshine
  2021-01-29 20:27           ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Eric Sunshine @ 2021-01-29 20:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, Git List, Derrick Stolee

On Fri, Jan 29, 2021 at 3:07 PM Jeff King <peff@peff.net> wrote:
> Subject: [PATCH] p5303: avoid sed GNU-ism
>
> Using "1~5" isn't portable. Nobody seems to have noticed, since perhaps
> people don't tend to run the perf suite on more exotic platforms. Still,
> it's better to set a good example.

It's not just exotic platforms on which this can be a problem. BSD
lineage `sed`, such as stock `sed` on macOS, doesn't understand this
notation.

Thanks for eliminating this particular GNU-ism.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()'
  2021-01-29 19:31     ` Jeff King
@ 2021-01-29 20:20       ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 20:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee

Jeff King <peff@peff.net> writes:

> So the more compelling argument, I think, is something like:
>
>   - you sometimes want to know if object X is any kept packs
>
>   - you can't use find_pack_entry(), because it only gives you the first
>     pack it finds
>
>   - you can walk over all packs and look for the object in each.
>     pack-objects does this. But it's slow, because you are looking in
>     packs you don't care about.
>
>   - so it's helpful for the lookup to know up front which packs are
>     interesting to find objects in and which are not, to avoid looking
>     in the uninteresting ones

That does make sense.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 20:01       ` Taylor Blau
@ 2021-01-29 20:25         ` Jeff King
  2021-01-29 22:10           ` Taylor Blau
  2021-01-29 22:13           ` Junio C Hamano
  0 siblings, 2 replies; 120+ messages in thread
From: Jeff King @ 2021-01-29 20:25 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, dstolee

On Fri, Jan 29, 2021 at 03:01:48PM -0500, Taylor Blau wrote:

> On Fri, Jan 29, 2021 at 02:19:25PM -0500, Jeff King wrote:
> > The overall goal here is being able to roll up loose objects and smaller
> > packs without having to pay the cost of a full reachability traversal
> > (which can take several minutes on large repositories). Another
> > very-different direction there is to just enumerate those objects
> > without respect to reachability, stick them in a pack, and then delete
> > the originals. That does imply something like "repack -k", though, and
> > interacts weirdly with letting unreachable objects age out via their
> > mtimes (we'd constantly suck them back into fresh packs).
> 
> As I mentioned in an earlier response to Junio, this was the original
> approach that I took when implementing this, but ultimately decided
> against it because it means that we'll never let unreachable objects age
> out (as you note).

Right. But that's no different than using "-k" most of the time, and
then occasionally doing a more careful repack with short expiration
times and a full reachability check. As you know, this is basically what
we do at GitHub.

So it may be reasonable to go that direction, which is really defining a
totally separate strategy from git-gc's "repack, and occasionally
objects age out". Especially if we find that the
assume-kept-packs-closed route is too risky (i.e., has too many cases
where it's possible to cause corruption if our assumptions isn't met).

I'm not convinced either way at this point, but just thinking out loud
on the options (and trying to give some context to the list).

> I wonder if we need our assumption that the union of kept packs is
> closed under reachability to be specified as an option. If the option is
> passed, then we stop the traversal as soon as we hit an object in the
> frozen packs. If not passed, then we do a full traversal but pass
> --honor-pack-keep to drop out objects in the frozen packs after the
> fact.
> 
> Thoughts?

I'm confused. I thought the whole idea was to pass it as an option (the
user telling Git "I know these packs are supposed to be closed; trust
me")?

> > I think it would want to be "the set of all .keep packs is closed". In a
> > "roll all into one" scenario like above, there is only one .keep pack.
> > But in a geometric progression, that single pack which constitutes your
> > base set could be multiple packs (the last whole "git repack -ad", but
> > then a sequence of roll-ups that came on top of it).
> 
> I don't think having a roll-up strategy of "all-except-one" simplifies
> things. Or, if it does, then I don't understand it. Isn't this the exact
> same thing as a geometric repack which decides to keep only one pack?
> 
> ISTM that you would be susceptible to the same problems in this case,
> too.

I wasn't trying to argue that all-except-one avoids any problems. I was
saying that the example I gave above was an all-into-one, but if you
want to extend the concept to multiple packs, it has to cover the whole
set. I.e., answering Junio's:

  > is it OK for objects in one kept pack to refer to another object in
  > the other kept pack?

with "yes".

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH] p5303: avoid sed GNU-ism
  2021-01-29 20:19         ` Eric Sunshine
@ 2021-01-29 20:27           ` Jeff King
  2021-01-29 20:36             ` Eric Sunshine
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-01-29 20:27 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, Taylor Blau, Git List, Derrick Stolee

On Fri, Jan 29, 2021 at 03:19:31PM -0500, Eric Sunshine wrote:

> On Fri, Jan 29, 2021 at 3:07 PM Jeff King <peff@peff.net> wrote:
> > Subject: [PATCH] p5303: avoid sed GNU-ism
> >
> > Using "1~5" isn't portable. Nobody seems to have noticed, since perhaps
> > people don't tend to run the perf suite on more exotic platforms. Still,
> > it's better to set a good example.
> 
> It's not just exotic platforms on which this can be a problem. BSD
> lineage `sed`, such as stock `sed` on macOS, doesn't understand this
> notation.
> 
> Thanks for eliminating this particular GNU-ism.

OK, then I'm doubly surprised nobody has noticed and complained about
this. :)

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 19:19     ` Jeff King
  2021-01-29 20:01       ` Taylor Blau
@ 2021-01-29 20:30       ` Junio C Hamano
  2021-01-29 22:43         ` Jeff King
  1 sibling, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 20:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee

Jeff King <peff@peff.net> writes:

>> If there is an out-of-band guarantee that .kept packs won't refer to
>> outside world, then we can obtain identical results to what existing
>> --honor-pack-keep (which traverses everything and then filteres out
>> what is in .keep pack) does by just stopping traversal when we see
>> an object that is found in a .keep pack.  OK, I guess that it
>> answers the correctness question I asked about [02/10].
>> 
>> It still is curious how we can safely "assume", but presumably we
>> will see how in a patch that appears later in the series.
>
> I think this would generally happen if the .keep packs are generated
> using something like "git repack -a", which packs everything reachable
> together. So if you do:
>
>   git repack -ad
>   touch .git/objects/pack/pack-whatever.keep
>   ... some more packs come in, perhaps via pushes ...
>   # imagine repack knew how to pass this along...
>   git repack -a --assume-kept-packs-closed
>
> then you'd repack just the objects that aren't in the big pack.

Yeah.  As a tool to help the above workflow, where you are only
creating another .keep out of youngest objects (i.e. those that are
either loose or in non-kept packs), because by definition anything
in .keep cannot be pointing back at these younger objects, it does
make sense to take advantage of "the set of packs with .keep as a
whole is closed".

It may become tricky once we start talking about creating a new
.keep out of youngest objects PLUS a few young keep packs, though.

Starting from all on-disk .keep packs, you'd mark them as in-core
keep bit, then drop in-core keep bit from the few young keep packs
that you intend to coalesce with the youngest objects---that is how
I would imagine your repacking strategy would go.  The set of all
the on-disk .keep packs may give us "closed" guarantee, but if we 
exclude a few latest packs from that set, would the remainder still
give us the "closed" guarantee we can take advantage of, in order to
pack these youngest objects (including the ones in the kept packs
that we are coalescing)?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH] p5303: avoid sed GNU-ism
  2021-01-29 20:27           ` Jeff King
@ 2021-01-29 20:36             ` Eric Sunshine
  2021-01-29 22:11               ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Eric Sunshine @ 2021-01-29 20:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, Git List, Derrick Stolee

On Fri, Jan 29, 2021 at 3:28 PM Jeff King <peff@peff.net> wrote:
> On Fri, Jan 29, 2021 at 03:19:31PM -0500, Eric Sunshine wrote:
> > It's not just exotic platforms on which this can be a problem. BSD
> > lineage `sed`, such as stock `sed` on macOS, doesn't understand this
> > notation.
>
> OK, then I'm doubly surprised nobody has noticed and complained about
> this. :)

Aside from there possibly being relatively few regular Git developers
using macOS, it could also be because it's difficult to run the perf
tests on macOS in the first place due to the GNU prerequisites. For
instance, the perf tests have an unconditional dependency on GNU
`time` which is not installed on macOS by default, and it's not always
easy to figure out how to obtain it.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 05/10] p5303: measure time to repack with keep
  2021-01-29 19:32     ` Jeff King
  2021-01-29 20:04       ` [PATCH] p5303: avoid sed GNU-ism Jeff King
@ 2021-01-29 20:38       ` Junio C Hamano
  2021-01-29 22:10         ` Jeff King
  1 sibling, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 20:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee

Jeff King <peff@peff.net> writes:

> On Thu, Jan 28, 2021 at 07:40:40PM -0800, Junio C Hamano wrote:
>
>> > diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
>> > index 277d22ec4b..85b077b72b 100755
>> > --- a/t/perf/p5303-many-packs.sh
>> > +++ b/t/perf/p5303-many-packs.sh
>> > @@ -27,8 +27,11 @@ repack_into_n () {
>> 
>> this construct:
>> 
>> 	... |
>> 	sed -n '1~5p' |
>> 	head -n "$1" |
>>         ...
>> 
>> which is a GNUism.  Peff often says that very small population
>> actually run our perf suite, and this seems to corroborate the
>> conjecture.
>
> Oops. Looks like I was the one who introduced that. Nobody seems to have
> complained, so I'm somewhat tempted to leave it. But it would not be too
> hard to replace with perl, I think.

Yeah, but would it be worth it?  I am actually OK to say that you
need GNU sed if you want to run perf.  We already rely on GNU time
to run perf tests, no?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 05/10] p5303: measure time to repack with keep
  2021-01-29 20:38       ` [PATCH 05/10] p5303: measure time to repack with keep Junio C Hamano
@ 2021-01-29 22:10         ` Jeff King
  2021-01-29 23:12           ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-01-29 22:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Fri, Jan 29, 2021 at 12:38:08PM -0800, Junio C Hamano wrote:

> > Oops. Looks like I was the one who introduced that. Nobody seems to have
> > complained, so I'm somewhat tempted to leave it. But it would not be too
> > hard to replace with perl, I think.
> 
> Yeah, but would it be worth it?  I am actually OK to say that you
> need GNU sed if you want to run perf.  We already rely on GNU time
> to run perf tests, no?

True. This one is a little worse because it's subtle, and somebody might
copy it unknowingly into the regular test suite.

I am happy to leave it, or for you to pick up the patch I sent earlier
(which I did verify produces identical output).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 20:25         ` Jeff King
@ 2021-01-29 22:10           ` Taylor Blau
  2021-01-29 22:57             ` Jeff King
  2021-01-29 23:03             ` Junio C Hamano
  2021-01-29 22:13           ` Junio C Hamano
  1 sibling, 2 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 22:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, git, dstolee

On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote:
> So it may be reasonable to go that direction, which is really defining a
> totally separate strategy from git-gc's "repack, and occasionally
> objects age out". Especially if we find that the
> assume-kept-packs-closed route is too risky (i.e., has too many cases
> where it's possible to cause corruption if our assumptions isn't met).

Yeah, this whole conversation has made me very nervous about using
reachability. Fundamentally, this isn't about reachability at all. The
operation is as simple as telling pack-objects a list of packs that you
do and don't want objects from, making a new pack out of that, and then
optionally dropping the packs that you rolled up.

So, I think that teaching pack-objects a way to understand a caller that
says "include objects from packs X, Y, and Z, but not if they appear in
packs A, B, or C, and also pull in any loose objects" is the best way
forward here.

Of course, you're going to be dragging along unreachable objects until
you decide to do a full repack, but I'm OK with that since we wouldn't
expect anybody to be solely relying on geometric repacks without
occasionally running 'git repack -ad'.

Junio: I don't think that you have picked this up yet, but please avoid
doing so for now, and I'll send a new series that goes in the direction
I outlined above.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH] p5303: avoid sed GNU-ism
  2021-01-29 20:36             ` Eric Sunshine
@ 2021-01-29 22:11               ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 22:11 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Jeff King, Junio C Hamano, Taylor Blau, Git List, Derrick Stolee

On Fri, Jan 29, 2021 at 03:36:01PM -0500, Eric Sunshine wrote:
> On Fri, Jan 29, 2021 at 3:28 PM Jeff King <peff@peff.net> wrote:
> > On Fri, Jan 29, 2021 at 03:19:31PM -0500, Eric Sunshine wrote:
> > > It's not just exotic platforms on which this can be a problem. BSD
> > > lineage `sed`, such as stock `sed` on macOS, doesn't understand this
> > > notation.
> >
> > OK, then I'm doubly surprised nobody has noticed and complained about
> > this. :)
>
> Aside from there possibly being relatively few regular Git developers
> using macOS, it could also be because it's difficult to run the perf
> tests on macOS in the first place due to the GNU prerequisites. For
> instance, the perf tests have an unconditional dependency on GNU
> `time` which is not installed on macOS by default, and it's not always
> easy to figure out how to obtain it.

Yep, I agree completely. I was going to say that this would produce a
conflict (albeit, a trivial one) with the series that this came out of.

But I think that we're better off abandoning that series for now until I
send a different version, so I think we should just go ahead an apply
this.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 20:25         ` Jeff King
  2021-01-29 22:10           ` Taylor Blau
@ 2021-01-29 22:13           ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 22:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee

Jeff King <peff@peff.net> writes:

>> I wonder if we need our assumption that the union of kept packs is
>> closed under reachability to be specified as an option. If the option is
>> passed, then we stop the traversal as soon as we hit an object in the
>> frozen packs. If not passed, then we do a full traversal but pass
>> --honor-pack-keep to drop out objects in the frozen packs after the
>> fact.
>> 
>> Thoughts?
>
> I'm confused. I thought the whole idea was to pass it as an option (the
> user telling Git "I know these packs are supposed to be closed; trust
> me")?

Yes, that is how I read these patches, and it sounds like an assumption
that we can make under many scenarios/repacking strategies.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 20:30       ` Junio C Hamano
@ 2021-01-29 22:43         ` Jeff King
  2021-01-29 22:53           ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-01-29 22:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Fri, Jan 29, 2021 at 12:30:38PM -0800, Junio C Hamano wrote:

> > I think this would generally happen if the .keep packs are generated
> > using something like "git repack -a", which packs everything reachable
> > together. So if you do:
> >
> >   git repack -ad
> >   touch .git/objects/pack/pack-whatever.keep
> >   ... some more packs come in, perhaps via pushes ...
> >   # imagine repack knew how to pass this along...
> >   git repack -a --assume-kept-packs-closed
> >
> > then you'd repack just the objects that aren't in the big pack.
> 
> Yeah.  As a tool to help the above workflow, where you are only
> creating another .keep out of youngest objects (i.e. those that are
> either loose or in non-kept packs), because by definition anything
> in .keep cannot be pointing back at these younger objects, it does
> make sense to take advantage of "the set of packs with .keep as a
> whole is closed".
> 
> It may become tricky once we start talking about creating a new
> .keep out of youngest objects PLUS a few young keep packs, though.

Right. You'd have to make sure the younger packs were also created with
that reachability in mind. I.e., if you are in a situation where you've
got:

  - a big "old" pack
  - N new packs from pushes

then you can't assume anything about the reachability for those
individual N packs. It would be wrong to split any of them into the
"keep" side. They need to have all their objects traversed until we hit
something in the old pack.

But if you have a situation with:

  - a big "old" pack
  - M packs made from previous rollups on top of the old pack
  - N new packs from pushes

Then I think you can still take the old pack + the M packs as a
cohesive unit closed under reachability.

The tricky part is knowing which packs are which (size is a heuristic,
but it can be wrong; people may make a big push to a previously-small
repository).

> Starting from all on-disk .keep packs, you'd mark them as in-core
> keep bit, then drop in-core keep bit from the few young keep packs
> that you intend to coalesce with the youngest objects---that is how
> I would imagine your repacking strategy would go.  The set of all
> the on-disk .keep packs may give us "closed" guarantee, but if we 
> exclude a few latest packs from that set, would the remainder still
> give us the "closed" guarantee we can take advantage of, in order to
> pack these youngest objects (including the ones in the kept packs
> that we are coalescing)?

Yeah, I think we are going along the same lines. Except I think it is
dangerous to use on-disk ".keep" as your marker, because we will racily
see incoming push packs with a ".keep" (which receive-pack/index-pack
use as a lockfile until the refs are updated).

So repack has to "somehow" get the list of which is which.

None of which is disputing your "it may become tricky", of course. ;) It
is exactly this trickiness that I am worried about. And I am not being
coy with "somehow", as if we have some custom not-yet-shared layer on
top of repack that tracks this. We are still figuring out whether this
is a good direction in the first place. :)

One of the things that led us to this reachability traversal, away from
"just suck up all of the objects from these packs", is that this is how
--unpacked works. I had always assumed it was implemented as "if it's
loose, then put it in the pack". But it's not. It's attached to the
revision traversal. And it actually gets some cases wrong!

It will walk every commit, so you don't have to worry about a packed
commit referring to an unpacked one. But it doesn't look at the trees of
the packed commits (for the quite obvious reason that doing so is orders
of magnitude more expensive). That means that if there is a packed
commit that refers to an unpacked blob (which is not referenced by an
unpacked commit), then "rev-list --unpacked" will not report it (and
likewise "git repack -d" would not pack it).

It's easy to create such a situation manually, but I've included a
more plausible sequence involving "--amend" and push/fetch unpackLimit
at the end of this email.

At the core, --unpacked is assuming certain things about reachability of
loose/packed objects that aren't necessarily true. And this
--assume-kept-pack-closed stuff is basically doing the same thing for a
particular set of packs (albeit more so; I believe the patches here cut
off traversal of parent pointers, not just commit-to-tree pointers).

One of the reasons I think nobody noticed with --unpacked is that the
stakes are pretty low. If our assumption is wrong, the worst case is
that a loose object remains unpacked in "repack -d". But we'd never
delete it based on that information (instead, git prune would do its own
traversal to find the correct reachability). And it would eventually get
picked up by "git repack -ad".

But for repacking, the general strategy is to put things you want to
keep into the new pack, and then delete the old ones (not marked as
keep, of course). So if our assumption is ever wrong, it means we'd
potentially drop packs that have reachable objects not found elsewhere,
and we'd end up corrupting the repository.

So I think the paths forward are either:

  - come up with an air-tight system of making sure that we know packs
    we claim are closed under reachability really are (perhaps some
    marker that says "I was generated by repack -a")

  - have a "roll-up" mode that does not care about reachability at all,
    and just takes any objects from a particular set of packs (plus
    probably loose objects)

I'm still thinking aloud here, and not really sure which is a better
path. I do feel like the failure modes for the second one are less
risky.

Anyway, here's the --unpacked example, if you're curious. It's based on
fetching, but you could invert it to do pushes (in which case it is
repacking in "parent" that gets the wrong result).

-- >8 --
# two repos, one a clone of the other
git init parent
git -C parent commit --allow-empty -m base
git clone parent child

# now there's a small fetch, which will get
# exploded into loose objects.
(
	cd parent
	echo small >small
	git add small
	git commit -m small
)
git -C child fetch

# We can verify that "rev-list --unpacked" reports these
# objects.
git -C child rev-list --objects --unpacked origin

# and now a bigger one that will remain a pack (we'll
# tweak unpackLimit instead of making a really big commit,
# but the concept is the same)
#
# There are two key things here:
#
#   - the "small" commit is no longer reachable, but the big one
#     still contains the "small" blob object. Using --amend is
#     a plausible mechanism for this happening.
#
#   - we are using bitmaps, which give us an exact answer for the set of
#     objects to send. Otherwise, pack-objects on the server actually
#     fails to notice the other side has told us it has the small blob, and
#     sends another copy of it.
(
	cd parent
	git repack -adb
	echo big >big
	git add big
	git commit --amend -m big
)
git -C child -c fetch.unpackLimit=1 fetch

# So now in the child we have a packed object
# whose ancestor is an unpacked one. rev-list
# now won't report the "small" blob (ac790413).
git.compile -C child rev-list --objects --unpacked origin

# Even though we can see that it's present only as an
# unpacked object.
show_objects() {
	for i in child/.git/objects/pack/*.idx; do
		git show-index <$i
	done | cut -d' ' -f2 | sed 's/^/packed: /'
	find child/.git/objects/??/* |
		perl -F/ -alne 'print " loose: $F[-2]$F[-1]"'
}

# If we were to do an incremental repack now, it wouldn't be packed.
# (Note we have to kill off the reflog, which still references the
# rewound commit).
rm -rf child/.git/logs
git -C child repack -d
show_objects

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 22:43         ` Jeff King
@ 2021-01-29 22:53           ` Taylor Blau
  2021-01-29 23:00             ` Jeff King
  2021-01-29 23:10             ` Junio C Hamano
  0 siblings, 2 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 22:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Fri, Jan 29, 2021 at 05:43:32PM -0500, Jeff King wrote:
> So I think the paths forward are either:
>
>   - come up with an air-tight system of making sure that we know packs
>     we claim are closed under reachability really are (perhaps some
>     marker that says "I was generated by repack -a")
>
>   - have a "roll-up" mode that does not care about reachability at all,
>     and just takes any objects from a particular set of packs (plus
>     probably loose objects)
>
> I'm still thinking aloud here, and not really sure which is a better
> path. I do feel like the failure modes for the second one are less
> risky.

The more I think about it, the more I feel that the second option is the
right approach. It seems like if you were naïvely implementing this from
scratch, that you'd pick the second one (i.e., have pack-objects
understand a new input mode, and then make a pack based on that).

I am leery that we'd be able to get the first option "right" without
attaching some sort of marker to each pack, especially given how
difficult I think that this is to reason about precisely. I suppose you
could have a .closed file corresponding to each pack, or alternatively a
$objdir/pack/pack-geometry file which specifies the same thing, but both
of these feel overly restrictive.

Besides having to special case the loose objects, is there any downside
to doing the simpler thing here?

> Anyway, here's the --unpacked example, if you're curious. It's based on
> fetching, but you could invert it to do pushes (in which case it is
> repacking in "parent" that gets the wrong result).

Fascinating indeed :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 22:10           ` Taylor Blau
@ 2021-01-29 22:57             ` Jeff King
  2021-01-29 23:03             ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-01-29 22:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, dstolee

On Fri, Jan 29, 2021 at 05:10:20PM -0500, Taylor Blau wrote:

> On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote:
> > So it may be reasonable to go that direction, which is really defining a
> > totally separate strategy from git-gc's "repack, and occasionally
> > objects age out". Especially if we find that the
> > assume-kept-packs-closed route is too risky (i.e., has too many cases
> > where it's possible to cause corruption if our assumptions isn't met).
> 
> Yeah, this whole conversation has made me very nervous about using
> reachability. Fundamentally, this isn't about reachability at all. The
> operation is as simple as telling pack-objects a list of packs that you
> do and don't want objects from, making a new pack out of that, and then
> optionally dropping the packs that you rolled up.
> 
> So, I think that teaching pack-objects a way to understand a caller that
> says "include objects from packs X, Y, and Z, but not if they appear in
> packs A, B, or C, and also pull in any loose objects" is the best way
> forward here.
> 
> Of course, you're going to be dragging along unreachable objects until
> you decide to do a full repack, but I'm OK with that since we wouldn't
> expect anybody to be solely relying on geometric repacks without
> occasionally running 'git repack -ad'.

While writing my other response, I had some thoughts that this "dragging
along" might not be so bad.

Just to lay out the problem as I see it, if you do:

  - frequently roll up all small packs and loose objects into a new
    pack, without regard to reachability

  - occasionally run "git repack -ad" to do a real traversal

then the problem is that unreachable objects never age out:

  - a loose unreachable object starts with a recent-ish mtime

  - the frequent roll-up rolls it into a pack, freshening its mtime

  - the full "repack -ad" doesn't delete it, because its pack mtime is
    too recent. It explodes it loose again.

  - repeat forever

We know that "repack -d" is not 100% accurate because of similar "closed
under reachability" assumptions (see my other email). But it's OK,
because the worst case is an object that doesn't quite get packed yet,
not that it gets deleted.

So you could do something like:

  - roll up loose objects into a pack with "repack -d"; mostly accurate,
    but doesn't suck up unreachable objects

  - roll up small packs into a bigger pack without regard for
    reachability. This includes the pack created in the first step, but
    we know everything in it is actually reachable.

  - eventually run "repack -ad" to do a real traversal

That would extend the lifetime of unreachable objects which were found
in a pack (they get dragged forward during the rollups). But they'd
eventually get exploded loose during a "repack -ad", and then _not_
sucked back into a roll-up pack. And then eventually "repack -ad"
removes them.

The downsides are:

  - doing a separate "repack -d" plus a roll-up repack is wasted work.
    But I think they could be combined into a single step (at the cost
    of some extra complexity in the implementation).

  - using "--unpacked" still means traversing every commit. That's much
    faster than traversing the whole object graph, but still scales with
    the size of the repo, not the size of the new objects. That might be
    acceptable, though.

I do think the original problem goes away entirely if we can keep better
track of the mtimes. I.e., if we had packs marked with ".cruft" instead
of exploding loose, then the logic is:

  - roll up all loose objects and any objects in a pack that isn't
    marked as cruft (or keep); never delete a cruft pack at this stage

  - occasionally "repack -ad"; this does delete old cruft packs (because
    we'd have rescued any reachable objects they might have contained)

I'm not sure I want to block this topic on having cruft packs, though.
Of course there are tons of _other_ reasons to want them (like not
causing operational headaches when a repo's disk and inode usage grows
by 10x due to exploding loose objects). So maybe it's not a bad idea to
work on them together. I dunno.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 22:53           ` Taylor Blau
@ 2021-01-29 23:00             ` Jeff King
  2021-01-29 23:10             ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-01-29 23:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, dstolee

On Fri, Jan 29, 2021 at 05:53:58PM -0500, Taylor Blau wrote:

> > I'm still thinking aloud here, and not really sure which is a better
> > path. I do feel like the failure modes for the second one are less
> > risky.
> 
> The more I think about it, the more I feel that the second option is the
> right approach. It seems like if you were naïvely implementing this from
> scratch, that you'd pick the second one (i.e., have pack-objects
> understand a new input mode, and then make a pack based on that).
> 
> I am leery that we'd be able to get the first option "right" without
> attaching some sort of marker to each pack, especially given how
> difficult I think that this is to reason about precisely. I suppose you
> could have a .closed file corresponding to each pack, or alternatively a
> $objdir/pack/pack-geometry file which specifies the same thing, but both
> of these feel overly restrictive.

Yeah, I think my gut feeling matches yours.

> Besides having to special case the loose objects, is there any downside
> to doing the simpler thing here?

The other downside I can think of is that you can't just run "git repack
--geometric" every time, and eventually get a good result (or one that
asymptotically approaches good ;) ). I.e., you now have two types of
repacks: quick and dirty rollups, and "real" ones that do reachability.
So you need some heuristics about how often you do one versus the other.

I'm definitely OK with that outcome. And I think we could even bake
those heuristics into a script or mode of repack (e.g., maybe "gc
--auto" would trigger a bigger repack every N times or something). But
that's what I came up with by brainstorming. :)

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 22:10           ` Taylor Blau
  2021-01-29 22:57             ` Jeff King
@ 2021-01-29 23:03             ` Junio C Hamano
  2021-01-29 23:28               ` Taylor Blau
  2021-01-29 23:31               ` Jeff King
  1 sibling, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 23:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> So, I think that teaching pack-objects a way to understand a caller that
> says "include objects from packs X, Y, and Z, but not if they appear in
> packs A, B, or C, and also pull in any loose objects" is the best way
> forward here.

Are our goals still include that the resulting packfile has good
delta compression and object locality?  Reachability traversal
discovers which commit comes close to which other commits to help
pack-objects to arrange the resulting pack so that objects that
appear close together in history appears close together.  It also
gives each object a pathname hint to help group objects of the same
type (either blobs or trees) with like-paths together for better
deltification.

Without reachability traversal, I would imagine that it would become
quite important to keep the order in which objects appear in the
original pack, and existing delta chain, as much as possible, or
we'd be seeing a horribly inefficient pack like fast-import would
produce.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 22:53           ` Taylor Blau
  2021-01-29 23:00             ` Jeff King
@ 2021-01-29 23:10             ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 23:10 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> On Fri, Jan 29, 2021 at 05:43:32PM -0500, Jeff King wrote:
>> So I think the paths forward are either:
>>
>>   - come up with an air-tight system of making sure that we know packs
>>     we claim are closed under reachability really are (perhaps some
>>     marker that says "I was generated by repack -a")
>>
>>   - have a "roll-up" mode that does not care about reachability at all,
>>     and just takes any objects from a particular set of packs (plus
>>     probably loose objects)
>>
>> I'm still thinking aloud here, and not really sure which is a better
>> path. I do feel like the failure modes for the second one are less
>> risky.
>
> The more I think about it, the more I feel that the second option is the
> right approach. It seems like if you were naïvely implementing this from
> scratch, that you'd pick the second one (i.e., have pack-objects
> understand a new input mode, and then make a pack based on that).

Yes, "roll-up" mode would be a sensible thing to have, as long as we
can keep pruning out of the picture for now.  But in the end, I do
think "stop at any object in this frozen pack---these objects go all
the way down to root and we know they are reachable" optimization
that would give 'prune' a performance boost with small margin of
false positive about reachability (i.e. we may never be able to
prune away an object in such a pack, even when it becomes
unreachable) would be a valuable thing to have in a practical
system, so from that point of view, the work done in these patches
are not lost ;-)

The efficiency issue of the resulting pack I mentioned earlier in a
separate message is there in the "roll-up" mode, though.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 05/10] p5303: measure time to repack with keep
  2021-01-29 22:10         ` Jeff King
@ 2021-01-29 23:12           ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-29 23:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee

Jeff King <peff@peff.net> writes:

> On Fri, Jan 29, 2021 at 12:38:08PM -0800, Junio C Hamano wrote:
>
>> > Oops. Looks like I was the one who introduced that. Nobody seems to have
>> > complained, so I'm somewhat tempted to leave it. But it would not be too
>> > hard to replace with perl, I think.
>> 
>> Yeah, but would it be worth it?  I am actually OK to say that you
>> need GNU sed if you want to run perf.  We already rely on GNU time
>> to run perf tests, no?
>
> True. This one is a little worse because it's subtle, and somebody might
> copy it unknowingly into the regular test suite.
>
> I am happy to leave it, or for you to pick up the patch I sent earlier
> (which I did verify produces identical output).

Yeah, I would be very unhappy if somebody copied-and-pasted it, but
somehow I didn't think too many people moved code in that direction
;-)

Will apply the portability fix, then.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 23:03             ` Junio C Hamano
@ 2021-01-29 23:28               ` Taylor Blau
  2021-02-02  3:04                 ` Taylor Blau
  2021-01-29 23:31               ` Jeff King
  1 sibling, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-29 23:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, Jeff King, git, dstolee

On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > So, I think that teaching pack-objects a way to understand a caller that
> > says "include objects from packs X, Y, and Z, but not if they appear in
> > packs A, B, or C, and also pull in any loose objects" is the best way
> > forward here.
>
> Are our goals still include that the resulting packfile has good
> delta compression and object locality?  Reachability traversal
> discovers which commit comes close to which other commits to help
> pack-objects to arrange the resulting pack so that objects that
> appear close together in history appears close together.  It also
> gives each object a pathname hint to help group objects of the same
> type (either blobs or trees) with like-paths together for better
> deltification.

I think our goals here are somewhere between having fewer packfiles
while also ensuring that the packfiles we had to create don't have
horrible delta compression and locality.

But now that you do mention it, I remember the reachability traversal's
bringing in object names was a reason that we decided to implement this
series using a reachability traversal in the first place.

> Without reachability traversal, I would imagine that it would become
> quite important to keep the order in which objects appear in the
> original pack, and existing delta chain, as much as possible, or
> we'd be seeing a horribly inefficient pack like fast-import would
> produce.

Yeah; we'd definitely want to feed the objects to pack-objects in the
order that they appear in the original pack. Maybe that's not that bad a
tradeoff to make, though...

> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 23:03             ` Junio C Hamano
  2021-01-29 23:28               ` Taylor Blau
@ 2021-01-29 23:31               ` Jeff King
  1 sibling, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-01-29 23:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > So, I think that teaching pack-objects a way to understand a caller that
> > says "include objects from packs X, Y, and Z, but not if they appear in
> > packs A, B, or C, and also pull in any loose objects" is the best way
> > forward here.
> 
> Are our goals still include that the resulting packfile has good
> delta compression and object locality?  Reachability traversal
> discovers which commit comes close to which other commits to help
> pack-objects to arrange the resulting pack so that objects that
> appear close together in history appears close together.  It also
> gives each object a pathname hint to help group objects of the same
> type (either blobs or trees) with like-paths together for better
> deltification.
> 
> Without reachability traversal, I would imagine that it would become
> quite important to keep the order in which objects appear in the
> original pack, and existing delta chain, as much as possible, or
> we'd be seeing a horribly inefficient pack like fast-import would
> produce.

Thanks, that's another good point we discussed a while ago (off-list),
but hasn't come up in this discussion yet.

Another option here is not to roll up packs at all, but instead to use a
midx to cover them all[1]. That solves the issue where object lookup is
O(nr_packs), and you retain the same locality and delta characteristics.

But I think part of the goal is to actually improve the deltas, in two
ways:

  - we'd hopefully find new delta opportunities between objects in the
    various packs

  - we'll drop some objects that are duplicated in other packs.
    Definitely we have to to avoid duplicates in the roll-up pack, but I
    think we'd want to even for objects that are in the "big" kept pack.
    These are likely bases of deltas in our roll-up pack, since the
    common cause there is --fix-thin adding them to complete the pack.
    But we really prefer to serve fetches using the ones out of the main
    pack, since they may already themselves be deltas (which makes them
    way cheaper; we can send the delta straight off the disk, rather
    than looking for a new possible base).

So I would anticipate the delta-compression phase actually trying to do
some new work. I do worry that the lack of pathname hints may make the
deltas we find much more worse (or cause us to spend excessive CPU
searching for them). It's possible we could do a "best effort" traversal
where we walk new commits to find newly added pathnames, but don't
bother crossing into trees/commits that aren't in the set of objects to
be packed. It's OK to optimize for speed there, because it's just
feeding the delta heuristic, not the set of objects we'd plan to pack.

-Peff

[1] Our end-game plan is actually to _also_ use a midx to cover the
    roll-ups and the "big" pack, since we'd want to generate bitmaps for
    the new objects, too.'

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
  2021-01-29 23:28               ` Taylor Blau
@ 2021-02-02  3:04                 ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-02  3:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, dstolee

On Fri, Jan 29, 2021 at 06:28:28PM -0500, Taylor Blau wrote:
> On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote:
> > Are our goals still include that the resulting packfile has good
> > delta compression and object locality?  Reachability traversal
> > discovers which commit comes close to which other commits to help
> > pack-objects to arrange the resulting pack so that objects that
> > appear close together in history appears close together.  It also
> > gives each object a pathname hint to help group objects of the same
> > type (either blobs or trees) with like-paths together for better
> > deltification.
>
> I think our goals here are somewhere between having fewer packfiles
> while also ensuring that the packfiles we had to create don't have
> horrible delta compression and locality.
>
> But now that you do mention it, I remember the reachability traversal's
> bringing in object names was a reason that we decided to implement this
> series using a reachability traversal in the first place.

Peff shared a very clever idea with me today. Like in the naive
approach, we fill the list of "objects to pack" with everything in the
packs that are about to get rolled up, excluding anything that appears
in the large packs.

But we do a reachability traversal whose starting points are all of the
commits in the packs that are about to be rolled up, filling in the
namehash of the objects we encounter along the way.

Like in the original version of this series, we'll stop early once we
encounter an object in any of the frozen packs (which are marked as kept
in core), and so we might not traverse through everything. But that's
completely OK, since we know we have the right list of objects to pack
(at worst, we would having some zero'd namehashes and come up with
slightly worse deltas).

But, I think that this is a nice middle-ground (and it allows us to
reuse lots of work from the original version), so I'm quite happy.

It's in my fork [1] in the tb/geometric-repack.wip branch, but I'll try
and clean those patches up tomorrow and send a v2 to the list.

Thanks,
Taylor

[1]: https://github.com/ttaylorr/git

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 0/8] repack: support repacking into a geometric sequence
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (10 preceding siblings ...)
  2021-01-20 14:05 ` [PATCH 00/10] repack: support repacking into a geometric sequence Derrick Stolee
@ 2021-02-04  3:58 ` Taylor Blau
  2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
                     ` (8 more replies)
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
  13 siblings, 9 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:58 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

Here is an updated version of mine and Peff's series to add a new 'git repack
--geometric' mode which supports repacking a repository into a geometric
progression of packs by object count.

This version depends on jk/p5303-sed-portability-fix, but it could be applied
onto 'master' after resolving a trivial conflict.

As a reminder, here is a description from the original cover letter [1] which
outlines what the geometric mode entails:

  Roughly speaking, for a given factor, say "d", each pack has at least "d"
  times the number of objects as the next largest pack. So, if there are "N"
  packs, "P1", "P2", ..., "PN" ordered by object count (where "PN" has the most
  objects, and "P1" the fewest), then:

    objects(Pi) > d * objects(P(i-1))

  for all 1 < i <= N.

  This is done by first ordering packs by object count, and then determining the
  longest sequence of large packs which already form a geometric progression.
  All packs on the small side of that cut must be repacked together, and so we
  check that the existing progression can be maintained with the new pack, and
  adjust as necessary.

Since last time, the series has been reworked substantially. In the previous
version, a single reachability traversal was performed to determine the set of
objects to pack. That traversal halted upon encountering any objects found in a
kept pack, but this led to serious correctness problems (if, for e.g., an object
we would like to pack is an ancestor of some other object in a kept pack, and
thus isn't picked up).

The details of the new approach can be found in the third patch, but the gist is
as follows:

  - 'git repack --geometric' calls 'git pack-objects --stdin-packs', which
    expects input like:

        pack-xyz.pack
        pack-abc.pack
        ^pack-exclude.pack

    'git pack-objects' determines the set of objects to pack by iterating all of
    the objects in the listed packs, and then removing any objects found in the
    packs which are prefixed with '^'.

  - To improve the delta selection process, the same reachability traversal from
    the original version of this series is performed. But, the set of objects to
    pack is already known, so we don't run the risk of the correctness bugs from
    before.

  - In this reachability traversal, visited objects get their namehash field
    set, which helps drive the heuristics that power delta selection. It's
    possible that we may not visit all of the objects to pack, but that's OK
    since this process is only additive (again, the set of objects to pack is
    known up-front independent of the reachability traversal).

So, this strikes a happy medium between not relying on reachability so much that
we run the risk of corrupting the repository, but relying on it enough that we
can aid in the delta selection process.

Because we reuse the same "halt the traversal when encountering objects in kept
packs" mechanism, a lot of the patches are able to be reused. The structure of
the series is as follows:

  - The first three patches introduce new infrastructure, and implement 'git
    pack-objects --stdin-packs'.

  - The next four patches introduce and use a kept-pack cache, which improves
    the performance of 'git pack-objects --stdin-packs' substantially.

  - The final patch implements 'git repack --geometric'.

Let me know what you think of this new approach, and thanks in advance for your
review.

[1]: https://lore.kernel.org/git/cover.1611098616.git.me@ttaylorr.com/

Thanks in advance for your review.

Jeff King (4):
  p5303: add missing &&-chains
  p5303: measure time to repack with keep
  builtin/pack-objects.c: rewrite honor-pack-keep logic
  packfile: add kept-pack cache for find_kept_pack_entry()

Taylor Blau (4):
  packfile: introduce 'find_kept_pack_entry()'
  revision: learn '--no-kept-objects'
  builtin/pack-objects.c: add '--stdin-packs' option
  builtin/repack.c: add '--geometric' option

 Documentation/git-pack-objects.txt |  10 +
 Documentation/git-repack.txt       |  11 ++
 Documentation/rev-list-options.txt |   7 +
 builtin/pack-objects.c             | 301 +++++++++++++++++++++++------
 builtin/repack.c                   | 187 +++++++++++++++++-
 list-objects.c                     |   7 +
 object-store.h                     |  10 +
 packfile.c                         |  69 +++++++
 packfile.h                         |   2 +
 revision.c                         |  15 ++
 revision.h                         |   4 +
 t/perf/p5303-many-packs.sh         |  24 ++-
 t/t5300-pack-object.sh             |  97 ++++++++++
 t/t6114-keep-packs.sh              |  69 +++++++
 t/t7703-repack-geometric.sh        | 137 +++++++++++++
 15 files changed, 889 insertions(+), 61 deletions(-)
 create mode 100755 t/t6114-keep-packs.sh
 create mode 100755 t/t7703-repack-geometric.sh

Range-diff against v1:
 1:  dc7fa4c7a6 !  1:  f7186147eb packfile: introduce 'find_kept_pack_entry()'
    @@ Commit message
         packfile: introduce 'find_kept_pack_entry()'
     
         Future callers will want a function to fill a 'struct pack_entry' for a
    -    given object id but _only_ from its position in any kept pack(s). They
    -    could accomplish this by calling 'find_pack_entry()' and checking
    -    whether the found pack is kept or not, but this is insufficient, since
    -    there may be duplicate objects (and the mru cache makes it unpredictable
    -    which variant we'll get).
    -
    -    Teach this new function to treat the two different kinds of kept packs
    -    (on disk ones with .keep files, as well as in-core ones which are set by
    -    manually poking the 'pack_keep_in_core' bit) separately. This will
    -    become important for callers that only want to respect a certain kind of
    -    kept pack.
    -
    -    Introduce 'find_kept_pack_entry()' which behaves like
    -    'find_pack_entry()', except that it skips over packs which are not
    -    marked kept. Callers will be added in subsequent patches.
    +    given object id but _only_ from its position in any kept pack(s).
    +
    +    In particular, an new 'git repack' mode which ensures the resulting
    +    packs form a geometric progress by object count will mark packs that it
    +    does not want to repack as "kept in-core", and it will want to halt a
    +    reachability traversal as soon as it visits an object in any of the kept
    +    packs. But, it does not want to halt the traversal at non-kept, or
    +    .keep packs.
    +
    +    The obvious alternative is 'find_pack_entry()', but this doesn't quite
    +    suffice since it only returns the first pack it finds, which may or may
    +    not be kept (and the mru cache makes it unpredictable which one you'll
    +    get if there are options).
    +
    +    Short of that, you could walk over all packs looking for the object in
    +    each one, but it scales with the number of packs, which may be
    +    prohibitive.
    +
    +    Introduce 'find_kept_pack_entry()', a function which is like
    +    'find_pack_entry()', but only fills in objects in the kept packs.
    +
    +    Handle packs which have .keep files, as well as in-core kept packs
    +    separately, since certain callers will want to distinguish one from the
    +    other. (Though on-disk and in-core kept packs share the adjective
    +    "kept", it is best to think of the two sets as independent.)
    +
    +    There is a gotcha when looking up objects that are duplicated in kept
    +    and non-kept packs, particularly when the MIDX stores the non-kept
    +    version and the caller asked for kept objects only. This could be
    +    resolved by teaching the MIDX to resolve duplicates by always favoring
    +    the kept pack (if one exists), but this breaks an assumption in existing
    +    MIDXs, and so it would require a format change.
    +
    +    The benefit to changing the MIDX in this way is marginal, so we instead
    +    have a more thorough check here which is explained with a comment.
    +
    +    Callers will be added in subsequent patches.
     
         Co-authored-by: Jeff King <peff@peff.net>
         Signed-off-by: Jeff King <peff@peff.net>
    @@ packfile.c: int find_pack_entry(struct repository *r, const struct object_id *oi
      
      	for (m = r->objects->multi_pack_index; m; m = m->next) {
     -		if (fill_midx_entry(r, oid, e, m))
    -+		if (!(fill_midx_entry(r, oid, e, m)))
    ++		if (!fill_midx_entry(r, oid, e, m))
     +			continue;
     +
     +		if (!kept_only)
 2:  4184529648 !  2:  ddc2896caa revision: learn '--no-kept-objects'
    @@ Metadata
      ## Commit message ##
         revision: learn '--no-kept-objects'
     
    -    Some callers want to perform a reachability traversal that terminates
    -    when an object is found in a kept pack. The closest existing option is
    -    '--honor-pack-keep', but this isn't quite what we want. Instead of
    -    halting the traversal midway through, a full traversal is always
    -    performed, and the results are only trimmed afterwords.
    +    A future caller will want to be able to perform a reachability traversal
    +    which terminates when visiting an object found in a kept pack. The
    +    closest existing option is '--honor-pack-keep', but this isn't quite
    +    what we want. Instead of halting the traversal midway through, a full
    +    traversal is always performed, and the results are only trimmed
    +    afterwords.
     
         Besides needing to introduce a new flag (since culling results
         post-facto can be different than halting the traversal as it's
    @@ Commit message
         of kept packs, if any, should stop a traversal. This can be useful for
         callers that want to perform a reachability analysis, but want to leave
         certain packs alone (for e.g., when doing a geometric repack that has
    -    some "large" packs it wants to leave alone).
    +    some "large" packs which are kept in-core that it wants to leave alone).
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
 3:  2da42e9ca2 <  -:  ---------- builtin/pack-objects.c: learn '--assume-kept-packs-closed'
 -:  ---------- >  3:  c96b1bf995 builtin/pack-objects.c: add '--stdin-packs' option
 4:  26b46dff15 !  4:  a46b7002b4 p5303: add missing &&-chains
    @@ Commit message
     
      ## t/perf/p5303-many-packs.sh ##
     @@ t/perf/p5303-many-packs.sh: repack_into_n () {
    - 	sed -n '1~5p' |
    - 	head -n "$1" |
    - 	perl -e 'print reverse <>' \
    --	>pushes
    -+	>pushes &&
    + 			push @commits, $_ if $. % 5 == 1;
    + 		}
    + 		print reverse @commits;
    +-	' "$1" >pushes
    ++	' "$1" >pushes &&
      
      	# create base packfile
      	head -n 1 pushes |
 5:  b3b2574d4d <  -:  ---------- p5303: measure time to repack with keep
 -:  ---------- >  5:  b5081c01b5 p5303: measure time to repack with keep
 6:  4dd5076fcc !  6:  c3868c7df9 pack-objects: rewrite honor-pack-keep logic
    @@ Metadata
     Author: Jeff King <peff@peff.net>
     
      ## Commit message ##
    -    pack-objects: rewrite honor-pack-keep logic
    +    builtin/pack-objects.c: rewrite honor-pack-keep logic
     
         Now that we have find_kept_pack_entry(), we don't have to manually keep
         hunting through every pack to find a possible "kept" duplicate of the
    @@ Commit message
         question. It might be worth having a similar optimized function to look
         at only local packs.
     
    -    Here are the results from p5303 (measurements taken on git.git):
    +    Here are the results from p5303 (measurements again taken on the
    +    kernel):
     
    -      Test                               HEAD^                  HEAD
    -      ------------------------------------------------------------------------------------
    -      5303.5: repack (1)                 57.29(54.88+10.39)     56.87(54.63+10.48) -0.7%
    -      5303.6: repack with keep (1)       1.25(1.19+0.05)        1.26(1.19+0.06) +0.8%
    -      5303.10: repack (50)               89.71(132.78+6.14)     89.35(132.42+6.25) -0.4%
    -      5303.11: repack with keep (50)     6.92(26.93+0.58)       6.73(26.61+0.59) -2.7%
    -      5303.15: repack (1000)             217.14(493.76+15.29)   217.25(494.38+15.24) +0.1%
    -      5303.16: repack with keep (1000)   209.46(387.83+8.42)    133.12(311.80+8.44) -36.4%
    +      Test                                        HEAD^                    HEAD
    +      -----------------------------------------------------------------------------------------------
    +      5303.5: repack (1)                          57.42(54.88+10.64)       57.44(54.71+10.78) +0.0%
    +      5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)          0.01(0.00+0.01) +0.0%
    +      5303.10: repack (50)                        71.26(88.24+4.96)        71.32(88.38+4.90) +0.1%
    +      5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)         3.43(11.81+0.22) -1.7%
    +      5303.15: repack (1000)                      215.64(491.33+14.80)     215.59(493.75+14.62) -0.0%
    +      5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)      131.44(314.24+8.11) -33.9%
     
    -    So our case with many packs and a .keep is finally now faster than the
    +    So our --stdin-packs case with many packs is now finally faster than the
         non-keep case (because it gets the speed benefit of looking at fewer
         objects, but not as big a penalty for looking at many packs).
     
 7:  182664e1a9 !  7:  f1c07324f6 packfile: add kept-pack cache for find_kept_pack_entry()
    @@ Commit message
     
         Here are p5303 results (as always, measured against the kernel):
     
    -      Test                               HEAD^                  HEAD
    -      ------------------------------------------------------------------------------------
    -      5303.5: repack (1)                 56.87(54.63+10.48)     56.63(54.41+10.36) -0.4%
    -      5303.6: repack with keep (1)       1.26(1.19+0.06)        1.25(1.19+0.05) -0.8%
    -      5303.10: repack (50)               89.35(132.42+6.25)     89.49(132.31+6.31) +0.2%
    -      5303.11: repack with keep (50)     6.73(26.61+0.59)       6.72(26.70+0.53) -0.1%
    -      5303.15: repack (1000)             217.25(494.38+15.24)   218.69(495.62+14.99) +0.7%
    -      5303.16: repack with keep (1000)   133.12(311.80+8.44)    128.79(306.96+8.55) -3.3%
    +      Test                                        HEAD^                   HEAD
    +      ----------------------------------------------------------------------------------------------
    +      5303.5: repack (1)                          57.44(54.71+10.78)      57.06(54.29+10.96) -0.7%
    +      5303.6: repack with --stdin-packs (1)       0.01(0.00+0.01)         0.01(0.01+0.00) +0.0%
    +      5303.10: repack (50)                        71.32(88.38+4.90)       71.47(88.60+5.04) +0.2%
    +      5303.11: repack with --stdin-packs (50)     3.43(11.81+0.22)        3.49(12.21+0.26) +1.7%
    +      5303.15: repack (1000)                      215.59(493.75+14.62)    217.41(495.36+14.85) +0.8%
    +      5303.16: repack with --stdin-packs (1000)   131.44(314.24+8.11)     126.75(309.88+8.09) -3.6%
     
         Signed-off-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
    @@ builtin/pack-objects.c: static int want_found_object(const struct object_id *oid
      
      		if (ignore_packed_keep_on_disk && p->pack_keep)
      			return 0;
    +@@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
    + 	 * an optimization during delta selection.
    + 	 */
    + 	revs.no_kept_objects = 1;
    +-	revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
    ++	revs.keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
    + 	revs.blob_objects = 1;
    + 	revs.tree_objects = 1;
    + 	revs.tag_objects = 1;
     
      ## object-store.h ##
     @@ object-store.h: static inline int pack_map_entry_cmp(const void *unused_cmp_data,
    @@ packfile.c: static int find_one_pack_entry(struct repository *r,
      		return 0;
      
      	for (m = r->objects->multi_pack_index; m; m = m->next) {
    --		if (!(fill_midx_entry(r, oid, e, m)))
    +-		if (!fill_midx_entry(r, oid, e, m))
     -			continue;
     -
     -		if (!kept_only)
 8:  6547c082f8 <  -:  ---------- builtin/pack-objects.c: teach '--keep-pack-stdin'
 9:  a808fbdf31 <  -:  ---------- builtin/repack.c: extract loose object handling
10:  f853087216 !  8:  d5561585c2 builtin/repack.c: add '--geometric' option
    @@ builtin/repack.c: static void repack_promisor_objects(const struct pack_objects_
     +	geometry = *geometry_p;
     +
     +	for (p = get_all_packs(the_repository); p; p = p->next) {
    ++		if (!pack_kept_objects && p->pack_keep)
    ++			continue;
    ++
     +		ALLOC_GROW(geometry->pack,
     +			   geometry->pack_nr + 1,
     +			   geometry->pack_alloc);
    @@ builtin/repack.c: static void repack_promisor_objects(const struct pack_objects_
     +	uint32_t split;
     +	off_t total_size = 0;
     +
    ++	if (geometry->pack_nr <= 1) {
    ++		geometry->split = geometry->pack_nr;
    ++		return;
    ++	}
    ++
     +	split = geometry->pack_nr - 1;
     +
     +	/*
    @@ builtin/repack.c: static void repack_promisor_objects(const struct pack_objects_
     +	geometry->split = 0;
     +}
     +
    - static void handle_loose_and_reachable(struct child_process *cmd,
    - 				       const char *unpack_unreachable,
    - 				       int pack_everything,
    + int cmd_repack(int argc, const char **argv, const char *prefix)
    + {
    + 	struct child_process cmd = CHILD_PROCESS_INIT;
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
      	struct string_list names = STRING_LIST_INIT_DUP;
      	struct string_list rollback = STRING_LIST_INIT_NODUP;
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      		die(_(incremental_bitmap_conflict_error));
      
     +	if (geometric_factor) {
    ++		if (pack_everything)
    ++			die(_("--geometric is incompatible with -A, -a"));
     +		init_pack_geometry(&geometry);
     +		split_pack_geometry(geometry, geometric_factor);
     +	}
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
      
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 			handle_loose_and_reachable(&cmd, unpack_unreachable,
    - 						   pack_everything,
    - 						   keep_unreachable);
    + 		strvec_pushf(&cmd.args, "--keep-pack=%s",
    + 			     keep_pack_list.items[i].string);
    + 	strvec_push(&cmd.args, "--non-empty");
    +-	strvec_push(&cmd.args, "--all");
    +-	strvec_push(&cmd.args, "--reflog");
    +-	strvec_push(&cmd.args, "--indexed-objects");
    ++	if (!geometry) {
    ++		/*
    ++		 * 'git pack-objects' will up all objects loose or packed
    ++		 * (either rolling them up or leaving them alone), so don't pass
    ++		 * these options.
    ++		 *
    ++		 * The implementation of 'git pack-objects --stdin-packs'
    ++		 * makes them redundant (and the two are incompatible).
    ++		 */
    ++		strvec_push(&cmd.args, "--all");
    ++		strvec_push(&cmd.args, "--reflog");
    ++		strvec_push(&cmd.args, "--indexed-objects");
    ++	}
    + 	if (has_promisor_remote())
    + 		strvec_push(&cmd.args, "--exclude-promisor-objects");
    + 	if (write_bitmaps > 0)
    +@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    + 				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
    + 			}
    + 		}
     +	} else if (geometry) {
    -+		strvec_push(&cmd.args, "--keep-pack-stdin");
    -+		strvec_push(&cmd.args, "--honor-pack-keep");
    -+		strvec_push(&cmd.args, "--assume-kept-packs-closed");
    -+		if (delete_redundant)
    -+			handle_loose_and_reachable(&cmd, unpack_unreachable,
    -+						   pack_everything,
    -+						   keep_unreachable);
    ++		strvec_push(&cmd.args, "--stdin-packs");
    ++		strvec_push(&cmd.args, "--unpacked");
      	} else {
      		strvec_push(&cmd.args, "--unpacked");
      		strvec_push(&cmd.args, "--incremental");
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +	if (geometry) {
     +		FILE *in = xfdopen(cmd.in, "w");
     +		/*
    -+		 * Tell 'git pack-objects' to avoid tampering with the structure
    -+		 * with the packs that already form a geometric progression.
    -+		 *
    -+		 * Everything else will get picked up by the reachability walk.
    ++		 * The resulting pack should contain all objects in packs that
    ++		 * are going to be rolled up, but exclude objects in packs which
    ++		 * are being left alone.
     +		 */
    -+		for (i = geometry->split; i < geometry->pack_nr; i++)
    ++		for (i = 0; i < geometry->split; i++)
     +			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
    ++		for (i = geometry->split; i < geometry->pack_nr; i++)
    ++			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
     +		fclose(in);
     +	}
     +
    @@ t/t7703-repack-geometric.sh (new)
     +objdir=.git/objects
     +midx=$objdir/pack/multi-pack-index
     +
    ++test_expect_success '--geometric with no packs' '
    ++	git init geometric &&
    ++	test_when_finished "rm -fr geometric" &&
    ++	(
    ++		cd geometric &&
    ++
    ++		git repack --geometric 2 >out &&
    ++		test_i18ngrep "Nothing new to pack" out
    ++	)
    ++'
    ++
     +test_expect_success '--geometric with an intact progression' '
     +	git init geometric &&
     +	test_when_finished "rm -fr geometric" &&
    @@ t/t7703-repack-geometric.sh (new)
     +		test_commit_bulk --start=4 4 && # 12 objects
     +
     +		find $objdir/pack -name "*.pack" | sort >expect &&
    -+		GIT_TEST_MULTI_PACK_BITMAP=0 git repack --geometric 2 -d &&
    ++		git repack --geometric 2 -d &&
     +		find $objdir/pack -name "*.pack" | sort >actual &&
     +
     +		test_cmp expect actual
    @@ t/t7703-repack-geometric.sh (new)
     +		test_commit_bulk --start=7 8 && # 24 objects
     +		find $objdir/pack -name "*.pack" | sort >before &&
     +
    -+		GIT_TEST_MULTI_PACK_BITMAP=0 git repack --geometric 2 -d &&
    ++		git repack --geometric 2 -d &&
     +
     +		# Three packs in total; two of the existing large ones, and one
     +		# new one.
    @@ t/t7703-repack-geometric.sh (new)
     +
     +		find $objdir/pack -name "*.pack" | sort >before &&
     +
    -+		GIT_TEST_MULTI_PACK_BITMAP=0 git repack --geometric 2 -d &&
    ++		git repack --geometric 2 -d &&
     +
     +		find $objdir/pack -name "*.pack" | sort >after &&
     +		comm -12 before after >untouched &&
    @@ t/t7703-repack-geometric.sh (new)
     +	)
     +'
     +
    ++test_expect_success '--geometric ignores kept packs' '
    ++	git init geometric &&
    ++	test_when_finished "rm -fr geometric" &&
    ++	(
    ++		cd geometric &&
    ++
    ++		test_commit kept && # 3 objects
    ++		test_commit pack && # 3 objects
    ++
    ++		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
    ++		refs/tags/kept
    ++		EOF
    ++		) &&
    ++		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
    ++		refs/tags/pack
    ++		^refs/tags/kept
    ++		EOF
    ++		) &&
    ++
    ++		# neither pack contains more than twice the number of objects in
    ++		# the other, so they should be combined. but, marking one as
    ++		# .kept on disk will "freeze" it, so the pack structure should
    ++		# remain unchanged.
    ++		touch $objdir/pack/pack-$KEPT.keep &&
    ++
    ++		find $objdir/pack -name "*.pack" | sort >before &&
    ++		git repack --geometric 2 -d &&
    ++		find $objdir/pack -name "*.pack" | sort >after &&
    ++
    ++		# both packs should still exist
    ++		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
    ++		test_path_is_file $objdir/pack/pack-$PACK.pack &&
    ++
    ++		# and no new packs should be created
    ++		test_cmp before after &&
    ++
    ++		# Passing --pack-kept-objects causes packs with a .keep file to
    ++		# be repacked, too.
    ++		git repack --geometric 2 -d --pack-kept-objects &&
    ++
    ++		find $objdir/pack -name "*.pack" >after &&
    ++		test_line_count = 1 after
    ++	)
    ++'
    ++
     +test_done
-- 
2.30.0.533.g2f8b6b552f.dirty

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()'
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
@ 2021-02-04  3:58   ` Taylor Blau
  2021-02-16 21:42     ` Jeff King
  2021-02-04  3:58   ` [PATCH v2 2/8] revision: learn '--no-kept-objects' Taylor Blau
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:58 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

Future callers will want a function to fill a 'struct pack_entry' for a
given object id but _only_ from its position in any kept pack(s).

In particular, an new 'git repack' mode which ensures the resulting
packs form a geometric progress by object count will mark packs that it
does not want to repack as "kept in-core", and it will want to halt a
reachability traversal as soon as it visits an object in any of the kept
packs. But, it does not want to halt the traversal at non-kept, or
.keep packs.

The obvious alternative is 'find_pack_entry()', but this doesn't quite
suffice since it only returns the first pack it finds, which may or may
not be kept (and the mru cache makes it unpredictable which one you'll
get if there are options).

Short of that, you could walk over all packs looking for the object in
each one, but it scales with the number of packs, which may be
prohibitive.

Introduce 'find_kept_pack_entry()', a function which is like
'find_pack_entry()', but only fills in objects in the kept packs.

Handle packs which have .keep files, as well as in-core kept packs
separately, since certain callers will want to distinguish one from the
other. (Though on-disk and in-core kept packs share the adjective
"kept", it is best to think of the two sets as independent.)

There is a gotcha when looking up objects that are duplicated in kept
and non-kept packs, particularly when the MIDX stores the non-kept
version and the caller asked for kept objects only. This could be
resolved by teaching the MIDX to resolve duplicates by always favoring
the kept pack (if one exists), but this breaks an assumption in existing
MIDXs, and so it would require a format change.

The benefit to changing the MIDX in this way is marginal, so we instead
have a more thorough check here which is explained with a comment.

Callers will be added in subsequent patches.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.h |  6 +++++
 2 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 4b938b4372..5f35cfe788 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2031,7 +2031,10 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static int find_one_pack_entry(struct repository *r,
+			       const struct object_id *oid,
+			       struct pack_entry *e,
+			       int kept_only)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2041,26 +2044,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (fill_midx_entry(r, oid, e, m))
+		if (!fill_midx_entry(r, oid, e, m))
+			continue;
+
+		if (!kept_only)
+			return 1;
+
+		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
+		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
-			list_move(&p->mru, &r->objects->packed_git_mru);
-			return 1;
+		if (p->multi_pack_index && !kept_only) {
+			/*
+			 * If this pack is covered by the MIDX, we'd have found
+			 * the object already in the loop above if it was here,
+			 * so don't bother looking.
+			 *
+			 * The exception is if we are looking only at kept
+			 * packs. An object can be present in two packs covered
+			 * by the MIDX, one kept and one not-kept. And as the
+			 * MIDX points to only one copy of each object, it might
+			 * have returned only the non-kept version above. We
+			 * have to check again to be thorough.
+			 */
+			continue;
+		}
+		if (!kept_only ||
+		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
+		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
+			if (fill_pack_entry(oid, e, p)) {
+				list_move(&p->mru, &r->objects->packed_git_mru);
+				return 1;
+			}
 		}
 	}
 	return 0;
 }
 
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+{
+	return find_one_pack_entry(r, oid, e, 0);
+}
+
+int find_kept_pack_entry(struct repository *r,
+			 const struct object_id *oid,
+			 unsigned flags,
+			 struct pack_entry *e)
+{
+	/*
+	 * Load all packs, including midx packs, since our "kept" strategy
+	 * relies on that. We're relying on the side effect of it setting up
+	 * r->objects->packed_git, which is a little ugly.
+	 */
+	get_all_packs(r);
+	return find_one_pack_entry(r, oid, e, flags);
+}
+
 int has_object_pack(const struct object_id *oid)
 {
 	struct pack_entry e;
 	return find_pack_entry(the_repository, oid, &e);
 }
 
+int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+{
+	struct pack_entry e;
+	return find_kept_pack_entry(the_repository, oid, flags, &e);
+}
+
 int has_pack_index(const unsigned char *sha1)
 {
 	struct stat st;
diff --git a/packfile.h b/packfile.h
index a58fc738e0..624327f64d 100644
--- a/packfile.h
+++ b/packfile.h
@@ -161,13 +161,19 @@ int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
+#define ON_DISK_KEEP_PACKS 1
+#define IN_CORE_KEEP_PACKS 2
+#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
+
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
  */
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e);
+int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e);
 
 int has_object_pack(const struct object_id *oid);
+int has_object_kept_pack(const struct object_id *oid, unsigned flags);
 
 int has_pack_index(const unsigned char *sha1);
 
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 2/8] revision: learn '--no-kept-objects'
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
  2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
@ 2021-02-04  3:58   ` Taylor Blau
  2021-02-16 23:17     ` Jeff King
  2021-02-04  3:59   ` [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:58 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/rev-list-options.txt |  7 +++
 list-objects.c                     |  7 +++
 revision.c                         | 15 +++++++
 revision.h                         |  4 ++
 t/t6114-keep-packs.sh              | 69 ++++++++++++++++++++++++++++++
 5 files changed, 102 insertions(+)
 create mode 100755 t/t6114-keep-packs.sh

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 96cc89d157..f611832277 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -861,6 +861,13 @@ ifdef::git-rev-list[]
 	Only useful with `--objects`; print the object IDs that are not
 	in packs.
 
+--no-kept-objects[=<kind>]::
+	Halts the traversal as soon as an object in a kept pack is
+	found. If `<kind>` is `on-disk`, only packs with a corresponding
+	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
+	with their in-core kept state set are ignored. Otherwise, both
+	kinds of kept packs are ignored.
+
 --object-names::
 	Only useful with `--objects`; print the names of the object IDs
 	that are found. This is the default behavior.
diff --git a/list-objects.c b/list-objects.c
index e19589baa0..b06c3bfeba 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 			ctx->show_object(obj, name, ctx->show_data);
 			continue;
 		}
+		if (ctx->revs->no_kept_objects) {
+			struct pack_entry e;
+			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
+						 ctx->revs->keep_pack_cache_flags,
+						 &e))
+				continue;
+		}
 		if (!path)
 			path = "";
 		if (obj->type == OBJ_TREE) {
diff --git a/revision.c b/revision.c
index fbc3e607fd..4c5adb90b1 100644
--- a/revision.c
+++ b/revision.c
@@ -2336,6 +2336,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->unpacked = 1;
 	} else if (starts_with(arg, "--unpacked=")) {
 		die(_("--unpacked=<packfile> no longer supported"));
+	} else if (!strcmp(arg, "--no-kept-objects")) {
+		revs->no_kept_objects = 1;
+		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
+		revs->no_kept_objects = 1;
+		if (!strcmp(optarg, "in-core"))
+			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		if (!strcmp(optarg, "on-disk"))
+			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
@@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (revs->unpacked && has_object_pack(&commit->object.oid))
 		return commit_ignore;
+	if (revs->no_kept_objects) {
+		if (has_object_kept_pack(&commit->object.oid,
+					 revs->keep_pack_cache_flags))
+			return commit_ignore;
+	}
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
 	if (revs->line_level_traverse && !want_ancestry(revs)) {
diff --git a/revision.h b/revision.h
index e6be3c845e..a20a530d52 100644
--- a/revision.h
+++ b/revision.h
@@ -148,6 +148,7 @@ struct rev_info {
 			edge_hint_aggressive:1,
 			limited:1,
 			unpacked:1,
+			no_kept_objects:1,
 			boundary:2,
 			count:1,
 			left_right:1,
@@ -317,6 +318,9 @@ struct rev_info {
 	 * This is loaded from the commit-graph being used.
 	 */
 	struct bloom_filter_settings *bloom_filter_settings;
+
+	/* misc. flags related to '--no-kept-objects' */
+	unsigned keep_pack_cache_flags;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
new file mode 100755
index 0000000000..9239d8aa46
--- /dev/null
+++ b/t/t6114-keep-packs.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='rev-list with .keep packs'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit loose &&
+	test_commit packed &&
+	test_commit kept &&
+
+	KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/kept
+	^refs/tags/packed
+	EOF
+	) &&
+	MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/packed
+	^refs/tags/loose
+	EOF
+	) &&
+
+	touch .git/objects/pack/pack-$KEPT_PACK.keep
+'
+
+rev_list_objects () {
+	git rev-list "$@" >out &&
+	sort out
+}
+
+idx_objects () {
+	git show-index <$1 >expect-idx &&
+	cut -d" " -f2 <expect-idx | sort
+}
+
+test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' '
+	rev_list_objects --objects --all --no-object-names >kept &&
+	rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept &&
+
+	idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect &&
+	comm -3 kept no-kept >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
+	test_config core.multiPackIndex true &&
+
+	# Create a pack with just the commit object in pack, and do not mark it
+	# as kept (even though it appears in $KEPT_PACK, which does have a .keep
+	# file).
+	MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$(git rev-parse kept)
+	EOF
+	) &&
+
+	# Write a MIDX containing all packs, but use the version of the commit
+	# at "kept" in a non-kept pack by touching $MIDX_PACK.
+	touch .git/objects/pack/pack-$MIDX_PACK.pack &&
+	git multi-pack-index write &&
+
+	rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual &&
+	(
+		idx_objects .git/objects/pack/pack-$MISC_PACK.idx &&
+		git rev-list --objects --no-object-names refs/tags/loose
+	) | sort >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
  2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
  2021-02-04  3:58   ` [PATCH v2 2/8] revision: learn '--no-kept-objects' Taylor Blau
@ 2021-02-04  3:59   ` Taylor Blau
  2021-02-16 23:46     ` Jeff King
  2021-02-04  3:59   ` [PATCH v2 4/8] p5303: add missing &&-chains Taylor Blau
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

In an upcoming commit, 'git repack' will want to create a pack comprised
of all of the objects in some packs (the included packs) excluding any
objects in some other packs (the excluded packs).

This caller could iterate those packs themselves and feed the objects it
finds to 'git pack-objects' directly over stdin, but this approach has a
few downsides:

  - It requires every caller that wants to drive 'git pack-objects' in
    this way to implement pack iteration themselves. This forces the
    caller to think about details like what order objects are fed to
    pack-objects, which callers would likely rather not do.

  - If the set of objects in included packs is large, it requires
    sending a lot of data over a pipe, which is inefficient.

  - The caller is forced to keep track of the excluded objects, too, and
    make sure that it doesn't send any objects that appear in both
    included and excluded packs.

But the biggest downside is the lack of a reachability traversal.
Because the caller passes in a list of objects directly, those objects
don't get a namehash assigned to them, which can have a negative impact
on the delta selection process, causing 'git pack-objects' to fail to
find good deltas even when they exist.

The caller could formulate a reachability traversal themselves, but the
only way to drive 'git pack-objects' in this way is to do a full
traversal, and then remove objects in the excluded packs after the
traversal is complete. This can be detrimental to callers who care
about performance, especially in repositories with many objects.

Introduce 'git pack-objects --stdin-packs' which remedies these four
concerns.

'git pack-objects --stdin-packs' expects a list of pack names on stdin,
where 'pack-xyz.pack' denotes that pack as included, and
'^pack-xyz.pack' denotes it as excluded. The resulting pack includes all
objects that are present in at least one included pack, and aren't
present in any excluded pack.

To address the delta selection problem, 'git pack-objects --stdin-packs'
works as follows. First, it assembles a list of objects that it is going
to pack, as above. Then, a reachability traversal is started, whose tips
are any commits mentioned in included packs. Upon visiting an object, we
find its corresponding object_entry in the to_pack list, and set its
namehash parameter appropriately.

To avoid the traversal visiting more objects than it needs to, the
traversal is halted upon encountering an object which can be found in an
excluded pack (by marking the excluded packs as kept in-core, and
passing --no-kept-objects=in-core to the revision machinery).

This can cause the traversal to halt early, for example if an object in
an included pack is an ancestor of ones in excluded packs. But stopping
early is OK, since filling in the namehash fields of objects in the
to_pack list is only additive (i.e., having it helps the delta selection
process, but leaving it blank doesn't impact the correctness of the
resulting pack).

Even still, it is unlikely that this hurts us much in practice, since
the 'git repack --geometric' caller (which is introduced in a later
commit) marks small packs as included, and large ones as excluded.
During ordinary use, the small packs usually represent pushes after a
large repack, and so are unlikely to be ancestors of objects that
already exist in the repository.

(I found it convenient while developing this patch to have 'git
pack-objects' report the number of objects which were visited and got
their namehash fields filled in during traversal. This is also included
in the below patch via trace2 data lines).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  10 ++
 builtin/pack-objects.c             | 176 ++++++++++++++++++++++++++++-
 t/t5300-pack-object.sh             |  97 ++++++++++++++++
 3 files changed, 281 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 54d715ead1..92733f6bf5 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -85,6 +85,16 @@ base-name::
 	reference was included in the resulting packfile.  This
 	can be useful to send new tags to native Git clients.
 
+--stdin-packs::
+	Read the basenames of packfiles from the standard input, instead
+	of object names or revision arguments. The resulting pack
+	contains all objects listed in the included packs (those not
+	beginning with `^`), excluding any objects listed in the
+	excluded packs (beginning with `^`).
++
+Incompatible with `--revs`, or options that imply `--revs` (such as
+`--all`), with the exception of `--unpacked`, which is compatible.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 13cde5896a..6d19eb000a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2979,6 +2979,164 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 	return git_default_config(k, v, cb);
 }
 
+static int stdin_packs_found_nr;
+static int stdin_packs_hints_nr;
+
+static int add_object_entry_from_pack(const struct object_id *oid,
+				      struct packed_git *p,
+				      uint32_t pos,
+				      void *_data)
+{
+	struct rev_info *revs = _data;
+	struct object_info oi = OBJECT_INFO_INIT;
+	off_t ofs;
+	enum object_type type;
+
+	display_progress(progress_state, ++nr_seen);
+
+	ofs = nth_packed_object_offset(p, pos);
+
+	oi.typep = &type;
+	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
+		die(_("could not get type of object %s in pack %s"),
+		    oid_to_hex(oid), p->pack_name);
+	else if (type == OBJ_COMMIT) {
+		/*
+		 * commits in included packs are used as starting points for the
+		 * subsequent revision walk
+		 */
+		add_pending_oid(revs, NULL, oid, 0);
+	}
+
+	if (have_duplicate_entry(oid, 0))
+		return 0;
+
+	if (!want_object_in_pack(oid, 0, &p, &ofs))
+		return 0;
+
+	stdin_packs_found_nr++;
+
+	create_object_entry(oid, type, 0, 0, 0, p, ofs);
+
+	return 0;
+}
+
+static void show_commit_pack_hint(struct commit *commit, void *_data)
+{
+}
+
+static void show_object_pack_hint(struct object *object, const char *name,
+				  void *_data)
+{
+	struct object_entry *oe = packlist_find(&to_pack, &object->oid);
+	if (!oe)
+		return;
+
+	/*
+	 * Our 'to_pack' list was constructed by iterating all objects packed in
+	 * included packs, and so doesn't have a non-zero hash field that you
+	 * would typically pick up during a reachability traversal.
+	 *
+	 * Make a best-effort attempt to fill in the ->hash and ->no_try_delta
+	 * here using a now in order to perhaps improve the delta selection
+	 * process.
+	 */
+	oe->hash = pack_name_hash(name);
+	oe->no_try_delta = name && no_try_delta(name);
+
+	stdin_packs_hints_nr++;
+}
+
+static void read_packs_list_from_stdin(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list include_packs = STRING_LIST_INIT_DUP;
+	struct string_list exclude_packs = STRING_LIST_INIT_DUP;
+	struct string_list_item *item = NULL;
+
+	struct packed_git *p;
+	struct rev_info revs;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+	/*
+	 * Use a revision walk to fill in the namehash of objects in the include
+	 * packs. To save time, we'll avoid traversing through objects that are
+	 * in excluded packs.
+	 *
+	 * That may cause us to avoid populating all of the namehash fields of
+	 * all included objects, but our goal is best-effort, since this is only
+	 * an optimization during delta selection.
+	 */
+	revs.no_kept_objects = 1;
+	revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+	revs.blob_objects = 1;
+	revs.tree_objects = 1;
+	revs.tag_objects = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '^')
+			string_list_append(&exclude_packs, buf.buf + 1);
+		else
+			string_list_append(&include_packs, buf.buf);
+
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&include_packs);
+	string_list_sort(&exclude_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+
+		item = string_list_lookup(&include_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&exclude_packs, pack_name);
+
+		if (item)
+			item->util = p;
+	}
+
+	/*
+	 * First handle all of the excluded packs, marking them as kept in-core
+	 * so that later calls to add_object_entry() discards any objects that
+	 * are also found in excluded packs.
+	 */
+	for_each_string_list_item(item, &exclude_packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = 1;
+	}
+	for_each_string_list_item(item, &include_packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		for_each_object_in_pack(p,
+					add_object_entry_from_pack,
+					&revs,
+					FOR_EACH_OBJECT_PACK_ORDER);
+	}
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	traverse_commit_list(&revs,
+			     show_commit_pack_hint,
+			     show_object_pack_hint,
+			     NULL);
+
+	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found",
+			   stdin_packs_found_nr);
+	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_hints",
+			   stdin_packs_hints_nr);
+
+	strbuf_release(&buf);
+	string_list_clear(&include_packs, 0);
+	string_list_clear(&exclude_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3482,6 +3640,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	struct strvec rp = STRVEC_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	int rev_list_index = 0;
+	int stdin_packs = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -3532,6 +3691,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_SET_INT_F(0, "indexed-objects", &rev_list_index,
 			      N_("include objects referred to by the index"),
 			      1, PARSE_OPT_NONEG),
+		OPT_BOOL(0, "stdin-packs", &stdin_packs,
+			 N_("read packs from stdin")),
 		OPT_BOOL(0, "stdout", &pack_to_stdout,
 			 N_("output pack to stdout")),
 		OPT_BOOL(0, "include-tag", &include_tag,
@@ -3636,7 +3797,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--indexed-objects");
 	}
-	if (rev_list_unpacked) {
+	if (rev_list_unpacked && !stdin_packs) {
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--unpacked");
 	}
@@ -3681,8 +3842,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (filter_options.choice) {
 		if (!pack_to_stdout)
 			die(_("cannot use --filter without --stdout"));
+		if (stdin_packs)
+			die(_("cannot use --filter with --stdin-packs"));
 	}
 
+	if (stdin_packs && use_internal_rev_list)
+		die(_("cannot use internal rev list with --stdin-packs"));
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -3741,7 +3907,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (progress)
 		progress_state = start_progress(_("Enumerating objects"), 0);
-	if (!use_internal_rev_list)
+	if (stdin_packs) {
+		/* avoids adding objects in excluded packs */
+		ignore_packed_keep_in_core = 1;
+		read_packs_list_from_stdin();
+		if (rev_list_unpacked)
+			add_unreachable_loose_objects();
+	} else if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
 		get_object_list(rp.nr, rp.v);
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 392201cabd..7138a54595 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -532,4 +532,101 @@ test_expect_success 'prefetch objects' '
 	test_line_count = 1 donelines
 '
 
+test_expect_success 'setup for --stdin-packs tests' '
+	git init stdin-packs &&
+	(
+		cd stdin-packs &&
+
+		test_commit A &&
+		test_commit B &&
+		test_commit C &&
+
+		for id in A B C
+		do
+			git pack-objects .git/objects/pack/pack-$id \
+				--incremental --revs <<-EOF
+			refs/tags/$id
+			EOF
+		done &&
+
+		ls -la .git/objects/pack
+	)
+'
+
+test_expect_success '--stdin-packs with excluded packs' '
+	(
+		cd stdin-packs &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+
+		git pack-objects test --stdin-packs <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx)
+		) >expect.raw &&
+		git show-index <$(ls test-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--stdin-packs is incompatible with --filter' '
+	(
+		cd stdin-packs &&
+		test_must_fail git pack-objects --stdin-packs --stdout \
+			--filter=blob:none </dev/null 2>err &&
+		test_i18ngrep "cannot use --filter with --stdin-packs" err
+	)
+'
+
+test_expect_success '--stdin-packs is incompatible with --revs' '
+	(
+		cd stdin-packs &&
+		test_must_fail git pack-objects --stdin-packs --revs out \
+			</dev/null 2>err &&
+		test_i18ngrep "cannot use internal rev list with --stdin-packs" err
+	)
+'
+
+test_expect_success '--stdin-packs with loose objects' '
+	(
+		cd stdin-packs &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+
+		test_commit D && # loose
+
+		git pack-objects test2 --stdin-packs --unpacked <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx) &&
+			git rev-list --objects --no-object-names \
+				refs/tags/C..refs/tags/D
+
+		) >expect.raw &&
+		ls -la . &&
+		git show-index <$(ls test2-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 4/8] p5303: add missing &&-chains
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                     ` (2 preceding siblings ...)
  2021-02-04  3:59   ` [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
@ 2021-02-04  3:59   ` Taylor Blau
  2021-02-04  3:59   ` [PATCH v2 5/8] p5303: measure time to repack with keep Taylor Blau
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

From: Jeff King <peff@peff.net>

These are in a helper function, so the usual chain-lint doesn't notice
them. This function is still not perfect, as it has some git invocations
on the left-hand-side of the pipe, but it's primary purpose is timing,
not finding bugs or correctness issues.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index ce0c42cc9f..d90d714923 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -28,11 +28,11 @@ repack_into_n () {
 			push @commits, $_ if $. % 5 == 1;
 		}
 		print reverse @commits;
-	' "$1" >pushes
+	' "$1" >pushes &&
 
 	# create base packfile
 	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack
+	git pack-objects --delta-base-offset --revs staging/pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 5/8] p5303: measure time to repack with keep
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                     ` (3 preceding siblings ...)
  2021-02-04  3:59   ` [PATCH v2 4/8] p5303: add missing &&-chains Taylor Blau
@ 2021-02-04  3:59   ` Taylor Blau
  2021-02-16 23:58     ` Jeff King
  2021-02-04  3:59   ` [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

From: Jeff King <peff@peff.net>

This is the same as the regular repack test, except that we mark the
single base pack as "kept" and use --assume-kept-packs-closed. The
theory is that this should be faster than the normal repack, because
we'll have fewer objects to traverse and process.

Here are some timings on a recent clone of the kernel. In the
single-pack case, there is nothing do since there are no non-excluded
packs:

  5303.5: repack (1)                          57.42(54.88+10.64)
  5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)

and in the 50-pack case, it is much faster to use `--stdin-packs`, since
we avoid having to consider any objects in the excluded pack:

  5303.10: repack (50)                        71.26(88.24+4.96)
  5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)

but our improvements vanish as we approach 1000 packs.

  5303.15: repack (1000)                      215.64(491.33+14.80)
  5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)

That's because the code paths around handling .keep files are known to
scale badly; they look in every single pack file to find each object.
Our solution to that was to notice that most repos don't have keep
files, and to make that case a fast path. But as soon as you add a
single .keep, that part of pack-objects slows down again (even if we
have fewer objects total to look at).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index d90d714923..b76a6efe00 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -31,8 +31,11 @@ repack_into_n () {
 	' "$1" >pushes &&
 
 	# create base packfile
-	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack &&
+	base_pack=$(
+		head -n 1 pushes |
+		git pack-objects --delta-base-offset --revs staging/pack
+	) &&
+	test_export base_pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
@@ -49,6 +52,12 @@ repack_into_n () {
 		last=$rev
 	done <pushes &&
 
+	(
+		find staging -type f -name 'pack-*.pack' |
+			xargs -n 1 basename | grep -v "$base_pack" &&
+		printf "^pack-%s.pack\n" $base_pack
+	) >stdin.packs
+
 	# and install the whole thing
 	rm -f .git/objects/pack/* &&
 	mv staging/* .git/objects/pack/
@@ -91,6 +100,15 @@ do
 		  --reflog --indexed-objects --delta-base-offset \
 		  --stdout </dev/null >/dev/null
 	'
+
+	test_perf "repack with --stdin-packs ($nr_packs)" '
+		git pack-objects \
+		  --keep-true-parents \
+		  --stdin-packs \
+		  --non-empty \
+		  --delta-base-offset \
+		  --stdout <stdin.packs >/dev/null
+	'
 done
 
 # Measure pack loading with 10,000 packs.
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                     ` (4 preceding siblings ...)
  2021-02-04  3:59   ` [PATCH v2 5/8] p5303: measure time to repack with keep Taylor Blau
@ 2021-02-04  3:59   ` Taylor Blau
  2021-02-17 16:05     ` Jeff King
  2021-02-04  3:59   ` [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

From: Jeff King <peff@peff.net>

Now that we have find_kept_pack_entry(), we don't have to manually keep
hunting through every pack to find a possible "kept" duplicate of the
object. This should be faster, assuming only a portion of your total
packs are actually kept.

Note that we have to re-order the logic a bit here; we can deal with the
"kept" situation completely, and then just fall back to the "--local"
question. It might be worth having a similar optimized function to look
at only local packs.

Here are the results from p5303 (measurements again taken on the
kernel):

  Test                                        HEAD^                    HEAD
  -----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.42(54.88+10.64)       57.44(54.71+10.78) +0.0%
  5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)          0.01(0.00+0.01) +0.0%
  5303.10: repack (50)                        71.26(88.24+4.96)        71.32(88.38+4.90) +0.1%
  5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)         3.43(11.81+0.22) -1.7%
  5303.15: repack (1000)                      215.64(491.33+14.80)     215.59(493.75+14.62) -0.0%
  5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)      131.44(314.24+8.11) -33.9%

So our --stdin-packs case with many packs is now finally faster than the
non-keep case (because it gets the speed benefit of looking at fewer
objects, but not as big a penalty for looking at many packs).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 125 ++++++++++++++++++++++++-----------------
 1 file changed, 73 insertions(+), 52 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6d19eb000a..fbd7b54d70 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1188,7 +1188,8 @@ static int have_duplicate_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int want_found_object(const struct object_id *oid, int exclude,
+			     struct packed_git *p)
 {
 	if (exclude)
 		return 1;
@@ -1209,22 +1210,73 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 * Otherwise, we signal "-1" at the end to tell the caller that we do
 	 * not know either way, and it needs to check more packs.
 	 */
-	if (!ignore_packed_keep_on_disk &&
-	    !ignore_packed_keep_in_core &&
-	    (!local || !have_non_local_packs))
+
+	/*
+	 * Handle .keep first, as we have a fast(er) path there.
+	 */
+	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+		/*
+		 * Set the flags for the kept-pack cache to be the ones we want
+		 * to ignore.
+		 *
+		 * That is, if we are ignoring objects in on-disk keep packs,
+		 * then we want to search through the on-disk keep and ignore
+		 * the in-core ones.
+		 */
+		unsigned flags = 0;
+		if (ignore_packed_keep_on_disk)
+			flags |= ON_DISK_KEEP_PACKS;
+		if (ignore_packed_keep_in_core)
+			flags |= IN_CORE_KEEP_PACKS;
+
+		if (ignore_packed_keep_on_disk && p->pack_keep)
+			return 0;
+		if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+			return 0;
+		if (has_object_kept_pack(oid, flags))
+			return 0;
+	}
+
+	/*
+	 * At this point we know definitively that either we don't care about
+	 * keep-packs, or the object is not in one. Keep checking other
+	 * conditions...
+	 */
+
+	if (!local || !have_non_local_packs)
 		return 1;
-
 	if (local && !p->pack_local)
 		return 0;
-	if (p->pack_local &&
-	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
-	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
-		return 0;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
+static int want_object_in_pack_one(struct packed_git *p,
+				   const struct object_id *oid,
+				   int exclude,
+				   struct packed_git **found_pack,
+				   off_t *found_offset)
+{
+	off_t offset;
+
+	if (p == *found_pack)
+		offset = *found_offset;
+	else
+		offset = find_pack_entry_one(oid->hash, p);
+
+	if (offset) {
+		if (!*found_pack) {
+			if (!is_pack_valid(p))
+				return -1;
+			*found_offset = offset;
+			*found_pack = p;
+		}
+		return want_found_object(oid, exclude, p);
+	}
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
@@ -1252,7 +1304,7 @@ static int want_object_in_pack(const struct object_id *oid,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
+		want = want_found_object(oid, exclude, *found_pack);
 		if (want != -1)
 			return want;
 	}
@@ -1260,53 +1312,22 @@ static int want_object_in_pack(const struct object_id *oid,
 	for (m = get_multi_pack_index(the_repository); m; m = m->next) {
 		struct pack_entry e;
 		if (fill_midx_entry(the_repository, oid, &e, m)) {
-			struct packed_git *p = e.p;
-			off_t offset;
-
-			if (p == *found_pack)
-				offset = *found_offset;
-			else
-				offset = find_pack_entry_one(oid->hash, p);
-
-			if (offset) {
-				if (!*found_pack) {
-					if (!is_pack_valid(p))
-						continue;
-					*found_offset = offset;
-					*found_pack = p;
-				}
-				want = want_found_object(exclude, p);
-				if (want != -1)
-					return want;
-			}
-		}
-	}
-
-	list_for_each(pos, get_packed_git_mru(the_repository)) {
-		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		off_t offset;
-
-		if (p == *found_pack)
-			offset = *found_offset;
-		else
-			offset = find_pack_entry_one(oid->hash, p);
-
-		if (offset) {
-			if (!*found_pack) {
-				if (!is_pack_valid(p))
-					continue;
-				*found_offset = offset;
-				*found_pack = p;
-			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
-				list_move(&p->mru,
-					  get_packed_git_mru(the_repository));
+			want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
 			if (want != -1)
 				return want;
 		}
 	}
 
+	list_for_each(pos, get_packed_git_mru(the_repository)) {
+		struct packed_git *p = list_entry(pos, struct packed_git, mru);
+		want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+		if (!exclude && want > 0)
+			list_move(&p->mru,
+				  get_packed_git_mru(the_repository));
+		if (want != -1)
+			return want;
+	}
+
 	if (uri_protocols.nr) {
 		struct configured_exclusion *ex =
 			oidmap_get(&configured_exclusions, oid);
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                     ` (5 preceding siblings ...)
  2021-02-04  3:59   ` [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
@ 2021-02-04  3:59   ` Taylor Blau
  2021-02-17 17:11     ` Jeff King
  2021-02-04  3:59   ` [PATCH v2 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
  2021-02-17  0:01   ` [PATCH v2 0/8] repack: support repacking into a geometric sequence Jeff King
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

From: Jeff King <peff@peff.net>

In a recent patch we added a function 'find_kept_pack_entry()' to look
for an object only among kept packs.

While this function avoids doing any lookup work in non-kept packs, it
is still linear in the number of packs, since we have to traverse the
linked list of packs once per object. Let's cache a reduced version of
that list to save us time.

Note that this cache will last the lifetime of the program. We could
invalidate it on reprepare_packed_git(), but there's not much point in
being rigorous here:

  - we might already fail to notice new .keep packs showing up after the
    program starts. We only reprepare_packed_git() when we fail to find
    an object. But adding a new pack won't cause that to happen.
    Somebody repacking could add a new pack and delete an old one, but
    most of the time we'd have a descriptor or mmap open to the old
    pack anyway, so we might not even notice.

  - in pack-objects we already cache the .keep state at startup, since
    56dfeb6263 (pack-objects: compute local/ignore_pack_keep early,
    2016-07-29). So this is just extending that concept further.

  - we don't have to worry about any packed_git being removed; we always
    keep the old structs around, even after reprepare_packed_git()

Here are p5303 results (as always, measured against the kernel):

  Test                                        HEAD^                   HEAD
  ----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.44(54.71+10.78)      57.06(54.29+10.96) -0.7%
  5303.6: repack with --stdin-packs (1)       0.01(0.00+0.01)         0.01(0.01+0.00) +0.0%
  5303.10: repack (50)                        71.32(88.38+4.90)       71.47(88.60+5.04) +0.2%
  5303.11: repack with --stdin-packs (50)     3.43(11.81+0.22)        3.49(12.21+0.26) +1.7%
  5303.15: repack (1000)                      215.59(493.75+14.62)    217.41(495.36+14.85) +0.8%
  5303.16: repack with --stdin-packs (1000)   131.44(314.24+8.11)     126.75(309.88+8.09) -3.6%

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |   6 +--
 object-store.h         |  10 ++++
 packfile.c             | 103 +++++++++++++++++++++++------------------
 packfile.h             |   4 --
 revision.c             |   8 ++--
 5 files changed, 76 insertions(+), 55 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index fbd7b54d70..b2ba5aa14f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1225,9 +1225,9 @@ static int want_found_object(const struct object_id *oid, int exclude,
 		 */
 		unsigned flags = 0;
 		if (ignore_packed_keep_on_disk)
-			flags |= ON_DISK_KEEP_PACKS;
+			flags |= CACHE_ON_DISK_KEEP_PACKS;
 		if (ignore_packed_keep_in_core)
-			flags |= IN_CORE_KEEP_PACKS;
+			flags |= CACHE_IN_CORE_KEEP_PACKS;
 
 		if (ignore_packed_keep_on_disk && p->pack_keep)
 			return 0;
@@ -3089,7 +3089,7 @@ static void read_packs_list_from_stdin(void)
 	 * an optimization during delta selection.
 	 */
 	revs.no_kept_objects = 1;
-	revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+	revs.keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
 	revs.blob_objects = 1;
 	revs.tree_objects = 1;
 	revs.tag_objects = 1;
diff --git a/object-store.h b/object-store.h
index c4fc9dd74e..4cbe8eae3c 100644
--- a/object-store.h
+++ b/object-store.h
@@ -105,6 +105,14 @@ static inline int pack_map_entry_cmp(const void *unused_cmp_data,
 	return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
 }
 
+#define CACHE_ON_DISK_KEEP_PACKS 1
+#define CACHE_IN_CORE_KEEP_PACKS 2
+
+struct kept_pack_cache {
+	struct packed_git **packs;
+	unsigned flags;
+};
+
 struct raw_object_store {
 	/*
 	 * Set of all object directories; the main directory is first (and
@@ -150,6 +158,8 @@ struct raw_object_store {
 	/* A most-recently-used ordered version of the packed_git list. */
 	struct list_head packed_git_mru;
 
+	struct kept_pack_cache *kept_pack_cache;
+
 	/*
 	 * A map of packfiles to packed_git structs for tracking which
 	 * packs have been loaded already.
diff --git a/packfile.c b/packfile.c
index 5f35cfe788..2a139c907b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2031,10 +2031,7 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int find_one_pack_entry(struct repository *r,
-			       const struct object_id *oid,
-			       struct pack_entry *e,
-			       int kept_only)
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2044,49 +2041,64 @@ static int find_one_pack_entry(struct repository *r,
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (!fill_midx_entry(r, oid, e, m))
-			continue;
-
-		if (!kept_only)
-			return 1;
-
-		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
-		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
+		if (fill_midx_entry(r, oid, e, m))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (p->multi_pack_index && !kept_only) {
-			/*
-			 * If this pack is covered by the MIDX, we'd have found
-			 * the object already in the loop above if it was here,
-			 * so don't bother looking.
-			 *
-			 * The exception is if we are looking only at kept
-			 * packs. An object can be present in two packs covered
-			 * by the MIDX, one kept and one not-kept. And as the
-			 * MIDX points to only one copy of each object, it might
-			 * have returned only the non-kept version above. We
-			 * have to check again to be thorough.
-			 */
-			continue;
-		}
-		if (!kept_only ||
-		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
-		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
-			if (fill_pack_entry(oid, e, p)) {
-				list_move(&p->mru, &r->objects->packed_git_mru);
-				return 1;
-			}
+		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
+			list_move(&p->mru, &r->objects->packed_git_mru);
+			return 1;
 		}
 	}
 	return 0;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static void maybe_invalidate_kept_pack_cache(struct repository *r,
+					     unsigned flags)
 {
-	return find_one_pack_entry(r, oid, e, 0);
+	if (!r->objects->kept_pack_cache)
+		return;
+	if (r->objects->kept_pack_cache->flags == flags)
+		return;
+	free(r->objects->kept_pack_cache->packs);
+	FREE_AND_NULL(r->objects->kept_pack_cache);
+}
+
+static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+{
+	maybe_invalidate_kept_pack_cache(r, flags);
+
+	if (!r->objects->kept_pack_cache) {
+		struct packed_git **packs = NULL;
+		size_t nr = 0, alloc = 0;
+		struct packed_git *p;
+
+		/*
+		 * We want "all" packs here, because we need to cover ones that
+		 * are used by a midx, as well. We need to look in every one of
+		 * them (instead of the midx itself) to cover duplicates. It's
+		 * possible that an object is found in two packs that the midx
+		 * covers, one kept and one not kept, but the midx returns only
+		 * the non-kept version.
+		 */
+		for (p = get_all_packs(r); p; p = p->next) {
+			if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) ||
+			    (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) {
+				ALLOC_GROW(packs, nr + 1, alloc);
+				packs[nr++] = p;
+			}
+		}
+		ALLOC_GROW(packs, nr + 1, alloc);
+		packs[nr] = NULL;
+
+		r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache));
+		r->objects->kept_pack_cache->packs = packs;
+		r->objects->kept_pack_cache->flags = flags;
+	}
+
+	return r->objects->kept_pack_cache->packs;
 }
 
 int find_kept_pack_entry(struct repository *r,
@@ -2094,13 +2106,15 @@ int find_kept_pack_entry(struct repository *r,
 			 unsigned flags,
 			 struct pack_entry *e)
 {
-	/*
-	 * Load all packs, including midx packs, since our "kept" strategy
-	 * relies on that. We're relying on the side effect of it setting up
-	 * r->objects->packed_git, which is a little ugly.
-	 */
-	get_all_packs(r);
-	return find_one_pack_entry(r, oid, e, flags);
+	struct packed_git **cache;
+
+	for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+		struct packed_git *p = *cache;
+		if (fill_pack_entry(oid, e, p))
+			return 1;
+	}
+
+	return 0;
 }
 
 int has_object_pack(const struct object_id *oid)
@@ -2109,7 +2123,8 @@ int has_object_pack(const struct object_id *oid)
 	return find_pack_entry(the_repository, oid, &e);
 }
 
-int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+int has_object_kept_pack(const struct object_id *oid,
+			 unsigned flags)
 {
 	struct pack_entry e;
 	return find_kept_pack_entry(the_repository, oid, flags, &e);
diff --git a/packfile.h b/packfile.h
index 624327f64d..eb56db2a7b 100644
--- a/packfile.h
+++ b/packfile.h
@@ -161,10 +161,6 @@ int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
-#define ON_DISK_KEEP_PACKS 1
-#define IN_CORE_KEEP_PACKS 2
-#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
-
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
diff --git a/revision.c b/revision.c
index 4c5adb90b1..41c0478705 100644
--- a/revision.c
+++ b/revision.c
@@ -2338,14 +2338,14 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		die(_("--unpacked=<packfile> no longer supported"));
 	} else if (!strcmp(arg, "--no-kept-objects")) {
 		revs->no_kept_objects = 1;
-		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
-		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= CACHE_ON_DISK_KEEP_PACKS;
 	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
 		revs->no_kept_objects = 1;
 		if (!strcmp(optarg, "in-core"))
-			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+			revs->keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
 		if (!strcmp(optarg, "on-disk"))
-			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+			revs->keep_pack_cache_flags |= CACHE_ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
-- 
2.30.0.533.g2f8b6b552f.dirty


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 8/8] builtin/repack.c: add '--geometric' option
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                     ` (6 preceding siblings ...)
  2021-02-04  3:59   ` [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
@ 2021-02-04  3:59   ` Taylor Blau
  2021-02-17 18:17     ` Jeff King
  2021-02-17  0:01   ` [PATCH v2 0/8] repack: support repacking into a geometric sequence Jeff King
  8 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-04  3:59 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff

Often it is useful to both:

  - have relatively few packfiles in a repository, and

  - avoid having so few packfiles in a repository that we repack its
    entire contents regularly

This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).

Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:

  objects(Pi) > r*objects(P(i-1))

for all i in [1, n], where the packs are sorted by

  objects(P1) <= objects(P2) <= ... <= objects(Pn).

Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:

  1. We assume that there is a cutoff of packs _before starting the
     repack_ where everything to the right of that cut-off already forms
     a geometric progression (or no cutoff exists and everything must be
     repacked).

  2. We assume that everything smaller than the cutoff count must be
     repacked. This forms our base assumption, but it can also cause
     even the "heavy" packs to get repacked, for e.g., if we have 6
     packs containing the following number of objects:

       1, 1, 1, 2, 4, 32

     then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
     rolling up the first two packs into a pack with 2 objects. That
     breaks our progression and leaves us:

       2, 1, 2, 4, 32
         ^

     (where the '^' indicates the position of our split). To restore a
     progression, we move the split forward (towards larger packs)
     joining each pack into our new pack until a geometric progression
     is restored. Here, that looks like:

       2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
         ^                   ^                ^                   ^

This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  11 +++
 builtin/repack.c             | 187 ++++++++++++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 137 +++++++++++++++++++++++++
 3 files changed, 331 insertions(+), 4 deletions(-)
 create mode 100755 t/t7703-repack-geometric.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 92f146d27d..b1ffcfd974 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -165,6 +165,17 @@ depth is 4095.
 	Pass the `--delta-islands` option to `git-pack-objects`, see
 	linkgit:git-pack-objects[1].
 
+-g=<factor>::
+--geometric=<factor>::
+	Arrange resulting pack structure so that each successive pack
+	contains at least `<factor>` times the number of objects as the
+	next-largest pack.
++
+`git repack` ensures this by determining a "cut" of packfiles that need to be
+repacked into one in order to ensure a geometric progression. It picks the
+smallest set of packfiles such that as many of the larger packfiles (by count of
+objects contained in that pack) may be left intact.
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 2158b48f4c..b4e0e69661 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -296,6 +296,124 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
 
+struct pack_geometry {
+	struct packed_git **pack;
+	uint32_t pack_nr, pack_alloc;
+	uint32_t split;
+};
+
+static uint32_t geometry_pack_weight(struct packed_git *p)
+{
+	if (open_pack_index(p))
+		die(_("cannot open index for %s"), p->pack_name);
+	return p->num_objects;
+}
+
+static int geometry_cmp(const void *va, const void *vb)
+{
+	uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
+		 bw = geometry_pack_weight(*(struct packed_git **)vb);
+
+	if (aw < bw)
+		return -1;
+	if (aw > bw)
+		return 1;
+	return 0;
+}
+
+static void init_pack_geometry(struct pack_geometry **geometry_p)
+{
+	struct packed_git *p;
+	struct pack_geometry *geometry;
+
+	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
+	geometry = *geometry_p;
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (!pack_kept_objects && p->pack_keep)
+			continue;
+
+		ALLOC_GROW(geometry->pack,
+			   geometry->pack_nr + 1,
+			   geometry->pack_alloc);
+
+		geometry->pack[geometry->pack_nr] = p;
+		geometry->pack_nr++;
+	}
+
+	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
+}
+
+static void split_pack_geometry(struct pack_geometry *geometry, int factor)
+{
+	uint32_t i;
+	uint32_t split;
+	off_t total_size = 0;
+
+	if (geometry->pack_nr <= 1) {
+		geometry->split = geometry->pack_nr;
+		return;
+	}
+
+	split = geometry->pack_nr - 1;
+
+	/*
+	 * First, count the number of packs (in descending order of size) which
+	 * already form a geometric progression.
+	 */
+	for (i = geometry->pack_nr - 1; i > 0; i--) {
+		struct packed_git *ours = geometry->pack[i];
+		struct packed_git *prev = geometry->pack[i - 1];
+		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
+			split--;
+		else
+			break;
+	}
+
+	if (split) {
+		/*
+		 * Move the split one to the right, since the top element in the
+		 * last-compared pair can't be in the progression. Only do this
+		 * when we split in the middle of the array (otherwise if we got
+		 * to the end, then the split is in the right place).
+		 */
+		split++;
+	}
+
+	/*
+	 * Then, anything to the left of 'split' must be in a new pack. But,
+	 * creating that new pack may cause packs in the heavy half to no longer
+	 * form a geometric progression.
+	 *
+	 * Compute an expected size of the new pack, and then determine how many
+	 * packs in the heavy half need to be joined into it (if any) to restore
+	 * the geometric progression.
+	 */
+	for (i = 0; i < split; i++)
+		total_size += geometry_pack_weight(geometry->pack[i]);
+	for (i = split; i < geometry->pack_nr; i++) {
+		struct packed_git *ours = geometry->pack[i];
+		if (geometry_pack_weight(ours) < factor * total_size) {
+			split++;
+			total_size += geometry_pack_weight(ours);
+		} else
+			break;
+	}
+
+	geometry->split = split;
+}
+
+static void clear_pack_geometry(struct pack_geometry *geometry)
+{
+	if (!geometry)
+		return;
+
+	free(geometry->pack);
+	geometry->pack_nr = 0;
+	geometry->pack_alloc = 0;
+	geometry->split = 0;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -303,6 +421,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
@@ -315,6 +434,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	int geometric_factor = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -355,6 +475,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("do not repack this pack")),
+		OPT_INTEGER('g', "geometric", &geometric_factor,
+			    N_("find a geometric progression with factor <N>")),
 		OPT_END()
 	};
 
@@ -381,6 +503,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
 		die(_(incremental_bitmap_conflict_error));
 
+	if (geometric_factor) {
+		if (pack_everything)
+			die(_("--geometric is incompatible with -A, -a"));
+		init_pack_geometry(&geometry);
+		split_pack_geometry(geometry, geometric_factor);
+	}
+
 	packdir = mkpathdup("%s/pack", get_object_directory());
 	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
@@ -395,9 +524,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_pushf(&cmd.args, "--keep-pack=%s",
 			     keep_pack_list.items[i].string);
 	strvec_push(&cmd.args, "--non-empty");
-	strvec_push(&cmd.args, "--all");
-	strvec_push(&cmd.args, "--reflog");
-	strvec_push(&cmd.args, "--indexed-objects");
+	if (!geometry) {
+		/*
+		 * 'git pack-objects' will up all objects loose or packed
+		 * (either rolling them up or leaving them alone), so don't pass
+		 * these options.
+		 *
+		 * The implementation of 'git pack-objects --stdin-packs'
+		 * makes them redundant (and the two are incompatible).
+		 */
+		strvec_push(&cmd.args, "--all");
+		strvec_push(&cmd.args, "--reflog");
+		strvec_push(&cmd.args, "--indexed-objects");
+	}
 	if (has_promisor_remote())
 		strvec_push(&cmd.args, "--exclude-promisor-objects");
 	if (write_bitmaps > 0)
@@ -428,17 +567,37 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
 			}
 		}
+	} else if (geometry) {
+		strvec_push(&cmd.args, "--stdin-packs");
+		strvec_push(&cmd.args, "--unpacked");
 	} else {
 		strvec_push(&cmd.args, "--unpacked");
 		strvec_push(&cmd.args, "--incremental");
 	}
 
-	cmd.no_stdin = 1;
+	if (geometry)
+		cmd.in = -1;
+	else
+		cmd.no_stdin = 1;
 
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
 
+	if (geometry) {
+		FILE *in = xfdopen(cmd.in, "w");
+		/*
+		 * The resulting pack should contain all objects in packs that
+		 * are going to be rolled up, but exclude objects in packs which
+		 * are being left alone.
+		 */
+		for (i = 0; i < geometry->split; i++)
+			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
+		for (i = geometry->split; i < geometry->pack_nr; i++)
+			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
+		fclose(in);
+	}
+
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		if (line.len != the_hash_algo->hexsz)
@@ -506,6 +665,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			if (!string_list_has_string(&names, sha1))
 				remove_redundant_pack(packdir, item->string);
 		}
+
+		if (geometry) {
+			struct strbuf buf = STRBUF_INIT;
+
+			uint32_t i;
+			for (i = 0; i < geometry->split; i++) {
+				struct packed_git *p = geometry->pack[i];
+				if (string_list_has_string(&names,
+							   hash_to_hex(p->hash)))
+					continue;
+
+				strbuf_reset(&buf);
+				strbuf_addstr(&buf, pack_basename(p));
+				strbuf_strip_suffix(&buf, ".pack");
+
+				remove_redundant_pack(packdir, buf.buf);
+			}
+			strbuf_release(&buf);
+		}
 		if (!po_args.quiet && isatty(2))
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
@@ -527,6 +705,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
 	return 0;
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
new file mode 100755
index 0000000000..96917fc163
--- /dev/null
+++ b/t/t7703-repack-geometric.sh
@@ -0,0 +1,137 @@
+#!/bin/sh
+
+test_description='git repack --geometric works correctly'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success '--geometric with no packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		git repack --geometric 2 >out &&
+		test_i18ngrep "Nothing new to pack" out
+	)
+'
+
+test_expect_success '--geometric with an intact progression' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		test_commit_bulk --start=4 4 && # 12 objects
+
+		find $objdir/pack -name "*.pack" | sort >expect &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >actual &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--geometric with small-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		find $objdir/pack -name "*.pack" | sort >small &&
+		test_commit_bulk --start=3 4 && # 12 objects
+		test_commit_bulk --start=7 8 && # 24 objects
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		git repack --geometric 2 -d &&
+
+		# Three packs in total; two of the existing large ones, and one
+		# new one.
+		find $objdir/pack -name "*.pack" | sort >after &&
+		test_line_count = 3 after &&
+		comm -3 small before | tr -d "\t" >large &&
+		grep -qFf large after
+	)
+'
+
+test_expect_success '--geometric with small- and large-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# size(small1) + size(small2) > size(medium) / 2
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		test_commit_bulk --start=2 3 && # 7 objects
+		test_commit_bulk --start=6 9 && # 27 objects &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		git repack --geometric 2 -d &&
+
+		find $objdir/pack -name "*.pack" | sort >after &&
+		comm -12 before after >untouched &&
+
+		# Two packs in total; the largest pack from before running "git
+		# repack", and one new one.
+		test_line_count = 1 untouched &&
+		test_line_count = 2 after
+	)
+'
+
+test_expect_success '--geometric ignores kept packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit kept && # 3 objects
+		test_commit pack && # 3 objects
+
+		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/kept
+		EOF
+		) &&
+		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/pack
+		^refs/tags/kept
+		EOF
+		) &&
+
+		# neither pack contains more than twice the number of objects in
+		# the other, so they should be combined. but, marking one as
+		# .kept on disk will "freeze" it, so the pack structure should
+		# remain unchanged.
+		touch $objdir/pack/pack-$KEPT.keep &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >after &&
+
+		# both packs should still exist
+		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
+		test_path_is_file $objdir/pack/pack-$PACK.pack &&
+
+		# and no new packs should be created
+		test_cmp before after &&
+
+		# Passing --pack-kept-objects causes packs with a .keep file to
+		# be repacked, too.
+		git repack --geometric 2 -d --pack-kept-objects &&
+
+		find $objdir/pack -name "*.pack" >after &&
+		test_line_count = 1 after
+	)
+'
+
+test_done
-- 
2.30.0.533.g2f8b6b552f.dirty

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()'
  2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
@ 2021-02-16 21:42     ` Jeff King
  2021-02-16 21:48       ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-16 21:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:58:50PM -0500, Taylor Blau wrote:

> Future callers will want a function to fill a 'struct pack_entry' for a
> given object id but _only_ from its position in any kept pack(s).
> 
> In particular, an new 'git repack' mode which ensures the resulting

Nit (not worth re-rolling): s/an new/a new/

> There is a gotcha when looking up objects that are duplicated in kept
> and non-kept packs, particularly when the MIDX stores the non-kept
> version and the caller asked for kept objects only. This could be
> resolved by teaching the MIDX to resolve duplicates by always favoring
> the kept pack (if one exists), but this breaks an assumption in existing
> MIDXs, and so it would require a format change.

I don't think this would be possible without a major rethink of how
midxs work. The "keep" property of a pack is not set in stone when the
midx is created. You could add a ".keep" file to one of its packs later,
or even mark one as an in-core keep on the fly. But the duplicate
resolution happens at creation.

So maybe your "breaks an assumption" is the notion that we do not store
duplicate information at all in the midx. If so, then I agree. :) But
I'd also call fixing that more than just a format change.

(None of which changes your point, which isn't that it isn't worth
pursuing that direction).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()'
  2021-02-16 21:42     ` Jeff King
@ 2021-02-16 21:48       ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-16 21:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Tue, Feb 16, 2021 at 04:42:38PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:58:50PM -0500, Taylor Blau wrote:
>
> > Future callers will want a function to fill a 'struct pack_entry' for a
> > given object id but _only_ from its position in any kept pack(s).
> >
> > In particular, an new 'git repack' mode which ensures the resulting
>
> Nit (not worth re-rolling): s/an new/a new/

Oops. Good eyes.

> > There is a gotcha when looking up objects that are duplicated in kept
> > and non-kept packs, particularly when the MIDX stores the non-kept
> > version and the caller asked for kept objects only. This could be
> > resolved by teaching the MIDX to resolve duplicates by always favoring
> > the kept pack (if one exists), but this breaks an assumption in existing
> > MIDXs, and so it would require a format change.
>
> I don't think this would be possible without a major rethink of how
> midxs work. The "keep" property of a pack is not set in stone when the
> midx is created. You could add a ".keep" file to one of its packs later,
> or even mark one as an in-core keep on the fly. But the duplicate
> resolution happens at creation.
>
> So maybe your "breaks an assumption" is the notion that we do not store
> duplicate information at all in the midx. If so, then I agree. :) But
> I'd also call fixing that more than just a format change.

That's part of it, indeed. The part that I was referring to is that
existing MIDX readers expect duplicates to be resolved in a certain way
(effectively in favor of the pack with the lowest mtime). So the easy
part is indicating a format change which tells new readers how to expect
ties to be broken.

But (as you note) that's only part of the problem: even if we say "ties
are resolved in favor of the lowest mtime pack, or a .keep one, if it
exists", then which ones are kept and which aren't? Even *if* we wrote
that down (which I'm not suggesting we do), kept-ness isn't an immutable
property of the pack, and so I think relying on it is a tricky direction
to take.

> (None of which changes your point, which isn't that it isn't worth
> pursuing that direction).

Yeah; my hope in writing some of this down in the above paragraph is
that it would make clear to future readers that such a MIDX change would
resolve some complexity here, but the complexity it adds in the MIDX
code isn't worth the tradeoff.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/8] revision: learn '--no-kept-objects'
  2021-02-04  3:58   ` [PATCH v2 2/8] revision: learn '--no-kept-objects' Taylor Blau
@ 2021-02-16 23:17     ` Jeff King
  2021-02-17 18:35       ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-16 23:17 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:58:57PM -0500, Taylor Blau wrote:

> @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
>  		return commit_ignore;
>  	if (revs->unpacked && has_object_pack(&commit->object.oid))
>  		return commit_ignore;
> +	if (revs->no_kept_objects) {
> +		if (has_object_kept_pack(&commit->object.oid,
> +					 revs->keep_pack_cache_flags))
> +			return commit_ignore;
> +	}

OK, so this has the same "problems" as --unpacked, which is that we can
miss some objects (i.e., things that are reachable but not-kept may not
be reported). But it should be OK in this version of the series, because
we will not be relying on it for selection of objects, but only to fill
in ordering / namehash fields.

Should we warn people about that, either as a comment or in the commit
message?

> +--no-kept-objects[=<kind>]::
> +	Halts the traversal as soon as an object in a kept pack is
> +	found. If `<kind>` is `on-disk`, only packs with a corresponding
> +	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
> +	with their in-core kept state set are ignored. Otherwise, both
> +	kinds of kept packs are ignored.

Likewise, I wonder whether we need to expose this mode to users.
Normally I'm a fan of doing so, because it allows scripted callers
access to more of the internals, but:

  - the semantics are kind of weird about where we draw the line between
    performance and absolute correctness

  - the "in-core" thing is a bit weird for callers of rev-list; how do I
    as a caller mark a pack as kept-in-core? I think it's only an
    internal pack-objects thing.

Once we support this in rev-list, we'll have to do it forever (or deal
with deprecation, etc). If we just need it internally, maybe it's wise
to leave it as a something you ask for by manipulating rev_info
directly. Or perhaps leave it as an undocumented interface we use for
testing, and not something we promise to keep working.

> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
>  			ctx->show_object(obj, name, ctx->show_data);
>  			continue;
>  		}
> +		if (ctx->revs->no_kept_objects) {
> +			struct pack_entry e;
> +			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
> +						 ctx->revs->keep_pack_cache_flags,
> +						 &e))
> +				continue;
> +		}

This hunk is interesting.

There is no similar check for revs->unpacked in list-objects.c to cut
off the traversal. And indeed, running "rev-list --unpacked" will
generally look at the _whole_ tree for a commit that is unpacked, even
if all of the tree entries are packed. That's something we might
consider changing in the name of performance (though it does increase
the number of cases where --unpacked will fail to find an unpacked but
reachable object).

But this is a funny place to put it. If I understand it correctly, it is
cutting off the traversal at the very top of the tree. I.e., if we had a
commit that is not-kept, we'd queue it's root tree. And then we might
find that the root tree is kept, and avoid traversing it. But if we _do_
traverse it, we would look at every subtree it contains, even if they
are kept! That's because we recurse the tree via the recursive
process_tree(), not by queueing more objects in the pending array here.

So this check seems to exist in a funny middle ground. I think it's
unlikely to catch anything useful (usually commits have a unique root
tree; it's all of the untouched parts of the subtrees that will be in
the kept packs). IMHO we should either drop it (and act like
"--unpacked", accepting that we may traverse some extra tree objects),
or we should go all-in on performance and cut it off in the top of
process_tree().

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-04  3:59   ` [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
@ 2021-02-16 23:46     ` Jeff King
  2021-02-17 18:59       ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-16 23:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:59:03PM -0500, Taylor Blau wrote:

> In an upcoming commit, 'git repack' will want to create a pack comprised
> of all of the objects in some packs (the included packs) excluding any
> objects in some other packs (the excluded packs).
> 
> This caller could iterate those packs themselves and feed the objects it
> finds to 'git pack-objects' directly over stdin, but this approach has a
> few downsides:
> 
>   - It requires every caller that wants to drive 'git pack-objects' in
>     this way to implement pack iteration themselves. This forces the
>     caller to think about details like what order objects are fed to
>     pack-objects, which callers would likely rather not do.
> 
>   - If the set of objects in included packs is large, it requires
>     sending a lot of data over a pipe, which is inefficient.
> 
>   - The caller is forced to keep track of the excluded objects, too, and
>     make sure that it doesn't send any objects that appear in both
>     included and excluded packs.
> 
> But the biggest downside is the lack of a reachability traversal.
> Because the caller passes in a list of objects directly, those objects
> don't get a namehash assigned to them, which can have a negative impact
> on the delta selection process, causing 'git pack-objects' to fail to
> find good deltas even when they exist.
>
> The caller could formulate a reachability traversal themselves, but the
> only way to drive 'git pack-objects' in this way is to do a full
> traversal, and then remove objects in the excluded packs after the
> traversal is complete. This can be detrimental to callers who care
> about performance, especially in repositories with many objects.

Yep, I think this is a good summary of the problem space, and why this
complexity should be pushed into pack-objects and not the caller.

> To address the delta selection problem, 'git pack-objects --stdin-packs'
> works as follows. First, it assembles a list of objects that it is going
> to pack, as above. Then, a reachability traversal is started, whose tips
> are any commits mentioned in included packs. Upon visiting an object, we
> find its corresponding object_entry in the to_pack list, and set its
> namehash parameter appropriately.
> 
> To avoid the traversal visiting more objects than it needs to, the
> traversal is halted upon encountering an object which can be found in an
> excluded pack (by marking the excluded packs as kept in-core, and
> passing --no-kept-objects=in-core to the revision machinery).
> 
> This can cause the traversal to halt early, for example if an object in
> an included pack is an ancestor of ones in excluded packs. But stopping
> early is OK, since filling in the namehash fields of objects in the
> to_pack list is only additive (i.e., having it helps the delta selection
> process, but leaving it blank doesn't impact the correctness of the
> resulting pack).

OK, good. Definitely worth calling out this subtle distinction of
correctness versus the heuristic.

Do we use this partial traversal to impact the write order at all? That
would be a nice-to-have, but I suspect that just concatenating the packs
(presumably by descending mtime) ends up with a similar result.

> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -85,6 +85,16 @@ base-name::
>  	reference was included in the resulting packfile.  This
>  	can be useful to send new tags to native Git clients.
>  
> +--stdin-packs::
> +	Read the basenames of packfiles from the standard input, instead
> +	of object names or revision arguments. The resulting pack
> +	contains all objects listed in the included packs (those not
> +	beginning with `^`), excluding any objects listed in the
> +	excluded packs (beginning with `^`).
> ++
> +Incompatible with `--revs`, or options that imply `--revs` (such as
> +`--all`), with the exception of `--unpacked`, which is compatible.

I know you say "basename" here, but I wonder if it is worth giving an
example (`pack-1234abcd.pack`) to make it clear in what form we expect
it. Or possibly something in the `EXAMPLES` section.

> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -2979,6 +2979,164 @@ static int git_pack_config(const char *k, const char *v, void *cb)
>  	return git_default_config(k, v, cb);
>  }
>  
> +static int stdin_packs_found_nr;
> +static int stdin_packs_hints_nr;

I scratched my head at these until I looked further in the code. They're
the counters for the trace output. Might be worth a brief comment above
them. (I do approve of adding this kind of trace debugging info; I'm
pretty accustomed to using gdb or adding one-off debug statements, but
we really could do a better job in general of making these kinds of
internals visible to mere mortal admins).

> +static int add_object_entry_from_pack(const struct object_id *oid,
> +				      struct packed_git *p,
> +				      uint32_t pos,
> +				      void *_data)
> +{
> +	struct rev_info *revs = _data;
> +	struct object_info oi = OBJECT_INFO_INIT;
> +	off_t ofs;
> +	enum object_type type;
> +
> +	display_progress(progress_state, ++nr_seen);
> +
> +	ofs = nth_packed_object_offset(p, pos);
> +
> +	oi.typep = &type;
> +	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
> +		die(_("could not get type of object %s in pack %s"),
> +		    oid_to_hex(oid), p->pack_name);

Calling out for other reviewers: the oi.typep field will be filled in
the with _real_ type of the object, even if it's a delta. This is as
opposed to the return value of packed_object_info(), which may be
OFS_DELTA or REF_DELTA.

And that real type is what we want here:

> +	else if (type == OBJ_COMMIT) {
> +		/*
> +		 * commits in included packs are used as starting points for the
> +		 * subsequent revision walk
> +		 */
> +		add_pending_oid(revs, NULL, oid, 0);
> +	}

And later when we call create_object_entry().

I wondered whether it would be worth adding other objects we might find,
like trees, in order to increase our traversal. But that doesn't make
any sense. The whole point is to find the paths, which come from
traversing from the root trees. And we can only find the root trees by
starting at commits. Adding any random tree we found would defeat the
purpose (most of them are sub-trees and would give us a useless partial
path).

Should we avoid adding the commit as a tip for walking if it won't end
up in the resulting pack? I.e., should we check these:

> +	if (have_duplicate_entry(oid, 0))
> +		return 0;
> +
> +	if (!want_object_in_pack(oid, 0, &p, &ofs))
> +		return 0;

...first? I guess it probably doesn't matter too much since we'd
truncate the traversal as soon as we saw it was in a kept pack anyway.

> +static void show_commit_pack_hint(struct commit *commit, void *_data)
> +{
> +}

Nothing to do here, since commits don't have a name field. Makes sense.

> +static void show_object_pack_hint(struct object *object, const char *name,
> +				  void *_data)
> +{
> +	struct object_entry *oe = packlist_find(&to_pack, &object->oid);
> +	if (!oe)
> +		return;
> +
> +	/*
> +	 * Our 'to_pack' list was constructed by iterating all objects packed in
> +	 * included packs, and so doesn't have a non-zero hash field that you
> +	 * would typically pick up during a reachability traversal.
> +	 *
> +	 * Make a best-effort attempt to fill in the ->hash and ->no_try_delta
> +	 * here using a now in order to perhaps improve the delta selection
> +	 * process.
> +	 */
> +	oe->hash = pack_name_hash(name);
> +	oe->no_try_delta = name && no_try_delta(name);
> +
> +	stdin_packs_hints_nr++;
> +}

But for actual objects, we do fill in the hash. I wonder if it's
possible for oe->hash to have been already filled. I don't think it
really matters, though. Any value we get is equally valid, so
overwriting is OK in that case.

> +	string_list_sort(&include_packs);
> +	string_list_sort(&exclude_packs);
> +
> +	for (p = get_all_packs(the_repository); p; p = p->next) {
> +		const char *pack_name = pack_basename(p);
> +
> +		item = string_list_lookup(&include_packs, pack_name);
> +		if (!item)
> +			item = string_list_lookup(&exclude_packs, pack_name);
> +
> +		if (item)
> +			item->util = p;
> +	}

OK, here we're just filling in the util field with each found pack. So
we wouldn't notice a pack that we didn't find, but we will in the
subsequent loops. Makes sense.

I think you could do without string lists at all by using the recent-ish
pack-hash to efficiently look up the names, but I'm perfectly content to
see it all handled within this function.

> +	/*
> +	 * First handle all of the excluded packs, marking them as kept in-core
> +	 * so that later calls to add_object_entry() discards any objects that
> +	 * are also found in excluded packs.
> +	 */
> +	for_each_string_list_item(item, &exclude_packs) {
> +		struct packed_git *p = item->util;
> +		if (!p)
> +			die(_("could not find pack '%s'"), item->string);
> +		p->pack_keep_in_core = 1;
> +	}
> +	for_each_string_list_item(item, &include_packs) {
> +		struct packed_git *p = item->util;
> +		if (!p)
> +			die(_("could not find pack '%s'"), item->string);
> +		for_each_object_in_pack(p,
> +					add_object_entry_from_pack,
> +					&revs,
> +					FOR_EACH_OBJECT_PACK_ORDER);
> +	}

Yeah, this ordering makes sense.

> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));
> +	traverse_commit_list(&revs,
> +			     show_commit_pack_hint,
> +			     show_object_pack_hint,
> +			     NULL);

And this traversal is pretty straight-forward. Looks good.

> +	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found",
> +			   stdin_packs_found_nr);

I wonder if it makes sense to report the actual set of packs via trace
(obviously not as an int, but as a list). That's less helpful for
debugging pack-objects, if you just fed it the input anyway, but if you
were debugging "git repack --geometric" it might be useful to see which
packs it thought were which (though arguably that would be a useful
trace in builtin/repack.c instead).

> @@ -3636,7 +3797,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		use_internal_rev_list = 1;
>  		strvec_push(&rp, "--indexed-objects");
>  	}
> -	if (rev_list_unpacked) {
> +	if (rev_list_unpacked && !stdin_packs) {
>  		use_internal_rev_list = 1;
>  		strvec_push(&rp, "--unpacked");
>  	}

OK, this is necessary to avoid triggering the internal rev-list, because
we handle --unpacked ourselves specially later here...

> @@ -3741,7 +3907,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  
>  	if (progress)
>  		progress_state = start_progress(_("Enumerating objects"), 0);
> -	if (!use_internal_rev_list)
> +	if (stdin_packs) {
> +		/* avoids adding objects in excluded packs */
> +		ignore_packed_keep_in_core = 1;
> +		read_packs_list_from_stdin();
> +		if (rev_list_unpacked)
> +			add_unreachable_loose_objects();

Which isn't quite behaving like normal --unpacked (in that we are adding
all loose objects, not just reachable ones). I think we actually could
just add --unpacked as part of our heuristic traversal. It's not
perfect, but unlike the packed objects, it's OK for us to miss some
corner cases (they just end up not getting packed; they don't get
deleted).

I'm OK to consider that an implementation detail for now, though. We can
change it later without impacting the interface.

> +		if (rev_list_unpacked)
> +			add_unreachable_loose_objects();

Despite the name, that function is adding both reachable and unreachable
ones. So it is doing what you want. It might be worth renaming, but it's
not too big a deal since it's local to this file.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 5/8] p5303: measure time to repack with keep
  2021-02-04  3:59   ` [PATCH v2 5/8] p5303: measure time to repack with keep Taylor Blau
@ 2021-02-16 23:58     ` Jeff King
  2021-02-17  0:02       ` Jeff King
  2021-02-17 19:13       ` Taylor Blau
  0 siblings, 2 replies; 120+ messages in thread
From: Jeff King @ 2021-02-16 23:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:59:13PM -0500, Taylor Blau wrote:

> From: Jeff King <peff@peff.net>
> 
> This is the same as the regular repack test, except that we mark the
> single base pack as "kept" and use --assume-kept-packs-closed. The

I don't think that option exists anymore. I guess we are just using
--stdin-packs, which causes us to mark a pack as kept.

I think we could just mark it in the filesystem and use
--honor-pack-keep, which would make it independent of your new feature.
At first I was going to say "but it doesn't matter either way", but...

> theory is that this should be faster than the normal repack, because
> we'll have fewer objects to traverse and process.
> 
> Here are some timings on a recent clone of the kernel. In the
> single-pack case, there is nothing do since there are no non-excluded
> packs:
> 
>   5303.5: repack (1)                          57.42(54.88+10.64)
>   5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)
> 
> and in the 50-pack case, it is much faster to use `--stdin-packs`, since
> we avoid having to consider any objects in the excluded pack:
> 
>   5303.10: repack (50)                        71.26(88.24+4.96)
>   5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)
> 
> but our improvements vanish as we approach 1000 packs.
> 
>   5303.15: repack (1000)                      215.64(491.33+14.80)
>   5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)
> 
> That's because the code paths around handling .keep files are known to
> scale badly; they look in every single pack file to find each object.

Well, part of it is just that with 1000 packs we have 20 times as many
objects that are actually getting packed with --stdin-packs, compared to
the 50-pack case. IIRC, each pack is a fixed-size slice and then the
residual is put into the .keep pack. So the fact that the time gets
closer to a full repack as we add more packs is expected: we are asking
pack-objects to do more work!

For showing the impact of the optimizations in patches 7 and 8, I think
doing a full repack with --honor-pack-keep is a better test. Because
then we're always doing a full traversal, and most of the work continues
to scale with the repo size (though obviously not the actual shuffling
of packed bytes around). That would get rid of the weird "no work to do"
case in the single-pack tests, too.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/8] repack: support repacking into a geometric sequence
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
                     ` (7 preceding siblings ...)
  2021-02-04  3:59   ` [PATCH v2 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
@ 2021-02-17  0:01   ` Jeff King
  2021-02-17 18:18     ` Jeff King
  8 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-17  0:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:58:45PM -0500, Taylor Blau wrote:

> The details of the new approach can be found in the third patch, but the gist is
> as follows:
> [...]

I think this turned out very nice (and less complicated than I feared it
might). I've read up through patch 5. I think the overall approach is
good, but I had various small-to-medium comments.

I'll try to pick up reviewing the rest tomorrow, though it may make
sense to resolve the earlier comments first.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 5/8] p5303: measure time to repack with keep
  2021-02-16 23:58     ` Jeff King
@ 2021-02-17  0:02       ` Jeff King
  2021-02-17 19:13       ` Taylor Blau
  1 sibling, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-17  0:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Tue, Feb 16, 2021 at 06:58:16PM -0500, Jeff King wrote:

> For showing the impact of the optimizations in patches 7 and 8, I think
> doing a full repack with --honor-pack-keep is a better test. Because
> then we're always doing a full traversal, and most of the work continues
> to scale with the repo size (though obviously not the actual shuffling
> of packed bytes around). That would get rid of the weird "no work to do"
> case in the single-pack tests, too.

I meant to add: but I do like that we are timing --stdin-packs, too. We
may actually want to time both.

Another thing we _could_ do, if we have --honor-pack-keep perf tests, is
to shuffle patches 5, 6, and 7 towards the front of the series. They
should be able to show off the improvement even without the
--stdin-packs feature.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic
  2021-02-04  3:59   ` [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
@ 2021-02-17 16:05     ` Jeff King
  2021-02-17 19:23       ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-17 16:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:59:17PM -0500, Taylor Blau wrote:

> @@ -1209,22 +1210,73 @@ static int want_found_object(int exclude, struct packed_git *p)
>  	 * Otherwise, we signal "-1" at the end to tell the caller that we do
>  	 * not know either way, and it needs to check more packs.
>  	 */
> -	if (!ignore_packed_keep_on_disk &&
> -	    !ignore_packed_keep_in_core &&
> -	    (!local || !have_non_local_packs))
> +
> +	/*
> +	 * Handle .keep first, as we have a fast(er) path there.
> +	 */
> +	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
> +		/*
> +		 * Set the flags for the kept-pack cache to be the ones we want
> +		 * to ignore.
> +		 *
> +		 * That is, if we are ignoring objects in on-disk keep packs,
> +		 * then we want to search through the on-disk keep and ignore
> +		 * the in-core ones.
> +		 */
> +		unsigned flags = 0;
> +		if (ignore_packed_keep_on_disk)
> +			flags |= ON_DISK_KEEP_PACKS;
> +		if (ignore_packed_keep_in_core)
> +			flags |= IN_CORE_KEEP_PACKS;
> +
> +		if (ignore_packed_keep_on_disk && p->pack_keep)
> +			return 0;
> +		if (ignore_packed_keep_in_core && p->pack_keep_in_core)
> +			return 0;
> +		if (has_object_kept_pack(oid, flags))
> +			return 0;
> +	}
> +
> +	/*
> +	 * At this point we know definitively that either we don't care about
> +	 * keep-packs, or the object is not in one. Keep checking other
> +	 * conditions...
> +	 */
> +
> +	if (!local || !have_non_local_packs)
>  		return 1;
> -
>  	if (local && !p->pack_local)
>  		return 0;
> -	if (p->pack_local &&
> -	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
> -	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
> -		return 0;
>  
>  	/* we don't know yet; keep looking for more packs */
>  	return -1;

I know I wrote this patch, but just looking it over again with a
critical eye: it looks like more re-ordering could avoid work in some
cases.

In particular, has_object_kept_pack() is a potentially expensive call.
But if "(local && !p->pack_local)" is true, then we could cheaply exit
the function with "0", regardless of what the keep requirement says.

That's not a case that I think anybody cares that deeply about (and it
certainly is not covered by t/perf). But I think it does regress in this
patch. Prior to the patch, we'd check that condition before returning
-1, and it was the caller who would then continue to search through all
the kept packs. Now we do it preemptively.

I think just bumping that:

  if (local && !p->pack_local)
	return 0;

above the new code would fix it. Or to lay out the logic more fully, the
order of checks should be:

  - does _this_ pack we found the object in disqualify it. If so, we can
    cheaply return 0. And that applies to both keep and local rules.

  - otherwise, check all packs via has_object_kept_pack(), which is
    cheaper than continuing to iterate through all packs by returning
    -1.

  - once we know definitively about keep-packs, then check any shortcuts
    related to local packs (like !have_non_local_packs)

  - and then if no shortcuts, we return -1

I think that might be easier to express by rewriting the patch. :)

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-04  3:59   ` [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
@ 2021-02-17 17:11     ` Jeff King
  2021-02-17 19:54       ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-17 17:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:59:21PM -0500, Taylor Blau wrote:

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index fbd7b54d70..b2ba5aa14f 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1225,9 +1225,9 @@ static int want_found_object(const struct object_id *oid, int exclude,
>  		 */
>  		unsigned flags = 0;
>  		if (ignore_packed_keep_on_disk)
> -			flags |= ON_DISK_KEEP_PACKS;
> +			flags |= CACHE_ON_DISK_KEEP_PACKS;
>  		if (ignore_packed_keep_in_core)
> -			flags |= IN_CORE_KEEP_PACKS;
> +			flags |= CACHE_IN_CORE_KEEP_PACKS;

Why are we renaming the constants in this patch?

I know I'm listed as the author, but I think this came out of some
off-list back and forth between us. It seems like the existing constants
would have been fine.

> +static void maybe_invalidate_kept_pack_cache(struct repository *r,
> +					     unsigned flags)
>  {
> -	return find_one_pack_entry(r, oid, e, 0);
> +	if (!r->objects->kept_pack_cache)
> +		return;
> +	if (r->objects->kept_pack_cache->flags == flags)
> +		return;
> +	free(r->objects->kept_pack_cache->packs);
> +	FREE_AND_NULL(r->objects->kept_pack_cache);
> +}

OK, so we keep a single cache based on the flags, and then if somebody
ever asks for different flags, we throw it away. That's probably OK for
our purposes, since we wouldn't expect multiple callers within a single
process.

I wondered if it would be simpler to just keep two lists, one for
in-core keeps and one for on-disk keeps. And then just walk over each
list separately based on the query flags. That makes things more robust
_and_ I think would be less code. It does mean that a pack could appear
in both lists, though, which means we might do a lookup in it twice.
That doesn't seem all that likely, but it is working against our goal
here.

Another option is to keep 3 caches (two separate and one combined),
rather than flipping between them. I'm not sure if that would be less
code or not (it gets rid of the "invalidate" function, but you do have
to pick the right cache depending on the query flags).

Yet another option is to keep a cache of any that are marked as _either_
in core or on-disk keeps, and then decide to look up the object based on
the query flags. Then you just pay the cost to iterate over the list and
check the flags (which really is all this cache is helping with in the
first place).

I dunno. TBH, I kind of wonder if this whole patch is worth doing at
all, giving the underwhelming performance benefit (3% on the
pathological 1000-pack case). When I had timed this strategy initially,
it was more like 15%. I'm not sure where the savings went in the
interim, or if it was a timing fluke.

> +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
> +{
> +	maybe_invalidate_kept_pack_cache(r, flags);
> +
> +	if (!r->objects->kept_pack_cache) {
> +		struct packed_git **packs = NULL;
> +		size_t nr = 0, alloc = 0;
> +		struct packed_git *p;
> +
> +		/*
> +		 * We want "all" packs here, because we need to cover ones that
> +		 * are used by a midx, as well. We need to look in every one of
> +		 * them (instead of the midx itself) to cover duplicates. It's
> +		 * possible that an object is found in two packs that the midx
> +		 * covers, one kept and one not kept, but the midx returns only
> +		 * the non-kept version.
> +		 */
> +		for (p = get_all_packs(r); p; p = p->next) {
> +			if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) ||
> +			    (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) {
> +				ALLOC_GROW(packs, nr + 1, alloc);
> +				packs[nr++] = p;
> +			}
> +		}
> +		ALLOC_GROW(packs, nr + 1, alloc);
> +		packs[nr] = NULL;
> +
> +		r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache));
> +		r->objects->kept_pack_cache->packs = packs;
> +		r->objects->kept_pack_cache->flags = flags;
> +	}

Is there any reason not to just embed the kept_pack_cache struct inside
the object_store? It's one less pointer to deal with. I wonder if this
is a holdover from an attempt to have multiple caches.

(I also think it would be reasonable if we wanted to hide the definition
of the cache struct from callers, but we don't seem do to that).

> @@ -2109,7 +2123,8 @@ int has_object_pack(const struct object_id *oid)
>  	return find_pack_entry(the_repository, oid, &e);
>  }
>  
> -int has_object_kept_pack(const struct object_id *oid, unsigned flags)
> +int has_object_kept_pack(const struct object_id *oid,
> +			 unsigned flags)
>  {
>  	struct pack_entry e;
>  	return find_kept_pack_entry(the_repository, oid, flags, &e);

This seems like a stray change.

> diff --git a/packfile.h b/packfile.h
> index 624327f64d..eb56db2a7b 100644
> --- a/packfile.h
> +++ b/packfile.h
> @@ -161,10 +161,6 @@ int packed_object_info(struct repository *r,
>  void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
>  const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
>  
> -#define ON_DISK_KEEP_PACKS 1
> -#define IN_CORE_KEEP_PACKS 2
> -#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)

I notice that when the constants moved, we didn't keep an equivalent of
ALL_KEEP_PACKS. Maybe we didn't need it in the first place in patch 1?

  BTW, I absolutely hate the complication that all of this on-disk
  versus in-core keep distinction brings to this code. And I wondered
  what it was really doing for us and whether we could get rid of it.
  But I think we do need it: a common case may be to avoid using
  --honor-pack-keep (because you don't want to deal with racy .keep
  writes from incoming receive-pack processes), but use in-core ones for
  something like --stdin-packs. So we do need to respect one and not the
  other.

  I do wonder if things would be simpler if pack-objects simply kept its
  own list of "in core" packs in a separate array. But that is really
  just another form of the same problem, I guess.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 8/8] builtin/repack.c: add '--geometric' option
  2021-02-04  3:59   ` [PATCH v2 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
@ 2021-02-17 18:17     ` Jeff King
  2021-02-17 20:01       ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-17 18:17 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 03, 2021 at 10:59:25PM -0500, Taylor Blau wrote:

> Often it is useful to both:
> 
>   - have relatively few packfiles in a repository, and
> 
>   - avoid having so few packfiles in a repository that we repack its
>     entire contents regularly
> 
> This patch implements a '--geometric=<n>' option in 'git repack'. This
> allows the caller to specify that they would like each pack to be at
> least a factor times as large as the previous largest pack (by object
> count).
> 
> Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
> ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
> With a geometric factor of 'r', it should be that:
> 
>   objects(Pi) > r*objects(P(i-1))
> 
> for all i in [1, n], where the packs are sorted by
> 
>   objects(P1) <= objects(P2) <= ... <= objects(Pn).

Just devil's advocating for a moment.

I think in this kind of geometric roll-up strategy, you want to imagine
that you are rolling up recent pushes but leaving untouched a good
"base" pack that you previously created.

And that will usually be true if you are doing the rollup based on
number of objects (or size, etc). But it won't always be (e.g., for some
reason somebody makes a very large push relative to the current
repository size). What happens when this assumption is violated?

In some ways, it is a good thing to drift away from this "base pack"
view of the world. If we're trying to amortize the per-object work done,
then we are better off rolling up the small things into the large,
regardless of where they came from.

But the base pack may also have other properties we want to retain. Two
I can think of:

  - it may have a .bitmap that we'll be throwing away, without
    generating a new one. I know that your end-game involves writing a
    midx with bitmaps that covers all of the packs, so this would become
    a non-issue in that strategy.

  - it may have been more carefully packed (e.g., with a larger window
    size, using "-f", etc) than the packs we got from pushes. We do
    _mostly_ retain the deltas when we roll up the packs, so it probably
    only has a small impact in practice (I'd expect in a few cases we'd
    throw away deltas because a pushed pack contains a duplicate of its
    base object that we added via --fix-thin).

So I suspect it's probably OK in practice. These cases would happen
rarely, and the impact would not be all that big. The bitmap thing I'd
worry the most about. As part of a larger strategy involving a midx it
is taken care of, but people using just this new feature may not realize
that. The bitmaps of course are "just" an optimization, but it's hard to
say how dire things are when they don't exist. For many situations,
probably not very dire. But I know that on our servers, when repos lack
bitmaps, people notice the performance degradation.

On the other hand, by definition this happens in a case where there are
more objects that have just been pushed (and are therefore not
bitmapped) than existed already. So you _already_ have a performance
problem either way until you get bitmap coverage of those new objects.

> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -165,6 +165,17 @@ depth is 4095.
>  	Pass the `--delta-islands` option to `git-pack-objects`, see
>  	linkgit:git-pack-objects[1].
>  
> +-g=<factor>::
> +--geometric=<factor>::
> +	Arrange resulting pack structure so that each successive pack
> +	contains at least `<factor>` times the number of objects as the
> +	next-largest pack.
> ++
> +`git repack` ensures this by determining a "cut" of packfiles that need to be
> +repacked into one in order to ensure a geometric progression. It picks the
> +smallest set of packfiles such that as many of the larger packfiles (by count of
> +objects contained in that pack) may be left intact.

I think we might need to make clear in the documentation how this
differs from other repacks, in that it is not considering reachability
at all. I like the term "roll up" to describe what is happening, but we
probably need to define that term clearly, as well.

Especially important, I think, is that we talk about what's happening
with loose objects, which are part of the rollup here. And IMHO we
should make clear that for now we include them all, without
consideration of their reachability, but that this may change in the
future.

Likewise, are there any options that are incompatible with "-g"? I have
to imagine that "--write-bitmap-index" would not work very well. I don't
know that we need to enumerate them all, but I'm wondering if a blanket
"this may not play well with other options" warning may be advisable.

> +static void split_pack_geometry(struct pack_geometry *geometry, int factor)
> [...]

I'll admit I didn't carefully think about the math of the progression
here. IMHO the exact split is the least interesting part of this whole
series (compared to the general idea of "rolling up some packs" versus a
whole repack). Between the comments and the tests, I'll assume it's
generally behaving as advertised. (I of course did look for any obvious
coding errors, but didn't see any).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/8] repack: support repacking into a geometric sequence
  2021-02-17  0:01   ` [PATCH v2 0/8] repack: support repacking into a geometric sequence Jeff King
@ 2021-02-17 18:18     ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-17 18:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Tue, Feb 16, 2021 at 07:01:13PM -0500, Jeff King wrote:

> On Wed, Feb 03, 2021 at 10:58:45PM -0500, Taylor Blau wrote:
> 
> > The details of the new approach can be found in the third patch, but the gist is
> > as follows:
> > [...]
> 
> I think this turned out very nice (and less complicated than I feared it
> might). I've read up through patch 5. I think the overall approach is
> good, but I had various small-to-medium comments.
> 
> I'll try to pick up reviewing the rest tomorrow, though it may make
> sense to resolve the earlier comments first.

OK, I finished reading the rest and left a few more comments. The short
of it is that I really like the new direction, but I think there are
enough small comments to merit a re-roll, which I hope would probably be
the final.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/8] revision: learn '--no-kept-objects'
  2021-02-16 23:17     ` Jeff King
@ 2021-02-17 18:35       ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 18:35 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Tue, Feb 16, 2021 at 06:17:40PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:58:57PM -0500, Taylor Blau wrote:
>
> > @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
> >  		return commit_ignore;
> >  	if (revs->unpacked && has_object_pack(&commit->object.oid))
> >  		return commit_ignore;
> > +	if (revs->no_kept_objects) {
> > +		if (has_object_kept_pack(&commit->object.oid,
> > +					 revs->keep_pack_cache_flags))
> > +			return commit_ignore;
> > +	}
>
> OK, so this has the same "problems" as --unpacked, which is that we can
> miss some objects (i.e., things that are reachable but not-kept may not
> be reported). But it should be OK in this version of the series, because
> we will not be relying on it for selection of objects, but only to fill
> in ordering / namehash fields.
>
> Should we warn people about that, either as a comment or in the commit
> message?

Yeah, let's warn about it in the commit message. We could put it in the
documentation, but...

> > +--no-kept-objects[=<kind>]::
> > +	Halts the traversal as soon as an object in a kept pack is
> > +	found. If `<kind>` is `on-disk`, only packs with a corresponding
> > +	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
> > +	with their in-core kept state set are ignored. Otherwise, both
> > +	kinds of kept packs are ignored.
>
> Likewise, I wonder whether we need to expose this mode to users.
> Normally I'm a fan of doing so, because it allows scripted callers
> access to more of the internals, but:
>
>   - the semantics are kind of weird about where we draw the line between
>     performance and absolute correctness
>
>   - the "in-core" thing is a bit weird for callers of rev-list; how do I
>     as a caller mark a pack as kept-in-core? I think it's only an
>     internal pack-objects thing.
>
> Once we support this in rev-list, we'll have to do it forever (or deal
> with deprecation, etc). If we just need it internally, maybe it's wise
> to leave it as a something you ask for by manipulating rev_info
> directly. Or perhaps leave it as an undocumented interface we use for
> testing, and not something we promise to keep working.

I think that you raise a good point about not advertising this option,
since doing so paints us into a corner that we have to keep it working
and behaving consistently forever.

I'm not opposed to the idea that we may eventually want to do so, but I
think that this is too early for that. As you note, we *could* just
expose it in rev_info flags, but that makes it much more difficult to
test some of the tricky cases that are added in t6114, so I think a
middle ground of having an undocumented option satisfies both of our
wants.

> > --- a/list-objects.c
> > +++ b/list-objects.c
> > @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
> >  			ctx->show_object(obj, name, ctx->show_data);
> >  			continue;
> >  		}
> > +		if (ctx->revs->no_kept_objects) {
> > +			struct pack_entry e;
> > +			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
> > +						 ctx->revs->keep_pack_cache_flags,
> > +						 &e))
> > +				continue;
> > +		}
>
> This hunk is interesting.
>
> There is no similar check for revs->unpacked in list-objects.c to cut
> off the traversal. And indeed, running "rev-list --unpacked" will
> generally look at the _whole_ tree for a commit that is unpacked, even
> if all of the tree entries are packed. That's something we might
> consider changing in the name of performance (though it does increase
> the number of cases where --unpacked will fail to find an unpacked but
> reachable object).
>
> But this is a funny place to put it. If I understand it correctly, it is
> cutting off the traversal at the very top of the tree. I.e., if we had a
> commit that is not-kept, we'd queue it's root tree. And then we might
> find that the root tree is kept, and avoid traversing it. But if we _do_
> traverse it, we would look at every subtree it contains, even if they
> are kept! That's because we recurse the tree via the recursive
> process_tree(), not by queueing more objects in the pending array here.
>
> So this check seems to exist in a funny middle ground. I think it's
> unlikely to catch anything useful (usually commits have a unique root
> tree; it's all of the untouched parts of the subtrees that will be in
> the kept packs). IMHO we should either drop it (and act like
> "--unpacked", accepting that we may traverse some extra tree objects),
> or we should go all-in on performance and cut it off in the top of
> process_tree().

Agreed. Let's drop it.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-16 23:46     ` Jeff King
@ 2021-02-17 18:59       ` Taylor Blau
  2021-02-17 19:21         ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 18:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Tue, Feb 16, 2021 at 06:46:59PM -0500, Jeff King wrote:
> Do we use this partial traversal to impact the write order at all? That
> would be a nice-to-have, but I suspect that just concatenating the packs
> (presumably by descending mtime) ends up with a similar result.

We don't; the objects are written in pack order. In the version of the
patch you reviewed, the order of packs was determined by their hash (due
to the string_list_sort()), but the version I just prepared re-sorts by
mtime.

It's kind of gross, since we need to use QSORT directly on the
string_list internals in order to have access to the ->util field of the
string_list_items (string_list_sort() only lets you compare strings
directly for obvious reasons).

I added a comment describing this hack.

> > +--stdin-packs::
> > +	Read the basenames of packfiles from the standard input, instead
> > +	of object names or revision arguments. The resulting pack
> > +	contains all objects listed in the included packs (those not
> > +	beginning with `^`), excluding any objects listed in the
> > +	excluded packs (beginning with `^`).
> > ++
> > +Incompatible with `--revs`, or options that imply `--revs` (such as
> > +`--all`), with the exception of `--unpacked`, which is compatible.
>
> I know you say "basename" here, but I wonder if it is worth giving an
> example (`pack-1234abcd.pack`) to make it clear in what form we expect
> it. Or possibly something in the `EXAMPLES` section.

Good idea, thanks.

> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -2979,6 +2979,164 @@ static int git_pack_config(const char *k, const char *v, void *cb)
> >  	return git_default_config(k, v, cb);
> >  }
> >
> > +static int stdin_packs_found_nr;
> > +static int stdin_packs_hints_nr;
>
> I scratched my head at these until I looked further in the code. They're
> the counters for the trace output. Might be worth a brief comment above
> them. (I do approve of adding this kind of trace debugging info; I'm
> pretty accustomed to using gdb or adding one-off debug statements, but
> we really could do a better job in general of making these kinds of
> internals visible to mere mortal admins).

Good call.

> > +static int add_object_entry_from_pack(const struct object_id *oid,
> > +				      struct packed_git *p,
> > +				      uint32_t pos,
> > +				      void *_data)
> > +{
> > +	struct rev_info *revs = _data;
> > +	struct object_info oi = OBJECT_INFO_INIT;
> > +	off_t ofs;
> > +	enum object_type type;
> > +
> > +	display_progress(progress_state, ++nr_seen);
> > +
> > +	ofs = nth_packed_object_offset(p, pos);
> > +
> > +	oi.typep = &type;
> > +	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
> > +		die(_("could not get type of object %s in pack %s"),
> > +		    oid_to_hex(oid), p->pack_name);
>
> Calling out for other reviewers: the oi.typep field will be filled in
> the with _real_ type of the object, even if it's a delta. This is as
> opposed to the return value of packed_object_info(), which may be
> OFS_DELTA or REF_DELTA.
>
> And that real type is what we want here:
>
> > +	else if (type == OBJ_COMMIT) {
> > +		/*
> > +		 * commits in included packs are used as starting points for the
> > +		 * subsequent revision walk
> > +		 */
> > +		add_pending_oid(revs, NULL, oid, 0);
> > +	}
>
> And later when we call create_object_entry().

:-). Yes indeed. As I'm sure that you will recall, the pack-objects
code _does not_ behave well when you give it the packed type of an
object (which is not entirely unexpected, since the pack-objects code
only operates on the true type, so passing the packed type--as I did
when originally writing this patch--is a bug).

> I wondered whether it would be worth adding other objects we might find,
> like trees, in order to increase our traversal. But that doesn't make
> any sense. The whole point is to find the paths, which come from
> traversing from the root trees. And we can only find the root trees by
> starting at commits. Adding any random tree we found would defeat the
> purpose (most of them are sub-trees and would give us a useless partial
> path).

Right.

> Should we avoid adding the commit as a tip for walking if it won't end
> up in the resulting pack? I.e., should we check these:
>
> > +	if (have_duplicate_entry(oid, 0))
> > +		return 0;
> > +
> > +	if (!want_object_in_pack(oid, 0, &p, &ofs))
> > +		return 0;
>
> ...first? I guess it probably doesn't matter too much since we'd
> truncate the traversal as soon as we saw it was in a kept pack anyway.

I agree it doesn't make a difference, but I think placing the extra
guards first makes it easier to read (since the reader doesn't have to
consider how the subsequent traversal would treat it).

> > +static void show_commit_pack_hint(struct commit *commit, void *_data)
> > +{
> > +}
>
> Nothing to do here, since commits don't have a name field. Makes sense.

Yeah. I added a comment to say the same thing, just for extra clarity.

>
> > +static void show_object_pack_hint(struct object *object, const char *name,
> > +				  void *_data)
> > +{
> > +	struct object_entry *oe = packlist_find(&to_pack, &object->oid);
> > +	if (!oe)
> > +		return;
> > +
> > +	/*
> > +	 * Our 'to_pack' list was constructed by iterating all objects packed in
> > +	 * included packs, and so doesn't have a non-zero hash field that you
> > +	 * would typically pick up during a reachability traversal.
> > +	 *
> > +	 * Make a best-effort attempt to fill in the ->hash and ->no_try_delta
> > +	 * here using a now in order to perhaps improve the delta selection
> > +	 * process.
> > +	 */
> > +	oe->hash = pack_name_hash(name);
> > +	oe->no_try_delta = name && no_try_delta(name);
> > +
> > +	stdin_packs_hints_nr++;
> > +}
>
> But for actual objects, we do fill in the hash. I wonder if it's
> possible for oe->hash to have been already filled. I don't think it
> really matters, though. Any value we get is equally valid, so
> overwriting is OK in that case.

Right.

> > +	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found",
> > +			   stdin_packs_found_nr);
>
> I wonder if it makes sense to report the actual set of packs via trace
> (obviously not as an int, but as a list). That's less helpful for
> debugging pack-objects, if you just fed it the input anyway, but if you
> were debugging "git repack --geometric" it might be useful to see which
> packs it thought were which (though arguably that would be a useful
> trace in builtin/repack.c instead).

I could see an argument in both ways. I'd rather pass for now until we
have a clearer need for it.

> [passing --unpacked to the namehash traversal]
>
> I'm OK to consider that an implementation detail for now, though. We can
> change it later without impacting the interface.

Agreed.

> > +		if (rev_list_unpacked)
> > +			add_unreachable_loose_objects();
>
> Despite the name, that function is adding both reachable and unreachable
> ones. So it is doing what you want. It might be worth renaming, but it's
> not too big a deal since it's local to this file.

Yeah, I tend to err on the side of "it's fine as-is" since this isn't
exposed outside of pack-objects internals. If you feel strongly I'm
happy to change it, but I suspect you don't.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 5/8] p5303: measure time to repack with keep
  2021-02-16 23:58     ` Jeff King
  2021-02-17  0:02       ` Jeff King
@ 2021-02-17 19:13       ` Taylor Blau
  2021-02-17 19:25         ` Jeff King
  1 sibling, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 19:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Tue, Feb 16, 2021 at 06:58:16PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:59:13PM -0500, Taylor Blau wrote:
>
> > From: Jeff King <peff@peff.net>
> >
> > This is the same as the regular repack test, except that we mark the
> > single base pack as "kept" and use --assume-kept-packs-closed. The
>
> I don't think that option exists anymore. I guess we are just using
> --stdin-packs, which causes us to mark a pack as kept.
>
> I think we could just mark it in the filesystem and use
> --honor-pack-keep, which would make it independent of your new feature.
> At first I was going to say "but it doesn't matter either way", but...
>
> > theory is that this should be faster than the normal repack, because
> > we'll have fewer objects to traverse and process.
> >
> > Here are some timings on a recent clone of the kernel. In the
> > single-pack case, there is nothing do since there are no non-excluded
> > packs:
> >
> >   5303.5: repack (1)                          57.42(54.88+10.64)
> >   5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)
> >
> > and in the 50-pack case, it is much faster to use `--stdin-packs`, since
> > we avoid having to consider any objects in the excluded pack:
> >
> >   5303.10: repack (50)                        71.26(88.24+4.96)
> >   5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)
> >
> > but our improvements vanish as we approach 1000 packs.
> >
> >   5303.15: repack (1000)                      215.64(491.33+14.80)
> >   5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)
> >
> > That's because the code paths around handling .keep files are known to
> > scale badly; they look in every single pack file to find each object.
>
> Well, part of it is just that with 1000 packs we have 20 times as many
> objects that are actually getting packed with --stdin-packs, compared to
> the 50-pack case. IIRC, each pack is a fixed-size slice and then the
> residual is put into the .keep pack. So the fact that the time gets
> closer to a full repack as we add more packs is expected: we are asking
> pack-objects to do more work!

No, the residual base pack isn't marked as kept on-disk. But the
--stdin-packs test treats it as such, by passing '^pack-$base_pack.pack'
as input to '--stdin-packs' (thus marking it as kept in-core).

> For showing the impact of the optimizations in patches 7 and 8, I think
> doing a full repack with --honor-pack-keep is a better test. Because
> then we're always doing a full traversal, and most of the work continues
> to scale with the repo size (though obviously not the actual shuffling
> of packed bytes around). That would get rid of the weird "no work to do"
> case in the single-pack tests, too.

I think you're suggesting that we change the "repack ($nr_packs)" test
to have the residual pack marked as kept (so we're measuring time it
takes to repack everything that _isn't_ in the base pack)?

That would allow a more direct comparison, but I think it's loosing out
on an important aspect which is how long it takes to pack the entire
repository. Maybe we want three.

What do you think?

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-17 18:59       ` Taylor Blau
@ 2021-02-17 19:21         ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-17 19:21 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 01:59:08PM -0500, Taylor Blau wrote:

> > > +		if (rev_list_unpacked)
> > > +			add_unreachable_loose_objects();
> >
> > Despite the name, that function is adding both reachable and unreachable
> > ones. So it is doing what you want. It might be worth renaming, but it's
> > not too big a deal since it's local to this file.
> 
> Yeah, I tend to err on the side of "it's fine as-is" since this isn't
> exposed outside of pack-objects internals. If you feel strongly I'm
> happy to change it, but I suspect you don't.

Yeah, I don't feel strongly (and if we did change it, it should be in a
separate patch anyway).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic
  2021-02-17 16:05     ` Jeff King
@ 2021-02-17 19:23       ` Taylor Blau
  2021-02-17 19:29         ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 19:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 11:05:22AM -0500, Jeff King wrote:
> I think just bumping that:
>
>   if (local && !p->pack_local)
> 	return 0;

> above the new code would fix it. Or to lay out the logic more fully, the
> order of checks should be:

>   - does _this_ pack we found the object in disqualify it. If so, we can
>     cheaply return 0. And that applies to both keep and local rules.
>
>   - otherwise, check all packs via has_object_kept_pack(), which is
>     cheaper than continuing to iterate through all packs by returning
>     -1.
>
>   - once we know definitively about keep-packs, then check any shortcuts
>     related to local packs (like !have_non_local_packs)
>
>   - and then if no shortcuts, we return -1

I don't understand what you're suggesting. Is the (local &&
!p->pack_local) a disqualifying condition? Reading the comment, I think
it is, and so we could do something like:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 36c2fa3aff..be3ba60bc2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1205,14 +1205,21 @@ static int want_found_object(const struct object_id *oid, int exclude,
         * make sure no copy of this object appears in _any_ pack that makes us
         * to omit the object, so we need to check all the packs.
         *
-        * We can however first check whether these options can possible matter;
+        * We can however first check whether these options can possibly matter;
         * if they do not matter we know we want the object in generated pack.
         * Otherwise, we signal "-1" at the end to tell the caller that we do
         * not know either way, and it needs to check more packs.
         */

        /*
-        * Handle .keep first, as we have a fast(er) path there.
+        * Objects in packs borrowed from elsewhere are discarded regardless of
+        * if they appear in other packs that weren't borrowed.
+        */
+       if (local && !p->pack_local)
+               return 0;
+
+       /*
+        * Then handle .keep first, as we have a fast(er) path there.
         */
        if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
                /*
@@ -1242,11 +1249,8 @@ static int want_found_object(const struct object_id *oid, int exclude,
         * keep-packs, or the object is not in one. Keep checking other
         * conditions...
         */
-
        if (!local || !have_non_local_packs)
                return 1;
-       if (local && !p->pack_local)
-               return 0;

        /* we don't know yet; keep looking for more packs */
        return -1;

But your "check any shortcuts related to local packs" makes me think
that we should leave the code as-is.

Which are you suggesting?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 5/8] p5303: measure time to repack with keep
  2021-02-17 19:13       ` Taylor Blau
@ 2021-02-17 19:25         ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-17 19:25 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 02:13:47PM -0500, Taylor Blau wrote:

> > > That's because the code paths around handling .keep files are known to
> > > scale badly; they look in every single pack file to find each object.
> >
> > Well, part of it is just that with 1000 packs we have 20 times as many
> > objects that are actually getting packed with --stdin-packs, compared to
> > the 50-pack case. IIRC, each pack is a fixed-size slice and then the
> > residual is put into the .keep pack. So the fact that the time gets
> > closer to a full repack as we add more packs is expected: we are asking
> > pack-objects to do more work!
> 
> No, the residual base pack isn't marked as kept on-disk. But the
> --stdin-packs test treats it as such, by passing '^pack-$base_pack.pack'
> as input to '--stdin-packs' (thus marking it as kept in-core).

Sorry, I perhaps shouldn't have said ".keep" here. But it's the same
thing, isn't it? The 50 pack case is packing 50*pack_size objects
(because it's excluding everything else that is in the base pack we mark
as keep-in-core), and the 1000-pack case is packing 1000*pack_size
objects (for the same reason).

So any patterns we see between them have more to do with that, than how
the keep-handling code scales with the number of non-kept packs.

> > For showing the impact of the optimizations in patches 7 and 8, I think
> > doing a full repack with --honor-pack-keep is a better test. Because
> > then we're always doing a full traversal, and most of the work continues
> > to scale with the repo size (though obviously not the actual shuffling
> > of packed bytes around). That would get rid of the weird "no work to do"
> > case in the single-pack tests, too.
> 
> I think you're suggesting that we change the "repack ($nr_packs)" test
> to have the residual pack marked as kept (so we're measuring time it
> takes to repack everything that _isn't_ in the base pack)?
> 
> That would allow a more direct comparison, but I think it's loosing out
> on an important aspect which is how long it takes to pack the entire
> repository. Maybe we want three.

That was what I was suggesting, but I think it's equivalent to what your
--stdin-packs is testing. I guess the most interesting thing would
actually be an _additional_ pack mark as .keep (and that pack does not
even have to contain anything interesting -- the point is how much
effort it costs to find that out. Of course the bigger it is the more
pronounced the effect of avoiding lookups in it).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic
  2021-02-17 19:23       ` Taylor Blau
@ 2021-02-17 19:29         ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-17 19:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 02:23:27PM -0500, Taylor Blau wrote:

> On Wed, Feb 17, 2021 at 11:05:22AM -0500, Jeff King wrote:
> > I think just bumping that:
> >
> >   if (local && !p->pack_local)
> > 	return 0;
> 
> > above the new code would fix it. Or to lay out the logic more fully, the
> > order of checks should be:
> 
> >   - does _this_ pack we found the object in disqualify it. If so, we can
> >     cheaply return 0. And that applies to both keep and local rules.
> >
> >   - otherwise, check all packs via has_object_kept_pack(), which is
> >     cheaper than continuing to iterate through all packs by returning
> >     -1.
> >
> >   - once we know definitively about keep-packs, then check any shortcuts
> >     related to local packs (like !have_non_local_packs)
> >
> >   - and then if no shortcuts, we return -1
> 
> I don't understand what you're suggesting. Is the (local &&
> !p->pack_local) a disqualifying condition? Reading the comment, I think
> it is, and so we could do something like:

That's exactly what I'm suggesting. If we have a non-local pack and were
given --local, then we can shortcut immediately without caring about
kept packs: we know that we do not want the object.

> [...]
> But your "check any shortcuts related to local packs" makes me think
> that we should leave the code as-is.

No, the "shortcuts" there is the opposite:

  if (!local || !have_non_local_packs)
	return 1;

If either of those is true, we can say "definitely include" but only
with respect to the --local requirement. So we _can't_ bump that up, but
must check it only after we've definitively resolved the keep-pack
requirement.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-17 17:11     ` Jeff King
@ 2021-02-17 19:54       ` Taylor Blau
  2021-02-17 20:25         ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 19:54 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 12:11:00PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:59:21PM -0500, Taylor Blau wrote:
>
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index fbd7b54d70..b2ba5aa14f 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -1225,9 +1225,9 @@ static int want_found_object(const struct object_id *oid, int exclude,
> >  		 */
> >  		unsigned flags = 0;
> >  		if (ignore_packed_keep_on_disk)
> > -			flags |= ON_DISK_KEEP_PACKS;
> > +			flags |= CACHE_ON_DISK_KEEP_PACKS;
> >  		if (ignore_packed_keep_in_core)
> > -			flags |= IN_CORE_KEEP_PACKS;
> > +			flags |= CACHE_IN_CORE_KEEP_PACKS;
>
> Why are we renaming the constants in this patch?
>
> I know I'm listed as the author, but I think this came out of some
> off-list back and forth between us. It seems like the existing constants
> would have been fine.

Yeah, they would have been fine. They were renamed because this patch
makes them only used for the kept pack cache, but I agree the existing
names are fine, too.

In any case, they make an easier-to-read diff, so I'm perfectly happy to
un-rename them ;).

> > +static void maybe_invalidate_kept_pack_cache(struct repository *r,
> > +					     unsigned flags)
> >  {
> > -	return find_one_pack_entry(r, oid, e, 0);
> > +	if (!r->objects->kept_pack_cache)
> > +		return;
> > +	if (r->objects->kept_pack_cache->flags == flags)
> > +		return;
> > +	free(r->objects->kept_pack_cache->packs);
> > +	FREE_AND_NULL(r->objects->kept_pack_cache);
> > +}
>
> OK, so we keep a single cache based on the flags, and then if somebody
> ever asks for different flags, we throw it away. That's probably OK for
> our purposes, since we wouldn't expect multiple callers within a single
> process.
>
> I wondered if it would be simpler to just keep two lists, one for
> in-core keeps and one for on-disk keeps. And then just walk over each
> list separately based on the query flags. That makes things more robust
> _and_ I think would be less code. It does mean that a pack could appear
> in both lists, though, which means we might do a lookup in it twice.
> That doesn't seem all that likely, but it is working against our goal
> here.
>
> Another option is to keep 3 caches (two separate and one combined),
> rather than flipping between them. I'm not sure if that would be less
> code or not (it gets rid of the "invalidate" function, but you do have
> to pick the right cache depending on the query flags).
>
> Yet another option is to keep a cache of any that are marked as _either_
> in core or on-disk keeps, and then decide to look up the object based on
> the query flags. Then you just pay the cost to iterate over the list and
> check the flags (which really is all this cache is helping with in the
> first place).

All interesting ideas. In this patch (and by the end of the series)
callers that use the kept pack cache never ask for the cache with a
different set of flags. IOW, there isn't a situation where a caller
would populate the in-core kept pack cache, and then suddenly ask for
both in-core and on-disk packs to be kept.

So all of this code is defensive in case that were to change, and
suddenly we'd be returning subtly wrong results. I could imagine that
being kind of a nasty bug to track down, so detecting and invalidating
the cache would make it a non-issue.

I'll note it in the commit message, though, since it's good for future
readers to be aware, too.

> I dunno. TBH, I kind of wonder if this whole patch is worth doing at
> all, giving the underwhelming performance benefit (3% on the
> pathological 1000-pack case). When I had timed this strategy initially,
> it was more like 15%. I'm not sure where the savings went in the
> interim, or if it was a timing fluke.

Yeah, I dunno. It's certainly not hurting (I don't think the extra code
is all that complex, and the savings is at least non-zero), so I'm
inclined to keep it.

> > +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
> > +{
> > +	maybe_invalidate_kept_pack_cache(r, flags);
> > +
> > +	if (!r->objects->kept_pack_cache) {
> > +		struct packed_git **packs = NULL;
> > +		size_t nr = 0, alloc = 0;
> > +		struct packed_git *p;
> > +
> > +		/*
> > +		 * We want "all" packs here, because we need to cover ones that
> > +		 * are used by a midx, as well. We need to look in every one of
> > +		 * them (instead of the midx itself) to cover duplicates. It's
> > +		 * possible that an object is found in two packs that the midx
> > +		 * covers, one kept and one not kept, but the midx returns only
> > +		 * the non-kept version.
> > +		 */
> > +		for (p = get_all_packs(r); p; p = p->next) {
> > +			if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) ||
> > +			    (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) {
> > +				ALLOC_GROW(packs, nr + 1, alloc);
> > +				packs[nr++] = p;
> > +			}
> > +		}
> > +		ALLOC_GROW(packs, nr + 1, alloc);
> > +		packs[nr] = NULL;
> > +
> > +		r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache));
> > +		r->objects->kept_pack_cache->packs = packs;
> > +		r->objects->kept_pack_cache->flags = flags;
> > +	}
>
> Is there any reason not to just embed the kept_pack_cache struct inside
> the object_store? It's one less pointer to deal with. I wonder if this
> is a holdover from an attempt to have multiple caches.
>
> (I also think it would be reasonable if we wanted to hide the definition
> of the cache struct from callers, but we don't seem do to that).

Not a holdover, just designed to avoid adding too many extra fields to
the object-store. I don't feel strongly, but I do think hiding the
definition is a good idea, so I'll inline it.

> > @@ -2109,7 +2123,8 @@ int has_object_pack(const struct object_id *oid)
> >  	return find_pack_entry(the_repository, oid, &e);
> >  }
> >
> > -int has_object_kept_pack(const struct object_id *oid, unsigned flags)
> > +int has_object_kept_pack(const struct object_id *oid,
> > +			 unsigned flags)
> >  {
> >  	struct pack_entry e;
> >  	return find_kept_pack_entry(the_repository, oid, flags, &e);
>
> This seems like a stray change.

Good eyes, thanks.

>
> > diff --git a/packfile.h b/packfile.h
> > index 624327f64d..eb56db2a7b 100644
> > --- a/packfile.h
> > +++ b/packfile.h
> > @@ -161,10 +161,6 @@ int packed_object_info(struct repository *r,
> >  void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
> >  const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
> >
> > -#define ON_DISK_KEEP_PACKS 1
> > -#define IN_CORE_KEEP_PACKS 2
> > -#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
>
> I notice that when the constants moved, we didn't keep an equivalent of
> ALL_KEEP_PACKS. Maybe we didn't need it in the first place in patch 1?

Yeah, we didn't need it to begin with. I'll drop it accordingly.

>   BTW, I absolutely hate the complication that all of this on-disk
>   versus in-core keep distinction brings to this code. And I wondered
>   what it was really doing for us and whether we could get rid of it.
>   But I think we do need it: a common case may be to avoid using
>   --honor-pack-keep (because you don't want to deal with racy .keep
>   writes from incoming receive-pack processes), but use in-core ones for
>   something like --stdin-packs. So we do need to respect one and not the
>   other.
>
>   I do wonder if things would be simpler if pack-objects simply kept its
>   own list of "in core" packs in a separate array. But that is really
>   just another form of the same problem, I guess.

Yeah, the complexity is awfully hard to reason about, but you're right
that here it is necessary.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 8/8] builtin/repack.c: add '--geometric' option
  2021-02-17 18:17     ` Jeff King
@ 2021-02-17 20:01       ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 20:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 01:17:16PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:59:25PM -0500, Taylor Blau wrote:
>
> > Often it is useful to both:
> >
> >   - have relatively few packfiles in a repository, and
> >
> >   - avoid having so few packfiles in a repository that we repack its
> >     entire contents regularly
> >
> > This patch implements a '--geometric=<n>' option in 'git repack'. This
> > allows the caller to specify that they would like each pack to be at
> > least a factor times as large as the previous largest pack (by object
> > count).
> >
> > Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
> > ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
> > With a geometric factor of 'r', it should be that:
> >
> >   objects(Pi) > r*objects(P(i-1))
> >
> > for all i in [1, n], where the packs are sorted by
> >
> >   objects(P1) <= objects(P2) <= ... <= objects(Pn).
>
> Just devil's advocating for a moment.
>
> [large push becoming the biggest pack in a repository]
>
>   - it may have been more carefully packed (e.g., with a larger window
>     size, using "-f", etc) than the packs we got from pushes. We do
>     _mostly_ retain the deltas when we roll up the packs, so it probably
>     only has a small impact in practice (I'd expect in a few cases we'd
>     throw away deltas because a pushed pack contains a duplicate of its
>     base object that we added via --fix-thin).

Yeah, agreed.

> So I suspect it's probably OK in practice. These cases would happen
> rarely, and the impact would not be all that big. The bitmap thing I'd
> worry the most about. As part of a larger strategy involving a midx it
> is taken care of, but people using just this new feature may not realize
> that. The bitmaps of course are "just" an optimization, but it's hard to
> say how dire things are when they don't exist. For many situations,
> probably not very dire. But I know that on our servers, when repos lack
> bitmaps, people notice the performance degradation.
>
> On the other hand, by definition this happens in a case where there are
> more objects that have just been pushed (and are therefore not
> bitmapped) than existed already. So you _already_ have a performance
> problem either way until you get bitmap coverage of those new objects.

I almost split my reply between this and the above paragraph to say
exactly this. I think in this case you'd want to rewrite your bitmap
from scratch either way (whether you were using multi-pack or
traditional reachability bitmaps).

> > --- a/Documentation/git-repack.txt
> > +++ b/Documentation/git-repack.txt
> > @@ -165,6 +165,17 @@ depth is 4095.
> >  	Pass the `--delta-islands` option to `git-pack-objects`, see
> >  	linkgit:git-pack-objects[1].
> >
> > +-g=<factor>::
> > +--geometric=<factor>::
> > +	Arrange resulting pack structure so that each successive pack
> > +	contains at least `<factor>` times the number of objects as the
> > +	next-largest pack.
> > ++
> > +`git repack` ensures this by determining a "cut" of packfiles that need to be
> > +repacked into one in order to ensure a geometric progression. It picks the
> > +smallest set of packfiles such that as many of the larger packfiles (by count of
> > +objects contained in that pack) may be left intact.
>
> I think we might need to make clear in the documentation how this
> differs from other repacks, in that it is not considering reachability
> at all. I like the term "roll up" to describe what is happening, but we
> probably need to define that term clearly, as well.

All fair suggestions, thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-17 19:54       ` Taylor Blau
@ 2021-02-17 20:25         ` Jeff King
  2021-02-17 20:29           ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-17 20:25 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 02:54:33PM -0500, Taylor Blau wrote:

> > OK, so we keep a single cache based on the flags, and then if somebody
> > ever asks for different flags, we throw it away. That's probably OK for
> > our purposes, since we wouldn't expect multiple callers within a single
> > process.
> > [...some alternatives]
> 
> All interesting ideas. In this patch (and by the end of the series)
> callers that use the kept pack cache never ask for the cache with a
> different set of flags. IOW, there isn't a situation where a caller
> would populate the in-core kept pack cache, and then suddenly ask for
> both in-core and on-disk packs to be kept.
> 
> So all of this code is defensive in case that were to change, and
> suddenly we'd be returning subtly wrong results. I could imagine that
> being kind of a nasty bug to track down, so detecting and invalidating
> the cache would make it a non-issue.

Yeah, I agree that the current crop of callers does not care. And I am
glad we are not leaving a booby-trap for later programmers with respect
to correctness (by virtue of the invalidation function). But it does
feel like we are leaving one for performance, which they very well might
not realize the cache is doing worse-than-nothing.

Would just doing:

  if (cache.packs && cache.flags != flags)
	BUG("kept-pack-cache cannot handle multiple queries in a single process");

be a better solution? That is not helping anyone towards a world where
we gracefully handle back-and-forth queries. But it makes it abundantly
clear when such a thing would become necessary.

> > Is there any reason not to just embed the kept_pack_cache struct inside
> > the object_store? It's one less pointer to deal with. I wonder if this
> > is a holdover from an attempt to have multiple caches.
> >
> > (I also think it would be reasonable if we wanted to hide the definition
> > of the cache struct from callers, but we don't seem do to that).
> 
> Not a holdover, just designed to avoid adding too many extra fields to
> the object-store. I don't feel strongly, but I do think hiding the
> definition is a good idea, so I'll inline it.

This response confuses me a bit. Hiding the definition from callers
would mean _keeping_ it as a pointer, but putting the definition into
packfile.c, where nobody outside that file could see it (at least that
is what I meant by hiding).

But inlining it to me implies embedding the struct (not a pointer to it)
in "struct object_store", defining the struct at the point we define the
struct field which uses it.

I am fine with either, to be clear. I'm just confused which you are
proposing to do. :)

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-17 20:25         ` Jeff King
@ 2021-02-17 20:29           ` Taylor Blau
  2021-02-17 21:43             ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-17 20:29 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 03:25:17PM -0500, Jeff King wrote:
> Would just doing:
>
>   if (cache.packs && cache.flags != flags)
> 	BUG("kept-pack-cache cannot handle multiple queries in a single process");
>
> be a better solution? That is not helping anyone towards a world where
> we gracefully handle back-and-forth queries. But it makes it abundantly
> clear when such a thing would become necessary.

I dunno. I can certainly see its merits, but I have to imagine that
anybody who cares enough about the performance will be able to find our
conversation here. Assuming that's the case, I would rather have the
kept-pack cache handle multiple queries before BUG()-ing.

> > > Is there any reason not to just embed the kept_pack_cache struct inside
> > > the object_store? It's one less pointer to deal with. I wonder if this
> > > is a holdover from an attempt to have multiple caches.
> > >
> > > (I also think it would be reasonable if we wanted to hide the definition
> > > of the cache struct from callers, but we don't seem do to that).
> >
> > Not a holdover, just designed to avoid adding too many extra fields to
> > the object-store. I don't feel strongly, but I do think hiding the
> > definition is a good idea, so I'll inline it.
>
> This response confuses me a bit. Hiding the definition from callers
> would mean _keeping_ it as a pointer, but putting the definition into
> packfile.c, where nobody outside that file could see it (at least that
> is what I meant by hiding).
>
> But inlining it to me implies embedding the struct (not a pointer to it)
> in "struct object_store", defining the struct at the point we define the
> struct field which uses it.
>
> I am fine with either, to be clear. I'm just confused which you are
> proposing to do. :)

Probably because I changed my mind in the middle of writing it ;). I'm
proposing embedding the definition of the struct into the definition of
object_store, and then operating on its fields (from within packfile.c).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-17 20:29           ` Taylor Blau
@ 2021-02-17 21:43             ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-17 21:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 03:29:50PM -0500, Taylor Blau wrote:

> On Wed, Feb 17, 2021 at 03:25:17PM -0500, Jeff King wrote:
> > Would just doing:
> >
> >   if (cache.packs && cache.flags != flags)
> > 	BUG("kept-pack-cache cannot handle multiple queries in a single process");
> >
> > be a better solution? That is not helping anyone towards a world where
> > we gracefully handle back-and-forth queries. But it makes it abundantly
> > clear when such a thing would become necessary.
> 
> I dunno. I can certainly see its merits, but I have to imagine that
> anybody who cares enough about the performance will be able to find our
> conversation here. Assuming that's the case, I would rather have the
> kept-pack cache handle multiple queries before BUG()-ing.

OK. I am on the fence, and you are the author, so I'm happy to go with
your preference.

I'm not quite as optimistic that somebody would find this conversation,
if only because they have to know to look for it. I could easily see
somebody adding a find_kept_in_pack() without thinking too hard about
it. OTOH, I find it quite unlikely that anybody would use a different
set of flags within the same process, so it would probably Just Work for
them regardless. :)

> > This response confuses me a bit. Hiding the definition from callers
> > would mean _keeping_ it as a pointer, but putting the definition into
> > packfile.c, where nobody outside that file could see it (at least that
> > is what I meant by hiding).
> >
> > But inlining it to me implies embedding the struct (not a pointer to it)
> > in "struct object_store", defining the struct at the point we define the
> > struct field which uses it.
> >
> > I am fine with either, to be clear. I'm just confused which you are
> > proposing to do. :)
> 
> Probably because I changed my mind in the middle of writing it ;). I'm
> proposing embedding the definition of the struct into the definition of
> object_store, and then operating on its fields (from within packfile.c).

OK, that sounds great to me (and arguably produces more efficient code,
since we avoid a pointer dereference, though I doubt it matters in
practice). Thanks for clarifying.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 0/8] repack: support repacking into a geometric sequence
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (11 preceding siblings ...)
  2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
@ 2021-02-18  3:14 ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
                     ` (8 more replies)
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
  13 siblings, 9 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

Here is another updated version of mine and Peff's series to add a new 'git
repack --geometric' mode which supports repacking a repository into a geometric
progression of packs by object count.

(A previous version of this series depended on 'jk/p5303-sed-portability-fix',
but that topic has since been merged to 'master'. This series has been updated
to apply based on 'master' accordingly).

The series has not changed substantially since v2, but a range-diff is included
below for convenience. The most notable change is the new tests in p5303 were
reworked to provide a more equivalent comparison.

Beyond that, some minor code clean-up (embedding the kept-pack cache, making the
'--no-kept-packs' option of 'rev-list' undocumented, etc) has been applied to
address Peff's review.

Thanks in advance for another look at this series. I'm hopeful that this version
is in a good state to be queued so that it can make the 2.31 release, and users
can start playing with it.

Jeff King (4):
  p5303: add missing &&-chains
  p5303: measure time to repack with keep
  builtin/pack-objects.c: rewrite honor-pack-keep logic
  packfile: add kept-pack cache for find_kept_pack_entry()

Taylor Blau (4):
  packfile: introduce 'find_kept_pack_entry()'
  revision: learn '--no-kept-objects'
  builtin/pack-objects.c: add '--stdin-packs' option
  builtin/repack.c: add '--geometric' option

 Documentation/git-pack-objects.txt |  10 +
 Documentation/git-repack.txt       |  22 ++
 builtin/pack-objects.c             | 329 ++++++++++++++++++++++++-----
 builtin/repack.c                   | 187 +++++++++++++++-
 object-store.h                     |   5 +
 packfile.c                         |  67 ++++++
 packfile.h                         |   5 +
 revision.c                         |  15 ++
 revision.h                         |   4 +
 t/perf/p5303-many-packs.sh         |  36 +++-
 t/t5300-pack-object.sh             |  97 +++++++++
 t/t6114-keep-packs.sh              |  69 ++++++
 t/t7703-repack-geometric.sh        | 137 ++++++++++++
 13 files changed, 921 insertions(+), 62 deletions(-)
 create mode 100755 t/t6114-keep-packs.sh
 create mode 100755 t/t7703-repack-geometric.sh

Range-diff against v2:
[rebased onto 'master']
13:  f7186147eb !  1:  aa94edf39b packfile: introduce 'find_kept_pack_entry()'
    @@ packfile.h: int packed_object_info(struct repository *r,

     +#define ON_DISK_KEEP_PACKS 1
     +#define IN_CORE_KEEP_PACKS 2
    -+#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
     +
      /*
       * Iff a pack file in the given repository contains the object named by sha1,
14:  ddc2896caa !  2:  82f6b45463 revision: learn '--no-kept-objects'
    @@ Commit message
         certain packs alone (for e.g., when doing a geometric repack that has
         some "large" packs which are kept in-core that it wants to leave alone).

    +    Note that this option is not guaranteed to produce exactly the set of
    +    objects that aren't in kept packs, since it's possible the traversal
    +    order may end up in a situation where a non-kept ancestor was "cut off"
    +    by a kept object (at which point we would stop traversing). But, we
    +    don't care about absolute correctness here, since this will eventually
    +    be used as a purely additive guide in an upcoming new repack mode.
    +
    +    Explicitly avoid documenting this new flag, since it is only used
    +    internally. In theory we could avoid even adding it rev-list, but being
    +    able to spell this option out on the command-line makes some special
    +    cases easier to test without promising to keep it behaving consistently
    +    forever. Those tricky cases are exercised in t6114.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

    - ## Documentation/rev-list-options.txt ##
    -@@ Documentation/rev-list-options.txt: ifdef::git-rev-list[]
    - 	Only useful with `--objects`; print the object IDs that are not
    - 	in packs.
    -
    -+--no-kept-objects[=<kind>]::
    -+	Halts the traversal as soon as an object in a kept pack is
    -+	found. If `<kind>` is `on-disk`, only packs with a corresponding
    -+	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
    -+	with their in-core kept state set are ignored. Otherwise, both
    -+	kinds of kept packs are ignored.
    -+
    - --object-names::
    - 	Only useful with `--objects`; print the names of the object IDs
    - 	that are found. This is the default behavior.
    -
    - ## list-objects.c ##
    -@@ list-objects.c: static void traverse_trees_and_blobs(struct traversal_context *ctx,
    - 			ctx->show_object(obj, name, ctx->show_data);
    - 			continue;
    - 		}
    -+		if (ctx->revs->no_kept_objects) {
    -+			struct pack_entry e;
    -+			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
    -+						 ctx->revs->keep_pack_cache_flags,
    -+						 &e))
    -+				continue;
    -+		}
    - 		if (!path)
    - 			path = "";
    - 		if (obj->type == OBJ_TREE) {
    -
      ## revision.c ##
     @@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
      		revs->unpacked = 1;
15:  c96b1bf995 !  3:  033e4e3f67 builtin/pack-objects.c: add '--stdin-packs' option
    @@ Documentation/git-pack-objects.txt: base-name::
      	can be useful to send new tags to native Git clients.

     +--stdin-packs::
    -+	Read the basenames of packfiles from the standard input, instead
    -+	of object names or revision arguments. The resulting pack
    -+	contains all objects listed in the included packs (those not
    -+	beginning with `^`), excluding any objects listed in the
    -+	excluded packs (beginning with `^`).
    ++	Read the basenames of packfiles (e.g., `pack-1234abcd.pack`)
    ++	from the standard input, instead of object names or revision
    ++	arguments. The resulting pack contains all objects listed in the
    ++	included packs (those not beginning with `^`), excluding any
    ++	objects listed in the excluded packs (beginning with `^`).
     ++
     +Incompatible with `--revs`, or options that imply `--revs` (such as
     +`--all`), with the exception of `--unpacked`, which is compatible.
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
      	return git_default_config(k, v, cb);
      }

    ++/* Counters for trace2 output when in --stdin-packs mode. */
     +static int stdin_packs_found_nr;
     +static int stdin_packs_hints_nr;
     +
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +
     +	display_progress(progress_state, ++nr_seen);
     +
    ++	if (have_duplicate_entry(oid, 0))
    ++		return 0;
    ++
     +	ofs = nth_packed_object_offset(p, pos);
    ++	if (!want_object_in_pack(oid, 0, &p, &ofs))
    ++		return 0;
     +
     +	oi.typep = &type;
     +	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +		add_pending_oid(revs, NULL, oid, 0);
     +	}
     +
    -+	if (have_duplicate_entry(oid, 0))
    -+		return 0;
    -+
    -+	if (!want_object_in_pack(oid, 0, &p, &ofs))
    -+		return 0;
    -+
     +	stdin_packs_found_nr++;
     +
     +	create_object_entry(oid, type, 0, 0, 0, p, ofs);
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +
     +static void show_commit_pack_hint(struct commit *commit, void *_data)
     +{
    ++	/* nothing to do; commits don't have a namehash */
     +}
     +
     +static void show_object_pack_hint(struct object *object, const char *name,
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +	stdin_packs_hints_nr++;
     +}
     +
    ++static int pack_mtime_cmp(const void *_a, const void *_b)
    ++{
    ++	struct packed_git *a = ((const struct string_list_item*)_a)->util;
    ++	struct packed_git *b = ((const struct string_list_item*)_b)->util;
    ++
    ++	if (a->mtime < b->mtime)
    ++		return -1;
    ++	else if (b->mtime < a->mtime)
    ++		return 1;
    ++	else
    ++		return 0;
    ++}
    ++
     +static void read_packs_list_from_stdin(void)
     +{
     +	struct strbuf buf = STRBUF_INIT;
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +			die(_("could not find pack '%s'"), item->string);
     +		p->pack_keep_in_core = 1;
     +	}
    ++
    ++	/*
    ++	 * Order packs by ascending mtime; use QSORT directly to access the
    ++	 * string_list_item's ->util pointer, which string_list_sort() does not
    ++	 * provide.
    ++	 */
    ++	QSORT(include_packs.items, include_packs.nr, pack_mtime_cmp);
    ++
     +	for_each_string_list_item(item, &include_packs) {
     +		struct packed_git *p = item->util;
     +		if (!p)
16:  a46b7002b4 =  4:  f9a5faf773 p5303: add missing &&-chains
17:  b5081c01b5 !  5:  181c104a03 p5303: measure time to repack with keep
    @@ Metadata
      ## Commit message ##
         p5303: measure time to repack with keep

    -    This is the same as the regular repack test, except that we mark the
    -    single base pack as "kept" and use --assume-kept-packs-closed. The
    -    theory is that this should be faster than the normal repack, because
    -    we'll have fewer objects to traverse and process.
    +    Add two new tests to measure repack performance. Both test split the
    +    repository into synthetic "pushes", and then leave the remaining objects
    +    in a big base pack.

    -    Here are some timings on a recent clone of the kernel. In the
    -    single-pack case, there is nothing do since there are no non-excluded
    -    packs:
    +    The first new test marks an empty pack as "kept" and then passes
    +    --honor-pack-keep to avoid including objects in it. That doesn't change
    +    the resulting pack, but it does let us compare to the normal repack case
    +    to see how much overhead we add to check whether objects are kept or
    +    not.

    -      5303.5: repack (1)                          57.42(54.88+10.64)
    -      5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)
    +    The other test is of --stdin-packs, which gives us a sense of how that
    +    number scales based on the number of packs we provide as input. In each
    +    of those tests, the empty pack isn't considered, but the residual pack
    +    (objects that were left over and not included in one of the synthetic
    +    push packs) is marked as kept.

    -    and in the 50-pack case, it is much faster to use `--stdin-packs`, since
    -    we avoid having to consider any objects in the excluded pack:
    +    (Note that in the single-pack case of the --stdin-packs test, there is
    +    nothing do since there are no non-excluded packs).

    -      5303.10: repack (50)                        71.26(88.24+4.96)
    -      5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)
    +    Here are some timings on a recent clone of the kernel:

    -    but our improvements vanish as we approach 1000 packs.
    +      5303.5: repack (1)                          57.26(54.59+10.84)
    +      5303.6: repack with kept (1)                57.33(54.80+10.51)

    -      5303.15: repack (1000)                      215.64(491.33+14.80)
    -      5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)
    +    in the 50-pack case, things start to slow down:
    +
    +      5303.11: repack (50)                        71.54(88.57+4.84)
    +      5303.12: repack with kept (50)              85.12(102.05+4.94)
    +
    +    and by the time we hit 1,000 packs, things are substantially worse, even
    +    though the resulting pack produced is the same:
    +
    +      5303.17: repack (1000)                      216.87(490.79+14.57)
    +      5303.18: repack with kept (1000)            665.63(938.87+15.76)
    +
    +    Likewise, the scaling is pretty extreme on --stdin-packs:
    +
    +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
    +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
    +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)

         That's because the code paths around handling .keep files are known to
         scale badly; they look in every single pack file to find each object.
    @@ t/perf/p5303-many-packs.sh: repack_into_n () {
     +		git pack-objects --delta-base-offset --revs staging/pack
     +	) &&
     +	test_export base_pack &&
    ++
    ++	# create an empty packfile
    ++	empty_pack=$(git pack-objects staging/pack </dev/null) &&
    ++	test_export empty_pack &&

      	# and then incrementals between each pair of commits
      	last= &&
    @@ t/perf/p5303-many-packs.sh: do
      		  --stdout </dev/null >/dev/null
      	'
     +
    ++	test_perf "repack with kept ($nr_packs)" '
    ++		git pack-objects --keep-true-parents \
    ++		  --keep-pack=pack-$empty_pack.pack \
    ++		  --honor-pack-keep --non-empty --all \
    ++		  --reflog --indexed-objects --delta-base-offset \
    ++		  --stdout </dev/null >/dev/null
    ++	'
    ++
     +	test_perf "repack with --stdin-packs ($nr_packs)" '
     +		git pack-objects \
     +		  --keep-true-parents \
18:  c3868c7df9 !  6:  67af143fd1 builtin/pack-objects.c: rewrite honor-pack-keep logic
    @@ Commit message
         packs are actually kept.

         Note that we have to re-order the logic a bit here; we can deal with the
    -    "kept" situation completely, and then just fall back to the "--local"
    -    question. It might be worth having a similar optimized function to look
    -    at only local packs.
    +    disqualifying situations first (e.g., finding the object in a non-local
    +    pack with --local), then "kept" situation(s), and then just fall back to
    +    other "--local" conditions.

         Here are the results from p5303 (measurements again taken on the
         kernel):

    -      Test                                        HEAD^                    HEAD
    +      Test                                        HEAD^                   HEAD
           -----------------------------------------------------------------------------------------------
    -      5303.5: repack (1)                          57.42(54.88+10.64)       57.44(54.71+10.78) +0.0%
    -      5303.6: repack with --stdin-packs (1)       0.01(0.01+0.00)          0.01(0.00+0.01) +0.0%
    -      5303.10: repack (50)                        71.26(88.24+4.96)        71.32(88.38+4.90) +0.1%
    -      5303.11: repack with --stdin-packs (50)     3.49(11.82+0.28)         3.43(11.81+0.22) -1.7%
    -      5303.15: repack (1000)                      215.64(491.33+14.80)     215.59(493.75+14.62) -0.0%
    -      5303.16: repack with --stdin-packs (1000)   198.79(380.51+7.97)      131.44(314.24+8.11) -33.9%
    -
    -    So our --stdin-packs case with many packs is now finally faster than the
    -    non-keep case (because it gets the speed benefit of looking at fewer
    -    objects, but not as big a penalty for looking at many packs).
    +      5303.5: repack (1)                          57.26(54.59+10.84)      57.34(54.66+10.88) +0.1%
    +      5303.6: repack with kept (1)                57.33(54.80+10.51)      57.38(54.83+10.49) +0.1%
    +      5303.11: repack (50)                        71.54(88.57+4.84)       71.70(88.99+4.74) +0.2%
    +      5303.12: repack with kept (50)              85.12(102.05+4.94)      72.58(89.61+4.78) -14.7%
    +      5303.17: repack (1000)                      216.87(490.79+14.57)    217.19(491.72+14.25) +0.1%
    +      5303.18: repack with kept (1000)            665.63(938.87+15.76)    246.12(520.07+14.93) -63.0%
    +
    +    and the --stdin-packs timings:
    +
    +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)         0.00(0.00+0.00) -100.0%
    +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)        3.43(11.75+0.24) -2.8%
    +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)     130.50(307.15+7.66) -33.4%
    +
    +    So our repack with an empty .keep pack is roughly as fast as one without
    +    a .keep pack up to 50 packs. But the --stdin-packs case scales a little
    +    better, too.
    +
    +    Notably, it is faster than a repack of the same size and a kept pack. It
    +    looks at fewer objects, of course, but the penalty for looking at many
    +    packs isn't as costly.

         Signed-off-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
    @@ builtin/pack-objects.c: static int have_duplicate_entry(const struct object_id *
      	if (exclude)
      		return 1;
     @@ builtin/pack-objects.c: static int want_found_object(int exclude, struct packed_git *p)
    + 	 * make sure no copy of this object appears in _any_ pack that makes us
    + 	 * to omit the object, so we need to check all the packs.
    + 	 *
    +-	 * We can however first check whether these options can possible matter;
    ++	 * We can however first check whether these options can possibly matter;
    + 	 * if they do not matter we know we want the object in generated pack.
      	 * Otherwise, we signal "-1" at the end to tell the caller that we do
      	 * not know either way, and it needs to check more packs.
      	 */
     -	if (!ignore_packed_keep_on_disk &&
     -	    !ignore_packed_keep_in_core &&
     -	    (!local || !have_non_local_packs))
    +-		return 1;
    +
    ++	/*
    ++	 * Objects in packs borrowed from elsewhere are discarded regardless of
    ++	 * if they appear in other packs that weren't borrowed.
    ++	 */
    + 	if (local && !p->pack_local)
    + 		return 0;
    +-	if (p->pack_local &&
    +-	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
    +-	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
    +-		return 0;
     +
     +	/*
    -+	 * Handle .keep first, as we have a fast(er) path there.
    ++	 * Then handle .keep first, as we have a fast(er) path there.
     +	 */
     +	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
     +		/*
    @@ builtin/pack-objects.c: static int want_found_object(int exclude, struct packed_
     +	 * keep-packs, or the object is not in one. Keep checking other
     +	 * conditions...
     +	 */
    -+
     +	if (!local || !have_non_local_packs)
    - 		return 1;
    --
    - 	if (local && !p->pack_local)
    - 		return 0;
    --	if (p->pack_local &&
    --	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
    --	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
    --		return 0;
    ++		return 1;

      	/* we don't know yet; keep looking for more packs */
      	return -1;
19:  f1c07324f6 !  7:  e9e04b95e7 packfile: add kept-pack cache for find_kept_pack_entry()
    @@ Commit message
           - we don't have to worry about any packed_git being removed; we always
             keep the old structs around, even after reprepare_packed_git()

    +    We do defensively invalidate the cache in case the set of kept packs
    +    being asked for changes (e.g., only in-core kept packs were cached, but
    +    suddenly the caller also wants on-disk kept packs, too). In theory we
    +    could build all three caches and switch between them, but it's not
    +    necessary, since this patch (and series) never changes the set of kept
    +    packs that it wants to inspect from the cache.
    +
    +    So that "optimization" is more about being defensive in the face of
    +    future changes than it is about asking for multiple kinds of kept packs
    +    in this patch.
    +
         Here are p5303 results (as always, measured against the kernel):

           Test                                        HEAD^                   HEAD
    -      ----------------------------------------------------------------------------------------------
    -      5303.5: repack (1)                          57.44(54.71+10.78)      57.06(54.29+10.96) -0.7%
    -      5303.6: repack with --stdin-packs (1)       0.01(0.00+0.01)         0.01(0.01+0.00) +0.0%
    -      5303.10: repack (50)                        71.32(88.38+4.90)       71.47(88.60+5.04) +0.2%
    -      5303.11: repack with --stdin-packs (50)     3.43(11.81+0.22)        3.49(12.21+0.26) +1.7%
    -      5303.15: repack (1000)                      215.59(493.75+14.62)    217.41(495.36+14.85) +0.8%
    -      5303.16: repack with --stdin-packs (1000)   131.44(314.24+8.11)     126.75(309.88+8.09) -3.6%
    +      -----------------------------------------------------------------------------------------------
    +      5303.5: repack (1)                          57.34(54.66+10.88)      56.98(54.36+10.98) -0.6%
    +      5303.6: repack with kept (1)                57.38(54.83+10.49)      57.17(54.97+10.26) -0.4%
    +      5303.11: repack (50)                        71.70(88.99+4.74)       71.62(88.48+5.08) -0.1%
    +      5303.12: repack with kept (50)              72.58(89.61+4.78)       71.56(88.80+4.59) -1.4%
    +      5303.17: repack (1000)                      217.19(491.72+14.25)    217.31(490.82+14.53) +0.1%
    +      5303.18: repack with kept (1000)            246.12(520.07+14.93)    217.08(490.37+15.10) -11.8%
    +
    +    and the --stdin-packs case, which scales a little bit better (although
    +    not by that much even at 1,000 packs):
    +
    +      5303.7: repack with --stdin-packs (1)       0.00(0.00+0.00)         0.00(0.00+0.00) =
    +      5303.13: repack with --stdin-packs (50)     3.43(11.75+0.24)        3.43(11.69+0.30) +0.0%
    +      5303.19: repack with --stdin-packs (1000)   130.50(307.15+7.66)     125.13(301.36+8.04) -4.1%

         Signed-off-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

    - ## builtin/pack-objects.c ##
    -@@ builtin/pack-objects.c: static int want_found_object(const struct object_id *oid, int exclude,
    - 		 */
    - 		unsigned flags = 0;
    - 		if (ignore_packed_keep_on_disk)
    --			flags |= ON_DISK_KEEP_PACKS;
    -+			flags |= CACHE_ON_DISK_KEEP_PACKS;
    - 		if (ignore_packed_keep_in_core)
    --			flags |= IN_CORE_KEEP_PACKS;
    -+			flags |= CACHE_IN_CORE_KEEP_PACKS;
    -
    - 		if (ignore_packed_keep_on_disk && p->pack_keep)
    - 			return 0;
    -@@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
    - 	 * an optimization during delta selection.
    - 	 */
    - 	revs.no_kept_objects = 1;
    --	revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
    -+	revs.keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
    - 	revs.blob_objects = 1;
    - 	revs.tree_objects = 1;
    - 	revs.tag_objects = 1;
    -
      ## object-store.h ##
    -@@ object-store.h: static inline int pack_map_entry_cmp(const void *unused_cmp_data,
    - 	return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
    - }
    -
    -+#define CACHE_ON_DISK_KEEP_PACKS 1
    -+#define CACHE_IN_CORE_KEEP_PACKS 2
    -+
    -+struct kept_pack_cache {
    -+	struct packed_git **packs;
    -+	unsigned flags;
    -+};
    -+
    - struct raw_object_store {
    - 	/*
    - 	 * Set of all object directories; the main directory is first (and
     @@ object-store.h: struct raw_object_store {
      	/* A most-recently-used ordered version of the packed_git list. */
      	struct list_head packed_git_mru;

    -+	struct kept_pack_cache *kept_pack_cache;
    ++	struct {
    ++		struct packed_git **packs;
    ++		unsigned flags;
    ++	} kept_pack_cache;
     +
      	/*
      	 * A map of packfiles to packed_git structs for tracking which
    @@ packfile.c: static int find_one_pack_entry(struct repository *r,
     +					     unsigned flags)
      {
     -	return find_one_pack_entry(r, oid, e, 0);
    -+	if (!r->objects->kept_pack_cache)
    ++	if (!r->objects->kept_pack_cache.packs)
     +		return;
    -+	if (r->objects->kept_pack_cache->flags == flags)
    ++	if (r->objects->kept_pack_cache.flags == flags)
     +		return;
    -+	free(r->objects->kept_pack_cache->packs);
    -+	FREE_AND_NULL(r->objects->kept_pack_cache);
    ++	FREE_AND_NULL(r->objects->kept_pack_cache.packs);
    ++	r->objects->kept_pack_cache.flags = 0;
     +}
     +
     +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
     +{
     +	maybe_invalidate_kept_pack_cache(r, flags);
     +
    -+	if (!r->objects->kept_pack_cache) {
    ++	if (!r->objects->kept_pack_cache.packs) {
     +		struct packed_git **packs = NULL;
     +		size_t nr = 0, alloc = 0;
     +		struct packed_git *p;
    @@ packfile.c: static int find_one_pack_entry(struct repository *r,
     +		 * the non-kept version.
     +		 */
     +		for (p = get_all_packs(r); p; p = p->next) {
    -+			if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) ||
    -+			    (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) {
    ++			if ((p->pack_keep && (flags & ON_DISK_KEEP_PACKS)) ||
    ++			    (p->pack_keep_in_core && (flags & IN_CORE_KEEP_PACKS))) {
     +				ALLOC_GROW(packs, nr + 1, alloc);
     +				packs[nr++] = p;
     +			}
    @@ packfile.c: static int find_one_pack_entry(struct repository *r,
     +		ALLOC_GROW(packs, nr + 1, alloc);
     +		packs[nr] = NULL;
     +
    -+		r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache));
    -+		r->objects->kept_pack_cache->packs = packs;
    -+		r->objects->kept_pack_cache->flags = flags;
    ++		r->objects->kept_pack_cache.packs = packs;
    ++		r->objects->kept_pack_cache.flags = flags;
     +	}
     +
    -+	return r->objects->kept_pack_cache->packs;
    ++	return r->objects->kept_pack_cache.packs;
      }

      int find_kept_pack_entry(struct repository *r,
    @@ packfile.c: int find_kept_pack_entry(struct repository *r,
      }

      int has_object_pack(const struct object_id *oid)
    -@@ packfile.c: int has_object_pack(const struct object_id *oid)
    - 	return find_pack_entry(the_repository, oid, &e);
    - }
    -
    --int has_object_kept_pack(const struct object_id *oid, unsigned flags)
    -+int has_object_kept_pack(const struct object_id *oid,
    -+			 unsigned flags)
    - {
    - 	struct pack_entry e;
    - 	return find_kept_pack_entry(the_repository, oid, flags, &e);
    -
    - ## packfile.h ##
    -@@ packfile.h: int packed_object_info(struct repository *r,
    - void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
    - const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
    -
    --#define ON_DISK_KEEP_PACKS 1
    --#define IN_CORE_KEEP_PACKS 2
    --#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
    --
    - /*
    -  * Iff a pack file in the given repository contains the object named by sha1,
    -  * return true and store its location to e.
    -
    - ## revision.c ##
    -@@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
    - 		die(_("--unpacked=<packfile> no longer supported"));
    - 	} else if (!strcmp(arg, "--no-kept-objects")) {
    - 		revs->no_kept_objects = 1;
    --		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
    --		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
    -+		revs->keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
    -+		revs->keep_pack_cache_flags |= CACHE_ON_DISK_KEEP_PACKS;
    - 	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
    - 		revs->no_kept_objects = 1;
    - 		if (!strcmp(optarg, "in-core"))
    --			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
    -+			revs->keep_pack_cache_flags |= CACHE_IN_CORE_KEEP_PACKS;
    - 		if (!strcmp(optarg, "on-disk"))
    --			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
    -+			revs->keep_pack_cache_flags |= CACHE_ON_DISK_KEEP_PACKS;
    - 	} else if (!strcmp(arg, "-r")) {
    - 		revs->diff = 1;
    - 		revs->diffopt.flags.recursive = 1;
20:  d5561585c2 !  8:  bd492ec142 builtin/repack.c: add '--geometric' option
    @@ Documentation/git-repack.txt: depth is 4095.
     +	contains at least `<factor>` times the number of objects as the
     +	next-largest pack.
     ++
    -+`git repack` ensures this by determining a "cut" of packfiles that need to be
    -+repacked into one in order to ensure a geometric progression. It picks the
    -+smallest set of packfiles such that as many of the larger packfiles (by count of
    -+objects contained in that pack) may be left intact.
    ++`git repack` ensures this by determining a "cut" of packfiles that need
    ++to be repacked into one in order to ensure a geometric progression. It
    ++picks the smallest set of packfiles such that as many of the larger
    ++packfiles (by count of objects contained in that pack) may be left
    ++intact.
    +++
    ++Unlike other repack modes, the set of objects to pack is determined
    ++uniquely by the set of packs being "rolled-up"; in other words, the
    ++packs determined to need to be combined in order to restore a geometric
    ++progression.
    +++
    ++Loose objects are implicitly included in this "roll-up", without respect
    ++to their reachability. This is subject to change in the future. This
    ++option (implying a drastically different repack mode) is not guarenteed
    ++to work with all other combinations of option to `git repack`).
     +
      Configuration
      -------------
--
2.30.0.667.g81c0cbc6fd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()'
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 2/8] revision: learn '--no-kept-objects' Taylor Blau
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

Future callers will want a function to fill a 'struct pack_entry' for a
given object id but _only_ from its position in any kept pack(s).

In particular, an new 'git repack' mode which ensures the resulting
packs form a geometric progress by object count will mark packs that it
does not want to repack as "kept in-core", and it will want to halt a
reachability traversal as soon as it visits an object in any of the kept
packs. But, it does not want to halt the traversal at non-kept, or
.keep packs.

The obvious alternative is 'find_pack_entry()', but this doesn't quite
suffice since it only returns the first pack it finds, which may or may
not be kept (and the mru cache makes it unpredictable which one you'll
get if there are options).

Short of that, you could walk over all packs looking for the object in
each one, but it scales with the number of packs, which may be
prohibitive.

Introduce 'find_kept_pack_entry()', a function which is like
'find_pack_entry()', but only fills in objects in the kept packs.

Handle packs which have .keep files, as well as in-core kept packs
separately, since certain callers will want to distinguish one from the
other. (Though on-disk and in-core kept packs share the adjective
"kept", it is best to think of the two sets as independent.)

There is a gotcha when looking up objects that are duplicated in kept
and non-kept packs, particularly when the MIDX stores the non-kept
version and the caller asked for kept objects only. This could be
resolved by teaching the MIDX to resolve duplicates by always favoring
the kept pack (if one exists), but this breaks an assumption in existing
MIDXs, and so it would require a format change.

The benefit to changing the MIDX in this way is marginal, so we instead
have a more thorough check here which is explained with a comment.

Callers will be added in subsequent patches.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.h |  5 +++++
 2 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 1fec12ac5f..7f84f221ce 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2042,7 +2042,10 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static int find_one_pack_entry(struct repository *r,
+			       const struct object_id *oid,
+			       struct pack_entry *e,
+			       int kept_only)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2052,26 +2055,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (fill_midx_entry(r, oid, e, m))
+		if (!fill_midx_entry(r, oid, e, m))
+			continue;
+
+		if (!kept_only)
+			return 1;
+
+		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
+		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
-			list_move(&p->mru, &r->objects->packed_git_mru);
-			return 1;
+		if (p->multi_pack_index && !kept_only) {
+			/*
+			 * If this pack is covered by the MIDX, we'd have found
+			 * the object already in the loop above if it was here,
+			 * so don't bother looking.
+			 *
+			 * The exception is if we are looking only at kept
+			 * packs. An object can be present in two packs covered
+			 * by the MIDX, one kept and one not-kept. And as the
+			 * MIDX points to only one copy of each object, it might
+			 * have returned only the non-kept version above. We
+			 * have to check again to be thorough.
+			 */
+			continue;
+		}
+		if (!kept_only ||
+		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
+		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
+			if (fill_pack_entry(oid, e, p)) {
+				list_move(&p->mru, &r->objects->packed_git_mru);
+				return 1;
+			}
 		}
 	}
 	return 0;
 }
 
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+{
+	return find_one_pack_entry(r, oid, e, 0);
+}
+
+int find_kept_pack_entry(struct repository *r,
+			 const struct object_id *oid,
+			 unsigned flags,
+			 struct pack_entry *e)
+{
+	/*
+	 * Load all packs, including midx packs, since our "kept" strategy
+	 * relies on that. We're relying on the side effect of it setting up
+	 * r->objects->packed_git, which is a little ugly.
+	 */
+	get_all_packs(r);
+	return find_one_pack_entry(r, oid, e, flags);
+}
+
 int has_object_pack(const struct object_id *oid)
 {
 	struct pack_entry e;
 	return find_pack_entry(the_repository, oid, &e);
 }
 
+int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+{
+	struct pack_entry e;
+	return find_kept_pack_entry(the_repository, oid, flags, &e);
+}
+
 int has_pack_index(const unsigned char *sha1)
 {
 	struct stat st;
diff --git a/packfile.h b/packfile.h
index 4cfec9e8d3..3ae117a8ae 100644
--- a/packfile.h
+++ b/packfile.h
@@ -162,13 +162,18 @@ int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
+#define ON_DISK_KEEP_PACKS 1
+#define IN_CORE_KEEP_PACKS 2
+
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
  */
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e);
+int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e);
 
 int has_object_pack(const struct object_id *oid);
+int has_object_kept_pack(const struct object_id *oid, unsigned flags);
 
 int has_pack_index(const unsigned char *sha1);
 
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 2/8] revision: learn '--no-kept-objects'
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).

Note that this option is not guaranteed to produce exactly the set of
objects that aren't in kept packs, since it's possible the traversal
order may end up in a situation where a non-kept ancestor was "cut off"
by a kept object (at which point we would stop traversing). But, we
don't care about absolute correctness here, since this will eventually
be used as a purely additive guide in an upcoming new repack mode.

Explicitly avoid documenting this new flag, since it is only used
internally. In theory we could avoid even adding it rev-list, but being
able to spell this option out on the command-line makes some special
cases easier to test without promising to keep it behaving consistently
forever. Those tricky cases are exercised in t6114.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 revision.c            | 15 ++++++++++
 revision.h            |  4 +++
 t/t6114-keep-packs.sh | 69 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)
 create mode 100755 t/t6114-keep-packs.sh

diff --git a/revision.c b/revision.c
index 3efd994160..f9311f3a73 100644
--- a/revision.c
+++ b/revision.c
@@ -2336,6 +2336,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->unpacked = 1;
 	} else if (starts_with(arg, "--unpacked=")) {
 		die(_("--unpacked=<packfile> no longer supported"));
+	} else if (!strcmp(arg, "--no-kept-objects")) {
+		revs->no_kept_objects = 1;
+		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
+		revs->no_kept_objects = 1;
+		if (!strcmp(optarg, "in-core"))
+			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		if (!strcmp(optarg, "on-disk"))
+			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
@@ -3792,6 +3802,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (revs->unpacked && has_object_pack(&commit->object.oid))
 		return commit_ignore;
+	if (revs->no_kept_objects) {
+		if (has_object_kept_pack(&commit->object.oid,
+					 revs->keep_pack_cache_flags))
+			return commit_ignore;
+	}
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
 	if (revs->line_level_traverse && !want_ancestry(revs)) {
diff --git a/revision.h b/revision.h
index e6be3c845e..a20a530d52 100644
--- a/revision.h
+++ b/revision.h
@@ -148,6 +148,7 @@ struct rev_info {
 			edge_hint_aggressive:1,
 			limited:1,
 			unpacked:1,
+			no_kept_objects:1,
 			boundary:2,
 			count:1,
 			left_right:1,
@@ -317,6 +318,9 @@ struct rev_info {
 	 * This is loaded from the commit-graph being used.
 	 */
 	struct bloom_filter_settings *bloom_filter_settings;
+
+	/* misc. flags related to '--no-kept-objects' */
+	unsigned keep_pack_cache_flags;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
new file mode 100755
index 0000000000..9239d8aa46
--- /dev/null
+++ b/t/t6114-keep-packs.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='rev-list with .keep packs'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit loose &&
+	test_commit packed &&
+	test_commit kept &&
+
+	KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/kept
+	^refs/tags/packed
+	EOF
+	) &&
+	MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/packed
+	^refs/tags/loose
+	EOF
+	) &&
+
+	touch .git/objects/pack/pack-$KEPT_PACK.keep
+'
+
+rev_list_objects () {
+	git rev-list "$@" >out &&
+	sort out
+}
+
+idx_objects () {
+	git show-index <$1 >expect-idx &&
+	cut -d" " -f2 <expect-idx | sort
+}
+
+test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' '
+	rev_list_objects --objects --all --no-object-names >kept &&
+	rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept &&
+
+	idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect &&
+	comm -3 kept no-kept >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
+	test_config core.multiPackIndex true &&
+
+	# Create a pack with just the commit object in pack, and do not mark it
+	# as kept (even though it appears in $KEPT_PACK, which does have a .keep
+	# file).
+	MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$(git rev-parse kept)
+	EOF
+	) &&
+
+	# Write a MIDX containing all packs, but use the version of the commit
+	# at "kept" in a non-kept pack by touching $MIDX_PACK.
+	touch .git/objects/pack/pack-$MIDX_PACK.pack &&
+	git multi-pack-index write &&
+
+	rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual &&
+	(
+		idx_objects .git/objects/pack/pack-$MISC_PACK.idx &&
+		git rev-list --objects --no-object-names refs/tags/loose
+	) | sort >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 2/8] revision: learn '--no-kept-objects' Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 4/8] p5303: add missing &&-chains Taylor Blau
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

In an upcoming commit, 'git repack' will want to create a pack comprised
of all of the objects in some packs (the included packs) excluding any
objects in some other packs (the excluded packs).

This caller could iterate those packs themselves and feed the objects it
finds to 'git pack-objects' directly over stdin, but this approach has a
few downsides:

  - It requires every caller that wants to drive 'git pack-objects' in
    this way to implement pack iteration themselves. This forces the
    caller to think about details like what order objects are fed to
    pack-objects, which callers would likely rather not do.

  - If the set of objects in included packs is large, it requires
    sending a lot of data over a pipe, which is inefficient.

  - The caller is forced to keep track of the excluded objects, too, and
    make sure that it doesn't send any objects that appear in both
    included and excluded packs.

But the biggest downside is the lack of a reachability traversal.
Because the caller passes in a list of objects directly, those objects
don't get a namehash assigned to them, which can have a negative impact
on the delta selection process, causing 'git pack-objects' to fail to
find good deltas even when they exist.

The caller could formulate a reachability traversal themselves, but the
only way to drive 'git pack-objects' in this way is to do a full
traversal, and then remove objects in the excluded packs after the
traversal is complete. This can be detrimental to callers who care
about performance, especially in repositories with many objects.

Introduce 'git pack-objects --stdin-packs' which remedies these four
concerns.

'git pack-objects --stdin-packs' expects a list of pack names on stdin,
where 'pack-xyz.pack' denotes that pack as included, and
'^pack-xyz.pack' denotes it as excluded. The resulting pack includes all
objects that are present in at least one included pack, and aren't
present in any excluded pack.

To address the delta selection problem, 'git pack-objects --stdin-packs'
works as follows. First, it assembles a list of objects that it is going
to pack, as above. Then, a reachability traversal is started, whose tips
are any commits mentioned in included packs. Upon visiting an object, we
find its corresponding object_entry in the to_pack list, and set its
namehash parameter appropriately.

To avoid the traversal visiting more objects than it needs to, the
traversal is halted upon encountering an object which can be found in an
excluded pack (by marking the excluded packs as kept in-core, and
passing --no-kept-objects=in-core to the revision machinery).

This can cause the traversal to halt early, for example if an object in
an included pack is an ancestor of ones in excluded packs. But stopping
early is OK, since filling in the namehash fields of objects in the
to_pack list is only additive (i.e., having it helps the delta selection
process, but leaving it blank doesn't impact the correctness of the
resulting pack).

Even still, it is unlikely that this hurts us much in practice, since
the 'git repack --geometric' caller (which is introduced in a later
commit) marks small packs as included, and large ones as excluded.
During ordinary use, the small packs usually represent pushes after a
large repack, and so are unlikely to be ancestors of objects that
already exist in the repository.

(I found it convenient while developing this patch to have 'git
pack-objects' report the number of objects which were visited and got
their namehash fields filled in during traversal. This is also included
in the below patch via trace2 data lines).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  10 ++
 builtin/pack-objects.c             | 198 ++++++++++++++++++++++++++++-
 t/t5300-pack-object.sh             |  97 ++++++++++++++
 3 files changed, 303 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 54d715ead1..df533c3b19 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -85,6 +85,16 @@ base-name::
 	reference was included in the resulting packfile.  This
 	can be useful to send new tags to native Git clients.
 
+--stdin-packs::
+	Read the basenames of packfiles (e.g., `pack-1234abcd.pack`)
+	from the standard input, instead of object names or revision
+	arguments. The resulting pack contains all objects listed in the
+	included packs (those not beginning with `^`), excluding any
+	objects listed in the excluded packs (beginning with `^`).
++
+Incompatible with `--revs`, or options that imply `--revs` (such as
+`--all`), with the exception of `--unpacked`, which is compatible.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6d62aaf59a..e766a4a43b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2986,6 +2986,186 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 	return git_default_config(k, v, cb);
 }
 
+/* Counters for trace2 output when in --stdin-packs mode. */
+static int stdin_packs_found_nr;
+static int stdin_packs_hints_nr;
+
+static int add_object_entry_from_pack(const struct object_id *oid,
+				      struct packed_git *p,
+				      uint32_t pos,
+				      void *_data)
+{
+	struct rev_info *revs = _data;
+	struct object_info oi = OBJECT_INFO_INIT;
+	off_t ofs;
+	enum object_type type;
+
+	display_progress(progress_state, ++nr_seen);
+
+	if (have_duplicate_entry(oid, 0))
+		return 0;
+
+	ofs = nth_packed_object_offset(p, pos);
+	if (!want_object_in_pack(oid, 0, &p, &ofs))
+		return 0;
+
+	oi.typep = &type;
+	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
+		die(_("could not get type of object %s in pack %s"),
+		    oid_to_hex(oid), p->pack_name);
+	else if (type == OBJ_COMMIT) {
+		/*
+		 * commits in included packs are used as starting points for the
+		 * subsequent revision walk
+		 */
+		add_pending_oid(revs, NULL, oid, 0);
+	}
+
+	stdin_packs_found_nr++;
+
+	create_object_entry(oid, type, 0, 0, 0, p, ofs);
+
+	return 0;
+}
+
+static void show_commit_pack_hint(struct commit *commit, void *_data)
+{
+	/* nothing to do; commits don't have a namehash */
+}
+
+static void show_object_pack_hint(struct object *object, const char *name,
+				  void *_data)
+{
+	struct object_entry *oe = packlist_find(&to_pack, &object->oid);
+	if (!oe)
+		return;
+
+	/*
+	 * Our 'to_pack' list was constructed by iterating all objects packed in
+	 * included packs, and so doesn't have a non-zero hash field that you
+	 * would typically pick up during a reachability traversal.
+	 *
+	 * Make a best-effort attempt to fill in the ->hash and ->no_try_delta
+	 * here using a now in order to perhaps improve the delta selection
+	 * process.
+	 */
+	oe->hash = pack_name_hash(name);
+	oe->no_try_delta = name && no_try_delta(name);
+
+	stdin_packs_hints_nr++;
+}
+
+static int pack_mtime_cmp(const void *_a, const void *_b)
+{
+	struct packed_git *a = ((const struct string_list_item*)_a)->util;
+	struct packed_git *b = ((const struct string_list_item*)_b)->util;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	else if (b->mtime < a->mtime)
+		return 1;
+	else
+		return 0;
+}
+
+static void read_packs_list_from_stdin(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list include_packs = STRING_LIST_INIT_DUP;
+	struct string_list exclude_packs = STRING_LIST_INIT_DUP;
+	struct string_list_item *item = NULL;
+
+	struct packed_git *p;
+	struct rev_info revs;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+	/*
+	 * Use a revision walk to fill in the namehash of objects in the include
+	 * packs. To save time, we'll avoid traversing through objects that are
+	 * in excluded packs.
+	 *
+	 * That may cause us to avoid populating all of the namehash fields of
+	 * all included objects, but our goal is best-effort, since this is only
+	 * an optimization during delta selection.
+	 */
+	revs.no_kept_objects = 1;
+	revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+	revs.blob_objects = 1;
+	revs.tree_objects = 1;
+	revs.tag_objects = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '^')
+			string_list_append(&exclude_packs, buf.buf + 1);
+		else
+			string_list_append(&include_packs, buf.buf);
+
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&include_packs);
+	string_list_sort(&exclude_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+
+		item = string_list_lookup(&include_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&exclude_packs, pack_name);
+
+		if (item)
+			item->util = p;
+	}
+
+	/*
+	 * First handle all of the excluded packs, marking them as kept in-core
+	 * so that later calls to add_object_entry() discards any objects that
+	 * are also found in excluded packs.
+	 */
+	for_each_string_list_item(item, &exclude_packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = 1;
+	}
+
+	/*
+	 * Order packs by ascending mtime; use QSORT directly to access the
+	 * string_list_item's ->util pointer, which string_list_sort() does not
+	 * provide.
+	 */
+	QSORT(include_packs.items, include_packs.nr, pack_mtime_cmp);
+
+	for_each_string_list_item(item, &include_packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		for_each_object_in_pack(p,
+					add_object_entry_from_pack,
+					&revs,
+					FOR_EACH_OBJECT_PACK_ORDER);
+	}
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	traverse_commit_list(&revs,
+			     show_commit_pack_hint,
+			     show_object_pack_hint,
+			     NULL);
+
+	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found",
+			   stdin_packs_found_nr);
+	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_hints",
+			   stdin_packs_hints_nr);
+
+	strbuf_release(&buf);
+	string_list_clear(&include_packs, 0);
+	string_list_clear(&exclude_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3489,6 +3669,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	struct strvec rp = STRVEC_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	int rev_list_index = 0;
+	int stdin_packs = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -3539,6 +3720,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_SET_INT_F(0, "indexed-objects", &rev_list_index,
 			      N_("include objects referred to by the index"),
 			      1, PARSE_OPT_NONEG),
+		OPT_BOOL(0, "stdin-packs", &stdin_packs,
+			 N_("read packs from stdin")),
 		OPT_BOOL(0, "stdout", &pack_to_stdout,
 			 N_("output pack to stdout")),
 		OPT_BOOL(0, "include-tag", &include_tag,
@@ -3645,7 +3828,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--indexed-objects");
 	}
-	if (rev_list_unpacked) {
+	if (rev_list_unpacked && !stdin_packs) {
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--unpacked");
 	}
@@ -3690,8 +3873,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (filter_options.choice) {
 		if (!pack_to_stdout)
 			die(_("cannot use --filter without --stdout"));
+		if (stdin_packs)
+			die(_("cannot use --filter with --stdin-packs"));
 	}
 
+	if (stdin_packs && use_internal_rev_list)
+		die(_("cannot use internal rev list with --stdin-packs"));
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -3750,7 +3938,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (progress)
 		progress_state = start_progress(_("Enumerating objects"), 0);
-	if (!use_internal_rev_list)
+	if (stdin_packs) {
+		/* avoids adding objects in excluded packs */
+		ignore_packed_keep_in_core = 1;
+		read_packs_list_from_stdin();
+		if (rev_list_unpacked)
+			add_unreachable_loose_objects();
+	} else if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
 		get_object_list(rp.nr, rp.v);
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 392201cabd..7138a54595 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -532,4 +532,101 @@ test_expect_success 'prefetch objects' '
 	test_line_count = 1 donelines
 '
 
+test_expect_success 'setup for --stdin-packs tests' '
+	git init stdin-packs &&
+	(
+		cd stdin-packs &&
+
+		test_commit A &&
+		test_commit B &&
+		test_commit C &&
+
+		for id in A B C
+		do
+			git pack-objects .git/objects/pack/pack-$id \
+				--incremental --revs <<-EOF
+			refs/tags/$id
+			EOF
+		done &&
+
+		ls -la .git/objects/pack
+	)
+'
+
+test_expect_success '--stdin-packs with excluded packs' '
+	(
+		cd stdin-packs &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+
+		git pack-objects test --stdin-packs <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx)
+		) >expect.raw &&
+		git show-index <$(ls test-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--stdin-packs is incompatible with --filter' '
+	(
+		cd stdin-packs &&
+		test_must_fail git pack-objects --stdin-packs --stdout \
+			--filter=blob:none </dev/null 2>err &&
+		test_i18ngrep "cannot use --filter with --stdin-packs" err
+	)
+'
+
+test_expect_success '--stdin-packs is incompatible with --revs' '
+	(
+		cd stdin-packs &&
+		test_must_fail git pack-objects --stdin-packs --revs out \
+			</dev/null 2>err &&
+		test_i18ngrep "cannot use internal rev list with --stdin-packs" err
+	)
+'
+
+test_expect_success '--stdin-packs with loose objects' '
+	(
+		cd stdin-packs &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+
+		test_commit D && # loose
+
+		git pack-objects test2 --stdin-packs --unpacked <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx) &&
+			git rev-list --objects --no-object-names \
+				refs/tags/C..refs/tags/D
+
+		) >expect.raw &&
+		ls -la . &&
+		git show-index <$(ls test2-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 4/8] p5303: add missing &&-chains
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
                     ` (2 preceding siblings ...)
  2021-02-18  3:14   ` [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 5/8] p5303: measure time to repack with keep Taylor Blau
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

These are in a helper function, so the usual chain-lint doesn't notice
them. This function is still not perfect, as it has some git invocations
on the left-hand-side of the pipe, but it's primary purpose is timing,
not finding bugs or correctness issues.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index ce0c42cc9f..d90d714923 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -28,11 +28,11 @@ repack_into_n () {
 			push @commits, $_ if $. % 5 == 1;
 		}
 		print reverse @commits;
-	' "$1" >pushes
+	' "$1" >pushes &&
 
 	# create base packfile
 	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack
+	git pack-objects --delta-base-offset --revs staging/pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 5/8] p5303: measure time to repack with keep
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
                     ` (3 preceding siblings ...)
  2021-02-18  3:14   ` [PATCH v3 4/8] p5303: add missing &&-chains Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

Add two new tests to measure repack performance. Both test split the
repository into synthetic "pushes", and then leave the remaining objects
in a big base pack.

The first new test marks an empty pack as "kept" and then passes
--honor-pack-keep to avoid including objects in it. That doesn't change
the resulting pack, but it does let us compare to the normal repack case
to see how much overhead we add to check whether objects are kept or
not.

The other test is of --stdin-packs, which gives us a sense of how that
number scales based on the number of packs we provide as input. In each
of those tests, the empty pack isn't considered, but the residual pack
(objects that were left over and not included in one of the synthetic
push packs) is marked as kept.

(Note that in the single-pack case of the --stdin-packs test, there is
nothing do since there are no non-excluded packs).

Here are some timings on a recent clone of the kernel:

  5303.5: repack (1)                          57.26(54.59+10.84)
  5303.6: repack with kept (1)                57.33(54.80+10.51)

in the 50-pack case, things start to slow down:

  5303.11: repack (50)                        71.54(88.57+4.84)
  5303.12: repack with kept (50)              85.12(102.05+4.94)

and by the time we hit 1,000 packs, things are substantially worse, even
though the resulting pack produced is the same:

  5303.17: repack (1000)                      216.87(490.79+14.57)
  5303.18: repack with kept (1000)            665.63(938.87+15.76)

Likewise, the scaling is pretty extreme on --stdin-packs:

  5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
  5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
  5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)

That's because the code paths around handling .keep files are known to
scale badly; they look in every single pack file to find each object.
Our solution to that was to notice that most repos don't have keep
files, and to make that case a fast path. But as soon as you add a
single .keep, that part of pack-objects slows down again (even if we
have fewer objects total to look at).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index d90d714923..35c0cbdf49 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -31,8 +31,15 @@ repack_into_n () {
 	' "$1" >pushes &&
 
 	# create base packfile
-	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack &&
+	base_pack=$(
+		head -n 1 pushes |
+		git pack-objects --delta-base-offset --revs staging/pack
+	) &&
+	test_export base_pack &&
+
+	# create an empty packfile
+	empty_pack=$(git pack-objects staging/pack </dev/null) &&
+	test_export empty_pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
@@ -49,6 +56,12 @@ repack_into_n () {
 		last=$rev
 	done <pushes &&
 
+	(
+		find staging -type f -name 'pack-*.pack' |
+			xargs -n 1 basename | grep -v "$base_pack" &&
+		printf "^pack-%s.pack\n" $base_pack
+	) >stdin.packs
+
 	# and install the whole thing
 	rm -f .git/objects/pack/* &&
 	mv staging/* .git/objects/pack/
@@ -91,6 +104,23 @@ do
 		  --reflog --indexed-objects --delta-base-offset \
 		  --stdout </dev/null >/dev/null
 	'
+
+	test_perf "repack with kept ($nr_packs)" '
+		git pack-objects --keep-true-parents \
+		  --keep-pack=pack-$empty_pack.pack \
+		  --honor-pack-keep --non-empty --all \
+		  --reflog --indexed-objects --delta-base-offset \
+		  --stdout </dev/null >/dev/null
+	'
+
+	test_perf "repack with --stdin-packs ($nr_packs)" '
+		git pack-objects \
+		  --keep-true-parents \
+		  --stdin-packs \
+		  --non-empty \
+		  --delta-base-offset \
+		  --stdout <stdin.packs >/dev/null
+	'
 done
 
 # Measure pack loading with 10,000 packs.
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
                     ` (4 preceding siblings ...)
  2021-02-18  3:14   ` [PATCH v3 5/8] p5303: measure time to repack with keep Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

Now that we have find_kept_pack_entry(), we don't have to manually keep
hunting through every pack to find a possible "kept" duplicate of the
object. This should be faster, assuming only a portion of your total
packs are actually kept.

Note that we have to re-order the logic a bit here; we can deal with the
disqualifying situations first (e.g., finding the object in a non-local
pack with --local), then "kept" situation(s), and then just fall back to
other "--local" conditions.

Here are the results from p5303 (measurements again taken on the
kernel):

  Test                                        HEAD^                   HEAD
  -----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.26(54.59+10.84)      57.34(54.66+10.88) +0.1%
  5303.6: repack with kept (1)                57.33(54.80+10.51)      57.38(54.83+10.49) +0.1%
  5303.11: repack (50)                        71.54(88.57+4.84)       71.70(88.99+4.74) +0.2%
  5303.12: repack with kept (50)              85.12(102.05+4.94)      72.58(89.61+4.78) -14.7%
  5303.17: repack (1000)                      216.87(490.79+14.57)    217.19(491.72+14.25) +0.1%
  5303.18: repack with kept (1000)            665.63(938.87+15.76)    246.12(520.07+14.93) -63.0%

and the --stdin-packs timings:

  5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)         0.00(0.00+0.00) -100.0%
  5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)        3.43(11.75+0.24) -2.8%
  5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)     130.50(307.15+7.66) -33.4%

So our repack with an empty .keep pack is roughly as fast as one without
a .keep pack up to 50 packs. But the --stdin-packs case scales a little
better, too.

Notably, it is faster than a repack of the same size and a kept pack. It
looks at fewer objects, of course, but the penalty for looking at many
packs isn't as costly.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 131 ++++++++++++++++++++++++-----------------
 1 file changed, 78 insertions(+), 53 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e766a4a43b..be3ba60bc2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1188,7 +1188,8 @@ static int have_duplicate_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int want_found_object(const struct object_id *oid, int exclude,
+			     struct packed_git *p)
 {
 	if (exclude)
 		return 1;
@@ -1204,27 +1205,82 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 * make sure no copy of this object appears in _any_ pack that makes us
 	 * to omit the object, so we need to check all the packs.
 	 *
-	 * We can however first check whether these options can possible matter;
+	 * We can however first check whether these options can possibly matter;
 	 * if they do not matter we know we want the object in generated pack.
 	 * Otherwise, we signal "-1" at the end to tell the caller that we do
 	 * not know either way, and it needs to check more packs.
 	 */
-	if (!ignore_packed_keep_on_disk &&
-	    !ignore_packed_keep_in_core &&
-	    (!local || !have_non_local_packs))
-		return 1;
 
+	/*
+	 * Objects in packs borrowed from elsewhere are discarded regardless of
+	 * if they appear in other packs that weren't borrowed.
+	 */
 	if (local && !p->pack_local)
 		return 0;
-	if (p->pack_local &&
-	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
-	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
-		return 0;
+
+	/*
+	 * Then handle .keep first, as we have a fast(er) path there.
+	 */
+	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+		/*
+		 * Set the flags for the kept-pack cache to be the ones we want
+		 * to ignore.
+		 *
+		 * That is, if we are ignoring objects in on-disk keep packs,
+		 * then we want to search through the on-disk keep and ignore
+		 * the in-core ones.
+		 */
+		unsigned flags = 0;
+		if (ignore_packed_keep_on_disk)
+			flags |= ON_DISK_KEEP_PACKS;
+		if (ignore_packed_keep_in_core)
+			flags |= IN_CORE_KEEP_PACKS;
+
+		if (ignore_packed_keep_on_disk && p->pack_keep)
+			return 0;
+		if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+			return 0;
+		if (has_object_kept_pack(oid, flags))
+			return 0;
+	}
+
+	/*
+	 * At this point we know definitively that either we don't care about
+	 * keep-packs, or the object is not in one. Keep checking other
+	 * conditions...
+	 */
+	if (!local || !have_non_local_packs)
+		return 1;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
+static int want_object_in_pack_one(struct packed_git *p,
+				   const struct object_id *oid,
+				   int exclude,
+				   struct packed_git **found_pack,
+				   off_t *found_offset)
+{
+	off_t offset;
+
+	if (p == *found_pack)
+		offset = *found_offset;
+	else
+		offset = find_pack_entry_one(oid->hash, p);
+
+	if (offset) {
+		if (!*found_pack) {
+			if (!is_pack_valid(p))
+				return -1;
+			*found_offset = offset;
+			*found_pack = p;
+		}
+		return want_found_object(oid, exclude, p);
+	}
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
@@ -1252,7 +1308,7 @@ static int want_object_in_pack(const struct object_id *oid,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
+		want = want_found_object(oid, exclude, *found_pack);
 		if (want != -1)
 			return want;
 	}
@@ -1260,53 +1316,22 @@ static int want_object_in_pack(const struct object_id *oid,
 	for (m = get_multi_pack_index(the_repository); m; m = m->next) {
 		struct pack_entry e;
 		if (fill_midx_entry(the_repository, oid, &e, m)) {
-			struct packed_git *p = e.p;
-			off_t offset;
-
-			if (p == *found_pack)
-				offset = *found_offset;
-			else
-				offset = find_pack_entry_one(oid->hash, p);
-
-			if (offset) {
-				if (!*found_pack) {
-					if (!is_pack_valid(p))
-						continue;
-					*found_offset = offset;
-					*found_pack = p;
-				}
-				want = want_found_object(exclude, p);
-				if (want != -1)
-					return want;
-			}
-		}
-	}
-
-	list_for_each(pos, get_packed_git_mru(the_repository)) {
-		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		off_t offset;
-
-		if (p == *found_pack)
-			offset = *found_offset;
-		else
-			offset = find_pack_entry_one(oid->hash, p);
-
-		if (offset) {
-			if (!*found_pack) {
-				if (!is_pack_valid(p))
-					continue;
-				*found_offset = offset;
-				*found_pack = p;
-			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
-				list_move(&p->mru,
-					  get_packed_git_mru(the_repository));
+			want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
 			if (want != -1)
 				return want;
 		}
 	}
 
+	list_for_each(pos, get_packed_git_mru(the_repository)) {
+		struct packed_git *p = list_entry(pos, struct packed_git, mru);
+		want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+		if (!exclude && want > 0)
+			list_move(&p->mru,
+				  get_packed_git_mru(the_repository));
+		if (want != -1)
+			return want;
+	}
+
 	if (uri_protocols.nr) {
 		struct configured_exclusion *ex =
 			oidmap_get(&configured_exclusions, oid);
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
                     ` (5 preceding siblings ...)
  2021-02-18  3:14   ` [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-18  3:14   ` [PATCH v3 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
  2021-02-23  0:31   ` [PATCH v3 0/8] repack: support repacking into a geometric sequence Jeff King
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

In a recent patch we added a function 'find_kept_pack_entry()' to look
for an object only among kept packs.

While this function avoids doing any lookup work in non-kept packs, it
is still linear in the number of packs, since we have to traverse the
linked list of packs once per object. Let's cache a reduced version of
that list to save us time.

Note that this cache will last the lifetime of the program. We could
invalidate it on reprepare_packed_git(), but there's not much point in
being rigorous here:

  - we might already fail to notice new .keep packs showing up after the
    program starts. We only reprepare_packed_git() when we fail to find
    an object. But adding a new pack won't cause that to happen.
    Somebody repacking could add a new pack and delete an old one, but
    most of the time we'd have a descriptor or mmap open to the old
    pack anyway, so we might not even notice.

  - in pack-objects we already cache the .keep state at startup, since
    56dfeb6263 (pack-objects: compute local/ignore_pack_keep early,
    2016-07-29). So this is just extending that concept further.

  - we don't have to worry about any packed_git being removed; we always
    keep the old structs around, even after reprepare_packed_git()

We do defensively invalidate the cache in case the set of kept packs
being asked for changes (e.g., only in-core kept packs were cached, but
suddenly the caller also wants on-disk kept packs, too). In theory we
could build all three caches and switch between them, but it's not
necessary, since this patch (and series) never changes the set of kept
packs that it wants to inspect from the cache.

So that "optimization" is more about being defensive in the face of
future changes than it is about asking for multiple kinds of kept packs
in this patch.

Here are p5303 results (as always, measured against the kernel):

  Test                                        HEAD^                   HEAD
  -----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.34(54.66+10.88)      56.98(54.36+10.98) -0.6%
  5303.6: repack with kept (1)                57.38(54.83+10.49)      57.17(54.97+10.26) -0.4%
  5303.11: repack (50)                        71.70(88.99+4.74)       71.62(88.48+5.08) -0.1%
  5303.12: repack with kept (50)              72.58(89.61+4.78)       71.56(88.80+4.59) -1.4%
  5303.17: repack (1000)                      217.19(491.72+14.25)    217.31(490.82+14.53) +0.1%
  5303.18: repack with kept (1000)            246.12(520.07+14.93)    217.08(490.37+15.10) -11.8%

and the --stdin-packs case, which scales a little bit better (although
not by that much even at 1,000 packs):

  5303.7: repack with --stdin-packs (1)       0.00(0.00+0.00)         0.00(0.00+0.00) =
  5303.13: repack with --stdin-packs (50)     3.43(11.75+0.24)        3.43(11.69+0.30) +0.0%
  5303.19: repack with --stdin-packs (1000)   130.50(307.15+7.66)     125.13(301.36+8.04) -4.1%

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-store.h |  5 +++
 packfile.c     | 99 ++++++++++++++++++++++++++++----------------------
 2 files changed, 61 insertions(+), 43 deletions(-)

diff --git a/object-store.h b/object-store.h
index 541dab0858..ec32c23dcb 100644
--- a/object-store.h
+++ b/object-store.h
@@ -153,6 +153,11 @@ struct raw_object_store {
 	/* A most-recently-used ordered version of the packed_git list. */
 	struct list_head packed_git_mru;
 
+	struct {
+		struct packed_git **packs;
+		unsigned flags;
+	} kept_pack_cache;
+
 	/*
 	 * A map of packfiles to packed_git structs for tracking which
 	 * packs have been loaded already.
diff --git a/packfile.c b/packfile.c
index 7f84f221ce..57d5b436fb 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2042,10 +2042,7 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int find_one_pack_entry(struct repository *r,
-			       const struct object_id *oid,
-			       struct pack_entry *e,
-			       int kept_only)
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2055,49 +2052,63 @@ static int find_one_pack_entry(struct repository *r,
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (!fill_midx_entry(r, oid, e, m))
-			continue;
-
-		if (!kept_only)
-			return 1;
-
-		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
-		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
+		if (fill_midx_entry(r, oid, e, m))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (p->multi_pack_index && !kept_only) {
-			/*
-			 * If this pack is covered by the MIDX, we'd have found
-			 * the object already in the loop above if it was here,
-			 * so don't bother looking.
-			 *
-			 * The exception is if we are looking only at kept
-			 * packs. An object can be present in two packs covered
-			 * by the MIDX, one kept and one not-kept. And as the
-			 * MIDX points to only one copy of each object, it might
-			 * have returned only the non-kept version above. We
-			 * have to check again to be thorough.
-			 */
-			continue;
-		}
-		if (!kept_only ||
-		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
-		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
-			if (fill_pack_entry(oid, e, p)) {
-				list_move(&p->mru, &r->objects->packed_git_mru);
-				return 1;
-			}
+		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
+			list_move(&p->mru, &r->objects->packed_git_mru);
+			return 1;
 		}
 	}
 	return 0;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static void maybe_invalidate_kept_pack_cache(struct repository *r,
+					     unsigned flags)
 {
-	return find_one_pack_entry(r, oid, e, 0);
+	if (!r->objects->kept_pack_cache.packs)
+		return;
+	if (r->objects->kept_pack_cache.flags == flags)
+		return;
+	FREE_AND_NULL(r->objects->kept_pack_cache.packs);
+	r->objects->kept_pack_cache.flags = 0;
+}
+
+static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+{
+	maybe_invalidate_kept_pack_cache(r, flags);
+
+	if (!r->objects->kept_pack_cache.packs) {
+		struct packed_git **packs = NULL;
+		size_t nr = 0, alloc = 0;
+		struct packed_git *p;
+
+		/*
+		 * We want "all" packs here, because we need to cover ones that
+		 * are used by a midx, as well. We need to look in every one of
+		 * them (instead of the midx itself) to cover duplicates. It's
+		 * possible that an object is found in two packs that the midx
+		 * covers, one kept and one not kept, but the midx returns only
+		 * the non-kept version.
+		 */
+		for (p = get_all_packs(r); p; p = p->next) {
+			if ((p->pack_keep && (flags & ON_DISK_KEEP_PACKS)) ||
+			    (p->pack_keep_in_core && (flags & IN_CORE_KEEP_PACKS))) {
+				ALLOC_GROW(packs, nr + 1, alloc);
+				packs[nr++] = p;
+			}
+		}
+		ALLOC_GROW(packs, nr + 1, alloc);
+		packs[nr] = NULL;
+
+		r->objects->kept_pack_cache.packs = packs;
+		r->objects->kept_pack_cache.flags = flags;
+	}
+
+	return r->objects->kept_pack_cache.packs;
 }
 
 int find_kept_pack_entry(struct repository *r,
@@ -2105,13 +2116,15 @@ int find_kept_pack_entry(struct repository *r,
 			 unsigned flags,
 			 struct pack_entry *e)
 {
-	/*
-	 * Load all packs, including midx packs, since our "kept" strategy
-	 * relies on that. We're relying on the side effect of it setting up
-	 * r->objects->packed_git, which is a little ugly.
-	 */
-	get_all_packs(r);
-	return find_one_pack_entry(r, oid, e, flags);
+	struct packed_git **cache;
+
+	for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+		struct packed_git *p = *cache;
+		if (fill_pack_entry(oid, e, p))
+			return 1;
+	}
+
+	return 0;
 }
 
 int has_object_pack(const struct object_id *oid)
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 8/8] builtin/repack.c: add '--geometric' option
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
                     ` (6 preceding siblings ...)
  2021-02-18  3:14   ` [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
@ 2021-02-18  3:14   ` Taylor Blau
  2021-02-23  0:31   ` [PATCH v3 0/8] repack: support repacking into a geometric sequence Jeff King
  8 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-18  3:14 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

Often it is useful to both:

  - have relatively few packfiles in a repository, and

  - avoid having so few packfiles in a repository that we repack its
    entire contents regularly

This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).

Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:

  objects(Pi) > r*objects(P(i-1))

for all i in [1, n], where the packs are sorted by

  objects(P1) <= objects(P2) <= ... <= objects(Pn).

Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:

  1. We assume that there is a cutoff of packs _before starting the
     repack_ where everything to the right of that cut-off already forms
     a geometric progression (or no cutoff exists and everything must be
     repacked).

  2. We assume that everything smaller than the cutoff count must be
     repacked. This forms our base assumption, but it can also cause
     even the "heavy" packs to get repacked, for e.g., if we have 6
     packs containing the following number of objects:

       1, 1, 1, 2, 4, 32

     then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
     rolling up the first two packs into a pack with 2 objects. That
     breaks our progression and leaves us:

       2, 1, 2, 4, 32
         ^

     (where the '^' indicates the position of our split). To restore a
     progression, we move the split forward (towards larger packs)
     joining each pack into our new pack until a geometric progression
     is restored. Here, that looks like:

       2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
         ^                   ^                ^                   ^

This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  22 +++++
 builtin/repack.c             | 187 ++++++++++++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 137 +++++++++++++++++++++++++
 3 files changed, 342 insertions(+), 4 deletions(-)
 create mode 100755 t/t7703-repack-geometric.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 92f146d27d..21c7068925 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -165,6 +165,28 @@ depth is 4095.
 	Pass the `--delta-islands` option to `git-pack-objects`, see
 	linkgit:git-pack-objects[1].
 
+-g=<factor>::
+--geometric=<factor>::
+	Arrange resulting pack structure so that each successive pack
+	contains at least `<factor>` times the number of objects as the
+	next-largest pack.
++
+`git repack` ensures this by determining a "cut" of packfiles that need
+to be repacked into one in order to ensure a geometric progression. It
+picks the smallest set of packfiles such that as many of the larger
+packfiles (by count of objects contained in that pack) may be left
+intact.
++
+Unlike other repack modes, the set of objects to pack is determined
+uniquely by the set of packs being "rolled-up"; in other words, the
+packs determined to need to be combined in order to restore a geometric
+progression.
++
+Loose objects are implicitly included in this "roll-up", without respect
+to their reachability. This is subject to change in the future. This
+option (implying a drastically different repack mode) is not guarenteed
+to work with all other combinations of option to `git repack`).
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 01440de2d5..bcf280b10d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -297,6 +297,124 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
 
+struct pack_geometry {
+	struct packed_git **pack;
+	uint32_t pack_nr, pack_alloc;
+	uint32_t split;
+};
+
+static uint32_t geometry_pack_weight(struct packed_git *p)
+{
+	if (open_pack_index(p))
+		die(_("cannot open index for %s"), p->pack_name);
+	return p->num_objects;
+}
+
+static int geometry_cmp(const void *va, const void *vb)
+{
+	uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
+		 bw = geometry_pack_weight(*(struct packed_git **)vb);
+
+	if (aw < bw)
+		return -1;
+	if (aw > bw)
+		return 1;
+	return 0;
+}
+
+static void init_pack_geometry(struct pack_geometry **geometry_p)
+{
+	struct packed_git *p;
+	struct pack_geometry *geometry;
+
+	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
+	geometry = *geometry_p;
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (!pack_kept_objects && p->pack_keep)
+			continue;
+
+		ALLOC_GROW(geometry->pack,
+			   geometry->pack_nr + 1,
+			   geometry->pack_alloc);
+
+		geometry->pack[geometry->pack_nr] = p;
+		geometry->pack_nr++;
+	}
+
+	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
+}
+
+static void split_pack_geometry(struct pack_geometry *geometry, int factor)
+{
+	uint32_t i;
+	uint32_t split;
+	off_t total_size = 0;
+
+	if (geometry->pack_nr <= 1) {
+		geometry->split = geometry->pack_nr;
+		return;
+	}
+
+	split = geometry->pack_nr - 1;
+
+	/*
+	 * First, count the number of packs (in descending order of size) which
+	 * already form a geometric progression.
+	 */
+	for (i = geometry->pack_nr - 1; i > 0; i--) {
+		struct packed_git *ours = geometry->pack[i];
+		struct packed_git *prev = geometry->pack[i - 1];
+		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
+			split--;
+		else
+			break;
+	}
+
+	if (split) {
+		/*
+		 * Move the split one to the right, since the top element in the
+		 * last-compared pair can't be in the progression. Only do this
+		 * when we split in the middle of the array (otherwise if we got
+		 * to the end, then the split is in the right place).
+		 */
+		split++;
+	}
+
+	/*
+	 * Then, anything to the left of 'split' must be in a new pack. But,
+	 * creating that new pack may cause packs in the heavy half to no longer
+	 * form a geometric progression.
+	 *
+	 * Compute an expected size of the new pack, and then determine how many
+	 * packs in the heavy half need to be joined into it (if any) to restore
+	 * the geometric progression.
+	 */
+	for (i = 0; i < split; i++)
+		total_size += geometry_pack_weight(geometry->pack[i]);
+	for (i = split; i < geometry->pack_nr; i++) {
+		struct packed_git *ours = geometry->pack[i];
+		if (geometry_pack_weight(ours) < factor * total_size) {
+			split++;
+			total_size += geometry_pack_weight(ours);
+		} else
+			break;
+	}
+
+	geometry->split = split;
+}
+
+static void clear_pack_geometry(struct pack_geometry *geometry)
+{
+	if (!geometry)
+		return;
+
+	free(geometry->pack);
+	geometry->pack_nr = 0;
+	geometry->pack_alloc = 0;
+	geometry->split = 0;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -304,6 +422,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
@@ -316,6 +435,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	int geometric_factor = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -356,6 +476,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("do not repack this pack")),
+		OPT_INTEGER('g', "geometric", &geometric_factor,
+			    N_("find a geometric progression with factor <N>")),
 		OPT_END()
 	};
 
@@ -382,6 +504,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
 		die(_(incremental_bitmap_conflict_error));
 
+	if (geometric_factor) {
+		if (pack_everything)
+			die(_("--geometric is incompatible with -A, -a"));
+		init_pack_geometry(&geometry);
+		split_pack_geometry(geometry, geometric_factor);
+	}
+
 	packdir = mkpathdup("%s/pack", get_object_directory());
 	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
@@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_pushf(&cmd.args, "--keep-pack=%s",
 			     keep_pack_list.items[i].string);
 	strvec_push(&cmd.args, "--non-empty");
-	strvec_push(&cmd.args, "--all");
-	strvec_push(&cmd.args, "--reflog");
-	strvec_push(&cmd.args, "--indexed-objects");
+	if (!geometry) {
+		/*
+		 * 'git pack-objects' will up all objects loose or packed
+		 * (either rolling them up or leaving them alone), so don't pass
+		 * these options.
+		 *
+		 * The implementation of 'git pack-objects --stdin-packs'
+		 * makes them redundant (and the two are incompatible).
+		 */
+		strvec_push(&cmd.args, "--all");
+		strvec_push(&cmd.args, "--reflog");
+		strvec_push(&cmd.args, "--indexed-objects");
+	}
 	if (has_promisor_remote())
 		strvec_push(&cmd.args, "--exclude-promisor-objects");
 	if (write_bitmaps > 0)
@@ -429,17 +568,37 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
 			}
 		}
+	} else if (geometry) {
+		strvec_push(&cmd.args, "--stdin-packs");
+		strvec_push(&cmd.args, "--unpacked");
 	} else {
 		strvec_push(&cmd.args, "--unpacked");
 		strvec_push(&cmd.args, "--incremental");
 	}
 
-	cmd.no_stdin = 1;
+	if (geometry)
+		cmd.in = -1;
+	else
+		cmd.no_stdin = 1;
 
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
 
+	if (geometry) {
+		FILE *in = xfdopen(cmd.in, "w");
+		/*
+		 * The resulting pack should contain all objects in packs that
+		 * are going to be rolled up, but exclude objects in packs which
+		 * are being left alone.
+		 */
+		for (i = 0; i < geometry->split; i++)
+			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
+		for (i = geometry->split; i < geometry->pack_nr; i++)
+			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
+		fclose(in);
+	}
+
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		if (line.len != the_hash_algo->hexsz)
@@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			if (!string_list_has_string(&names, sha1))
 				remove_redundant_pack(packdir, item->string);
 		}
+
+		if (geometry) {
+			struct strbuf buf = STRBUF_INIT;
+
+			uint32_t i;
+			for (i = 0; i < geometry->split; i++) {
+				struct packed_git *p = geometry->pack[i];
+				if (string_list_has_string(&names,
+							   hash_to_hex(p->hash)))
+					continue;
+
+				strbuf_reset(&buf);
+				strbuf_addstr(&buf, pack_basename(p));
+				strbuf_strip_suffix(&buf, ".pack");
+
+				remove_redundant_pack(packdir, buf.buf);
+			}
+			strbuf_release(&buf);
+		}
 		if (!po_args.quiet && isatty(2))
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
@@ -528,6 +706,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
 	return 0;
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
new file mode 100755
index 0000000000..96917fc163
--- /dev/null
+++ b/t/t7703-repack-geometric.sh
@@ -0,0 +1,137 @@
+#!/bin/sh
+
+test_description='git repack --geometric works correctly'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success '--geometric with no packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		git repack --geometric 2 >out &&
+		test_i18ngrep "Nothing new to pack" out
+	)
+'
+
+test_expect_success '--geometric with an intact progression' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		test_commit_bulk --start=4 4 && # 12 objects
+
+		find $objdir/pack -name "*.pack" | sort >expect &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >actual &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--geometric with small-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		find $objdir/pack -name "*.pack" | sort >small &&
+		test_commit_bulk --start=3 4 && # 12 objects
+		test_commit_bulk --start=7 8 && # 24 objects
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		git repack --geometric 2 -d &&
+
+		# Three packs in total; two of the existing large ones, and one
+		# new one.
+		find $objdir/pack -name "*.pack" | sort >after &&
+		test_line_count = 3 after &&
+		comm -3 small before | tr -d "\t" >large &&
+		grep -qFf large after
+	)
+'
+
+test_expect_success '--geometric with small- and large-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# size(small1) + size(small2) > size(medium) / 2
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		test_commit_bulk --start=2 3 && # 7 objects
+		test_commit_bulk --start=6 9 && # 27 objects &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		git repack --geometric 2 -d &&
+
+		find $objdir/pack -name "*.pack" | sort >after &&
+		comm -12 before after >untouched &&
+
+		# Two packs in total; the largest pack from before running "git
+		# repack", and one new one.
+		test_line_count = 1 untouched &&
+		test_line_count = 2 after
+	)
+'
+
+test_expect_success '--geometric ignores kept packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit kept && # 3 objects
+		test_commit pack && # 3 objects
+
+		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/kept
+		EOF
+		) &&
+		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/pack
+		^refs/tags/kept
+		EOF
+		) &&
+
+		# neither pack contains more than twice the number of objects in
+		# the other, so they should be combined. but, marking one as
+		# .kept on disk will "freeze" it, so the pack structure should
+		# remain unchanged.
+		touch $objdir/pack/pack-$KEPT.keep &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >after &&
+
+		# both packs should still exist
+		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
+		test_path_is_file $objdir/pack/pack-$PACK.pack &&
+
+		# and no new packs should be created
+		test_cmp before after &&
+
+		# Passing --pack-kept-objects causes packs with a .keep file to
+		# be repacked, too.
+		git repack --geometric 2 -d --pack-kept-objects &&
+
+		find $objdir/pack -name "*.pack" >after &&
+		test_line_count = 1 after
+	)
+'
+
+test_done
-- 
2.30.0.667.g81c0cbc6fd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 0/8] repack: support repacking into a geometric sequence
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
                     ` (7 preceding siblings ...)
  2021-02-18  3:14   ` [PATCH v3 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
@ 2021-02-23  0:31   ` Jeff King
  2021-02-23  1:06     ` Taylor Blau
  8 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-23  0:31 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Wed, Feb 17, 2021 at 10:14:11PM -0500, Taylor Blau wrote:

> Here is another updated version of mine and Peff's series to add a new 'git
> repack --geometric' mode which supports repacking a repository into a geometric
> progression of packs by object count.

Thanks. This version looks pretty good to me. I have a few inline
comments below. Mostly just observations, but there a couple tiny nits
that I think may justify one more re-roll.

> 14:  ddc2896caa !  2:  82f6b45463 revision: learn '--no-kept-objects'
>     @@ Commit message
>          certain packs alone (for e.g., when doing a geometric repack that has
>          some "large" packs which are kept in-core that it wants to leave alone).
> 
>     +    Note that this option is not guaranteed to produce exactly the set of
>     +    objects that aren't in kept packs, since it's possible the traversal
>     +    order may end up in a situation where a non-kept ancestor was "cut off"
>     +    by a kept object (at which point we would stop traversing). But, we
>     +    don't care about absolute correctness here, since this will eventually
>     +    be used as a purely additive guide in an upcoming new repack mode.
>     +
>     +    Explicitly avoid documenting this new flag, since it is only used
>     +    internally. In theory we could avoid even adding it rev-list, but being
>     +    able to spell this option out on the command-line makes some special
>     +    cases easier to test without promising to keep it behaving consistently
>     +    forever. Those tricky cases are exercised in t6114.

We don't have a real procedure for marking something as "off limits" for
users. IMHO omitting it from the documentation and putting an explicit
note in the commit message is probably enough. It would be perhaps
stronger to mark it explicitly as "do not touch" in the documentation,
but then we are polluting the documentation. :)

>     @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
>      +			die(_("could not find pack '%s'"), item->string);
>      +		p->pack_keep_in_core = 1;
>      +	}
>     ++
>     ++	/*
>     ++	 * Order packs by ascending mtime; use QSORT directly to access the
>     ++	 * string_list_item's ->util pointer, which string_list_sort() does not
>     ++	 * provide.
>     ++	 */
>     ++	QSORT(include_packs.items, include_packs.nr, pack_mtime_cmp);
>     ++

I wondered briefly if we should accept the order from the caller, and
make it responsible for any sorting. But in other instances, we are
happy to reorder objects internally for the sake of optimization, so it
probably makes sense here.

I also wondered if we could piggy-back on the sorting of packed_git,
which is already in reverse chronological order. But here our primary
structure is the string-list, so we lose that order.

I'm not sure if your sort function is going the right way, though. It
does:

>     ++static int pack_mtime_cmp(const void *_a, const void *_b)
>     ++{
>     ++        struct packed_git *a = ((const struct string_list_item*)_a)->util;
>     ++        struct packed_git *b = ((const struct string_list_item*)_b)->util;
>     ++
>     ++        if (a->mtime < b->mtime)
>     ++                return -1;
>     ++        else if (b->mtime < a->mtime)
>     ++                return 1;
>     ++        else
>     ++                return 0;
>     ++}
>     ++

Does that give us the packs in increasing chronological order, but then
decreasing chronological order within the packs themselves?

> 17:  b5081c01b5 !  5:  181c104a03 p5303: measure time to repack with keep
>     @@ Metadata
>       ## Commit message ##
>          p5303: measure time to repack with keep
> 
>     -    This is the same as the regular repack test, except that we mark the
>     -    single base pack as "kept" and use --assume-kept-packs-closed. The
>     -    theory is that this should be faster than the normal repack, because
>     -    we'll have fewer objects to traverse and process.
>     +    Add two new tests to measure repack performance. Both test split the

s/test split/tests split/, I think.

>     +    in the 50-pack case, things start to slow down:
>     +
>     +      5303.11: repack (50)                        71.54(88.57+4.84)
>     +      5303.12: repack with kept (50)              85.12(102.05+4.94)
>     +
>     +    and by the time we hit 1,000 packs, things are substantially worse, even
>     +    though the resulting pack produced is the same:
>     +
>     +      5303.17: repack (1000)                      216.87(490.79+14.57)
>     +      5303.18: repack with kept (1000)            665.63(938.87+15.76)

OK, that's the kind of horrendous slowdown I knew we could demonstrate. :)
I'm excited to see the numbers improve in the next patch.

>     +    Likewise, the scaling is pretty extreme on --stdin-packs:
>     +
>     +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
>     +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
>     +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)
> 
>          That's because the code paths around handling .keep files are known to
>          scale badly; they look in every single pack file to find each object.

Your "that's because" is a little confusing to me. It certainly applies
to the repack vs repack-with-kept comparisons for a given number of
packs. But the scaling on the three --stdin-packs tests is high because
each subsequent test is being asked to do a lot more work. But they're
still cheaper than the matching "repack" case with a given number of
packs. Just not _as_ cheap as they would be if the kept code weren't so
slow.

Would it make sense to reorder those two paragraphs?

>     ++	test_perf "repack with kept ($nr_packs)" '
>     ++		git pack-objects --keep-true-parents \
>     ++		  --keep-pack=pack-$empty_pack.pack \
>     ++		  --honor-pack-keep --non-empty --all \
>     ++		  --reflog --indexed-objects --delta-base-offset \
>     ++		  --stdout </dev/null >/dev/null
>     ++	'

The new test itself looks sensible. I like using --keep-pack here to
avoid needing to do any other setup/cleanup. (It does assume that
on-disk and in-core keeps behave the same, but I'm fine with that
white-box assumption, especially for a perf test).

>     +      5303.5: repack (1)                          57.26(54.59+10.84)      57.34(54.66+10.88) +0.1%
>     +      5303.6: repack with kept (1)                57.33(54.80+10.51)      57.38(54.83+10.49) +0.1%
>     +      5303.11: repack (50)                        71.54(88.57+4.84)       71.70(88.99+4.74) +0.2%
>     +      5303.12: repack with kept (50)              85.12(102.05+4.94)      72.58(89.61+4.78) -14.7%
>     +      5303.17: repack (1000)                      216.87(490.79+14.57)    217.19(491.72+14.25) +0.1%
>     +      5303.18: repack with kept (1000)            665.63(938.87+15.76)    246.12(520.07+14.93) -63.0%

Nice. In each amount we are recovering almost all of the kept slowdown
seen between the repack and repack-with-kept cases. The remaining
slowdown is just from iterating that N-pack linked list, even though we
don't look in any of its .idx files.

>     +    and the --stdin-packs timings:
>     +
>     +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)         0.00(0.00+0.00) -100.0%
>     +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)        3.43(11.75+0.24) -2.8%
>     +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)     130.50(307.15+7.66) -33.4%

And of course we see an improvement here, too (as expected, but not as
dramatic because we are doing less work overall).

> 19:  f1c07324f6 !  7:  e9e04b95e7 packfile: add kept-pack cache for find_kept_pack_entry()
> [...]
>     +      5303.5: repack (1)                          57.34(54.66+10.88)      56.98(54.36+10.98) -0.6%
>     +      5303.6: repack with kept (1)                57.38(54.83+10.49)      57.17(54.97+10.26) -0.4%
>     +      5303.11: repack (50)                        71.70(88.99+4.74)       71.62(88.48+5.08) -0.1%
>     +      5303.12: repack with kept (50)              72.58(89.61+4.78)       71.56(88.80+4.59) -1.4%
>     +      5303.17: repack (1000)                      217.19(491.72+14.25)    217.31(490.82+14.53) +0.1%
>     +      5303.18: repack with kept (1000)            246.12(520.07+14.93)    217.08(490.37+15.10) -11.8%

And now we can see this patch carrying its weight much more than in the
previous iteration of the series. Good. Our N-pack linked list is now a
single element (just the kept pack), so we expect our repack-with-kept
times to match their non-kept partners. And they do.

>     +    and the --stdin-packs case, which scales a little bit better (although
>     +    not by that much even at 1,000 packs):
>     +
>     +      5303.7: repack with --stdin-packs (1)       0.00(0.00+0.00)         0.00(0.00+0.00) =
>     +      5303.13: repack with --stdin-packs (50)     3.43(11.75+0.24)        3.43(11.69+0.30) +0.0%
>     +      5303.19: repack with --stdin-packs (1000)   130.50(307.15+7.66)     125.13(301.36+8.04) -4.1%

And likewise this is less dramatic, but still nice to see.

> 20:  d5561585c2 !  8:  bd492ec142 builtin/repack.c: add '--geometric' option
>     @@ Documentation/git-repack.txt: depth is 4095.
> [...]
>     ++Unlike other repack modes, the set of objects to pack is determined
>     ++uniquely by the set of packs being "rolled-up"; in other words, the
>     ++packs determined to need to be combined in order to restore a geometric
>     ++progression.

And this is the "clarify roll-up" bit I asked for. Looks good.

>     ++Loose objects are implicitly included in this "roll-up", without respect
>     ++to their reachability. This is subject to change in the future. This
>     ++option (implying a drastically different repack mode) is not guarenteed
>     ++to work with all other combinations of option to `git repack`).

Likewise, this is a big improvement. But should it make it clear that
touching loose objects requires --unpacked? I.e., something like:

  When `--unpacked` is specified, loose objects are included in this
  "roll-up" without respect to their reachability...

Also, s/guarenteed/guaranteed/.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 0/8] repack: support repacking into a geometric sequence
  2021-02-23  0:31   ` [PATCH v3 0/8] repack: support repacking into a geometric sequence Jeff King
@ 2021-02-23  1:06     ` Taylor Blau
  2021-02-23  1:42       ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  1:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster

On Mon, Feb 22, 2021 at 07:31:12PM -0500, Jeff King wrote:
> On Wed, Feb 17, 2021 at 10:14:11PM -0500, Taylor Blau wrote:
>
> > Here is another updated version of mine and Peff's series to add a new 'git
> > repack --geometric' mode which supports repacking a repository into a geometric
> > progression of packs by object count.
>
> Thanks. This version looks pretty good to me. I have a few inline
> comments below. Mostly just observations, but there a couple tiny nits
> that I think may justify one more re-roll.

Thanks for taking a look; I agree that your comments do justify a
re-roll. But I think that one can be done without touching any of the
code (or maybe one line of code), depending on my question below.

Let's see...

> > [snip documentation]
>
> We don't have a real procedure for marking something as "off limits" for
> users. IMHO omitting it from the documentation and putting an explicit
> note in the commit message is probably enough. It would be perhaps
> stronger to mark it explicitly as "do not touch" in the documentation,
> but then we are polluting the documentation. :)

I agree; and the second paragraph in the quoted snippet is the "do not
touch" one. So I think this one is good as-is.

> I also wondered if we could piggy-back on the sorting of packed_git,
> which is already in reverse chronological order. But here our primary
> structure is the string-list, so we lose that order.
>
> I'm not sure if your sort function is going the right way, though. It
> does:
>
> >     ++static int pack_mtime_cmp(const void *_a, const void *_b)
> >     ++{
> >     ++        struct packed_git *a = ((const struct string_list_item*)_a)->util;
> >     ++        struct packed_git *b = ((const struct string_list_item*)_b)->util;
> >     ++
> >     ++        if (a->mtime < b->mtime)
> >     ++                return -1;
> >     ++        else if (b->mtime < a->mtime)
> >     ++                return 1;
> >     ++        else
> >     ++                return 0;
> >     ++}
> >     ++
>
> Does that give us the packs in increasing chronological order, but then
> decreasing chronological order within the packs themselves?

I agree we should be sorting and not blindly accepting the order that
the caller gave us, but...

"chronological order within the packs themselves" confuses me. I think
that you mean ordering objects within a pack by their offsets. If so,
then yes: this gives you the oldest pack first (and all of its objects
in their original order), then the second oldest (and all of its
objects) and so on.

Could you clarify a bit how you'd expect to sort the objects in two
packs?

> > 17:  b5081c01b5 !  5:  181c104a03 p5303: measure time to repack with keep
> >     @@ Metadata
> >       ## Commit message ##
> >          p5303: measure time to repack with keep
> >
> >     -    This is the same as the regular repack test, except that we mark the
> >     -    single base pack as "kept" and use --assume-kept-packs-closed. The
> >     -    theory is that this should be faster than the normal repack, because
> >     -    we'll have fewer objects to traverse and process.
> >     +    Add two new tests to measure repack performance. Both test split the
>
> s/test split/tests split/, I think.

Good eyes, thanks.

> >     +    Likewise, the scaling is pretty extreme on --stdin-packs:
> >     +
> >     +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
> >     +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
> >     +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)
> >
> >          That's because the code paths around handling .keep files are known to
> >          scale badly; they look in every single pack file to find each object.
>
> Your "that's because" is a little confusing to me. It certainly applies
> to the repack vs repack-with-kept comparisons for a given number of
> packs. But the scaling on the three --stdin-packs tests is high because
> each subsequent test is being asked to do a lot more work. But they're
> still cheaper than the matching "repack" case with a given number of
> packs. Just not _as_ cheap as they would be if the kept code weren't so
> slow.
>
> Would it make sense to reorder those two paragraphs?

I think so. I did add a tiny parenthetical after my "Likewise, the
scaling is pretty extreme [...]" to say "(but each subsequent test is
also being asked to do more work)".

> >     ++Loose objects are implicitly included in this "roll-up", without respect
> >     ++to their reachability. This is subject to change in the future. This
> >     ++option (implying a drastically different repack mode) is not guarenteed
> >     ++to work with all other combinations of option to `git repack`).
>
> Likewise, this is a big improvement. But should it make it clear that
> touching loose objects requires --unpacked? I.e., something like:
>
>   When `--unpacked` is specified, loose objects are included in this
>   "roll-up" without respect to their reachability...
>
> Also, s/guarenteed/guaranteed/.

Agreed on both, thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 0/8] repack: support repacking into a geometric sequence
  2021-02-23  1:06     ` Taylor Blau
@ 2021-02-23  1:42       ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-23  1:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Mon, Feb 22, 2021 at 08:06:16PM -0500, Taylor Blau wrote:

> > >     ++static int pack_mtime_cmp(const void *_a, const void *_b)
> > >     ++{
> > >     ++        struct packed_git *a = ((const struct string_list_item*)_a)->util;
> > >     ++        struct packed_git *b = ((const struct string_list_item*)_b)->util;
> > >     ++
> > >     ++        if (a->mtime < b->mtime)
> > >     ++                return -1;
> > >     ++        else if (b->mtime < a->mtime)
> > >     ++                return 1;
> > >     ++        else
> > >     ++                return 0;
> > >     ++}
> > >     ++
> >
> > Does that give us the packs in increasing chronological order, but then
> > decreasing chronological order within the packs themselves?
> 
> I agree we should be sorting and not blindly accepting the order that
> the caller gave us, but...
> 
> "chronological order within the packs themselves" confuses me. I think
> that you mean ordering objects within a pack by their offsets. If so,
> then yes: this gives you the oldest pack first (and all of its objects
> in their original order), then the second oldest (and all of its
> objects) and so on.
> 
> Could you clarify a bit how you'd expect to sort the objects in two
> packs?

Yes, by "within the packs themselves" I meant the physical order of
objects within an individual pack (sorted by their offsets, as we'd get
from for_each_object_in_pack). We would generally expect that to be
"newest first" within a given pack (modulo some other heuristics, but we
generally follow traversal order from rev-list).

So if the packs themselves are in oldest-first order, won't that create
a weird discontinuity at the pack boundaries?

E.g., imagine we have a linear sequence of commits A..Z in chronological
order, stored in two packs of equal size. Something like:

  tick=1234567890
  commit() {
    tick=$((tick+10))
    export GIT_COMMITTER_DATE="@$tick +0000"
    git commit --allow-empty -m $1
  }

  for i in $(perl -le 'print for A..M'); do commit $i; done
  git repack -d
  sleep 5
  for i in $(perl -le 'print for N..Z'); do commit $i; done
  git repack -d

Since "repack -d" will use a traversal to decide which objects to pack,
the two packs will have their commits in reverse chronological order:
M..A and Z..N. You can verify that with:

  for idx in $(ls -rt .git/objects/pack/*.idx); do
    stat --format='==> %y %n' $idx
    git show-index <$idx |
    sort -n |
    awk '{print $2}' |
    git --no-pager log --no-walk=unsorted --stdin --format=%s
  done

And if we then ran "git repack -ad" to make a new pack, it would be in
newest-to-oldest Z..A order.

But if instead we concatenate the packs after sorting them in
oldest-first order, we'll end up with a pack that contains M..A, then
Z..N. We instead want newest packs first (and then newest objects within
that pack, which is the pack order), then oldest.

In other words, I think your comparison function should be reversed
(return "1" when a->mtime < b->mtime).

(Of course these orders aren't perfect; in a real pack you'd have
non-commit objects, and we'd tweak the write order to keep delta
families together, etc. But our "best guess" should keep packs and
objects-within-packs consistent in newest-first order).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
                   ` (12 preceding siblings ...)
  2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
@ 2021-02-23  2:24 ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
                     ` (9 more replies)
  13 siblings, 10 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

Here's a very lightly modified version on v3 of mine and Peff's series
to add a new 'git repack --geometric' mode. Almost nothing has changed
since last time, with the exception of:

  - Packs listed over standard input to 'git pack-objects --stdin-packs'
    are sorted in descending mtime order (and objects are strung
    together in pack order as before) so that objects are laid out
    roughly newest-to-oldest in the resulting pack.

  - Swapped the order of two paragraphs in patch 5 to make the perf
    results clearer.

  - Mention '--unpacked' specifically in the documentation for 'git
    repack --geometric'.

  - Typo fixes.

Range-diff is below. It would be good to start merging this down since
we have a release candidate coming up soon, and I'd rather focus future
reviewer efforts on the multi-pack reverse index and bitmaps series
instead of this one.

Jeff King (4):
  p5303: add missing &&-chains
  p5303: measure time to repack with keep
  builtin/pack-objects.c: rewrite honor-pack-keep logic
  packfile: add kept-pack cache for find_kept_pack_entry()

Taylor Blau (4):
  packfile: introduce 'find_kept_pack_entry()'
  revision: learn '--no-kept-objects'
  builtin/pack-objects.c: add '--stdin-packs' option
  builtin/repack.c: add '--geometric' option

 Documentation/git-pack-objects.txt |  10 +
 Documentation/git-repack.txt       |  23 ++
 builtin/pack-objects.c             | 333 ++++++++++++++++++++++++-----
 builtin/repack.c                   | 187 +++++++++++++++-
 object-store.h                     |   5 +
 packfile.c                         |  67 ++++++
 packfile.h                         |   5 +
 revision.c                         |  15 ++
 revision.h                         |   4 +
 t/perf/p5303-many-packs.sh         |  36 +++-
 t/t5300-pack-object.sh             |  97 +++++++++
 t/t6114-keep-packs.sh              |  69 ++++++
 t/t7703-repack-geometric.sh        | 137 ++++++++++++
 13 files changed, 926 insertions(+), 62 deletions(-)
 create mode 100755 t/t6114-keep-packs.sh
 create mode 100755 t/t7703-repack-geometric.sh

Range-diff against v3:
1:  aa94edf39b = 1:  bb674e5119 packfile: introduce 'find_kept_pack_entry()'
2:  82f6b45463 = 2:  c85a915597 revision: learn '--no-kept-objects'
3:  033e4e3f67 ! 3:  649cf9020b builtin/pack-objects.c: add '--stdin-packs' option
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +	struct packed_git *a = ((const struct string_list_item*)_a)->util;
     +	struct packed_git *b = ((const struct string_list_item*)_b)->util;
     +
    ++	/*
    ++	 * order packs by descending mtime so that objects are laid out
    ++	 * roughly as newest-to-oldest
    ++	 */
     +	if (a->mtime < b->mtime)
    -+		return -1;
    -+	else if (b->mtime < a->mtime)
     +		return 1;
    ++	else if (b->mtime < a->mtime)
    ++		return -1;
     +	else
     +		return 0;
     +}
4:  f9a5faf773 = 4:  6de9f0c52b p5303: add missing &&-chains
5:  181c104a03 ! 5:  94e4f3ee3a p5303: measure time to repack with keep
    @@ Metadata
      ## Commit message ##
         p5303: measure time to repack with keep
     
    -    Add two new tests to measure repack performance. Both test split the
    +    Add two new tests to measure repack performance. Both tests split the
         repository into synthetic "pushes", and then leave the remaining objects
         in a big base pack.
     
    @@ Commit message
           5303.17: repack (1000)                      216.87(490.79+14.57)
           5303.18: repack with kept (1000)            665.63(938.87+15.76)
     
    -    Likewise, the scaling is pretty extreme on --stdin-packs:
    -
    -      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
    -      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
    -      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)
    -
         That's because the code paths around handling .keep files are known to
         scale badly; they look in every single pack file to find each object.
         Our solution to that was to notice that most repos don't have keep
    @@ Commit message
         single .keep, that part of pack-objects slows down again (even if we
         have fewer objects total to look at).
     
    +    Likewise, the scaling is pretty extreme on --stdin-packs (but each
    +    subsequent test is also being asked to do more work):
    +
    +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
    +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
    +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)
    +
         Signed-off-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
6:  67af143fd1 = 6:  a116587fb2 builtin/pack-objects.c: rewrite honor-pack-keep logic
7:  e9e04b95e7 = 7:  db9f07ec1a packfile: add kept-pack cache for find_kept_pack_entry()
8:  bd492ec142 ! 8:  51f57d5da2 builtin/repack.c: add '--geometric' option
    @@ Documentation/git-repack.txt: depth is 4095.
     +packs determined to need to be combined in order to restore a geometric
     +progression.
     ++
    -+Loose objects are implicitly included in this "roll-up", without respect
    -+to their reachability. This is subject to change in the future. This
    -+option (implying a drastically different repack mode) is not guarenteed
    -+to work with all other combinations of option to `git repack`).
    ++When `--unpacked` is specified, loose objects are implicitly included in
    ++this "roll-up", without respect to their reachability. This is subject
    ++to change in the future. This option (implying a drastically different
    ++repack mode) is not guaranteed to work with all other combinations of
    ++option to `git repack`).
     +
      Configuration
      -------------
-- 
2.30.0.667.g81c0cbc6fd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()'
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 2/8] revision: learn '--no-kept-objects' Taylor Blau
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

Future callers will want a function to fill a 'struct pack_entry' for a
given object id but _only_ from its position in any kept pack(s).

In particular, an new 'git repack' mode which ensures the resulting
packs form a geometric progress by object count will mark packs that it
does not want to repack as "kept in-core", and it will want to halt a
reachability traversal as soon as it visits an object in any of the kept
packs. But, it does not want to halt the traversal at non-kept, or
.keep packs.

The obvious alternative is 'find_pack_entry()', but this doesn't quite
suffice since it only returns the first pack it finds, which may or may
not be kept (and the mru cache makes it unpredictable which one you'll
get if there are options).

Short of that, you could walk over all packs looking for the object in
each one, but it scales with the number of packs, which may be
prohibitive.

Introduce 'find_kept_pack_entry()', a function which is like
'find_pack_entry()', but only fills in objects in the kept packs.

Handle packs which have .keep files, as well as in-core kept packs
separately, since certain callers will want to distinguish one from the
other. (Though on-disk and in-core kept packs share the adjective
"kept", it is best to think of the two sets as independent.)

There is a gotcha when looking up objects that are duplicated in kept
and non-kept packs, particularly when the MIDX stores the non-kept
version and the caller asked for kept objects only. This could be
resolved by teaching the MIDX to resolve duplicates by always favoring
the kept pack (if one exists), but this breaks an assumption in existing
MIDXs, and so it would require a format change.

The benefit to changing the MIDX in this way is marginal, so we instead
have a more thorough check here which is explained with a comment.

Callers will be added in subsequent patches.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.h |  5 +++++
 2 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 1fec12ac5f..7f84f221ce 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2042,7 +2042,10 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static int find_one_pack_entry(struct repository *r,
+			       const struct object_id *oid,
+			       struct pack_entry *e,
+			       int kept_only)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2052,26 +2055,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (fill_midx_entry(r, oid, e, m))
+		if (!fill_midx_entry(r, oid, e, m))
+			continue;
+
+		if (!kept_only)
+			return 1;
+
+		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
+		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
-			list_move(&p->mru, &r->objects->packed_git_mru);
-			return 1;
+		if (p->multi_pack_index && !kept_only) {
+			/*
+			 * If this pack is covered by the MIDX, we'd have found
+			 * the object already in the loop above if it was here,
+			 * so don't bother looking.
+			 *
+			 * The exception is if we are looking only at kept
+			 * packs. An object can be present in two packs covered
+			 * by the MIDX, one kept and one not-kept. And as the
+			 * MIDX points to only one copy of each object, it might
+			 * have returned only the non-kept version above. We
+			 * have to check again to be thorough.
+			 */
+			continue;
+		}
+		if (!kept_only ||
+		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
+		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
+			if (fill_pack_entry(oid, e, p)) {
+				list_move(&p->mru, &r->objects->packed_git_mru);
+				return 1;
+			}
 		}
 	}
 	return 0;
 }
 
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+{
+	return find_one_pack_entry(r, oid, e, 0);
+}
+
+int find_kept_pack_entry(struct repository *r,
+			 const struct object_id *oid,
+			 unsigned flags,
+			 struct pack_entry *e)
+{
+	/*
+	 * Load all packs, including midx packs, since our "kept" strategy
+	 * relies on that. We're relying on the side effect of it setting up
+	 * r->objects->packed_git, which is a little ugly.
+	 */
+	get_all_packs(r);
+	return find_one_pack_entry(r, oid, e, flags);
+}
+
 int has_object_pack(const struct object_id *oid)
 {
 	struct pack_entry e;
 	return find_pack_entry(the_repository, oid, &e);
 }
 
+int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+{
+	struct pack_entry e;
+	return find_kept_pack_entry(the_repository, oid, flags, &e);
+}
+
 int has_pack_index(const unsigned char *sha1)
 {
 	struct stat st;
diff --git a/packfile.h b/packfile.h
index 4cfec9e8d3..3ae117a8ae 100644
--- a/packfile.h
+++ b/packfile.h
@@ -162,13 +162,18 @@ int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
+#define ON_DISK_KEEP_PACKS 1
+#define IN_CORE_KEEP_PACKS 2
+
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
  */
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e);
+int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e);
 
 int has_object_pack(const struct object_id *oid);
+int has_object_kept_pack(const struct object_id *oid, unsigned flags);
 
 int has_pack_index(const unsigned char *sha1);
 
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 2/8] revision: learn '--no-kept-objects'
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).

Note that this option is not guaranteed to produce exactly the set of
objects that aren't in kept packs, since it's possible the traversal
order may end up in a situation where a non-kept ancestor was "cut off"
by a kept object (at which point we would stop traversing). But, we
don't care about absolute correctness here, since this will eventually
be used as a purely additive guide in an upcoming new repack mode.

Explicitly avoid documenting this new flag, since it is only used
internally. In theory we could avoid even adding it rev-list, but being
able to spell this option out on the command-line makes some special
cases easier to test without promising to keep it behaving consistently
forever. Those tricky cases are exercised in t6114.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 revision.c            | 15 ++++++++++
 revision.h            |  4 +++
 t/t6114-keep-packs.sh | 69 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)
 create mode 100755 t/t6114-keep-packs.sh

diff --git a/revision.c b/revision.c
index b78733f508..dca2b8c801 100644
--- a/revision.c
+++ b/revision.c
@@ -2336,6 +2336,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->unpacked = 1;
 	} else if (starts_with(arg, "--unpacked=")) {
 		die(_("--unpacked=<packfile> no longer supported"));
+	} else if (!strcmp(arg, "--no-kept-objects")) {
+		revs->no_kept_objects = 1;
+		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
+		revs->no_kept_objects = 1;
+		if (!strcmp(optarg, "in-core"))
+			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		if (!strcmp(optarg, "on-disk"))
+			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
@@ -3795,6 +3805,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (revs->unpacked && has_object_pack(&commit->object.oid))
 		return commit_ignore;
+	if (revs->no_kept_objects) {
+		if (has_object_kept_pack(&commit->object.oid,
+					 revs->keep_pack_cache_flags))
+			return commit_ignore;
+	}
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
 	if (revs->line_level_traverse && !want_ancestry(revs)) {
diff --git a/revision.h b/revision.h
index e6be3c845e..a20a530d52 100644
--- a/revision.h
+++ b/revision.h
@@ -148,6 +148,7 @@ struct rev_info {
 			edge_hint_aggressive:1,
 			limited:1,
 			unpacked:1,
+			no_kept_objects:1,
 			boundary:2,
 			count:1,
 			left_right:1,
@@ -317,6 +318,9 @@ struct rev_info {
 	 * This is loaded from the commit-graph being used.
 	 */
 	struct bloom_filter_settings *bloom_filter_settings;
+
+	/* misc. flags related to '--no-kept-objects' */
+	unsigned keep_pack_cache_flags;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
new file mode 100755
index 0000000000..9239d8aa46
--- /dev/null
+++ b/t/t6114-keep-packs.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='rev-list with .keep packs'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit loose &&
+	test_commit packed &&
+	test_commit kept &&
+
+	KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/kept
+	^refs/tags/packed
+	EOF
+	) &&
+	MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/packed
+	^refs/tags/loose
+	EOF
+	) &&
+
+	touch .git/objects/pack/pack-$KEPT_PACK.keep
+'
+
+rev_list_objects () {
+	git rev-list "$@" >out &&
+	sort out
+}
+
+idx_objects () {
+	git show-index <$1 >expect-idx &&
+	cut -d" " -f2 <expect-idx | sort
+}
+
+test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' '
+	rev_list_objects --objects --all --no-object-names >kept &&
+	rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept &&
+
+	idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect &&
+	comm -3 kept no-kept >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
+	test_config core.multiPackIndex true &&
+
+	# Create a pack with just the commit object in pack, and do not mark it
+	# as kept (even though it appears in $KEPT_PACK, which does have a .keep
+	# file).
+	MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$(git rev-parse kept)
+	EOF
+	) &&
+
+	# Write a MIDX containing all packs, but use the version of the commit
+	# at "kept" in a non-kept pack by touching $MIDX_PACK.
+	touch .git/objects/pack/pack-$MIDX_PACK.pack &&
+	git multi-pack-index write &&
+
+	rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual &&
+	(
+		idx_objects .git/objects/pack/pack-$MISC_PACK.idx &&
+		git rev-list --objects --no-object-names refs/tags/loose
+	) | sort >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 2/8] revision: learn '--no-kept-objects' Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  8:07     ` Junio C Hamano
  2021-02-23  2:25   ` [PATCH v4 4/8] p5303: add missing &&-chains Taylor Blau
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

In an upcoming commit, 'git repack' will want to create a pack comprised
of all of the objects in some packs (the included packs) excluding any
objects in some other packs (the excluded packs).

This caller could iterate those packs themselves and feed the objects it
finds to 'git pack-objects' directly over stdin, but this approach has a
few downsides:

  - It requires every caller that wants to drive 'git pack-objects' in
    this way to implement pack iteration themselves. This forces the
    caller to think about details like what order objects are fed to
    pack-objects, which callers would likely rather not do.

  - If the set of objects in included packs is large, it requires
    sending a lot of data over a pipe, which is inefficient.

  - The caller is forced to keep track of the excluded objects, too, and
    make sure that it doesn't send any objects that appear in both
    included and excluded packs.

But the biggest downside is the lack of a reachability traversal.
Because the caller passes in a list of objects directly, those objects
don't get a namehash assigned to them, which can have a negative impact
on the delta selection process, causing 'git pack-objects' to fail to
find good deltas even when they exist.

The caller could formulate a reachability traversal themselves, but the
only way to drive 'git pack-objects' in this way is to do a full
traversal, and then remove objects in the excluded packs after the
traversal is complete. This can be detrimental to callers who care
about performance, especially in repositories with many objects.

Introduce 'git pack-objects --stdin-packs' which remedies these four
concerns.

'git pack-objects --stdin-packs' expects a list of pack names on stdin,
where 'pack-xyz.pack' denotes that pack as included, and
'^pack-xyz.pack' denotes it as excluded. The resulting pack includes all
objects that are present in at least one included pack, and aren't
present in any excluded pack.

To address the delta selection problem, 'git pack-objects --stdin-packs'
works as follows. First, it assembles a list of objects that it is going
to pack, as above. Then, a reachability traversal is started, whose tips
are any commits mentioned in included packs. Upon visiting an object, we
find its corresponding object_entry in the to_pack list, and set its
namehash parameter appropriately.

To avoid the traversal visiting more objects than it needs to, the
traversal is halted upon encountering an object which can be found in an
excluded pack (by marking the excluded packs as kept in-core, and
passing --no-kept-objects=in-core to the revision machinery).

This can cause the traversal to halt early, for example if an object in
an included pack is an ancestor of ones in excluded packs. But stopping
early is OK, since filling in the namehash fields of objects in the
to_pack list is only additive (i.e., having it helps the delta selection
process, but leaving it blank doesn't impact the correctness of the
resulting pack).

Even still, it is unlikely that this hurts us much in practice, since
the 'git repack --geometric' caller (which is introduced in a later
commit) marks small packs as included, and large ones as excluded.
During ordinary use, the small packs usually represent pushes after a
large repack, and so are unlikely to be ancestors of objects that
already exist in the repository.

(I found it convenient while developing this patch to have 'git
pack-objects' report the number of objects which were visited and got
their namehash fields filled in during traversal. This is also included
in the below patch via trace2 data lines).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  10 ++
 builtin/pack-objects.c             | 202 ++++++++++++++++++++++++++++-
 t/t5300-pack-object.sh             |  97 ++++++++++++++
 3 files changed, 307 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 54d715ead1..df533c3b19 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -85,6 +85,16 @@ base-name::
 	reference was included in the resulting packfile.  This
 	can be useful to send new tags to native Git clients.
 
+--stdin-packs::
+	Read the basenames of packfiles (e.g., `pack-1234abcd.pack`)
+	from the standard input, instead of object names or revision
+	arguments. The resulting pack contains all objects listed in the
+	included packs (those not beginning with `^`), excluding any
+	objects listed in the excluded packs (beginning with `^`).
++
+Incompatible with `--revs`, or options that imply `--revs` (such as
+`--all`), with the exception of `--unpacked`, which is compatible.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6d62aaf59a..6ee8e40665 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2986,6 +2986,190 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 	return git_default_config(k, v, cb);
 }
 
+/* Counters for trace2 output when in --stdin-packs mode. */
+static int stdin_packs_found_nr;
+static int stdin_packs_hints_nr;
+
+static int add_object_entry_from_pack(const struct object_id *oid,
+				      struct packed_git *p,
+				      uint32_t pos,
+				      void *_data)
+{
+	struct rev_info *revs = _data;
+	struct object_info oi = OBJECT_INFO_INIT;
+	off_t ofs;
+	enum object_type type;
+
+	display_progress(progress_state, ++nr_seen);
+
+	if (have_duplicate_entry(oid, 0))
+		return 0;
+
+	ofs = nth_packed_object_offset(p, pos);
+	if (!want_object_in_pack(oid, 0, &p, &ofs))
+		return 0;
+
+	oi.typep = &type;
+	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
+		die(_("could not get type of object %s in pack %s"),
+		    oid_to_hex(oid), p->pack_name);
+	else if (type == OBJ_COMMIT) {
+		/*
+		 * commits in included packs are used as starting points for the
+		 * subsequent revision walk
+		 */
+		add_pending_oid(revs, NULL, oid, 0);
+	}
+
+	stdin_packs_found_nr++;
+
+	create_object_entry(oid, type, 0, 0, 0, p, ofs);
+
+	return 0;
+}
+
+static void show_commit_pack_hint(struct commit *commit, void *_data)
+{
+	/* nothing to do; commits don't have a namehash */
+}
+
+static void show_object_pack_hint(struct object *object, const char *name,
+				  void *_data)
+{
+	struct object_entry *oe = packlist_find(&to_pack, &object->oid);
+	if (!oe)
+		return;
+
+	/*
+	 * Our 'to_pack' list was constructed by iterating all objects packed in
+	 * included packs, and so doesn't have a non-zero hash field that you
+	 * would typically pick up during a reachability traversal.
+	 *
+	 * Make a best-effort attempt to fill in the ->hash and ->no_try_delta
+	 * here using a now in order to perhaps improve the delta selection
+	 * process.
+	 */
+	oe->hash = pack_name_hash(name);
+	oe->no_try_delta = name && no_try_delta(name);
+
+	stdin_packs_hints_nr++;
+}
+
+static int pack_mtime_cmp(const void *_a, const void *_b)
+{
+	struct packed_git *a = ((const struct string_list_item*)_a)->util;
+	struct packed_git *b = ((const struct string_list_item*)_b)->util;
+
+	/*
+	 * order packs by descending mtime so that objects are laid out
+	 * roughly as newest-to-oldest
+	 */
+	if (a->mtime < b->mtime)
+		return 1;
+	else if (b->mtime < a->mtime)
+		return -1;
+	else
+		return 0;
+}
+
+static void read_packs_list_from_stdin(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list include_packs = STRING_LIST_INIT_DUP;
+	struct string_list exclude_packs = STRING_LIST_INIT_DUP;
+	struct string_list_item *item = NULL;
+
+	struct packed_git *p;
+	struct rev_info revs;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+	/*
+	 * Use a revision walk to fill in the namehash of objects in the include
+	 * packs. To save time, we'll avoid traversing through objects that are
+	 * in excluded packs.
+	 *
+	 * That may cause us to avoid populating all of the namehash fields of
+	 * all included objects, but our goal is best-effort, since this is only
+	 * an optimization during delta selection.
+	 */
+	revs.no_kept_objects = 1;
+	revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+	revs.blob_objects = 1;
+	revs.tree_objects = 1;
+	revs.tag_objects = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '^')
+			string_list_append(&exclude_packs, buf.buf + 1);
+		else
+			string_list_append(&include_packs, buf.buf);
+
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&include_packs);
+	string_list_sort(&exclude_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+
+		item = string_list_lookup(&include_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&exclude_packs, pack_name);
+
+		if (item)
+			item->util = p;
+	}
+
+	/*
+	 * First handle all of the excluded packs, marking them as kept in-core
+	 * so that later calls to add_object_entry() discards any objects that
+	 * are also found in excluded packs.
+	 */
+	for_each_string_list_item(item, &exclude_packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = 1;
+	}
+
+	/*
+	 * Order packs by ascending mtime; use QSORT directly to access the
+	 * string_list_item's ->util pointer, which string_list_sort() does not
+	 * provide.
+	 */
+	QSORT(include_packs.items, include_packs.nr, pack_mtime_cmp);
+
+	for_each_string_list_item(item, &include_packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		for_each_object_in_pack(p,
+					add_object_entry_from_pack,
+					&revs,
+					FOR_EACH_OBJECT_PACK_ORDER);
+	}
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	traverse_commit_list(&revs,
+			     show_commit_pack_hint,
+			     show_object_pack_hint,
+			     NULL);
+
+	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found",
+			   stdin_packs_found_nr);
+	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_hints",
+			   stdin_packs_hints_nr);
+
+	strbuf_release(&buf);
+	string_list_clear(&include_packs, 0);
+	string_list_clear(&exclude_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3489,6 +3673,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	struct strvec rp = STRVEC_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	int rev_list_index = 0;
+	int stdin_packs = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -3539,6 +3724,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_SET_INT_F(0, "indexed-objects", &rev_list_index,
 			      N_("include objects referred to by the index"),
 			      1, PARSE_OPT_NONEG),
+		OPT_BOOL(0, "stdin-packs", &stdin_packs,
+			 N_("read packs from stdin")),
 		OPT_BOOL(0, "stdout", &pack_to_stdout,
 			 N_("output pack to stdout")),
 		OPT_BOOL(0, "include-tag", &include_tag,
@@ -3645,7 +3832,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--indexed-objects");
 	}
-	if (rev_list_unpacked) {
+	if (rev_list_unpacked && !stdin_packs) {
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--unpacked");
 	}
@@ -3690,8 +3877,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (filter_options.choice) {
 		if (!pack_to_stdout)
 			die(_("cannot use --filter without --stdout"));
+		if (stdin_packs)
+			die(_("cannot use --filter with --stdin-packs"));
 	}
 
+	if (stdin_packs && use_internal_rev_list)
+		die(_("cannot use internal rev list with --stdin-packs"));
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -3750,7 +3942,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (progress)
 		progress_state = start_progress(_("Enumerating objects"), 0);
-	if (!use_internal_rev_list)
+	if (stdin_packs) {
+		/* avoids adding objects in excluded packs */
+		ignore_packed_keep_in_core = 1;
+		read_packs_list_from_stdin();
+		if (rev_list_unpacked)
+			add_unreachable_loose_objects();
+	} else if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
 		get_object_list(rp.nr, rp.v);
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 392201cabd..7138a54595 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -532,4 +532,101 @@ test_expect_success 'prefetch objects' '
 	test_line_count = 1 donelines
 '
 
+test_expect_success 'setup for --stdin-packs tests' '
+	git init stdin-packs &&
+	(
+		cd stdin-packs &&
+
+		test_commit A &&
+		test_commit B &&
+		test_commit C &&
+
+		for id in A B C
+		do
+			git pack-objects .git/objects/pack/pack-$id \
+				--incremental --revs <<-EOF
+			refs/tags/$id
+			EOF
+		done &&
+
+		ls -la .git/objects/pack
+	)
+'
+
+test_expect_success '--stdin-packs with excluded packs' '
+	(
+		cd stdin-packs &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+
+		git pack-objects test --stdin-packs <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx)
+		) >expect.raw &&
+		git show-index <$(ls test-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--stdin-packs is incompatible with --filter' '
+	(
+		cd stdin-packs &&
+		test_must_fail git pack-objects --stdin-packs --stdout \
+			--filter=blob:none </dev/null 2>err &&
+		test_i18ngrep "cannot use --filter with --stdin-packs" err
+	)
+'
+
+test_expect_success '--stdin-packs is incompatible with --revs' '
+	(
+		cd stdin-packs &&
+		test_must_fail git pack-objects --stdin-packs --revs out \
+			</dev/null 2>err &&
+		test_i18ngrep "cannot use internal rev list with --stdin-packs" err
+	)
+'
+
+test_expect_success '--stdin-packs with loose objects' '
+	(
+		cd stdin-packs &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+
+		test_commit D && # loose
+
+		git pack-objects test2 --stdin-packs --unpacked <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx) &&
+			git rev-list --objects --no-object-names \
+				refs/tags/C..refs/tags/D
+
+		) >expect.raw &&
+		ls -la . &&
+		git show-index <$(ls test2-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 4/8] p5303: add missing &&-chains
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (2 preceding siblings ...)
  2021-02-23  2:25   ` [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 5/8] p5303: measure time to repack with keep Taylor Blau
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

These are in a helper function, so the usual chain-lint doesn't notice
them. This function is still not perfect, as it has some git invocations
on the left-hand-side of the pipe, but it's primary purpose is timing,
not finding bugs or correctness issues.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index ce0c42cc9f..d90d714923 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -28,11 +28,11 @@ repack_into_n () {
 			push @commits, $_ if $. % 5 == 1;
 		}
 		print reverse @commits;
-	' "$1" >pushes
+	' "$1" >pushes &&
 
 	# create base packfile
 	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack
+	git pack-objects --delta-base-offset --revs staging/pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 5/8] p5303: measure time to repack with keep
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (3 preceding siblings ...)
  2021-02-23  2:25   ` [PATCH v4 4/8] p5303: add missing &&-chains Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

Add two new tests to measure repack performance. Both tests split the
repository into synthetic "pushes", and then leave the remaining objects
in a big base pack.

The first new test marks an empty pack as "kept" and then passes
--honor-pack-keep to avoid including objects in it. That doesn't change
the resulting pack, but it does let us compare to the normal repack case
to see how much overhead we add to check whether objects are kept or
not.

The other test is of --stdin-packs, which gives us a sense of how that
number scales based on the number of packs we provide as input. In each
of those tests, the empty pack isn't considered, but the residual pack
(objects that were left over and not included in one of the synthetic
push packs) is marked as kept.

(Note that in the single-pack case of the --stdin-packs test, there is
nothing do since there are no non-excluded packs).

Here are some timings on a recent clone of the kernel:

  5303.5: repack (1)                          57.26(54.59+10.84)
  5303.6: repack with kept (1)                57.33(54.80+10.51)

in the 50-pack case, things start to slow down:

  5303.11: repack (50)                        71.54(88.57+4.84)
  5303.12: repack with kept (50)              85.12(102.05+4.94)

and by the time we hit 1,000 packs, things are substantially worse, even
though the resulting pack produced is the same:

  5303.17: repack (1000)                      216.87(490.79+14.57)
  5303.18: repack with kept (1000)            665.63(938.87+15.76)

That's because the code paths around handling .keep files are known to
scale badly; they look in every single pack file to find each object.
Our solution to that was to notice that most repos don't have keep
files, and to make that case a fast path. But as soon as you add a
single .keep, that part of pack-objects slows down again (even if we
have fewer objects total to look at).

Likewise, the scaling is pretty extreme on --stdin-packs (but each
subsequent test is also being asked to do more work):

  5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
  5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
  5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5303-many-packs.sh | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index d90d714923..35c0cbdf49 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -31,8 +31,15 @@ repack_into_n () {
 	' "$1" >pushes &&
 
 	# create base packfile
-	head -n 1 pushes |
-	git pack-objects --delta-base-offset --revs staging/pack &&
+	base_pack=$(
+		head -n 1 pushes |
+		git pack-objects --delta-base-offset --revs staging/pack
+	) &&
+	test_export base_pack &&
+
+	# create an empty packfile
+	empty_pack=$(git pack-objects staging/pack </dev/null) &&
+	test_export empty_pack &&
 
 	# and then incrementals between each pair of commits
 	last= &&
@@ -49,6 +56,12 @@ repack_into_n () {
 		last=$rev
 	done <pushes &&
 
+	(
+		find staging -type f -name 'pack-*.pack' |
+			xargs -n 1 basename | grep -v "$base_pack" &&
+		printf "^pack-%s.pack\n" $base_pack
+	) >stdin.packs
+
 	# and install the whole thing
 	rm -f .git/objects/pack/* &&
 	mv staging/* .git/objects/pack/
@@ -91,6 +104,23 @@ do
 		  --reflog --indexed-objects --delta-base-offset \
 		  --stdout </dev/null >/dev/null
 	'
+
+	test_perf "repack with kept ($nr_packs)" '
+		git pack-objects --keep-true-parents \
+		  --keep-pack=pack-$empty_pack.pack \
+		  --honor-pack-keep --non-empty --all \
+		  --reflog --indexed-objects --delta-base-offset \
+		  --stdout </dev/null >/dev/null
+	'
+
+	test_perf "repack with --stdin-packs ($nr_packs)" '
+		git pack-objects \
+		  --keep-true-parents \
+		  --stdin-packs \
+		  --non-empty \
+		  --delta-base-offset \
+		  --stdout <stdin.packs >/dev/null
+	'
 done
 
 # Measure pack loading with 10,000 packs.
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (4 preceding siblings ...)
  2021-02-23  2:25   ` [PATCH v4 5/8] p5303: measure time to repack with keep Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

Now that we have find_kept_pack_entry(), we don't have to manually keep
hunting through every pack to find a possible "kept" duplicate of the
object. This should be faster, assuming only a portion of your total
packs are actually kept.

Note that we have to re-order the logic a bit here; we can deal with the
disqualifying situations first (e.g., finding the object in a non-local
pack with --local), then "kept" situation(s), and then just fall back to
other "--local" conditions.

Here are the results from p5303 (measurements again taken on the
kernel):

  Test                                        HEAD^                   HEAD
  -----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.26(54.59+10.84)      57.34(54.66+10.88) +0.1%
  5303.6: repack with kept (1)                57.33(54.80+10.51)      57.38(54.83+10.49) +0.1%
  5303.11: repack (50)                        71.54(88.57+4.84)       71.70(88.99+4.74) +0.2%
  5303.12: repack with kept (50)              85.12(102.05+4.94)      72.58(89.61+4.78) -14.7%
  5303.17: repack (1000)                      216.87(490.79+14.57)    217.19(491.72+14.25) +0.1%
  5303.18: repack with kept (1000)            665.63(938.87+15.76)    246.12(520.07+14.93) -63.0%

and the --stdin-packs timings:

  5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)         0.00(0.00+0.00) -100.0%
  5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)        3.43(11.75+0.24) -2.8%
  5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)     130.50(307.15+7.66) -33.4%

So our repack with an empty .keep pack is roughly as fast as one without
a .keep pack up to 50 packs. But the --stdin-packs case scales a little
better, too.

Notably, it is faster than a repack of the same size and a kept pack. It
looks at fewer objects, of course, but the penalty for looking at many
packs isn't as costly.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 131 ++++++++++++++++++++++++-----------------
 1 file changed, 78 insertions(+), 53 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ee8e40665..8cb32763b7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1188,7 +1188,8 @@ static int have_duplicate_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int want_found_object(const struct object_id *oid, int exclude,
+			     struct packed_git *p)
 {
 	if (exclude)
 		return 1;
@@ -1204,27 +1205,82 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 * make sure no copy of this object appears in _any_ pack that makes us
 	 * to omit the object, so we need to check all the packs.
 	 *
-	 * We can however first check whether these options can possible matter;
+	 * We can however first check whether these options can possibly matter;
 	 * if they do not matter we know we want the object in generated pack.
 	 * Otherwise, we signal "-1" at the end to tell the caller that we do
 	 * not know either way, and it needs to check more packs.
 	 */
-	if (!ignore_packed_keep_on_disk &&
-	    !ignore_packed_keep_in_core &&
-	    (!local || !have_non_local_packs))
-		return 1;
 
+	/*
+	 * Objects in packs borrowed from elsewhere are discarded regardless of
+	 * if they appear in other packs that weren't borrowed.
+	 */
 	if (local && !p->pack_local)
 		return 0;
-	if (p->pack_local &&
-	    ((ignore_packed_keep_on_disk && p->pack_keep) ||
-	     (ignore_packed_keep_in_core && p->pack_keep_in_core)))
-		return 0;
+
+	/*
+	 * Then handle .keep first, as we have a fast(er) path there.
+	 */
+	if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) {
+		/*
+		 * Set the flags for the kept-pack cache to be the ones we want
+		 * to ignore.
+		 *
+		 * That is, if we are ignoring objects in on-disk keep packs,
+		 * then we want to search through the on-disk keep and ignore
+		 * the in-core ones.
+		 */
+		unsigned flags = 0;
+		if (ignore_packed_keep_on_disk)
+			flags |= ON_DISK_KEEP_PACKS;
+		if (ignore_packed_keep_in_core)
+			flags |= IN_CORE_KEEP_PACKS;
+
+		if (ignore_packed_keep_on_disk && p->pack_keep)
+			return 0;
+		if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+			return 0;
+		if (has_object_kept_pack(oid, flags))
+			return 0;
+	}
+
+	/*
+	 * At this point we know definitively that either we don't care about
+	 * keep-packs, or the object is not in one. Keep checking other
+	 * conditions...
+	 */
+	if (!local || !have_non_local_packs)
+		return 1;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
+static int want_object_in_pack_one(struct packed_git *p,
+				   const struct object_id *oid,
+				   int exclude,
+				   struct packed_git **found_pack,
+				   off_t *found_offset)
+{
+	off_t offset;
+
+	if (p == *found_pack)
+		offset = *found_offset;
+	else
+		offset = find_pack_entry_one(oid->hash, p);
+
+	if (offset) {
+		if (!*found_pack) {
+			if (!is_pack_valid(p))
+				return -1;
+			*found_offset = offset;
+			*found_pack = p;
+		}
+		return want_found_object(oid, exclude, p);
+	}
+	return -1;
+}
+
 /*
  * Check whether we want the object in the pack (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
@@ -1252,7 +1308,7 @@ static int want_object_in_pack(const struct object_id *oid,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
+		want = want_found_object(oid, exclude, *found_pack);
 		if (want != -1)
 			return want;
 	}
@@ -1260,53 +1316,22 @@ static int want_object_in_pack(const struct object_id *oid,
 	for (m = get_multi_pack_index(the_repository); m; m = m->next) {
 		struct pack_entry e;
 		if (fill_midx_entry(the_repository, oid, &e, m)) {
-			struct packed_git *p = e.p;
-			off_t offset;
-
-			if (p == *found_pack)
-				offset = *found_offset;
-			else
-				offset = find_pack_entry_one(oid->hash, p);
-
-			if (offset) {
-				if (!*found_pack) {
-					if (!is_pack_valid(p))
-						continue;
-					*found_offset = offset;
-					*found_pack = p;
-				}
-				want = want_found_object(exclude, p);
-				if (want != -1)
-					return want;
-			}
-		}
-	}
-
-	list_for_each(pos, get_packed_git_mru(the_repository)) {
-		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		off_t offset;
-
-		if (p == *found_pack)
-			offset = *found_offset;
-		else
-			offset = find_pack_entry_one(oid->hash, p);
-
-		if (offset) {
-			if (!*found_pack) {
-				if (!is_pack_valid(p))
-					continue;
-				*found_offset = offset;
-				*found_pack = p;
-			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
-				list_move(&p->mru,
-					  get_packed_git_mru(the_repository));
+			want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
 			if (want != -1)
 				return want;
 		}
 	}
 
+	list_for_each(pos, get_packed_git_mru(the_repository)) {
+		struct packed_git *p = list_entry(pos, struct packed_git, mru);
+		want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+		if (!exclude && want > 0)
+			list_move(&p->mru,
+				  get_packed_git_mru(the_repository));
+		if (want != -1)
+			return want;
+	}
+
 	if (uri_protocols.nr) {
 		struct configured_exclusion *ex =
 			oidmap_get(&configured_exclusions, oid);
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry()
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (5 preceding siblings ...)
  2021-02-23  2:25   ` [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-23  2:25   ` [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

From: Jeff King <peff@peff.net>

In a recent patch we added a function 'find_kept_pack_entry()' to look
for an object only among kept packs.

While this function avoids doing any lookup work in non-kept packs, it
is still linear in the number of packs, since we have to traverse the
linked list of packs once per object. Let's cache a reduced version of
that list to save us time.

Note that this cache will last the lifetime of the program. We could
invalidate it on reprepare_packed_git(), but there's not much point in
being rigorous here:

  - we might already fail to notice new .keep packs showing up after the
    program starts. We only reprepare_packed_git() when we fail to find
    an object. But adding a new pack won't cause that to happen.
    Somebody repacking could add a new pack and delete an old one, but
    most of the time we'd have a descriptor or mmap open to the old
    pack anyway, so we might not even notice.

  - in pack-objects we already cache the .keep state at startup, since
    56dfeb6263 (pack-objects: compute local/ignore_pack_keep early,
    2016-07-29). So this is just extending that concept further.

  - we don't have to worry about any packed_git being removed; we always
    keep the old structs around, even after reprepare_packed_git()

We do defensively invalidate the cache in case the set of kept packs
being asked for changes (e.g., only in-core kept packs were cached, but
suddenly the caller also wants on-disk kept packs, too). In theory we
could build all three caches and switch between them, but it's not
necessary, since this patch (and series) never changes the set of kept
packs that it wants to inspect from the cache.

So that "optimization" is more about being defensive in the face of
future changes than it is about asking for multiple kinds of kept packs
in this patch.

Here are p5303 results (as always, measured against the kernel):

  Test                                        HEAD^                   HEAD
  -----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.34(54.66+10.88)      56.98(54.36+10.98) -0.6%
  5303.6: repack with kept (1)                57.38(54.83+10.49)      57.17(54.97+10.26) -0.4%
  5303.11: repack (50)                        71.70(88.99+4.74)       71.62(88.48+5.08) -0.1%
  5303.12: repack with kept (50)              72.58(89.61+4.78)       71.56(88.80+4.59) -1.4%
  5303.17: repack (1000)                      217.19(491.72+14.25)    217.31(490.82+14.53) +0.1%
  5303.18: repack with kept (1000)            246.12(520.07+14.93)    217.08(490.37+15.10) -11.8%

and the --stdin-packs case, which scales a little bit better (although
not by that much even at 1,000 packs):

  5303.7: repack with --stdin-packs (1)       0.00(0.00+0.00)         0.00(0.00+0.00) =
  5303.13: repack with --stdin-packs (50)     3.43(11.75+0.24)        3.43(11.69+0.30) +0.0%
  5303.19: repack with --stdin-packs (1000)   130.50(307.15+7.66)     125.13(301.36+8.04) -4.1%

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-store.h |  5 +++
 packfile.c     | 99 ++++++++++++++++++++++++++++----------------------
 2 files changed, 61 insertions(+), 43 deletions(-)

diff --git a/object-store.h b/object-store.h
index 541dab0858..ec32c23dcb 100644
--- a/object-store.h
+++ b/object-store.h
@@ -153,6 +153,11 @@ struct raw_object_store {
 	/* A most-recently-used ordered version of the packed_git list. */
 	struct list_head packed_git_mru;
 
+	struct {
+		struct packed_git **packs;
+		unsigned flags;
+	} kept_pack_cache;
+
 	/*
 	 * A map of packfiles to packed_git structs for tracking which
 	 * packs have been loaded already.
diff --git a/packfile.c b/packfile.c
index 7f84f221ce..57d5b436fb 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2042,10 +2042,7 @@ static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-static int find_one_pack_entry(struct repository *r,
-			       const struct object_id *oid,
-			       struct pack_entry *e,
-			       int kept_only)
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2055,49 +2052,63 @@ static int find_one_pack_entry(struct repository *r,
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (!fill_midx_entry(r, oid, e, m))
-			continue;
-
-		if (!kept_only)
-			return 1;
-
-		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
-		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
+		if (fill_midx_entry(r, oid, e, m))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (p->multi_pack_index && !kept_only) {
-			/*
-			 * If this pack is covered by the MIDX, we'd have found
-			 * the object already in the loop above if it was here,
-			 * so don't bother looking.
-			 *
-			 * The exception is if we are looking only at kept
-			 * packs. An object can be present in two packs covered
-			 * by the MIDX, one kept and one not-kept. And as the
-			 * MIDX points to only one copy of each object, it might
-			 * have returned only the non-kept version above. We
-			 * have to check again to be thorough.
-			 */
-			continue;
-		}
-		if (!kept_only ||
-		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
-		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
-			if (fill_pack_entry(oid, e, p)) {
-				list_move(&p->mru, &r->objects->packed_git_mru);
-				return 1;
-			}
+		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
+			list_move(&p->mru, &r->objects->packed_git_mru);
+			return 1;
 		}
 	}
 	return 0;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static void maybe_invalidate_kept_pack_cache(struct repository *r,
+					     unsigned flags)
 {
-	return find_one_pack_entry(r, oid, e, 0);
+	if (!r->objects->kept_pack_cache.packs)
+		return;
+	if (r->objects->kept_pack_cache.flags == flags)
+		return;
+	FREE_AND_NULL(r->objects->kept_pack_cache.packs);
+	r->objects->kept_pack_cache.flags = 0;
+}
+
+static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+{
+	maybe_invalidate_kept_pack_cache(r, flags);
+
+	if (!r->objects->kept_pack_cache.packs) {
+		struct packed_git **packs = NULL;
+		size_t nr = 0, alloc = 0;
+		struct packed_git *p;
+
+		/*
+		 * We want "all" packs here, because we need to cover ones that
+		 * are used by a midx, as well. We need to look in every one of
+		 * them (instead of the midx itself) to cover duplicates. It's
+		 * possible that an object is found in two packs that the midx
+		 * covers, one kept and one not kept, but the midx returns only
+		 * the non-kept version.
+		 */
+		for (p = get_all_packs(r); p; p = p->next) {
+			if ((p->pack_keep && (flags & ON_DISK_KEEP_PACKS)) ||
+			    (p->pack_keep_in_core && (flags & IN_CORE_KEEP_PACKS))) {
+				ALLOC_GROW(packs, nr + 1, alloc);
+				packs[nr++] = p;
+			}
+		}
+		ALLOC_GROW(packs, nr + 1, alloc);
+		packs[nr] = NULL;
+
+		r->objects->kept_pack_cache.packs = packs;
+		r->objects->kept_pack_cache.flags = flags;
+	}
+
+	return r->objects->kept_pack_cache.packs;
 }
 
 int find_kept_pack_entry(struct repository *r,
@@ -2105,13 +2116,15 @@ int find_kept_pack_entry(struct repository *r,
 			 unsigned flags,
 			 struct pack_entry *e)
 {
-	/*
-	 * Load all packs, including midx packs, since our "kept" strategy
-	 * relies on that. We're relying on the side effect of it setting up
-	 * r->objects->packed_git, which is a little ugly.
-	 */
-	get_all_packs(r);
-	return find_one_pack_entry(r, oid, e, flags);
+	struct packed_git **cache;
+
+	for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+		struct packed_git *p = *cache;
+		if (fill_pack_entry(oid, e, p))
+			return 1;
+	}
+
+	return 0;
 }
 
 int has_object_pack(const struct object_id *oid)
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 8/8] builtin/repack.c: add '--geometric' option
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (6 preceding siblings ...)
  2021-02-23  2:25   ` [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
@ 2021-02-23  2:25   ` Taylor Blau
  2021-02-24 23:19     ` Junio C Hamano
  2021-02-23  3:39   ` [PATCH v4 0/8] repack: support repacking into a geometric sequence Jeff King
  2021-02-23  7:43   ` Junio C Hamano
  9 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-23  2:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster

Often it is useful to both:

  - have relatively few packfiles in a repository, and

  - avoid having so few packfiles in a repository that we repack its
    entire contents regularly

This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).

Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:

  objects(Pi) > r*objects(P(i-1))

for all i in [1, n], where the packs are sorted by

  objects(P1) <= objects(P2) <= ... <= objects(Pn).

Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:

  1. We assume that there is a cutoff of packs _before starting the
     repack_ where everything to the right of that cut-off already forms
     a geometric progression (or no cutoff exists and everything must be
     repacked).

  2. We assume that everything smaller than the cutoff count must be
     repacked. This forms our base assumption, but it can also cause
     even the "heavy" packs to get repacked, for e.g., if we have 6
     packs containing the following number of objects:

       1, 1, 1, 2, 4, 32

     then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
     rolling up the first two packs into a pack with 2 objects. That
     breaks our progression and leaves us:

       2, 1, 2, 4, 32
         ^

     (where the '^' indicates the position of our split). To restore a
     progression, we move the split forward (towards larger packs)
     joining each pack into our new pack until a geometric progression
     is restored. Here, that looks like:

       2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
         ^                   ^                ^                   ^

This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  23 +++++
 builtin/repack.c             | 187 ++++++++++++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 137 +++++++++++++++++++++++++
 3 files changed, 343 insertions(+), 4 deletions(-)
 create mode 100755 t/t7703-repack-geometric.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 92f146d27d..136da9fa0b 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -165,6 +165,29 @@ depth is 4095.
 	Pass the `--delta-islands` option to `git-pack-objects`, see
 	linkgit:git-pack-objects[1].
 
+-g=<factor>::
+--geometric=<factor>::
+	Arrange resulting pack structure so that each successive pack
+	contains at least `<factor>` times the number of objects as the
+	next-largest pack.
++
+`git repack` ensures this by determining a "cut" of packfiles that need
+to be repacked into one in order to ensure a geometric progression. It
+picks the smallest set of packfiles such that as many of the larger
+packfiles (by count of objects contained in that pack) may be left
+intact.
++
+Unlike other repack modes, the set of objects to pack is determined
+uniquely by the set of packs being "rolled-up"; in other words, the
+packs determined to need to be combined in order to restore a geometric
+progression.
++
+When `--unpacked` is specified, loose objects are implicitly included in
+this "roll-up", without respect to their reachability. This is subject
+to change in the future. This option (implying a drastically different
+repack mode) is not guaranteed to work with all other combinations of
+option to `git repack`).
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 01440de2d5..bcf280b10d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -297,6 +297,124 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
 
+struct pack_geometry {
+	struct packed_git **pack;
+	uint32_t pack_nr, pack_alloc;
+	uint32_t split;
+};
+
+static uint32_t geometry_pack_weight(struct packed_git *p)
+{
+	if (open_pack_index(p))
+		die(_("cannot open index for %s"), p->pack_name);
+	return p->num_objects;
+}
+
+static int geometry_cmp(const void *va, const void *vb)
+{
+	uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
+		 bw = geometry_pack_weight(*(struct packed_git **)vb);
+
+	if (aw < bw)
+		return -1;
+	if (aw > bw)
+		return 1;
+	return 0;
+}
+
+static void init_pack_geometry(struct pack_geometry **geometry_p)
+{
+	struct packed_git *p;
+	struct pack_geometry *geometry;
+
+	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
+	geometry = *geometry_p;
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		if (!pack_kept_objects && p->pack_keep)
+			continue;
+
+		ALLOC_GROW(geometry->pack,
+			   geometry->pack_nr + 1,
+			   geometry->pack_alloc);
+
+		geometry->pack[geometry->pack_nr] = p;
+		geometry->pack_nr++;
+	}
+
+	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
+}
+
+static void split_pack_geometry(struct pack_geometry *geometry, int factor)
+{
+	uint32_t i;
+	uint32_t split;
+	off_t total_size = 0;
+
+	if (geometry->pack_nr <= 1) {
+		geometry->split = geometry->pack_nr;
+		return;
+	}
+
+	split = geometry->pack_nr - 1;
+
+	/*
+	 * First, count the number of packs (in descending order of size) which
+	 * already form a geometric progression.
+	 */
+	for (i = geometry->pack_nr - 1; i > 0; i--) {
+		struct packed_git *ours = geometry->pack[i];
+		struct packed_git *prev = geometry->pack[i - 1];
+		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
+			split--;
+		else
+			break;
+	}
+
+	if (split) {
+		/*
+		 * Move the split one to the right, since the top element in the
+		 * last-compared pair can't be in the progression. Only do this
+		 * when we split in the middle of the array (otherwise if we got
+		 * to the end, then the split is in the right place).
+		 */
+		split++;
+	}
+
+	/*
+	 * Then, anything to the left of 'split' must be in a new pack. But,
+	 * creating that new pack may cause packs in the heavy half to no longer
+	 * form a geometric progression.
+	 *
+	 * Compute an expected size of the new pack, and then determine how many
+	 * packs in the heavy half need to be joined into it (if any) to restore
+	 * the geometric progression.
+	 */
+	for (i = 0; i < split; i++)
+		total_size += geometry_pack_weight(geometry->pack[i]);
+	for (i = split; i < geometry->pack_nr; i++) {
+		struct packed_git *ours = geometry->pack[i];
+		if (geometry_pack_weight(ours) < factor * total_size) {
+			split++;
+			total_size += geometry_pack_weight(ours);
+		} else
+			break;
+	}
+
+	geometry->split = split;
+}
+
+static void clear_pack_geometry(struct pack_geometry *geometry)
+{
+	if (!geometry)
+		return;
+
+	free(geometry->pack);
+	geometry->pack_nr = 0;
+	geometry->pack_alloc = 0;
+	geometry->split = 0;
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -304,6 +422,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
@@ -316,6 +435,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	int geometric_factor = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -356,6 +476,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("do not repack this pack")),
+		OPT_INTEGER('g', "geometric", &geometric_factor,
+			    N_("find a geometric progression with factor <N>")),
 		OPT_END()
 	};
 
@@ -382,6 +504,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
 		die(_(incremental_bitmap_conflict_error));
 
+	if (geometric_factor) {
+		if (pack_everything)
+			die(_("--geometric is incompatible with -A, -a"));
+		init_pack_geometry(&geometry);
+		split_pack_geometry(geometry, geometric_factor);
+	}
+
 	packdir = mkpathdup("%s/pack", get_object_directory());
 	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
@@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_pushf(&cmd.args, "--keep-pack=%s",
 			     keep_pack_list.items[i].string);
 	strvec_push(&cmd.args, "--non-empty");
-	strvec_push(&cmd.args, "--all");
-	strvec_push(&cmd.args, "--reflog");
-	strvec_push(&cmd.args, "--indexed-objects");
+	if (!geometry) {
+		/*
+		 * 'git pack-objects' will up all objects loose or packed
+		 * (either rolling them up or leaving them alone), so don't pass
+		 * these options.
+		 *
+		 * The implementation of 'git pack-objects --stdin-packs'
+		 * makes them redundant (and the two are incompatible).
+		 */
+		strvec_push(&cmd.args, "--all");
+		strvec_push(&cmd.args, "--reflog");
+		strvec_push(&cmd.args, "--indexed-objects");
+	}
 	if (has_promisor_remote())
 		strvec_push(&cmd.args, "--exclude-promisor-objects");
 	if (write_bitmaps > 0)
@@ -429,17 +568,37 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
 			}
 		}
+	} else if (geometry) {
+		strvec_push(&cmd.args, "--stdin-packs");
+		strvec_push(&cmd.args, "--unpacked");
 	} else {
 		strvec_push(&cmd.args, "--unpacked");
 		strvec_push(&cmd.args, "--incremental");
 	}
 
-	cmd.no_stdin = 1;
+	if (geometry)
+		cmd.in = -1;
+	else
+		cmd.no_stdin = 1;
 
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
 
+	if (geometry) {
+		FILE *in = xfdopen(cmd.in, "w");
+		/*
+		 * The resulting pack should contain all objects in packs that
+		 * are going to be rolled up, but exclude objects in packs which
+		 * are being left alone.
+		 */
+		for (i = 0; i < geometry->split; i++)
+			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
+		for (i = geometry->split; i < geometry->pack_nr; i++)
+			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
+		fclose(in);
+	}
+
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		if (line.len != the_hash_algo->hexsz)
@@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			if (!string_list_has_string(&names, sha1))
 				remove_redundant_pack(packdir, item->string);
 		}
+
+		if (geometry) {
+			struct strbuf buf = STRBUF_INIT;
+
+			uint32_t i;
+			for (i = 0; i < geometry->split; i++) {
+				struct packed_git *p = geometry->pack[i];
+				if (string_list_has_string(&names,
+							   hash_to_hex(p->hash)))
+					continue;
+
+				strbuf_reset(&buf);
+				strbuf_addstr(&buf, pack_basename(p));
+				strbuf_strip_suffix(&buf, ".pack");
+
+				remove_redundant_pack(packdir, buf.buf);
+			}
+			strbuf_release(&buf);
+		}
 		if (!po_args.quiet && isatty(2))
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
@@ -528,6 +706,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
 	return 0;
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
new file mode 100755
index 0000000000..96917fc163
--- /dev/null
+++ b/t/t7703-repack-geometric.sh
@@ -0,0 +1,137 @@
+#!/bin/sh
+
+test_description='git repack --geometric works correctly'
+
+. ./test-lib.sh
+
+GIT_TEST_MULTI_PACK_INDEX=0
+
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success '--geometric with no packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		git repack --geometric 2 >out &&
+		test_i18ngrep "Nothing new to pack" out
+	)
+'
+
+test_expect_success '--geometric with an intact progression' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		test_commit_bulk --start=4 4 && # 12 objects
+
+		find $objdir/pack -name "*.pack" | sort >expect &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >actual &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success '--geometric with small-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		find $objdir/pack -name "*.pack" | sort >small &&
+		test_commit_bulk --start=3 4 && # 12 objects
+		test_commit_bulk --start=7 8 && # 24 objects
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		git repack --geometric 2 -d &&
+
+		# Three packs in total; two of the existing large ones, and one
+		# new one.
+		find $objdir/pack -name "*.pack" | sort >after &&
+		test_line_count = 3 after &&
+		comm -3 small before | tr -d "\t" >large &&
+		grep -qFf large after
+	)
+'
+
+test_expect_success '--geometric with small- and large-pack rollup' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# size(small1) + size(small2) > size(medium) / 2
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 1 && # 3 objects
+		test_commit_bulk --start=2 3 && # 7 objects
+		test_commit_bulk --start=6 9 && # 27 objects &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+
+		git repack --geometric 2 -d &&
+
+		find $objdir/pack -name "*.pack" | sort >after &&
+		comm -12 before after >untouched &&
+
+		# Two packs in total; the largest pack from before running "git
+		# repack", and one new one.
+		test_line_count = 1 untouched &&
+		test_line_count = 2 after
+	)
+'
+
+test_expect_success '--geometric ignores kept packs' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit kept && # 3 objects
+		test_commit pack && # 3 objects
+
+		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/kept
+		EOF
+		) &&
+		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
+		refs/tags/pack
+		^refs/tags/kept
+		EOF
+		) &&
+
+		# neither pack contains more than twice the number of objects in
+		# the other, so they should be combined. but, marking one as
+		# .kept on disk will "freeze" it, so the pack structure should
+		# remain unchanged.
+		touch $objdir/pack/pack-$KEPT.keep &&
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >after &&
+
+		# both packs should still exist
+		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
+		test_path_is_file $objdir/pack/pack-$PACK.pack &&
+
+		# and no new packs should be created
+		test_cmp before after &&
+
+		# Passing --pack-kept-objects causes packs with a .keep file to
+		# be repacked, too.
+		git repack --geometric 2 -d --pack-kept-objects &&
+
+		find $objdir/pack -name "*.pack" >after &&
+		test_line_count = 1 after
+	)
+'
+
+test_done
-- 
2.30.0.667.g81c0cbc6fd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (7 preceding siblings ...)
  2021-02-23  2:25   ` [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
@ 2021-02-23  3:39   ` Jeff King
  2021-02-23  7:43   ` Junio C Hamano
  9 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-23  3:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster

On Mon, Feb 22, 2021 at 09:24:59PM -0500, Taylor Blau wrote:

> Here's a very lightly modified version on v3 of mine and Peff's series
> to add a new 'git repack --geometric' mode. Almost nothing has changed
> since last time, with the exception of:
> 
>   - Packs listed over standard input to 'git pack-objects --stdin-packs'
>     are sorted in descending mtime order (and objects are strung
>     together in pack order as before) so that objects are laid out
>     roughly newest-to-oldest in the resulting pack.
> 
>   - Swapped the order of two paragraphs in patch 5 to make the perf
>     results clearer.
> 
>   - Mention '--unpacked' specifically in the documentation for 'git
>     repack --geometric'.
> 
>   - Typo fixes.

Thanks, this all looks great to me.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
                     ` (8 preceding siblings ...)
  2021-02-23  3:39   ` [PATCH v4 0/8] repack: support repacking into a geometric sequence Jeff King
@ 2021-02-23  7:43   ` Junio C Hamano
  2021-02-23 18:44     ` Jeff King
  9 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-23  7:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:


>     ++	/*
>     ++	 * order packs by descending mtime so that objects are laid out
>     ++	 * roughly as newest-to-oldest
>     ++	 */
>      +	if (a->mtime < b->mtime)
>      +		return 1;
>     ++	else if (b->mtime < a->mtime)
>     ++		return -1;
>      +	else
>      +		return 0;

I think this strategy makes sense when this repack using this new
feature is run for the first time in a repository that acquired many
packs over time.  I am not sure what happens after the feature is
used a few times---it won't always be the newest sets of packs that
will be rewritten, but sometimes older ones are also coalesced, and
when that happens the resulting pack that consists primarily of older
objects would end up having a more recent timestamp, no?

Even then, I do agree that newer to older would be beneficial most
of the time, so this is of course not an objection against this
particular sort order.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-23  2:25   ` [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
@ 2021-02-23  8:07     ` Junio C Hamano
  2021-02-23 18:51       ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-23  8:07 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> (I found it convenient while developing this patch to have 'git
> pack-objects' report the number of objects which were visited and got
> their namehash fields filled in during traversal. This is also included
> in the below patch via trace2 data lines).

It does sound like a well thought out strategy to give name-hash to
entries that we may have to find good delta bases afresh, while
stopping upon hitting parts of the history we won't have to (either
because they are in "excluded" packs, which you did here, or because
they can take advantage of the "reuse existing delta base" logic [*],
which we may want to look further into in future follow-on topics).


[Footnote]

* I presume that such a logic may, instead of stopping at an object
  that is in an excluded pack, stop at an object that is stored in
  the current pack as a delta and its base is also going to be
  packed (and the latter by definition is always true, I presume, as
  everything in the included pack would be packed)



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23  7:43   ` Junio C Hamano
@ 2021-02-23 18:44     ` Jeff King
  2021-02-23 19:54       ` Martin Fick
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-23 18:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Mon, Feb 22, 2021 at 11:43:22PM -0800, Junio C Hamano wrote:

> >     ++	/*
> >     ++	 * order packs by descending mtime so that objects are laid out
> >     ++	 * roughly as newest-to-oldest
> >     ++	 */
> >      +	if (a->mtime < b->mtime)
> >      +		return 1;
> >     ++	else if (b->mtime < a->mtime)
> >     ++		return -1;
> >      +	else
> >      +		return 0;
> 
> I think this strategy makes sense when this repack using this new
> feature is run for the first time in a repository that acquired many
> packs over time.  I am not sure what happens after the feature is
> used a few times---it won't always be the newest sets of packs that
> will be rewritten, but sometimes older ones are also coalesced, and
> when that happens the resulting pack that consists primarily of older
> objects would end up having a more recent timestamp, no?

Yeah, this is definitely a heuristic that can get out of sync with
reality. I think in general if you have base pack A and somebody pushes
up B, C, and D in sequence, we're likely to roll up a single DBC (in
that order) pack. Further pushes E, F, G would have newer mtimes. So we
might get GFEDBC directly. Or we might get GFE and DBC, but the former
would still have a newer mtime, so we'd create GFEDBC on the next run.

The issues come from:

  - we are deciding what to roll up based on size. A big push might not
    get rolled up immediately, putting it out-of-sync with the rest of
    the rollups.

  - we are happy to manipulate pack mtimes under the hood as part of the
    freshen_*() code.

I think you probably wouldn't want to use this roll-up strategy all the
time (even though in theory it would eventually roll up to a single good
pack), just because it is based on heuristics like this. You'd want to
occasionally run a "real" repack that does a full traversal, possibly
pruning objects, etc.

And that's how we plan to use it at GitHub. I don't remember how much of
the root problem we've discussed on-list, but the crux of it is:
per-object costs including traversal can get really high on big
repositories. Our shared-storage repo for all torvalds/linux forks is on
the order of 45M objects, and some companies with large and active
private repositories are close to that. Traversing the object graph
takes 15+ minutes (plus another 15 for delta island marking). For busy
repositories, by the time you finish repacking, it's time to start
again. :)

> Even then, I do agree that newer to older would be beneficial most
> of the time, so this is of course not an objection against this
> particular sort order.

So yeah. I consider this best-effort for sure, and I think this sort
order is the best we can do without traversing.

OTOH, we _do_ actually do a partial traversal in this latest version of
the series. We could use that to impact the final write order. It
doesn't necessarily hit every object, though, so we'd still want to fall
back on this pack ordering heuristic. I'm content to leave punt on that
work for now, and leave it for a future series after we see how this
heuristic performs in practice.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option
  2021-02-23  8:07     ` Junio C Hamano
@ 2021-02-23 18:51       ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-23 18:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee

On Tue, Feb 23, 2021 at 12:07:52AM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > (I found it convenient while developing this patch to have 'git
> > pack-objects' report the number of objects which were visited and got
> > their namehash fields filled in during traversal. This is also included
> > in the below patch via trace2 data lines).
> 
> It does sound like a well thought out strategy to give name-hash to
> entries that we may have to find good delta bases afresh, while
> stopping upon hitting parts of the history we won't have to (either
> because they are in "excluded" packs, which you did here, or because
> they can take advantage of the "reuse existing delta base" logic [*],
> which we may want to look further into in future follow-on topics).
> 
> [Footnote]
> 
> * I presume that such a logic may, instead of stopping at an object
>   that is in an excluded pack, stop at an object that is stored in
>   the current pack as a delta and its base is also going to be
>   packed (and the latter by definition is always true, I presume, as
>   everything in the included pack would be packed)

I'm not sure if using deltas as a heuristic for stopping traversal makes
sense. They don't necessarily correspond to the history graph, or to
what was pushed. E.g., if I see that tree X is a delta against tree Y,
then we might say: if Y is not excluded by being in one of the base
packs, then we will reuse the delta. We do not need the namehash of X,
since we already know its delta.

But that does not tell us anything about the subtrees and blobs
contained in X. We still want to traverse X in order to find out _their_
name hashes, because it is likely that we will need to delta some of
those.

Of course if you see a blob that is a delta that you plan to reuse, you
know you can stop there. But by the time you get to it, you already know
its namehash, and there is nothing left to traverse. :)

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 18:44     ` Jeff King
@ 2021-02-23 19:54       ` Martin Fick
  2021-02-23 20:06         ` Taylor Blau
  2021-02-23 20:15         ` Jeff King
  0 siblings, 2 replies; 120+ messages in thread
From: Martin Fick @ 2021-02-23 19:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Tuesday, February 23, 2021 1:44:05 PM MST Jeff King wrote:
> On Mon, Feb 22, 2021 at 11:43:22PM -0800, Junio C Hamano wrote:
> > >     ++	/*
> > >     ++	 * order packs by descending mtime so that objects are laid out
> > >     ++	 * roughly as newest-to-oldest
> > >     ++	 */
> > >     
> > >      +	if (a->mtime < b->mtime)
> > >      +		return 1;
> > >     
> > >     ++	else if (b->mtime < a->mtime)
> > >     ++		return -1;
> > >     
> > >      +	else
> > >      +		return 0;
> > 
> > I think this strategy makes sense when this repack using this new
> > feature is run for the first time in a repository that acquired many
> > packs over time.  I am not sure what happens after the feature is
> > used a few times---it won't always be the newest sets of packs that
> > will be rewritten, but sometimes older ones are also coalesced, and
> > when that happens the resulting pack that consists primarily of older
> > objects would end up having a more recent timestamp, no?
> 
> Yeah, this is definitely a heuristic that can get out of sync with
> reality. I think in general if you have base pack A and somebody pushes
> up B, C, and D in sequence, we're likely to roll up a single DBC (in
> that order) pack. Further pushes E, F, G would have newer mtimes. So we
> might get GFEDBC directly. Or we might get GFE and DBC, but the former
> would still have a newer mtime, so we'd create GFEDBC on the next run.
> 
> The issues come from:
> 
>   - we are deciding what to roll up based on size. A big push might not
>     get rolled up immediately, putting it out-of-sync with the rest of
>     the rollups.

Would it make sense to somehow detect all new packs since the last rollup and 
always include them in the rollup no matter what their size? That is one thing 
that my git-exproll script did. One of the main reasons to do this was because 
newer packs tended to look big (I was using bytes to determine size), and 
newer packs were often bigger on disk compared to other packs with similar 
objects in them (I think you suggested this was due to the thickening of packs 
on receipt). Maybe roll up all packs with a timestamp "new enough", no matter 
how big they are?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 19:54       ` Martin Fick
@ 2021-02-23 20:06         ` Taylor Blau
  2021-02-23 21:57           ` Martin Fick
  2021-02-23 20:15         ` Jeff King
  1 sibling, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-23 20:06 UTC (permalink / raw)
  To: Martin Fick; +Cc: Jeff King, Junio C Hamano, Taylor Blau, git, dstolee

On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:
> Would it make sense to somehow detect all new packs since the last rollup and
> always include them in the rollup no matter what their size? That is one thing
> that my git-exproll script did.

I'm certainly not opposed, and this could certainly be done in an
additive way (i.e., after this series). I think the current approach has
nice properties, but I could also see "roll-up all packs that have
mtimes after xyz timestamp" being useful.

It would even be possible to reuse a lot of the geometric repack
machinery. Having a separate path to arrange packs by their mtimes and
determine the "split" at pack whose mtime is nearest the provided one
would do exactly what you want.

(As a side-note, reading the original threads about your git-exproll was
quite humbling, since it turns out all of the problems I thought were
hard had already been discussed eight years ago!)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 19:54       ` Martin Fick
  2021-02-23 20:06         ` Taylor Blau
@ 2021-02-23 20:15         ` Jeff King
  2021-02-23 21:41           ` Martin Fick
  1 sibling, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-23 20:15 UTC (permalink / raw)
  To: Martin Fick; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:

> > Yeah, this is definitely a heuristic that can get out of sync with
> > reality. I think in general if you have base pack A and somebody pushes
> > up B, C, and D in sequence, we're likely to roll up a single DBC (in
> > that order) pack. Further pushes E, F, G would have newer mtimes. So we
> > might get GFEDBC directly. Or we might get GFE and DBC, but the former
> > would still have a newer mtime, so we'd create GFEDBC on the next run.
> > 
> > The issues come from:
> > 
> >   - we are deciding what to roll up based on size. A big push might not
> >     get rolled up immediately, putting it out-of-sync with the rest of
> >     the rollups.
> 
> Would it make sense to somehow detect all new packs since the last rollup and 
> always include them in the rollup no matter what their size? That is one thing 
> that my git-exproll script did. One of the main reasons to do this was because 
> newer packs tended to look big (I was using bytes to determine size), and 
> newer packs were often bigger on disk compared to other packs with similar 
> objects in them (I think you suggested this was due to the thickening of packs 
> on receipt). Maybe roll up all packs with a timestamp "new enough", no matter 
> how big they are?

That works against the "geometric" part of the strategy, which is trying
to roll up in a sequence that is amortized-linear. I.e., we are not
always rolling up everything outside of the base pack, but trying to
roll up little into medium, and then eventually medium into large. If
you roll up things that are "too big", then you end up rewriting the
bytes more often, and your amount of work becomes super-linear.

Now whether that matters all that much or not is perhaps another
discussion. The current strategy is mostly to repack all-into-one with
no base, which is the worst possible case. So just about any rollup
strategy will be an improvement. ;)

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 20:15         ` Jeff King
@ 2021-02-23 21:41           ` Martin Fick
  2021-02-23 21:53             ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Martin Fick @ 2021-02-23 21:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Tuesday, February 23, 2021 3:15:12 PM MST Jeff King wrote:
> On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:
> > > Yeah, this is definitely a heuristic that can get out of sync with
> > > reality. I think in general if you have base pack A and somebody pushes
> > > up B, C, and D in sequence, we're likely to roll up a single DBC (in
> > > that order) pack. Further pushes E, F, G would have newer mtimes. So we
> > > might get GFEDBC directly. Or we might get GFE and DBC, but the former
> > > would still have a newer mtime, so we'd create GFEDBC on the next run.
> > > 
> > > The issues come from:
> > >   - we are deciding what to roll up based on size. A big push might not
> > >   
> > >     get rolled up immediately, putting it out-of-sync with the rest of
> > >     the rollups.
> > 
> > Would it make sense to somehow detect all new packs since the last rollup
> > and always include them in the rollup no matter what their size? That is
> > one thing that my git-exproll script did. One of the main reasons to do
> > this was because newer packs tended to look big (I was using bytes to
> > determine size), and newer packs were often bigger on disk compared to
> > other packs with similar objects in them (I think you suggested this was
> > due to the thickening of packs on receipt). Maybe roll up all packs with
> > a timestamp "new enough", no matter how big they are?
> 
> That works against the "geometric" part of the strategy, which is trying
> to roll up in a sequence that is amortized-linear. I.e., we are not
> always rolling up everything outside of the base pack, but trying to
> roll up little into medium, and then eventually medium into large. If
> you roll up things that are "too big", then you end up rewriting the
> bytes more often, and your amount of work becomes super-linear.

I'm not sure I follow, it would seem to me that it would stay linear, and be 
at most rewriting each new packfile once more than previously? Are you 
envisioning more work than that?

> Now whether that matters all that much or not is perhaps another
> discussion. The current strategy is mostly to repack all-into-one with
> no base, which is the worst possible case. So just about any rollup
> strategy will be an improvement. ;)

+1 Yes, while anything would be an improvement, this series' approach is very 
good! Thanks for doing this!!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 21:41           ` Martin Fick
@ 2021-02-23 21:53             ` Jeff King
  2021-02-24 18:13               ` Martin Fick
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2021-02-23 21:53 UTC (permalink / raw)
  To: Martin Fick; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote:

> > > Would it make sense to somehow detect all new packs since the last rollup
> > > and always include them in the rollup no matter what their size? That is
> > > one thing that my git-exproll script did. One of the main reasons to do
> > > this was because newer packs tended to look big (I was using bytes to
> > > determine size), and newer packs were often bigger on disk compared to
> > > other packs with similar objects in them (I think you suggested this was
> > > due to the thickening of packs on receipt). Maybe roll up all packs with
> > > a timestamp "new enough", no matter how big they are?
> > 
> > That works against the "geometric" part of the strategy, which is trying
> > to roll up in a sequence that is amortized-linear. I.e., we are not
> > always rolling up everything outside of the base pack, but trying to
> > roll up little into medium, and then eventually medium into large. If
> > you roll up things that are "too big", then you end up rewriting the
> > bytes more often, and your amount of work becomes super-linear.
> 
> I'm not sure I follow, it would seem to me that it would stay linear, and be 
> at most rewriting each new packfile once more than previously? Are you 
> envisioning more work than that?

Maybe I don't understand what you're proposing.

The idea of the geometric repack is that by sorting by size and then
finding a "cutoff" within the size array, we can make sure that we roll
up a sufficiently small number of bytes in each roll-up that it ends up
linear in the size of the repo in the long run. But if we roll up
without regard to size, then our worst case is that the biggest pack is
the newest (imagine a repo with 10 small pushes and then one gigantic
one). So we roll that up with some small packs, doing effectively
O(size_of_repo) work. And then in the next roll up we do it again, and
so on. So we end up with O(size_of_repo * nr_rollups) total work. Which
is no better than having just done a full repack at each rollup.

Now I don't think we'd see that worst case in practice that much. And
depending on your definition of "new enough", you might keep nr_rollups
pretty small.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 20:06         ` Taylor Blau
@ 2021-02-23 21:57           ` Martin Fick
  0 siblings, 0 replies; 120+ messages in thread
From: Martin Fick @ 2021-02-23 21:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, Junio C Hamano, git, dstolee

On Tuesday, February 23, 2021 3:06:22 PM MST Taylor Blau wrote:
> On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:
> > Would it make sense to somehow detect all new packs since the last rollup
> > and always include them in the rollup no matter what their size? That is
> > one thing that my git-exproll script did.
> 
> I'm certainly not opposed, and this could certainly be done in an
> additive way (i.e., after this series). I think the current approach has
> nice properties, but I could also see "roll-up all packs that have
> mtimes after xyz timestamp" being useful.

Just to be clear, I meant to combine the two approaches. And yes, my 
suggestion would likely make more sense as an additive switch later on.
 
> It would even be possible to reuse a lot of the geometric repack
> machinery. Having a separate path to arrange packs by their mtimes and
> determine the "split" at pack whose mtime is nearest the provided one
> would do exactly what you want.

I was thinking to keep all of your geometric repack machinery and only looking 
for the split point starting at the right most pack which is newer than the 
provided mtime, and then possibly enhancing the approach with a clever way to 
use the mtime of the last consolidation (maybe by touching a pack/.geometric 
file?).

> (As a side-note, reading the original threads about your git-exproll was
> quite humbling, since it turns out all of the problems I thought were
> hard had already been discussed eight years ago!)

Thanks, but I think you have likely done a much better job than what I did. 
Your approach of using object counts is likely much better as it should be 
stable, using byte counts is not. You are also solving only one problem at a 
time, that's probably better than my hodge-podge of at least 3 different 
problems. And the most important part of your approach as I understand it, is 
that it actually saves CPU time whereas my approach only saved IO.

Cheers,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-23 21:53             ` Jeff King
@ 2021-02-24 18:13               ` Martin Fick
  2021-02-26  6:23                 ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Martin Fick @ 2021-02-24 18:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Tuesday, February 23, 2021 4:53:21 PM MST Jeff King wrote:
> On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote:
> > > > Would it make sense to somehow detect all new packs since the last
> > > > rollup
> > > > and always include them in the rollup no matter what their size? That
> > > > is
> > > > one thing that my git-exproll script did. One of the main reasons to
> > > > do
> > > > this was because newer packs tended to look big (I was using bytes to
> > > > determine size), and newer packs were often bigger on disk compared to
> > > > other packs with similar objects in them (I think you suggested this
> > > > was
> > > > due to the thickening of packs on receipt). Maybe roll up all packs
> > > > with
> > > > a timestamp "new enough", no matter how big they are?
> > > 
> > > That works against the "geometric" part of the strategy, which is trying
> > > to roll up in a sequence that is amortized-linear. I.e., we are not
> > > always rolling up everything outside of the base pack, but trying to
> > > roll up little into medium, and then eventually medium into large. If
> > > you roll up things that are "too big", then you end up rewriting the
> > > bytes more often, and your amount of work becomes super-linear.
> > 
> > I'm not sure I follow, it would seem to me that it would stay linear, and
> > be at most rewriting each new packfile once more than previously? Are you
> > envisioning more work than that?
> 
> Maybe I don't understand what you're proposing.
> 
> The idea of the geometric repack is that by sorting by size and then
> finding a "cutoff" within the size array, we can make sure that we roll
> up a sufficiently small number of bytes in each roll-up that it ends up
> linear in the size of the repo in the long run. But if we roll up
> without regard to size, then our worst case is that the biggest pack is
> the newest (imagine a repo with 10 small pushes and then one gigantic
> one). So we roll that up with some small packs, doing effectively
> O(size_of_repo) work.

This isn't quite a fair evaluation, it should be O(size_of_push) I think?

> And then in the next roll up we do it again, and so on. 
 
I should have clarified that the intent is to prevent this by specifying an 
mtime after the last rollup so that this should only ever happen once for new 
packfiles. It also means you probably need special logic to ensure this roll-up 
doesn't happen if there would only be one file in the rollup, 

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 8/8] builtin/repack.c: add '--geometric' option
  2021-02-23  2:25   ` [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
@ 2021-02-24 23:19     ` Junio C Hamano
  2021-02-24 23:43       ` Junio C Hamano
  2021-03-04 21:55       ` Taylor Blau
  0 siblings, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-24 23:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Taylor Blau <me@ttaylorr.com> writes:

> Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
> ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
> With a geometric factor of 'r', it should be that:
>
>   objects(Pi) > r*objects(P(i-1))
>
> for all i in [1, n], where the packs are sorted by
>
>   objects(P1) <= objects(P2) <= ... <= objects(Pn).
>
> Since finding a true optimal repacking is NP-hard, we approximate it
> along two directions:
>
>   1. We assume that there is a cutoff of packs _before starting the
>      repack_ where everything to the right of that cut-off already forms
>      a geometric progression (or no cutoff exists and everything must be
>      repacked).

When you order existing packs like how you explained the next
"direction" below, do we assume loose ones would sit before
(i.e. "newer and smaller" than) all of the packs?

>   2. We assume that everything smaller than the cutoff count must be
>      repacked. This forms our base assumption, but it can also cause
>      even the "heavy" packs to get repacked, for e.g., if we have 6
>      packs containing the following number of objects:
>
>        1, 1, 1, 2, 4, 32
>
>      then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
>      rolling up the first two packs into a pack with 2 objects. That
>      breaks our progression and leaves us:
>
>        2, 1, 2, 4, 32
>          ^
>
>      (where the '^' indicates the position of our split). To restore a
>      progression, we move the split forward (towards larger packs)
>      joining each pack into our new pack until a geometric progression
>      is restored. Here, that looks like:
>
>        2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
>          ^                   ^                ^                   ^

This explanation is very intuitive and easy to understand (I assume
we aren't actually repacking 1+1 into 2 and then 2+1 into 3 and then
choosing to repack 3+2 to create 5, but we scan before doing any
repacking and decide to repack 2+1+2+4 into a single 9).

What is not so clear is how this picture changes depending on the
value of 'r'.

> ... Another wrinkle
> is that we assume that loose, indexed, and reflog'd objects are
> insignificant, and lump them into any new pack that we create.

In the example of 2 above, these are treated as insignificant
compared to the first '1' in the 1+1+1+2+4+32, so the choice of
repacked packs are made by computing 1+1+1+2+4 and noticing that is
where we should stop, but we pack these insignificant ones together
with these repacked packs into the new pack that is supposed to
contain "9" objects?

> This can
> lead to non-idempotent results.

Let me try to follow aloud to see if I got this right.

If we start from 1+1+1+2+4+32+... (similarly to the example given to
explain 2 above, but with more larger packs---but the assumption
here is that everything larger than 32 is already in good
progression), depending on how many loose objects we have, the
result of packing 1+1+1+2+4+loose might not necessarily be 9 but 100
(collecting too many loose objects), and the set of packs would be
32+... (from before the "repack -g") plus a 100-object pack, not
9+32+... as the above explanation for 2 suggested.  Starting from
that state, re-running "repack -g" again would then have to repack
the packs existed before the first repack (i.e. 32+...) into one.
In other words, the second "git repack -g" in back-to-back "git
repack -g && git repack -g" may necessarily be a no-op.

Is that what you meant by non-idempotent?

> diff --git a/builtin/repack.c b/builtin/repack.c
> index 01440de2d5..bcf280b10d 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -297,6 +297,124 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
>  #define ALL_INTO_ONE 1
>  #define LOOSEN_UNREACHABLE 2
>  
> +struct pack_geometry {
> +	struct packed_git **pack;
> +	uint32_t pack_nr, pack_alloc;
> +	uint32_t split;
> +};
> +
> +static uint32_t geometry_pack_weight(struct packed_git *p)
> +{
> +	if (open_pack_index(p))
> +		die(_("cannot open index for %s"), p->pack_name);
> +	return p->num_objects;
> +}
> +
> +static int geometry_cmp(const void *va, const void *vb)
> +{
> +	uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
> +		 bw = geometry_pack_weight(*(struct packed_git **)vb);
> +
> +	if (aw < bw)
> +		return -1;
> +	if (aw > bw)
> +		return 1;
> +	return 0;
> +}
> +
> +static void init_pack_geometry(struct pack_geometry **geometry_p)
> +{
> +	struct packed_git *p;
> +	struct pack_geometry *geometry;
> +
> +	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
> +	geometry = *geometry_p;
> +
> +	for (p = get_all_packs(the_repository); p; p = p->next) {
> +		if (!pack_kept_objects && p->pack_keep)
> +			continue;
> +
> +		ALLOC_GROW(geometry->pack,
> +			   geometry->pack_nr + 1,
> +			   geometry->pack_alloc);
> +
> +		geometry->pack[geometry->pack_nr] = p;
> +		geometry->pack_nr++;
> +	}
> +
> +	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
> +}

After calling this helper, we get geometry->pack[] that is sorted by
the number of objects in each pack, packs with fewer objects sort
before the ones with more objects.  OK.

> +static void split_pack_geometry(struct pack_geometry *geometry, int factor)
> +{
> +	uint32_t i;
> +	uint32_t split;
> +	off_t total_size = 0;
> +
> +	if (geometry->pack_nr <= 1) {
> +		geometry->split = geometry->pack_nr;
> +		return;
> +	}

When there is a single pack (or no pack), we place the split to 1
(let's keep reading with the need to find out what split means in
mind; it is not yet clear if it points at the pack that will be part
of the kept set, or at the pack that is the last one among the
repacked set, at this point in the code).

> +	split = geometry->pack_nr - 1;
> +
> +	/*
> +	 * First, count the number of packs (in descending order of size) which
> +	 * already form a geometric progression.
> +	 */
> +	for (i = geometry->pack_nr - 1; i > 0; i--) {
> +		struct packed_git *ours = geometry->pack[i];
> +		struct packed_git *prev = geometry->pack[i - 1];
> +		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
> +			split--;
> +		else
> +			break;
> +	}

Instead of rolling up from smaller ones like explained in the log
message, we scan from the larger end and see where the existing
progression is broken.  When the loop breaks in the middle, the pack
at position 'i-1' (prev) is too big.

Why do we need to initialize 'split' before the loop and decrement
it?  Wouldn't it be equivalent to assign 'i' after the loop breaks
to 'split'?

In any case, after the loop breaks, the packs starting at position
'i+1' (one after ours when the loop broke) thru to the end of the
geometry->pack[] array are in good progression.  We have 'i' in
'split' at this point, so ...

> +	if (split) {
> +		/*
> +		 * Move the split one to the right, since the top element in the
> +		 * last-compared pair can't be in the progression. Only do this
> +		 * when we split in the middle of the array (otherwise if we got
> +		 * to the end, then the split is in the right place).
> +		 */
> +		split++;
> +	}

... we increment it.  It means geometry->pack[split] is small enough
relative to geometry->pack[split+1] and so on thru to the end of the
array.

What if split==0 when we exited the loop?  That would mean that the
everything in the array was in good progression, which is in line
with the "in the middle" case.  Either way, the pack at 'split' and
later are in good progression.

> +	/*
> +	 * Then, anything to the left of 'split' must be in a new pack. But,
> +	 * creating that new pack may cause packs in the heavy half to no longer
> +	 * form a geometric progression.
> +	 *
> +	 * Compute an expected size of the new pack, and then determine how many
> +	 * packs in the heavy half need to be joined into it (if any) to restore
> +	 * the geometric progression.
> +	 */
> +	for (i = 0; i < split; i++)
> +		total_size += geometry_pack_weight(geometry->pack[i]);

We guestimate the number of objects in the rolled-up pack to be
created.  Some objects may appear in multiple packs, but the number
of them ought to be insignificant.  OK.

> +	for (i = split; i < geometry->pack_nr; i++) {
> +		struct packed_git *ours = geometry->pack[i];
> +		if (geometry_pack_weight(ours) < factor * total_size) {

If the pack at the bottom end of the range we previously thought to
keep turns out to be too small, we'd also roll that one in, by
shifting the split point to the right.  And of course we update the
expected size of the new pack.  OK.

> +			split++;
> +			total_size += geometry_pack_weight(ours);
> +		} else
> +			break;
> +	}
> +
> +	geometry->split = split;

The code makes me wonder if we can compute all of the above in a
single pass, but that is purely an intellectual curiosity.  The
logic in the code is crystal clear (the "what if everything was
already in a good progression" case was the only part that made me
stop and think about the correctness of the logic) and the
implementation looks good, except for a few small nits:

 - why initialize 'split' so early before the first loop, which I
   already mentioned.

 - we know many numbers are in uint32_t because that is how
   packfiles limit their contents, but is it safe to perform the
   multiplication with factor and comparison in that type?

>  int cmd_repack(int argc, const char **argv, const char *prefix)
>  {
>  	struct child_process cmd = CHILD_PROCESS_INIT;
> @@ -304,6 +422,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	struct string_list names = STRING_LIST_INIT_DUP;
>  	struct string_list rollback = STRING_LIST_INIT_NODUP;
>  	struct string_list existing_packs = STRING_LIST_INIT_DUP;
> +	struct pack_geometry *geometry = NULL;
>  	struct strbuf line = STRBUF_INIT;
>  	int i, ext, ret;
>  	FILE *out;
> @@ -316,6 +435,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
>  	int no_update_server_info = 0;
>  	struct pack_objects_args po_args = {NULL};
> +	int geometric_factor = 0;
>  
>  	struct option builtin_repack_options[] = {
>  		OPT_BIT('a', NULL, &pack_everything,
> @@ -356,6 +476,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  				N_("repack objects in packs marked with .keep")),
>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>  				N_("do not repack this pack")),
> +		OPT_INTEGER('g', "geometric", &geometric_factor,
> +			    N_("find a geometric progression with factor <N>")),
>  		OPT_END()
>  	};
>  
> @@ -382,6 +504,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
>  		die(_(incremental_bitmap_conflict_error));
>  
> +	if (geometric_factor) {
> +		if (pack_everything)
> +			die(_("--geometric is incompatible with -A, -a"));
> +		init_pack_geometry(&geometry);
> +		split_pack_geometry(geometry, geometric_factor);
> +	}
> +
>  	packdir = mkpathdup("%s/pack", get_object_directory());
>  	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
>  
> @@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		strvec_pushf(&cmd.args, "--keep-pack=%s",
>  			     keep_pack_list.items[i].string);
>  	strvec_push(&cmd.args, "--non-empty");
> -	strvec_push(&cmd.args, "--all");
> -	strvec_push(&cmd.args, "--reflog");
> -	strvec_push(&cmd.args, "--indexed-objects");
> +	if (!geometry) {
> +		/*
> +		 * 'git pack-objects' will up all objects loose or packed

"git pack-objects --stdin-packs" will?
What verb is missing in "will VERB up all objects"?

> +		 * (either rolling them up or leaving them alone), so don't pass
> +		 * these options.
> +		 *
> +		 * The implementation of 'git pack-objects --stdin-packs'
> +		 * makes them redundant (and the two are incompatible).

I am not sure if that is true.

More importantly, if you read this comment after you are done with
the series and no longer feel that geometric repacking is the most
important thing in the world, you'd realize that an important piece
of information is missing to help readers.  It talks about what
"geometric" code does (i.e. uses --stdin-packs hence no need to pass
these options) in a block that is for !geometric.

	We need to grab all reachable objects, including those that
	are reachable from reflogs and the index.

	When repacking into a geometric progression of packs,
	however, we ask 'git pack-objects --stdin-packs', and it is
	not about packing objects based on reachability but about
	repacking all the objects in specified packs and loose ones
	(indeed, --stdin-packs is incompatible with these options).

or something?  I suspect that --stdin-packs does not make --all and
others "redundant".  The operation is about creating a new pack out
of the objects contained in these packs, regardless of the objects'
reachability from the usual "refs, index and reflogs" anchor points,
no?

> +		 */
> +		strvec_push(&cmd.args, "--all");
> +		strvec_push(&cmd.args, "--reflog");
> +		strvec_push(&cmd.args, "--indexed-objects");
> +	}
>  	if (has_promisor_remote())
>  		strvec_push(&cmd.args, "--exclude-promisor-objects");
>  	if (write_bitmaps > 0)
> @@ -429,17 +568,37 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  				strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1");
>  			}
>  		}
> +	} else if (geometry) {
> +		strvec_push(&cmd.args, "--stdin-packs");
> +		strvec_push(&cmd.args, "--unpacked");
>  	} else {
>  		strvec_push(&cmd.args, "--unpacked");
>  		strvec_push(&cmd.args, "--incremental");
>  	}
>  
> -	cmd.no_stdin = 1;
> +	if (geometry)
> +		cmd.in = -1;
> +	else
> +		cmd.no_stdin = 1;

It is a bit sad that we need to do this before start_command() in
that the code structure does not make it clear why two modes have
different handling of the standard input stream, but I do not think
of anything better, so I'll let it pass.

>  	ret = start_command(&cmd);
>  	if (ret)
>  		return ret;
>  
> +	if (geometry) {
> +		FILE *in = xfdopen(cmd.in, "w");
> +		/*
> +		 * The resulting pack should contain all objects in packs that
> +		 * are going to be rolled up, but exclude objects in packs which
> +		 * are being left alone.
> +		 */
> +		for (i = 0; i < geometry->split; i++)
> +			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
> +		for (i = geometry->split; i < geometry->pack_nr; i++)
> +			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
> +		fclose(in);
> +	}
> +
>  	out = xfdopen(cmd.out, "r");
>  	while (strbuf_getline_lf(&line, out) != EOF) {
>  		if (line.len != the_hash_algo->hexsz)
> @@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  			if (!string_list_has_string(&names, sha1))
>  				remove_redundant_pack(packdir, item->string);
>  		}
> +
> +		if (geometry) {
> +			struct strbuf buf = STRBUF_INIT;
> +
> +			uint32_t i;
> +			for (i = 0; i < geometry->split; i++) {
> +				struct packed_git *p = geometry->pack[i];
> +				if (string_list_has_string(&names,
> +							   hash_to_hex(p->hash)))
> +					continue;
> +
> +				strbuf_reset(&buf);
> +				strbuf_addstr(&buf, pack_basename(p));
> +				strbuf_strip_suffix(&buf, ".pack");
> +
> +				remove_redundant_pack(packdir, buf.buf);
> +			}
> +			strbuf_release(&buf);
> +		}

Before this new code, we seem to remove all pre-existing packfiles
that are not in the output from the pack-objects already.  The only
reason that code does not harm the geometry case is we assume
get_non_kept_pack_filenames() call is never made while doing
geometric repack (iow, ALL_INTO_ONE is not set) and the list of
pre-existing packfiles &existing_packs is empty.  Am I reading the
code correctly?

 - It is a bit unnerving to learn (and it will be a maintenance
   burden in the future) that a variable whose name is
   existing_packs does not necessarily have a list of existing packs
   depending on the mode we are operating in.

 - The guard to make geometric incompatible with ALL_INTO_ONE does
   not mention ALL_INTO_ONE, even though that bit is what would
   corrupt the resulting repository if overlooked.  We should
   probably need s/pack_everything/& \& ALL_INTO_ONE/ in the hunk
    below.

> @@ -382,6 +504,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
>  		die(_(incremental_bitmap_conflict_error));
>  
> +	if (geometric_factor) {
> +		if (pack_everything)
> +			die(_("--geometric is incompatible with -A, -a"));
> +		init_pack_geometry(&geometry);
> +		split_pack_geometry(geometry, geometric_factor);
> +	}
> +
>  	packdir = mkpathdup("%s/pack", get_object_directory());
>  	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
>  

Other than that, it was a fun patch to read.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 8/8] builtin/repack.c: add '--geometric' option
  2021-02-24 23:19     ` Junio C Hamano
@ 2021-02-24 23:43       ` Junio C Hamano
  2021-03-04 21:40         ` Taylor Blau
  2021-03-04 21:55       ` Taylor Blau
  1 sibling, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-24 23:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

Junio C Hamano <gitster@pobox.com> writes:

> Let me try to follow aloud to see if I got this right.
>
> If we start from 1+1+1+2+4+32+... (similarly to the example given to
> explain 2 above, but with more larger packs---but the assumption
> here is that everything larger than 32 is already in good
> progression), depending on how many loose objects we have, the
> result of packing 1+1+1+2+4+loose might not necessarily be 9 but 100
> (collecting too many loose objects), and the set of packs would be
> 32+... (from before the "repack -g") plus a 100-object pack, not
> 9+32+... as the above explanation for 2 suggested.  Starting from
> that state, re-running "repack -g" again would then have to repack
> the packs existed before the first repack (i.e. 32+...) into one.
> In other words, the second "git repack -g" in back-to-back "git
> repack -g && git repack -g" may necessarily be a no-op.

"... may not necessarily be a no-op" is what I should have typed here.

> Is that what you meant by non-idempotent?

And I think it makes sense for the repack to be non-idempotent.
Once we have packs in good progression, it is the only way to make
progress by keep rolling loose objects up into the smallest pack
until it grows larger than the geometry factor allows it to be
relative to the next smallest pack.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence
  2021-02-24 18:13               ` Martin Fick
@ 2021-02-26  6:23                 ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2021-02-26  6:23 UTC (permalink / raw)
  To: Martin Fick; +Cc: Junio C Hamano, Taylor Blau, git, dstolee

On Wed, Feb 24, 2021 at 11:13:53AM -0700, Martin Fick wrote:

> > The idea of the geometric repack is that by sorting by size and then
> > finding a "cutoff" within the size array, we can make sure that we roll
> > up a sufficiently small number of bytes in each roll-up that it ends up
> > linear in the size of the repo in the long run. But if we roll up
> > without regard to size, then our worst case is that the biggest pack is
> > the newest (imagine a repo with 10 small pushes and then one gigantic
> > one). So we roll that up with some small packs, doing effectively
> > O(size_of_repo) work.
> 
> This isn't quite a fair evaluation, it should be O(size_of_push) I think?

Sorry, I had a longer example, but then cut it down in the name of
simplicity. But I think I made it too simple. :)

You can imagine more pushes after the gigantic one, in which case we'd
roll them up with the gigantic push. So that gigantic one is part of
multiple sequential rollups, until it is itself rolled up further.

But...

> > And then in the next roll up we do it again, and so on. 
>  
> I should have clarified that the intent is to prevent this by specifying an 
> mtime after the last rollup so that this should only ever happen once for new 
> packfiles. It also means you probably need special logic to ensure this roll-up 
> doesn't happen if there would only be one file in the rollup, 

Yes, I agree that if you record a cut point, and then avoid rolling up
across it, then you'd only consider the single push once. You probably
want to record the actual pack set rather than just an mtime cutoff,
though, since Git will update the mtime on packs sometimes (to freshen
them whenever it optimizes out an object write for an object in the
pack).

One of the nice things about looking only at the pack sizes is that you
don't have to record that cut point. :) But it's possible you'd want to
for other reasons (e.g., you may spend extra work to find good deltas in
your on-disk packs, so you want to know what is old and what is new in
order to discard on-disk deltas from pushed-up packs).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 8/8] builtin/repack.c: add '--geometric' option
  2021-02-24 23:43       ` Junio C Hamano
@ 2021-03-04 21:40         ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-03-04 21:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, dstolee

On Wed, Feb 24, 2021 at 03:43:34PM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Let me try to follow aloud to see if I got this right.
> >
> > If we start from 1+1+1+2+4+32+... (similarly to the example given to
> > explain 2 above, but with more larger packs---but the assumption
> > here is that everything larger than 32 is already in good
> > progression), depending on how many loose objects we have, the
> > result of packing 1+1+1+2+4+loose might not necessarily be 9 but 100
> > (collecting too many loose objects), and the set of packs would be
> > 32+... (from before the "repack -g") plus a 100-object pack, not
> > 9+32+... as the above explanation for 2 suggested.  Starting from
> > that state, re-running "repack -g" again would then have to repack
> > the packs existed before the first repack (i.e. 32+...) into one.
> > In other words, the second "git repack -g" in back-to-back "git
> > repack -g && git repack -g" may necessarily be a no-op.
>
> "... may not necessarily be a no-op" is what I should have typed here.

Exactly.

> > Is that what you meant by non-idempotent?
>
> And I think it makes sense for the repack to be non-idempotent.
> Once we have packs in good progression, it is the only way to make
> progress by keep rolling loose objects up into the smallest pack
> until it grows larger than the geometry factor allows it to be
> relative to the next smallest pack.

Right again. It *would* be idempotent if we didn't push any new objects
into the repository (and repacked it with the same geometric factor once
more to clean up any inconsistencies after creating a pack with loose
objects), which is what you'd expect.

Of course, pushing new objects into the repository means that the
progression will either grow (i.e., because the smallest pack in an
existing progression was quite large, and so we have some space to grow
smaller packs before rolling up the larger one), or it will get rolled
up.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 8/8] builtin/repack.c: add '--geometric' option
  2021-02-24 23:19     ` Junio C Hamano
  2021-02-24 23:43       ` Junio C Hamano
@ 2021-03-04 21:55       ` Taylor Blau
  1 sibling, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-03-04 21:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, dstolee

On Wed, Feb 24, 2021 at 03:19:30PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
> > ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
> > With a geometric factor of 'r', it should be that:
> >
> >   objects(Pi) > r*objects(P(i-1))
> >
> > for all i in [1, n], where the packs are sorted by
> >
> >   objects(P1) <= objects(P2) <= ... <= objects(Pn).
> >
> > Since finding a true optimal repacking is NP-hard, we approximate it
> > along two directions:
> >
> >   1. We assume that there is a cutoff of packs _before starting the
> >      repack_ where everything to the right of that cut-off already forms
> >      a geometric progression (or no cutoff exists and everything must be
> >      repacked).
>
> When you order existing packs like how you explained the next
> "direction" below, do we assume loose ones would sit before
> (i.e. "newer and smaller" than) all of the packs?

Kind of. We don't consider them to be part of any pack when deciding
where to place the split (in other words, we don't consider them at
all until the subsequent repack by which time they are packed).

That's a fine assumption to make (as you note in the reply below this
one), since we'll eventually reach a geometric progression. This
approximation can be as wrong as there are loose objects (but hopefully
there aren't so many by the time we want to do a geometric repack).

> >   2. We assume that everything smaller than the cutoff count must be
> >      repacked. This forms our base assumption, but it can also cause
> >      even the "heavy" packs to get repacked, for e.g., if we have 6
> >      packs containing the following number of objects:
> >
> >        1, 1, 1, 2, 4, 32
> >
> >      then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
> >      rolling up the first two packs into a pack with 2 objects. That
> >      breaks our progression and leaves us:
> >
> >        2, 1, 2, 4, 32
> >          ^
> >
> >      (where the '^' indicates the position of our split). To restore a
> >      progression, we move the split forward (towards larger packs)
> >      joining each pack into our new pack until a geometric progression
> >      is restored. Here, that looks like:
> >
> >        2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
> >          ^                   ^                ^                   ^
>
> This explanation is very intuitive and easy to understand (I assume
> we aren't actually repacking 1+1 into 2 and then 2+1 into 3 and then
> choosing to repack 3+2 to create 5, but we scan before doing any
> repacking and decide to repack 2+1+2+4 into a single 9).

Correct, and thanks. The split is determined ahead of time before we
actually get to writing any new packs.

> What is not so clear is how this picture changes depending on the
> value of 'r'.

It only means that subsequent packs need to contain at least 'r' times
as many objects as the previous pack does.

> > +static void split_pack_geometry(struct pack_geometry *geometry, int factor)
> > +{
> > +	uint32_t i;
> > +	uint32_t split;
> > +	off_t total_size = 0;
> > +
> > +	if (geometry->pack_nr <= 1) {
> > +		geometry->split = geometry->pack_nr;
> > +		return;
> > +	}
>
> When there is a single pack (or no pack), we place the split to 1
> (let's keep reading with the need to find out what split means in
> mind; it is not yet clear if it points at the pack that will be part
> of the kept set, or at the pack that is the last one among the
> repacked set, at this point in the code).

Everything that is strictly less than the split will get repacked, which
upon reading this again means that we'll repack a repository containing
just a single pack again. That's wasteful, so we may in the future want
to adjust this to set the split to 0 regardless of whether we have zero
or one pack here.

> > +	split = geometry->pack_nr - 1;
> > +
> > +	/*
> > +	 * First, count the number of packs (in descending order of size) which
> > +	 * already form a geometric progression.
> > +	 */
> > +	for (i = geometry->pack_nr - 1; i > 0; i--) {
> > +		struct packed_git *ours = geometry->pack[i];
> > +		struct packed_git *prev = geometry->pack[i - 1];
> > +		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
> > +			split--;
> > +		else
> > +			break;
> > +	}
>
> Instead of rolling up from smaller ones like explained in the log
> message, we scan from the larger end and see where the existing
> progression is broken.  When the loop breaks in the middle, the pack
> at position 'i-1' (prev) is too big.
>
> Why do we need to initialize 'split' before the loop and decrement
> it?  Wouldn't it be equivalent to assign 'i' after the loop breaks
> to 'split'?

Yep, they are equivalent.

> In any case, after the loop breaks, the packs starting at position
> 'i+1' (one after ours when the loop broke) thru to the end of the
> geometry->pack[] array are in good progression.  We have 'i' in
> 'split' at this point, so ...
>
> > +	if (split) {
> > +		/*
> > +		 * Move the split one to the right, since the top element in the
> > +		 * last-compared pair can't be in the progression. Only do this
> > +		 * when we split in the middle of the array (otherwise if we got
> > +		 * to the end, then the split is in the right place).
> > +		 */
> > +		split++;
> > +	}
>
> ... we increment it.  It means geometry->pack[split] is small enough
> relative to geometry->pack[split+1] and so on thru to the end of the
> array.
>
> What if split==0 when we exited the loop?  That would mean that the
> everything in the array was in good progression, which is in line
> with the "in the middle" case.  Either way, the pack at 'split' and
> later are in good progression.

Right (and ditto that we wouldn't do anything if split==0 in that case).

>  - we know many numbers are in uint32_t because that is how
>    packfiles limit their contents, but is it safe to perform the
>    multiplication with factor and comparison in that type?

We could arguably be more careful here, yes.

> > @@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  		strvec_pushf(&cmd.args, "--keep-pack=%s",
> >  			     keep_pack_list.items[i].string);
> >  	strvec_push(&cmd.args, "--non-empty");
> > -	strvec_push(&cmd.args, "--all");
> > -	strvec_push(&cmd.args, "--reflog");
> > -	strvec_push(&cmd.args, "--indexed-objects");
> > +	if (!geometry) {
> > +		/*
> > +		 * 'git pack-objects' will up all objects loose or packed
>
> "git pack-objects --stdin-packs" will?
> What verb is missing in "will VERB up all objects"?

Likely I meant to say "roll" here before "up".

> > +		 * (either rolling them up or leaving them alone), so don't pass
> > +		 * these options.
> > +		 *
> > +		 * The implementation of 'git pack-objects --stdin-packs'
> > +		 * makes them redundant (and the two are incompatible).
>
> I am not sure if that is true.
>
> More importantly, if you read this comment after you are done with
> the series and no longer feel that geometric repacking is the most
> important thing in the world, you'd realize that an important piece
> of information is missing to help readers.  It talks about what
> "geometric" code does (i.e. uses --stdin-packs hence no need to pass
> these options) in a block that is for !geometric.
>
> 	We need to grab all reachable objects, including those that
> 	are reachable from reflogs and the index.
>
> 	When repacking into a geometric progression of packs,
> 	however, we ask 'git pack-objects --stdin-packs', and it is
> 	not about packing objects based on reachability but about
> 	repacking all the objects in specified packs and loose ones
> 	(indeed, --stdin-packs is incompatible with these options).
>
> or something?  I suspect that --stdin-packs does not make --all and
> others "redundant".  The operation is about creating a new pack out
> of the objects contained in these packs, regardless of the objects'
> reachability from the usual "refs, index and reflogs" anchor points,
> no?

Exactly right. And I am certainly in favor of your wording above. Since
this series is already on next, I'd be happy to pick this up with the
few other minor things above in a separate series to apply on top (but
since I don't think any of these are correctness issues, you should feel
free to continue merging this down in the meantime).

> > @@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  			if (!string_list_has_string(&names, sha1))
> >  				remove_redundant_pack(packdir, item->string);
> >  		}
> > +
> > +		if (geometry) {
> > +			struct strbuf buf = STRBUF_INIT;
> > +
> > +			uint32_t i;
> > +			for (i = 0; i < geometry->split; i++) {
> > +				struct packed_git *p = geometry->pack[i];
> > +				if (string_list_has_string(&names,
> > +							   hash_to_hex(p->hash)))
> > +					continue;
> > +
> > +				strbuf_reset(&buf);
> > +				strbuf_addstr(&buf, pack_basename(p));
> > +				strbuf_strip_suffix(&buf, ".pack");
> > +
> > +				remove_redundant_pack(packdir, buf.buf);
> > +			}
> > +			strbuf_release(&buf);
> > +		}
>
> Before this new code, we seem to remove all pre-existing packfiles
> that are not in the output from the pack-objects already.  The only
> reason that code does not harm the geometry case is we assume
> get_non_kept_pack_filenames() call is never made while doing
> geometric repack (iow, ALL_INTO_ONE is not set) and the list of
> pre-existing packfiles &existing_packs is empty.  Am I reading the
> code correctly?
>
>  - It is a bit unnerving to learn (and it will be a maintenance
>    burden in the future) that a variable whose name is
>    existing_packs does not necessarily have a list of existing packs
>    depending on the mode we are operating in.
>
>  - The guard to make geometric incompatible with ALL_INTO_ONE does
>    not mention ALL_INTO_ONE, even though that bit is what would
>    corrupt the resulting repository if overlooked.  We should
>    probably need s/pack_everything/& \& ALL_INTO_ONE/ in the hunk
>     below.

Eek, yes. This is because the geometric code takes its own view of the
pack directory when figuring out where to place to split line, and so it
seemed easier to have separate paths.

I'm not sure whether I maintain that that was a good idea in hindsight
;). Certainly it does create a little bit of a maintenance burden for
us. But they really are two different things: the geometric code really
wants to have the packs laid out in order of object size, while the
"existing" string_list wants packs laid out in lexicographic order of
their filename to check whether certain packs exist or not.

> Other than that, it was a fun patch to read.

Thanks, I think the few suggestions you made here are good ones. I'll
put it on my to-do list of things to clean up in a separate little
series.

Since this is already in next, I would suggest continuing to merge it
down since none of these suggestions impact the patch's correctness.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2021-03-04 21:58 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-01-20 13:40   ` Derrick Stolee
2021-01-20 14:38     ` Taylor Blau
2021-01-29  2:33   ` Junio C Hamano
2021-01-29 18:38     ` Taylor Blau
2021-01-29 19:31     ` Jeff King
2021-01-29 20:20       ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 02/10] revision: learn '--no-kept-objects' Taylor Blau
2021-01-29  3:10   ` Junio C Hamano
2021-01-29 19:13     ` Taylor Blau
2021-01-19 23:24 ` [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed' Taylor Blau
2021-01-29  3:21   ` Junio C Hamano
2021-01-29 19:19     ` Jeff King
2021-01-29 20:01       ` Taylor Blau
2021-01-29 20:25         ` Jeff King
2021-01-29 22:10           ` Taylor Blau
2021-01-29 22:57             ` Jeff King
2021-01-29 23:03             ` Junio C Hamano
2021-01-29 23:28               ` Taylor Blau
2021-02-02  3:04                 ` Taylor Blau
2021-01-29 23:31               ` Jeff King
2021-01-29 22:13           ` Junio C Hamano
2021-01-29 20:30       ` Junio C Hamano
2021-01-29 22:43         ` Jeff King
2021-01-29 22:53           ` Taylor Blau
2021-01-29 23:00             ` Jeff King
2021-01-29 23:10             ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 04/10] p5303: add missing &&-chains Taylor Blau
2021-01-19 23:24 ` [PATCH 05/10] p5303: measure time to repack with keep Taylor Blau
2021-01-29  3:40   ` Junio C Hamano
2021-01-29 19:32     ` Jeff King
2021-01-29 20:04       ` [PATCH] p5303: avoid sed GNU-ism Jeff King
2021-01-29 20:19         ` Eric Sunshine
2021-01-29 20:27           ` Jeff King
2021-01-29 20:36             ` Eric Sunshine
2021-01-29 22:11               ` Taylor Blau
2021-01-29 20:38       ` [PATCH 05/10] p5303: measure time to repack with keep Junio C Hamano
2021-01-29 22:10         ` Jeff King
2021-01-29 23:12           ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 06/10] pack-objects: rewrite honor-pack-keep logic Taylor Blau
2021-01-19 23:24 ` [PATCH 07/10] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-01-19 23:24 ` [PATCH 08/10] builtin/pack-objects.c: teach '--keep-pack-stdin' Taylor Blau
2021-01-19 23:24 ` [PATCH 09/10] builtin/repack.c: extract loose object handling Taylor Blau
2021-01-20 13:59   ` Derrick Stolee
2021-01-20 14:34     ` Taylor Blau
2021-01-20 15:51       ` Derrick Stolee
2021-01-21  3:45     ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 10/10] builtin/repack.c: add '--geometric' option Taylor Blau
2021-01-20 14:05 ` [PATCH 00/10] repack: support repacking into a geometric sequence Derrick Stolee
2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-16 21:42     ` Jeff King
2021-02-16 21:48       ` Taylor Blau
2021-02-04  3:58   ` [PATCH v2 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-16 23:17     ` Jeff King
2021-02-17 18:35       ` Taylor Blau
2021-02-04  3:59   ` [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-16 23:46     ` Jeff King
2021-02-17 18:59       ` Taylor Blau
2021-02-17 19:21         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-04  3:59   ` [PATCH v2 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-16 23:58     ` Jeff King
2021-02-17  0:02       ` Jeff King
2021-02-17 19:13       ` Taylor Blau
2021-02-17 19:25         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-17 16:05     ` Jeff King
2021-02-17 19:23       ` Taylor Blau
2021-02-17 19:29         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-17 17:11     ` Jeff King
2021-02-17 19:54       ` Taylor Blau
2021-02-17 20:25         ` Jeff King
2021-02-17 20:29           ` Taylor Blau
2021-02-17 21:43             ` Jeff King
2021-02-04  3:59   ` [PATCH v2 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-17 18:17     ` Jeff King
2021-02-17 20:01       ` Taylor Blau
2021-02-17  0:01   ` [PATCH v2 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-17 18:18     ` Jeff King
2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
2021-02-18  3:14   ` [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-18  3:14   ` [PATCH v3 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-18  3:14   ` [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-18  3:14   ` [PATCH v3 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-18  3:14   ` [PATCH v3 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-18  3:14   ` [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-18  3:14   ` [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-18  3:14   ` [PATCH v3 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-23  0:31   ` [PATCH v3 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-23  1:06     ` Taylor Blau
2021-02-23  1:42       ` Jeff King
2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
2021-02-23  2:25   ` [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-23  2:25   ` [PATCH v4 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-23  2:25   ` [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-23  8:07     ` Junio C Hamano
2021-02-23 18:51       ` Jeff King
2021-02-23  2:25   ` [PATCH v4 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-23  2:25   ` [PATCH v4 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-23  2:25   ` [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-23  2:25   ` [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-23  2:25   ` [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-24 23:19     ` Junio C Hamano
2021-02-24 23:43       ` Junio C Hamano
2021-03-04 21:40         ` Taylor Blau
2021-03-04 21:55       ` Taylor Blau
2021-02-23  3:39   ` [PATCH v4 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-23  7:43   ` Junio C Hamano
2021-02-23 18:44     ` Jeff King
2021-02-23 19:54       ` Martin Fick
2021-02-23 20:06         ` Taylor Blau
2021-02-23 21:57           ` Martin Fick
2021-02-23 20:15         ` Jeff King
2021-02-23 21:41           ` Martin Fick
2021-02-23 21:53             ` Jeff King
2021-02-24 18:13               ` Martin Fick
2021-02-26  6:23                 ` Jeff King

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git