git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack`
       [not found] <YyokIf%2FSd7SYztKQ@nand.local>
@ 2022-09-20 20:40 ` Taylor Blau
  2022-09-20 20:40   ` [PATCH v2 1/3] midx.c: compute pack_info array outside of fill_included_packs_batch() Taylor Blau
                     ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Taylor Blau @ 2022-09-20 20:40 UTC (permalink / raw)
  To: git; +Cc: gitster, derrickstolee, peff

Here's a few patches that replace the existing "feed each OID
one-by-one" approach to implement the `repack` sub-command of the
`multi-pack-index` builtin with one that uses `pack-objects`'s
`--stdin-packs` option.

There is an additional patch at the beginning of this series in order to
extract the mtime-sorted list of packs to rollup from their home in
`fill_included_packs_batch()`. There's also one more on the end to unify
the `include_pack` array into the `repack_info` struct(s) themselves.

The second patch is the substantive one. Thanks for all of the review so
far! :-)

Taylor Blau (3):
  midx.c: compute pack_info array outside of fill_included_packs_batch()
  midx.c: use `pack-objects --stdin-packs` when repacking
  midx.c: unify `include_pack` array into `pack_info` struct

 midx.c | 76 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 38 insertions(+), 38 deletions(-)

Range-diff against v1:
-:  ---------- > 1:  a501776f6e midx.c: compute pack_info array outside of fill_included_packs_batch()
1:  7e987eb9d7 ! 2:  4218d9e08a midx.c: use `pack-objects --stdin-packs` when repacking
    @@ midx.c: int midx_repack(struct repository *r, const char *object_dir, size_t bat
      	struct strbuf base_name = STRBUF_INIT;
     +	struct strbuf scratch = STRBUF_INIT;
      	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);
    + 	struct repack_info *pack_info = NULL;
      
    - 	/*
     @@ midx.c: int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
      	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
      	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
    @@ midx.c: int midx_repack(struct repository *r, const char *object_dir, size_t bat
     -		struct object_id oid;
     -		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
     +	for (i = 0; i < m->num_packs; i++) {
    -+		strbuf_reset(&scratch);
    ++		struct repack_info *info = &pack_info[i];
      
     -		if (!include_pack[pack_int_id])
     -			continue;
    -+		strbuf_addstr(&scratch, m->pack_names[i]);
    -+		strbuf_strip_suffix(&scratch, ".idx");
    -+		strbuf_addstr(&scratch, ".pack");
    ++		strbuf_reset(&scratch);
      
     -		nth_midxed_object_oid(&oid, m, i);
     -		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
    -+		fprintf(cmd_in, "%s%s\n", include_pack[i] ? "" : "^", scratch.buf);
    ++		strbuf_addstr(&scratch, m->pack_names[info->pack_int_id]);
    ++		strbuf_strip_suffix(&scratch, ".idx");
    ++		strbuf_addstr(&scratch, ".pack");
    ++
    ++		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);
      	}
      	fclose(cmd_in);
     +	strbuf_release(&scratch);
-:  ---------- > 3:  81e9ccc323 midx.c: unify `include_pack` array into `pack_info` struct
-- 
2.37.0.1.g1379af2e9d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] midx.c: compute pack_info array outside of fill_included_packs_batch()
  2022-09-20 20:40 ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Taylor Blau
@ 2022-09-20 20:40   ` Taylor Blau
  2022-09-20 20:40   ` [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2022-09-20 20:40 UTC (permalink / raw)
  To: git; +Cc: gitster, derrickstolee, peff

The fill_included_packs_batch() function needs to compute a list of
packs which are included in the MIDX, and then sort that list according
to the pack mtimes.

Since packs are listed in the `m->pack_names` array in SHA-1 order, we
have to reconstruct their mtimes by reading the `pack_mtime` member of
the struct `packed_git` in order to determine the value by which to
sort.

This list is only needed by the fill_included_packs_batch(), but the
subsequent commit will want to have this list accessible in this
function's caller, `midx_repack()`.

Prepare for that by teaching `midx_repack()` to compute, pass in, and
then free that list so it is available throughout the body of
`midx_repack()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/midx.c b/midx.c
index c27d0e5f15..db0a70c5af 100644
--- a/midx.c
+++ b/midx.c
@@ -1905,32 +1905,16 @@ static int fill_included_packs_all(struct repository *r,
 
 static int fill_included_packs_batch(struct repository *r,
 				     struct multi_pack_index *m,
+				     struct repack_info *pack_info,
 				     unsigned char *include_pack,
 				     size_t batch_size)
 {
 	uint32_t i, packs_to_repack;
 	size_t total_size;
-	struct repack_info *pack_info = xcalloc(m->num_packs, sizeof(struct repack_info));
 	int pack_kept_objects = 0;
 
 	repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects);
 
-	for (i = 0; i < m->num_packs; i++) {
-		pack_info[i].pack_int_id = i;
-
-		if (prepare_midx_pack(r, m, i))
-			continue;
-
-		pack_info[i].mtime = m->packs[i]->mtime;
-	}
-
-	for (i = 0; batch_size && i < m->num_objects; i++) {
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
-		pack_info[pack_int_id].referenced_objects++;
-	}
-
-	QSORT(pack_info, m->num_packs, compare_by_mtime);
-
 	total_size = 0;
 	packs_to_repack = 0;
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
@@ -1957,8 +1941,6 @@ static int fill_included_packs_batch(struct repository *r,
 		include_pack[pack_int_id] = 1;
 	}
 
-	free(pack_info);
-
 	if (packs_to_repack < 2)
 		return 1;
 
@@ -1974,6 +1956,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
 	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);
+	struct repack_info *pack_info = NULL;
 
 	/*
 	 * When updating the default for these configuration
@@ -1987,9 +1970,25 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		return 0;
 
 	CALLOC_ARRAY(include_pack, m->num_packs);
+	CALLOC_ARRAY(pack_info, m->num_packs);
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_info[i].pack_int_id = i;
+
+		if (prepare_midx_pack(r, m, i))
+			continue;
+
+		pack_info[i].mtime = m->packs[i]->mtime;
+	}
+
+	for (i = 0; batch_size && i < m->num_objects; i++)
+		pack_info[nth_midxed_pack_int_id(m, i)].referenced_objects++;
+
+	QSORT(pack_info, m->num_packs, compare_by_mtime);
 
 	if (batch_size) {
-		if (fill_included_packs_batch(r, m, include_pack, batch_size))
+		if (fill_included_packs_batch(r, m, pack_info, include_pack,
+					      batch_size))
 			goto cleanup;
 	} else if (fill_included_packs_all(r, m, include_pack))
 		goto cleanup;
@@ -2047,6 +2046,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	result = write_midx_internal(object_dir, NULL, NULL, NULL, NULL, flags);
 
 cleanup:
+	free(pack_info);
 	free(include_pack);
 	return result;
 }
-- 
@@ -1905,32 +1905,16 @@ static int fill_included_packs_all(struct repository *r,
 
 static int fill_included_packs_batch(struct repository *r,
 				     struct multi_pack_index *m,
+				     struct repack_info *pack_info,
 				     unsigned char *include_pack,
 				     size_t batch_size)
 {
 	uint32_t i, packs_to_repack;
 	size_t total_size;
-	struct repack_info *pack_info = xcalloc(m->num_packs, sizeof(struct repack_info));
 	int pack_kept_objects = 0;
 
 	repo_config_get_bool(r, "repack.packkeptobjects", &pack_kept_objects);
 
-	for (i = 0; i < m->num_packs; i++) {
-		pack_info[i].pack_int_id = i;
-
-		if (prepare_midx_pack(r, m, i))
-			continue;
-
-		pack_info[i].mtime = m->packs[i]->mtime;
-	}
-
-	for (i = 0; batch_size && i < m->num_objects; i++) {
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
-		pack_info[pack_int_id].referenced_objects++;
-	}
-
-	QSORT(pack_info, m->num_packs, compare_by_mtime);
-
 	total_size = 0;
 	packs_to_repack = 0;
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
@@ -1957,8 +1941,6 @@ static int fill_included_packs_batch(struct repository *r,
 		include_pack[pack_int_id] = 1;
 	}
 
-	free(pack_info);
-
 	if (packs_to_repack < 2)
 		return 1;
 
@@ -1974,6 +1956,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
 	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);
+	struct repack_info *pack_info = NULL;
 
 	/*
 	 * When updating the default for these configuration
@@ -1987,9 +1970,25 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		return 0;
 
 	CALLOC_ARRAY(include_pack, m->num_packs);
+	CALLOC_ARRAY(pack_info, m->num_packs);
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_info[i].pack_int_id = i;
+
+		if (prepare_midx_pack(r, m, i))
+			continue;
+
+		pack_info[i].mtime = m->packs[i]->mtime;
+	}
+
+	for (i = 0; batch_size && i < m->num_objects; i++)
+		pack_info[nth_midxed_pack_int_id(m, i)].referenced_objects++;
+
+	QSORT(pack_info, m->num_packs, compare_by_mtime);
 
 	if (batch_size) {
-		if (fill_included_packs_batch(r, m, include_pack, batch_size))
+		if (fill_included_packs_batch(r, m, pack_info, include_pack,
+					      batch_size))
 			goto cleanup;
 	} else if (fill_included_packs_all(r, m, include_pack))
 		goto cleanup;
@@ -2047,6 +2046,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	result = write_midx_internal(object_dir, NULL, NULL, NULL, NULL, flags);
 
 cleanup:
+	free(pack_info);
 	free(include_pack);
 	return result;
 }
-- 
2.37.0.1.g1379af2e9d


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20 20:40 ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Taylor Blau
  2022-09-20 20:40   ` [PATCH v2 1/3] midx.c: compute pack_info array outside of fill_included_packs_batch() Taylor Blau
@ 2022-09-20 20:40   ` Taylor Blau
  2022-09-23 18:23     ` Derrick Stolee
  2022-09-20 20:40   ` [PATCH v2 3/3] midx.c: unify `include_pack` array into `pack_info` struct Taylor Blau
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Taylor Blau @ 2022-09-20 20:40 UTC (permalink / raw)
  To: git; +Cc: gitster, derrickstolee, peff

The `git multi-pack-index repack` command aggregates all objects in
certain packs contained in the MIDX, largely dictated by the value of
its `--batch-size` parameter.

To create the new pack, the MIDX machinery spawns a `pack-objects`
sub-process, which it feeds the object IDs in each of the pack(s) that
it wants to aggregate.

This implementation, which dates back to ce1e4a105b (midx: implement
midx_repack(), 2019-06-10), predates the `--stdin-packs` mode in
`pack-objects`. This mode (which was introduced a couple of years later
in 339bce27f4 (builtin/pack-objects.c: add '--stdin-packs' option,
2021-02-22)) allows `pack-objects` callers to specify the contents of
the output pack by indicating which packs to aggregate.

This patch replaces the pre-`--stdin-packs` invocation (where each
object is given to `pack-objects` one by one) with the more modern
`--stdin-packs` option.

This allows us to avoid some CPU cycles serializing and deserializing
every object ID in all of the packs we're aggregating. It also avoids us
having to send a potentially large amount of data down to
`pack-objects`.

But more importantly, it generates slightly higher quality (read: more
tightly compressed) packs, because of the reachability traversal that
`--stdin-packs` does after the fact in order to gather namehash values
which seed the delta selection process.

In practice, this seems to add a slight amount of overhead (on the order
of a few seconds for git.git broken up into ~100 packs), in exchange for
a modest reduction (on the order of ~3.5%) in the resulting pack size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/midx.c b/midx.c
index db0a70c5af..5fbf964bba 100644
--- a/midx.c
+++ b/midx.c
@@ -1955,6 +1955,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
+	struct strbuf scratch = STRBUF_INIT;
 	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);
 	struct repack_info *pack_info = NULL;
 
@@ -1996,7 +1997,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
 
-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", NULL);
 
 	strbuf_addstr(&base_name, object_dir);
 	strbuf_addstr(&base_name, "/pack/pack");
@@ -2025,17 +2026,19 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 
 	cmd_in = xfdopen(cmd.in, "w");
 
-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+	for (i = 0; i < m->num_packs; i++) {
+		struct repack_info *info = &pack_info[i];
 
-		if (!include_pack[pack_int_id])
-			continue;
+		strbuf_reset(&scratch);
 
-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		strbuf_addstr(&scratch, m->pack_names[info->pack_int_id]);
+		strbuf_strip_suffix(&scratch, ".idx");
+		strbuf_addstr(&scratch, ".pack");
+
+		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);
 	}
 	fclose(cmd_in);
+	strbuf_release(&scratch);
 
 	if (finish_command(&cmd)) {
 		error(_("could not finish pack-objects"));
-- 
@@ -1955,6 +1955,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
+	struct strbuf scratch = STRBUF_INIT;
 	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);
 	struct repack_info *pack_info = NULL;
 
@@ -1996,7 +1997,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
 
-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", NULL);
 
 	strbuf_addstr(&base_name, object_dir);
 	strbuf_addstr(&base_name, "/pack/pack");
@@ -2025,17 +2026,19 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 
 	cmd_in = xfdopen(cmd.in, "w");
 
-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+	for (i = 0; i < m->num_packs; i++) {
+		struct repack_info *info = &pack_info[i];
 
-		if (!include_pack[pack_int_id])
-			continue;
+		strbuf_reset(&scratch);
 
-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		strbuf_addstr(&scratch, m->pack_names[info->pack_int_id]);
+		strbuf_strip_suffix(&scratch, ".idx");
+		strbuf_addstr(&scratch, ".pack");
+
+		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);
 	}
 	fclose(cmd_in);
+	strbuf_release(&scratch);
 
 	if (finish_command(&cmd)) {
 		error(_("could not finish pack-objects"));
-- 
2.37.0.1.g1379af2e9d


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] midx.c: unify `include_pack` array into `pack_info` struct
  2022-09-20 20:40 ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Taylor Blau
  2022-09-20 20:40   ` [PATCH v2 1/3] midx.c: compute pack_info array outside of fill_included_packs_batch() Taylor Blau
  2022-09-20 20:40   ` [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
@ 2022-09-20 20:40   ` Taylor Blau
  2022-09-20 21:54   ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Jeff King
  2022-09-21 19:09   ` Junio C Hamano
  4 siblings, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2022-09-20 20:40 UTC (permalink / raw)
  To: git; +Cc: gitster, derrickstolee, peff

The MIDX repack code uses a separate byte-array to track which packs are
"included" in the rollup. Now that an array of `repack_info` structs is
available for the lifetime of the `midx_repack()` function, we can
instead use a new bit in that struct's record instead of maintaining a
separate array.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/midx.c b/midx.c
index 5fbf964bba..59db9162ea 100644
--- a/midx.c
+++ b/midx.c
@@ -1865,6 +1865,7 @@ struct repack_info {
 	timestamp_t mtime;
 	uint32_t referenced_objects;
 	uint32_t pack_int_id;
+	unsigned include : 1;
 };
 
 static int compare_by_mtime(const void *a_, const void *b_)
@@ -1883,7 +1884,7 @@ static int compare_by_mtime(const void *a_, const void *b_)
 
 static int fill_included_packs_all(struct repository *r,
 				   struct multi_pack_index *m,
-				   unsigned char *include_pack)
+				   struct repack_info *pack_info)
 {
 	uint32_t i, count = 0;
 	int pack_kept_objects = 0;
@@ -1896,7 +1897,7 @@ static int fill_included_packs_all(struct repository *r,
 		if (!pack_kept_objects && m->packs[i]->pack_keep)
 			continue;
 
-		include_pack[i] = 1;
+		pack_info[i].include = 1;
 		count++;
 	}
 
@@ -1906,7 +1907,6 @@ static int fill_included_packs_all(struct repository *r,
 static int fill_included_packs_batch(struct repository *r,
 				     struct multi_pack_index *m,
 				     struct repack_info *pack_info,
-				     unsigned char *include_pack,
 				     size_t batch_size)
 {
 	uint32_t i, packs_to_repack;
@@ -1918,8 +1918,8 @@ static int fill_included_packs_batch(struct repository *r,
 	total_size = 0;
 	packs_to_repack = 0;
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
-		int pack_int_id = pack_info[i].pack_int_id;
-		struct packed_git *p = m->packs[pack_int_id];
+		struct repack_info *info = &pack_info[i];
+		struct packed_git *p = m->packs[info->pack_int_id];
 		size_t expected_size;
 
 		if (!p)
@@ -1938,7 +1938,8 @@ static int fill_included_packs_batch(struct repository *r,
 
 		packs_to_repack++;
 		total_size += expected_size;
-		include_pack[pack_int_id] = 1;
+
+		info->include = 1;
 	}
 
 	if (packs_to_repack < 2)
@@ -1951,7 +1952,6 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 {
 	int result = 0;
 	uint32_t i;
-	unsigned char *include_pack;
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
@@ -1970,7 +1970,6 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	if (!m)
 		return 0;
 
-	CALLOC_ARRAY(include_pack, m->num_packs);
 	CALLOC_ARRAY(pack_info, m->num_packs);
 
 	for (i = 0; i < m->num_packs; i++) {
@@ -1988,10 +1987,9 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	QSORT(pack_info, m->num_packs, compare_by_mtime);
 
 	if (batch_size) {
-		if (fill_included_packs_batch(r, m, pack_info, include_pack,
-					      batch_size))
+		if (fill_included_packs_batch(r, m, pack_info, batch_size))
 			goto cleanup;
-	} else if (fill_included_packs_all(r, m, include_pack))
+	} else if (fill_included_packs_all(r, m, pack_info))
 		goto cleanup;
 
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
@@ -2035,7 +2033,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		strbuf_strip_suffix(&scratch, ".idx");
 		strbuf_addstr(&scratch, ".pack");
 
-		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);
+		fprintf(cmd_in, "%s%s\n", info->include ? "" : "^", scratch.buf);
 	}
 	fclose(cmd_in);
 	strbuf_release(&scratch);
@@ -2050,6 +2048,5 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 
 cleanup:
 	free(pack_info);
-	free(include_pack);
 	return result;
 }
-- 
@@ -1865,6 +1865,7 @@ struct repack_info {
 	timestamp_t mtime;
 	uint32_t referenced_objects;
 	uint32_t pack_int_id;
+	unsigned include : 1;
 };
 
 static int compare_by_mtime(const void *a_, const void *b_)
@@ -1883,7 +1884,7 @@ static int compare_by_mtime(const void *a_, const void *b_)
 
 static int fill_included_packs_all(struct repository *r,
 				   struct multi_pack_index *m,
-				   unsigned char *include_pack)
+				   struct repack_info *pack_info)
 {
 	uint32_t i, count = 0;
 	int pack_kept_objects = 0;
@@ -1896,7 +1897,7 @@ static int fill_included_packs_all(struct repository *r,
 		if (!pack_kept_objects && m->packs[i]->pack_keep)
 			continue;
 
-		include_pack[i] = 1;
+		pack_info[i].include = 1;
 		count++;
 	}
 
@@ -1906,7 +1907,6 @@ static int fill_included_packs_all(struct repository *r,
 static int fill_included_packs_batch(struct repository *r,
 				     struct multi_pack_index *m,
 				     struct repack_info *pack_info,
-				     unsigned char *include_pack,
 				     size_t batch_size)
 {
 	uint32_t i, packs_to_repack;
@@ -1918,8 +1918,8 @@ static int fill_included_packs_batch(struct repository *r,
 	total_size = 0;
 	packs_to_repack = 0;
 	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
-		int pack_int_id = pack_info[i].pack_int_id;
-		struct packed_git *p = m->packs[pack_int_id];
+		struct repack_info *info = &pack_info[i];
+		struct packed_git *p = m->packs[info->pack_int_id];
 		size_t expected_size;
 
 		if (!p)
@@ -1938,7 +1938,8 @@ static int fill_included_packs_batch(struct repository *r,
 
 		packs_to_repack++;
 		total_size += expected_size;
-		include_pack[pack_int_id] = 1;
+
+		info->include = 1;
 	}
 
 	if (packs_to_repack < 2)
@@ -1951,7 +1952,6 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 {
 	int result = 0;
 	uint32_t i;
-	unsigned char *include_pack;
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
@@ -1970,7 +1970,6 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	if (!m)
 		return 0;
 
-	CALLOC_ARRAY(include_pack, m->num_packs);
 	CALLOC_ARRAY(pack_info, m->num_packs);
 
 	for (i = 0; i < m->num_packs; i++) {
@@ -1988,10 +1987,9 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	QSORT(pack_info, m->num_packs, compare_by_mtime);
 
 	if (batch_size) {
-		if (fill_included_packs_batch(r, m, pack_info, include_pack,
-					      batch_size))
+		if (fill_included_packs_batch(r, m, pack_info, batch_size))
 			goto cleanup;
-	} else if (fill_included_packs_all(r, m, include_pack))
+	} else if (fill_included_packs_all(r, m, pack_info))
 		goto cleanup;
 
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
@@ -2035,7 +2033,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		strbuf_strip_suffix(&scratch, ".idx");
 		strbuf_addstr(&scratch, ".pack");
 
-		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);
+		fprintf(cmd_in, "%s%s\n", info->include ? "" : "^", scratch.buf);
 	}
 	fclose(cmd_in);
 	strbuf_release(&scratch);
@@ -2050,6 +2048,5 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 
 cleanup:
 	free(pack_info);
-	free(include_pack);
 	return result;
 }
-- 
2.37.0.1.g1379af2e9d

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack`
  2022-09-20 20:40 ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Taylor Blau
                     ` (2 preceding siblings ...)
  2022-09-20 20:40   ` [PATCH v2 3/3] midx.c: unify `include_pack` array into `pack_info` struct Taylor Blau
@ 2022-09-20 21:54   ` Jeff King
  2022-09-21 19:09   ` Junio C Hamano
  4 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2022-09-20 21:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster, derrickstolee

On Tue, Sep 20, 2022 at 04:40:14PM -0400, Taylor Blau wrote:

> Here's a few patches that replace the existing "feed each OID
> one-by-one" approach to implement the `repack` sub-command of the
> `multi-pack-index` builtin with one that uses `pack-objects`'s
> `--stdin-packs` option.
> 
> There is an additional patch at the beginning of this series in order to
> extract the mtime-sorted list of packs to rollup from their home in
> `fill_included_packs_batch()`. There's also one more on the end to unify
> the `include_pack` array into the `repack_info` struct(s) themselves.

The first and third make sense to me, and it looks like a nice cleanup.

The middle one looks fine, modulo all of our earlier discussion about
the extra traversal being fine. I think it would be good t get a review
from Stolee as the person who invented "git multi-index-pack repack" and
would understand its intended use. Presumably this is all weird "I have
a giant Windows repository" packing strategy. :)

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack`
  2022-09-20 20:40 ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Taylor Blau
                     ` (3 preceding siblings ...)
  2022-09-20 21:54   ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Jeff King
@ 2022-09-21 19:09   ` Junio C Hamano
  2022-09-23 18:29     ` Taylor Blau
  4 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2022-09-21 19:09 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, peff

Taylor Blau <me@ttaylorr.com> writes:

> Here's a few patches that replace the existing "feed each OID
> one-by-one" approach to implement the `repack` sub-command of the
> `multi-pack-index` builtin with one that uses `pack-objects`'s
> `--stdin-packs` option.

One question.  How is this series expected to interact with the
7-patch series about ignoring cruft pack while "midx repack" etc.?

I guess it is not so urgent during the -rc period, so I'll stop at
asking that question without reading further, at least for now.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20 20:40   ` [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
@ 2022-09-23 18:23     ` Derrick Stolee
  2022-09-23 18:28       ` Taylor Blau
  0 siblings, 1 reply; 10+ messages in thread
From: Derrick Stolee @ 2022-09-23 18:23 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, peff

On 9/20/2022 4:40 PM, Taylor Blau wrote:

> +	for (i = 0; i < m->num_packs; i++) {
> +		struct repack_info *info = &pack_info[i];
>  
> -		if (!include_pack[pack_int_id])
> -			continue;

...

> +		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);

Outside of how the object set is provided (a list of objects or a list of
packs) it seems this is equivalent. The only difference is that we are
writing a line for _every_ pack in the multi-pack-index, but we preface
the line with "^" to say "not this pack".

Do you know if there is any reason to do this explicitly? Does this
change the set of objects in any way (perhaps by not including
duplicates that are tracked in those other packs)?

Personally, I would have kept the "if (...) continue;" pattern, so maybe
you had a concrete idea why this approach is better.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-23 18:23     ` Derrick Stolee
@ 2022-09-23 18:28       ` Taylor Blau
  2022-09-23 19:05         ` Derrick Stolee
  0 siblings, 1 reply; 10+ messages in thread
From: Taylor Blau @ 2022-09-23 18:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, gitster, peff

On Fri, Sep 23, 2022 at 02:23:30PM -0400, Derrick Stolee wrote:
> Do you know if there is any reason to do this explicitly? Does this
> change the set of objects in any way (perhaps by not including
> duplicates that are tracked in those other packs)?

Yes. The "^" lines become excluded packs from the perspective of the
follow-on reachability traversal to discover the namehashes. So as soon
as we hit an object contained in one of the packs marked as excluded,
we'll halt the traversal along that direction, since we know that we're
not going to pick up any objects in those packs.

So you could omit them, and you'd get the same resulting pack, but it
would take longer to generate since we wouldn't be stopping the
follow-on traversal as early as possible.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack`
  2022-09-21 19:09   ` Junio C Hamano
@ 2022-09-23 18:29     ` Taylor Blau
  0 siblings, 0 replies; 10+ messages in thread
From: Taylor Blau @ 2022-09-23 18:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, derrickstolee, peff

On Wed, Sep 21, 2022 at 12:09:31PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Here's a few patches that replace the existing "feed each OID
> > one-by-one" approach to implement the `repack` sub-command of the
> > `multi-pack-index` builtin with one that uses `pack-objects`'s
> > `--stdin-packs` option.
>
> One question.  How is this series expected to interact with the
> 7-patch series about ignoring cruft pack while "midx repack" etc.?

The first round merges cleanly, but the second round will have a slight
conflict. I'll send you a new version of this series based on whatever
is in `master` following the release.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-23 18:28       ` Taylor Blau
@ 2022-09-23 19:05         ` Derrick Stolee
  0 siblings, 0 replies; 10+ messages in thread
From: Derrick Stolee @ 2022-09-23 19:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster, peff

On 9/23/2022 2:28 PM, Taylor Blau wrote:
> On Fri, Sep 23, 2022 at 02:23:30PM -0400, Derrick Stolee wrote:
>> Do you know if there is any reason to do this explicitly? Does this
>> change the set of objects in any way (perhaps by not including
>> duplicates that are tracked in those other packs)?
> 
> Yes. The "^" lines become excluded packs from the perspective of the
> follow-on reachability traversal to discover the namehashes. So as soon
> as we hit an object contained in one of the packs marked as excluded,
> we'll halt the traversal along that direction, since we know that we're
> not going to pick up any objects in those packs.
> 
> So you could omit them, and you'd get the same resulting pack, but it
> would take longer to generate since we wouldn't be stopping the
> follow-on traversal as early as possible.

Excellent. Thanks for explaining that!

-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-23 19:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <YyokIf%2FSd7SYztKQ@nand.local>
2022-09-20 20:40 ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Taylor Blau
2022-09-20 20:40   ` [PATCH v2 1/3] midx.c: compute pack_info array outside of fill_included_packs_batch() Taylor Blau
2022-09-20 20:40   ` [PATCH v2 2/3] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
2022-09-23 18:23     ` Derrick Stolee
2022-09-23 18:28       ` Taylor Blau
2022-09-23 19:05         ` Derrick Stolee
2022-09-20 20:40   ` [PATCH v2 3/3] midx.c: unify `include_pack` array into `pack_info` struct Taylor Blau
2022-09-20 21:54   ` [PATCH v2 0/3] midx: use `--stdin-packs` to implement `repack` Jeff King
2022-09-21 19:09   ` Junio C Hamano
2022-09-23 18:29     ` Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).