git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>, "Junio C Hamano" <gitster@pobox.com>,
	"René Scharfe" <l.s.r@web.de>
Subject: [PATCH v2 4/4] midx-write.c: use `--stdin-packs` when repacking
Date: Mon, 1 Apr 2024 17:16:44 -0400	[thread overview]
Message-ID: <b5d6ba5802aef6ddf1542f1b0efffe43c22436ba.1712006190.git.me@ttaylorr.com> (raw)
In-Reply-To: <cover.1712006190.git.me@ttaylorr.com>

When constructing a new pack `git multi-pack-index repack` provides a
list of objects which is the union of objects in all MIDX'd packs which
were "included" in the repack.

Though correct, this typically yields a poorly structured pack, since
providing the objects list over stdin does not give pack-objects a
chance to discover the namehash values for each object, leading to
sub-optimal delta selection.

We can use `--stdin-packs` instead, which has a couple of benefits:

  - it does a supplemental walk over objects in the supplied list of
    packs to discover their namehash, leading to higher-quality delta
    selection

  - it requires us to list far less data over stdin; instead of listing
    each object in the resulting pack, we need only list the
    constituent packs from which those objects were selected in the MIDX

Of course, this comes at a slight cost: though we save time on listing
packs versus objects over stdin[^1] (around ~650 milliseconds), we add a
non-trivial amount of time walking over the given objects in order to
find better deltas.

In general, this is likely to more closely match the user's expectations
(i.e. that packs generated via `git multi-pack-index repack` are written
with high-quality deltas). But if not, we can always introduce a new
option in pack-objects to disable the supplemental object walk, which
would yield a pure CPU-time savings, at the cost of the on-disk size of
the resulting pack.

[^1]: In a patched version of Git that doesn't perform the supplemental
  object walk in `pack-objects --stdin-packs`, we save around ~650ms
  (from 5.968 to 5.325 seconds) when running `git multi-pack-index
  repack --batch-size=0` on git.git with all objects packed, and all
  packs in a MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx-write.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/midx-write.c b/midx-write.c
index 960cc46250..65e69d2de7 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1474,7 +1474,8 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
 
-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", "--non-empty",
+		     NULL);
 
 	strvec_pushf(&cmd.args, "%s/pack/pack", object_dir);
 
@@ -1498,16 +1499,15 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	}
 
 	cmd_in = xfdopen(cmd.in, "w");
-
-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
-
-		if (!include_pack[pack_int_id])
+	for (i = 0; i < m->num_packs; i++) {
+		struct packed_git *p = m->packs[i];
+		if (!p)
 			continue;
 
-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		if (include_pack[i])
+			fprintf(cmd_in, "%s\n", pack_basename(p));
+		else
+			fprintf(cmd_in, "^%s\n", pack_basename(p));
 	}
 	fclose(cmd_in);
 
-- 
2.44.0.330.g158d2a670b4


  parent reply	other threads:[~2024-04-01 21:17 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-25 17:24 [PATCH 00/11] midx: split MIDX writing routines into midx-write.c, cleanup Taylor Blau
2024-03-25 17:24 ` [PATCH 01/11] midx-write: initial commit Taylor Blau
2024-03-25 20:30   ` Junio C Hamano
2024-03-25 22:09     ` Taylor Blau
2024-03-25 17:24 ` [PATCH 02/11] midx: extern a pair of shared functions Taylor Blau
2024-03-25 17:24 ` [PATCH 03/11] midx: move `midx_repack` (and related functions) to midx-write.c Taylor Blau
2024-03-25 17:24 ` [PATCH 04/11] midx: move `expire_midx_packs` " Taylor Blau
2024-03-25 17:24 ` [PATCH 05/11] midx: move `write_midx_file_only` " Taylor Blau
2024-03-25 17:24 ` [PATCH 06/11] midx: move `write_midx_file` " Taylor Blau
2024-03-25 17:24 ` [PATCH 07/11] midx: move `write_midx_internal` (and related functions) " Taylor Blau
2024-03-25 17:24 ` [PATCH 08/11] midx-write.c: avoid directly managed temporary strbuf Taylor Blau
2024-03-25 20:33   ` Junio C Hamano
2024-03-25 22:11     ` Taylor Blau
2024-03-25 17:24 ` [PATCH 09/11] midx-write.c: factor out common want_included_pack() routine Taylor Blau
2024-03-25 20:36   ` Junio C Hamano
2024-03-27  8:29   ` Jeff King
2024-03-25 17:24 ` [PATCH 10/11] midx-write.c: check count of packs to repack after grouping Taylor Blau
2024-03-25 20:41   ` Junio C Hamano
2024-03-25 22:11     ` Taylor Blau
2024-03-25 17:24 ` [PATCH 11/11] midx-write.c: use `--stdin-packs` when repacking Taylor Blau
2024-03-27  8:37   ` Jeff King
2024-03-27  8:39 ` [PATCH 00/11] midx: split MIDX writing routines into midx-write.c, cleanup Jeff King
2024-04-01 21:16 ` [PATCH v2 0/4] " Taylor Blau
2024-04-01 21:16   ` [PATCH v2 1/4] midx-write: move writing-related functions from midx.c Taylor Blau
2024-04-01 21:16   ` [PATCH v2 2/4] midx-write.c: factor out common want_included_pack() routine Taylor Blau
2024-04-02 11:47     ` Patrick Steinhardt
2024-04-01 21:16   ` [PATCH v2 3/4] midx-write.c: check count of packs to repack after grouping Taylor Blau
2024-04-01 21:16   ` Taylor Blau [this message]
2024-04-01 21:45     ` [PATCH v2 4/4] midx-write.c: use `--stdin-packs` when repacking Junio C Hamano
2024-04-02 11:47     ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b5d6ba5802aef6ddf1542f1b0efffe43c22436ba.1712006190.git.me@ttaylorr.com \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).