git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Victoria Dye via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Victoria Dye <vdye@github.com>
Subject: Re: [PATCH] repack: respect --keep-pack with geometric repack
Date: Fri, 20 May 2022 13:00:30 -0400	[thread overview]
Message-ID: <YofJLv8+x5J7yPmf@nand.local> (raw)
In-Reply-To: <pull.1235.git.1653064572170.gitgitgadget@gmail.com>

Hi Victoria,

On Fri, May 20, 2022 at 04:36:12PM +0000, Victoria Dye via GitGitGadget wrote:
> From: Victoria Dye <vdye@github.com>
>
> Update 'repack' to ignore packs named on the command line with the
> '--keep-pack' option. Specifically, modify 'init_pack_geometry()' to treat
> command line-kept packs the same way it treats packs with an on-disk '.keep'
> file (that is, skip the pack and do not include it in the 'geometry'
> structure).
>
> Without this handling, a '--keep-pack' pack would be included in the
> 'geometry' structure. If the pack is *before* the geometry split line (with
> at least one other pack and/or loose objects present), 'repack' assumes the
> pack's contents are "rolled up" into another pack via 'pack-objects'.
> However, because the internally-invoked 'pack-objects' properly excludes
> '--keep-pack' objects, any new pack it creates will not contain the kept
> objects. Finally, 'repack' deletes the '--keep-pack' as "redundant" (since
> it assumes 'pack-objects' created a new pack with its contents), resulting
> in possible object loss and repository corruption.

Nicely found and explained. Having discussed this fix with you already
off-list, this approach (to treat kept packs as excluded from the list
of packs in the `geometry` structure regardless of whether they are kept
on disk or in-core) makes sense to me.

I left a couple of small notes on the patch below, but since I have some
patches that deal with a separate issue in the `git repack --geometric`
code coming, do you want to combine forces (and I can send a
lightly-reworked version of this patch as a part of my series)?

> @@ -332,17 +332,34 @@ static int geometry_cmp(const void *va, const void *vb)
>  	return 0;
>  }
>
> -static void init_pack_geometry(struct pack_geometry **geometry_p)
> +static void init_pack_geometry(struct pack_geometry **geometry_p,
> +			       struct string_list *existing_kept_packs)
>  {
>  	struct packed_git *p;
>  	struct pack_geometry *geometry;
> +	struct strbuf buf = STRBUF_INIT;
>
>  	*geometry_p = xcalloc(1, sizeof(struct pack_geometry));
>  	geometry = *geometry_p;
>
> +	string_list_sort(existing_kept_packs);

Would it be worth sorting this as early as in collect_pack_filenames()?
For our purposes in this patch, this works as-is, but it may be
defensive to try and minimize the time that list has unsorted contents.

>  	for (p = get_all_packs(the_repository); p; p = p->next) {
> -		if (!pack_kept_objects && p->pack_keep)
> -			continue;
> +		if (!pack_kept_objects) {
> +			if (p->pack_keep)
> +				continue;

(You mentioned this to me off-list, but I'll repeat it here since it
wasn't obvious to me on first read): this check for `p->pack_keep` isn't
strictly necessary, since any packs that have their `pack_keep` bit set
will appear in the `existing_kept_packs` list.

But it does give us a fast path to avoid having to check that list, so
it's worth checking that bit to avoid a slightly more expensive check
where possible.

> +			/*
> +			 * The pack may be kept via the --keep-pack option;
> +			 * check 'existing_kept_packs' to determine whether to
> +			 * ignore it.
> +			 */
> +			strbuf_reset(&buf);
> +			strbuf_addstr(&buf, pack_basename(p));
> +			strbuf_strip_suffix(&buf, ".pack");
> +
> +			if (string_list_has_string(existing_kept_packs, buf.buf))
> +				continue;

It's too bad that we have to do this check at all, and can't rely on the
`pack_keep_in_core` in the same way as we check `p->pack_keep`. But
lifting that restriction is a more invasive change, so I'm happy to
rely on the contents of existing_kept_packs here in the meantime.

> +		}
>
>  		ALLOC_GROW(geometry->pack,
>  			   geometry->pack_nr + 1,
> @@ -353,6 +370,7 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
>  	}
>
>  	QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
> +	strbuf_release(&buf);
>  }
>
>  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
> @@ -714,17 +732,20 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		strbuf_release(&path);
>  	}
>
> +	packdir = mkpathdup("%s/pack", get_object_directory());
> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
> +
> +	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
> +			       &keep_pack_list);
> +

Makes sense; we have to initialize existing_kept_packs before arranging
the list of packs for the `--geometric` split. And presumably
`collect_pack_filenames()` relies on `packdir`, `packtmp_name`, and
`packtmp` being setup ahead of time, too.

>  	if (geometric_factor) {
>  		if (pack_everything)
>  			die(_("options '%s' and '%s' cannot be used together"), "--geometric", "-A/-a");
> -		init_pack_geometry(&geometry);
> +		init_pack_geometry(&geometry, &existing_kept_packs);
>  		split_pack_geometry(geometry, geometric_factor);
>  	}
>
> -	packdir = mkpathdup("%s/pack", get_object_directory());
> -	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> -	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
> -
>  	sigchain_push_common(remove_pack_on_signal);
>
>  	prepare_pack_objects(&cmd, &po_args);
> @@ -764,9 +785,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	if (use_delta_islands)
>  		strvec_push(&cmd.args, "--delta-islands");
>
> -	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
> -			       &keep_pack_list);
> -
>  	if (pack_everything & ALL_INTO_ONE) {
>  		repack_promisor_objects(&po_args, &names);
>
> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index bdbbcbf1eca..f5ac23413d5 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -180,6 +180,40 @@ test_expect_success '--geometric ignores kept packs' '
>  	)
>  '
>
> +test_expect_success '--geometric ignores --keep-pack packs' '
> +	git init geometric &&
> +	test_when_finished "rm -fr geometric" &&
> +	(
> +		cd geometric &&
> +
> +		# Create two equal-sized packs
> +		test_commit kept && # 3 objects
> +		test_commit pack && # 3 objects
> +
> +		KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF
> +		refs/tags/kept
> +		EOF
> +		) &&
> +		PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF
> +		refs/tags/pack
> +		^refs/tags/kept
> +		EOF
> +		) &&

Nit; we don't care about the name of $PACK, so it would probably be fine
to avoid storing the `PACK` variable. We could write these packs with
just `git repack -d` after each `test_commit` (which would avoid us
having to call `prune-packed`).

Does it matter which one is kept? I don't think so, since AFAICT the
critical bit is that we mark one of the packs being rolled up as a
`--keep-pack`.

> +		# Prune loose objects that are now packed into PACK and KEEP
> +		git prune-packed &&
> +
> +		git repack --geometric 2 -dm --keep-pack=pack-$KEPT.pack >out &&
> +
> +		# Packs should not have changed (only one non-kept pack, no
> +		# loose objects), but midx should now exist.
> +		test_i18ngrep "Nothing new to pack" out &&

Nit; test_i18ngrep here should just be "grep".

> +		test_path_is_file $midx &&
> +		test_path_is_file $objdir/pack/pack-$KEPT.pack &&
> +		git fsck
> +	)


Thanks,
Taylor

  reply	other threads:[~2022-05-20 17:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-20 16:36 [PATCH] repack: respect --keep-pack with geometric repack Victoria Dye via GitGitGadget
2022-05-20 17:00 ` Taylor Blau [this message]
2022-05-20 17:27   ` Victoria Dye
2022-05-20 19:05     ` Taylor Blau
2022-05-20 20:41   ` Junio C Hamano
2022-05-20 22:12     ` Junio C Hamano
2022-05-20 22:20       ` Taylor Blau
2022-05-20 23:26     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YofJLv8+x5J7yPmf@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).