git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Jeff King <peff@peff.net>, git@vger.kernel.org, dstolee@microsoft.com
Subject: Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
Date: Tue, 25 Aug 2020 11:14:41 -0400	[thread overview]
Message-ID: <6a34d7ee-8c6b-8c6c-93bd-0013dccccafb@gmail.com> (raw)
In-Reply-To: <20200825144146.GA7671@syl.lan>

On 8/25/2020 10:41 AM, Taylor Blau wrote:
> On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote:
>> The code in builtin/repack.c looks good for sure. I have a quick question
>> about this new test:
>>
>> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> +	git multi-pack-index write &&
>> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>> +
>> +	# Write a new pack that is unknown to the multi-pack-index.
>> +	git hash-object -w </dev/null >blob &&
>> +	git pack-objects $objdir/pack/pack <blob &&
>> +
>> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> +	test_cmp_bin $objdir/pack/multi-pack-index \
>> +		$objdir/pack/multi-pack-index.bak
>> +'
>> +
>>
>> You create an arbitrary blob, and then add it to a pack-file. Do we
>> know that 'git repack' is definitely creating a new pack-file that makes
>> our manually-created pack-file redundant?
>>
>> My suggestion is to have the test check itself:
>>
>> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> +	git multi-pack-index write &&
>> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>> +
>> +	# Write a new pack that is unknown to the multi-pack-index.
>> +	git hash-object -w </dev/null >blob &&
>> +	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
>> +
>> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> +	test_cmp_bin $objdir/pack/multi-pack-index \
>> +		$objdir/pack/multi-pack-index.bak &&
>> +	test_path_is_missing $objdir/pack/pack-$HASH.pack
>> +'
>> +
>>
>> This test fails for me, on the 'test_path_is_missing'. Likely, the
>> blob is seen as already in a pack-file so is just pruned by 'git repack'
>> instead. I thought that perhaps we need to add a new pack ourselves that
>> overrides the small pack. Here is my attempt:
>>
>> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> 	git multi-pack-index write &&
>> 	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> 	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>>
>> 	# Write a new pack that is unknown to the multi-pack-index.
>> 	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
>> 	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
>> 	cat >blobs <<-EOF &&
>> 	$BLOB1
>> 	$BLOB2
>> 	EOF
>> 	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
>> 	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
>> 	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> 	test_cmp_bin $objdir/pack/multi-pack-index \
>> 		$objdir/pack/multi-pack-index.bak &&
>> 	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
>> 	test_path_is_missing $objdir/pack/pack-$HASH1.pack
>> '
>>
>> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
>> how to make sure your logic is tested. I saw that 'git repack' was writing
>> "nothing new to pack" in the output, so I also tested adding a few commits and
>> trying to force it to repack reachable data, but I cannot seem to trigger it
>> to create a new pack that overrides only one pack that is not in the MIDX.
>>
>> Likely, I just don't know how 'git rebase' works well enough to trigger this
>> behavior. But the test as-is is not testing what you want it to test.
> 
> I think this case might actually be impossible to tickle in a test. I
> thought that 'git repack -d' looked for existing packs whose objects are
> a subset of some new pack generated. But, it's much simpler than that:
> '-d' by itself just looks for packs that were already on disk with the
> same SHA-1 as a new pack, and it removes the old one.

If 'git repack' never calls remove_redundant_pack() unless we are doing
a "full repack", then we could simplify this logic:

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
-	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	struct multi_pack_index *m = get_multi_pack_index(the_repository);
+	strbuf_addf(&buf, "%s.pack", base_name);
+	if (m && midx_contains_pack(m, buf.buf))
+		clear_midx_file(the_repository);
+	strbuf_insertf(&buf, 0, "%s/", dir_name);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }

to

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	clear_midx_file(the_repository);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }

and get the same results as we are showing in these tests. This does
move us incrementally to a better situation: don't delete the MIDX
if we aren't deleting pack files. But, I think we can get around it.

> Note that 'git repack' uses 'git pack-objects' internally to find
> objects and generate a packfile. When calling 'git pack-objects', 'git
> repack -d' passes '--all' and '--unpacked', which means that there is no
> way we'd generate a new pack with the same SHA-1 as an existing pack
> ordinarily.
> 
> So, I think this case is impossible, or at least astronomically
> unlikely. What is more interesting (and untested) is that adding a _new_
> pack doesn't cause us to invalidate the MIDX. Here's a patch that does
> that:
> 
>   diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>   index 16a1ad040e..620f2058d6 100755
>   --- a/t/t5319-multi-pack-index.sh
>   +++ b/t/t5319-multi-pack-index.sh
>   @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' '
>           test_path_is_missing $objdir/pack/multi-pack-index
>    '
> 
>   -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>   -       git multi-pack-index write &&
>   -       cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>   -       test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>   -
>   -       # Write a new pack that is unknown to the multi-pack-index.
>   -       git hash-object -w </dev/null >blob &&
>   -       git pack-objects $objdir/pack/pack <blob &&
>   -
>   -       GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>   -       test_cmp_bin $objdir/pack/multi-pack-index \
>   -               $objdir/pack/multi-pack-index.bak
>   +test_expect_success 'repack preserves multi-pack-index when creating packs' '
>   +       git init preserve &&
>   +       test_when_finished "rm -fr preserve" &&
>   +       (
>   +               cd preserve &&
>   +               midx=.git/objects/pack/multi-pack-index &&
>   +
>   +               test_commit "initial" &&
>   +               git repack -ad &&
>   +               git multi-pack-index write &&
>   +               ls .git/objects/pack | grep "\.pack$" >before &&
>   +
>   +               cp $midx $midx.bak &&
>   +
>   +               test_commit "another" &&
>   +               GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>   +               ls .git/objects/pack | grep "\.pack$" >after &&
>   +
>   +               test_cmp_bin $midx.bak $midx &&
>   +               ! test_cmp before after
>   +       )
>    '

After looking at the callers to remote_redundant_pack() I noticed that it is only
called after inspecting the "names" struct, which contains the names of the packs
to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in
the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and
deleted. While this is very unlikely in the wild, it is definitely possible.

test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
	git init preserve &&
	test_when_finished "rm -fr preserve" &&
	(
		cd preserve &&
		midx=.git/objects/pack/multi-pack-index &&

		test_commit 1 &&
		HASH1=$(git pack-objects --all .git/objects/pack/pack) &&
		touch .git/objects/pack/pack-$HASH1.keep &&

		cat >pack-input <<-\EOF &&
		HEAD
		^HEAD~1
		EOF

		test_commit 2 &&
		HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
		touch .git/objects/pack/pack-$HASH2.keep &&

		git multi-pack-index write &&
		cp $midx $midx.bak &&

		test_commit 3 &&
		HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&

		test_commit 4 &&
		HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&

		GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad &&
		test_path_is_file .git/objects/pack/pack-$HASH1.pack &&
		test_path_is_file .git/objects/pack/pack-$HASH2.pack &&
		test_path_is_missing .git/objects/pack/pack-$HASH3.pack &&
		test_path_is_missing .git/objects/pack/pack-$HASH4.pack
       )
'

I believe this checks your condition properly enough.

Thanks,
-Stolee

  reply	other threads:[~2020-08-25 15:14 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-25  2:01 [PATCH] builtin/repack.c: invalidate MIDX only when necessary Taylor Blau
2020-08-25  2:26 ` Jeff King
2020-08-25  2:37   ` Taylor Blau
2020-08-25 13:14     ` Derrick Stolee
2020-08-25 14:41       ` Taylor Blau
2020-08-25 15:14         ` Derrick Stolee [this message]
2020-08-25 15:42           ` Taylor Blau
2020-08-25 16:56             ` Jeff King
2020-08-25 15:58   ` Junio C Hamano
2020-08-25 16:08     ` Taylor Blau
2020-08-25 16:18     ` Derrick Stolee
2020-08-25 17:34       ` Jeff King
2020-08-25 17:22     ` Jeff King
2020-08-25 18:05       ` Junio C Hamano
2020-08-25 18:27         ` Jeff King
2020-08-25 22:45           ` [PATCH] pack-redundant: gauge the usage before proposing its removal Junio C Hamano
2020-08-25 23:09             ` Taylor Blau
2020-08-25 23:22               ` Junio C Hamano
2020-08-26  1:17             ` [PATCH v1 0/3] War on dashed-git Junio C Hamano
2020-08-26  1:17               ` [PATCH v1 1/3] transport-helper: do not run git-remote-ext etc. in dashed form Junio C Hamano
2020-08-26  1:24                 ` Eric Sunshine
2020-08-26  7:55                   ` Johannes Schindelin
2020-08-26 16:27                   ` Junio C Hamano
2020-08-26  1:17               ` [PATCH v1 2/3] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26  1:28                 ` Eric Sunshine
2020-08-26  1:42                   ` Junio C Hamano
2020-08-26 16:08                   ` Junio C Hamano
2020-08-26 16:28                     ` Junio C Hamano
2020-08-26  8:02                 ` Johannes Schindelin
2020-08-26  1:17               ` [PATCH v1 3/3] git: catch an attempt to run "git-foo" Junio C Hamano
2020-08-26  1:19                 ` Junio C Hamano
2020-08-26  8:06                 ` Johannes Schindelin
2020-08-26 16:30                   ` Junio C Hamano
2020-08-28  2:13                 ` Johannes Schindelin
2020-08-28 22:03                   ` Junio C Hamano
2020-08-31  9:59                     ` Johannes Schindelin
2020-08-31 17:45                       ` Junio C Hamano
2020-12-20 15:25                   ` Johannes Schindelin
2020-12-21 22:24                     ` Junio C Hamano
2020-12-30  5:30                       ` Johannes Schindelin
2020-08-26  8:09               ` [PATCH v1 0/3] War on dashed-git Johannes Schindelin
2020-08-26 16:45                 ` Junio C Hamano
2020-08-26 19:46                   ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Junio C Hamano
2020-08-26 19:46                     ` [PATCH v2 1/2] transport-helper: do not run git-remote-ext etc. in " Junio C Hamano
2020-08-26 19:46                     ` [PATCH v2 2/2] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26 21:37                       ` [PATCH v2 3/2] credential-cache: use child_process.args Junio C Hamano
2020-08-26 22:25                         ` [PATCH] run_command: teach API users to use embedded 'args' more Junio C Hamano
2020-08-27  4:21                           ` Jeff King
2020-08-27  4:30                             ` Junio C Hamano
2020-08-27  4:31                             ` Eric Sunshine
2020-08-27  4:44                               ` Jeff King
2020-08-27  5:03                                 ` Eric Sunshine
2020-08-27  5:25                                   ` [PATCH] worktree: fix leak in check_clean_worktree() Jeff King
2020-08-27  5:56                                     ` Eric Sunshine
2020-08-27 15:31                                       ` Junio C Hamano
2020-08-27  4:13                         ` [PATCH v2 3/2] credential-cache: use child_process.args Jeff King
2020-08-27  4:22                           ` Jeff King
2020-08-27  4:31                           ` Junio C Hamano
2020-08-27  4:14                         ` Jeff King
2020-08-27 15:34                           ` Junio C Hamano
2020-08-31 22:56                         ` Junio C Hamano
2020-09-01  4:49                           ` Jeff King
2020-09-01 16:11                             ` Junio C Hamano
2020-08-27  0:57                     ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Derrick Stolee
2020-08-27  1:22                       ` Junio C Hamano
2020-08-28  9:14             ` [PATCH] pack-redundant: gauge the usage before proposing its removal Jeff King
2020-08-28 22:45               ` Junio C Hamano
2020-08-25  7:55 ` [PATCH] builtin/repack.c: invalidate MIDX only when necessary Son Luong Ngoc
2020-08-25 12:45   ` Derrick Stolee
2020-08-25 14:45   ` Taylor Blau
2020-08-25 16:04     ` [PATCH v2] " Taylor Blau
2020-08-26 20:51       ` Derrick Stolee
2020-08-26 20:54         ` Junio C Hamano
2020-08-25 16:47     ` [PATCH] " Jeff King
2020-08-25 17:10       ` Derrick Stolee
2020-08-25 17:29         ` Jeff King
2020-08-25 17:34           ` Taylor Blau
2020-08-25 17:42             ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a34d7ee-8c6b-8c6c-93bd-0013dccccafb@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).