From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>, Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, dstolee@microsoft.com
Subject: Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
Date: Tue, 25 Aug 2020 09:14:19 -0400 [thread overview]
Message-ID: <e7eb9fb6-f1ea-f932-efaa-7434ad809989@gmail.com> (raw)
In-Reply-To: <20200825023710.GA98081@syl.lan>
On 8/24/2020 10:37 PM, Taylor Blau wrote:
> On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote:
>> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:
>>
>>> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
>>> learned to remove a multi-pack-index file if it added or removed a pack
>>> from the object store.
>>>
>>> This mechanism is a little over-eager, since it is only necessary to
>>> drop a MIDX if 'git repack' removes a pack that the MIDX references.
>>> Adding a pack outside of the MIDX does not require invalidating the
>>> MIDX, and likewise for removing a pack the MIDX does not know about.
>>
>> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
>> or "git repack -Ad" is going to pack everything and delete the old
>> packs. So I think we'd want to remove a midx there.
>>
>> And "git repack -d" I think of as deleting only loose objects that we
>> just packed. But I guess it could also remove a pack that has now been
>> made redundant? That seems like a rare case in practice, but I suppose
>> is possible.
>
> Yeah, the patch message makes this sound more likely than it actually
> is, which I agree is very rare. I often write 'git repack' instead of
> 'git pack-objects' to slurp up everything loose into a new pack without
> having to list loose objects by name.
>
> That's the case that I really care about here: purely adding a new pack
> should not invalidate the existing MIDX.
>
>> Not exactly related to your fix, but kind of the flip side of it: would
>> we ever need to retain a midx that mentions some packs that still exist?
>>
>> E.g., imagine we have a midx that points to packs A and B, and
>> git-repack deletes B. By your logic above, we need to remove the midx
>> because now it points to objects in B which aren't accessible. But by
>> deleting it, could we be deleting the only thing that mentions the
>> objects in A?
>>
>> I _think_ the answer is "no", because we never went all-in on midx and
>> allowed deleting the matching .idx files for contained packs. So we'd
>> still have that A.idx, and we could just use the pack as normal. But
>> it's an interesting corner case if we ever do go in that direction.
>
> Agreed. Maybe a (admittedly somewhat large) #leftoverbits.
>
>> If you'll let me muse a bit more on midx-lifetime issues (which I've
>> never really thought about before just now):
>>
>> I'm also a little curious how bad it is to have a midx whose pack has
>> gone away. I guess we'd answer queries for "yes, we have this object"
>> even if we don't, which is bad. Though in practice we'd only delete
>> those packs if we have their objects elsewhere. And the pack code is
>> pretty good about retrying other copies of objects that can't be
>> accessed. Alternatively, I wonder if the midx-loading code ought to
>> check that all of the constituent packs are available.
>>
>> In that line of thinking, do we even need to delete midx files if one of
>> their packs goes away? The reading side probably ought to be able to
>> handle that gracefully.
>
> I think that this is probably the right direction, although I've only
> spend time in the MIDX code over the past couple of weeks, so I can't
> say with authority. It seems like it would be pretty annoying, though.
> For example, code that cares about listing all objects in a MIDX would
> have to check first whether the pack they're in still exists before
> emitting them. On top of that, there are more corner cases when object X
> exists in more than one pack, but some strict subset of those packs
> containing X have gone away.
>
> I don't think that it couldn't be done, though.
>
>> And the more interesting case is when you repack everything with "-ad"
>> or similar, at which point you shouldn't even need to look up what's in
>> the midx to see if you deleted its packs. The point of your operation is
>> to put it all-into-one, so you know the old midx should be discarded.
>>
>>> Teach 'git repack' to check for this by loading the MIDX, and checking
>>> whether the to-be-removed pack is known to the MIDX. This requires a
>>> slightly odd alternation to a test in t5319, which is explained with a
>>> comment.
>>
>> My above musings aside, this seems like an obvious improvement.
>>
>>> diff --git a/builtin/repack.c b/builtin/repack.c
>>> index 04c5ceaf7e..98fac03946 100644
>>> --- a/builtin/repack.c
>>> +++ b/builtin/repack.c
>>> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
>>> static void remove_redundant_pack(const char *dir_name, const char *base_name)
>>> {
>>> struct strbuf buf = STRBUF_INIT;
>>> - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
>>> + struct multi_pack_index *m = get_multi_pack_index(the_repository);
>>> + strbuf_addf(&buf, "%s.pack", base_name);
>>> + if (m && midx_contains_pack(m, buf.buf))
>>> + clear_midx_file(the_repository);
>>> + strbuf_insertf(&buf, 0, "%s/", dir_name);
>>
>> Makes sense. midx_contains_pack() is a binary search, so we'll spend
>> O(n log n) effort deleting the packs (I wondered if this might be
>> accidentally quadratic over the number of packs).
>
> Right. The MIDX stores packs in lexographic order, so checking them is
> O(log n), which we do at most 'n' times.
>
>> And after we clear, "m" will be NULL, so we'll do it at most once. Which
>> is why you can get rid of the manual "midx_cleared" flag from the
>> preimage.
>
> Yep. I thought briefly about passing 'm' as a parameter, but then you
> have to worry about a dangling reference to
> 'the_repository->objects->multi_pack_index' after calling
> 'clear_midx_file()', so it's easier to look it up each time.
The discussion in this thread matches my understanding of the
situation.
>> So the patch looks good to me.
The code in builtin/repack.c looks good for sure. I have a quick question
about this new test:
+test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
+ git multi-pack-index write &&
+ cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
+ test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
+
+ # Write a new pack that is unknown to the multi-pack-index.
+ git hash-object -w </dev/null >blob &&
+ git pack-objects $objdir/pack/pack <blob &&
+
+ GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
+ test_cmp_bin $objdir/pack/multi-pack-index \
+ $objdir/pack/multi-pack-index.bak
+'
+
You create an arbitrary blob, and then add it to a pack-file. Do we
know that 'git repack' is definitely creating a new pack-file that makes
our manually-created pack-file redundant?
My suggestion is to have the test check itself:
+test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
+ git multi-pack-index write &&
+ cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
+ test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
+
+ # Write a new pack that is unknown to the multi-pack-index.
+ git hash-object -w </dev/null >blob &&
+ HASH=$(git pack-objects $objdir/pack/pack <blob) &&
+
+ GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
+ test_cmp_bin $objdir/pack/multi-pack-index \
+ $objdir/pack/multi-pack-index.bak &&
+ test_path_is_missing $objdir/pack/pack-$HASH.pack
+'
+
This test fails for me, on the 'test_path_is_missing'. Likely, the
blob is seen as already in a pack-file so is just pruned by 'git repack'
instead. I thought that perhaps we need to add a new pack ourselves that
overrides the small pack. Here is my attempt:
test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
git multi-pack-index write &&
cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
# Write a new pack that is unknown to the multi-pack-index.
BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
cat >blobs <<-EOF &&
$BLOB1
$BLOB2
EOF
HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
test_cmp_bin $objdir/pack/multi-pack-index \
$objdir/pack/multi-pack-index.bak &&
test_path_is_file $objdir/pack/pack-$HASH2.pack &&
test_path_is_missing $objdir/pack/pack-$HASH1.pack
'
However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
how to make sure your logic is tested. I saw that 'git repack' was writing
"nothing new to pack" in the output, so I also tested adding a few commits and
trying to force it to repack reachable data, but I cannot seem to trigger it
to create a new pack that overrides only one pack that is not in the MIDX.
Likely, I just don't know how 'git rebase' works well enough to trigger this
behavior. But the test as-is is not testing what you want it to test.
Thanks,
-Stolee
next prev parent reply other threads:[~2020-08-25 13:14 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-25 2:01 [PATCH] builtin/repack.c: invalidate MIDX only when necessary Taylor Blau
2020-08-25 2:26 ` Jeff King
2020-08-25 2:37 ` Taylor Blau
2020-08-25 13:14 ` Derrick Stolee [this message]
2020-08-25 14:41 ` Taylor Blau
2020-08-25 15:14 ` Derrick Stolee
2020-08-25 15:42 ` Taylor Blau
2020-08-25 16:56 ` Jeff King
2020-08-25 15:58 ` Junio C Hamano
2020-08-25 16:08 ` Taylor Blau
2020-08-25 16:18 ` Derrick Stolee
2020-08-25 17:34 ` Jeff King
2020-08-25 17:22 ` Jeff King
2020-08-25 18:05 ` Junio C Hamano
2020-08-25 18:27 ` Jeff King
2020-08-25 22:45 ` [PATCH] pack-redundant: gauge the usage before proposing its removal Junio C Hamano
2020-08-25 23:09 ` Taylor Blau
2020-08-25 23:22 ` Junio C Hamano
2020-08-26 1:17 ` [PATCH v1 0/3] War on dashed-git Junio C Hamano
2020-08-26 1:17 ` [PATCH v1 1/3] transport-helper: do not run git-remote-ext etc. in dashed form Junio C Hamano
2020-08-26 1:24 ` Eric Sunshine
2020-08-26 7:55 ` Johannes Schindelin
2020-08-26 16:27 ` Junio C Hamano
2020-08-26 1:17 ` [PATCH v1 2/3] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26 1:28 ` Eric Sunshine
2020-08-26 1:42 ` Junio C Hamano
2020-08-26 16:08 ` Junio C Hamano
2020-08-26 16:28 ` Junio C Hamano
2020-08-26 8:02 ` Johannes Schindelin
2020-08-26 1:17 ` [PATCH v1 3/3] git: catch an attempt to run "git-foo" Junio C Hamano
2020-08-26 1:19 ` Junio C Hamano
2020-08-26 8:06 ` Johannes Schindelin
2020-08-26 16:30 ` Junio C Hamano
2020-08-28 2:13 ` Johannes Schindelin
2020-08-28 22:03 ` Junio C Hamano
2020-08-31 9:59 ` Johannes Schindelin
2020-08-31 17:45 ` Junio C Hamano
2020-12-20 15:25 ` Johannes Schindelin
2020-12-21 22:24 ` Junio C Hamano
2020-12-30 5:30 ` Johannes Schindelin
2020-08-26 8:09 ` [PATCH v1 0/3] War on dashed-git Johannes Schindelin
2020-08-26 16:45 ` Junio C Hamano
2020-08-26 19:46 ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Junio C Hamano
2020-08-26 19:46 ` [PATCH v2 1/2] transport-helper: do not run git-remote-ext etc. in " Junio C Hamano
2020-08-26 19:46 ` [PATCH v2 2/2] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26 21:37 ` [PATCH v2 3/2] credential-cache: use child_process.args Junio C Hamano
2020-08-26 22:25 ` [PATCH] run_command: teach API users to use embedded 'args' more Junio C Hamano
2020-08-27 4:21 ` Jeff King
2020-08-27 4:30 ` Junio C Hamano
2020-08-27 4:31 ` Eric Sunshine
2020-08-27 4:44 ` Jeff King
2020-08-27 5:03 ` Eric Sunshine
2020-08-27 5:25 ` [PATCH] worktree: fix leak in check_clean_worktree() Jeff King
2020-08-27 5:56 ` Eric Sunshine
2020-08-27 15:31 ` Junio C Hamano
2020-08-27 4:13 ` [PATCH v2 3/2] credential-cache: use child_process.args Jeff King
2020-08-27 4:22 ` Jeff King
2020-08-27 4:31 ` Junio C Hamano
2020-08-27 4:14 ` Jeff King
2020-08-27 15:34 ` Junio C Hamano
2020-08-31 22:56 ` Junio C Hamano
2020-09-01 4:49 ` Jeff King
2020-09-01 16:11 ` Junio C Hamano
2020-08-27 0:57 ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Derrick Stolee
2020-08-27 1:22 ` Junio C Hamano
2020-08-28 9:14 ` [PATCH] pack-redundant: gauge the usage before proposing its removal Jeff King
2020-08-28 22:45 ` Junio C Hamano
2020-08-25 7:55 ` [PATCH] builtin/repack.c: invalidate MIDX only when necessary Son Luong Ngoc
2020-08-25 12:45 ` Derrick Stolee
2020-08-25 14:45 ` Taylor Blau
2020-08-25 16:04 ` [PATCH v2] " Taylor Blau
2020-08-26 20:51 ` Derrick Stolee
2020-08-26 20:54 ` Junio C Hamano
2020-08-25 16:47 ` [PATCH] " Jeff King
2020-08-25 17:10 ` Derrick Stolee
2020-08-25 17:29 ` Jeff King
2020-08-25 17:34 ` Taylor Blau
2020-08-25 17:42 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e7eb9fb6-f1ea-f932-efaa-7434ad809989@gmail.com \
--to=stolee@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).