git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, dstolee@microsoft.com
Subject: Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
Date: Mon, 24 Aug 2020 22:37:10 -0400	[thread overview]
Message-ID: <20200825023710.GA98081@syl.lan> (raw)
In-Reply-To: <20200825022614.GA1391422@coredump.intra.peff.net>

On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote:
> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:
>
> > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> > learned to remove a multi-pack-index file if it added or removed a pack
> > from the object store.
> >
> > This mechanism is a little over-eager, since it is only necessary to
> > drop a MIDX if 'git repack' removes a pack that the MIDX references.
> > Adding a pack outside of the MIDX does not require invalidating the
> > MIDX, and likewise for removing a pack the MIDX does not know about.
>
> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> or "git repack -Ad" is going to pack everything and delete the old
> packs. So I think we'd want to remove a midx there.
>
> And "git repack -d" I think of as deleting only loose objects that we
> just packed. But I guess it could also remove a pack that has now been
> made redundant? That seems like a rare case in practice, but I suppose
> is possible.

Yeah, the patch message makes this sound more likely than it actually
is, which I agree is very rare. I often write 'git repack' instead of
'git pack-objects' to slurp up everything loose into a new pack without
having to list loose objects by name.

That's the case that I really care about here: purely adding a new pack
should not invalidate the existing MIDX.

> Not exactly related to your fix, but kind of the flip side of it: would
> we ever need to retain a midx that mentions some packs that still exist?
>
> E.g., imagine we have a midx that points to packs A and B, and
> git-repack deletes B. By your logic above, we need to remove the midx
> because now it points to objects in B which aren't accessible. But by
> deleting it, could we be deleting the only thing that mentions the
> objects in A?
>
> I _think_ the answer is "no", because we never went all-in on midx and
> allowed deleting the matching .idx files for contained packs. So we'd
> still have that A.idx, and we could just use the pack as normal. But
> it's an interesting corner case if we ever do go in that direction.

Agreed. Maybe a (admittedly somewhat large) #leftoverbits.

> If you'll let me muse a bit more on midx-lifetime issues (which I've
> never really thought about before just now):
>
> I'm also a little curious how bad it is to have a midx whose pack has
> gone away. I guess we'd answer queries for "yes, we have this object"
> even if we don't, which is bad. Though in practice we'd only delete
> those packs if we have their objects elsewhere. And the pack code is
> pretty good about retrying other copies of objects that can't be
> accessed. Alternatively, I wonder if the midx-loading code ought to
> check that all of the constituent packs are available.
>
> In that line of thinking, do we even need to delete midx files if one of
> their packs goes away? The reading side probably ought to be able to
> handle that gracefully.

I think that this is probably the right direction, although I've only
spend time in the MIDX code over the past couple of weeks, so I can't
say with authority. It seems like it would be pretty annoying, though.
For example, code that cares about listing all objects in a MIDX would
have to check first whether the pack they're in still exists before
emitting them. On top of that, there are more corner cases when object X
exists in more than one pack, but some strict subset of those packs
containing X have gone away.

I don't think that it couldn't be done, though.

> And the more interesting case is when you repack everything with "-ad"
> or similar, at which point you shouldn't even need to look up what's in
> the midx to see if you deleted its packs. The point of your operation is
> to put it all-into-one, so you know the old midx should be discarded.
>
> > Teach 'git repack' to check for this by loading the MIDX, and checking
> > whether the to-be-removed pack is known to the MIDX. This requires a
> > slightly odd alternation to a test in t5319, which is explained with a
> > comment.
>
> My above musings aside, this seems like an obvious improvement.
>
> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 04c5ceaf7e..98fac03946 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
> >  static void remove_redundant_pack(const char *dir_name, const char *base_name)
> >  {
> >  	struct strbuf buf = STRBUF_INIT;
> > -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> > +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> > +	strbuf_addf(&buf, "%s.pack", base_name);
> > +	if (m && midx_contains_pack(m, buf.buf))
> > +		clear_midx_file(the_repository);
> > +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>
> Makes sense. midx_contains_pack() is a binary search, so we'll spend
> O(n log n) effort deleting the packs (I wondered if this might be
> accidentally quadratic over the number of packs).

Right. The MIDX stores packs in lexographic order, so checking them is
O(log n), which we do at most 'n' times.

> And after we clear, "m" will be NULL, so we'll do it at most once. Which
> is why you can get rid of the manual "midx_cleared" flag from the
> preimage.

Yep. I thought briefly about passing 'm' as a parameter, but then you
have to worry about a dangling reference to
'the_repository->objects->multi_pack_index' after calling
'clear_midx_file()', so it's easier to look it up each time.

> So the patch looks good to me.

Thanks.

> -Peff

Thanks,
Taylor

  reply	other threads:[~2020-08-25  2:37 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-25  2:01 [PATCH] builtin/repack.c: invalidate MIDX only when necessary Taylor Blau
2020-08-25  2:26 ` Jeff King
2020-08-25  2:37   ` Taylor Blau [this message]
2020-08-25 13:14     ` Derrick Stolee
2020-08-25 14:41       ` Taylor Blau
2020-08-25 15:14         ` Derrick Stolee
2020-08-25 15:42           ` Taylor Blau
2020-08-25 16:56             ` Jeff King
2020-08-25 15:58   ` Junio C Hamano
2020-08-25 16:08     ` Taylor Blau
2020-08-25 16:18     ` Derrick Stolee
2020-08-25 17:34       ` Jeff King
2020-08-25 17:22     ` Jeff King
2020-08-25 18:05       ` Junio C Hamano
2020-08-25 18:27         ` Jeff King
2020-08-25 22:45           ` [PATCH] pack-redundant: gauge the usage before proposing its removal Junio C Hamano
2020-08-25 23:09             ` Taylor Blau
2020-08-25 23:22               ` Junio C Hamano
2020-08-26  1:17             ` [PATCH v1 0/3] War on dashed-git Junio C Hamano
2020-08-26  1:17               ` [PATCH v1 1/3] transport-helper: do not run git-remote-ext etc. in dashed form Junio C Hamano
2020-08-26  1:24                 ` Eric Sunshine
2020-08-26  7:55                   ` Johannes Schindelin
2020-08-26 16:27                   ` Junio C Hamano
2020-08-26  1:17               ` [PATCH v1 2/3] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26  1:28                 ` Eric Sunshine
2020-08-26  1:42                   ` Junio C Hamano
2020-08-26 16:08                   ` Junio C Hamano
2020-08-26 16:28                     ` Junio C Hamano
2020-08-26  8:02                 ` Johannes Schindelin
2020-08-26  1:17               ` [PATCH v1 3/3] git: catch an attempt to run "git-foo" Junio C Hamano
2020-08-26  1:19                 ` Junio C Hamano
2020-08-26  8:06                 ` Johannes Schindelin
2020-08-26 16:30                   ` Junio C Hamano
2020-08-28  2:13                 ` Johannes Schindelin
2020-08-28 22:03                   ` Junio C Hamano
2020-08-31  9:59                     ` Johannes Schindelin
2020-08-31 17:45                       ` Junio C Hamano
2020-12-20 15:25                   ` Johannes Schindelin
2020-12-21 22:24                     ` Junio C Hamano
2020-12-30  5:30                       ` Johannes Schindelin
2020-08-26  8:09               ` [PATCH v1 0/3] War on dashed-git Johannes Schindelin
2020-08-26 16:45                 ` Junio C Hamano
2020-08-26 19:46                   ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Junio C Hamano
2020-08-26 19:46                     ` [PATCH v2 1/2] transport-helper: do not run git-remote-ext etc. in " Junio C Hamano
2020-08-26 19:46                     ` [PATCH v2 2/2] cvsexportcommit: do not run git programs " Junio C Hamano
2020-08-26 21:37                       ` [PATCH v2 3/2] credential-cache: use child_process.args Junio C Hamano
2020-08-26 22:25                         ` [PATCH] run_command: teach API users to use embedded 'args' more Junio C Hamano
2020-08-27  4:21                           ` Jeff King
2020-08-27  4:30                             ` Junio C Hamano
2020-08-27  4:31                             ` Eric Sunshine
2020-08-27  4:44                               ` Jeff King
2020-08-27  5:03                                 ` Eric Sunshine
2020-08-27  5:25                                   ` [PATCH] worktree: fix leak in check_clean_worktree() Jeff King
2020-08-27  5:56                                     ` Eric Sunshine
2020-08-27 15:31                                       ` Junio C Hamano
2020-08-27  4:13                         ` [PATCH v2 3/2] credential-cache: use child_process.args Jeff King
2020-08-27  4:22                           ` Jeff King
2020-08-27  4:31                           ` Junio C Hamano
2020-08-27  4:14                         ` Jeff King
2020-08-27 15:34                           ` Junio C Hamano
2020-08-31 22:56                         ` Junio C Hamano
2020-09-01  4:49                           ` Jeff King
2020-09-01 16:11                             ` Junio C Hamano
2020-08-27  0:57                     ` [PATCH v2 0/2] avoid running "git-subcmd" in the dashed form Derrick Stolee
2020-08-27  1:22                       ` Junio C Hamano
2020-08-28  9:14             ` [PATCH] pack-redundant: gauge the usage before proposing its removal Jeff King
2020-08-28 22:45               ` Junio C Hamano
2020-08-25  7:55 ` [PATCH] builtin/repack.c: invalidate MIDX only when necessary Son Luong Ngoc
2020-08-25 12:45   ` Derrick Stolee
2020-08-25 14:45   ` Taylor Blau
2020-08-25 16:04     ` [PATCH v2] " Taylor Blau
2020-08-26 20:51       ` Derrick Stolee
2020-08-26 20:54         ` Junio C Hamano
2020-08-25 16:47     ` [PATCH] " Jeff King
2020-08-25 17:10       ` Derrick Stolee
2020-08-25 17:29         ` Jeff King
2020-08-25 17:34           ` Taylor Blau
2020-08-25 17:42             ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200825023710.GA98081@syl.lan \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).