git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, dstolee@microsoft.com
Subject: Re: [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'
Date: Fri, 29 Jan 2021 17:43:32 -0500	[thread overview]
Message-ID: <YBSPlO/ki5vRNX0T@coredump.intra.peff.net> (raw)
In-Reply-To: <xmqq35yjtrip.fsf@gitster.c.googlers.com>

On Fri, Jan 29, 2021 at 12:30:38PM -0800, Junio C Hamano wrote:

> > I think this would generally happen if the .keep packs are generated
> > using something like "git repack -a", which packs everything reachable
> > together. So if you do:
> >
> >   git repack -ad
> >   touch .git/objects/pack/pack-whatever.keep
> >   ... some more packs come in, perhaps via pushes ...
> >   # imagine repack knew how to pass this along...
> >   git repack -a --assume-kept-packs-closed
> >
> > then you'd repack just the objects that aren't in the big pack.
> 
> Yeah.  As a tool to help the above workflow, where you are only
> creating another .keep out of youngest objects (i.e. those that are
> either loose or in non-kept packs), because by definition anything
> in .keep cannot be pointing back at these younger objects, it does
> make sense to take advantage of "the set of packs with .keep as a
> whole is closed".
> 
> It may become tricky once we start talking about creating a new
> .keep out of youngest objects PLUS a few young keep packs, though.

Right. You'd have to make sure the younger packs were also created with
that reachability in mind. I.e., if you are in a situation where you've
got:

  - a big "old" pack
  - N new packs from pushes

then you can't assume anything about the reachability for those
individual N packs. It would be wrong to split any of them into the
"keep" side. They need to have all their objects traversed until we hit
something in the old pack.

But if you have a situation with:

  - a big "old" pack
  - M packs made from previous rollups on top of the old pack
  - N new packs from pushes

Then I think you can still take the old pack + the M packs as a
cohesive unit closed under reachability.

The tricky part is knowing which packs are which (size is a heuristic,
but it can be wrong; people may make a big push to a previously-small
repository).

> Starting from all on-disk .keep packs, you'd mark them as in-core
> keep bit, then drop in-core keep bit from the few young keep packs
> that you intend to coalesce with the youngest objects---that is how
> I would imagine your repacking strategy would go.  The set of all
> the on-disk .keep packs may give us "closed" guarantee, but if we 
> exclude a few latest packs from that set, would the remainder still
> give us the "closed" guarantee we can take advantage of, in order to
> pack these youngest objects (including the ones in the kept packs
> that we are coalescing)?

Yeah, I think we are going along the same lines. Except I think it is
dangerous to use on-disk ".keep" as your marker, because we will racily
see incoming push packs with a ".keep" (which receive-pack/index-pack
use as a lockfile until the refs are updated).

So repack has to "somehow" get the list of which is which.

None of which is disputing your "it may become tricky", of course. ;) It
is exactly this trickiness that I am worried about. And I am not being
coy with "somehow", as if we have some custom not-yet-shared layer on
top of repack that tracks this. We are still figuring out whether this
is a good direction in the first place. :)

One of the things that led us to this reachability traversal, away from
"just suck up all of the objects from these packs", is that this is how
--unpacked works. I had always assumed it was implemented as "if it's
loose, then put it in the pack". But it's not. It's attached to the
revision traversal. And it actually gets some cases wrong!

It will walk every commit, so you don't have to worry about a packed
commit referring to an unpacked one. But it doesn't look at the trees of
the packed commits (for the quite obvious reason that doing so is orders
of magnitude more expensive). That means that if there is a packed
commit that refers to an unpacked blob (which is not referenced by an
unpacked commit), then "rev-list --unpacked" will not report it (and
likewise "git repack -d" would not pack it).

It's easy to create such a situation manually, but I've included a
more plausible sequence involving "--amend" and push/fetch unpackLimit
at the end of this email.

At the core, --unpacked is assuming certain things about reachability of
loose/packed objects that aren't necessarily true. And this
--assume-kept-pack-closed stuff is basically doing the same thing for a
particular set of packs (albeit more so; I believe the patches here cut
off traversal of parent pointers, not just commit-to-tree pointers).

One of the reasons I think nobody noticed with --unpacked is that the
stakes are pretty low. If our assumption is wrong, the worst case is
that a loose object remains unpacked in "repack -d". But we'd never
delete it based on that information (instead, git prune would do its own
traversal to find the correct reachability). And it would eventually get
picked up by "git repack -ad".

But for repacking, the general strategy is to put things you want to
keep into the new pack, and then delete the old ones (not marked as
keep, of course). So if our assumption is ever wrong, it means we'd
potentially drop packs that have reachable objects not found elsewhere,
and we'd end up corrupting the repository.

So I think the paths forward are either:

  - come up with an air-tight system of making sure that we know packs
    we claim are closed under reachability really are (perhaps some
    marker that says "I was generated by repack -a")

  - have a "roll-up" mode that does not care about reachability at all,
    and just takes any objects from a particular set of packs (plus
    probably loose objects)

I'm still thinking aloud here, and not really sure which is a better
path. I do feel like the failure modes for the second one are less
risky.

Anyway, here's the --unpacked example, if you're curious. It's based on
fetching, but you could invert it to do pushes (in which case it is
repacking in "parent" that gets the wrong result).

-- >8 --
# two repos, one a clone of the other
git init parent
git -C parent commit --allow-empty -m base
git clone parent child

# now there's a small fetch, which will get
# exploded into loose objects.
(
	cd parent
	echo small >small
	git add small
	git commit -m small
)
git -C child fetch

# We can verify that "rev-list --unpacked" reports these
# objects.
git -C child rev-list --objects --unpacked origin

# and now a bigger one that will remain a pack (we'll
# tweak unpackLimit instead of making a really big commit,
# but the concept is the same)
#
# There are two key things here:
#
#   - the "small" commit is no longer reachable, but the big one
#     still contains the "small" blob object. Using --amend is
#     a plausible mechanism for this happening.
#
#   - we are using bitmaps, which give us an exact answer for the set of
#     objects to send. Otherwise, pack-objects on the server actually
#     fails to notice the other side has told us it has the small blob, and
#     sends another copy of it.
(
	cd parent
	git repack -adb
	echo big >big
	git add big
	git commit --amend -m big
)
git -C child -c fetch.unpackLimit=1 fetch

# So now in the child we have a packed object
# whose ancestor is an unpacked one. rev-list
# now won't report the "small" blob (ac790413).
git.compile -C child rev-list --objects --unpacked origin

# Even though we can see that it's present only as an
# unpacked object.
show_objects() {
	for i in child/.git/objects/pack/*.idx; do
		git show-index <$i
	done | cut -d' ' -f2 | sed 's/^/packed: /'
	find child/.git/objects/??/* |
		perl -F/ -alne 'print " loose: $F[-2]$F[-1]"'
}

# If we were to do an incremental repack now, it wouldn't be packed.
# (Note we have to kill off the reflog, which still references the
# rewound commit).
rm -rf child/.git/logs
git -C child repack -d
show_objects

  reply	other threads:[~2021-01-29 22:51 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-01-20 13:40   ` Derrick Stolee
2021-01-20 14:38     ` Taylor Blau
2021-01-29  2:33   ` Junio C Hamano
2021-01-29 18:38     ` Taylor Blau
2021-01-29 19:31     ` Jeff King
2021-01-29 20:20       ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 02/10] revision: learn '--no-kept-objects' Taylor Blau
2021-01-29  3:10   ` Junio C Hamano
2021-01-29 19:13     ` Taylor Blau
2021-01-19 23:24 ` [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed' Taylor Blau
2021-01-29  3:21   ` Junio C Hamano
2021-01-29 19:19     ` Jeff King
2021-01-29 20:01       ` Taylor Blau
2021-01-29 20:25         ` Jeff King
2021-01-29 22:10           ` Taylor Blau
2021-01-29 22:57             ` Jeff King
2021-01-29 23:03             ` Junio C Hamano
2021-01-29 23:28               ` Taylor Blau
2021-02-02  3:04                 ` Taylor Blau
2021-01-29 23:31               ` Jeff King
2021-01-29 22:13           ` Junio C Hamano
2021-01-29 20:30       ` Junio C Hamano
2021-01-29 22:43         ` Jeff King [this message]
2021-01-29 22:53           ` Taylor Blau
2021-01-29 23:00             ` Jeff King
2021-01-29 23:10             ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 04/10] p5303: add missing &&-chains Taylor Blau
2021-01-19 23:24 ` [PATCH 05/10] p5303: measure time to repack with keep Taylor Blau
2021-01-29  3:40   ` Junio C Hamano
2021-01-29 19:32     ` Jeff King
2021-01-29 20:04       ` [PATCH] p5303: avoid sed GNU-ism Jeff King
2021-01-29 20:19         ` Eric Sunshine
2021-01-29 20:27           ` Jeff King
2021-01-29 20:36             ` Eric Sunshine
2021-01-29 22:11               ` Taylor Blau
2021-01-29 20:38       ` [PATCH 05/10] p5303: measure time to repack with keep Junio C Hamano
2021-01-29 22:10         ` Jeff King
2021-01-29 23:12           ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 06/10] pack-objects: rewrite honor-pack-keep logic Taylor Blau
2021-01-19 23:24 ` [PATCH 07/10] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-01-19 23:24 ` [PATCH 08/10] builtin/pack-objects.c: teach '--keep-pack-stdin' Taylor Blau
2021-01-19 23:24 ` [PATCH 09/10] builtin/repack.c: extract loose object handling Taylor Blau
2021-01-20 13:59   ` Derrick Stolee
2021-01-20 14:34     ` Taylor Blau
2021-01-20 15:51       ` Derrick Stolee
2021-01-21  3:45     ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 10/10] builtin/repack.c: add '--geometric' option Taylor Blau
2021-01-20 14:05 ` [PATCH 00/10] repack: support repacking into a geometric sequence Derrick Stolee
2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-16 21:42     ` Jeff King
2021-02-16 21:48       ` Taylor Blau
2021-02-04  3:58   ` [PATCH v2 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-16 23:17     ` Jeff King
2021-02-17 18:35       ` Taylor Blau
2021-02-04  3:59   ` [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-16 23:46     ` Jeff King
2021-02-17 18:59       ` Taylor Blau
2021-02-17 19:21         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-04  3:59   ` [PATCH v2 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-16 23:58     ` Jeff King
2021-02-17  0:02       ` Jeff King
2021-02-17 19:13       ` Taylor Blau
2021-02-17 19:25         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-17 16:05     ` Jeff King
2021-02-17 19:23       ` Taylor Blau
2021-02-17 19:29         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-17 17:11     ` Jeff King
2021-02-17 19:54       ` Taylor Blau
2021-02-17 20:25         ` Jeff King
2021-02-17 20:29           ` Taylor Blau
2021-02-17 21:43             ` Jeff King
2021-02-04  3:59   ` [PATCH v2 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-17 18:17     ` Jeff King
2021-02-17 20:01       ` Taylor Blau
2021-02-17  0:01   ` [PATCH v2 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-17 18:18     ` Jeff King
2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
2021-02-18  3:14   ` [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-18  3:14   ` [PATCH v3 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-18  3:14   ` [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-18  3:14   ` [PATCH v3 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-18  3:14   ` [PATCH v3 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-18  3:14   ` [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-18  3:14   ` [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-18  3:14   ` [PATCH v3 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-23  0:31   ` [PATCH v3 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-23  1:06     ` Taylor Blau
2021-02-23  1:42       ` Jeff King
2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
2021-02-23  2:25   ` [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-23  2:25   ` [PATCH v4 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-23  2:25   ` [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-23  8:07     ` Junio C Hamano
2021-02-23 18:51       ` Jeff King
2021-02-23  2:25   ` [PATCH v4 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-23  2:25   ` [PATCH v4 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-23  2:25   ` [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-23  2:25   ` [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-23  2:25   ` [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-24 23:19     ` Junio C Hamano
2021-02-24 23:43       ` Junio C Hamano
2021-03-04 21:40         ` Taylor Blau
2021-03-04 21:55       ` Taylor Blau
2021-02-23  3:39   ` [PATCH v4 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-23  7:43   ` Junio C Hamano
2021-02-23 18:44     ` Jeff King
2021-02-23 19:54       ` Martin Fick
2021-02-23 20:06         ` Taylor Blau
2021-02-23 21:57           ` Martin Fick
2021-02-23 20:15         ` Jeff King
2021-02-23 21:41           ` Martin Fick
2021-02-23 21:53             ` Jeff King
2021-02-24 18:13               ` Martin Fick
2021-02-26  6:23                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YBSPlO/ki5vRNX0T@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).