From: Elijah Newren <newren@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git Mailing List <git@vger.kernel.org>,
Jonathan Tan <jonathantanmy@google.com>,
Jonathan Nieder <jrnieder@gmail.com>
Subject: Re: [PATCH] gc: do not warn about too many loose objects
Date: Mon, 16 Jul 2018 13:16:45 -0700 [thread overview]
Message-ID: <CABPp-BGdzyhPkFYyocqArtMX8=cDKFuV88q3mboeaTDjt275Tw@mail.gmail.com> (raw)
In-Reply-To: <20180716195226.GB25189@sigill.intra.peff.net>
On Mon, Jul 16, 2018 at 12:52 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote:
>
>> The basic problem here, at least for us, is that gc has enough
>> information to know it could expunge some objects, but because of how
>> it is structured in terms of several substeps (reflog expiration,
>> repack, prune), the information is lost between the steps and it
>> instead writes them out as unreachable objects. If we could prune (or
>> avoid exploding) loose objects that are only reachable from reflog
>> entries that we are expiring, then the problem goes away for us. (I
>> totally understand that other repos may have enough unreachable
>> objects for other reasons that Peff's suggestion to just pack up
>> unreachable objects is still a really good idea. But on its own, it
>> seems like a waste since it's packing stuff that we know we could just
>> expunge.)
>
> No, we should have expunged everything that could be during the "repack"
> and "prune" steps. We feed the expiration time to repack, so that it
> knows to drop objects entirely instead of exploding them loose.
Um, except it doesn't actually do that. The testcase I provided shows
that it leaves around 10000 objects that are totally deletable and
were only previously referenced by reflog entries -- entries that gc
removed without removing the corresponding objects.
I will note that my testcase was slightly out-of-date; with current
git it needs a call to 'wait_for_background_gc_to_finish' right before
the 'git gc --quiet' to avoid erroring out.
> You
> could literally just do:
>
> find .git/objects/?? -type f |
> perl -lne 's{../.{38}$} and print "$1$2"' |
> git pack-objects .git/objects/pack/cruft-pack
>
> But:
>
> - that will explode them out only to repack them, which is inefficient
> (if they're already packed, you can probably reuse deltas, not to
> mention the I/O savings)
>
> - there's the question of how to handle timestamps. Some of those
> objects may have been _about_ to expire, but now you've just put
> them in a brand-new pack that adds another 2 weeks to their life
>
> - the find above is sloppy, and will race with somebody adding new
> objects to the repo
>
> So probably you want to have pack-objects write out the list of objects
> it _would_ explode, rather than exploding them. And then before
> git-repack deletes the old packs, put those into a new cruft pack. That
> _just_ leaves the timestamp issue (which is discussed at length in the
> thread I linked earlier).
>
>> git_actual_garbage_collect() {
>> GITDIR=$(git rev-parse --git-dir)
>>
>> # Record all revisions stored in reflog before and after gc
>> git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
>> git gc --auto
>> wait_for_background_gc_to_finish
>> git rev-list --no-walk --reflog >$GITDIR/gc.final-refs
>>
>> # Find out which reflog entries were removed
>> DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs))
>
> This is too detailed, I think. There are other reasons to have
> unreachable objects than expired reflogs. I think you really just want
> to consider all unreachable objects (like the pack-objects thing I
> mentioned above).
Yes, like I said, coarse workaround and I never had time to create a
real fix. But I thought the testcase might be useful as a
demonstration of how git gc leaves around loose objects that were
previously reference by reflogs that gc itself pruned.
next prev parent reply other threads:[~2018-07-16 20:16 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
2018-07-16 17:51 ` Jeff King
2018-07-16 18:22 ` Jonathan Nieder
2018-07-16 18:52 ` Jeff King
2018-07-16 19:09 ` Jonathan Nieder
2018-07-16 19:41 ` Jeff King
2018-07-16 19:54 ` Jonathan Nieder
2018-07-16 20:29 ` Jeff King
2018-07-16 20:37 ` Jonathan Nieder
2018-07-16 21:09 ` Jeff King
2018-07-16 21:40 ` Jonathan Nieder
2018-07-16 21:45 ` Jeff King
2018-07-16 22:03 ` Jonathan Nieder
2018-07-16 22:43 ` Jeff King
2018-07-16 22:56 ` Jonathan Nieder
2018-07-16 23:26 ` Jeff King
2018-07-17 1:53 ` Jonathan Nieder
2018-07-17 8:59 ` Ævar Arnfjörð Bjarmason
2018-07-17 14:03 ` Jonathan Nieder
2018-07-17 15:24 ` Ævar Arnfjörð Bjarmason
2018-07-17 20:27 ` Jeff King
2018-07-18 13:11 ` Ævar Arnfjörð Bjarmason
2018-07-18 17:29 ` Jeff King
2018-07-17 15:59 ` Duy Nguyen
2018-07-17 18:09 ` Junio C Hamano
2018-07-16 19:15 ` Elijah Newren
2018-07-16 19:19 ` Jonathan Nieder
2018-07-16 20:21 ` Elijah Newren
2018-07-16 20:35 ` Jeff King
2018-07-16 20:56 ` Jonathan Nieder
2018-07-16 21:12 ` Jeff King
2018-07-16 19:52 ` Jeff King
2018-07-16 20:16 ` Elijah Newren [this message]
2018-07-16 20:38 ` Jeff King
2018-07-16 21:09 ` Elijah Newren
2018-07-16 21:21 ` Jeff King
2018-07-16 22:07 ` Elijah Newren
2018-07-16 22:55 ` Jeff King
2018-07-16 23:06 ` Elijah Newren
2018-07-16 21:31 ` Jonathan Nieder
2018-07-17 6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17 6:53 ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
2018-07-17 18:19 ` Junio C Hamano
2018-07-17 19:58 ` Jeff King
2018-07-17 6:54 ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
2018-07-17 18:22 ` Junio C Hamano
2018-07-17 19:59 ` Jeff King
2018-09-17 18:33 ` Jeff King
2018-09-17 18:40 ` Jonathan Nieder
2018-09-18 17:30 ` Jeff King
2018-07-17 6:57 ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17 20:13 ` Jeff King
2018-07-18 16:21 ` Junio C Hamano
2018-07-18 17:22 ` Jeff King
2018-07-18 18:19 ` Junio C Hamano
2018-07-18 19:06 ` Jeff King
2018-07-18 19:55 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CABPp-BGdzyhPkFYyocqArtMX8=cDKFuV88q3mboeaTDjt275Tw@mail.gmail.com' \
--to=newren@gmail.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).