git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git Mailing List <git@vger.kernel.org>,
	Jonathan Tan <jonathantanmy@google.com>,
	Jonathan Nieder <jrnieder@gmail.com>
Subject: Re: [PATCH] gc: do not warn about too many loose objects
Date: Mon, 16 Jul 2018 13:16:45 -0700	[thread overview]
Message-ID: <CABPp-BGdzyhPkFYyocqArtMX8=cDKFuV88q3mboeaTDjt275Tw@mail.gmail.com> (raw)
In-Reply-To: <20180716195226.GB25189@sigill.intra.peff.net>

On Mon, Jul 16, 2018 at 12:52 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote:
>
>> The basic problem here, at least for us, is that gc has enough
>> information to know it could expunge some objects, but because of how
>> it is structured in terms of several substeps (reflog expiration,
>> repack, prune), the information is lost between the steps and it
>> instead writes them out as unreachable objects.  If we could prune (or
>> avoid exploding) loose objects that are only reachable from reflog
>> entries that we are expiring, then the problem goes away for us.  (I
>> totally understand that other repos may have enough unreachable
>> objects for other reasons that Peff's suggestion to just pack up
>> unreachable objects is still a really good idea.  But on its own, it
>> seems like a waste since it's packing stuff that we know we could just
>> expunge.)
>
> No, we should have expunged everything that could be during the "repack"
> and "prune" steps. We feed the expiration time to repack, so that it
> knows to drop objects entirely instead of exploding them loose.

Um, except it doesn't actually do that.  The testcase I provided shows
that it leaves around 10000 objects that are totally deletable and
were only previously referenced by reflog entries -- entries that gc
removed without removing the corresponding objects.


I will note that my testcase was slightly out-of-date; with current
git it needs a call to 'wait_for_background_gc_to_finish' right before
the 'git gc --quiet' to avoid erroring out.

> You
> could literally just do:
>
>   find .git/objects/?? -type f |
>   perl -lne 's{../.{38}$} and print "$1$2"' |
>   git pack-objects .git/objects/pack/cruft-pack
>
> But:
>
>   - that will explode them out only to repack them, which is inefficient
>     (if they're already packed, you can probably reuse deltas, not to
>     mention the I/O savings)
>
>   - there's the question of how to handle timestamps. Some of those
>     objects may have been _about_ to expire, but now you've just put
>     them in a brand-new pack that adds another 2 weeks to their life
>
>   - the find above is sloppy, and will race with somebody adding new
>     objects to the repo
>
> So probably you want to have pack-objects write out the list of objects
> it _would_ explode, rather than exploding them. And then before
> git-repack deletes the old packs, put those into a new cruft pack. That
> _just_ leaves the timestamp issue (which is discussed at length in the
> thread I linked earlier).
>
>> git_actual_garbage_collect() {
>>     GITDIR=$(git rev-parse --git-dir)
>>
>>     # Record all revisions stored in reflog before and after gc
>>     git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
>>     git gc --auto
>>     wait_for_background_gc_to_finish
>>     git rev-list --no-walk --reflog >$GITDIR/gc.final-refs
>>
>>     # Find out which reflog entries were removed
>>     DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs))
>
> This is too detailed, I think. There are other reasons to have
> unreachable objects than expired reflogs. I think you really just want
> to consider all unreachable objects (like the pack-objects thing I
> mentioned above).

Yes, like I said, coarse workaround and I never had time to create a
real fix.  But I thought the testcase might be useful as a
demonstration of how git gc leaves around loose objects that were
previously reference by reflogs that gc itself pruned.

  reply	other threads:[~2018-07-16 20:16 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
2018-07-16 17:51 ` Jeff King
2018-07-16 18:22   ` Jonathan Nieder
2018-07-16 18:52     ` Jeff King
2018-07-16 19:09       ` Jonathan Nieder
2018-07-16 19:41         ` Jeff King
2018-07-16 19:54           ` Jonathan Nieder
2018-07-16 20:29             ` Jeff King
2018-07-16 20:37               ` Jonathan Nieder
2018-07-16 21:09                 ` Jeff King
2018-07-16 21:40                   ` Jonathan Nieder
2018-07-16 21:45                     ` Jeff King
2018-07-16 22:03                       ` Jonathan Nieder
2018-07-16 22:43                         ` Jeff King
2018-07-16 22:56                           ` Jonathan Nieder
2018-07-16 23:26                             ` Jeff King
2018-07-17  1:53                               ` Jonathan Nieder
2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
2018-07-17 14:03                                   ` Jonathan Nieder
2018-07-17 15:24                                     ` Ævar Arnfjörð Bjarmason
2018-07-17 20:27                                   ` Jeff King
2018-07-18 13:11                                     ` Ævar Arnfjörð Bjarmason
2018-07-18 17:29                                       ` Jeff King
2018-07-17 15:59                                 ` Duy Nguyen
2018-07-17 18:09                                 ` Junio C Hamano
2018-07-16 19:15 ` Elijah Newren
2018-07-16 19:19   ` Jonathan Nieder
2018-07-16 20:21     ` Elijah Newren
2018-07-16 20:35       ` Jeff King
2018-07-16 20:56         ` Jonathan Nieder
2018-07-16 21:12           ` Jeff King
2018-07-16 19:52   ` Jeff King
2018-07-16 20:16     ` Elijah Newren [this message]
2018-07-16 20:38       ` Jeff King
2018-07-16 21:09         ` Elijah Newren
2018-07-16 21:21           ` Jeff King
2018-07-16 22:07             ` Elijah Newren
2018-07-16 22:55               ` Jeff King
2018-07-16 23:06                 ` Elijah Newren
2018-07-16 21:31           ` Jonathan Nieder
2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
2018-07-17 18:19     ` Junio C Hamano
2018-07-17 19:58     ` Jeff King
2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
2018-07-17 18:22     ` Junio C Hamano
2018-07-17 19:59     ` Jeff King
2018-09-17 18:33       ` Jeff King
2018-09-17 18:40         ` Jonathan Nieder
2018-09-18 17:30           ` Jeff King
2018-07-17  6:57   ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17 20:13     ` Jeff King
2018-07-18 16:21       ` Junio C Hamano
2018-07-18 17:22         ` Jeff King
2018-07-18 18:19           ` Junio C Hamano
2018-07-18 19:06             ` Jeff King
2018-07-18 19:55               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BGdzyhPkFYyocqArtMX8=cDKFuV88q3mboeaTDjt275Tw@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).