From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
Date: Sun, 18 Jun 2017 15:22:29 +0200 [thread overview]
Message-ID: <EDF2B923-8A5F-436E-BDB8-82249C6052ED@gmail.com> (raw)
In-Reply-To: <20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net>
> On 10 Jun 2017, at 10:06, Jeff King <peff@peff.net> wrote:
>
> On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:
>
>>> I agree the existing message isn't great. There should probably be a big
>>> advise() block explaining what's going on (and that expert users can
>>> disable).
>>
>> How about this?
>>
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index c2c61a57bb..12ee212544 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>> if (pack_garbage.nr > 0)
>> clean_pack_garbage();
>>
>> - if (auto_gc && too_many_loose_objects())
>> - warning(_("There are too many unreachable loose objects; "
>> - "run 'git prune' to remove them."));
>> + if (auto_gc && too_many_loose_objects()) {
>> + warning(_("Auto packing did not lead to optimal results as the "
>> + "repository contains too many unreachable objects."));
>> + advice(_("Unreachable objects are Git objects (commits, files, ...) "
>> + "that are not referenced by any branch or tag. This might happen "
>> + "if you use 'git rebase' or if you delete branches. Auto packing "
>> + "only prunes unreachable objects that are older than 2 weeks "
>> + "(default, overridable by the config variable 'gc.pruneExpire'). "
>> + "Please run 'git prune' to prune all unreachable objects for "
>> + "optimal repository performance."));
>> + }
>
> s/advice/advise/, of course. This probably be protected by a new entry
> in advice_config[] in advice.c.
>
> But I assume you are most interested in the text. I think it
> simultaneously goes into too much and too little detail. I think the
> warning itself should just say _what_ we observed: after garbage
> collection, there were still enough objects to trigger a gc. And then
> the hint doesn't need to go into the details of why we prune or what
> unreachable objects are. Those can be cross-referenced with other
> documentation. I think we need to focus on what the warning means, and
> whether and how they would correct it.
>
> Maybe:
>
> warning: too many loose objects remain after garbage collection
> hint: Automatic garbage collection is triggered when there are a
> hint: large number of unpacked objects in the repository. Unreachable
> hint: objects that are more recent than gc.pruneExpire are not
> hint: pruned. If there are too many of these recent loose
> hint: objects, automatic garbage collection may be triggered more
> hint: frequently than necessary. You may run "git prune" now to
> hint: prune all unreachable objects, regardless of their age.
>
> I was tempted to suggest that we find and report the correct "prune"
> cutoff that would let us avoid auto-gc. I.e., sort the unreachable
> objects by timestamp and find the cutoff that will drop enough to leave
> fewer than `gc.auto`. That in theory makes things a bit safer. That's
> probably not a good idea, though:
>
> 1. Telling the user to run `git prune --expire=37.minutes.ago` is
> just going to confuse them. We could hide it behind a command line
> option like `git prune --expire-auto-gc` or something, though.
>
> 2. Now that we try to keep recent chunks, the analysis isn't quite so
> easy. You may have a single recent commit that references a ton of
> old history, and only dropping that commit would help. So the
> analysis is harder than a simple sort-and-cutoff, but it also means
> that the prune times are likely to skew close to "now".
>
> 3. If we just show them how to prune the minimal amount, then they're
> likely to just hit this message again soon.
>
> So that's probably a dead end.
>
> To be honest, the fact that we have to write this warning at all is a
> sign that Git is not doing a very good job. The best place to spend
> effort would be to teach git-gc to pack all of the unreachable objects
> into a single "cruft" pack, so this problem doesn't happen at all (and
> it's way more efficient, as well).
>
> The big problem with that approach is that we lose individual-object
> timestamps. Each object just gets the timestamp of its surrounding pack,
> so as we continually ran auto-gc, the cruft-pack timestamp would get
> updated and we'd never drop objects. So we'd need some auxiliary file
> (e.g., pack-1234abcd.times) that stores the per-object timestamps. This
> can be as small as a 32- or 64-bit int per object, since we can just
> index it by the existing object list in the pack .idx.
Why can't we generate a new cruft-pack on every gc run that detects too
many unreachable objects? That would not be as efficient as a single
cruft-pack but it should be way more efficient than the individual
objects, no?
Plus, chances are that the existing cruft-packs are purged with the next
gc run anyways.
- Lars
> The trickiest part would be when an object's timestamp gets freshened
> (because somebody tried to write it again but we optimized out the
> write). Updating the timestamps in the .times file would probably work
> atomically, though we usually avoid writing in the middle of a file (we
> certainly can't portably do so via mmap, and I can't think of another
> case where we do seeked writes). It might be sufficient for objects in
> the cruft pack to just do the actual loose object write instead of
> optimizing it out. It should happen very rarely.
next prev parent reply other threads:[~2017-06-18 13:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-08 12:45 "There are too many unreachable loose objects" - why don't we run 'git prune' automatically? Lars Schneider
2017-06-09 5:27 ` Jeff King
2017-06-09 12:03 ` Lars Schneider
2017-06-10 8:06 ` Jeff King
2017-06-18 13:22 ` Lars Schneider [this message]
2017-06-20 14:08 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=EDF2B923-8A5F-436E-BDB8-82249C6052ED@gmail.com \
--to=larsxschneider@gmail.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).