git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
Date: Sun, 18 Jun 2017 15:22:29 +0200	[thread overview]
Message-ID: <EDF2B923-8A5F-436E-BDB8-82249C6052ED@gmail.com> (raw)
In-Reply-To: <20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net>


> On 10 Jun 2017, at 10:06, Jeff King <peff@peff.net> wrote:
> 
> On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:
> 
>>> I agree the existing message isn't great. There should probably be a big
>>> advise() block explaining what's going on (and that expert users can
>>> disable).
>> 
>> How about this?
>> 
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index c2c61a57bb..12ee212544 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>> 	if (pack_garbage.nr > 0)
>> 		clean_pack_garbage();
>> 
>> -	if (auto_gc && too_many_loose_objects())
>> -		warning(_("There are too many unreachable loose objects; "
>> -			"run 'git prune' to remove them."));
>> +	if (auto_gc && too_many_loose_objects()) {
>> +		warning(_("Auto packing did not lead to optimal results as the "
>> +			"repository contains too many unreachable objects."));
>> +		advice(_("Unreachable objects are Git objects (commits, files, ...) "
>> +			"that are not referenced by any branch or tag. This might happen "
>> +			"if you use 'git rebase' or if you delete branches. Auto packing "
>> +			"only prunes unreachable objects that are older than 2 weeks "
>> +			"(default, overridable by the config variable 'gc.pruneExpire'). "
>> +			"Please run 'git prune' to prune all unreachable objects for "
>> +			"optimal repository performance."));
>> +	}
> 
> s/advice/advise/, of course. This probably be protected by a new entry
> in advice_config[] in advice.c.
> 
> But I assume you are most interested in the text. I think it
> simultaneously goes into too much and too little detail. I think the
> warning itself should just say _what_ we observed: after garbage
> collection, there were still enough objects to trigger a gc.  And then
> the hint doesn't need to go into the details of why we prune or what
> unreachable objects are. Those can be cross-referenced with other
> documentation. I think we need to focus on what the warning means, and
> whether and how they would correct it.
> 
> Maybe:
> 
>  warning: too many loose objects remain after garbage collection
>  hint: Automatic garbage collection is triggered when there are a
>  hint: large number of unpacked objects in the repository. Unreachable
>  hint: objects that are more recent than gc.pruneExpire are not
>  hint: pruned. If there are too many of these recent loose
>  hint: objects, automatic garbage collection may be triggered more
>  hint: frequently than necessary. You may run "git prune" now to
>  hint: prune all unreachable objects, regardless of their age.
> 
> I was tempted to suggest that we find and report the correct "prune"
> cutoff that would let us avoid auto-gc. I.e., sort the unreachable
> objects by timestamp and find the cutoff that will drop enough to leave
> fewer than `gc.auto`. That in theory makes things a bit safer. That's
> probably not a good idea, though:
> 
>  1. Telling the user to run `git prune --expire=37.minutes.ago` is
>     just going to confuse them. We could hide it behind a command line
>     option like `git prune --expire-auto-gc` or something, though.
> 
>  2. Now that we try to keep recent chunks, the analysis isn't quite so
>     easy. You may have a single recent commit that references a ton of
>     old history, and only dropping that commit would help. So the
>     analysis is harder than a simple sort-and-cutoff, but it also means
>     that the prune times are likely to skew close to "now".
> 
>  3. If we just show them how to prune the minimal amount, then they're
>     likely to just hit this message again soon.
> 
> So that's probably a dead end.
> 
> To be honest, the fact that we have to write this warning at all is a
> sign that Git is not doing a very good job. The best place to spend
> effort would be to teach git-gc to pack all of the unreachable objects
> into a single "cruft" pack, so this problem doesn't happen at all (and
> it's way more efficient, as well).
> 
> The big problem with that approach is that we lose individual-object
> timestamps. Each object just gets the timestamp of its surrounding pack,
> so as we continually ran auto-gc, the cruft-pack timestamp would get
> updated and we'd never drop objects. So we'd need some auxiliary file
> (e.g., pack-1234abcd.times) that stores the per-object timestamps. This
> can be as small as a 32- or 64-bit int per object, since we can just
> index it by the existing object list in the pack .idx.

Why can't we generate a new cruft-pack on every gc run that detects too
many unreachable objects? That would not be as efficient as a single
cruft-pack but it should be way more efficient than the individual
objects, no?

Plus, chances are that the existing cruft-packs are purged with the next
gc run anyways.

- Lars

> The trickiest part would be when an object's timestamp gets freshened
> (because somebody tried to write it again but we optimized out the
> write). Updating the timestamps in the .times file would probably work
> atomically, though we usually avoid writing in the middle of a file (we
> certainly can't portably do so via mmap, and I can't think of another
> case where we do seeked writes). It might be sufficient for objects in
> the cruft pack to just do the actual loose object write instead of
> optimizing it out. It should happen very rarely.


  reply	other threads:[~2017-06-18 13:22 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-08 12:45 "There are too many unreachable loose objects" - why don't we run 'git prune' automatically? Lars Schneider
2017-06-09  5:27 ` Jeff King
2017-06-09 12:03   ` Lars Schneider
2017-06-10  8:06     ` Jeff King
2017-06-18 13:22       ` Lars Schneider [this message]
2017-06-20 14:08         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=EDF2B923-8A5F-436E-BDB8-82249C6052ED@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).