git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
Date: Sat, 10 Jun 2017 04:06:27 -0400
Message-ID: <20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net> (raw)
In-Reply-To: <7497DFA7-3F4E-4DB2-B31B-FDDEB2F30BB8@gmail.com>

On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:

> > I agree the existing message isn't great. There should probably be a big
> > advise() block explaining what's going on (and that expert users can
> > disable).
> 
> How about this?
> 
> diff --git a/builtin/gc.c b/builtin/gc.c
> index c2c61a57bb..12ee212544 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	if (pack_garbage.nr > 0)
>  		clean_pack_garbage();
>  
> -	if (auto_gc && too_many_loose_objects())
> -		warning(_("There are too many unreachable loose objects; "
> -			"run 'git prune' to remove them."));
> +	if (auto_gc && too_many_loose_objects()) {
> +		warning(_("Auto packing did not lead to optimal results as the "
> +			"repository contains too many unreachable objects."));
> +		advice(_("Unreachable objects are Git objects (commits, files, ...) "
> +			"that are not referenced by any branch or tag. This might happen "
> +			"if you use 'git rebase' or if you delete branches. Auto packing "
> +			"only prunes unreachable objects that are older than 2 weeks "
> +			"(default, overridable by the config variable 'gc.pruneExpire'). "
> +			"Please run 'git prune' to prune all unreachable objects for "
> +			"optimal repository performance."));
> +	}

s/advice/advise/, of course. This probably be protected by a new entry
in advice_config[] in advice.c.

But I assume you are most interested in the text. I think it
simultaneously goes into too much and too little detail. I think the
warning itself should just say _what_ we observed: after garbage
collection, there were still enough objects to trigger a gc.  And then
the hint doesn't need to go into the details of why we prune or what
unreachable objects are. Those can be cross-referenced with other
documentation. I think we need to focus on what the warning means, and
whether and how they would correct it.

Maybe:

  warning: too many loose objects remain after garbage collection
  hint: Automatic garbage collection is triggered when there are a
  hint: large number of unpacked objects in the repository. Unreachable
  hint: objects that are more recent than gc.pruneExpire are not
  hint: pruned. If there are too many of these recent loose
  hint: objects, automatic garbage collection may be triggered more
  hint: frequently than necessary. You may run "git prune" now to
  hint: prune all unreachable objects, regardless of their age.

I was tempted to suggest that we find and report the correct "prune"
cutoff that would let us avoid auto-gc. I.e., sort the unreachable
objects by timestamp and find the cutoff that will drop enough to leave
fewer than `gc.auto`. That in theory makes things a bit safer. That's
probably not a good idea, though:

  1. Telling the user to run `git prune --expire=37.minutes.ago` is
     just going to confuse them. We could hide it behind a command line
     option like `git prune --expire-auto-gc` or something, though.

  2. Now that we try to keep recent chunks, the analysis isn't quite so
     easy. You may have a single recent commit that references a ton of
     old history, and only dropping that commit would help. So the
     analysis is harder than a simple sort-and-cutoff, but it also means
     that the prune times are likely to skew close to "now".

  3. If we just show them how to prune the minimal amount, then they're
     likely to just hit this message again soon.

So that's probably a dead end.

To be honest, the fact that we have to write this warning at all is a
sign that Git is not doing a very good job. The best place to spend
effort would be to teach git-gc to pack all of the unreachable objects
into a single "cruft" pack, so this problem doesn't happen at all (and
it's way more efficient, as well).

The big problem with that approach is that we lose individual-object
timestamps. Each object just gets the timestamp of its surrounding pack,
so as we continually ran auto-gc, the cruft-pack timestamp would get
updated and we'd never drop objects. So we'd need some auxiliary file
(e.g., pack-1234abcd.times) that stores the per-object timestamps. This
can be as small as a 32- or 64-bit int per object, since we can just
index it by the existing object list in the pack .idx.

The trickiest part would be when an object's timestamp gets freshened
(because somebody tried to write it again but we optimized out the
write). Updating the timestamps in the .times file would probably work
atomically, though we usually avoid writing in the middle of a file (we
certainly can't portably do so via mmap, and I can't think of another
case where we do seeked writes). It might be sufficient for objects in
the cruft pack to just do the actual loose object write instead of
optimizing it out. It should happen very rarely.

-Peff

  reply	other threads:[~2017-06-10  8:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-08 12:45 Lars Schneider
2017-06-09  5:27 ` Jeff King
2017-06-09 12:03   ` Lars Schneider
2017-06-10  8:06     ` Jeff King [this message]
2017-06-18 13:22       ` Lars Schneider
2017-06-20 14:08         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    --subject='Re: "There are too many unreachable loose objects" - why don'\''t we run '\''git prune'\'' automatically?' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git