git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
@ 2017-06-08 12:45 Lars Schneider
  2017-06-09  5:27 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Schneider @ 2017-06-08 12:45 UTC (permalink / raw)
  To: Git Mailing List

Hi,

I recently ran into "There are too many unreachable loose objects; run 
'git prune' to remove them." after a "Auto packing the repository in 
background for optimum performance." message.

This was introduced with a087cc9 "git-gc --auto: protect ourselves from 
accumulated cruft" but I don't understand the commit message really.

Why don't we call 'git prune' automatically? I though Git would prune
unreachable objects after 90 days by default anyways. Is the warning 
about unreachable objects that are not yet 90 days old?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
  2017-06-08 12:45 "There are too many unreachable loose objects" - why don't we run 'git prune' automatically? Lars Schneider
@ 2017-06-09  5:27 ` Jeff King
  2017-06-09 12:03   ` Lars Schneider
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2017-06-09  5:27 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List

On Thu, Jun 08, 2017 at 02:45:48PM +0200, Lars Schneider wrote:

> I recently ran into "There are too many unreachable loose objects; run 
> 'git prune' to remove them." after a "Auto packing the repository in 
> background for optimum performance." message.
> 
> This was introduced with a087cc9 "git-gc --auto: protect ourselves from 
> accumulated cruft" but I don't understand the commit message really.
> 
> Why don't we call 'git prune' automatically? I though Git would prune
> unreachable objects after 90 days by default anyways. Is the warning 
> about unreachable objects that are not yet 90 days old?

We _do_ call "git prune", but we do so with whatever configured
expiration time is (by default 2 weeks; the 90-day expiration is for
reflogs).

The problem is that auto-gc kicked in because there were a bunch of
loose objects, but after repacking and running "git prune" there were
still enough loose objects to trigger auto-gc. Which means every command
you run will do an auto-gc that never actually helps.

So you have two options:

  1. Wait until those objects expire (which may be up to 2 weeks,
     depending on how recent they are), at which point your auto-gc will
     finally delete them.

  2. Run "git prune". Without an argument it prunes everything now,
     with no expiration period.

I agree the existing message isn't great. There should probably be a big
advise() block explaining what's going on (and that expert users can
disable).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
  2017-06-09  5:27 ` Jeff King
@ 2017-06-09 12:03   ` Lars Schneider
  2017-06-10  8:06     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Schneider @ 2017-06-09 12:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List


> On 09 Jun 2017, at 07:27, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Jun 08, 2017 at 02:45:48PM +0200, Lars Schneider wrote:
> 
>> I recently ran into "There are too many unreachable loose objects; run 
>> 'git prune' to remove them." after a "Auto packing the repository in 
>> background for optimum performance." message.
>> 
>> This was introduced with a087cc9 "git-gc --auto: protect ourselves from 
>> accumulated cruft" but I don't understand the commit message really.
>> 
>> Why don't we call 'git prune' automatically? I though Git would prune
>> unreachable objects after 90 days by default anyways. Is the warning 
>> about unreachable objects that are not yet 90 days old?
> 
> We _do_ call "git prune", but we do so with whatever configured
> expiration time is (by default 2 weeks; the 90-day expiration is for
> reflogs).
> 
> The problem is that auto-gc kicked in because there were a bunch of
> loose objects, but after repacking and running "git prune" there were
> still enough loose objects to trigger auto-gc. Which means every command
> you run will do an auto-gc that never actually helps.
> 
> So you have two options:
> 
>  1. Wait until those objects expire (which may be up to 2 weeks,
>     depending on how recent they are), at which point your auto-gc will
>     finally delete them.
> 
>  2. Run "git prune". Without an argument it prunes everything now,
>     with no expiration period.
> 
> I agree the existing message isn't great. There should probably be a big
> advise() block explaining what's going on (and that expert users can
> disable).

How about this?

diff --git a/builtin/gc.c b/builtin/gc.c
index c2c61a57bb..12ee212544 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (pack_garbage.nr > 0)
 		clean_pack_garbage();
 
-	if (auto_gc && too_many_loose_objects())
-		warning(_("There are too many unreachable loose objects; "
-			"run 'git prune' to remove them."));
+	if (auto_gc && too_many_loose_objects()) {
+		warning(_("Auto packing did not lead to optimal results as the "
+			"repository contains too many unreachable objects."));
+		advice(_("Unreachable objects are Git objects (commits, files, ...) "
+			"that are not referenced by any branch or tag. This might happen "
+			"if you use 'git rebase' or if you delete branches. Auto packing "
+			"only prunes unreachable objects that are older than 2 weeks "
+			"(default, overridable by the config variable 'gc.pruneExpire'). "
+			"Please run 'git prune' to prune all unreachable objects for "
+			"optimal repository performance."));
+	}
 
 	if (!daemonized)
 		unlink(git_path("gc.log"));
- Lars


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
  2017-06-09 12:03   ` Lars Schneider
@ 2017-06-10  8:06     ` Jeff King
  2017-06-18 13:22       ` Lars Schneider
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2017-06-10  8:06 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List

On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:

> > I agree the existing message isn't great. There should probably be a big
> > advise() block explaining what's going on (and that expert users can
> > disable).
> 
> How about this?
> 
> diff --git a/builtin/gc.c b/builtin/gc.c
> index c2c61a57bb..12ee212544 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	if (pack_garbage.nr > 0)
>  		clean_pack_garbage();
>  
> -	if (auto_gc && too_many_loose_objects())
> -		warning(_("There are too many unreachable loose objects; "
> -			"run 'git prune' to remove them."));
> +	if (auto_gc && too_many_loose_objects()) {
> +		warning(_("Auto packing did not lead to optimal results as the "
> +			"repository contains too many unreachable objects."));
> +		advice(_("Unreachable objects are Git objects (commits, files, ...) "
> +			"that are not referenced by any branch or tag. This might happen "
> +			"if you use 'git rebase' or if you delete branches. Auto packing "
> +			"only prunes unreachable objects that are older than 2 weeks "
> +			"(default, overridable by the config variable 'gc.pruneExpire'). "
> +			"Please run 'git prune' to prune all unreachable objects for "
> +			"optimal repository performance."));
> +	}

s/advice/advise/, of course. This probably be protected by a new entry
in advice_config[] in advice.c.

But I assume you are most interested in the text. I think it
simultaneously goes into too much and too little detail. I think the
warning itself should just say _what_ we observed: after garbage
collection, there were still enough objects to trigger a gc.  And then
the hint doesn't need to go into the details of why we prune or what
unreachable objects are. Those can be cross-referenced with other
documentation. I think we need to focus on what the warning means, and
whether and how they would correct it.

Maybe:

  warning: too many loose objects remain after garbage collection
  hint: Automatic garbage collection is triggered when there are a
  hint: large number of unpacked objects in the repository. Unreachable
  hint: objects that are more recent than gc.pruneExpire are not
  hint: pruned. If there are too many of these recent loose
  hint: objects, automatic garbage collection may be triggered more
  hint: frequently than necessary. You may run "git prune" now to
  hint: prune all unreachable objects, regardless of their age.

I was tempted to suggest that we find and report the correct "prune"
cutoff that would let us avoid auto-gc. I.e., sort the unreachable
objects by timestamp and find the cutoff that will drop enough to leave
fewer than `gc.auto`. That in theory makes things a bit safer. That's
probably not a good idea, though:

  1. Telling the user to run `git prune --expire=37.minutes.ago` is
     just going to confuse them. We could hide it behind a command line
     option like `git prune --expire-auto-gc` or something, though.

  2. Now that we try to keep recent chunks, the analysis isn't quite so
     easy. You may have a single recent commit that references a ton of
     old history, and only dropping that commit would help. So the
     analysis is harder than a simple sort-and-cutoff, but it also means
     that the prune times are likely to skew close to "now".

  3. If we just show them how to prune the minimal amount, then they're
     likely to just hit this message again soon.

So that's probably a dead end.

To be honest, the fact that we have to write this warning at all is a
sign that Git is not doing a very good job. The best place to spend
effort would be to teach git-gc to pack all of the unreachable objects
into a single "cruft" pack, so this problem doesn't happen at all (and
it's way more efficient, as well).

The big problem with that approach is that we lose individual-object
timestamps. Each object just gets the timestamp of its surrounding pack,
so as we continually ran auto-gc, the cruft-pack timestamp would get
updated and we'd never drop objects. So we'd need some auxiliary file
(e.g., pack-1234abcd.times) that stores the per-object timestamps. This
can be as small as a 32- or 64-bit int per object, since we can just
index it by the existing object list in the pack .idx.

The trickiest part would be when an object's timestamp gets freshened
(because somebody tried to write it again but we optimized out the
write). Updating the timestamps in the .times file would probably work
atomically, though we usually avoid writing in the middle of a file (we
certainly can't portably do so via mmap, and I can't think of another
case where we do seeked writes). It might be sufficient for objects in
the cruft pack to just do the actual loose object write instead of
optimizing it out. It should happen very rarely.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
  2017-06-10  8:06     ` Jeff King
@ 2017-06-18 13:22       ` Lars Schneider
  2017-06-20 14:08         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Schneider @ 2017-06-18 13:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List


> On 10 Jun 2017, at 10:06, Jeff King <peff@peff.net> wrote:
> 
> On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:
> 
>>> I agree the existing message isn't great. There should probably be a big
>>> advise() block explaining what's going on (and that expert users can
>>> disable).
>> 
>> How about this?
>> 
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index c2c61a57bb..12ee212544 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>> 	if (pack_garbage.nr > 0)
>> 		clean_pack_garbage();
>> 
>> -	if (auto_gc && too_many_loose_objects())
>> -		warning(_("There are too many unreachable loose objects; "
>> -			"run 'git prune' to remove them."));
>> +	if (auto_gc && too_many_loose_objects()) {
>> +		warning(_("Auto packing did not lead to optimal results as the "
>> +			"repository contains too many unreachable objects."));
>> +		advice(_("Unreachable objects are Git objects (commits, files, ...) "
>> +			"that are not referenced by any branch or tag. This might happen "
>> +			"if you use 'git rebase' or if you delete branches. Auto packing "
>> +			"only prunes unreachable objects that are older than 2 weeks "
>> +			"(default, overridable by the config variable 'gc.pruneExpire'). "
>> +			"Please run 'git prune' to prune all unreachable objects for "
>> +			"optimal repository performance."));
>> +	}
> 
> s/advice/advise/, of course. This probably be protected by a new entry
> in advice_config[] in advice.c.
> 
> But I assume you are most interested in the text. I think it
> simultaneously goes into too much and too little detail. I think the
> warning itself should just say _what_ we observed: after garbage
> collection, there were still enough objects to trigger a gc.  And then
> the hint doesn't need to go into the details of why we prune or what
> unreachable objects are. Those can be cross-referenced with other
> documentation. I think we need to focus on what the warning means, and
> whether and how they would correct it.
> 
> Maybe:
> 
>  warning: too many loose objects remain after garbage collection
>  hint: Automatic garbage collection is triggered when there are a
>  hint: large number of unpacked objects in the repository. Unreachable
>  hint: objects that are more recent than gc.pruneExpire are not
>  hint: pruned. If there are too many of these recent loose
>  hint: objects, automatic garbage collection may be triggered more
>  hint: frequently than necessary. You may run "git prune" now to
>  hint: prune all unreachable objects, regardless of their age.
> 
> I was tempted to suggest that we find and report the correct "prune"
> cutoff that would let us avoid auto-gc. I.e., sort the unreachable
> objects by timestamp and find the cutoff that will drop enough to leave
> fewer than `gc.auto`. That in theory makes things a bit safer. That's
> probably not a good idea, though:
> 
>  1. Telling the user to run `git prune --expire=37.minutes.ago` is
>     just going to confuse them. We could hide it behind a command line
>     option like `git prune --expire-auto-gc` or something, though.
> 
>  2. Now that we try to keep recent chunks, the analysis isn't quite so
>     easy. You may have a single recent commit that references a ton of
>     old history, and only dropping that commit would help. So the
>     analysis is harder than a simple sort-and-cutoff, but it also means
>     that the prune times are likely to skew close to "now".
> 
>  3. If we just show them how to prune the minimal amount, then they're
>     likely to just hit this message again soon.
> 
> So that's probably a dead end.
> 
> To be honest, the fact that we have to write this warning at all is a
> sign that Git is not doing a very good job. The best place to spend
> effort would be to teach git-gc to pack all of the unreachable objects
> into a single "cruft" pack, so this problem doesn't happen at all (and
> it's way more efficient, as well).
> 
> The big problem with that approach is that we lose individual-object
> timestamps. Each object just gets the timestamp of its surrounding pack,
> so as we continually ran auto-gc, the cruft-pack timestamp would get
> updated and we'd never drop objects. So we'd need some auxiliary file
> (e.g., pack-1234abcd.times) that stores the per-object timestamps. This
> can be as small as a 32- or 64-bit int per object, since we can just
> index it by the existing object list in the pack .idx.

Why can't we generate a new cruft-pack on every gc run that detects too
many unreachable objects? That would not be as efficient as a single
cruft-pack but it should be way more efficient than the individual
objects, no?

Plus, chances are that the existing cruft-packs are purged with the next
gc run anyways.

- Lars

> The trickiest part would be when an object's timestamp gets freshened
> (because somebody tried to write it again but we optimized out the
> write). Updating the timestamps in the .times file would probably work
> atomically, though we usually avoid writing in the middle of a file (we
> certainly can't portably do so via mmap, and I can't think of another
> case where we do seeked writes). It might be sufficient for objects in
> the cruft pack to just do the actual loose object write instead of
> optimizing it out. It should happen very rarely.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
  2017-06-18 13:22       ` Lars Schneider
@ 2017-06-20 14:08         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-06-20 14:08 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List

On Sun, Jun 18, 2017 at 03:22:29PM +0200, Lars Schneider wrote:

> > To be honest, the fact that we have to write this warning at all is a
> > sign that Git is not doing a very good job. The best place to spend
> > effort would be to teach git-gc to pack all of the unreachable objects
> > into a single "cruft" pack, so this problem doesn't happen at all (and
> > it's way more efficient, as well).
> > 
> > The big problem with that approach is that we lose individual-object
> > timestamps. Each object just gets the timestamp of its surrounding pack,
> > so as we continually ran auto-gc, the cruft-pack timestamp would get
> > updated and we'd never drop objects. So we'd need some auxiliary file
> > (e.g., pack-1234abcd.times) that stores the per-object timestamps. This
> > can be as small as a 32- or 64-bit int per object, since we can just
> > index it by the existing object list in the pack .idx.
> 
> Why can't we generate a new cruft-pack on every gc run that detects too
> many unreachable objects? That would not be as efficient as a single
> cruft-pack but it should be way more efficient than the individual
> objects, no?
> 
> Plus, chances are that the existing cruft-packs are purged with the next
> gc run anyways.

Interesting idea. Here are some thoughts in random order.

That loses some delta opportunities between the cruft packs, but that's
certainly no worse than the all-loose storage we have today.

One nice aspect is that it means cruft objects don't incur any I/O cost
during a repack.

It doesn't really solve the "too many loose objects after gc" problem.
It just punts it to "too many packs after gc". This is likely to be
better because the number of packs would scale with the number of gc
runs, rather than the number of crufty objects. But there would still be
corner cases if you run gc frequently. Maybe that would be acceptable.

I'm not sure how the pruning process would work, especially with respect
to objects reachable from other unreachable-but-recent objects. Right
now the repack-and-delete procedure is done by git-repack, and is
basically:

  1. Get a list of all of the current packs.

  2. Ask pack-objects to pack everything into a new pack. Normally this
     is reachable objects, but we also include recent objects and
     objects reachable from recent objects. And of course with "-k" all
     objects are kept.

  3. Delete everything in the list from (1), under the assumption that
     anything worth keeping was repacked in step (2), and anything else
     is OK to drop.

So if there are regular packs and cruft packs, we'd have to know in step
3 which are which. We'd delete the regular ones, whose objects have all
been migrated to the new pack (either a "real" one or a cruft one), but
keep the crufty ones whose timestamps are still fresh.

That's a small change, and works except for one thing: the reachable
from recent objects. You can't just delete a whole cruft pack. Some of
its objects may be reachable from objects in other cruft packs that
we're keeping. In other words, you have cruft packs where you want to
keep half of the objects they contain. How do you do that?

I think you'd have to make pack-objects aware of the concept of cruft
packs, and that it should include reachable-from-recent objects in the
new pack only if they're in a cruft pack that is going to be deleted. So
those objects would be "rescued" from the cruft pack before it goes away
and migrated to the new cruft pack. That would effectively refresh their
timestamp, but that's fine. They're reachable from objects with that
fresh timestamp already, so effectively they couldn't be deleted until
that timestamp is hit.

So I think it's do-able, but it is a little complicated.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-20 14:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-08 12:45 "There are too many unreachable loose objects" - why don't we run 'git prune' automatically? Lars Schneider
2017-06-09  5:27 ` Jeff King
2017-06-09 12:03   ` Lars Schneider
2017-06-10  8:06     ` Jeff King
2017-06-18 13:22       ` Lars Schneider
2017-06-20 14:08         ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).