Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff King <peff@peff.net>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?
Date: Tue, 20 Jun 2017 10:08:38 -0400	[thread overview]
Message-ID: <20170620140837.fq3wxb63lnqay6xz@sigill.intra.peff.net> (raw)
In-Reply-To: <EDF2B923-8A5F-436E-BDB8-82249C6052ED@gmail.com>

On Sun, Jun 18, 2017 at 03:22:29PM +0200, Lars Schneider wrote:

> > To be honest, the fact that we have to write this warning at all is a
> > sign that Git is not doing a very good job. The best place to spend
> > effort would be to teach git-gc to pack all of the unreachable objects
> > into a single "cruft" pack, so this problem doesn't happen at all (and
> > it's way more efficient, as well).
> > 
> > The big problem with that approach is that we lose individual-object
> > timestamps. Each object just gets the timestamp of its surrounding pack,
> > so as we continually ran auto-gc, the cruft-pack timestamp would get
> > updated and we'd never drop objects. So we'd need some auxiliary file
> > (e.g., pack-1234abcd.times) that stores the per-object timestamps. This
> > can be as small as a 32- or 64-bit int per object, since we can just
> > index it by the existing object list in the pack .idx.
> 
> Why can't we generate a new cruft-pack on every gc run that detects too
> many unreachable objects? That would not be as efficient as a single
> cruft-pack but it should be way more efficient than the individual
> objects, no?
> 
> Plus, chances are that the existing cruft-packs are purged with the next
> gc run anyways.

Interesting idea. Here are some thoughts in random order.

That loses some delta opportunities between the cruft packs, but that's
certainly no worse than the all-loose storage we have today.

One nice aspect is that it means cruft objects don't incur any I/O cost
during a repack.

It doesn't really solve the "too many loose objects after gc" problem.
It just punts it to "too many packs after gc". This is likely to be
better because the number of packs would scale with the number of gc
runs, rather than the number of crufty objects. But there would still be
corner cases if you run gc frequently. Maybe that would be acceptable.

I'm not sure how the pruning process would work, especially with respect
to objects reachable from other unreachable-but-recent objects. Right
now the repack-and-delete procedure is done by git-repack, and is
basically:

  1. Get a list of all of the current packs.

  2. Ask pack-objects to pack everything into a new pack. Normally this
     is reachable objects, but we also include recent objects and
     objects reachable from recent objects. And of course with "-k" all
     objects are kept.

  3. Delete everything in the list from (1), under the assumption that
     anything worth keeping was repacked in step (2), and anything else
     is OK to drop.

So if there are regular packs and cruft packs, we'd have to know in step
3 which are which. We'd delete the regular ones, whose objects have all
been migrated to the new pack (either a "real" one or a cruft one), but
keep the crufty ones whose timestamps are still fresh.

That's a small change, and works except for one thing: the reachable
from recent objects. You can't just delete a whole cruft pack. Some of
its objects may be reachable from objects in other cruft packs that
we're keeping. In other words, you have cruft packs where you want to
keep half of the objects they contain. How do you do that?

I think you'd have to make pack-objects aware of the concept of cruft
packs, and that it should include reachable-from-recent objects in the
new pack only if they're in a cruft pack that is going to be deleted. So
those objects would be "rescued" from the cruft pack before it goes away
and migrated to the new cruft pack. That would effectively refresh their
timestamp, but that's fine. They're reachable from objects with that
fresh timestamp already, so effectively they couldn't be deleted until
that timestamp is hit.

So I think it's do-able, but it is a little complicated.

-Peff

     prev parent reply	other threads:[~2017-06-20 14:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-08 12:45 "There are too many unreachable loose objects" - why don't we run 'git prune' automatically? Lars Schneider
2017-06-09  5:27 ` Jeff King
2017-06-09 12:03   ` Lars Schneider
2017-06-10  8:06     ` Jeff King
2017-06-18 13:22       ` Lars Schneider
2017-06-20 14:08         ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170620140837.fq3wxb63lnqay6xz@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).