git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Duy Nguyen <pclouds@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Christian Couder" <christian.couder@gmail.com>
Subject: Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
Date: Sat, 13 Jan 2018 04:58:49 -0500	[thread overview]
Message-ID: <20180113095849.GA30511@sigill.intra.peff.net> (raw)
In-Reply-To: <20180112142305.GA338@ash>

On Fri, Jan 12, 2018 at 09:23:05PM +0700, Duy Nguyen wrote:

> > > Why can't we generate a new cruft-pack on every gc run that
> > > detects too many unreachable objects? That would not be as
> > > efficient as a single cruft-pack but it should be way more
> > > efficient than the individual objects, no?
> > > 
> > > Plus, chances are that the existing cruft-packs are purged with
> > > the next gc run anyways.
> > 
> > Interesting idea. Here are some thoughts in random order.
> > 
> > That loses some delta opportunities between the cruft packs, but
> > that's certainly no worse than the all-loose storage we have today.
> 
> Does it also affect deltas when we copy some objects to the new
> repacked pack (e.g. some objects in the cruft pack getting referenced
> again)? I remember we do reuse deltas sometimes but not in very
> detail. I guess we probably won't suffer any suboptimal deltas ...

We always reuse deltas that are coming from one pack into another pack,
unless the base isn't present in the new pack. So we'd retain existing
deltas. What you'd miss out on is just two versions of a file in two
separate cruft packs could not be delta'd together.

> > One nice aspect is that it means cruft objects don't incur any I/O
> > cost during a repack.
> 
> But cruft packs do incur object lookup cost since we still go through
> all packs linearly. The multi-pack index being discussed recently
> would help. But even without that, packs are sorted by mtime and old
> cruft packs won't affect as much I guess, as long as there aren't a
> zillion cruft packs around. Then even prepare_packed_git() is hit.

The cruft packs should behave pretty well with the mru list. We'd never
as for an object in such a pack during normal operations, so they'd end
up at the end of the list (the big exception is abbreviation, which has
to look in every single pack).

I'm not sure how many cruft packs you'd end up with in practice. If it's
one per auto-gc, then probably you're only generating one every few
days, and cleaning up old ones as you go.

I do still kind of favor having a single cruft pack, though, just
because it makes it simpler to reason about these sorts of things (but
then you need to mark individual object timestamps).

> > I'm not sure how the pruning process would work, especially with
> > respect to objects reachable from other unreachable-but-recent
> > objects. Right now the repack-and-delete procedure is done by
> > git-repack, and is basically:
> > 
> >   1. Get a list of all of the current packs.
> > 
> >   2. Ask pack-objects to pack everything into a new pack. Normally this
> >      is reachable objects, but we also include recent objects and
> >      objects reachable from recent objects. And of course with "-k" all
> >      objects are kept.
> > 
> >   3. Delete everything in the list from (1), under the assumption that
> >      anything worth keeping was repacked in step (2), and anything else
> >      is OK to drop.
> > 
> > So if there are regular packs and cruft packs, we'd have to know in
> > step 3 which are which. We'd delete the regular ones, whose objects
> > have all been migrated to the new pack (either a "real" one or a
> > cruft one), but keep the crufty ones whose timestamps are still
> > fresh.
> > 
> > That's a small change, and works except for one thing: the reachable
> > from recent objects. You can't just delete a whole cruft pack. Some
> > of its objects may be reachable from objects in other cruft packs
> > that we're keeping. In other words, you have cruft packs where you
> > want to keep half of the objects they contain. How do you do that?
> 
> Do we have to? Those reachable from recent objects must have ended up
> in the new pack created at step 2, correct? Which means we can safely
> remove the whole pack.

No, I think just I wrote (2) poorly. We repack the reachable objects,
but the recent ones (and things reachable only from them) are not
actually packed, but turned loose.

And of course in a cruft-packed world they'd end up in a cruft pack.

> Those reachable from other cruft packs. I'm not sure if it's different
> from when these objects are loose. If a loose object A depends on B,
> but B is much older than A, then B may get pruned anyway while A stays
> (does not sound right if A gets reused).

Hopefully not, after d3038d22f9 (prune: keep objects reachable from
recent objects, 2014-10-15). :)

-Peff

  reply	other threads:[~2018-01-13  9:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-11 21:33 Ævar Arnfjörð Bjarmason
2018-01-12 12:07 ` Duy Nguyen
2018-01-12 13:41   ` Duy Nguyen
2018-01-12 14:44   ` Ævar Arnfjörð Bjarmason
2018-01-13 10:07     ` Jeff King
2018-01-12 13:46 ` Jeff King
2018-01-12 14:23   ` Duy Nguyen
2018-01-13  9:58     ` Jeff King [this message]
2018-02-08 16:23 ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180113095849.GA30511@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --subject='Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git