git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git gc --auto yelling at users where a repo legitimately has >6700 loose objects
@ 2018-01-11 21:33 Ævar Arnfjörð Bjarmason
  2018-01-12 12:07 ` Duy Nguyen
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-11 21:33 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Christian Couder

I recently disabled gc.auto=0 and my nightly aggressive repack script on
our big monorepo across our infra, relying instead on git gc --auto in
the background to just do its thing.

I didn't want users to wait for git-gc, and I'd written this nightly
cronjob before git-gc learned to detach to the background.

But now I have git-gc on some servers yelling at users on every pull
command:

    warning: There are too many unreachable loose objects; run 'git prune' to remove them.

The reason is that I have all the values at git's default settings, and
there legitimately are >~6700 loose objects that were created in the
last 2 weeks.

For those rusty on git-gc's defaults, this is what it looks like in this
scenario:

 1. User runs "git pull"
 2. git gc --auto is called, there are >6700 loose objects
 3. it forks into the background, tries to prune and repack, objects
    older than gc.pruneExpire (2.weeks.ago) are pruned.
 4. At the end of all this, we check *again* if we have >6700 objects,
    if we do we print "run 'git prune'" to .git/gc.log, and will just
    emit that error for the next day before trying again, at which point
    we unlink the gc.log and retry, see gc.logExpiry.

Right now I've just worked around this by setting gc.pruneExpire to a
lower value (4.days.ago). But there's a larger issue to be addressed
here, and I'm not sure how.

When the warning was added in [1] it didn't know to detach to the
background yet, that came in [2], shortly after came gc.log in [3].

We could add another gc.auto-like limit, which could be set at some
higher value than gc.auto. "Hey if I have more than 6700 loose objects,
prune the <2wks old ones, but if at the end there's still >6700 I don't
want to hear about it unless there's >6700*N".

I thought I'd just add that, but the details of how to pass that message
around get nasty. With that solution we *also* don't want git gc to
start churning in the background once we reach >6700 objects, so we need
something like gc.logExpiry which defers the gc until the next day. We
might need to create .git/gc-waitabit.marker, ew.

More generally, these hard limits seem contrary to what the user cares
about. E.g. I suspect that most of these loose objects come from
branches since deleted in upstream, whose objects could have a different
retention policy.

Or we could say "I want 2 weeks of objects, but if that runs against the
6700 limit just keep the latest 6700/2".

1. a087cc9819 ("git-gc --auto: protect ourselves from accumulated
   cruft", 2007-09-17)
2. 9f673f9477 ("gc: config option for running --auto in background",
   2014-02-08)
3. 329e6e8794 ("gc: save log from daemonized gc --auto and print it next
   time", 2015-09-19)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-11 21:33 git gc --auto yelling at users where a repo legitimately has >6700 loose objects Ævar Arnfjörð Bjarmason
@ 2018-01-12 12:07 ` Duy Nguyen
  2018-01-12 13:41   ` Duy Nguyen
  2018-01-12 14:44   ` Ævar Arnfjörð Bjarmason
  2018-01-12 13:46 ` Jeff King
  2018-02-08 16:23 ` Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 9+ messages in thread
From: Duy Nguyen @ 2018-01-12 12:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano, Christian Couder

On Fri, Jan 12, 2018 at 4:33 AM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> For those rusty on git-gc's defaults, this is what it looks like in this
> scenario:
>
>  1. User runs "git pull"
>  2. git gc --auto is called, there are >6700 loose objects
>  3. it forks into the background, tries to prune and repack, objects
>     older than gc.pruneExpire (2.weeks.ago) are pruned.
>  4. At the end of all this, we check *again* if we have >6700 objects,
>     if we do we print "run 'git prune'" to .git/gc.log, and will just
>     emit that error for the next day before trying again, at which point
>     we unlink the gc.log and retry, see gc.logExpiry.
>
> Right now I've just worked around this by setting gc.pruneExpire to a
> lower value (4.days.ago). But there's a larger issue to be addressed
> here, and I'm not sure how.
>
> When the warning was added in [1] it didn't know to detach to the
> background yet, that came in [2], shortly after came gc.log in [3].
>
> We could add another gc.auto-like limit, which could be set at some
> higher value than gc.auto. "Hey if I have more than 6700 loose objects,
> prune the <2wks old ones, but if at the end there's still >6700 I don't
> want to hear about it unless there's >6700*N".

Yes it's about time we make too_many_loose_objects() more accurate and
complain less, especially when the complaint is useless.

> I thought I'd just add that, but the details of how to pass that message
> around get nasty. With that solution we *also* don't want git gc to
> start churning in the background once we reach >6700 objects, so we need
> something like gc.logExpiry which defers the gc until the next day. We
> might need to create .git/gc-waitabit.marker, ew.

Hmm.. could we save the info from the last run to help the next one?
If the last gc --auto (which does try to remove some loose objects)
leaves 6700 objects still loose, then it's "clear" that the next run
may also leave those loose. If we save that number somewhere (gc.log
too?) too_many_loose_objects() can read back and subtract it from the
estimation and may decide not to do gc at all since the number of
loose-and-prunable objects is below threshold.

The problem is of course these 6700 will gradually become prunable
over time. We can't just substract the same constant forever. Perhaps
we can do something based on gc.pruneExpire?

Say gc.pruneExpires specifies to keep objects in two weeks, we assume
these object create time is spread out equally over 14 days. So after
one day, 6700/14 objects are supposed to be prune-able and part of
too_many_loose_objects estimation. The gc--auto that is run two weeks
since the first run would count all loose objects as prunable again.

> More generally, these hard limits seem contrary to what the user cares
> about. E.g. I suspect that most of these loose objects come from
> branches since deleted in upstream, whose objects could have a different
> retention policy.

Er.. what retention policy? I think gc.pruneExpire is the only thing
that can keep loose objects around?

BTW

> But now I have git-gc on some servers yelling at users on every pull
> command:
>
>    warning: There are too many unreachable loose objects; run 'git prune' to remove them.

Why do we yell at the users when some maintenance thing is supposed to
be done on the server side? If this is the case, should gc have some
way to yell at the admin instead?
-- 
Duy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-12 12:07 ` Duy Nguyen
@ 2018-01-12 13:41   ` Duy Nguyen
  2018-01-12 14:44   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 9+ messages in thread
From: Duy Nguyen @ 2018-01-12 13:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano, Christian Couder

On Fri, Jan 12, 2018 at 7:07 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> More generally, these hard limits seem contrary to what the user cares
>> about. E.g. I suspect that most of these loose objects come from
>> branches since deleted in upstream, whose objects could have a different
>> retention policy.
>
> Er.. what retention policy? I think gc.pruneExpire is the only thing
> that can keep loose objects around?

Er... I think I know what you meant now. Loose objects can come from
three sources: worktree (git-hash-object and friends),
git-unpack-objects and unreachable objects in packs released back by
git-repack.

The last one could be a result of a branch deletion like you said.
Depending on the branch size, you could release back a large amount of
objects in loose form at the same time. This really skews my "create
time distributed equally" model and the new estimation in
too_many_objects() probably won't help you much either. If only we
have a way to count all these objects as "one"... but putting these
back in a pack hurts obj lookup performance...
-- 
Duy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-11 21:33 git gc --auto yelling at users where a repo legitimately has >6700 loose objects Ævar Arnfjörð Bjarmason
  2018-01-12 12:07 ` Duy Nguyen
@ 2018-01-12 13:46 ` Jeff King
  2018-01-12 14:23   ` Duy Nguyen
  2018-02-08 16:23 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2018-01-12 13:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Christian Couder

On Thu, Jan 11, 2018 at 10:33:15PM +0100, Ævar Arnfjörð Bjarmason wrote:

>  4. At the end of all this, we check *again* if we have >6700 objects,
>     if we do we print "run 'git prune'" to .git/gc.log, and will just
>     emit that error for the next day before trying again, at which point
>     we unlink the gc.log and retry, see gc.logExpiry.
> 
> Right now I've just worked around this by setting gc.pruneExpire to a
> lower value (4.days.ago). But there's a larger issue to be addressed
> here, and I'm not sure how.

IMHO the right solution is to stop exploding loose objects, and instead
write them all into a "cruft" pack. That's more efficient, to boot
(since it doesn't waste inodes, and may even retain deltas between cruft
objects).

But there are some tricks around timestamps. I wrote up some thoughts
in:

  https://public-inbox.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/

and downthread from there.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-12 13:46 ` Jeff King
@ 2018-01-12 14:23   ` Duy Nguyen
  2018-01-13  9:58     ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Duy Nguyen @ 2018-01-12 14:23 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Christian Couder

On Fri, Jan 12, 2018 at 08:46:09AM -0500, Jeff King wrote:
> On Thu, Jan 11, 2018 at 10:33:15PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> >  4. At the end of all this, we check *again* if we have >6700 objects,
> >     if we do we print "run 'git prune'" to .git/gc.log, and will just
> >     emit that error for the next day before trying again, at which point
> >     we unlink the gc.log and retry, see gc.logExpiry.
> > 
> > Right now I've just worked around this by setting gc.pruneExpire to a
> > lower value (4.days.ago). But there's a larger issue to be addressed
> > here, and I'm not sure how.
> 
> IMHO the right solution is to stop exploding loose objects, and instead
> write them all into a "cruft" pack. That's more efficient, to boot
> (since it doesn't waste inodes, and may even retain deltas between cruft
> objects).
> 
> But there are some tricks around timestamps. I wrote up some thoughts
> in:
> 
>   https://public-inbox.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/
> 
> and downthread from there.

My thoughts were moving towards that "multiple cruft packs" idea in
your last email of that thread [1]. I'll quote it here so people don't
have to open the link

> > Why can't we generate a new cruft-pack on every gc run that
> > detects too many unreachable objects? That would not be as
> > efficient as a single cruft-pack but it should be way more
> > efficient than the individual objects, no?
> > 
> > Plus, chances are that the existing cruft-packs are purged with
> > the next gc run anyways.
> 
> Interesting idea. Here are some thoughts in random order.
> 
> That loses some delta opportunities between the cruft packs, but
> that's certainly no worse than the all-loose storage we have today.

Does it also affect deltas when we copy some objects to the new
repacked pack (e.g. some objects in the cruft pack getting referenced
again)? I remember we do reuse deltas sometimes but not in very
detail. I guess we probably won't suffer any suboptimal deltas ...

> 
> One nice aspect is that it means cruft objects don't incur any I/O
> cost during a repack.

But cruft packs do incur object lookup cost since we still go through
all packs linearly. The multi-pack index being discussed recently
would help. But even without that, packs are sorted by mtime and old
cruft packs won't affect as much I guess, as long as there aren't a
zillion cruft packs around. Then even prepare_packed_git() is hit.

> It doesn't really solve the "too many loose objects after gc"
> problem.  It just punts it to "too many packs after gc". This is
> likely to be better because the number of packs would scale with the
> number of gc runs, rather than the number of crufty objects. But
> there would still be corner cases if you run gc frequently. Maybe
> that would be acceptable.
> 
> I'm not sure how the pruning process would work, especially with
> respect to objects reachable from other unreachable-but-recent
> objects. Right now the repack-and-delete procedure is done by
> git-repack, and is basically:
> 
>   1. Get a list of all of the current packs.
> 
>   2. Ask pack-objects to pack everything into a new pack. Normally this
>      is reachable objects, but we also include recent objects and
>      objects reachable from recent objects. And of course with "-k" all
>      objects are kept.
> 
>   3. Delete everything in the list from (1), under the assumption that
>      anything worth keeping was repacked in step (2), and anything else
>      is OK to drop.
> 
> So if there are regular packs and cruft packs, we'd have to know in
> step 3 which are which. We'd delete the regular ones, whose objects
> have all been migrated to the new pack (either a "real" one or a
> cruft one), but keep the crufty ones whose timestamps are still
> fresh.
> 
> That's a small change, and works except for one thing: the reachable
> from recent objects. You can't just delete a whole cruft pack. Some
> of its objects may be reachable from objects in other cruft packs
> that we're keeping. In other words, you have cruft packs where you
> want to keep half of the objects they contain. How do you do that?

Do we have to? Those reachable from recent objects must have ended up
in the new pack created at step 2, correct? Which means we can safely
remove the whole pack.

Those reachable from other cruft packs. I'm not sure if it's different
from when these objects are loose. If a loose object A depends on B,
but B is much older than A, then B may get pruned anyway while A stays
(does not sound right if A gets reused).

> I think you'd have to make pack-objects aware of the concept of
> cruft packs, and that it should include reachable-from-recent
> objects in the new pack only if they're in a cruft pack that is
> going to be deleted. So those objects would be "rescued" from the
> cruft pack before it goes away and migrated to the new cruft
> pack. That would effectively refresh their timestamp, but that's
> fine. They're reachable from objects with that fresh timestamp
> already, so effectively they couldn't be deleted until that
> timestamp is hit.
> 
> So I think it's do-able, but it is a little complicated.

[1] https://public-inbox.org/git/20170620140837.fq3wxb63lnqay6xz@sigill.intra.peff.net/
--
Duy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-12 12:07 ` Duy Nguyen
  2018-01-12 13:41   ` Duy Nguyen
@ 2018-01-12 14:44   ` Ævar Arnfjörð Bjarmason
  2018-01-13 10:07     ` Jeff King
  1 sibling, 1 reply; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-01-12 14:44 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Junio C Hamano, Christian Couder


On Fri, Jan 12 2018, Duy Nguyen jotted:

> On Fri, Jan 12, 2018 at 4:33 AM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> For those rusty on git-gc's defaults, this is what it looks like in this
>> scenario:
>>
>>  1. User runs "git pull"
>>  2. git gc --auto is called, there are >6700 loose objects
>>  3. it forks into the background, tries to prune and repack, objects
>>     older than gc.pruneExpire (2.weeks.ago) are pruned.
>>  4. At the end of all this, we check *again* if we have >6700 objects,
>>     if we do we print "run 'git prune'" to .git/gc.log, and will just
>>     emit that error for the next day before trying again, at which point
>>     we unlink the gc.log and retry, see gc.logExpiry.
>>
>> Right now I've just worked around this by setting gc.pruneExpire to a
>> lower value (4.days.ago). But there's a larger issue to be addressed
>> here, and I'm not sure how.
>>
>> When the warning was added in [1] it didn't know to detach to the
>> background yet, that came in [2], shortly after came gc.log in [3].
>>
>> We could add another gc.auto-like limit, which could be set at some
>> higher value than gc.auto. "Hey if I have more than 6700 loose objects,
>> prune the <2wks old ones, but if at the end there's still >6700 I don't
>> want to hear about it unless there's >6700*N".
>
> Yes it's about time we make too_many_loose_objects() more accurate and
> complain less, especially when the complaint is useless.
>
>> I thought I'd just add that, but the details of how to pass that message
>> around get nasty. With that solution we *also* don't want git gc to
>> start churning in the background once we reach >6700 objects, so we need
>> something like gc.logExpiry which defers the gc until the next day. We
>> might need to create .git/gc-waitabit.marker, ew.
>
> Hmm.. could we save the info from the last run to help the next one?
> If the last gc --auto (which does try to remove some loose objects)
> leaves 6700 objects still loose, then it's "clear" that the next run
> may also leave those loose. If we save that number somewhere (gc.log
> too?) too_many_loose_objects() can read back and subtract it from the
> estimation and may decide not to do gc at all since the number of
> loose-and-prunable objects is below threshold.
>
> The problem is of course these 6700 will gradually become prunable
> over time. We can't just substract the same constant forever. Perhaps
> we can do something based on gc.pruneExpire?
>
> Say gc.pruneExpires specifies to keep objects in two weeks, we assume
> these object create time is spread out equally over 14 days. So after
> one day, 6700/14 objects are supposed to be prune-able and part of
> too_many_loose_objects estimation. The gc--auto that is run two weeks
> since the first run would count all loose objects as prunable again.
>
>> More generally, these hard limits seem contrary to what the user cares
>> about. E.g. I suspect that most of these loose objects come from
>> branches since deleted in upstream, whose objects could have a different
>> retention policy.
>
> Er.. what retention policy? I think gc.pruneExpire is the only thing
> that can keep loose objects around?

You answered this yourself in
CACsJy8CUYosOGK5tn0C=t=SkbS-fyaSxp536zx+9jh_O+WNaEQ@mail.gmail.com, yeah
I mean loose objects from branch deletions.

More generally, the reason we even have the 2 week limit is to pick a
good trade-off between performance and not losing someone's work that
they e.g. "git add"-ed but never committed.

I'm suggesting (but don't know if this is worth it, especially given
Jeff's comments) that one smarter approach might be to track where the
objects came from (e.g. by keeping reflogs for deleted upstream branches
for $expiry_time).

Then we could immediately delete loose objects we got from upstream
branches (or delete them more aggressively), while treating objects that
were originally created in the local repository differently.

>> But now I have git-gc on some servers yelling at users on every pull
>> command:
>>
>>    warning: There are too many unreachable loose objects; run 'git prune' to remove them.
>
> Why do we yell at the users when some maintenance thing is supposed to
> be done on the server side? If this is the case, should gc have some
> way to yell at the admin instead?

Sorry I didn't clarify this, this is a shared server (rollout system
with staged checkouts) that users log into and stage/test a rollout from
the git repo, so not the git server.

Because it's a shared repo there's a lot more loose object churn, Mostly
due to pulling more often (and thus more branches that later get
deleted), but also from rebasing and whatnot in the rollout repo.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-12 14:23   ` Duy Nguyen
@ 2018-01-13  9:58     ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2018-01-13  9:58 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Christian Couder

On Fri, Jan 12, 2018 at 09:23:05PM +0700, Duy Nguyen wrote:

> > > Why can't we generate a new cruft-pack on every gc run that
> > > detects too many unreachable objects? That would not be as
> > > efficient as a single cruft-pack but it should be way more
> > > efficient than the individual objects, no?
> > > 
> > > Plus, chances are that the existing cruft-packs are purged with
> > > the next gc run anyways.
> > 
> > Interesting idea. Here are some thoughts in random order.
> > 
> > That loses some delta opportunities between the cruft packs, but
> > that's certainly no worse than the all-loose storage we have today.
> 
> Does it also affect deltas when we copy some objects to the new
> repacked pack (e.g. some objects in the cruft pack getting referenced
> again)? I remember we do reuse deltas sometimes but not in very
> detail. I guess we probably won't suffer any suboptimal deltas ...

We always reuse deltas that are coming from one pack into another pack,
unless the base isn't present in the new pack. So we'd retain existing
deltas. What you'd miss out on is just two versions of a file in two
separate cruft packs could not be delta'd together.

> > One nice aspect is that it means cruft objects don't incur any I/O
> > cost during a repack.
> 
> But cruft packs do incur object lookup cost since we still go through
> all packs linearly. The multi-pack index being discussed recently
> would help. But even without that, packs are sorted by mtime and old
> cruft packs won't affect as much I guess, as long as there aren't a
> zillion cruft packs around. Then even prepare_packed_git() is hit.

The cruft packs should behave pretty well with the mru list. We'd never
as for an object in such a pack during normal operations, so they'd end
up at the end of the list (the big exception is abbreviation, which has
to look in every single pack).

I'm not sure how many cruft packs you'd end up with in practice. If it's
one per auto-gc, then probably you're only generating one every few
days, and cleaning up old ones as you go.

I do still kind of favor having a single cruft pack, though, just
because it makes it simpler to reason about these sorts of things (but
then you need to mark individual object timestamps).

> > I'm not sure how the pruning process would work, especially with
> > respect to objects reachable from other unreachable-but-recent
> > objects. Right now the repack-and-delete procedure is done by
> > git-repack, and is basically:
> > 
> >   1. Get a list of all of the current packs.
> > 
> >   2. Ask pack-objects to pack everything into a new pack. Normally this
> >      is reachable objects, but we also include recent objects and
> >      objects reachable from recent objects. And of course with "-k" all
> >      objects are kept.
> > 
> >   3. Delete everything in the list from (1), under the assumption that
> >      anything worth keeping was repacked in step (2), and anything else
> >      is OK to drop.
> > 
> > So if there are regular packs and cruft packs, we'd have to know in
> > step 3 which are which. We'd delete the regular ones, whose objects
> > have all been migrated to the new pack (either a "real" one or a
> > cruft one), but keep the crufty ones whose timestamps are still
> > fresh.
> > 
> > That's a small change, and works except for one thing: the reachable
> > from recent objects. You can't just delete a whole cruft pack. Some
> > of its objects may be reachable from objects in other cruft packs
> > that we're keeping. In other words, you have cruft packs where you
> > want to keep half of the objects they contain. How do you do that?
> 
> Do we have to? Those reachable from recent objects must have ended up
> in the new pack created at step 2, correct? Which means we can safely
> remove the whole pack.

No, I think just I wrote (2) poorly. We repack the reachable objects,
but the recent ones (and things reachable only from them) are not
actually packed, but turned loose.

And of course in a cruft-packed world they'd end up in a cruft pack.

> Those reachable from other cruft packs. I'm not sure if it's different
> from when these objects are loose. If a loose object A depends on B,
> but B is much older than A, then B may get pruned anyway while A stays
> (does not sound right if A gets reused).

Hopefully not, after d3038d22f9 (prune: keep objects reachable from
recent objects, 2014-10-15). :)

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-12 14:44   ` Ævar Arnfjörð Bjarmason
@ 2018-01-13 10:07     ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2018-01-13 10:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Duy Nguyen, Git Mailing List, Junio C Hamano, Christian Couder

On Fri, Jan 12, 2018 at 03:44:26PM +0100, Ævar Arnfjörð Bjarmason wrote:

> More generally, the reason we even have the 2 week limit is to pick a
> good trade-off between performance and not losing someone's work that
> they e.g. "git add"-ed but never committed.
> 
> I'm suggesting (but don't know if this is worth it, especially given
> Jeff's comments) that one smarter approach might be to track where the
> objects came from (e.g. by keeping reflogs for deleted upstream branches
> for $expiry_time).

I don't think reflogs would help here. We consider reflog'd objects as
"reachable". So you'd still have something like this:

  1. You delete a branch. Reflog still mentions its commits.

  2. You run gc (or auto-gc). Those objects are still retained in the
     main pack due to the reachability rule. This may happen multiple
     times, and each time their "timestamp" is updated, because it is
     really just the timestamp of the containing pack.

  3. 30 days later, you run another gc. The reflog is now past its
     expiration and is deleted, and now those objects are unreachable.
     This gc turns them loose, but it still considers them "recent" as
     of the last gc you ran, due to the timestamp thing above.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git gc --auto yelling at users where a repo legitimately has >6700 loose objects
  2018-01-11 21:33 git gc --auto yelling at users where a repo legitimately has >6700 loose objects Ævar Arnfjörð Bjarmason
  2018-01-12 12:07 ` Duy Nguyen
  2018-01-12 13:46 ` Jeff King
@ 2018-02-08 16:23 ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-02-08 16:23 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Christian Couder


On Thu, Jan 11 2018, Ævar Arnfjörð Bjarmason jotted:

> I recently disabled gc.auto=0 and my nightly aggressive repack script on
> our big monorepo across our infra, relying instead on git gc --auto in
> the background to just do its thing.
>
> I didn't want users to wait for git-gc, and I'd written this nightly
> cronjob before git-gc learned to detach to the background.
>
> But now I have git-gc on some servers yelling at users on every pull
> command:
>
>     warning: There are too many unreachable loose objects; run 'git prune' to remove them.
>
> The reason is that I have all the values at git's default settings, and
> there legitimately are >~6700 loose objects that were created in the
> last 2 weeks.
>
> For those rusty on git-gc's defaults, this is what it looks like in this
> scenario:
>
>  1. User runs "git pull"
>  2. git gc --auto is called, there are >6700 loose objects
>  3. it forks into the background, tries to prune and repack, objects
>     older than gc.pruneExpire (2.weeks.ago) are pruned.
>  4. At the end of all this, we check *again* if we have >6700 objects,
>     if we do we print "run 'git prune'" to .git/gc.log, and will just
>     emit that error for the next day before trying again, at which point
>     we unlink the gc.log and retry, see gc.logExpiry.
>
> Right now I've just worked around this by setting gc.pruneExpire to a
> lower value (4.days.ago). But there's a larger issue to be addressed
> here, and I'm not sure how.
>
> When the warning was added in [1] it didn't know to detach to the
> background yet, that came in [2], shortly after came gc.log in [3].
>
> We could add another gc.auto-like limit, which could be set at some
> higher value than gc.auto. "Hey if I have more than 6700 loose objects,
> prune the <2wks old ones, but if at the end there's still >6700 I don't
> want to hear about it unless there's >6700*N".
>
> I thought I'd just add that, but the details of how to pass that message
> around get nasty. With that solution we *also* don't want git gc to
> start churning in the background once we reach >6700 objects, so we need
> something like gc.logExpiry which defers the gc until the next day. We
> might need to create .git/gc-waitabit.marker, ew.
>
> More generally, these hard limits seem contrary to what the user cares
> about. E.g. I suspect that most of these loose objects come from
> branches since deleted in upstream, whose objects could have a different
> retention policy.
>
> Or we could say "I want 2 weeks of objects, but if that runs against the
> 6700 limit just keep the latest 6700/2".
>
> 1. a087cc9819 ("git-gc --auto: protect ourselves from accumulated
>    cruft", 2007-09-17)
> 2. 9f673f9477 ("gc: config option for running --auto in background",
>    2014-02-08)
> 3. 329e6e8794 ("gc: save log from daemonized gc --auto and print it next
>    time", 2015-09-19)

My just-sent "How to produce a loose ref+size explosion via pruning +
git-gc", <87fu6bmr0j.fsf@evledraar.gmail.com>
(https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/),
shows an easy way to reproduce this.

After the steps outlined there git-gc --auto will end up in a state
where it'll start telling the user off for having too many loose
objects.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-02-08 16:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-11 21:33 git gc --auto yelling at users where a repo legitimately has >6700 loose objects Ævar Arnfjörð Bjarmason
2018-01-12 12:07 ` Duy Nguyen
2018-01-12 13:41   ` Duy Nguyen
2018-01-12 14:44   ` Ævar Arnfjörð Bjarmason
2018-01-13 10:07     ` Jeff King
2018-01-12 13:46 ` Jeff King
2018-01-12 14:23   ` Duy Nguyen
2018-01-13  9:58     ` Jeff King
2018-02-08 16:23 ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).