[PATCH] gc: do not warn about too many loose objects

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] gc: do not warn about too many loose objects
@ 2018-07-16 17:27 Jonathan Tan
  2018-07-16 17:51 ` Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Jonathan Tan @ 2018-07-16 17:27 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
cruft", 2007-09-17), the user was warned if there were too many
unreachable loose objects. This made sense at the time, because gc
couldn't prune them safely. But subsequently, git prune learned the
ability to not prune recently created loose objects, making pruning able
to be done more safely, and gc was made to automatically prune old
unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
2.weeks.ago" by default", 2008-03-12).

This makes the warning unactionable by the user, as any loose objects
left are not deleted yet because of safety, and "git prune" is not a
command that the user is recommended to run directly anyway.

Therefore, remove this warning.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This was noticed when a daemonized gc run wrote this warning to the log
file, and returned 0; but a subsequent run merely read the log file, saw
that it is non-empty and returned -1 (which is inconsistent in that such
a run should return 0, as it did the first time).
---
 builtin/gc.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index ccfb1ceaeb..fc3b553651 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -641,10 +641,6 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (pack_garbage.nr > 0)
 		clean_pack_garbage();

-	if (auto_gc && too_many_loose_objects())
-		warning(_("There are too many unreachable loose objects; "
-			"run 'git prune' to remove them."));
-
 	if (!daemonized)
 		unlink(git_path("gc.log"));

-- 
2.18.0.203.gfac676dfb9-goog

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
@ 2018-07-16 17:51 ` Jeff King
  2018-07-16 18:22   ` Jonathan Nieder
  2018-07-16 19:15 ` Elijah Newren
  2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
  2 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 17:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Mon, Jul 16, 2018 at 10:27:17AM -0700, Jonathan Tan wrote:

> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
> 
> This makes the warning unactionable by the user, as any loose objects
> left are not deleted yet because of safety, and "git prune" is not a
> command that the user is recommended to run directly anyway.

I'm not sure if this tells the whole story. You may still run into a
case where auto-gc runs over and over during the grace period, because
you have a bunch of objects which are not reachable and not yet ready to
be expired. So a gc cannot make forward progress until the 2-week
expiration, and the way to break the cycle is to run an immediate
"prune".

So while I completely agree that it's not a great thing to encourage
users to blindly run "git prune", I think it _is_ actionable.

But...

> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Yes, this got much worse with daemonized gc. The warning blocks further
runs. And then even after the 2-week period, you still cannot make
further progress until the user steps in! I think we dealt with that in
a831c06a2b (gc: ignore old gc.log files, 2017-02-10). So now we won't
run gc for a day, but eventually we may make further progress.

So the warning _is_ still doing something useful there (it's blocking
immediate auto-gc and kicking in the 1-day threshold).

I agree that the "-1" return is a little funny. Perhaps on the reading
side we should detect that the log contains only a "warning" line and
adjust our exit code accordingly.

Ultimately, I think Git should avoid keeping unreachable objects as
loose in the first place, which would make this whole problem go away.
There's some discussion in this thread:

  https://public-inbox.org/git/87inc89j38.fsf@evledraar.gmail.com/

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 17:51 ` Jeff King
@ 2018-07-16 18:22   ` Jonathan Nieder
  2018-07-16 18:52     ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 18:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Hi,

Jeff King wrote:
> On Mon, Jul 16, 2018 at 10:27:17AM -0700, Jonathan Tan wrote:

>> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
>> cruft", 2007-09-17), the user was warned if there were too many
>> unreachable loose objects. This made sense at the time, because gc
>> couldn't prune them safely. But subsequently, git prune learned the
>> ability to not prune recently created loose objects, making pruning able
>> to be done more safely, and gc was made to automatically prune old
>> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
>> 2.weeks.ago" by default", 2008-03-12).
>>
>> This makes the warning unactionable by the user, as any loose objects
>> left are not deleted yet because of safety, and "git prune" is not a
>> command that the user is recommended to run directly anyway.
>
> I'm not sure if this tells the whole story. You may still run into a
> case where auto-gc runs over and over during the grace period, because
> you have a bunch of objects which are not reachable and not yet ready to
> be expired. So a gc cannot make forward progress until the 2-week
> expiration, and the way to break the cycle is to run an immediate
> "prune".
>
> So while I completely agree that it's not a great thing to encourage
> users to blindly run "git prune", I think it _is_ actionable.

To flesh this out a little more: what user action do you suggest?  Could
we carry out that action automatically?

I mentally make a distinction between this warning from "git gc
--auto" and the warning from commands like "git gui".  The former was
saying "Please run 'git prune', because I'm not going to do it", and
as Jonathan noticed, that's not true any more.  The latter says

	This repository currently has approximately %i loose objects.

	To maintain optimal performance it is strongly recommended
	that you compress the database.

	Compress the database now?

which is relevant to the current operation (slowly reading the
repository) and easy to act on (choose "yes").

[...]
> I agree that the "-1" return is a little funny. Perhaps on the reading
> side we should detect that the log contains only a "warning" line and
> adjust our exit code accordingly.

Yes, this is a real problem, and it would affect any other warning
(or even GIT_TRACE=1 output) produced by gc --auto as well.  I think we
should consider it a serious bug, separate from the symptom Jonathan is
fixing.

Unfortunately I don't have a great idea about how to fix it.  Should
we avoid writing warnings to gc.log in daemon mode?  That would
prevent the user from ever seeing the warnings, but because of the
nature of a warning, that might be reasonable.  Should we put warnings
in a separate log file with different semantics?  Should we change the
format of gc.log to allow more structured information (include 'gc'
exit code) and perform a two-stage migration to the new format (first
learn to read the new format, then switch to writing it)?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 18:22   ` Jonathan Nieder
@ 2018-07-16 18:52     ` Jeff King
  2018-07-16 19:09       ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 18:52 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 11:22:07AM -0700, Jonathan Nieder wrote:

> > I'm not sure if this tells the whole story. You may still run into a
> > case where auto-gc runs over and over during the grace period, because
> > you have a bunch of objects which are not reachable and not yet ready to
> > be expired. So a gc cannot make forward progress until the 2-week
> > expiration, and the way to break the cycle is to run an immediate
> > "prune".
> >
> > So while I completely agree that it's not a great thing to encourage
> > users to blindly run "git prune", I think it _is_ actionable.
> 
> To flesh this out a little more: what user action do you suggest?  Could
> we carry out that action automatically?

Er, the action is to run "git prune", like the warning says. :)

And no, I do not think we should run it automatically. It deletes
objects without respect to the grace period, which may not be what the
user wants. (And yes, the existing warning is terrible because it does
not at all make clear that following its advice may be dangerous).

> I mentally make a distinction between this warning from "git gc
> --auto" and the warning from commands like "git gui".  The former was
> saying "Please run 'git prune', because I'm not going to do it", and
> as Jonathan noticed, that's not true any more.  The latter says
> 
> 	This repository currently has approximately %i loose objects.
> 
> 	To maintain optimal performance it is strongly recommended
> 	that you compress the database.
> 
> 	Compress the database now?
> 
> which is relevant to the current operation (slowly reading the
> repository) and easy to act on (choose "yes").

I don't think those warnings are the same. The warning from "git gui"
is: you may benefit from running gc.

The warning that is deleted by this patch is: you _just_ ran gc, and hey
we even did it automatically for you, but we're still in a funky state
afterwards. You might need to clean up this state.

(As an aside, I think the git-gui warning pre-dates "gc --auto", and
probably should just run that rather than implementing its own
heuristic).

> > I agree that the "-1" return is a little funny. Perhaps on the reading
> > side we should detect that the log contains only a "warning" line and
> > adjust our exit code accordingly.
> 
> Yes, this is a real problem, and it would affect any other warning
> (or even GIT_TRACE=1 output) produced by gc --auto as well.  I think we
> should consider it a serious bug, separate from the symptom Jonathan is
> fixing.
> 
> Unfortunately I don't have a great idea about how to fix it.  Should
> we avoid writing warnings to gc.log in daemon mode?  That would
> prevent the user from ever seeing the warnings, but because of the
> nature of a warning, that might be reasonable.

If you do that without anything further, then it will break the
protection against repeatedly running auto-gc, as I described in the
previous email.

> Should we put warnings
> in a separate log file with different semantics?  Should we change the
> format of gc.log to allow more structured information (include 'gc'
> exit code) and perform a two-stage migration to the new format (first
> learn to read the new format, then switch to writing it)?

Any of those would work similarly to the "just detect warnings" I
suggested earlier, with respect to keeping the "1 day" expiration
intact, so I'd be OK with them. In theory they'd be more robust than
scraping the "warning:" prefix. But in practice, I think you have to
resort to scraping anyway, if you are interested in treating warnings
from sub-processes the same way.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 18:52     ` Jeff King
@ 2018-07-16 19:09       ` Jonathan Nieder
  2018-07-16 19:41         ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 19:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Hi,

Jeff King wrote:
> On Mon, Jul 16, 2018 at 11:22:07AM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> So while I completely agree that it's not a great thing to encourage
>>> users to blindly run "git prune", I think it _is_ actionable.
>>
>> To flesh this out a little more: what user action do you suggest?  Could
>> we carry out that action automatically?
>
> Er, the action is to run "git prune", like the warning says. :)

I don't think we want to recommend that, especially when "git gc --auto"
does the right thing automatically.

Can you convince me I'm wrong?

[...]
>> I mentally make a distinction between this warning from "git gc
>> --auto" and the warning from commands like "git gui".
[...]
> I don't think those warnings are the same. The warning from "git gui"
> is: you may benefit from running gc.
>
> The warning that is deleted by this patch is: you _just_ ran gc, and hey
> we even did it automatically for you, but we're still in a funky state
> afterwards. You might need to clean up this state.

This sounds awful.  It sounds to me like you're saying "git gc --auto"
is saying "I just did the wrong thing, and here is how you can clean
up after me".  That's not how I want a program to behave.

But there's a simpler explanation for what "git gc --auto" was trying
to say, which Jonathan described.

[...]
>> Yes, this is a real problem, and it would affect any other warning
>> (or even GIT_TRACE=1 output) produced by gc --auto as well.  I think we
>> should consider it a serious bug, separate from the symptom Jonathan is
>> fixing.
>>
>> Unfortunately I don't have a great idea about how to fix it.  Should
>> we avoid writing warnings to gc.log in daemon mode?  That would
>> prevent the user from ever seeing the warnings, but because of the
>> nature of a warning, that might be reasonable.
>
> If you do that without anything further, then it will break the
> protection against repeatedly running auto-gc, as I described in the
> previous email.

Do you mean ratelimiting for the message, or do you actually mean
repeatedly running auto-gc itself?

If we suppress warnings, there would still be a gc.log while "git gc
--auto" is running, just as though there had been no warnings at all.
I believe this is close to the intended behavior; it's the same as
what you'd get without daemon mode, except that you lose the warning.

[...]
>> Should we put warnings
>> in a separate log file with different semantics?  Should we change the
>> format of gc.log to allow more structured information (include 'gc'
>> exit code) and perform a two-stage migration to the new format (first
>> learn to read the new format, then switch to writing it)?
>
> Any of those would work similarly to the "just detect warnings" I
> suggested earlier, with respect to keeping the "1 day" expiration
> intact, so I'd be OK with them. In theory they'd be more robust than
> scraping the "warning:" prefix. But in practice, I think you have to
> resort to scraping anyway, if you are interested in treating warnings
> from sub-processes the same way.

Can you say more about this for me?  I am not understanding what
you're saying necessitates scraping the output.  I would strongly
prefer to avoid scraping the output.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
  2018-07-16 17:51 ` Jeff King
@ 2018-07-16 19:15 ` Elijah Newren
  2018-07-16 19:19   ` Jonathan Nieder
  2018-07-16 19:52   ` Jeff King
  2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
  2 siblings, 2 replies; 57+ messages in thread
From: Elijah Newren @ 2018-07-16 19:15 UTC (permalink / raw)
  To: git; +Cc: jonathantanmy, peff, jrnieder, Elijah Newren


On Mon, Jul 16, 2018 at 10:27 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
...
>
> ---
> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Yeah, we've hit this several times too.  I even created a testcase and a
workaround, though I never got it into proper submission form.

The basic problem here, at least for us, is that gc has enough
information to know it could expunge some objects, but because of how
it is structured in terms of several substeps (reflog expiration,
repack, prune), the information is lost between the steps and it
instead writes them out as unreachable objects.  If we could prune (or
avoid exploding) loose objects that are only reachable from reflog
entries that we are expiring, then the problem goes away for us.  (I
totally understand that other repos may have enough unreachable
objects for other reasons that Peff's suggestion to just pack up
unreachable objects is still a really good idea.  But on its own, it
seems like a waste since it's packing stuff that we know we could just
expunge.)

Anyway, my very rough testcase is below.  The workaround is the
git_actual_garbage_collect() function (minus the call to
wait_for_background_gc_to_finish).

Elijah

---


wait_for_background_gc_to_finish() {
    while ( ps -ef | grep -v grep | grep --quiet git.gc.--auto ); do
        sleep 1;
    done
}

git_standard_garbage_collect() {
    # Current git gc sprays unreachable objects back in loose form; this is
    # fine in many cases, but is annoying when done with objects which
    # newly become unreachable because of something else git-gc does and
    # git-gc doesn't clean them up.
    git gc --auto
    wait_for_background_gc_to_finish
}

git_actual_garbage_collect() {
    GITDIR=$(git rev-parse --git-dir)

    # Record all revisions stored in reflog before and after gc
    git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
    git gc --auto
    wait_for_background_gc_to_finish
    git rev-list --no-walk --reflog >$GITDIR/gc.final-refs

    # Find out which reflog entries were removed
    DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs))

    # Get the list of objects which used to be reachable, but were made
    # unreachable due to gc's reflog expiration.  To get these, I need
    # the intersection of things reachable from $DELETED_REFS and things
    # which are unreachable now.
    git rev-list --objects $DELETED_REFS --not --all --reflog | awk '{print $1}' >$GITDIR/gc.previously-reachable-objects
    git prune --expire=now --dry-run | awk '{print $1}' > $GITDIR/gc.unreachable-objects

    # Delete all the previously-reachable-objects made unreachable by the
    # reflog expiration done by git gc
    comm -12 <(sort $GITDIR/gc.unreachable-objects) <(sort $GITDIR/gc.previously-reachable-objects) | sed -e "s#^\(..\)#$GITDIR/objects/\1/#" | xargs rm
}


test -d testcase && { echo "testcase exists; exiting"; exit 1; }
git init testcase/
cd testcase

# Create a basic commit
echo hello >world
git add world
git commit -q -m "Initial"

# Create a commit with lots of files
for i in {0000..9999}; do echo $i >$i; done
git add [0-9]*
git commit --quiet -m "Lots of numbers"

# Pack it all up
git gc --quiet

# Stop referencing the garbage
git reset --quiet --hard HEAD~1

# Pretend we did all the above stuff 30 days ago
for rlog in $(find .git/logs -type f); do
  # Subtract 3E6 (just over 30 days) from every date (assuming dates have
  # exactly 10 digits, which just happens to be valid...right now at least)
  perl -i -ne '/(.*?)(\b[0-9]{10}\b)(.*)/ && print $1,$2-3000000,$3,"\n"' $rlog
done

# HOWEVER: note that the pack is new; if we make the pack old, the old objects
# will get pruned for us.  But it is quite common to have new packfiles with
# old-and-soon-to-be-unreferenced-objects because frequent gc's mean moving
# the objects to new packfiles often, and eventually the reflog is expired.
# If you want to test them being part of an old packfile, uncomment this:
#   find .git/objects/pack -type f | xargs touch -t 200001010101

# Create 50 packfiles in the current repo so that 'git gc --auto' will
# trigger `git repack -A -d -l` instead of just `git repack -d -l`
for i in {01..50}; do
  git fast-export master | sed -e s/Initial/Initi$i/ | git -c fastimport.unpacklimit=0 fast-import --quiet |& grep -v "Not updating refs/heads/master"
done

echo "*** Before gc, reflog refers to garbage-collectible commits: ***"
git rev-list --no-walk --all --reflog
cat .git/logs/refs/heads/master
echo "*** Before gc, everything is packed with no loose objects: ***"
git count-objects -v

git_standard_garbage_collect  # Just `git gc --auto`
#git_actual_garbage_collect    # What I really want from `git gc --auto`

echo -e "\n\n*** After gc, commit garbage collected and objects made loose: ***"
git rev-list --no-walk --all --reflog
cat .git/logs/refs/heads/master
git count-objects -v

echo -e "\n\n*** Now we can trigger the 'too many unreachable' error: ***"
git gc --auto

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:15 ` Elijah Newren
@ 2018-07-16 19:19   ` Jonathan Nieder
  2018-07-16 20:21     ` Elijah Newren
  2018-07-16 19:52   ` Jeff King
  1 sibling, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 19:19 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, jonathantanmy, peff

Hi,

Elijah Newren wrote:

> The basic problem here, at least for us, is that gc has enough
> information to know it could expunge some objects, but because of how
> it is structured in terms of several substeps (reflog expiration,
> repack, prune), the information is lost between the steps and it
> instead writes them out as unreachable objects.  If we could prune (or
> avoid exploding) loose objects that are only reachable from reflog
> entries that we are expiring, then the problem goes away for us.

My understanding is that exploding the objects is intentional behavior,
to avoid a race where objects are newly referenced while they are being
pruned.

I am not a fan of that behavior.  It's still racy.  But when we've
brought it up in the past, the consensus seems to have been that it's
better than nothing.  Documentation/technical/hash-function-transition.txt
section "Loose objects and unreachable objects" describes a way to
eliminate the race.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:09       ` Jonathan Nieder
@ 2018-07-16 19:41         ` Jeff King
  2018-07-16 19:54           ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 19:41 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 12:09:49PM -0700, Jonathan Nieder wrote:

> >>> So while I completely agree that it's not a great thing to encourage
> >>> users to blindly run "git prune", I think it _is_ actionable.
> >>
> >> To flesh this out a little more: what user action do you suggest?  Could
> >> we carry out that action automatically?
> >
> > Er, the action is to run "git prune", like the warning says. :)
> 
> I don't think we want to recommend that, especially when "git gc --auto"
> does the right thing automatically.

But that's the point. This warning is written literally after running
"git gc --auto" _didn't_ do the right thing. Yes, it would be nicer if
it could do the right thing. But it doesn't yet know how to.

See the thread I linked earlier on putting unreachable objects into a
pack, which I think is the real solution.

> > The warning that is deleted by this patch is: you _just_ ran gc, and hey
> > we even did it automatically for you, but we're still in a funky state
> > afterwards. You might need to clean up this state.
> 
> This sounds awful.  It sounds to me like you're saying "git gc --auto"
> is saying "I just did the wrong thing, and here is how you can clean
> up after me".  That's not how I want a program to behave.

Sure, it would be nice if it did the right thing. Nobody has written
that yet. Until they do, we have to deal with the fallout.

> > If you do that without anything further, then it will break the
> > protection against repeatedly running auto-gc, as I described in the
> > previous email.
> 
> Do you mean ratelimiting for the message, or do you actually mean
> repeatedly running auto-gc itself?
> 
> If we suppress warnings, there would still be a gc.log while "git gc
> --auto" is running, just as though there had been no warnings at all.
> I believe this is close to the intended behavior; it's the same as
> what you'd get without daemon mode, except that you lose the warning.

I mean that if you do not write a persistent log, then "gc --auto" will
do an unproductive gc every time it is invoked. I.e., it will see "oh,
there are too many loose objects", and then waste a bunch of CPU every
time you commit.

> > Any of those would work similarly to the "just detect warnings" I
> > suggested earlier, with respect to keeping the "1 day" expiration
> > intact, so I'd be OK with them. In theory they'd be more robust than
> > scraping the "warning:" prefix. But in practice, I think you have to
> > resort to scraping anyway, if you are interested in treating warnings
> > from sub-processes the same way.
> 
> Can you say more about this for me?  I am not understanding what
> you're saying necessitates scraping the output.  I would strongly
> prefer to avoid scraping the output.

A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
with their stderr redirected into the logfile. If you want to have
warnings go somewhere else, then either:

  - you need some way to tell those sub-commands to send the warnings
    elsewhere (i.e., _not_ stderr)

    or

  - you have to post-process the output they send to separate warnings
    from other errors. Right now that means scraping. If you are
    proposing a system of machine-readable output, it would need to work
    not just for git-gc, but for every sub-command it runs.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:15 ` Elijah Newren
  2018-07-16 19:19   ` Jonathan Nieder
@ 2018-07-16 19:52   ` Jeff King
  2018-07-16 20:16     ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 19:52 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, jonathantanmy, jrnieder

On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote:

> The basic problem here, at least for us, is that gc has enough
> information to know it could expunge some objects, but because of how
> it is structured in terms of several substeps (reflog expiration,
> repack, prune), the information is lost between the steps and it
> instead writes them out as unreachable objects.  If we could prune (or
> avoid exploding) loose objects that are only reachable from reflog
> entries that we are expiring, then the problem goes away for us.  (I
> totally understand that other repos may have enough unreachable
> objects for other reasons that Peff's suggestion to just pack up
> unreachable objects is still a really good idea.  But on its own, it
> seems like a waste since it's packing stuff that we know we could just
> expunge.)

No, we should have expunged everything that could be during the "repack"
and "prune" steps. We feed the expiration time to repack, so that it
knows to drop objects entirely instead of exploding them loose.

So the leftovers really are objects that cannot be deleted yet. You
could literally just do:

  find .git/objects/?? -type f |
  perl -lne 's{../.{38}$} and print "$1$2"' |
  git pack-objects .git/objects/pack/cruft-pack

But:

  - that will explode them out only to repack them, which is inefficient
    (if they're already packed, you can probably reuse deltas, not to
    mention the I/O savings)

  - there's the question of how to handle timestamps. Some of those
    objects may have been _about_ to expire, but now you've just put
    them in a brand-new pack that adds another 2 weeks to their life

  - the find above is sloppy, and will race with somebody adding new
    objects to the repo

So probably you want to have pack-objects write out the list of objects
it _would_ explode, rather than exploding them. And then before
git-repack deletes the old packs, put those into a new cruft pack. That
_just_ leaves the timestamp issue (which is discussed at length in the
thread I linked earlier).

> git_actual_garbage_collect() {
>     GITDIR=$(git rev-parse --git-dir)
> 
>     # Record all revisions stored in reflog before and after gc
>     git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
>     git gc --auto
>     wait_for_background_gc_to_finish
>     git rev-list --no-walk --reflog >$GITDIR/gc.final-refs
> 
>     # Find out which reflog entries were removed
>     DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs))

This is too detailed, I think. There are other reasons to have
unreachable objects than expired reflogs. I think you really just want
to consider all unreachable objects (like the pack-objects thing I
mentioned above).

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:41         ` Jeff King
@ 2018-07-16 19:54           ` Jonathan Nieder
  2018-07-16 20:29             ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 19:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King wrote:
> On Mon, Jul 16, 2018 at 12:09:49PM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> Er, the action is to run "git prune", like the warning says. :)
>>
>> I don't think we want to recommend that, especially when "git gc --auto"
>> does the right thing automatically.
>
> But that's the point. This warning is written literally after running
> "git gc --auto" _didn't_ do the right thing. Yes, it would be nicer if
> it could do the right thing. But it doesn't yet know how to.

I think we fundamentally disagree, and I would like your help getting
past this impasse.

Even restricting attention to the warning, not the exit code, I think
you are saying the current behavior is acceptable.  I do not believe
it is.  I think you are saying Jonathan's patch makes the behavior
worse, and I'm not seeing it.  Can you describe an example user
interaction where the current behavior would be better than the
behavior after Jonathan's patch?  That might help make this discussion
more concrete.

[...]
> See the thread I linked earlier on putting unreachable objects into a
> pack, which I think is the real solution.

Have you looked over the discussion in "Loose objects and unreachable
objects" in Documentation/technical/hash-function-transition.txt?  Do
you have thoughts on it (preferrably in a separate thread)?

[...]
>> This sounds awful.  It sounds to me like you're saying "git gc --auto"
>> is saying "I just did the wrong thing, and here is how you can clean
>> up after me".  That's not how I want a program to behave.
>
> Sure, it would be nice if it did the right thing. Nobody has written
> that yet. Until they do, we have to deal with the fallout.

"git gc --auto" is already doing the right thing.

[...]
> I mean that if you do not write a persistent log, then "gc --auto" will
> do an unproductive gc every time it is invoked. I.e., it will see "oh,
> there are too many loose objects", and then waste a bunch of CPU every
> time you commit.

If so, then this would already be the behavior in non daemon mode.
Are you saying this accidental effect of daemon mode is in fact
intentional?

[...]
>>>                                 But in practice, I think you have to
>>> resort to scraping anyway, if you are interested in treating warnings
>>> from sub-processes the same way.
>>
>> Can you say more about this for me?  I am not understanding what
>> you're saying necessitates scraping the output.  I would strongly
>> prefer to avoid scraping the output.
>
> A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
> with their stderr redirected into the logfile. If you want to have
> warnings go somewhere else, then either:
>
>   - you need some way to tell those sub-commands to send the warnings
>     elsewhere (i.e., _not_ stderr)
>
>     or
>
>   - you have to post-process the output they send to separate warnings
>     from other errors. Right now that means scraping. If you are
>     proposing a system of machine-readable output, it would need to work
>     not just for git-gc, but for every sub-command it runs.

      or

    - you can simply record and trust the exit code

A simple way to do that without changing formats is to truncate the
file when exiting with status 0.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:52   ` Jeff King
@ 2018-07-16 20:16     ` Elijah Newren
  2018-07-16 20:38       ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2018-07-16 20:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 12:52 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote:
>
>> The basic problem here, at least for us, is that gc has enough
>> information to know it could expunge some objects, but because of how
>> it is structured in terms of several substeps (reflog expiration,
>> repack, prune), the information is lost between the steps and it
>> instead writes them out as unreachable objects.  If we could prune (or
>> avoid exploding) loose objects that are only reachable from reflog
>> entries that we are expiring, then the problem goes away for us.  (I
>> totally understand that other repos may have enough unreachable
>> objects for other reasons that Peff's suggestion to just pack up
>> unreachable objects is still a really good idea.  But on its own, it
>> seems like a waste since it's packing stuff that we know we could just
>> expunge.)
>
> No, we should have expunged everything that could be during the "repack"
> and "prune" steps. We feed the expiration time to repack, so that it
> knows to drop objects entirely instead of exploding them loose.

Um, except it doesn't actually do that.  The testcase I provided shows
that it leaves around 10000 objects that are totally deletable and
were only previously referenced by reflog entries -- entries that gc
removed without removing the corresponding objects.


I will note that my testcase was slightly out-of-date; with current
git it needs a call to 'wait_for_background_gc_to_finish' right before
the 'git gc --quiet' to avoid erroring out.

> You
> could literally just do:
>
>   find .git/objects/?? -type f |
>   perl -lne 's{../.{38}$} and print "$1$2"' |
>   git pack-objects .git/objects/pack/cruft-pack
>
> But:
>
>   - that will explode them out only to repack them, which is inefficient
>     (if they're already packed, you can probably reuse deltas, not to
>     mention the I/O savings)
>
>   - there's the question of how to handle timestamps. Some of those
>     objects may have been _about_ to expire, but now you've just put
>     them in a brand-new pack that adds another 2 weeks to their life
>
>   - the find above is sloppy, and will race with somebody adding new
>     objects to the repo
>
> So probably you want to have pack-objects write out the list of objects
> it _would_ explode, rather than exploding them. And then before
> git-repack deletes the old packs, put those into a new cruft pack. That
> _just_ leaves the timestamp issue (which is discussed at length in the
> thread I linked earlier).
>
>> git_actual_garbage_collect() {
>>     GITDIR=$(git rev-parse --git-dir)
>>
>>     # Record all revisions stored in reflog before and after gc
>>     git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
>>     git gc --auto
>>     wait_for_background_gc_to_finish
>>     git rev-list --no-walk --reflog >$GITDIR/gc.final-refs
>>
>>     # Find out which reflog entries were removed
>>     DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs))
>
> This is too detailed, I think. There are other reasons to have
> unreachable objects than expired reflogs. I think you really just want
> to consider all unreachable objects (like the pack-objects thing I
> mentioned above).

Yes, like I said, coarse workaround and I never had time to create a
real fix.  But I thought the testcase might be useful as a
demonstration of how git gc leaves around loose objects that were
previously reference by reflogs that gc itself pruned.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:19   ` Jonathan Nieder
@ 2018-07-16 20:21     ` Elijah Newren
  2018-07-16 20:35       ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2018-07-16 20:21 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git Mailing List, Jonathan Tan, Jeff King

Hi,

On Mon, Jul 16, 2018 at 12:19 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Elijah Newren wrote:
>
>> The basic problem here, at least for us, is that gc has enough
>> information to know it could expunge some objects, but because of how
>> it is structured in terms of several substeps (reflog expiration,
>> repack, prune), the information is lost between the steps and it
>> instead writes them out as unreachable objects.  If we could prune (or
>> avoid exploding) loose objects that are only reachable from reflog
>> entries that we are expiring, then the problem goes away for us.
>
> My understanding is that exploding the objects is intentional behavior,
> to avoid a race where objects are newly referenced while they are being
> pruned.
>
> I am not a fan of that behavior.  It's still racy.  But when we've
> brought it up in the past, the consensus seems to have been that it's
> better than nothing.  Documentation/technical/hash-function-transition.txt
> section "Loose objects and unreachable objects" describes a way to
> eliminate the race.

Ah, that's good to know and at least makes sense.  It seems somewhat
odd, though; loose objects that are two weeks old are just as
susceptible to being referenced anew by new commits, so the default of
running 'git prune --expire=2.weeks.ago' as gc currently does would
also be unsafe, wouldn't it?  Why is that any more or less unsafe than
pruning objects only referenced by reflog entries that are more than
90 days old?

Elijah

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 19:54           ` Jonathan Nieder
@ 2018-07-16 20:29             ` Jeff King
  2018-07-16 20:37               ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 20:29 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 12:54:31PM -0700, Jonathan Nieder wrote:

> Even restricting attention to the warning, not the exit code, I think
> you are saying the current behavior is acceptable.  I do not believe
> it is.

I didn't say that at all. The current situation sucks, and I think the
right solution is to pack the unreachable objects instead of exploding
them.

> I think you are saying Jonathan's patch makes the behavior
> worse, and I'm not seeing it.  Can you describe an example user
> interaction where the current behavior would be better than the
> behavior after Jonathan's patch?  That might help make this discussion
> more concrete.

It makes it worse because there is nothing to throttle a "gc --auto"
that is making no progress.

Try this:

-- >8 --
#!/bin/sh

rm -rf repo

# new mostly-empty repo
git init repo
cd repo
git commit --allow-empty -m base

# make a bunch of unreachable blobs
for i in $(seq 7000); do
  echo "blob"
  echo "data <<EOD"
  echo "cruft $i"
  echo "EOD"
done |
git fast-import

# This gc will explode them (doing a "gc --auto" isn't sufficient here, because
# we don't have enough _other_ material in the repo to trigger a gc. But you
# can imagine that a normal repo would eventually "gc --auto".
git gc

# Now simulate some real work in the repo.
for i in $(seq 50); do
  git commit --allow-empty -m "commit $i"
  sleep 30
done
-- 8< --

With the current code, that produces a bunch of annoying warnings for
every commit ("I couldn't gc because the last gc reported a warning").
But after Jonathan's patch, every single commit will do a full gc of the
repository. In this tiny repository that's relatively quick, but in a
real repo it's a ton of CPU and I/O, all for nothing.

> [...]
> > See the thread I linked earlier on putting unreachable objects into a
> > pack, which I think is the real solution.
> 
> Have you looked over the discussion in "Loose objects and unreachable
> objects" in Documentation/technical/hash-function-transition.txt?  Do
> you have thoughts on it (preferrably in a separate thread)?

It seems to propose putting the unreachable objects into a pack. Which
yes, I absolutely agree with (as I thought I'd been saying in every
single email in this thread).

> > I mean that if you do not write a persistent log, then "gc --auto" will
> > do an unproductive gc every time it is invoked. I.e., it will see "oh,
> > there are too many loose objects", and then waste a bunch of CPU every
> > time you commit.
> 
> If so, then this would already be the behavior in non daemon mode.
> Are you saying this accidental effect of daemon mode is in fact
> intentional?

I'm not sure if I'd call it intentional, since I don't recall ever
seeing this side effect discussed. But since daemonizing is the default,
I think that's the case people have focused on when they hit annoying
cases. E.g., a831c06a2b (gc: ignore old gc.log files, 2017-02-10).

I'd consider the fact that "gc --auto" with gc.autoDetach=false will
repeatedly do useless work to be a minor bug. But I think Jonathan's
patch makes it even worse because we do not even tell people that
useless work is being done (while making them wait for each gc to
finish!). Even if we were going to remove this message to help the
daemonized case, I think we'd want to retain it for the non-daemon case.

> > A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
> > with their stderr redirected into the logfile. If you want to have
> > warnings go somewhere else, then either:
> >
> >   - you need some way to tell those sub-commands to send the warnings
> >     elsewhere (i.e., _not_ stderr)
> >
> >     or
> >
> >   - you have to post-process the output they send to separate warnings
> >     from other errors. Right now that means scraping. If you are
> >     proposing a system of machine-readable output, it would need to work
> >     not just for git-gc, but for every sub-command it runs.
> 
>       or
> 
>     - you can simply record and trust the exit code
> 
> A simple way to do that without changing formats is to truncate the
> file when exiting with status 0.

That's a different behavior than what we do now (and what was suggested
earlier), in that it assumes that anything written to stderr is OK for
gc to hide from the user if the process exits with code zero.

That's probably OK in most cases, though I wonder if there are corner
cases. For example, if you have a corrupt ref, we used to say "error:
refs/heads/foo does not point to a valid object!" but otherwise ignore
it. These days git-gc sets REF_PARANOIA=1, so we'll actually barf on a
corrupt ref. But I wonder if there are other cases lurking.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:21     ` Elijah Newren
@ 2018-07-16 20:35       ` Jeff King
  2018-07-16 20:56         ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 20:35 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Jonathan Nieder, Git Mailing List, Jonathan Tan

On Mon, Jul 16, 2018 at 01:21:40PM -0700, Elijah Newren wrote:

> > My understanding is that exploding the objects is intentional behavior,
> > to avoid a race where objects are newly referenced while they are being
> > pruned.
> >
> > I am not a fan of that behavior.  It's still racy.  But when we've
> > brought it up in the past, the consensus seems to have been that it's
> > better than nothing.  Documentation/technical/hash-function-transition.txt
> > section "Loose objects and unreachable objects" describes a way to
> > eliminate the race.
> 
> Ah, that's good to know and at least makes sense.  It seems somewhat
> odd, though; loose objects that are two weeks old are just as
> susceptible to being referenced anew by new commits, so the default of
> running 'git prune --expire=2.weeks.ago' as gc currently does would
> also be unsafe, wouldn't it?  Why is that any more or less unsafe than
> pruning objects only referenced by reflog entries that are more than
> 90 days old?

The 2-week safety isn't primarily about things which just became
unreferenced.  It's about things which are in the act of being
referenced.

Imagine a "git commit" racing with a "git prune". The commit has to
create an object, and then it will update a ref to point to it. But
between those two actions, prune may racily delete the object!
The mtime grace period is what makes that work.

Using 2 weeks is sort of ridiculous for that. But it also helps with
manual recovery (e.g., imagine a blob added to the index but never
committed; 3 days later you may want to try to recover your old work).

And you're correct that a new git-commit may still reference an old
object (e.g., a blob that's 5 seconds shy of being 2 weeks old that
you're including in a new commit). That's why we retain non-fresh
objects that are referenced from fresh ones (so as long as you made the
new commit recently, it transitively infers freshness on the old blob),
and why we fresh mtimes when we elide a write for an existing object.

That's _still_ not race-proof, because none of these operations is
atomic. git-prune can decide the blob is unfresh at the exact moment
you're creating the commit object.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:29             ` Jeff King
@ 2018-07-16 20:37               ` Jonathan Nieder
  2018-07-16 21:09                 ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 20:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King wrote:
> On Mon, Jul 16, 2018 at 12:54:31PM -0700, Jonathan Nieder wrote:

>> Even restricting attention to the warning, not the exit code, I think
>> you are saying the current behavior is acceptable.  I do not believe
>> it is.
>
> I didn't say that at all. The current situation sucks,

Thanks for clarifying.  That helps.

>                                                        and I think the
> right solution is to pack the unreachable objects instead of exploding
> them.

That seems like a huge widening in scope relative to what this
original patch tackled.  I'm not aware of a way to do this without
breaking compatibility with the current (broken) race prevention.  If
you're saying that breaking compatibility in that way is okay with
you, then great!

[...]
>> I think you are saying Jonathan's patch makes the behavior
>> worse, and I'm not seeing it.  Can you describe an example user
>> interaction where the current behavior would be better than the
>> behavior after Jonathan's patch?  That might help make this discussion
>> more concrete.
>
> It makes it worse because there is nothing to throttle a "gc --auto"
> that is making no progress.
[...]
> With the current code, that produces a bunch of annoying warnings for
> every commit ("I couldn't gc because the last gc reported a warning").
> But after Jonathan's patch, every single commit will do a full gc of the
> repository. In this tiny repository that's relatively quick, but in a
> real repo it's a ton of CPU and I/O, all for nothing.

I see.  Do I understand correctly that if we find a way to print the
warning but not error out, that would be good enough for you?

[...]
>> Have you looked over the discussion in "Loose objects and unreachable
>> objects" in Documentation/technical/hash-function-transition.txt?  Do
>> you have thoughts on it (preferrably in a separate thread)?
>
> It seems to propose putting the unreachable objects into a pack. Which
> yes, I absolutely agree with (as I thought I'd been saying in every
> single email in this thread).

I figured you were proposing something like
https://public-inbox.org/git/20180113100734.GA30698@sigill.intra.peff.net/,
which is still racy (because it does not handle the freshening in a safe
way).

[...]
>           Even if we were going to remove this message to help the
> daemonized case, I think we'd want to retain it for the non-daemon case.

Interesting.  That should be doable, e.g. following the approach
described below.

[...]
>> A simple way to do that without changing formats is to truncate the
>> file when exiting with status 0.
>
> That's a different behavior than what we do now (and what was suggested
> earlier), in that it assumes that anything written to stderr is OK for
> gc to hide from the user if the process exits with code zero.
>
> That's probably OK in most cases, though I wonder if there are corner
> cases. For example, if you have a corrupt ref, we used to say "error:
> refs/heads/foo does not point to a valid object!" but otherwise ignore
> it. These days git-gc sets REF_PARANOIA=1, so we'll actually barf on a
> corrupt ref. But I wonder if there are other cases lurking.

What decides it for me is that the user did not invoke "git gc --auto"
explicitly, so anything "git gc --auto" prints is tangential to what
the user was trying to do.  If the gc failed, that is worth telling
them, but if e.g. it encountered a disk I/O error reading and
succeeded on retry (to make up a fake example), then that's likely
worth logging to syslog but it's not something the user asked to be
directly informed about.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:16     ` Elijah Newren
@ 2018-07-16 20:38       ` Jeff King
  2018-07-16 21:09         ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 20:38 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 01:16:45PM -0700, Elijah Newren wrote:

> >> The basic problem here, at least for us, is that gc has enough
> >> information to know it could expunge some objects, but because of how
> >> it is structured in terms of several substeps (reflog expiration,
> >> repack, prune), the information is lost between the steps and it
> >> instead writes them out as unreachable objects.  If we could prune (or
> >> avoid exploding) loose objects that are only reachable from reflog
> >> entries that we are expiring, then the problem goes away for us.  (I
> >> totally understand that other repos may have enough unreachable
> >> objects for other reasons that Peff's suggestion to just pack up
> >> unreachable objects is still a really good idea.  But on its own, it
> >> seems like a waste since it's packing stuff that we know we could just
> >> expunge.)
> >
> > No, we should have expunged everything that could be during the "repack"
> > and "prune" steps. We feed the expiration time to repack, so that it
> > knows to drop objects entirely instead of exploding them loose.
> 
> Um, except it doesn't actually do that.  The testcase I provided shows
> that it leaves around 10000 objects that are totally deletable and
> were only previously referenced by reflog entries -- entries that gc
> removed without removing the corresponding objects.

What's your definition of "totally deletable"?

If the pack they were in is less than 2 weeks old, then they are
unreachable but "fresh", and are intentionally retained. If it was older
than 2 weeks, they should have been deleted.

If you don't like that policy, you can set gc.pruneExpire to something
lower (including "now", but watch out, as that can race with other
operations!). But I don't think that's an implementation short-coming.
It's intentional to keep those objects. It's just that the format
they're in is inefficient and confuses later gc runs.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:35       ` Jeff King
@ 2018-07-16 20:56         ` Jonathan Nieder
  2018-07-16 21:12           ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 20:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren, Git Mailing List, Jonathan Tan

Jeff King wrote:
> On Mon, Jul 16, 2018 at 01:21:40PM -0700, Elijah Newren wrote:
>> Jonathan Nieder wrote:

>>> My understanding is that exploding the objects is intentional behavior,
>>> to avoid a race where objects are newly referenced while they are being
>>> pruned.
[...]
>> Ah, that's good to know and at least makes sense.  It seems somewhat
>> odd, though; loose objects that are two weeks old are just as
>> susceptible to being referenced anew by new commits, so the default of
>> running 'git prune --expire=2.weeks.ago' as gc currently does would
>> also be unsafe, wouldn't it?  Why is that any more or less unsafe than
>> pruning objects only referenced by reflog entries that are more than
>> 90 days old?

Part of the answer is that this safety feature applies even when
reflogs are not in use.  Another part is that as you say, you can
start referencing an object at the same time as the reflog entry
pointing to it is expiring.

[...]
>                                    That's why we retain non-fresh
> objects that are referenced from fresh ones (so as long as you made the
> new commit recently, it transitively infers freshness on the old blob),
> and why we fresh mtimes when we elide a write for an existing object.
>
> That's _still_ not race-proof, because none of these operations is
> atomic.

Indeed.  One of the advantages of using a packfile for unreachable
objects is that metadata associated with that packfile can be updated
atomically.

I don't believe we are very good at transitively propagating freshness
today, by the way.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:37               ` Jonathan Nieder
@ 2018-07-16 21:09                 ` Jeff King
  2018-07-16 21:40                   ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 21:09 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 01:37:53PM -0700, Jonathan Nieder wrote:

> >                                                        and I think the
> > right solution is to pack the unreachable objects instead of exploding
> > them.
> 
> That seems like a huge widening in scope relative to what this
> original patch tackled.

I'm not at all saying that Jonathan needs to work on that. It's a big
topic that we've been putting off for years. I _am_ saying that I don't
think his patch as-is should be merged, as it makes the daemonized
behavior here worse.

> I'm not aware of a way to do this without
> breaking compatibility with the current (broken) race prevention.  If
> you're saying that breaking compatibility in that way is okay with
> you, then great!

It depends what you're trying to fix. If you're trying to fix auto-gc
outputting an annoying message (or repeatedly doing useless invocations
of pack-objects), you don't have to break compatibility at all. You just
have to put the objects into a pack instead of exploding them loose.
Worst case if you use an old implementation to run "git gc", it may
accidentally freshen a cruft pack's timestamp.

If you want to improve the raciness, then yes, I think you need stronger
rules around marking packs as cruft/garbage, and making operations that
want to use those objects copy them out to a non-cruft location (like
you describe in the hash transition document). There you risk corruption
if an old implementation fails to make the non-cruft copy. But that's
really no worse than we are today.

So I don't even really see it as a backwards-compatibility break. But
even if it were, I'd probably still be fine with it. This is a
local-repo thing, and in the absolute worst case we could bump
core.repositoryformatversion (though again, I don't even think that
would be necessary since it would degrade to behaving just as the
current code does).

> > With the current code, that produces a bunch of annoying warnings for
> > every commit ("I couldn't gc because the last gc reported a warning").
> > But after Jonathan's patch, every single commit will do a full gc of the
> > repository. In this tiny repository that's relatively quick, but in a
> > real repo it's a ton of CPU and I/O, all for nothing.
> 
> I see.  Do I understand correctly that if we find a way to print the
> warning but not error out, that would be good enough for you?

Yes. I thought that's what I proposed originally. ;)

The key thing is that the presence of the warning/log still suppress
the actual gc for the gc.logExpiry period.

> >> Have you looked over the discussion in "Loose objects and unreachable
> >> objects" in Documentation/technical/hash-function-transition.txt?  Do
> >> you have thoughts on it (preferrably in a separate thread)?
> >
> > It seems to propose putting the unreachable objects into a pack. Which
> > yes, I absolutely agree with (as I thought I'd been saying in every
> > single email in this thread).
> 
> I figured you were proposing something like
> https://public-inbox.org/git/20180113100734.GA30698@sigill.intra.peff.net/,
> which is still racy (because it does not handle the freshening in a safe
> way).

That message is just me telling somebody that their idea _won't_ work. ;)

But I assume you mean the general idea of putting things in a cruft
pack[1]. Yes, I was only suggesting going that far as a solution to the
auto-gc woes. Solving races is another issue entirely (though obviously
it may make sense to build on the single-pack work, if it exists).

[1] The best message discussion on that is I think:

      https://public-inbox.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/

    which I think I linked earlier, so possibly that is even what you
    meant. :)

> What decides it for me is that the user did not invoke "git gc --auto"
> explicitly, so anything "git gc --auto" prints is tangential to what
> the user was trying to do.  If the gc failed, that is worth telling
> them, but if e.g. it encountered a disk I/O error reading and
> succeeded on retry (to make up a fake example), then that's likely
> worth logging to syslog but it's not something the user asked to be
> directly informed about.

I'm not sure I agree. If a repack discovered that you had a bit
corruption on disk, but you happened to have another copy of the object
available that made the repack succeed, I'd like to know that I'm
getting bit corruptions, and earlier is better. I think we need to
surface that information to the user eventually, and I don't think we
can count on syslog being universally available.

By definition we're daemonized here, so we can't count on even having
access to the user's terminal. But it would make more sense to me if the
logic were:

  - at the end of a "gc --auto" run, gc should write a machine-readable
    indication of its exit code (either as a final line in the log file,
    or to a gc.exitcode file). The warnings/errors remain intermingled
    in the logfile.

  - on the _next_ run, if:

    - the exit code was non-zero, dump the log output to stderr and exit
      with the same code. This is the same as the current behavior (with
      the exception that we propagate the code instead of returning -1,
      which is a minor bonus).

    - the exit code was zero, report to the user synchronously that the
      last run found some oddities, and dump the log. Proceed with the
      rest of auto-gc (either exiting 0 if nothing to be done, or doing
      useful work).

      Optionally, delete the log after showing it so the user is not
      spammed with the warning on every commit. Though for the warning
      under discussion, this does mean that if they don't take any
      action, they'll end up running a useless gc every _other_ commit.
      So I'd probably keep the log and let it expire naturally after the
      1-day gc.logExpiry period.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:38       ` Jeff King
@ 2018-07-16 21:09         ` Elijah Newren
  2018-07-16 21:21           ` Jeff King
  2018-07-16 21:31           ` Jonathan Nieder
  0 siblings, 2 replies; 57+ messages in thread
From: Elijah Newren @ 2018-07-16 21:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 1:38 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 16, 2018 at 01:16:45PM -0700, Elijah Newren wrote:
>
>> >> The basic problem here, at least for us, is that gc has enough
>> >> information to know it could expunge some objects, but because of how
>> >> it is structured in terms of several substeps (reflog expiration,
>> >> repack, prune), the information is lost between the steps and it
>> >> instead writes them out as unreachable objects.  If we could prune (or
>> >> avoid exploding) loose objects that are only reachable from reflog
>> >> entries that we are expiring, then the problem goes away for us.  (I
>> >> totally understand that other repos may have enough unreachable
>> >> objects for other reasons that Peff's suggestion to just pack up
>> >> unreachable objects is still a really good idea.  But on its own, it
>> >> seems like a waste since it's packing stuff that we know we could just
>> >> expunge.)
>> >
>> > No, we should have expunged everything that could be during the "repack"
>> > and "prune" steps. We feed the expiration time to repack, so that it
>> > knows to drop objects entirely instead of exploding them loose.
>>
>> Um, except it doesn't actually do that.  The testcase I provided shows
>> that it leaves around 10000 objects that are totally deletable and
>> were only previously referenced by reflog entries -- entries that gc
>> removed without removing the corresponding objects.
>
> What's your definition of "totally deletable"?

The point of gc is to: expire reflogs, repack referenced objects, and
delete loose objects that (1) are no longer referenced and (2) are
"old enough".

The "old enough" bit is the special part here.  Before the gc, those
objects were referenced (only by old reflog entries) and were found in
a pack.  git currently writes those objects out to disk and gives them
the age of the packfile they are contained in, which I think is the
source of the bug.  We have a reference for those objects from the
reflogs, know they aren't referenced anywhere else, so those objects
to me are the age of those reflog entries: 90 days.  As such, they are
"old enough" and should be deleted.

With big repos, it's easy to get into situations where there are well
more than 10000 objects satisfying these conditions.  In fact, it's
not all that rare that the repo has far more loose objects after a git
gc than before.

I never got around to fixing it properly, though, because 'git prune'
is such a handy workaround that for now.  Having people nuke all their
loose objects is a bit risky, but the only other workaround people had
was to re-clone (waiting the requisite 20+ minutes for the repo to
download) and throw away their old clone.  (Though some people even
went that route, IIRC.)

> If the pack they were in is less than 2 weeks old, then they are
> unreachable but "fresh", and are intentionally retained. If it was older
> than 2 weeks, they should have been deleted.

That's what's currently implemented, yes, but as above I think that
logic is bad.

> If you don't like that policy, you can set gc.pruneExpire to something
> lower (including "now", but watch out, as that can race with other
> operations!). But I don't think that's an implementation short-coming.
> It's intentional to keep those objects. It's just that the format
> they're in is inefficient and confuses later gc runs.

I'm aware of pruneExpire, but I think it's the wrong way to handle this.

I totally agree with your general plan to put unreferenced loose
objects into a pack.  However, I don't think these objects should be
part of that pack; they should just be deleted instead.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 20:56         ` Jonathan Nieder
@ 2018-07-16 21:12           ` Jeff King
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff King @ 2018-07-16 21:12 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Elijah Newren, Git Mailing List, Jonathan Tan

On Mon, Jul 16, 2018 at 01:56:46PM -0700, Jonathan Nieder wrote:

> I don't believe we are very good at transitively propagating freshness
> today, by the way.

Perhaps I don't understand what you mean by transitive here. But we
should be traversing from any fresh object and marking any non-fresh
ones that are reachable from it to be saved. If you know of a case where
we're not, there's a bug, and I'd be happy to look into it.

The only problem I know about is the utter lack of atomicity.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 21:09         ` Elijah Newren
@ 2018-07-16 21:21           ` Jeff King
  2018-07-16 22:07             ` Elijah Newren
  2018-07-16 21:31           ` Jonathan Nieder
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 21:21 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 02:09:43PM -0700, Elijah Newren wrote:

> >> Um, except it doesn't actually do that.  The testcase I provided shows
> >> that it leaves around 10000 objects that are totally deletable and
> >> were only previously referenced by reflog entries -- entries that gc
> >> removed without removing the corresponding objects.
> >
> > What's your definition of "totally deletable"?
> 
> The point of gc is to: expire reflogs, repack referenced objects, and
> delete loose objects that (1) are no longer referenced and (2) are
> "old enough".
> 
> The "old enough" bit is the special part here.  Before the gc, those
> objects were referenced (only by old reflog entries) and were found in
> a pack.  git currently writes those objects out to disk and gives them
> the age of the packfile they are contained in, which I think is the
> source of the bug.  We have a reference for those objects from the
> reflogs, know they aren't referenced anywhere else, so those objects
> to me are the age of those reflog entries: 90 days.  As such, they are
> "old enough" and should be deleted.

OK, I see what you're saying, but...

> I never got around to fixing it properly, though, because 'git prune'
> is such a handy workaround that for now.  Having people nuke all their
> loose objects is a bit risky, but the only other workaround people had
> was to re-clone (waiting the requisite 20+ minutes for the repo to
> download) and throw away their old clone.  (Though some people even
> went that route, IIRC.)

If we were to delete those objects, wouldn't it be exactly the same as
running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
minutes"?  Or are you concerned with taking other objects along for the
ride that weren't part of old reflogs? I think that's a valid concern,
but it's also an issue for objects which were previously referenced in
a reflog, but are part of another current operation.

Also, what do you do if there weren't reflogs in the repo? Or the
reflogs were deleted (e.g., because branch deletion drops the associated
reflog entirely)?

> With big repos, it's easy to get into situations where there are well
> more than 10000 objects satisfying these conditions.  In fact, it's
> not all that rare that the repo has far more loose objects after a git
> gc than before.

Yes, this is definitely a wart and I think is worth addressing.

> I totally agree with your general plan to put unreferenced loose
> objects into a pack.  However, I don't think these objects should be
> part of that pack; they should just be deleted instead.

I assume by "these objects" you mean ones which used to be reachable
from a reflog, but that reflog entry just expired.  I think you'd be
sacrificing some race-safety in that case.

If the objects went into a pack under a race-proof scheme, would that
actually bother you? Is it the 10,000 objects that's a problem, or is it
the horrible I/O from exploding them coupled with the annoying auto-gc
behavior?

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 21:09         ` Elijah Newren
  2018-07-16 21:21           ` Jeff King
@ 2018-07-16 21:31           ` Jonathan Nieder
  1 sibling, 0 replies; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 21:31 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Jeff King, Git Mailing List, Jonathan Tan

Elijah Newren wrote:

> I totally agree with your general plan to put unreferenced loose
> objects into a pack.  However, I don't think these objects should be
> part of that pack; they should just be deleted instead.

This might be the wrong thread to discuss it, but did you follow the
reference/prune race that Peff mentioned?  The simplest cure I'm aware
of to it does involve writing those objects to a pack.  The idea is to
enforce a straightforward contract:

There are two kinds of packs: GC and UNREACHABLE_GARBAGE.

Every object in a GC pack has a minimum lifetime of <ttl> (let's say
"1 days") from the time they are read.  If you start making use of an
object from a GC pack (e.g. by creating a new object referencing it),
you have three days to ensure it's referenced.

Each UNREACHABLE_GARBAGE pack has a <ttl> (let's say "3 days") from
the time it is created.  Objects in an UNREACHABLE_GARBAGE have no
minimum ttl from the time they are read.  If you want to start making
use of an object from an UNREACHABLE_GARBAGE pack (e.g. by creating a
new object referencing it), then copy it and everything it references
to a GC pack.

To avoid a proliferation of UNREACHABLE_GARBAGE packs, there's a rule
for coalescing them, but that's not relevant here.

It is perfectly possible for an object in a GC pack to reference an
object in an UNREACHABLE_GARBAGE pack via writes racing with gc, but
that's fine --- on the next gc run, the unreachable garbage objects
get copied to a GC pack.

We've been using this on a JGit DfsRepository based server for > 2
years now and it's been working well.  More details are in the "Loose
objects and unreachable objects" section in Documentation/technical/
mentioned before.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 21:09                 ` Jeff King
@ 2018-07-16 21:40                   ` Jonathan Nieder
  2018-07-16 21:45                     ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 21:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King wrote:
> On Mon, Jul 16, 2018 at 01:37:53PM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> With the current code, that produces a bunch of annoying warnings for
>>> every commit ("I couldn't gc because the last gc reported a warning").
>>> But after Jonathan's patch, every single commit will do a full gc of the
>>> repository. In this tiny repository that's relatively quick, but in a
>>> real repo it's a ton of CPU and I/O, all for nothing.
>>
>> I see.  Do I understand correctly that if we find a way to print the
>> warning but not error out, that would be good enough for you?
>
> Yes. I thought that's what I proposed originally. ;)
>
> The key thing is that the presence of the warning/log still suppress
> the actual gc for the gc.logExpiry period.

Ah, now I think I see the source of the misunderstanding.

When I initially read the docs, I also assumed that

	If the file gc.log exists, then git gc --auto won’t run unless
	that file is more than gc.logExpiry old.

meant "git gc --auto" would be skipped (and thus silently succeed) when
the file is less than gc.logExpiry old.  But that assumption was wrong:
it errors out!

This makes scripting difficult, since arbitrary commands that
incidentally run "git gc --auto" will fail in this state, until gc.log
expires (but at that point they'll fail again anyway).

For comparison, in non-daemon mode, there is nothing enforcing the
kind of ratelimiting you are talking about.  It is an accident of
history.  If we want this kind of ratelimiting, I'd rather we build it
deliberately instead of relying on this accident.

Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 21:40                   ` Jonathan Nieder
@ 2018-07-16 21:45                     ` Jeff King
  2018-07-16 22:03                       ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 21:45 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 02:40:03PM -0700, Jonathan Nieder wrote:

> > The key thing is that the presence of the warning/log still suppress
> > the actual gc for the gc.logExpiry period.
> 
> Ah, now I think I see the source of the misunderstanding.
> 
> When I initially read the docs, I also assumed that
> 
> 	If the file gc.log exists, then git gc --auto won’t run unless
> 	that file is more than gc.logExpiry old.
> 
> meant "git gc --auto" would be skipped (and thus silently succeed) when
> the file is less than gc.logExpiry old.  But that assumption was wrong:
> it errors out!

Right. That's the mysterious "-1" error code, and I agree that part is
confusing.

> This makes scripting difficult, since arbitrary commands that
> incidentally run "git gc --auto" will fail in this state, until gc.log
> expires (but at that point they'll fail again anyway).

I don't think any command should report failure of its _own_ operation
if "gc --auto" failed. And grepping around the source code shows that we
typically ignore it.

> For comparison, in non-daemon mode, there is nothing enforcing the
> kind of ratelimiting you are talking about.  It is an accident of
> history.  If we want this kind of ratelimiting, I'd rather we build it
> deliberately instead of relying on this accident.

What I was trying to say earlier is that we _did_ build this
rate-limiting, and I think it is a bug that the non-daemon case does not
rate-limit (but nobody noticed, because the default is daemonizing).

So the fix is not "rip out the rate-limiting in daemon mode", but rather
"extend it to the non-daemon case".

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 21:45                     ` Jeff King
@ 2018-07-16 22:03                       ` Jonathan Nieder
  2018-07-16 22:43                         ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 22:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King wrote:

> I don't think any command should report failure of its _own_ operation
> if "gc --auto" failed. And grepping around the source code shows that we
> typically ignore it.

Oh, good point.  In non-daemon mode, we don't let "gc --auto" failure
cause the invoking command to fail, but in daemon mode we do.  That
should be a straightforward fix; patch coming in a moment.

> On Mon, Jul 16, 2018 at 02:40:03PM -0700, Jonathan Nieder wrote:

>> For comparison, in non-daemon mode, there is nothing enforcing the
>> kind of ratelimiting you are talking about.  It is an accident of
>> history.  If we want this kind of ratelimiting, I'd rather we build it
>> deliberately instead of relying on this accident.
>
> What I was trying to say earlier is that we _did_ build this
> rate-limiting, and I think it is a bug that the non-daemon case does not
> rate-limit (but nobody noticed, because the default is daemonizing).
>
> So the fix is not "rip out the rate-limiting in daemon mode", but rather
> "extend it to the non-daemon case".

Can you point me to some discussion about building that rate-limiting?
The commit message for v2.12.2~17^2 (gc: ignore old gc.log files,
2017-02-10) definitely doesn't describe that as its intent.

This is the kind of review that Dscho often complains about, where
someone tries to fix something small but significant to users and gets
told to build something larger that was not their itch instead.

The comments about the "Why is 'git commit' so slow?" experience and
how having the warning helps with that are well taken.  I think we
should be able to find a way to keep the warning in a v2 of this
patch.  But the rest about rate-limiting and putting unreachable
objects in packs etc as a blocker for this are demoralizing, since
they gives the feeling that even if I handle the cases that are
handled today well, it will never be enough for the project unless I
solve the larger problems that were already there.

Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 21:21           ` Jeff King
@ 2018-07-16 22:07             ` Elijah Newren
  2018-07-16 22:55               ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2018-07-16 22:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 2:21 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 16, 2018 at 02:09:43PM -0700, Elijah Newren wrote:

>> The point of gc is to: expire reflogs, repack referenced objects, and
>> delete loose objects that (1) are no longer referenced and (2) are
>> "old enough".
>>
>> The "old enough" bit is the special part here.  Before the gc, those
>> objects were referenced (only by old reflog entries) and were found in
>> a pack.  git currently writes those objects out to disk and gives them
>> the age of the packfile they are contained in, which I think is the
>> source of the bug.  We have a reference for those objects from the
>> reflogs, know they aren't referenced anywhere else, so those objects
>> to me are the age of those reflog entries: 90 days.  As such, they are
>> "old enough" and should be deleted.
>
> OK, I see what you're saying, but...
>
>> I never got around to fixing it properly, though, because 'git prune'
>> is such a handy workaround that for now.  Having people nuke all their
>> loose objects is a bit risky, but the only other workaround people had
>> was to re-clone (waiting the requisite 20+ minutes for the repo to
>> download) and throw away their old clone.  (Though some people even
>> went that route, IIRC.)
>
> If we were to delete those objects, wouldn't it be exactly the same as
> running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
> minutes"?  Or are you concerned with taking other objects along for the
> ride that weren't part of old reflogs? I think that's a valid concern,

Yes, I was worried about taking other objects along for the ride that
weren't part of old reflogs.

> but it's also an issue for objects which were previously referenced in
> a reflog, but are part of another current operation.

I'm not certain what you're referring to here.

> Also, what do you do if there weren't reflogs in the repo? Or the
> reflogs were deleted (e.g., because branch deletion drops the associated
> reflog entirely)?

Yes, there are issues this rule won't help with, but in my experience
it was a primary (if not sole) actual cause in practice.  (I believe I
even said elsewhere in this thread that I knew there were unreachable
objects for other reasons and they might also become large in number).
At $DAYJOB we've had multiple people including myself hit the "too
many unreachable loose objects" nasty loop issue (some of us multiple
different times), and as far as I can tell, most (perhaps even all) of
them would have been avoided by just "properly" deleting garbage as
per my object-age-is-reflog-age-if-not-otherwise-referenced rule.

>> With big repos, it's easy to get into situations where there are well
>> more than 10000 objects satisfying these conditions.  In fact, it's
>> not all that rare that the repo has far more loose objects after a git
>> gc than before.
>
> Yes, this is definitely a wart and I think is worth addressing.
>
>> I totally agree with your general plan to put unreferenced loose
>> objects into a pack.  However, I don't think these objects should be
>> part of that pack; they should just be deleted instead.
>
> I assume by "these objects" you mean ones which used to be reachable
> from a reflog, but that reflog entry just expired.  I think you'd be
> sacrificing some race-safety in that case.

Is that inherently any more race unsafe than 'git prune
--expire=2.weeks.ago'?  I thought it'd be racy in the same ways, and
thus a tradeoff folks are already accepting (at least implicitly) when
running git-gc.  Since these objects are actually 90 days old rather
than a mere two weeks, it actually seemed more safe to me.  But maybe
I'm overlooking something with the pack->loose transition that makes
it more racy?

> If the objects went into a pack under a race-proof scheme, would that
> actually bother you? Is it the 10,000 objects that's a problem, or is it
> the horrible I/O from exploding them coupled with the annoying auto-gc
> behavior?

Yeah, good point.  It's mostly the annoying auto-gc behavior and the
horrible I/O of future git operations from having lots of loose
objects.  They've already been paying the cost of storing those
objects in packed form for 90 days; a few more won't hurt much.  I'd
be slightly annoyed knowing that we're storing garbage we don't need
to be, but I don't think it's a real issue.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 22:03                       ` Jonathan Nieder
@ 2018-07-16 22:43                         ` Jeff King
  2018-07-16 22:56                           ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 22:43 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 03:03:06PM -0700, Jonathan Nieder wrote:

> Jeff King wrote:
> 
> > I don't think any command should report failure of its _own_ operation
> > if "gc --auto" failed. And grepping around the source code shows that we
> > typically ignore it.
> 
> Oh, good point.  In non-daemon mode, we don't let "gc --auto" failure
> cause the invoking command to fail, but in daemon mode we do.  That
> should be a straightforward fix; patch coming in a moment.

OK, that definitely sounds like a bug. I'm still confused how that could
happen, though, since from the caller's perspective they ignore git-gc's
exit code either way. I guess I'll see in your patch. :)

> > What I was trying to say earlier is that we _did_ build this
> > rate-limiting, and I think it is a bug that the non-daemon case does not
> > rate-limit (but nobody noticed, because the default is daemonizing).
> >
> > So the fix is not "rip out the rate-limiting in daemon mode", but rather
> > "extend it to the non-daemon case".
> 
> Can you point me to some discussion about building that rate-limiting?
> The commit message for v2.12.2~17^2 (gc: ignore old gc.log files,
> 2017-02-10) definitely doesn't describe that as its intent.

I think that commit is a loosening of the rate-limiting (because we'd
refuse to progress for something that was actually time-based). But the
original stopping comes from this discussion, I think:

  https://public-inbox.org/git/xmqqlhijznpm.fsf@gitster.dls.corp.google.com/

(I didn't read the whole thread, but that was what I hit by blaming the
log code and then tracing that back to the list).

> This is the kind of review that Dscho often complains about, where
> someone tries to fix something small but significant to users and gets
> told to build something larger that was not their itch instead.

I don't know how to say more emphatically that I am not asking anyone to
build something larger (like cruft packfiles). I'm just trying to bring
up an impact that the author didn't consider (and that IMHO would be a
regression). Isn't that what reviews are for?

I only mention packfiles because as the discussion turns to "well, all
of these solutions are mediocre hacks" (because they absolutely are),
it's important to realize that there _is_ a right solution, and we even
already know about it. Even if we don't work on it now, knowing that
it's there makes it easier to decide about the various hacks.

> The comments about the "Why is 'git commit' so slow?" experience and
> how having the warning helps with that are well taken.  I think we
> should be able to find a way to keep the warning in a v2 of this
> patch.  But the rest about rate-limiting and putting unreachable
> objects in packs etc as a blocker for this are demoralizing, since
> they gives the feeling that even if I handle the cases that are
> handled today well, it will never be enough for the project unless I
> solve the larger problems that were already there.

I really don't know why we are having such trouble communicating. I've
tried to make it clear several times that the pack thing is not
something I expect your or Jonathan Tan to work on, but obviously I
failed. I'd be _delighted_ if you wanted to work on it, but AFAICT this
patch is purely motivated by:

  1. there's a funny exit code thing going on (0 on the first run, -1 on
     the second)

  2. the warning is not actionable by users

I disagree with the second, and I think we've discussed easy solutions
for the first.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 22:07             ` Elijah Newren
@ 2018-07-16 22:55               ` Jeff King
  2018-07-16 23:06                 ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 22:55 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 03:07:34PM -0700, Elijah Newren wrote:

> > If we were to delete those objects, wouldn't it be exactly the same as
> > running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
> > minutes"?  Or are you concerned with taking other objects along for the
> > ride that weren't part of old reflogs? I think that's a valid concern,
> 
> Yes, I was worried about taking other objects along for the ride that
> weren't part of old reflogs.
> 
> > but it's also an issue for objects which were previously referenced in
> > a reflog, but are part of another current operation.
> 
> I'm not certain what you're referring to here.

I mean that an ongoing operation could refer to a blob that just
recently became unreachable via reflog pruning. And then we would delete
it, leaving the repository corrupt.

One of the protections we have against that is that if I ask to write
blob XYZ and we find that we already have the object, Git will freshen
the mtime of the loose object or pack that contains it, protecting it
from pruning. But with your suggestion, we'd delete it immediately,
regardless of the mtime of the containing pack.

Another way to think of it is this: a reflog mentioning an object does
not mean it is the exclusive user of that object. So when we expire it,
that does not mean that it is OK to delete it immediately; there may be
other users.

> > Also, what do you do if there weren't reflogs in the repo? Or the
> > reflogs were deleted (e.g., because branch deletion drops the associated
> > reflog entirely)?
> 
> Yes, there are issues this rule won't help with, but in my experience
> it was a primary (if not sole) actual cause in practice.  (I believe I
> even said elsewhere in this thread that I knew there were unreachable
> objects for other reasons and they might also become large in number).
> At $DAYJOB we've had multiple people including myself hit the "too
> many unreachable loose objects" nasty loop issue (some of us multiple
> different times), and as far as I can tell, most (perhaps even all) of
> them would have been avoided by just "properly" deleting garbage as
> per my object-age-is-reflog-age-if-not-otherwise-referenced rule.

I agree with you that this is a frequent cause, and probably even the
primary one. But my concern is that your loosening increases the risk of
corruption for other cases.

> > I assume by "these objects" you mean ones which used to be reachable
> > from a reflog, but that reflog entry just expired.  I think you'd be
> > sacrificing some race-safety in that case.
> 
> Is that inherently any more race unsafe than 'git prune
> --expire=2.weeks.ago'?  I thought it'd be racy in the same ways, and
> thus a tradeoff folks are already accepting (at least implicitly) when
> running git-gc.  Since these objects are actually 90 days old rather
> than a mere two weeks, it actually seemed more safe to me.  But maybe
> I'm overlooking something with the pack->loose transition that makes
> it more racy?

I think it's worse in terms of race-safety because you're losing one of
the signals that users of the objects can provide to git-prune to tell
it the object is useful: updating the mtime. So yes, you think of the
objects as "90 days old", but we don't know if there are other users.
Has somebody else been accessing them in the meantime?

To be honest, I'm not sure how meaningful that distinction is in
practice. The current scheme is still racy, even if the windows are
shorter in some circumstances. But it seems like cruft packfiles are
a similar amount of work to your scheme, cover more cases, and are
slightly safer. And possibly even give us a long-term route to true
race-free pruning (via the "garbage pack" mark that Jonathan mentioned).

Assuming you buy into the idea that objects in a cruft-pack are not
hurting anything aside from a little wasted storage for up to 2 weeks
(which it sounds like you're at least partially on board with ;) ).

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 22:43                         ` Jeff King
@ 2018-07-16 22:56                           ` Jonathan Nieder
  2018-07-16 23:26                             ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-16 22:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King wrote:
> On Mon, Jul 16, 2018 at 03:03:06PM -0700, Jonathan Nieder wrote:

>> Oh, good point.  In non-daemon mode, we don't let "gc --auto" failure
>> cause the invoking command to fail, but in daemon mode we do.  That
>> should be a straightforward fix; patch coming in a moment.
>
> OK, that definitely sounds like a bug. I'm still confused how that could
> happen, though, since from the caller's perspective they ignore git-gc's
> exit code either way. I guess I'll see in your patch. :)

Alas, I just misremembered.  What I was remembering is that gc --auto
does

	if (auto_gc && too_many_loose_objects())
		warning(_("There are too many unreachable loose objects; "
			"run 'git prune' to remove them."));
	return 0;

which means that too_many_loose_objects is not an error in undaemonized
mode, while it is in daemonized mode.  But we've already discussed that.

The calling command in the motivating example is Android's "repo" tool:

            bare_git.gc('--auto')

from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
think it's reasonable that it expects a status code of 0 in the normal
case.  So life is less simple than I hoped.

[...]
>> Can you point me to some discussion about building that rate-limiting?
>> The commit message for v2.12.2~17^2 (gc: ignore old gc.log files,
>> 2017-02-10) definitely doesn't describe that as its intent.
>
> I think that commit is a loosening of the rate-limiting (because we'd
> refuse to progress for something that was actually time-based). But the
> original stopping comes from this discussion, I think:
>
>   https://public-inbox.org/git/xmqqlhijznpm.fsf@gitster.dls.corp.google.com/

Interesting!  It looks like that thread anticipated the problems we've
seen here.  Three years without having to have fixed it is a good run,
I suppose.

The discussion of stopping there appears to be primarily about
stopping in the error case, not rate-limiting in the success or
warning case.

Here's a patch for the 'return -1' thing.

-- >8 --
Subject: gc: exit with status 128 on failure

A value of -1 returned from cmd_gc gets propagated to exit(),
resulting in an exit status of 255.  Use die instead for a clearer
error message and a controlled exit.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
 builtin/gc.c  | 36 ++++++++++++++++--------------------
 t/t6500-gc.sh |  2 +-
 2 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index ccfb1ceaeb..2bebc52bda 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -438,10 +438,10 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 	return NULL;
 }
 
-static int report_last_gc_error(void)
+static void report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
-	int ret = 0;
+	ssize_t ret;
 	struct stat st;
 	char *gc_log_path = git_pathdup("gc.log");
 
@@ -449,16 +449,17 @@ static int report_last_gc_error(void)
 		if (errno == ENOENT)
 			goto done;
 
-		ret = error_errno(_("Can't stat %s"), gc_log_path);
-		goto done;
+		die_errno(_("cannot stat '%s'"), gc_log_path);
 	}
 
 	if (st.st_mtime < gc_log_expire_time)
 		goto done;
 
 	ret = strbuf_read_file(&sb, gc_log_path, 0);
+	if (ret < 0)
+		die_errno(_("cannot read '%s'"), gc_log_path);
 	if (ret > 0)
-		ret = error(_("The last gc run reported the following. "
+		die(_("The last gc run reported the following. "
 			       "Please correct the root cause\n"
 			       "and remove %s.\n"
 			       "Automatic cleanup will not be performed "
@@ -468,20 +469,18 @@ static int report_last_gc_error(void)
 	strbuf_release(&sb);
 done:
 	free(gc_log_path);
-	return ret;
 }
 
-static int gc_before_repack(void)
+static void gc_before_repack(void)
 {
 	if (pack_refs && run_command_v_opt(pack_refs_cmd.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, pack_refs_cmd.argv[0]);
+		die(FAILED_RUN, pack_refs_cmd.argv[0]);
 
 	if (prune_reflogs && run_command_v_opt(reflog.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, reflog.argv[0]);
+		die(FAILED_RUN, reflog.argv[0]);
 
 	pack_refs = 0;
 	prune_reflogs = 0;
-	return 0;
 }
 
 int cmd_gc(int argc, const char **argv, const char *prefix)
@@ -562,13 +561,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
 		if (detach_auto) {
-			if (report_last_gc_error())
-				return -1;
+			report_last_gc_error(); /* dies on error */
 
 			if (lock_repo_for_gc(force, &pid))
 				return 0;
-			if (gc_before_repack())
-				return -1;
+			gc_before_repack(); /* dies on failure */
 			delete_tempfile(&pidfile);
 
 			/*
@@ -608,12 +605,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		atexit(process_log_file_at_exit);
 	}
 
-	if (gc_before_repack())
-		return -1;
+	gc_before_repack();
 
 	if (!repository_format_precious_objects) {
 		if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, repack.argv[0]);
+			die(FAILED_RUN, repack.argv[0]);
 
 		if (prune_expire) {
 			argv_array_push(&prune, prune_expire);
@@ -623,18 +619,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 				argv_array_push(&prune,
 						"--exclude-promisor-objects");
 			if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
-				return error(FAILED_RUN, prune.argv[0]);
+				die(FAILED_RUN, prune.argv[0]);
 		}
 	}
 
 	if (prune_worktrees_expire) {
 		argv_array_push(&prune_worktrees, prune_worktrees_expire);
 		if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, prune_worktrees.argv[0]);
+			die(FAILED_RUN, prune_worktrees.argv[0]);
 	}
 
 	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, rerere.argv[0]);
+		die(FAILED_RUN, rerere.argv[0]);
 
 	report_garbage = report_pack_garbage;
 	reprepare_packed_git(the_repository);
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 818435f04e..c474a94a9f 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -117,7 +117,7 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
 	test_config gc.autodetach true &&
 	echo fleem >.git/gc.log &&
 	test_must_fail git gc --auto 2>err &&
-	test_i18ngrep "^error:" err &&
+	test_i18ngrep "^fatal:" err &&
 	test_config gc.logexpiry 5.days &&
 	test-tool chmtime =-345600 .git/gc.log &&
 	test_must_fail git gc --auto &&
-- 
2.18.0.233.g985f88cf7e


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 22:55               ` Jeff King
@ 2018-07-16 23:06                 ` Elijah Newren
  0 siblings, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2018-07-16 23:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List, Jonathan Tan, Jonathan Nieder

On Mon, Jul 16, 2018 at 3:55 PM, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 16, 2018 at 03:07:34PM -0700, Elijah Newren wrote:
>
>> > If we were to delete those objects, wouldn't it be exactly the same as
>> > running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
>> > minutes"?  Or are you concerned with taking other objects along for the
>> > ride that weren't part of old reflogs? I think that's a valid concern,
>>
>> Yes, I was worried about taking other objects along for the ride that
>> weren't part of old reflogs.
>>
>> > but it's also an issue for objects which were previously referenced in
>> > a reflog, but are part of another current operation.
>>
>> I'm not certain what you're referring to here.
>
> I mean that an ongoing operation could refer to a blob that just
> recently became unreachable via reflog pruning. And then we would delete
> it, leaving the repository corrupt.
>
> One of the protections we have against that is that if I ask to write
> blob XYZ and we find that we already have the object, Git will freshen
> the mtime of the loose object or pack that contains it, protecting it
> from pruning. But with your suggestion, we'd delete it immediately,
> regardless of the mtime of the containing pack.
>
> Another way to think of it is this: a reflog mentioning an object does
> not mean it is the exclusive user of that object. So when we expire it,
> that does not mean that it is OK to delete it immediately; there may be
> other users.
>
>> > Also, what do you do if there weren't reflogs in the repo? Or the
>> > reflogs were deleted (e.g., because branch deletion drops the associated
>> > reflog entirely)?
>>
>> Yes, there are issues this rule won't help with, but in my experience
>> it was a primary (if not sole) actual cause in practice.  (I believe I
>> even said elsewhere in this thread that I knew there were unreachable
>> objects for other reasons and they might also become large in number).
>> At $DAYJOB we've had multiple people including myself hit the "too
>> many unreachable loose objects" nasty loop issue (some of us multiple
>> different times), and as far as I can tell, most (perhaps even all) of
>> them would have been avoided by just "properly" deleting garbage as
>> per my object-age-is-reflog-age-if-not-otherwise-referenced rule.
>
> I agree with you that this is a frequent cause, and probably even the
> primary one. But my concern is that your loosening increases the risk of
> corruption for other cases.
>
>> > I assume by "these objects" you mean ones which used to be reachable
>> > from a reflog, but that reflog entry just expired.  I think you'd be
>> > sacrificing some race-safety in that case.
>>
>> Is that inherently any more race unsafe than 'git prune
>> --expire=2.weeks.ago'?  I thought it'd be racy in the same ways, and
>> thus a tradeoff folks are already accepting (at least implicitly) when
>> running git-gc.  Since these objects are actually 90 days old rather
>> than a mere two weeks, it actually seemed more safe to me.  But maybe
>> I'm overlooking something with the pack->loose transition that makes
>> it more racy?
>
> I think it's worse in terms of race-safety because you're losing one of
> the signals that users of the objects can provide to git-prune to tell
> it the object is useful: updating the mtime. So yes, you think of the
> objects as "90 days old", but we don't know if there are other users.
> Has somebody else been accessing them in the meantime?
>
> To be honest, I'm not sure how meaningful that distinction is in
> practice. The current scheme is still racy, even if the windows are
> shorter in some circumstances. But it seems like cruft packfiles are
> a similar amount of work to your scheme, cover more cases, and are
> slightly safer. And possibly even give us a long-term route to true
> race-free pruning (via the "garbage pack" mark that Jonathan mentioned).
>
> Assuming you buy into the idea that objects in a cruft-pack are not
> hurting anything aside from a little wasted storage for up to 2 weeks
> (which it sounds like you're at least partially on board with ;) ).

Ah, I see now.  Thanks (to you and Jonathan) for the thorough
explanations.  I'm totally on board now.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 22:56                           ` Jonathan Nieder
@ 2018-07-16 23:26                             ` Jeff King
  2018-07-17  1:53                               ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-16 23:26 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git

On Mon, Jul 16, 2018 at 03:56:39PM -0700, Jonathan Nieder wrote:

> The calling command in the motivating example is Android's "repo" tool:
> 
>             bare_git.gc('--auto')
> 
> from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
> think it's reasonable that it expects a status code of 0 in the normal
> case.  So life is less simple than I hoped.

IMHO it should ignore the return code, since that's what Git does
itself. And at any rate, you'd miss errors from daemonized gc's (at
least until the _next_ one runs and propagates the error code).

> Interesting!  It looks like that thread anticipated the problems we've
> seen here.  Three years without having to have fixed it is a good run,
> I suppose.
> 
> The discussion of stopping there appears to be primarily about
> stopping in the error case, not rate-limiting in the success or
> warning case.

I think the two are essentially the same. The point is that we cannot
make further progress by re-running the gc again, so we should not.

Amusingly, the warning we're talking about is the exact reason that the
logging in that thread was added.  329e6e8794 (gc: save log from
daemonized gc --auto and print it next time, 2015-09-19) says:

  The latest in this set is, as the result of daemonizing, stderr is
  closed and all warnings are lost. This warning at the end of cmd_gc()
  is particularly important because it tells the user how to avoid "gc
  --auto" running repeatedly. Because stderr is closed, the user does
  not know, naturally they complain about 'gc --auto' wasting CPU.

> -- >8 --
> Subject: gc: exit with status 128 on failure
> 
> A value of -1 returned from cmd_gc gets propagated to exit(),
> resulting in an exit status of 255.  Use die instead for a clearer
> error message and a controlled exit.

I agree it's better to not pass -1 to exit(). But I thought the original
motivation was the fact that we were returning non-zero in this case at
all?  (And I thought you and I both agreed with that motivation).

So is this meant to be a preparatory fix until somebody is interested in
fixing that?

> -static int gc_before_repack(void)
> +static void gc_before_repack(void)
>  {
>  	if (pack_refs && run_command_v_opt(pack_refs_cmd.argv, RUN_GIT_CMD))
> -		return error(FAILED_RUN, pack_refs_cmd.argv[0]);
> +		die(FAILED_RUN, pack_refs_cmd.argv[0]);
>  
>  	if (prune_reflogs && run_command_v_opt(reflog.argv, RUN_GIT_CMD))
> -		return error(FAILED_RUN, reflog.argv[0]);
> +		die(FAILED_RUN, reflog.argv[0]);

Dscho is going to yell at you about replacing error returns with die().  ;)

I wonder if it would be simpler to just reinterpret the "-1" into "128"
in cmd_gc(). I thought we had talked about having run_builtin() do that
at one point, but I don't think we ever did. I dunno. I'm fine with it
either way.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-16 23:26                             ` Jeff King
@ 2018-07-17  1:53                               ` Jonathan Nieder
  2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
                                                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-17  1:53 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy

Hi,

Jeff King wrote:
> On Mon, Jul 16, 2018 at 03:56:39PM -0700, Jonathan Nieder wrote:

>> The calling command in the motivating example is Android's "repo" tool:
>> 
>>             bare_git.gc('--auto')
>> 
>> from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
>> think it's reasonable that it expects a status code of 0 in the normal
>> case.  So life is less simple than I hoped.
>
> IMHO it should ignore the return code, since that's what Git does
> itself. And at any rate, you'd miss errors from daemonized gc's (at
> least until the _next_ one runs and propagates the error code).

That suggests a possible improvement.  If all callers should be
ignoring the exit code, can we make the exit code in daemonize mode
unconditionally zero in this kind of case?

That doesn't really solve the problem:

 1. "gc --auto" would produce noise *every time it runs* until gc.log
    is removed, for example via expiry

 2. "gc --auto" would not do any garbage collection until gc.log is
    removed, for example by expiry

 3. "gc --auto" would still not ratelimit itself, for example when
    there are a large number of loose unreachable objects that is not
    large enough to hit the loose object threshold.

but maybe it's better than the present state.

To solve (1) and (2), we could introduce a gc.warnings file that
behaves like gc.log except that it only gets printed once and then
self-destructs, allowing gc --auto to proceed.  To solve (3), we could
introduce a gc.lastrun file that is touched whenever "gc --auto" runs
successfully and make "gc --auto" a no-op for a while after each run.

-- >8 --
Subject: gc: do not return error for prior errors in daemonized mode

Some build machines started failing to fetch updated source using
"repo sync", with error

  error: The last gc run reported the following. Please correct the root cause
  and remove /build/.repo/projects/tools/git.git/gc.log.
  Automatic cleanup will not be performed until the file is removed.

  warning: There are too many unreachable loose objects; run 'git prune' to remove them.

The cause takes some time to describe.

In v2.0.0-rc0~145^2 (gc: config option for running --auto in
background, 2014-02-08), "git gc --auto" learned to run in the
background instead of blocking the invoking command.  In this mode, it
closed stderr to avoid interleaving output with any subsequent
commands, causing warnings like the above to be swallowed; v2.6.3~24^2
(gc: save log from daemonized gc --auto and print it next time,
2015-09-19) addressed this by storing any diagnostic output in
.git/gc.log and allowing the next "git gc --auto" run to print it.

To avoid wasteful repeated fruitless gcs, when gc.log is present, the
subsequent "gc --auto" would die after print its contents.  Most git
commands, such as "git fetch", ignore the exit status from "git gc
--auto" so all is well at this point: the user gets to see the error
message, and the fetch succeeds, without a wasteful additional attempt
at an automatic gc.

External tools like repo[1], though, do care about the exit status
from "git gc --auto".  In non-daemonized mode, the exit status is
straightforward: if there is an error, it is nonzero, but after a
warning like the above, the status is zero.  The daemonized mode, as a
side effect of the other properties provided, offers a very strange
exit code convention:

 - if no housekeeping was required, the exit status is 0

 - the first real run, after forking into the background, returns exit
   status 0 unconditionally.  The parent process has no way to know
   whether gc will succeed.

 - if there is any diagnostic output in gc.log, subsequent runs return
   a nonzero exit status to indicate that gc was not triggered.

There's nothing for the calling program to act on on the basis of that
error.  Use status 0 consistently instead, to indicate that we decided
not to run a gc (just like if no housekeeping was required).  This
way, repo and similar tools can get the benefit of the same behavior
as tools like "git fetch" that ignore the exit status from gc --auto.

Once the period of time described by gc.pruneExpire elapses, the
unreachable loose objects will be removed by "git gc --auto"
automatically.

[1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/

Reported-by: Andrii Dehtiarov <adehtiarov@google.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
 Documentation/config.txt |  3 ++-
 builtin/gc.c             | 16 +++++++++++-----
 t/t6500-gc.sh            |  6 +++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1cc18a828c..5eaa4aaa7d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should go below
 gc.autoPackLimit and gc.bigPackThreshold should be respected again.
 
 gc.logExpiry::
-	If the file gc.log exists, then `git gc --auto` won't run
+	If the file gc.log exists, then `git gc --auto` will print
+	its content and exit with status zero instead of running
 	unless that file is more than 'gc.logExpiry' old.  Default is
 	"1.day".  See `gc.pruneExpire` for more ways to specify its
 	value.
diff --git a/builtin/gc.c b/builtin/gc.c
index 2bebc52bda..484ab21b8c 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -438,7 +438,7 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 	return NULL;
 }
 
-static void report_last_gc_error(void)
+static int report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
 	ssize_t ret;
@@ -449,7 +449,7 @@ static void report_last_gc_error(void)
 		if (errno == ENOENT)
 			goto done;
 
-		die_errno(_("cannot stat '%s'"), gc_log_path);
+		return error_errno(_("cannot stat '%s'"), gc_log_path);
 	}
 
 	if (st.st_mtime < gc_log_expire_time)
@@ -457,9 +457,9 @@ static void report_last_gc_error(void)
 
 	ret = strbuf_read_file(&sb, gc_log_path, 0);
 	if (ret < 0)
-		die_errno(_("cannot read '%s'"), gc_log_path);
+		return error_errno(_("cannot read '%s'"), gc_log_path);
 	if (ret > 0)
-		die(_("The last gc run reported the following. "
+		return error(_("The last gc run reported the following. "
 			       "Please correct the root cause\n"
 			       "and remove %s.\n"
 			       "Automatic cleanup will not be performed "
@@ -469,6 +469,7 @@ static void report_last_gc_error(void)
 	strbuf_release(&sb);
 done:
 	free(gc_log_path);
+	return 0;
 }
 
 static void gc_before_repack(void)
@@ -561,7 +562,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
 		if (detach_auto) {
-			report_last_gc_error(); /* dies on error */
+			if (report_last_gc_error())
+				/*
+				 * A previous gc failed. We've reported the
+				 * error, so there's nothing left to do.
+				 */
+				return 0;
 
 			if (lock_repo_for_gc(force, &pid))
 				return 0;
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index c474a94a9f..3e62df616c 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -116,11 +116,11 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
 	test_config gc.autopacklimit 1 &&
 	test_config gc.autodetach true &&
 	echo fleem >.git/gc.log &&
-	test_must_fail git gc --auto 2>err &&
-	test_i18ngrep "^fatal:" err &&
+	git gc --auto 2>err &&
+	test_i18ngrep "^error:" err &&
 	test_config gc.logexpiry 5.days &&
 	test-tool chmtime =-345600 .git/gc.log &&
-	test_must_fail git gc --auto &&
+	git gc --auto &&
 	test_config gc.logexpiry 2.days &&
 	run_and_wait_for_auto_gc &&
 	ls .git/objects/pack/pack-*.pack >packs &&
-- 
2.18.0.233.g985f88cf7e


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode
  2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
  2018-07-16 17:51 ` Jeff King
  2018-07-16 19:15 ` Elijah Newren
@ 2018-07-17  6:51 ` Jonathan Nieder
  2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
                     ` (2 more replies)
  2 siblings, 3 replies; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-17  6:51 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jeff King, Elijah Newren,
	Nguyễn Thái Ngọc Duy

Hi,

Jonathan Tan wrote:

> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
>
> This makes the warning unactionable by the user, as any loose objects
> left are not deleted yet because of safety, and "git prune" is not a
> command that the user is recommended to run directly anyway.

I agree that given the better alternatives we have now, "git prune" is
not such a great option these days.  E.g. should it say

	struct strbuf now = STRBUF_INIT;
	date_stamp(&now);
	...

	"run 'git gc --prune=%s' to remove them", now.buf);

?

> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Here's a series to address that motivating case.  Thanks for the
careful analysis and to Peff for the patient explanations.

Sincerely,
Jonathan Nieder (3):
  gc: improve handling of errors reading gc.log
  gc: exit with status 128 on failure
  gc: do not return error for prior errors in daemonized mode

 Documentation/config.txt |  3 ++-
 builtin/gc.c             | 53 ++++++++++++++++++++++++++--------------
 t/t6500-gc.sh            |  6 ++---
 3 files changed, 40 insertions(+), 22 deletions(-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/3] gc: improve handling of errors reading gc.log
  2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
@ 2018-07-17  6:53   ` Jonathan Nieder
  2018-07-17 18:19     ` Junio C Hamano
  2018-07-17 19:58     ` Jeff King
  2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
  2018-07-17  6:57   ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
  2 siblings, 2 replies; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-17  6:53 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jeff King, Elijah Newren,
	Nguyễn Thái Ngọc Duy

A collection of minor error handling fixes:

- use an error message in lower case, following the usual style
- quote filenames in error messages to make them easier to read and to
  decrease translation load by matching other 'stat' error messages
- check for and report errors from 'read', too
- avoid being confused by a gc.log larger than INT_MAX bytes

Noticed by code inspection.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
Split out from
https://public-inbox.org/git/20180716225639.GK11513@aiede.svl.corp.google.com/.

 builtin/gc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index ccfb1ceaeb..d69fc4c0b0 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -442,6 +442,7 @@ static int report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
 	int ret = 0;
+	ssize_t len;
 	struct stat st;
 	char *gc_log_path = git_pathdup("gc.log");
 
@@ -449,15 +450,17 @@ static int report_last_gc_error(void)
 		if (errno == ENOENT)
 			goto done;
 
-		ret = error_errno(_("Can't stat %s"), gc_log_path);
+		ret = error_errno(_("cannot stat '%s'"), gc_log_path);
 		goto done;
 	}
 
 	if (st.st_mtime < gc_log_expire_time)
 		goto done;
 
-	ret = strbuf_read_file(&sb, gc_log_path, 0);
-	if (ret > 0)
+	len = strbuf_read_file(&sb, gc_log_path, 0);
+	if (len < 0)
+		ret = error_errno(_("cannot read '%s'"), gc_log_path);
+	else if (len > 0)
 		ret = error(_("The last gc run reported the following. "
 			       "Please correct the root cause\n"
 			       "and remove %s.\n"
-- 
2.18.0.233.g985f88cf7e


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/3] gc: exit with status 128 on failure
  2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
  2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
@ 2018-07-17  6:54   ` Jonathan Nieder
  2018-07-17 18:22     ` Junio C Hamano
  2018-07-17 19:59     ` Jeff King
  2018-07-17  6:57   ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
  2 siblings, 2 replies; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-17  6:54 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jeff King, Elijah Newren,
	Nguyễn Thái Ngọc Duy

A value of -1 returned from cmd_gc gets propagated to exit(),
resulting in an exit status of 255.  Use die instead for a clearer
error message and a controlled exit.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
As in
https://public-inbox.org/git/20180716225639.GK11513@aiede.svl.corp.google.com/.
The only change is splitting out patch 1/3.

 builtin/gc.c  | 35 ++++++++++++++---------------------
 t/t6500-gc.sh |  2 +-
 2 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index d69fc4c0b0..95c8afd07b 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -438,10 +438,9 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 	return NULL;
 }
 
-static int report_last_gc_error(void)
+static void report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
-	int ret = 0;
 	ssize_t len;
 	struct stat st;
 	char *gc_log_path = git_pathdup("gc.log");
@@ -450,8 +449,7 @@ static int report_last_gc_error(void)
 		if (errno == ENOENT)
 			goto done;
 
-		ret = error_errno(_("cannot stat '%s'"), gc_log_path);
-		goto done;
+		die_errno(_("cannot stat '%s'"), gc_log_path);
 	}
 
 	if (st.st_mtime < gc_log_expire_time)
@@ -459,9 +457,9 @@ static int report_last_gc_error(void)
 
 	len = strbuf_read_file(&sb, gc_log_path, 0);
 	if (len < 0)
-		ret = error_errno(_("cannot read '%s'"), gc_log_path);
+		die_errno(_("cannot read '%s'"), gc_log_path);
 	else if (len > 0)
-		ret = error(_("The last gc run reported the following. "
+		die(_("The last gc run reported the following. "
 			       "Please correct the root cause\n"
 			       "and remove %s.\n"
 			       "Automatic cleanup will not be performed "
@@ -471,20 +469,18 @@ static int report_last_gc_error(void)
 	strbuf_release(&sb);
 done:
 	free(gc_log_path);
-	return ret;
 }
 
-static int gc_before_repack(void)
+static void gc_before_repack(void)
 {
 	if (pack_refs && run_command_v_opt(pack_refs_cmd.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, pack_refs_cmd.argv[0]);
+		die(FAILED_RUN, pack_refs_cmd.argv[0]);
 
 	if (prune_reflogs && run_command_v_opt(reflog.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, reflog.argv[0]);
+		die(FAILED_RUN, reflog.argv[0]);
 
 	pack_refs = 0;
 	prune_reflogs = 0;
-	return 0;
 }
 
 int cmd_gc(int argc, const char **argv, const char *prefix)
@@ -565,13 +561,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
 		if (detach_auto) {
-			if (report_last_gc_error())
-				return -1;
+			report_last_gc_error(); /* dies on error */
 
 			if (lock_repo_for_gc(force, &pid))
 				return 0;
-			if (gc_before_repack())
-				return -1;
+			gc_before_repack(); /* dies on failure */
 			delete_tempfile(&pidfile);
 
 			/*
@@ -611,12 +605,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		atexit(process_log_file_at_exit);
 	}
 
-	if (gc_before_repack())
-		return -1;
+	gc_before_repack();
 
 	if (!repository_format_precious_objects) {
 		if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, repack.argv[0]);
+			die(FAILED_RUN, repack.argv[0]);
 
 		if (prune_expire) {
 			argv_array_push(&prune, prune_expire);
@@ -626,18 +619,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 				argv_array_push(&prune,
 						"--exclude-promisor-objects");
 			if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
-				return error(FAILED_RUN, prune.argv[0]);
+				die(FAILED_RUN, prune.argv[0]);
 		}
 	}
 
 	if (prune_worktrees_expire) {
 		argv_array_push(&prune_worktrees, prune_worktrees_expire);
 		if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, prune_worktrees.argv[0]);
+			die(FAILED_RUN, prune_worktrees.argv[0]);
 	}
 
 	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, rerere.argv[0]);
+		die(FAILED_RUN, rerere.argv[0]);
 
 	report_garbage = report_pack_garbage;
 	reprepare_packed_git(the_repository);
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 818435f04e..c474a94a9f 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -117,7 +117,7 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
 	test_config gc.autodetach true &&
 	echo fleem >.git/gc.log &&
 	test_must_fail git gc --auto 2>err &&
-	test_i18ngrep "^error:" err &&
+	test_i18ngrep "^fatal:" err &&
 	test_config gc.logexpiry 5.days &&
 	test-tool chmtime =-345600 .git/gc.log &&
 	test_must_fail git gc --auto &&
-- 
2.18.0.233.g985f88cf7e


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
  2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
  2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
@ 2018-07-17  6:57   ` Jonathan Nieder
  2018-07-17 20:13     ` Jeff King
  2 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-17  6:57 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jeff King, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

Some build machines started consistently failing to fetch updated
source using "repo sync", with error

  error: The last gc run reported the following. Please correct the root cause
  and remove /build/.repo/projects/tools/git.git/gc.log.
  Automatic cleanup will not be performed until the file is removed.

  warning: There are too many unreachable loose objects; run 'git prune' to remove them.

The cause takes some time to describe.

In v2.0.0-rc0~145^2 (gc: config option for running --auto in
background, 2014-02-08), "git gc --auto" learned to run in the
background instead of blocking the invoking command.  In this mode, it
closed stderr to avoid interleaving output with any subsequent
commands, causing warnings like the above to be swallowed; v2.6.3~24^2
(gc: save log from daemonized gc --auto and print it next time,
2015-09-19) addressed that by storing any diagnostic output in
.git/gc.log and allowing the next "git gc --auto" run to print it.

To avoid wasteful repeated fruitless gcs, when gc.log is present, the
subsequent "gc --auto" would die after printing its contents.  Most
git commands, such as "git fetch", ignore the exit status from "git gc
--auto" so all is well at this point: the user gets to see the error
message, and the fetch succeeds, without a wasteful additional attempt
at an automatic gc.

External tools like repo[1], though, do care about the exit status
from "git gc --auto".  In non-daemonized mode, the exit status is
straightforward: if there is an error, it is nonzero, but after a
warning like the above, the status is zero.  The daemonized mode, as a
side effect of the other properties provided, offers a very strange
exit code convention:

 - if no housekeeping was required, the exit status is 0

 - the first real run, after forking into the background, returns exit
   status 0 unconditionally.  The parent process has no way to know
   whether gc will succeed.

 - if there is any diagnostic output in gc.log, subsequent runs return
   a nonzero exit status to indicate that gc was not triggered.

There's nothing for the calling program to act on on the basis of that
error.  Use status 0 consistently instead, to indicate that we decided
not to run a gc (just like if no housekeeping was required).  This
way, repo and similar tools can get the benefit of the same behavior
as tools like "git fetch" that ignore the exit status from gc --auto.

Once the period of time described by gc.pruneExpire elapses, the
unreachable loose objects will be removed by "git gc --auto"
automatically.

[1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/

Reported-by: Andrii Dehtiarov <adehtiarov@google.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
Changes from v1:
- minor commit message tweaks
- using warning() instead of error()
- avoiding leaking gc_log_path in report_last_gc_error
- only changing the exit status when reporting a prior error, not a
  new one

Thoughts of all kinds welcome, as always.

Sincerely,
Jonathan

 Documentation/config.txt |  3 ++-
 builtin/gc.c             | 33 +++++++++++++++++++++++++++------
 t/t6500-gc.sh            |  6 +++---
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1cc18a828c..5eaa4aaa7d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should go below
 gc.autoPackLimit and gc.bigPackThreshold should be respected again.
 
 gc.logExpiry::
-	If the file gc.log exists, then `git gc --auto` won't run
+	If the file gc.log exists, then `git gc --auto` will print
+	its content and exit with status zero instead of running
 	unless that file is more than 'gc.logExpiry' old.  Default is
 	"1.day".  See `gc.pruneExpire` for more ways to specify its
 	value.
diff --git a/builtin/gc.c b/builtin/gc.c
index 95c8afd07b..ce8a663a01 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -438,9 +438,15 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 	return NULL;
 }
 
-static void report_last_gc_error(void)
+/*
+ * Returns 0 if there was no previous error and gc can proceed, 1 if
+ * gc should not proceed due to an error in the last run. Prints a
+ * message and returns -1 if an error occured while reading gc.log
+ */
+static int report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
+	int ret = 0;
 	ssize_t len;
 	struct stat st;
 	char *gc_log_path = git_pathdup("gc.log");
@@ -449,7 +455,8 @@ static void report_last_gc_error(void)
 		if (errno == ENOENT)
 			goto done;
 
-		die_errno(_("cannot stat '%s'"), gc_log_path);
+		ret = error_errno(_("cannot stat '%s'"), gc_log_path);
+		goto done;
 	}
 
 	if (st.st_mtime < gc_log_expire_time)
@@ -457,18 +464,26 @@ static void report_last_gc_error(void)
 
 	len = strbuf_read_file(&sb, gc_log_path, 0);
 	if (len < 0)
-		die_errno(_("cannot read '%s'"), gc_log_path);
-	else if (len > 0)
-		die(_("The last gc run reported the following. "
+		ret = error_errno(_("cannot read '%s'"), gc_log_path);
+	else if (len > 0) {
+		/*
+		 * A previous gc failed.  Report the error, and don't
+		 * bother with an automatic gc run since it is likely
+		 * to fail in the same way.
+		 */
+		warning(_("The last gc run reported the following. "
 			       "Please correct the root cause\n"
 			       "and remove %s.\n"
 			       "Automatic cleanup will not be performed "
 			       "until the file is removed.\n\n"
 			       "%s"),
 			    gc_log_path, sb.buf);
+		ret = 1;
+	}
 	strbuf_release(&sb);
 done:
 	free(gc_log_path);
+	return ret;
 }
 
 static void gc_before_repack(void)
@@ -561,7 +576,13 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
 		if (detach_auto) {
-			report_last_gc_error(); /* dies on error */
+			int ret = report_last_gc_error();
+			if (ret < 0)
+				/* an I/O error occured, already reported */
+				exit(128);
+			if (ret == 1)
+				/* Last gc --auto failed. Skip this one. */
+				return 0;
 
 			if (lock_repo_for_gc(force, &pid))
 				return 0;
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index c474a94a9f..a222efdbe1 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -116,11 +116,11 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
 	test_config gc.autopacklimit 1 &&
 	test_config gc.autodetach true &&
 	echo fleem >.git/gc.log &&
-	test_must_fail git gc --auto 2>err &&
-	test_i18ngrep "^fatal:" err &&
+	git gc --auto 2>err &&
+	test_i18ngrep "^warning:" err &&
 	test_config gc.logexpiry 5.days &&
 	test-tool chmtime =-345600 .git/gc.log &&
-	test_must_fail git gc --auto &&
+	git gc --auto &&
 	test_config gc.logexpiry 2.days &&
 	run_and_wait_for_auto_gc &&
 	ls .git/objects/pack/pack-*.pack >packs &&
-- 
2.18.0.233.g985f88cf7e


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17  1:53                               ` Jonathan Nieder
@ 2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
  2018-07-17 14:03                                   ` Jonathan Nieder
  2018-07-17 20:27                                   ` Jeff King
  2018-07-17 15:59                                 ` Duy Nguyen
  2018-07-17 18:09                                 ` Junio C Hamano
  2 siblings, 2 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-17  8:59 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff King, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy


On Tue, Jul 17 2018, Jonathan Nieder wrote:

> Hi,
>
> Jeff King wrote:
>> On Mon, Jul 16, 2018 at 03:56:39PM -0700, Jonathan Nieder wrote:
>
>>> The calling command in the motivating example is Android's "repo" tool:
>>>
>>>             bare_git.gc('--auto')
>>>
>>> from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
>>> think it's reasonable that it expects a status code of 0 in the normal
>>> case.  So life is less simple than I hoped.
>>
>> IMHO it should ignore the return code, since that's what Git does
>> itself. And at any rate, you'd miss errors from daemonized gc's (at
>> least until the _next_ one runs and propagates the error code).

I've read this whole thread, and as Peff noted I've barked up this
particular tree before[1] without coming up with a solution myself.

So please don't take the following as critique of any way of moving
forward, I'm just trying to poke holes in what you're doing to make sure
we don't have regressions to the currently (sucky) logic.

> That suggests a possible improvement.  If all callers should be
> ignoring the exit code, can we make the exit code in daemonize mode
> unconditionally zero in this kind of case?

That doesn't make sense to me. Just because git itself is happily
ignoring the exit code I don't think that should mean there shouldn't be
a meaningful exit code. Why don't you just ignore the exit code in the
repo tool as well?

Now e.g. automation I have to see if git-gc ---auto is having issues
can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
of servers, but will need to start caring if stderr was emitted to.

I don't care if we e.g. have a 'git gc --auto --exit-code' similar to
what git-diff does, but it doesn't make sense to me that we *know* we
can't background the gc due to a previous error and then always return
0. Having to parse STDERR to see if it *really* failed is un-unixy,
let's use exit codes. That's what they're for.

> That doesn't really solve the problem:
>
>  1. "gc --auto" would produce noise *every time it runs* until gc.log
>     is removed, for example via expiry
>
>  2. "gc --auto" would not do any garbage collection until gc.log is
>     removed, for example by expiry
>
>  3. "gc --auto" would still not ratelimit itself, for example when
>     there are a large number of loose unreachable objects that is not
>     large enough to hit the loose object threshold.
>
> but maybe it's better than the present state.
>
> To solve (1) and (2), we could introduce a gc.warnings file that
> behaves like gc.log except that it only gets printed once and then
> self-destructs, allowing gc --auto to proceed.

I think you're conflating two things here in a way that (if I'm reading
this correctly) produces a pretty bad regression for users.

 a) If we have something in the gc.log we keep yelling at users until we
    reach the gc.logExpiry, this was the subject of my thread back in
    January. This sucks, and should be fixed somehow.

    Maybe we should only emit the warning once, e.g. creating a
    .git/gc.log.wasemitted marker and as long as its ctime is later than
    the mtime on .git/gc.log we don't emit it again).

    But that also creates the UX problem that it's easy to miss this
    message in the middle of some huge "fetch" output. Perhaps we should
    just start emitting this as part of "git status" or something (or
    solve the root cause, as Peff notes...).

 b) We *also* use this presence of a gc.log as a marker for "we failed
    too recently, let's not try again until after a day".

The semantics of b) are very useful, and just because they're tied up
with a) right now, let's not throw out b) just because we're trying to
solve a).

We have dev machines with limited I/O & CPU/memory that occasionally
break the gc.auto limit, I really don't want those churning a background
"git gc" on every fetch/commit etc. until we're finally below the
gc.auto limit (possibly days later), which would be a side-effect of
printing the .git/gc.log once and then removing it.

> To solve (3), we could
> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
> successfully and make "gc --auto" a no-op for a while after each run.

This would work around my concern with b) above in most cases by
introducing a purely time-based rate limit, but I'm uneasy about this
change in git-gc semantics.

Right now one thing we do right is we always try to look at the actual
state of the repo with too_many_packs() and too_many_loose_objects().

We don't assume the state of your repo hasn't drastically changed
recently. That means that we do the right thing and gc when the repo
needs it, not just because we GC'd recently enough.

With these proposed semantics we'd skip a needed GC (potentially for
days, depending on the default) just because we recently ran one.

In many environments, such as on busy servers, we could have tens of
thousands of packs by this point, since this facility would (presumably)
bypass both gc.autoPackLimit and gc.auto in favor of only running gc at
a maximum of every N minutes, similarly in a local checkout I could have
a crapload of loose objects because I ran a big rebase or a
filter-branch on one of my branches.

1. https://public-inbox.org/git/87inc89j38.fsf@evledraar.gmail.com/

> -- >8 --
> Subject: gc: do not return error for prior errors in daemonized mode
>
> Some build machines started failing to fetch updated source using
> "repo sync", with error
>
>   error: The last gc run reported the following. Please correct the root cause
>   and remove /build/.repo/projects/tools/git.git/gc.log.
>   Automatic cleanup will not be performed until the file is removed.
>
>   warning: There are too many unreachable loose objects; run 'git prune' to remove them.
>
> The cause takes some time to describe.
>
> In v2.0.0-rc0~145^2 (gc: config option for running --auto in
> background, 2014-02-08), "git gc --auto" learned to run in the
> background instead of blocking the invoking command.  In this mode, it
> closed stderr to avoid interleaving output with any subsequent
> commands, causing warnings like the above to be swallowed; v2.6.3~24^2
> (gc: save log from daemonized gc --auto and print it next time,
> 2015-09-19) addressed this by storing any diagnostic output in
> .git/gc.log and allowing the next "git gc --auto" run to print it.
>
> To avoid wasteful repeated fruitless gcs, when gc.log is present, the
> subsequent "gc --auto" would die after print its contents.  Most git
> commands, such as "git fetch", ignore the exit status from "git gc
> --auto" so all is well at this point: the user gets to see the error
> message, and the fetch succeeds, without a wasteful additional attempt
> at an automatic gc.
>
> External tools like repo[1], though, do care about the exit status
> from "git gc --auto".  In non-daemonized mode, the exit status is
> straightforward: if there is an error, it is nonzero, but after a
> warning like the above, the status is zero.  The daemonized mode, as a
> side effect of the other properties provided, offers a very strange
> exit code convention:
>
>  - if no housekeeping was required, the exit status is 0
>
>  - the first real run, after forking into the background, returns exit
>    status 0 unconditionally.  The parent process has no way to know
>    whether gc will succeed.
>
>  - if there is any diagnostic output in gc.log, subsequent runs return
>    a nonzero exit status to indicate that gc was not triggered.
>
> There's nothing for the calling program to act on on the basis of that
> error.  Use status 0 consistently instead, to indicate that we decided
> not to run a gc (just like if no housekeeping was required).  This
> way, repo and similar tools can get the benefit of the same behavior
> as tools like "git fetch" that ignore the exit status from gc --auto.
>
> Once the period of time described by gc.pruneExpire elapses, the
> unreachable loose objects will be removed by "git gc --auto"
> automatically.
>
> [1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/
>
> Reported-by: Andrii Dehtiarov <adehtiarov@google.com>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---
>  Documentation/config.txt |  3 ++-
>  builtin/gc.c             | 16 +++++++++++-----
>  t/t6500-gc.sh            |  6 +++---
>  3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..5eaa4aaa7d 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should go below
>  gc.autoPackLimit and gc.bigPackThreshold should be respected again.
>
>  gc.logExpiry::
> -	If the file gc.log exists, then `git gc --auto` won't run
> +	If the file gc.log exists, then `git gc --auto` will print
> +	its content and exit with status zero instead of running
>  	unless that file is more than 'gc.logExpiry' old.  Default is
>  	"1.day".  See `gc.pruneExpire` for more ways to specify its
>  	value.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2bebc52bda..484ab21b8c 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -438,7 +438,7 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
>  	return NULL;
>  }
>
> -static void report_last_gc_error(void)
> +static int report_last_gc_error(void)
>  {
>  	struct strbuf sb = STRBUF_INIT;
>  	ssize_t ret;
> @@ -449,7 +449,7 @@ static void report_last_gc_error(void)
>  		if (errno == ENOENT)
>  			goto done;
>
> -		die_errno(_("cannot stat '%s'"), gc_log_path);
> +		return error_errno(_("cannot stat '%s'"), gc_log_path);
>  	}
>
>  	if (st.st_mtime < gc_log_expire_time)
> @@ -457,9 +457,9 @@ static void report_last_gc_error(void)
>
>  	ret = strbuf_read_file(&sb, gc_log_path, 0);
>  	if (ret < 0)
> -		die_errno(_("cannot read '%s'"), gc_log_path);
> +		return error_errno(_("cannot read '%s'"), gc_log_path);
>  	if (ret > 0)
> -		die(_("The last gc run reported the following. "
> +		return error(_("The last gc run reported the following. "
>  			       "Please correct the root cause\n"
>  			       "and remove %s.\n"
>  			       "Automatic cleanup will not be performed "
> @@ -469,6 +469,7 @@ static void report_last_gc_error(void)
>  	strbuf_release(&sb);
>  done:
>  	free(gc_log_path);
> +	return 0;
>  }
>
>  static void gc_before_repack(void)
> @@ -561,7 +562,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
>  		}
>  		if (detach_auto) {
> -			report_last_gc_error(); /* dies on error */
> +			if (report_last_gc_error())
> +				/*
> +				 * A previous gc failed. We've reported the
> +				 * error, so there's nothing left to do.
> +				 */
> +				return 0;
>
>  			if (lock_repo_for_gc(force, &pid))
>  				return 0;
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index c474a94a9f..3e62df616c 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -116,11 +116,11 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
>  	test_config gc.autopacklimit 1 &&
>  	test_config gc.autodetach true &&
>  	echo fleem >.git/gc.log &&
> -	test_must_fail git gc --auto 2>err &&
> -	test_i18ngrep "^fatal:" err &&
> +	git gc --auto 2>err &&
> +	test_i18ngrep "^error:" err &&
>  	test_config gc.logexpiry 5.days &&
>  	test-tool chmtime =-345600 .git/gc.log &&
> -	test_must_fail git gc --auto &&
> +	git gc --auto &&
>  	test_config gc.logexpiry 2.days &&
>  	run_and_wait_for_auto_gc &&
>  	ls .git/objects/pack/pack-*.pack >packs &&

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
@ 2018-07-17 14:03                                   ` Jonathan Nieder
  2018-07-17 15:24                                     ` Ævar Arnfjörð Bjarmason
  2018-07-17 20:27                                   ` Jeff King
  1 sibling, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-07-17 14:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy

Hi Ævar,

Ævar Arnfjörð Bjarmason wrote:
> On Tue, Jul 17 2018, Jonathan Nieder wrote:

>> That suggests a possible improvement.  If all callers should be
>> ignoring the exit code, can we make the exit code in daemonize mode
>> unconditionally zero in this kind of case?
>
> That doesn't make sense to me. Just because git itself is happily
> ignoring the exit code I don't think that should mean there shouldn't be
> a meaningful exit code. Why don't you just ignore the exit code in the
> repo tool as well?

I don't maintain the repo tool.  I am just trying to improve git to
make some users' lives better.

repo worked fine for years, until gc's daemonized mode broke it.  I
don't think it makes sense for us to declare that it has been holding
git gc wrong for all that time before then.

> Now e.g. automation I have to see if git-gc ---auto is having issues
> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
> of servers, but will need to start caring if stderr was emitted to.

Thanks for bringing this up.  The thing is, the current exit code is
not useful for that purpose.  It doesn't report a failure until the
*next* `git gc --auto` run, when it is too late to be useful.

See the commit message on the proposed patch
https://public-inbox.org/git/20180717065740.GD177907@aiede.svl.corp.google.com/
for more on that subject.

> I don't care if we e.g. have a 'git gc --auto --exit-code' similar to
> what git-diff does, but it doesn't make sense to me that we *know* we
> can't background the gc due to a previous error and then always return
> 0. Having to parse STDERR to see if it *really* failed is un-unixy,
> let's use exit codes. That's what they're for.

Do you know of anyone trying to use the exit code from gc --auto in
this way?

It sounds to me like for what you're proposing, a separate

	git gc --auto --last-run-failed

command that a tool could use to check the outcome from the previous
gc --auto run without triggering a new one would be best.

[...]
> I think you're conflating two things here in a way that (if I'm reading
> this correctly) produces a pretty bad regression for users.

The patch doesn't touch the "ratelimiting" behavior at all, so I'm not
sure what I'm doing to regress users.

[...]
>> To solve (3), we could
>> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
>> successfully and make "gc --auto" a no-op for a while after each run.
>
> This would work around my concern with b) above in most cases by
> introducing a purely time-based rate limit, but I'm uneasy about this
> change in git-gc semantics.
[..]
> With these proposed semantics we'd skip a needed GC (potentially for
> days, depending on the default) just because we recently ran one.
>
> In many environments, such as on busy servers, we could have tens of
> thousands of packs by this point, since this facility would (presumably)
> bypass both gc.autoPackLimit and gc.auto in favor of only running gc at
> a maximum of every N minutes, similarly in a local checkout I could have
> a crapload of loose objects because I ran a big rebase or a
> filter-branch on one of my branches.

I think we all agree that getting rid of the hack that 'explodes'
unreachable objects into loose objects is the best way forward, long
term.

Even in that future, some ratelimiting may be useful, though.  I also
suspect that we're never going to achieve a perfect set of defaults
that works well for both humans and servers, though we can try.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17 14:03                                   ` Jonathan Nieder
@ 2018-07-17 15:24                                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-17 15:24 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff King, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy

On Tue, Jul 17 2018, Jonathan Nieder wrote:

> Hi Ævar,
>
> Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Jul 17 2018, Jonathan Nieder wrote:
>
>>> That suggests a possible improvement.  If all callers should be
>>> ignoring the exit code, can we make the exit code in daemonize mode
>>> unconditionally zero in this kind of case?
>>
>> That doesn't make sense to me. Just because git itself is happily
>> ignoring the exit code I don't think that should mean there shouldn't be
>> a meaningful exit code. Why don't you just ignore the exit code in the
>> repo tool as well?
>
> I don't maintain the repo tool.  I am just trying to improve git to
> make some users' lives better.
>
> repo worked fine for years, until gc's daemonized mode broke it.  I
> don't think it makes sense for us to declare that it has been holding
> git gc wrong for all that time before then.

Before the sequence of commits noted at the bottom of my
https://public-inbox.org/git/87inc89j38.fsf@evledraar.gmail.com/ plus
a831c06a2b ("gc: ignore old gc.log files", 2017-02-10) we wouldn't do
that, that's true.

We'd just happily run GC again pointlessly even though it wasn't going
to accomplish anything, in cases where you really did have ~>6700 loose
objects that weren't going to be pruned.

I don't think it makes sense to revert those patches and go back to the
much more naïve (and user waiting there twiddling his thumbs while it
runs) method, but I also don't think making no distinction between the
following states:

 1. gc --auto has nothing to do
 2. gc --auto has something to do, will fork and try to do it
 3. gc --auto has something to do, but notices that gc has been failing
    before and can't do anything now.

I think #3 should exit with non-zero.

Yes, before this whole backgrounding etc. we wouldn't have exited with
non-zero either, we'd just print a thing to STDERR.

But with backgrounding we implicitly introduced a new state, which is we
won't do *any* maintenance at all, including consolidating packfiles
(very important for servers), so I think it makes sense to exit with
non-zero since gc can't run at all.

>> Now e.g. automation I have to see if git-gc ---auto is having issues
>> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
>> of servers, but will need to start caring if stderr was emitted to.
>
> Thanks for bringing this up.  The thing is, the current exit code is
> not useful for that purpose.  It doesn't report a failure until the
> *next* `git gc --auto` run, when it is too late to be useful.
>
> See the commit message on the proposed patch
> https://public-inbox.org/git/20180717065740.GD177907@aiede.svl.corp.google.com/
> for more on that subject.

Right. I know. What I mean is now I can (and do) use it to run 'git gc
--auto' across our server fleet and see whether I have any of #3, or
whether it's all #1 or #2. If there's nothing to do in #1 that's fine,
and it just so happens that I'll run gc due to #2 that's also fine, but
I'd like to see if gc really is stuck.

This of course relies on them having other users / scripts doing normal
git commands which would trigger previous 'git gc --auto' runs.

>> I don't care if we e.g. have a 'git gc --auto --exit-code' similar to
>> what git-diff does, but it doesn't make sense to me that we *know* we
>> can't background the gc due to a previous error and then always return
>> 0. Having to parse STDERR to see if it *really* failed is un-unixy,
>> let's use exit codes. That's what they're for.
>
> Do you know of anyone trying to use the exit code from gc --auto in
> this way?
>
> It sounds to me like for what you're proposing, a separate
>
> 	git gc --auto --last-run-failed
>
> command that a tool could use to check the outcome from the previous
> gc --auto run without triggering a new one would be best.

Yeah this is admittedly a bit of a poweruser thing I'm doing, so I don't
mind if it's hidden behind something like that in principle, but it
seems wrong to exit with zero in this (#3) case:

    $ git gc --auto; echo $?
    Auto packing the repository in background for optimum performance.
    See "git help gc" for manual housekeeping.
    error: The last gc run reported the following. Please correct the root cause
    and remove .git/gc.log.
    Automatic cleanup will not be performed until the file is removed.

    warning: There are too many unreachable loose objects; run 'git prune' to remove them.

    255

As a previous (good) patch in this series notes that shouldn't be 255,
let's fix that, but I don't see how emitting an "error" and "warning"
saying we're broken and can't gc at all here in conjunction with exiting
with zero makes sense.

> [...]
>> I think you're conflating two things here in a way that (if I'm reading
>> this correctly) produces a pretty bad regression for users.
>
> The patch doesn't touch the "ratelimiting" behavior at all, so I'm not
> sure what I'm doing to regress users.

Sorry about being unclear again, this is not a comment on this patch,
but your paragraph beginning with "To solve[...]", i.e. "this patch
doesn't do X, but in the future maybe we'll...". I'm pointing out
potential caveats with that, I realize you have not posted an
implementation of that yet.

> [...]
>>> To solve (3), we could
>>> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
>>> successfully and make "gc --auto" a no-op for a while after each run.
>>
>> This would work around my concern with b) above in most cases by
>> introducing a purely time-based rate limit, but I'm uneasy about this
>> change in git-gc semantics.
> [..]
>> With these proposed semantics we'd skip a needed GC (potentially for
>> days, depending on the default) just because we recently ran one.
>>
>> In many environments, such as on busy servers, we could have tens of
>> thousands of packs by this point, since this facility would (presumably)
>> bypass both gc.autoPackLimit and gc.auto in favor of only running gc at
>> a maximum of every N minutes, similarly in a local checkout I could have
>> a crapload of loose objects because I ran a big rebase or a
>> filter-branch on one of my branches.
>
> I think we all agree that getting rid of the hack that 'explodes'
> unreachable objects into loose objects is the best way forward, long
> term.
>
> Even in that future, some ratelimiting may be useful, though.  I also
> suspect that we're never going to achieve a perfect set of defaults
> that works well for both humans and servers, though we can try.

Having read this whole thing over again, and with some time to reflect
on it, I think the best way forward for now, in lieu of the bigger
solution of consolidating loose objects into a pack, is to split up this
special case of warning due to too many loose objects at the end, and
any other errors we may print during GC.

As noted above if our policy for gc.auto is legitimately exceeded, we
deal with that badly by stopping all gc, includin gc that would just run
due to too_many_packs().

Instead we should note that we had an error due to too many loose
objects last time, but still try to run if the too_many_packs()
condition is satisfied.

Now we're throwing the baby out with the bathwater and not consolidating
packs (for a default period of 2 weeks!) just because some operation
produced a lot of loose objects, and not writing bitmaps, commit graph
etc.

It would also be nice to expose via an exit code "do we need to gc?" and
"is gc failing here?" separately from the (currently working) "run gc or
report last error + code" mode we have with "gc --auto".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17  1:53                               ` Jonathan Nieder
  2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
@ 2018-07-17 15:59                                 ` Duy Nguyen
  2018-07-17 18:09                                 ` Junio C Hamano
  2 siblings, 0 replies; 57+ messages in thread
From: Duy Nguyen @ 2018-07-17 15:59 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Jonathan Tan, Git Mailing List, adehtiarov

On Tue, Jul 17, 2018 at 3:53 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
> Subject: gc: do not return error for prior errors in daemonized mode
>
> Some build machines started failing to fetch updated source using
> "repo sync", with error
>
>   error: The last gc run reported the following. Please correct the root cause
>   and remove /build/.repo/projects/tools/git.git/gc.log.
>   Automatic cleanup will not be performed until the file is removed.
>
>   warning: There are too many unreachable loose objects; run 'git prune' to remove them.
>
> The cause takes some time to describe.
>
> In v2.0.0-rc0~145^2 (gc: config option for running --auto in
> background, 2014-02-08), "git gc --auto" learned to run in the
> background instead of blocking the invoking command.  In this mode, it
> closed stderr to avoid interleaving output with any subsequent
> commands, causing warnings like the above to be swallowed; v2.6.3~24^2
> (gc: save log from daemonized gc --auto and print it next time,
> 2015-09-19) addressed this by storing any diagnostic output in
> .git/gc.log and allowing the next "git gc --auto" run to print it.
>
> To avoid wasteful repeated fruitless gcs, when gc.log is present, the
> subsequent "gc --auto" would die after print its contents.  Most git
> commands, such as "git fetch", ignore the exit status from "git gc
> --auto" so all is well at this point: the user gets to see the error
> message, and the fetch succeeds, without a wasteful additional attempt
> at an automatic gc.
>
> External tools like repo[1], though, do care about the exit status
> from "git gc --auto".  In non-daemonized mode, the exit status is
> straightforward: if there is an error, it is nonzero, but after a
> warning like the above, the status is zero.  The daemonized mode, as a
> side effect of the other properties provided, offers a very strange
> exit code convention:
>
>  - if no housekeeping was required, the exit status is 0
>
>  - the first real run, after forking into the background, returns exit
>    status 0 unconditionally.  The parent process has no way to know
>    whether gc will succeed.
>
>  - if there is any diagnostic output in gc.log, subsequent runs return
>    a nonzero exit status to indicate that gc was not triggered.
>
> There's nothing for the calling program to act on on the basis of that
> error.  Use status 0 consistently instead, to indicate that we decided
> not to run a gc (just like if no housekeeping was required).  This
> way, repo and similar tools can get the benefit of the same behavior
> as tools like "git fetch" that ignore the exit status from gc --auto.

This background gc thing is pretty much designed for interactive use.
When it's scripted, you probably should avoid it. Then you can fork a
new process by yourself and have much better control if you still want
"background" gc. So an alternative here is to turn on or off
gc.autodetach based on interactiveness (i.e. having tty). We can add a
new (and default) value "auto" to gc.autodetach for this purpose.

In automated scripts, it will always run in non-damonized mode. You
will get non-zero exit code when real errors occur. You don't worry
about hanging processes. Rate limit is also thrown out in this mode if
I'm not mistaken, but it could be added back and more tailored for
server needs.

> Once the period of time described by gc.pruneExpire elapses, the
> unreachable loose objects will be removed by "git gc --auto"
> automatically.
>
> [1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/
>
> Reported-by: Andrii Dehtiarov <adehtiarov@google.com>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---
>  Documentation/config.txt |  3 ++-
>  builtin/gc.c             | 16 +++++++++++-----
>  t/t6500-gc.sh            |  6 +++---
>  3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..5eaa4aaa7d 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should go below
>  gc.autoPackLimit and gc.bigPackThreshold should be respected again.
>
>  gc.logExpiry::
> -       If the file gc.log exists, then `git gc --auto` won't run
> +       If the file gc.log exists, then `git gc --auto` will print
> +       its content and exit with status zero instead of running
>         unless that file is more than 'gc.logExpiry' old.  Default is
>         "1.day".  See `gc.pruneExpire` for more ways to specify its
>         value.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2bebc52bda..484ab21b8c 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -438,7 +438,7 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
>         return NULL;
>  }
>
> -static void report_last_gc_error(void)
> +static int report_last_gc_error(void)
>  {
>         struct strbuf sb = STRBUF_INIT;
>         ssize_t ret;
> @@ -449,7 +449,7 @@ static void report_last_gc_error(void)
>                 if (errno == ENOENT)
>                         goto done;
>
> -               die_errno(_("cannot stat '%s'"), gc_log_path);
> +               return error_errno(_("cannot stat '%s'"), gc_log_path);
>         }
>
>         if (st.st_mtime < gc_log_expire_time)
> @@ -457,9 +457,9 @@ static void report_last_gc_error(void)
>
>         ret = strbuf_read_file(&sb, gc_log_path, 0);
>         if (ret < 0)
> -               die_errno(_("cannot read '%s'"), gc_log_path);
> +               return error_errno(_("cannot read '%s'"), gc_log_path);
>         if (ret > 0)
> -               die(_("The last gc run reported the following. "
> +               return error(_("The last gc run reported the following. "
>                                "Please correct the root cause\n"
>                                "and remove %s.\n"
>                                "Automatic cleanup will not be performed "
> @@ -469,6 +469,7 @@ static void report_last_gc_error(void)
>         strbuf_release(&sb);
>  done:
>         free(gc_log_path);
> +       return 0;
>  }
>
>  static void gc_before_repack(void)
> @@ -561,7 +562,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>                         fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
>                 }
>                 if (detach_auto) {
> -                       report_last_gc_error(); /* dies on error */
> +                       if (report_last_gc_error())
> +                               /*
> +                                * A previous gc failed. We've reported the
> +                                * error, so there's nothing left to do.
> +                                */
> +                               return 0;
>
>                         if (lock_repo_for_gc(force, &pid))
>                                 return 0;
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index c474a94a9f..3e62df616c 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -116,11 +116,11 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
>         test_config gc.autopacklimit 1 &&
>         test_config gc.autodetach true &&
>         echo fleem >.git/gc.log &&
> -       test_must_fail git gc --auto 2>err &&
> -       test_i18ngrep "^fatal:" err &&
> +       git gc --auto 2>err &&
> +       test_i18ngrep "^error:" err &&
>         test_config gc.logexpiry 5.days &&
>         test-tool chmtime =-345600 .git/gc.log &&
> -       test_must_fail git gc --auto &&
> +       git gc --auto &&
>         test_config gc.logexpiry 2.days &&
>         run_and_wait_for_auto_gc &&
>         ls .git/objects/pack/pack-*.pack >packs &&
> --
> 2.18.0.233.g985f88cf7e
>


-- 
Duy

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17  1:53                               ` Jonathan Nieder
  2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
  2018-07-17 15:59                                 ` Duy Nguyen
@ 2018-07-17 18:09                                 ` Junio C Hamano
  2 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2018-07-17 18:09 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff King, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy

Jonathan Nieder <jrnieder@gmail.com> writes:

> That doesn't really solve the problem:
>
>  1. "gc --auto" would produce noise *every time it runs* until gc.log
>     is removed, for example via expiry
>
>  2. "gc --auto" would not do any garbage collection until gc.log is
>     removed, for example by expiry
>
>  3. "gc --auto" would still not ratelimit itself, for example when
>     there are a large number of loose unreachable objects that is not
>     large enough to hit the loose object threshold.
>
> but maybe it's better than the present state.
>
> To solve (1) and (2), we could introduce a gc.warnings file that
> behaves like gc.log except that it only gets printed once and then
> self-destructs, allowing gc --auto to proceed.  

That makes it not rate-limit, no?

> To solve (3), we could
> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
> successfully and make "gc --auto" a no-op for a while after each run.

Does absolute time-interval make sense here?  Some repositories may
be busily churning new objects and for them 5-minute interval may be
too infrequent, while other repositories may be relatively slow and
once a day may be sufficient for them.  I wonder if we can somehow
auto-tune this.

> To avoid wasteful repeated fruitless gcs, when gc.log is present, the
> subsequent "gc --auto" would die after print its contents.  Most git

s/print/&ing/

> commands, such as "git fetch", ignore the exit status from "git gc
> --auto" so all is well at this point: the user gets to see the error
> message, and the fetch succeeds, without a wasteful additional attempt
> at an automatic gc.

The above, by saying "Most git commands", leaves readers want to
know "then what are minority git commands that do not correctly do
so?"  If you are not going to answer that question, it probably is
better not to even say "Most".

> External tools like repo[1], though, do care about the exit status
> from "git gc --auto".  In non-daemonized mode, the exit status is
> straightforward: if there is an error, it is nonzero, but after a
> warning like the above, the status is zero.  The daemonized mode, as a
> side effect of the other properties provided, offers a very strange
> exit code convention:
>
>  - if no housekeeping was required, the exit status is 0

OK.

>  - the first real run, after forking into the background, returns exit
>    status 0 unconditionally.  The parent process has no way to know
>    whether gc will succeed.

Understandable; allowing the parent to exit before we know the
outcome of the gc is the whole point of backgrounding.

>  - if there is any diagnostic output in gc.log, subsequent runs return
>    a nonzero exit status to indicate that gc was not triggered.

This is unfortunate.

> There's nothing for the calling program to act on on the basis of that
> error.  Use status 0 consistently instead, to indicate that we decided
> not to run a gc (just like if no housekeeping was required).  

s/.$/ in the last case./  And I fully agree with the reasoning.

> This
> way, repo and similar tools can get the benefit of the same behavior
> as tools like "git fetch" that ignore the exit status from gc --auto.
>
> Once the period of time described by gc.pruneExpire elapses, the
> unreachable loose objects will be removed by "git gc --auto"
> automatically.
>
> [1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/
>
> Reported-by: Andrii Dehtiarov <adehtiarov@google.com>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---
>  Documentation/config.txt |  3 ++-
>  builtin/gc.c             | 16 +++++++++++-----
>  t/t6500-gc.sh            |  6 +++---
>  3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..5eaa4aaa7d 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should go below
>  gc.autoPackLimit and gc.bigPackThreshold should be respected again.
>  
>  gc.logExpiry::
> -	If the file gc.log exists, then `git gc --auto` won't run
> +	If the file gc.log exists, then `git gc --auto` will print
> +	its content and exit with status zero instead of running
>  	unless that file is more than 'gc.logExpiry' old.  Default is
>  	"1.day".  See `gc.pruneExpire` for more ways to specify its
>  	value.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2bebc52bda..484ab21b8c 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -438,7 +438,7 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
>  	return NULL;
>  }
>  
> -static void report_last_gc_error(void)
> +static int report_last_gc_error(void)
>  {
>  	struct strbuf sb = STRBUF_INIT;
>  	ssize_t ret;
> @@ -449,7 +449,7 @@ static void report_last_gc_error(void)
>  		if (errno == ENOENT)
>  			goto done;
>  
> -		die_errno(_("cannot stat '%s'"), gc_log_path);
> +		return error_errno(_("cannot stat '%s'"), gc_log_path);
>  	}
>  
>  	if (st.st_mtime < gc_log_expire_time)
> @@ -457,9 +457,9 @@ static void report_last_gc_error(void)
>  
>  	ret = strbuf_read_file(&sb, gc_log_path, 0);
>  	if (ret < 0)
> -		die_errno(_("cannot read '%s'"), gc_log_path);
> +		return error_errno(_("cannot read '%s'"), gc_log_path);
>  	if (ret > 0)
> -		die(_("The last gc run reported the following. "
> +		return error(_("The last gc run reported the following. "
>  			       "Please correct the root cause\n"
>  			       "and remove %s.\n"
>  			       "Automatic cleanup will not be performed "
> @@ -469,6 +469,7 @@ static void report_last_gc_error(void)
>  	strbuf_release(&sb);
>  done:
>  	free(gc_log_path);
> +	return 0;
>  }
>  
>  static void gc_before_repack(void)
> @@ -561,7 +562,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
>  		}
>  		if (detach_auto) {
> -			report_last_gc_error(); /* dies on error */
> +			if (report_last_gc_error())
> +				/*
> +				 * A previous gc failed. We've reported the
> +				 * error, so there's nothing left to do.
> +				 */
> +				return 0;
>  
>  			if (lock_repo_for_gc(force, &pid))
>  				return 0;
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index c474a94a9f..3e62df616c 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -116,11 +116,11 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
>  	test_config gc.autopacklimit 1 &&
>  	test_config gc.autodetach true &&
>  	echo fleem >.git/gc.log &&
> -	test_must_fail git gc --auto 2>err &&
> -	test_i18ngrep "^fatal:" err &&
> +	git gc --auto 2>err &&
> +	test_i18ngrep "^error:" err &&
>  	test_config gc.logexpiry 5.days &&
>  	test-tool chmtime =-345600 .git/gc.log &&
> -	test_must_fail git gc --auto &&
> +	git gc --auto &&
>  	test_config gc.logexpiry 2.days &&
>  	run_and_wait_for_auto_gc &&
>  	ls .git/objects/pack/pack-*.pack >packs &&

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/3] gc: improve handling of errors reading gc.log
  2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
@ 2018-07-17 18:19     ` Junio C Hamano
  2018-07-17 19:58     ` Jeff King
  1 sibling, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2018-07-17 18:19 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Jeff King, Elijah Newren,
	Nguyễn Thái Ngọc Duy

Jonathan Nieder <jrnieder@gmail.com> writes:

> - avoid being confused by a gc.log larger than INT_MAX bytes

;-)

>
> Noticed by code inspection.
>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---

Looks good.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] gc: exit with status 128 on failure
  2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
@ 2018-07-17 18:22     ` Junio C Hamano
  2018-07-17 19:59     ` Jeff King
  1 sibling, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2018-07-17 18:22 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Jeff King, Elijah Newren,
	Nguyễn Thái Ngọc Duy

Jonathan Nieder <jrnieder@gmail.com> writes:

> A value of -1 returned from cmd_gc gets propagated to exit(),
> resulting in an exit status of 255.  Use die instead for a clearer
> error message and a controlled exit.
>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---
> As in
> https://public-inbox.org/git/20180716225639.GK11513@aiede.svl.corp.google.com/.
> The only change is splitting out patch 1/3.
>
>  builtin/gc.c  | 35 ++++++++++++++---------------------
>  t/t6500-gc.sh |  2 +-
>  2 files changed, 15 insertions(+), 22 deletions(-)

I double checked "return' in cmd_gc() and this patch catches them
all.  Good clean-up.

Thanks.

> diff --git a/builtin/gc.c b/builtin/gc.c
> index d69fc4c0b0..95c8afd07b 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -438,10 +438,9 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
>  	return NULL;
>  }
>  
> -static int report_last_gc_error(void)
> +static void report_last_gc_error(void)
>  {
>  	struct strbuf sb = STRBUF_INIT;
> -	int ret = 0;
>  	ssize_t len;
>  	struct stat st;
>  	char *gc_log_path = git_pathdup("gc.log");
> @@ -450,8 +449,7 @@ static int report_last_gc_error(void)
>  		if (errno == ENOENT)
>  			goto done;
>  
> -		ret = error_errno(_("cannot stat '%s'"), gc_log_path);
> -		goto done;
> +		die_errno(_("cannot stat '%s'"), gc_log_path);
>  	}
>  
>  	if (st.st_mtime < gc_log_expire_time)
> @@ -459,9 +457,9 @@ static int report_last_gc_error(void)
>  
>  	len = strbuf_read_file(&sb, gc_log_path, 0);
>  	if (len < 0)
> -		ret = error_errno(_("cannot read '%s'"), gc_log_path);
> +		die_errno(_("cannot read '%s'"), gc_log_path);
>  	else if (len > 0)
> -		ret = error(_("The last gc run reported the following. "
> +		die(_("The last gc run reported the following. "
>  			       "Please correct the root cause\n"
>  			       "and remove %s.\n"
>  			       "Automatic cleanup will not be performed "
> @@ -471,20 +469,18 @@ static int report_last_gc_error(void)
>  	strbuf_release(&sb);
>  done:
>  	free(gc_log_path);
> -	return ret;
>  }
>  
> -static int gc_before_repack(void)
> +static void gc_before_repack(void)
>  {
>  	if (pack_refs && run_command_v_opt(pack_refs_cmd.argv, RUN_GIT_CMD))
> -		return error(FAILED_RUN, pack_refs_cmd.argv[0]);
> +		die(FAILED_RUN, pack_refs_cmd.argv[0]);
>  
>  	if (prune_reflogs && run_command_v_opt(reflog.argv, RUN_GIT_CMD))
> -		return error(FAILED_RUN, reflog.argv[0]);
> +		die(FAILED_RUN, reflog.argv[0]);
>  
>  	pack_refs = 0;
>  	prune_reflogs = 0;
> -	return 0;
>  }
>  
>  int cmd_gc(int argc, const char **argv, const char *prefix)
> @@ -565,13 +561,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
>  		}
>  		if (detach_auto) {
> -			if (report_last_gc_error())
> -				return -1;
> +			report_last_gc_error(); /* dies on error */
>  
>  			if (lock_repo_for_gc(force, &pid))
>  				return 0;
> -			if (gc_before_repack())
> -				return -1;
> +			gc_before_repack(); /* dies on failure */
>  			delete_tempfile(&pidfile);
>  
>  			/*
> @@ -611,12 +605,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  		atexit(process_log_file_at_exit);
>  	}
>  
> -	if (gc_before_repack())
> -		return -1;
> +	gc_before_repack();
>  
>  	if (!repository_format_precious_objects) {
>  		if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
> -			return error(FAILED_RUN, repack.argv[0]);
> +			die(FAILED_RUN, repack.argv[0]);
>  
>  		if (prune_expire) {
>  			argv_array_push(&prune, prune_expire);
> @@ -626,18 +619,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  				argv_array_push(&prune,
>  						"--exclude-promisor-objects");
>  			if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
> -				return error(FAILED_RUN, prune.argv[0]);
> +				die(FAILED_RUN, prune.argv[0]);
>  		}
>  	}
>  
>  	if (prune_worktrees_expire) {
>  		argv_array_push(&prune_worktrees, prune_worktrees_expire);
>  		if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
> -			return error(FAILED_RUN, prune_worktrees.argv[0]);
> +			die(FAILED_RUN, prune_worktrees.argv[0]);
>  	}
>  
>  	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
> -		return error(FAILED_RUN, rerere.argv[0]);
> +		die(FAILED_RUN, rerere.argv[0]);
>  
>  	report_garbage = report_pack_garbage;
>  	reprepare_packed_git(the_repository);
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index 818435f04e..c474a94a9f 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -117,7 +117,7 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
>  	test_config gc.autodetach true &&
>  	echo fleem >.git/gc.log &&
>  	test_must_fail git gc --auto 2>err &&
> -	test_i18ngrep "^error:" err &&
> +	test_i18ngrep "^fatal:" err &&
>  	test_config gc.logexpiry 5.days &&
>  	test-tool chmtime =-345600 .git/gc.log &&
>  	test_must_fail git gc --auto &&

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/3] gc: improve handling of errors reading gc.log
  2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
  2018-07-17 18:19     ` Junio C Hamano
@ 2018-07-17 19:58     ` Jeff King
  1 sibling, 0 replies; 57+ messages in thread
From: Jeff King @ 2018-07-17 19:58 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy

On Mon, Jul 16, 2018 at 11:53:21PM -0700, Jonathan Nieder wrote:

> A collection of minor error handling fixes:
> 
> - use an error message in lower case, following the usual style
> - quote filenames in error messages to make them easier to read and to
>   decrease translation load by matching other 'stat' error messages
> - check for and report errors from 'read', too
> - avoid being confused by a gc.log larger than INT_MAX bytes
> 
> Noticed by code inspection.
> 
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>

Thanks, this all looks obviously good.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] gc: exit with status 128 on failure
  2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
  2018-07-17 18:22     ` Junio C Hamano
@ 2018-07-17 19:59     ` Jeff King
  2018-09-17 18:33       ` Jeff King
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-17 19:59 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy

On Mon, Jul 16, 2018 at 11:54:16PM -0700, Jonathan Nieder wrote:

> A value of -1 returned from cmd_gc gets propagated to exit(),
> resulting in an exit status of 255.  Use die instead for a clearer
> error message and a controlled exit.

This feels a little funny because we know we're going to turn some of
these back in the next patch. That said, I'm OK with it, since this
version is already written.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-17  6:57   ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
@ 2018-07-17 20:13     ` Jeff King
  2018-07-18 16:21       ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-17 20:13 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

On Mon, Jul 16, 2018 at 11:57:40PM -0700, Jonathan Nieder wrote:

> External tools like repo[1], though, do care about the exit status
> from "git gc --auto".  In non-daemonized mode, the exit status is
> straightforward: if there is an error, it is nonzero, but after a
> warning like the above, the status is zero.  The daemonized mode, as a
> side effect of the other properties provided, offers a very strange
> exit code convention:
> 
>  - if no housekeeping was required, the exit status is 0
> 
>  - the first real run, after forking into the background, returns exit
>    status 0 unconditionally.  The parent process has no way to know
>    whether gc will succeed.
> 
>  - if there is any diagnostic output in gc.log, subsequent runs return
>    a nonzero exit status to indicate that gc was not triggered.
> 
> There's nothing for the calling program to act on on the basis of that
> error.  Use status 0 consistently instead, to indicate that we decided
> not to run a gc (just like if no housekeeping was required).  This
> way, repo and similar tools can get the benefit of the same behavior
> as tools like "git fetch" that ignore the exit status from gc --auto.

I think this is a good change.

In theory it might be useful to propagate the original exit code (_not_
"did we see a warning or an error", but the true original exit code. But
as you note, it's not deterministic anyway (we'd miss that exit code on
the first run, or even any simultaneous runs that exit early due to lock
contention). So it's clear that callers can't really do anything robust
based on the exit code of a daemonized "gc --auto".

I still think that "repo" should probably stop respecting the exit code.
But that's no excuse for Git not to have a sensible exit code in the
first place.

The patch itself looks good overall. A few comments (none of which I
think even requires a fix):

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..5eaa4aaa7d 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should go below
>  gc.autoPackLimit and gc.bigPackThreshold should be respected again.
>  
>  gc.logExpiry::
> -	If the file gc.log exists, then `git gc --auto` won't run
> +	If the file gc.log exists, then `git gc --auto` will print
> +	its content and exit with status zero instead of running
>  	unless that file is more than 'gc.logExpiry' old.  Default is
>  	"1.day".  See `gc.pruneExpire` for more ways to specify its
>  	value.

Yeah, this is definitely worth documenting. I was surprised there's no
discussion of daemonization at all in git-gc(1). I don't think adding it
is a blocker for this series, though.

>  static void gc_before_repack(void)
> @@ -561,7 +576,13 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
>  		}
>  		if (detach_auto) {
> -			report_last_gc_error(); /* dies on error */
> +			int ret = report_last_gc_error();
> +			if (ret < 0)
> +				/* an I/O error occured, already reported */
> +				exit(128);

We have a few exit(128)'s sprinkled throughout the code-base, and I
always wonder if they will one day go stale if we change the code that
die() uses. But it probably doesn't matter, and anyway I don't think
there is a better way to do this currently.

I would have written "return 128" since the other case arm also returns,
but I really cannot think of a reason to prefer one over the other.

> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index c474a94a9f..a222efdbe1 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -116,11 +116,11 @@ test_expect_success 'background auto gc does not run if gc.log is present and re
>  	test_config gc.autopacklimit 1 &&
>  	test_config gc.autodetach true &&
>  	echo fleem >.git/gc.log &&
> -	test_must_fail git gc --auto 2>err &&
> -	test_i18ngrep "^fatal:" err &&
> +	git gc --auto 2>err &&
> +	test_i18ngrep "^warning:" err &&
>  	test_config gc.logexpiry 5.days &&
>  	test-tool chmtime =-345600 .git/gc.log &&
> -	test_must_fail git gc --auto &&
> +	git gc --auto &&

Nice. At first I thought this was changing an existing test to cover the
new case (which I usually frown on), but it is just that your patch is
intentionally changing the case covered here. So this is the right thing
to do.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
  2018-07-17 14:03                                   ` Jonathan Nieder
@ 2018-07-17 20:27                                   ` Jeff King
  2018-07-18 13:11                                     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-17 20:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy

On Tue, Jul 17, 2018 at 10:59:50AM +0200, Ævar Arnfjörð Bjarmason wrote:

> That doesn't make sense to me. Just because git itself is happily
> ignoring the exit code I don't think that should mean there shouldn't be
> a meaningful exit code. Why don't you just ignore the exit code in the
> repo tool as well?
> 
> Now e.g. automation I have to see if git-gc ---auto is having issues
> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
> of servers, but will need to start caring if stderr was emitted to.

If you're daemonizing gc, though, there are a number of cases where the
exit code is not propagated. If you really care about the outcome,
you're probably better off either:

  - doing synchronous gc's, which will still return a meaningful code
    after Jonathan's patches

  - inspecting the log yourself. I know that comes close to the un-unixy
    stderr thing. But in this case, the presence of a non-empty log is
    literally the on-disk bit for "the previous run failed". And doing
    so catches all of the daemonized cases, even the "first one" that
    you'd miss by paying attention to the immediate exit code.

    This will treat the zero-exit-code "warning" case as an error. I'm
    not against propagating the true original error code, if somebody
    wants to work on it. But I think Jonathan's patches are a strict
    improvement over the current situation, and a patch to propagate
    could come on top.

> I think you're conflating two things here in a way that (if I'm reading
> this correctly) produces a pretty bad regression for users.
> 
>  a) If we have something in the gc.log we keep yelling at users until we
>     reach the gc.logExpiry, this was the subject of my thread back in
>     January. This sucks, and should be fixed somehow.
> 
>     Maybe we should only emit the warning once, e.g. creating a
>     .git/gc.log.wasemitted marker and as long as its ctime is later than
>     the mtime on .git/gc.log we don't emit it again).
> 
>     But that also creates the UX problem that it's easy to miss this
>     message in the middle of some huge "fetch" output. Perhaps we should
>     just start emitting this as part of "git status" or something (or
>     solve the root cause, as Peff notes...).

I kind of like that "git status" suggestion. Of course many users run
"git status" more often than "git commit", so it may actually spam them
more!

>  b) We *also* use this presence of a gc.log as a marker for "we failed
>     too recently, let's not try again until after a day".
> 
> The semantics of b) are very useful, and just because they're tied up
> with a) right now, let's not throw out b) just because we're trying to
> solve a).

Yeah. In general my concern was the handling of (b), which I think this
last crop of patches is fine with. As far as showing the message
repeatedly or not, I don't have a strong opinion (it could even be
configurable, once it's split from the "marker" behavior).

> Right now one thing we do right is we always try to look at the actual
> state of the repo with too_many_packs() and too_many_loose_objects().
> 
> We don't assume the state of your repo hasn't drastically changed
> recently. That means that we do the right thing and gc when the repo
> needs it, not just because we GC'd recently enough.
> 
> With these proposed semantics we'd skip a needed GC (potentially for
> days, depending on the default) just because we recently ran one.

Yeah, I agree that deferring repeated gc's based on time is going to run
into pathological corner cases. OTOH, what we've found at GitHub is that
"gc --auto" is quite insufficient for scheduling maintenance anyway
(e.g., if a machine gets pushes to 100 separate repositories in quick
succession, you probably want to queue and throttle any associated gc).

But there are probably more mid-size sites that are big enough to have
corner cases, but not so big that "gc --auto" is hopeless. ;)

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-17 20:27                                   ` Jeff King
@ 2018-07-18 13:11                                     ` Ævar Arnfjörð Bjarmason
  2018-07-18 17:29                                       ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-18 13:11 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy


On Tue, Jul 17 2018, Jeff King wrote:

> On Tue, Jul 17, 2018 at 10:59:50AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> That doesn't make sense to me. Just because git itself is happily
>> ignoring the exit code I don't think that should mean there shouldn't be
>> a meaningful exit code. Why don't you just ignore the exit code in the
>> repo tool as well?
>>
>> Now e.g. automation I have to see if git-gc ---auto is having issues
>> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
>> of servers, but will need to start caring if stderr was emitted to.
>
> If you're daemonizing gc, though, there are a number of cases where the
> exit code is not propagated. If you really care about the outcome,
> you're probably better off either:

In theory a lot of the other stuff can fail, but in practice I've only
seen it get stuck due to running out of disk space (easily detected in
other ways), and because of having too many loose objects (e.g. due to,
but not limited to
https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/).

>   - doing synchronous gc's, which will still return a meaningful code
>     after Jonathan's patches

(Reply to this and "we've found at GitHub..." later in your mail)

I'm targeting clients (dev machines, laptops, staging boxes) which have
clones of repos, some put in place by automation, some manually cloned
by users in ~.

So it's much easier to rely on shipping a /etc/gitconfig than setting
gc.auto=0 and having some system-wide job that goes and hunts for local
git repos to manually gc.

It's also a big advantage that it's driven by user activity, because
it's an implicit guard of only_do_this_if_the_user_is_active_here()
before "git-gc" on a fleet of machines, which when some of those get
their disk space from shared NFS resources cuts down on overall I/O.

>   - inspecting the log yourself. I know that comes close to the un-unixy
>     stderr thing. But in this case, the presence of a non-empty log is
>     literally the on-disk bit for "the previous run failed". And doing
>     so catches all of the daemonized cases, even the "first one" that
>     you'd miss by paying attention to the immediate exit code.
>
>     This will treat the zero-exit-code "warning" case as an error. I'm
>     not against propagating the true original error code, if somebody
>     wants to work on it. But I think Jonathan's patches are a strict
>     improvement over the current situation, and a patch to propagate
>     could come on top.

Yeah, I can check gc.log, if others are of a different opinion about
this #3 case at the end of the day I don't mind if it returns 0. It just
doesn't make any conceptual sense to me.

>> I think you're conflating two things here in a way that (if I'm reading
>> this correctly) produces a pretty bad regression for users.
>>
>>  a) If we have something in the gc.log we keep yelling at users until we
>>     reach the gc.logExpiry, this was the subject of my thread back in
>>     January. This sucks, and should be fixed somehow.
>>
>>     Maybe we should only emit the warning once, e.g. creating a
>>     .git/gc.log.wasemitted marker and as long as its ctime is later than
>>     the mtime on .git/gc.log we don't emit it again).
>>
>>     But that also creates the UX problem that it's easy to miss this
>>     message in the middle of some huge "fetch" output. Perhaps we should
>>     just start emitting this as part of "git status" or something (or
>>     solve the root cause, as Peff notes...).
>
> I kind of like that "git status" suggestion. Of course many users run
> "git status" more often than "git commit", so it may actually spam them
> more!

Maybe piggy-packing on the advice facility ...

>>  b) We *also* use this presence of a gc.log as a marker for "we failed
>>     too recently, let's not try again until after a day".
>>
>> The semantics of b) are very useful, and just because they're tied up
>> with a) right now, let's not throw out b) just because we're trying to
>> solve a).
>
> Yeah. In general my concern was the handling of (b), which I think this
> last crop of patches is fine with. As far as showing the message
> repeatedly or not, I don't have a strong opinion (it could even be
> configurable, once it's split from the "marker" behavior).
>
>> Right now one thing we do right is we always try to look at the actual
>> state of the repo with too_many_packs() and too_many_loose_objects().
>>
>> We don't assume the state of your repo hasn't drastically changed
>> recently. That means that we do the right thing and gc when the repo
>> needs it, not just because we GC'd recently enough.
>>
>> With these proposed semantics we'd skip a needed GC (potentially for
>> days, depending on the default) just because we recently ran one.
>
> Yeah, I agree that deferring repeated gc's based on time is going to run
> into pathological corner cases. OTOH, what we've found at GitHub is that
> "gc --auto" is quite insufficient for scheduling maintenance anyway
> (e.g., if a machine gets pushes to 100 separate repositories in quick
> succession, you probably want to queue and throttle any associated gc).

I'm sure you have better solutions for this at GitHub, but as an aside
it might be interesting to add some sort of gc flock + retry setting for
this use-case, i.e. even if you had 100 concurrent gc's due to
too_many_*(), they'd wait + retry until they could get the flock on a
given file.

Then again this is probably too specific, and could be done with a
pre-auto-gc hook too..

> But there are probably more mid-size sites that are big enough to have
> corner cases, but not so big that "gc --auto" is hopeless. ;)

FWIW I don't deal with gc on the server at all these days (that's our
internal GitLab's instance problem to solve). I just mentioned the edge
case of a growing number of pack files on the server upthread as
something that we'd be shipping with git if we had time-based backoff
semantics.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-17 20:13     ` Jeff King
@ 2018-07-18 16:21       ` Junio C Hamano
  2018-07-18 17:22         ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2018-07-18 16:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

Jeff King <peff@peff.net> writes:

>> There's nothing for the calling program to act on on the basis of that
>> error.  Use status 0 consistently instead, to indicate that we decided
>> not to run a gc (just like if no housekeeping was required).  This
>> way, repo and similar tools can get the benefit of the same behavior
>> as tools like "git fetch" that ignore the exit status from gc --auto.
>
> I think this is a good change.
>
> In theory it might be useful to propagate the original exit code (_not_
> "did we see a warning or an error", but the true original exit code. But
> as you note, it's not deterministic anyway (we'd miss that exit code on
> the first run, or even any simultaneous runs that exit early due to lock
> contention). So it's clear that callers can't really do anything robust
> based on the exit code of a daemonized "gc --auto".
>
> I still think that "repo" should probably stop respecting the exit code.
> But that's no excuse for Git not to have a sensible exit code in the
> first place.

I am not yet convinced that this last step to exit with 0 is a good
change, even though I can understand that it would be more
convenient, as there currently is no easy way for the calling script
to tell two error cases apart.

I think the "sensible exit code" you mention would be something like
"1 for hard error, 2 for 'I am punting as I see there were previous
errors---you may want to examine your repository'".  

If we did that from day one and documented that behaviour, nobody
would have complained, but adopting that suddenly is of course a
breaking change.

Perhaps we should exit with 2 (not 0) in that "previous error" case
by default, and then have a configuration knob to turn that 2 into 0
for those who cannot easily modify the calling script?  That way, we
by default will *not* break those who have been paying attention to
zero-ness of the exit status, we allow those who want to treat this
"prior error" case as if there were no error with just a knob, and
then those who are willing to update their script can tell two cases
by the exit status and act differently.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-18 16:21       ` Junio C Hamano
@ 2018-07-18 17:22         ` Jeff King
  2018-07-18 18:19           ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-18 17:22 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

On Wed, Jul 18, 2018 at 09:21:18AM -0700, Junio C Hamano wrote:

> > I still think that "repo" should probably stop respecting the exit code.
> > But that's no excuse for Git not to have a sensible exit code in the
> > first place.
> 
> I am not yet convinced that this last step to exit with 0 is a good
> change, even though I can understand that it would be more
> convenient, as there currently is no easy way for the calling script
> to tell two error cases apart.
> 
> I think the "sensible exit code" you mention would be something like
> "1 for hard error, 2 for 'I am punting as I see there were previous
> errors---you may want to examine your repository'".  
> 
> If we did that from day one and documented that behaviour, nobody
> would have complained, but adopting that suddenly is of course a
> breaking change.
> 
> Perhaps we should exit with 2 (not 0) in that "previous error" case
> by default, and then have a configuration knob to turn that 2 into 0
> for those who cannot easily modify the calling script?  That way, we
> by default will *not* break those who have been paying attention to
> zero-ness of the exit status, we allow those who want to treat this
> "prior error" case as if there were no error with just a knob, and
> then those who are willing to update their script can tell two cases
> by the exit status and act differently.

I think we have been exiting non-zero with "previous errors" for some
time with the daemonizing code. It was just spelled "-1" instead of "2".
So just jumping right there does not mean any regression from the
current state, I don't think (but it also does not fix existing scripts
like "repo" that check the code).

I agree the config you suggest would give people the tools to make that
case work. But it somehow rubs me the wrong way. Can you imagine the
perspective of a user who is told "oh, your script breaks? Just try
setting this option to ignore error codes in this one particular
situation". It feels like a weird hack, because it is.

It's also still inconsistent in the daemonize case. The run that yields
the error won't return a non-zero exit. But the next run will exit with
"2".

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH] gc: do not warn about too many loose objects
  2018-07-18 13:11                                     ` Ævar Arnfjörð Bjarmason
@ 2018-07-18 17:29                                       ` Jeff King
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff King @ 2018-07-18 17:29 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Jonathan Tan, git, Andrii Dehtiarov,
	Nguyễn Thái Ngọc Duy

On Wed, Jul 18, 2018 at 03:11:38PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Yeah, I agree that deferring repeated gc's based on time is going to run
> > into pathological corner cases. OTOH, what we've found at GitHub is that
> > "gc --auto" is quite insufficient for scheduling maintenance anyway
> > (e.g., if a machine gets pushes to 100 separate repositories in quick
> > succession, you probably want to queue and throttle any associated gc).
> 
> I'm sure you have better solutions for this at GitHub, but as an aside
> it might be interesting to add some sort of gc flock + retry setting for
> this use-case, i.e. even if you had 100 concurrent gc's due to
> too_many_*(), they'd wait + retry until they could get the flock on a
> given file.
> 
> Then again this is probably too specific, and could be done with a
> pre-auto-gc hook too..

Yeah, I think any multi-repo solution is getting way too specific for
Git, and the best thing we can do is provide a hook. I agree you could
probably do this today with a pre-auto-gc hook (if it skips gc it would
just queue itself and return non-zero). Or even just make a mark in a
database that says "there was some activity here".

Since we have so much other infrastructure sitting between the user and
Git anyway, we do that marking at a separate layer which is already
talking to a database. ;)

Anyway, I do agree with your general notion that this isn't the right
approach for many situations, and auto-gc is a better solution for many
cases.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-18 17:22         ` Jeff King
@ 2018-07-18 18:19           ` Junio C Hamano
  2018-07-18 19:06             ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2018-07-18 18:19 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

Jeff King <peff@peff.net> writes:

>> Perhaps we should exit with 2 (not 0) in that "previous error" case
>> by default, and then have a configuration knob to turn that 2 into 0
>> for those who cannot easily modify the calling script?  That way, we
>> by default will *not* break those who have been paying attention to
>> zero-ness of the exit status, we allow those who want to treat this
>> "prior error" case as if there were no error with just a knob, and
>> then those who are willing to update their script can tell two cases
>> by the exit status and act differently.
>
> I think we have been exiting non-zero with "previous errors" for some
> time with the daemonizing code. It was just spelled "-1" instead of "2".
> So just jumping right there does not mean any regression from the
> current state, I don't think (but it also does not fix existing scripts
> like "repo" that check the code).

Correct.  That is why I suggested that default setting.  

However.

If we change it so that by default we exit with 0 but allow those
who care to see 2 with a knob, that would be a regression to those
who actually were acting on the non-zero exit, but they no longer
have to scrape the log/warning output if they choose to set the
knob, so the "convenience" factor might argue in favor of the "by
default we return 0 but you can choose to get the 'sensible exit
codes'" approach.  I do not have a strong opinion either way wrt
which one should be the default.

> I agree the config you suggest would give people the tools to make that
> case work. But it somehow rubs me the wrong way. Can you imagine the
> perspective of a user who is told "oh, your script breaks? Just try
> setting this option to ignore error codes in this one particular
> situation". It feels like a weird hack, because it is.

Well, the message is "your script is broken (and has always been
when 'gc --auto' saw that the previous round failed), but now we
have two ways to help you correct.  An easier but broken way is to
turn this knob to turn 2 to 0 to hide the issue under the rug; or
you can notice the difference between $?==1 and $?==2 and act
differently".  "turn 2 to 0" is a weird hack, but they are also
given the option of doing the right thing, so...

> It's also still inconsistent in the daemonize case. The run that yields
> the error won't return a non-zero exit. But the next run will exit with
> "2".

I do not see this particular "inconsistency" a problem.  The run
that first discovers the problem by definition cannot return with
non-zero; not waiting until finding out the outcome is the whole
point of running in the background and giving control back early.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-18 18:19           ` Junio C Hamano
@ 2018-07-18 19:06             ` Jeff King
  2018-07-18 19:55               ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-07-18 19:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

On Wed, Jul 18, 2018 at 11:19:01AM -0700, Junio C Hamano wrote:

> > It's also still inconsistent in the daemonize case. The run that yields
> > the error won't return a non-zero exit. But the next run will exit with
> > "2".
> 
> I do not see this particular "inconsistency" a problem.  The run
> that first discovers the problem by definition cannot return with
> non-zero; not waiting until finding out the outcome is the whole
> point of running in the background and giving control back early.

I guess I just see it as encouraging a non-robust flow. I can see
somebody thinking they should care about "gc --auto" errors, because
they can turn up latent repository corruption (because most operations
try to only look at the objects they need to, and gc inherently
considers every object). But doing so would give a very delayed
notification for a repository that is handled infrequently. So finding
out about a corruption that we detected might takes weeks or months.

I dunno. If your primary motivation is not finding latent problems, but
loudly complaining when gc does not make forward progress, I suppose it
makes more sense.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] gc: do not return error for prior errors in daemonized mode
  2018-07-18 19:06             ` Jeff King
@ 2018-07-18 19:55               ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2018-07-18 19:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy, Andrii Dehtiarov

Jeff King <peff@peff.net> writes:

> On Wed, Jul 18, 2018 at 11:19:01AM -0700, Junio C Hamano wrote:
>
>> > It's also still inconsistent in the daemonize case. The run that yields
>> > the error won't return a non-zero exit. But the next run will exit with
>> > "2".
>> 
>> I do not see this particular "inconsistency" a problem.  The run
>> that first discovers the problem by definition cannot return with
>> non-zero; not waiting until finding out the outcome is the whole
>> point of running in the background and giving control back early.
>
> I guess I just see it as encouraging a non-robust flow. I can see
> somebody thinking they should care about "gc --auto" errors, because
> they can turn up latent repository corruption (because most operations
> try to only look at the objects they need to, and gc inherently
> considers every object). But doing so would give a very delayed
> notification for a repository that is handled infrequently. So finding
> out about a corruption that we detected might takes weeks or months.
>
> I dunno. If your primary motivation is not finding latent problems, but
> loudly complaining when gc does not make forward progress, I suppose it
> makes more sense.

I am not sure either.  If your primary motivation is to protect
yourself from ignoring gc failures that could be an indication of
possible repository corruption, you wouldn't be running it
backgrounded in the first place, I woudl guess.  And if you are
backgrounding, with "exit with 2, not hide it with 0" approach, you
would have a better chance to notice, no?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] gc: exit with status 128 on failure
  2018-07-17 19:59     ` Jeff King
@ 2018-09-17 18:33       ` Jeff King
  2018-09-17 18:40         ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2018-09-17 18:33 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy

On Tue, Jul 17, 2018 at 03:59:47PM -0400, Jeff King wrote:

> On Mon, Jul 16, 2018 at 11:54:16PM -0700, Jonathan Nieder wrote:
> 
> > A value of -1 returned from cmd_gc gets propagated to exit(),
> > resulting in an exit status of 255.  Use die instead for a clearer
> > error message and a controlled exit.
> 
> This feels a little funny because we know we're going to turn some of
> these back in the next patch. That said, I'm OK with it, since this
> version is already written.

There's discussion elsewhere[1] of applying just up to patch 2.

Do we still want to convert these cases to die() as their end-state?
Another alternative would be to just convert a "-1" return to 128 or
similar at the level of cmd_gc(), which avoids the 255 weirdness.

Doing so also keeps the error messages the same as (as "error" and not
"fatal"). I'm not sure which we prefer.

It also makes the code more flexible and lib-ifiable (since the caller
can decide how to handle the errors). That probably doesn't matter much
since this is all static-local to builtin/gc.c, though I suppose in
theory we could eventually do parts of "gc --auto" without spawning a
separate process.

-Peff

[1] https://public-inbox.org/git/20180917182210.GB3894@sigill.intra.peff.net/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] gc: exit with status 128 on failure
  2018-09-17 18:33       ` Jeff King
@ 2018-09-17 18:40         ` Jonathan Nieder
  2018-09-18 17:30           ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2018-09-17 18:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy

Hi,

Jeff King wrote:
> On Tue, Jul 17, 2018 at 03:59:47PM -0400, Jeff King wrote:
>> On Mon, Jul 16, 2018 at 11:54:16PM -0700, Jonathan Nieder wrote:

>>> A value of -1 returned from cmd_gc gets propagated to exit(),
>>> resulting in an exit status of 255.  Use die instead for a clearer
>>> error message and a controlled exit.
>>
>> This feels a little funny because we know we're going to turn some of
>> these back in the next patch. That said, I'm OK with it, since this
>> version is already written.
>
> There's discussion elsewhere[1] of applying just up to patch 2.
>
> Do we still want to convert these cases to die() as their end-state?

IMHO yes, we do.  die() is the function that you can use to exit with
a fatal error.

If we want to get rid of die(), that would be a tree-wide effort, not
something that should hold up this patch.

[...]
> It also makes the code more flexible and lib-ifiable (since the caller
> can decide how to handle the errors). That probably doesn't matter much
> since this is all static-local to builtin/gc.c,

Exactly.  I'm a strong believer in http://wiki.c2.com/?YouArentGonnaNeedIt.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] gc: exit with status 128 on failure
  2018-09-17 18:40         ` Jonathan Nieder
@ 2018-09-18 17:30           ` Jeff King
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff King @ 2018-09-18 17:30 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Elijah Newren,
	Nguyễn Thái Ngọc Duy

On Mon, Sep 17, 2018 at 11:40:12AM -0700, Jonathan Nieder wrote:

> > There's discussion elsewhere[1] of applying just up to patch 2.
> >
> > Do we still want to convert these cases to die() as their end-state?
> 
> IMHO yes, we do.  die() is the function that you can use to exit with
> a fatal error.
> 
> If we want to get rid of die(), that would be a tree-wide effort, not
> something that should hold up this patch.

But that was sort of my question. I think there are people who _do_ want
to get rid of most die() calls (like Dscho), and there is a tree-wide
effort that is happening slowly to lib-ify. Your patch goes in the
opposite direction.

That said, I think there are actually two cases in your patch.

The calls to "return error()" or even just "return -1" in cmd_gc() seem
like obvious candidates for die(). We're at the top of the stack, and
anybody lib-ifying at that level is going to need to extract bits into
reusable functions anyway.

I more wondered about helpers like report_last_gc_error() and
gc_before_repack().

> > It also makes the code more flexible and lib-ifiable (since the caller
> > can decide how to handle the errors). That probably doesn't matter much
> > since this is all static-local to builtin/gc.c,
> 
> Exactly.  I'm a strong believer in http://wiki.c2.com/?YouArentGonnaNeedIt.

I only half-agree that this is YAGNI. If it were "let's punt on making
this code friendlier to lib-ification", I'd agree more completely. But
it's actually taking an active step in the opposite direction.

I dunno. It's probably not worth spending too much more time discussing,
and I'm OK either way.  I mostly just wanted to raise the issue since
dropping patch 3 changes the balance to me.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2018-09-18 17:30 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-16 17:27 [PATCH] gc: do not warn about too many loose objects Jonathan Tan
2018-07-16 17:51 ` Jeff King
2018-07-16 18:22   ` Jonathan Nieder
2018-07-16 18:52     ` Jeff King
2018-07-16 19:09       ` Jonathan Nieder
2018-07-16 19:41         ` Jeff King
2018-07-16 19:54           ` Jonathan Nieder
2018-07-16 20:29             ` Jeff King
2018-07-16 20:37               ` Jonathan Nieder
2018-07-16 21:09                 ` Jeff King
2018-07-16 21:40                   ` Jonathan Nieder
2018-07-16 21:45                     ` Jeff King
2018-07-16 22:03                       ` Jonathan Nieder
2018-07-16 22:43                         ` Jeff King
2018-07-16 22:56                           ` Jonathan Nieder
2018-07-16 23:26                             ` Jeff King
2018-07-17  1:53                               ` Jonathan Nieder
2018-07-17  8:59                                 ` Ævar Arnfjörð Bjarmason
2018-07-17 14:03                                   ` Jonathan Nieder
2018-07-17 15:24                                     ` Ævar Arnfjörð Bjarmason
2018-07-17 20:27                                   ` Jeff King
2018-07-18 13:11                                     ` Ævar Arnfjörð Bjarmason
2018-07-18 17:29                                       ` Jeff King
2018-07-17 15:59                                 ` Duy Nguyen
2018-07-17 18:09                                 ` Junio C Hamano
2018-07-16 19:15 ` Elijah Newren
2018-07-16 19:19   ` Jonathan Nieder
2018-07-16 20:21     ` Elijah Newren
2018-07-16 20:35       ` Jeff King
2018-07-16 20:56         ` Jonathan Nieder
2018-07-16 21:12           ` Jeff King
2018-07-16 19:52   ` Jeff King
2018-07-16 20:16     ` Elijah Newren
2018-07-16 20:38       ` Jeff King
2018-07-16 21:09         ` Elijah Newren
2018-07-16 21:21           ` Jeff King
2018-07-16 22:07             ` Elijah Newren
2018-07-16 22:55               ` Jeff King
2018-07-16 23:06                 ` Elijah Newren
2018-07-16 21:31           ` Jonathan Nieder
2018-07-17  6:51 ` [PATCH v2 0/3] gc --auto: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17  6:53   ` [PATCH 1/3] gc: improve handling of errors reading gc.log Jonathan Nieder
2018-07-17 18:19     ` Junio C Hamano
2018-07-17 19:58     ` Jeff King
2018-07-17  6:54   ` [PATCH 2/3] gc: exit with status 128 on failure Jonathan Nieder
2018-07-17 18:22     ` Junio C Hamano
2018-07-17 19:59     ` Jeff King
2018-09-17 18:33       ` Jeff King
2018-09-17 18:40         ` Jonathan Nieder
2018-09-18 17:30           ` Jeff King
2018-07-17  6:57   ` [PATCH 3/3] gc: do not return error for prior errors in daemonized mode Jonathan Nieder
2018-07-17 20:13     ` Jeff King
2018-07-18 16:21       ` Junio C Hamano
2018-07-18 17:22         ` Jeff King
2018-07-18 18:19           ` Junio C Hamano
2018-07-18 19:06             ` Jeff King
2018-07-18 19:55               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).