git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC] git gc "--prune=now" semantics considered harmful
@ 2018-05-26 21:49 Linus Torvalds
  2018-05-26 23:31 ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2018-05-26 21:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List


So this is a RFC patch, I'm not sure how much people really care, but I 
find the current behavior of "git gc --prune=now" to be unnecessarily 
dangerous.

There's two issues with it:

 (a) parse_expiry_date() considers "now" to be special, and it actually
     doesn't mean "now" at all, it means "everything".

 (b) the date parsing isn't actually done "now", it's done *after* gc has 
     already run, and we run "git prune --expire". So even if (a) wasn't 
     true, "--prune=now" wouldn't actually mean "now" when the user 
    expects it to happen, but "after doing repacking".

I actually think that the "parse_expiry_date()" behavior makes sense 
within the context of "git prune --expire", so I'm not really complaining 
about (a) per se. I just think that what makes sense within the context of 
"git prune" does *not* necessarily make sense within the context of "git 
gc".

Why do I care? I end up doing lots of random things in my local 
repository, and I also end up wanting to keep my repository fairly clean, 
so I tend to do "git gc --prune=now" to just make sure everything is 
packed and I've gotten rid of all the temporary stuff that so often 
happens when doing lots of git merges (which is what I do). 

You won't see those temporary objects for the usual trivial merges, but 
whenever you have a real recursive merge with automated conflict 
resolution, there will be things like those temporary merge-only objects 
for the 3-way base merge state. 

Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But 
it's what I've gotten used to, and it's at least not entirely insane.

But at least once now, I've done that "git gc" at the end of the day, and 
a new pull request comes in, so I do the "git pull" without even thinking 
about the fact that "git gc" is still running.

And then the "--prune=now" behavior is actually really pretty dangerous. 
Because it will prune *all* unreachable objects, even if they are only 
*currently* unreachable because they are in the process of being unpacked 
by the concurrent "git fetch" (and I didn't check - I might just have been 
unlocky, bit I think "git prune" ignores FETCH_HEAD).

So I actually would much prefer that foir git gc, "--prune=now" means

 (a) "now"

 (b) now at the _start_ of the "git gc" operation, not the time at
     the _end_ of the operation when we've already spent a minute or
     two doing repacking and are now doing the final pruning.

anyway, with that explanation in mind, I'm appending a patch that is 
pretty small and does that. It's a bit hacky, but I think it still makes 
sense.

Comments?

Note that this really isn't likely very noticeable on most projects. When 
I do "git gc" on a fairly well-packed repo of git itself, it takes under 
4s for me. So the window for that whole "do git pull at the same time" is 
simply not much of an issue.

For the kernel, "git gc" takes a minute and a half on the same machine 
(again, this is already a packed repo, it can be worse). So there's a much 
bigger window there to do something stupid,

             Linus

---
 builtin/gc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c4777b244..98368c8b5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (argc > 0)
 		usage_with_options(builtin_gc_usage, builtin_gc_options);
 
-	if (prune_expire && parse_expiry_date(prune_expire, &dummy))
-		die(_("failed to parse prune expiry value %s"), prune_expire);
+	if (prune_expire) {
+		if (!strcmp(prune_expire, "now"))
+			prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601));
+		if (parse_expiry_date(prune_expire, &dummy))
+			die(_("failed to parse prune expiry value %s"), prune_expire);
+	}
 
 	if (aggressive) {
 		argv_array_push(&repack, "-f");

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] git gc "--prune=now" semantics considered harmful
  2018-05-26 21:49 [RFC] git gc "--prune=now" semantics considered harmful Linus Torvalds
@ 2018-05-26 23:31 ` Junio C Hamano
  2018-05-27  1:27   ` Linus Torvalds
  2018-06-01  7:04   ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2018-05-26 23:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But 
> it's what I've gotten used to, and it's at least not entirely insane.

FWIW, my end-of-day ritual is to do repack -a -d -f with a large
window and a small depth, followed by prune, which boils down to
about the same.  So I'd hope that is not entirely insane.  I however
do not think I bother with an explicit --expire=now when running the
prune, though.

In any case, that makes two of us, and the suggested patch protects
only one of the two ;-)

> But at least once now, I've done that "git gc" at the end of the day, and 
> a new pull request comes in, so I do the "git pull" without even thinking 
> about the fact that "git gc" is still running.

*That* is something I don't do.  After all, I am fully aware that I
have started end-of-day ritual by that time, so I won't even look at
a new patch (or a pull request for that matter).

> So I actually would much prefer that foir git gc, "--prune=now" means
>
>  (a) "now"
>
>  (b) now at the _start_ of the "git gc" operation, not the time at
>      the _end_ of the operation when we've already spent a minute or
>      two doing repacking and are now doing the final pruning.
>
> anyway, with that explanation in mind, I'm appending a patch that is 
> pretty small and does that. It's a bit hacky, but I think it still makes 
> sense.
>
> Comments?

Closing the possiblity of racing a running "gc" and new object
creation like the above generally makes sense, I would think,
whether the creation is due to 'pull/fetch', 'add', or even 'push'.

I however have to wonder if there are opposite "oops" end-user
operation we also need to worry about, i.e. we are doing a large-ish
fetch, and get bored and run a gc fron another terminal.  Perhaps
*that* is a bit too stupid to worry about?  Auto-gc deliberately
does not use 'now' because it wants to leave a grace period to avoid
exactly that kind of race.

> diff --git a/builtin/gc.c b/builtin/gc.c
> index c4777b244..98368c8b5 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	if (argc > 0)
>  		usage_with_options(builtin_gc_usage, builtin_gc_options);
>  
> -	if (prune_expire && parse_expiry_date(prune_expire, &dummy))
> -		die(_("failed to parse prune expiry value %s"), prune_expire);
> +	if (prune_expire) {
> +		if (!strcmp(prune_expire, "now"))
> +			prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601));
> +		if (parse_expiry_date(prune_expire, &dummy))
> +			die(_("failed to parse prune expiry value %s"), prune_expire);
> +	}
>  
>  	if (aggressive) {
>  		argv_array_push(&repack, "-f");

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] git gc "--prune=now" semantics considered harmful
  2018-05-26 23:31 ` Junio C Hamano
@ 2018-05-27  1:27   ` Linus Torvalds
  2018-06-01  7:04   ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2018-05-27  1:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Sat, May 26, 2018 at 4:31 PM Junio C Hamano <gitster@pobox.com> wrote:

> *That* is something I don't do.  After all, I am fully aware that I
> have started end-of-day ritual by that time, so I won't even look at
> a new patch (or a pull request for that matter).

Sounds like you're more organized about the end-of-day ritual than I am.
For me the gc is not quite so structured.

> I however have to wonder if there are opposite "oops" end-user
> operation we also need to worry about, i.e. we are doing a large-ish
> fetch, and get bored and run a gc fron another terminal.  Perhaps
> *that* is a bit too stupid to worry about?  Auto-gc deliberately
> does not use 'now' because it wants to leave a grace period to avoid
> exactly that kind of race.

For me, a "pull" never takes that long.  Sure, any manual merging and the
writing of the commit message might take a while, but it's "foreground"
activity for me, I'd not start a gc in the middle of it.

So at least to me, doing "git fsck --full" and "git gc --prune=now" are
somewhat special because they take a while and tend to be background things
that I "start and forget" about (the same way I sometimes start and forget
a kernel build).

Which is why that current "git gc --prune=now" behavior seems a bit
dangerous.

                Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] git gc "--prune=now" semantics considered harmful
  2018-05-26 23:31 ` Junio C Hamano
  2018-05-27  1:27   ` Linus Torvalds
@ 2018-06-01  7:04   ` Jeff King
  2018-06-01 11:07     ` Linus Torvalds
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2018-06-01  7:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

On Sun, May 27, 2018 at 08:31:14AM +0900, Junio C Hamano wrote:

> > So I actually would much prefer that foir git gc, "--prune=now" means
> >
> >  (a) "now"
> >
> >  (b) now at the _start_ of the "git gc" operation, not the time at
> >      the _end_ of the operation when we've already spent a minute or
> >      two doing repacking and are now doing the final pruning.
> >
> > anyway, with that explanation in mind, I'm appending a patch that is 
> > pretty small and does that. It's a bit hacky, but I think it still makes 
> > sense.
> >
> > Comments?
> 
> Closing the possiblity of racing a running "gc" and new object
> creation like the above generally makes sense, I would think,
> whether the creation is due to 'pull/fetch', 'add', or even 'push'.

I think Linus's suggestion is an obvious improvement. It does shorten
the window for confusing things to happen, and I think it makes things
much easier to reason about if all parts of the gc are using the same
timestamp.

Regarding the implementation:

> > -	if (prune_expire && parse_expiry_date(prune_expire, &dummy))
> > -		die(_("failed to parse prune expiry value %s"), prune_expire);
> > +	if (prune_expire) {
> > +		if (!strcmp(prune_expire, "now"))
> > +			prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601));
> > +		if (parse_expiry_date(prune_expire, &dummy))
> > +			die(_("failed to parse prune expiry value %s"), prune_expire);
> > +	}

We'd also accept relative times like "5.minutes.ago" (in fact, the
default is a relative 2.weeks.ago, though it's long enough that the
difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter
much). So we probably ought to just normalize _everything_ without even
bothering to match "now". It's a noop for non-relative times, but that's
OK.

> I however have to wonder if there are opposite "oops" end-user
> operation we also need to worry about, i.e. we are doing a large-ish
> fetch, and get bored and run a gc fron another terminal.  Perhaps
> *that* is a bit too stupid to worry about?  Auto-gc deliberately
> does not use 'now' because it wants to leave a grace period to avoid
> exactly that kind of race.

There are still possibilities for a race, even with the grace period.
You can have an unreferenced 2-week-old object sitting on disk, and
somebody can choose to reference it at the same time as we are pruning
it. My freshness patches from a few years ago made things a bit better:

  - when we optimize out the write of an existing object, we now at
    least update its timestamp

  - we consider non-fresh objects reachable from fresh ones to also be
    fresh

But fundamentally none of this is atomic. You can have an old tree, and
while you're pruning somebody writes a new commit referencing it and
sticks that in a ref. It's more common if your grace period is "now",
but it can still happen with any grace period.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] git gc "--prune=now" semantics considered harmful
  2018-06-01  7:04   ` Jeff King
@ 2018-06-01 11:07     ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2018-06-01 11:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List

On Fri, Jun 1, 2018 at 2:04 AM Jeff King <peff@peff.net> wrote:
>
> We'd also accept relative times like "5.minutes.ago" (in fact, the
> default is a relative 2.weeks.ago, though it's long enough that the
> difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter
> much). So we probably ought to just normalize _everything_ without even
> bothering to match "now". It's a noop for non-relative times, but that's
> OK.

Heh. That's what I tried to do at first.

It's surprisingly hard.

You can't normalize it as a date, because we have a few times that
aren't expressible as dates because they are just the maximum value
(ie "all").

And then I tried to just express it just as a standard numerical
value, which we accept on input. But we *only* accept that if it's
more than eight digits.

And regardless, you need to special-case "now", since
parse_expiry_date() special cases it.

Or you'd need to just make another version of parse_expiry_date() entirely.

End result: I only special-cased "now".

> There are still possibilities for a race, even with the grace period.

Agreed. But you *really* have to work at it.

              Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-06-01 11:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-26 21:49 [RFC] git gc "--prune=now" semantics considered harmful Linus Torvalds
2018-05-26 23:31 ` Junio C Hamano
2018-05-27  1:27   ` Linus Torvalds
2018-06-01  7:04   ` Jeff King
2018-06-01 11:07     ` Linus Torvalds

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).