* [RFC] git gc "--prune=now" semantics considered harmful
@ 2018-05-26 21:49 Linus Torvalds
2018-05-26 23:31 ` Junio C Hamano
0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2018-05-26 21:49 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
So this is a RFC patch, I'm not sure how much people really care, but I
find the current behavior of "git gc --prune=now" to be unnecessarily
dangerous.
There's two issues with it:
(a) parse_expiry_date() considers "now" to be special, and it actually
doesn't mean "now" at all, it means "everything".
(b) the date parsing isn't actually done "now", it's done *after* gc has
already run, and we run "git prune --expire". So even if (a) wasn't
true, "--prune=now" wouldn't actually mean "now" when the user
expects it to happen, but "after doing repacking".
I actually think that the "parse_expiry_date()" behavior makes sense
within the context of "git prune --expire", so I'm not really complaining
about (a) per se. I just think that what makes sense within the context of
"git prune" does *not* necessarily make sense within the context of "git
gc".
Why do I care? I end up doing lots of random things in my local
repository, and I also end up wanting to keep my repository fairly clean,
so I tend to do "git gc --prune=now" to just make sure everything is
packed and I've gotten rid of all the temporary stuff that so often
happens when doing lots of git merges (which is what I do).
You won't see those temporary objects for the usual trivial merges, but
whenever you have a real recursive merge with automated conflict
resolution, there will be things like those temporary merge-only objects
for the 3-way base merge state.
Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But
it's what I've gotten used to, and it's at least not entirely insane.
But at least once now, I've done that "git gc" at the end of the day, and
a new pull request comes in, so I do the "git pull" without even thinking
about the fact that "git gc" is still running.
And then the "--prune=now" behavior is actually really pretty dangerous.
Because it will prune *all* unreachable objects, even if they are only
*currently* unreachable because they are in the process of being unpacked
by the concurrent "git fetch" (and I didn't check - I might just have been
unlocky, bit I think "git prune" ignores FETCH_HEAD).
So I actually would much prefer that foir git gc, "--prune=now" means
(a) "now"
(b) now at the _start_ of the "git gc" operation, not the time at
the _end_ of the operation when we've already spent a minute or
two doing repacking and are now doing the final pruning.
anyway, with that explanation in mind, I'm appending a patch that is
pretty small and does that. It's a bit hacky, but I think it still makes
sense.
Comments?
Note that this really isn't likely very noticeable on most projects. When
I do "git gc" on a fairly well-packed repo of git itself, it takes under
4s for me. So the window for that whole "do git pull at the same time" is
simply not much of an issue.
For the kernel, "git gc" takes a minute and a half on the same machine
(again, this is already a packed repo, it can be worse). So there's a much
bigger window there to do something stupid,
Linus
---
builtin/gc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/builtin/gc.c b/builtin/gc.c
index c4777b244..98368c8b5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (argc > 0)
usage_with_options(builtin_gc_usage, builtin_gc_options);
- if (prune_expire && parse_expiry_date(prune_expire, &dummy))
- die(_("failed to parse prune expiry value %s"), prune_expire);
+ if (prune_expire) {
+ if (!strcmp(prune_expire, "now"))
+ prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601));
+ if (parse_expiry_date(prune_expire, &dummy))
+ die(_("failed to parse prune expiry value %s"), prune_expire);
+ }
if (aggressive) {
argv_array_push(&repack, "-f");
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC] git gc "--prune=now" semantics considered harmful
2018-05-26 21:49 [RFC] git gc "--prune=now" semantics considered harmful Linus Torvalds
@ 2018-05-26 23:31 ` Junio C Hamano
2018-05-27 1:27 ` Linus Torvalds
2018-06-01 7:04 ` Jeff King
0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2018-05-26 23:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But
> it's what I've gotten used to, and it's at least not entirely insane.
FWIW, my end-of-day ritual is to do repack -a -d -f with a large
window and a small depth, followed by prune, which boils down to
about the same. So I'd hope that is not entirely insane. I however
do not think I bother with an explicit --expire=now when running the
prune, though.
In any case, that makes two of us, and the suggested patch protects
only one of the two ;-)
> But at least once now, I've done that "git gc" at the end of the day, and
> a new pull request comes in, so I do the "git pull" without even thinking
> about the fact that "git gc" is still running.
*That* is something I don't do. After all, I am fully aware that I
have started end-of-day ritual by that time, so I won't even look at
a new patch (or a pull request for that matter).
> So I actually would much prefer that foir git gc, "--prune=now" means
>
> (a) "now"
>
> (b) now at the _start_ of the "git gc" operation, not the time at
> the _end_ of the operation when we've already spent a minute or
> two doing repacking and are now doing the final pruning.
>
> anyway, with that explanation in mind, I'm appending a patch that is
> pretty small and does that. It's a bit hacky, but I think it still makes
> sense.
>
> Comments?
Closing the possiblity of racing a running "gc" and new object
creation like the above generally makes sense, I would think,
whether the creation is due to 'pull/fetch', 'add', or even 'push'.
I however have to wonder if there are opposite "oops" end-user
operation we also need to worry about, i.e. we are doing a large-ish
fetch, and get bored and run a gc fron another terminal. Perhaps
*that* is a bit too stupid to worry about? Auto-gc deliberately
does not use 'now' because it wants to leave a grace period to avoid
exactly that kind of race.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index c4777b244..98368c8b5 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
> if (argc > 0)
> usage_with_options(builtin_gc_usage, builtin_gc_options);
>
> - if (prune_expire && parse_expiry_date(prune_expire, &dummy))
> - die(_("failed to parse prune expiry value %s"), prune_expire);
> + if (prune_expire) {
> + if (!strcmp(prune_expire, "now"))
> + prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601));
> + if (parse_expiry_date(prune_expire, &dummy))
> + die(_("failed to parse prune expiry value %s"), prune_expire);
> + }
>
> if (aggressive) {
> argv_array_push(&repack, "-f");
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] git gc "--prune=now" semantics considered harmful
2018-05-26 23:31 ` Junio C Hamano
@ 2018-05-27 1:27 ` Linus Torvalds
2018-06-01 7:04 ` Jeff King
1 sibling, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2018-05-27 1:27 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Sat, May 26, 2018 at 4:31 PM Junio C Hamano <gitster@pobox.com> wrote:
> *That* is something I don't do. After all, I am fully aware that I
> have started end-of-day ritual by that time, so I won't even look at
> a new patch (or a pull request for that matter).
Sounds like you're more organized about the end-of-day ritual than I am.
For me the gc is not quite so structured.
> I however have to wonder if there are opposite "oops" end-user
> operation we also need to worry about, i.e. we are doing a large-ish
> fetch, and get bored and run a gc fron another terminal. Perhaps
> *that* is a bit too stupid to worry about? Auto-gc deliberately
> does not use 'now' because it wants to leave a grace period to avoid
> exactly that kind of race.
For me, a "pull" never takes that long. Sure, any manual merging and the
writing of the commit message might take a while, but it's "foreground"
activity for me, I'd not start a gc in the middle of it.
So at least to me, doing "git fsck --full" and "git gc --prune=now" are
somewhat special because they take a while and tend to be background things
that I "start and forget" about (the same way I sometimes start and forget
a kernel build).
Which is why that current "git gc --prune=now" behavior seems a bit
dangerous.
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] git gc "--prune=now" semantics considered harmful
2018-05-26 23:31 ` Junio C Hamano
2018-05-27 1:27 ` Linus Torvalds
@ 2018-06-01 7:04 ` Jeff King
2018-06-01 11:07 ` Linus Torvalds
1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2018-06-01 7:04 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List
On Sun, May 27, 2018 at 08:31:14AM +0900, Junio C Hamano wrote:
> > So I actually would much prefer that foir git gc, "--prune=now" means
> >
> > (a) "now"
> >
> > (b) now at the _start_ of the "git gc" operation, not the time at
> > the _end_ of the operation when we've already spent a minute or
> > two doing repacking and are now doing the final pruning.
> >
> > anyway, with that explanation in mind, I'm appending a patch that is
> > pretty small and does that. It's a bit hacky, but I think it still makes
> > sense.
> >
> > Comments?
>
> Closing the possiblity of racing a running "gc" and new object
> creation like the above generally makes sense, I would think,
> whether the creation is due to 'pull/fetch', 'add', or even 'push'.
I think Linus's suggestion is an obvious improvement. It does shorten
the window for confusing things to happen, and I think it makes things
much easier to reason about if all parts of the gc are using the same
timestamp.
Regarding the implementation:
> > - if (prune_expire && parse_expiry_date(prune_expire, &dummy))
> > - die(_("failed to parse prune expiry value %s"), prune_expire);
> > + if (prune_expire) {
> > + if (!strcmp(prune_expire, "now"))
> > + prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601));
> > + if (parse_expiry_date(prune_expire, &dummy))
> > + die(_("failed to parse prune expiry value %s"), prune_expire);
> > + }
We'd also accept relative times like "5.minutes.ago" (in fact, the
default is a relative 2.weeks.ago, though it's long enough that the
difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter
much). So we probably ought to just normalize _everything_ without even
bothering to match "now". It's a noop for non-relative times, but that's
OK.
> I however have to wonder if there are opposite "oops" end-user
> operation we also need to worry about, i.e. we are doing a large-ish
> fetch, and get bored and run a gc fron another terminal. Perhaps
> *that* is a bit too stupid to worry about? Auto-gc deliberately
> does not use 'now' because it wants to leave a grace period to avoid
> exactly that kind of race.
There are still possibilities for a race, even with the grace period.
You can have an unreferenced 2-week-old object sitting on disk, and
somebody can choose to reference it at the same time as we are pruning
it. My freshness patches from a few years ago made things a bit better:
- when we optimize out the write of an existing object, we now at
least update its timestamp
- we consider non-fresh objects reachable from fresh ones to also be
fresh
But fundamentally none of this is atomic. You can have an old tree, and
while you're pruning somebody writes a new commit referencing it and
sticks that in a ref. It's more common if your grace period is "now",
but it can still happen with any grace period.
-Peff
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] git gc "--prune=now" semantics considered harmful
2018-06-01 7:04 ` Jeff King
@ 2018-06-01 11:07 ` Linus Torvalds
0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2018-06-01 11:07 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, Git Mailing List
On Fri, Jun 1, 2018 at 2:04 AM Jeff King <peff@peff.net> wrote:
>
> We'd also accept relative times like "5.minutes.ago" (in fact, the
> default is a relative 2.weeks.ago, though it's long enough that the
> difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter
> much). So we probably ought to just normalize _everything_ without even
> bothering to match "now". It's a noop for non-relative times, but that's
> OK.
Heh. That's what I tried to do at first.
It's surprisingly hard.
You can't normalize it as a date, because we have a few times that
aren't expressible as dates because they are just the maximum value
(ie "all").
And then I tried to just express it just as a standard numerical
value, which we accept on input. But we *only* accept that if it's
more than eight digits.
And regardless, you need to special-case "now", since
parse_expiry_date() special cases it.
Or you'd need to just make another version of parse_expiry_date() entirely.
End result: I only special-cased "now".
> There are still possibilities for a race, even with the grace period.
Agreed. But you *really* have to work at it.
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-06-01 11:07 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-26 21:49 [RFC] git gc "--prune=now" semantics considered harmful Linus Torvalds
2018-05-26 23:31 ` Junio C Hamano
2018-05-27 1:27 ` Linus Torvalds
2018-06-01 7:04 ` Jeff King
2018-06-01 11:07 ` Linus Torvalds
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).