* [PATCH] gc: ignore old gc.log files @ 2017-02-09 2:02 David Turner 2017-02-09 3:23 ` Jeff King 2017-02-09 4:51 ` Eric Wong 0 siblings, 2 replies; 5+ messages in thread From: David Turner @ 2017-02-09 2:02 UTC (permalink / raw) To: git; +Cc: peff, pclouds, David Turner The intent of automatic gc is to have a git repository be relatively low-maintenance from a server-operator perspective. Of course, large operators like GitHub will need a more complicated management strategy, but for ordinary usage, git should just work. In this commit, git learns to ignore gc.log files which are older than (by default) one day old. It also learns about a config, gc.maxLogAge to manage this. So git should never get itself into a state where it refuses to do any maintenance, just because at some point some piece of the maintenance didn't make progress. That might still happen (e.g. because the repo is corrupt), but at the very least it won't be because Git is too dumb to try again. Signed-off-by: David Turner <dturner@twosigma.com> Helped-by: Jeff King <peff@peff.net> --- Documentation/config.txt | 5 +++++ builtin/gc.c | 15 ++++++++++++++- t/t6500-gc.sh | 13 +++++++++++++ 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index fc5a28a32..6751371cf 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1402,6 +1402,11 @@ gc.autoDetach:: Make `git gc --auto` return immediately and run in background if the system supports it. Default is true. +gc.maxLogAge:: + If the file gc.log exists, then `git gc --auto` won't run + unless that file is more than maxLogAge seconds old. Default + is 86400, one day. + gc.packRefs:: Running `git pack-refs` in a repository renders it unclonable by Git versions prior to 1.5.1.2 over dumb diff --git a/builtin/gc.c b/builtin/gc.c index 331f21926..62fc84815 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -33,6 +33,7 @@ static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; static int detach_auto = 1; +static int gc_max_log_age_seconds = 86400; static const char *prune_expire = "2.weeks.ago"; static const char *prune_worktrees_expire = "3.months.ago"; @@ -111,6 +112,7 @@ static void gc_config(void) git_config_get_int("gc.auto", &gc_auto_threshold); git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit); git_config_get_bool("gc.autodetach", &detach_auto); + git_config_get_int("gc.maxlogage", &gc_max_log_age_seconds); git_config_date_string("gc.pruneexpire", &prune_expire); git_config_date_string("gc.worktreepruneexpire", &prune_worktrees_expire); git_config(git_default_config, NULL); @@ -291,8 +293,19 @@ static int report_last_gc_error(void) { struct strbuf sb = STRBUF_INIT; int ret; + struct stat st; + const char *gc_log_path = git_path("gc.log"); + + if (stat(gc_log_path, &st)) { + if (errno == ENOENT) + return 0; + return error(_("Can't read %s"), gc_log_path); + } + + if (time(NULL) - st.st_mtime > gc_max_log_age_seconds) + return 0; - ret = strbuf_read_file(&sb, git_path("gc.log"), 0); + ret = strbuf_read_file(&sb, gc_log_path, 0); if (ret > 0) return error(_("The last gc run reported the following. " "Please correct the root cause\n" diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh index 1762dfa6a..b69c2c190 100755 --- a/t/t6500-gc.sh +++ b/t/t6500-gc.sh @@ -67,5 +67,18 @@ test_expect_success 'auto gc with too many loose objects does not attempt to cre test_line_count = 2 new # There is one new pack and its .idx ' +test_expect_success 'background auto gc does not run if gc.log is present and recent but does if it is old' ' + keep=$(ls .git/objects/pack/*.pack|head -1|sed -e "s/pack$/keep/") && + test_commit foo && + test_commit bar && + git repack && + test_config gc.autopacklimit 1 && + test_config gc.autodetach true && + echo fleem> .git/gc.log && + test_must_fail git gc --auto 2>err && + test_i18ngrep "^error:" err && + test-chmtime =-86401 .git/gc.log && + git gc --auto +' test_done -- 2.11.GIT ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] gc: ignore old gc.log files 2017-02-09 2:02 [PATCH] gc: ignore old gc.log files David Turner @ 2017-02-09 3:23 ` Jeff King 2017-02-09 23:57 ` Philip Oakley 2017-02-09 4:51 ` Eric Wong 1 sibling, 1 reply; 5+ messages in thread From: Jeff King @ 2017-02-09 3:23 UTC (permalink / raw) To: David Turner; +Cc: git, pclouds On Wed, Feb 08, 2017 at 09:02:22PM -0500, David Turner wrote: > The intent of automatic gc is to have a git repository be relatively > low-maintenance from a server-operator perspective. Of course, large > operators like GitHub will need a more complicated management strategy, > but for ordinary usage, git should just work. > > In this commit, git learns to ignore gc.log files which are older than > (by default) one day old. It also learns about a config, gc.maxLogAge > to manage this. > > So git should never get itself into a state where it refuses to do any > maintenance, just because at some point some piece of the maintenance > didn't make progress. That might still happen (e.g. because the repo > is corrupt), but at the very least it won't be because Git is too dumb > to try again. Sounds like a good goal and approach. > +gc.maxLogAge:: > + If the file gc.log exists, then `git gc --auto` won't run > + unless that file is more than maxLogAge seconds old. Default > + is 86400, one day. For other time-based config, we use approxidate with a relative time, like "1 day ago". I think it would make sense for this to match, as it makes the config a little more readable. You can follow the prune_expire example which is right below your new config variable in all of the hunks of your patch. Though I think ultimately that isn't parsed inside gc, so you'd eventually look at how "prune --expire" is handled (which I think is via parse_expiry_date()). > @@ -291,8 +293,19 @@ static int report_last_gc_error(void) > { > struct strbuf sb = STRBUF_INIT; > int ret; > + struct stat st; > + const char *gc_log_path = git_path("gc.log"); I usually try to avoid assigning the result of git_path(), because it's a static buffer. In this case it's fine, because you don't call any complex calls during the lifetime of the variable. But I consider assigning to be a bad practice. Using git_pathdup() or git_path_buf() is safer (but you do need to remember to free() as appropriate). Speaking of leaks, I think report_last_gc_error() will always leak the strbuf contents when it is non-empty. > + if (stat(gc_log_path, &st)) { > + if (errno == ENOENT) > + return 0; > + return error(_("Can't read %s"), gc_log_path); > + } I'm not quite clear on the life-cycle of this log file. I think the current code works like this: - if a non-empty gc.log exists, we bail - when we daemonize, we take a lock via gc.log.lock - if we wrote anything to the lockfile log, then we move it into place (essentially blocking further auto-gc) - otherwise, we rollback the lockfile and leave gc.log untouched That leaves a few quirks with your new strategy: - if our new run was unsuccessful (as judged by having non-empty log output), we'd probably want to overwrite the old logfile with our new log. Following the current-code logic we do, which is good. - if our new run is successful (empty log), we'll leave the old, crufty log in place. Probably process_log_file() should unlink() the original gc.log while holding the lock. And here are a few things I noticed that I think aren't caused by your patch, but are in the same area and might be worth examining: - we're not very atomic. After a day, two simultaneous auto-gc's might both ignore the gc.log file and continue to run. I don't think it matters, though. One of them will win the race to pick up gc.log.lock, and the other will immediately fail. - It looks like we clear the gc.log file only under another detached auto-gc. It seems like manually running a successful "git gc" should clear it, too. - We block further gc only based on the presence of data on stderr from the sub-programs. But _not_ if the sub-programs fail. So a program silently exiting with code 128 will stop further gc processing, but not prevent another auto-gc from running. Which is...weird. Maybe this can't happen, though, because I think we write our own error() in such cases, which makes such a failure inherently non-silent. > + if (time(NULL) - st.st_mtime > gc_max_log_age_seconds) > + return 0; Hmm. What happens if the file has a timestamp in the future due to clock skew? As long as time_t is signed, I think it's OK, but if it wraps, it does the wrong thing here. You could rearrange the subtraction to handle this. But I think if you start using approxidate, it will give you an a cutoff time, and you can just do: if (st.st_mtime < gc_log_expiration) return 0; /* too old to care about */ > - ret = strbuf_read_file(&sb, git_path("gc.log"), 0); > + ret = strbuf_read_file(&sb, gc_log_path, 0); I would have written this as an open(), followed by an fstat() on the file we opened, and then finally reading the contents if it's fresh enough. But I'm not sure if that level of atomicity matters. We're not doing any of this under lock. -Peff ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] gc: ignore old gc.log files 2017-02-09 3:23 ` Jeff King @ 2017-02-09 23:57 ` Philip Oakley 2017-02-10 0:36 ` Jeff King 0 siblings, 1 reply; 5+ messages in thread From: Philip Oakley @ 2017-02-09 23:57 UTC (permalink / raw) To: Jeff King, David Turner; +Cc: git, pclouds From: "Jeff King" <peff@peff.net> > On Wed, Feb 08, 2017 at 09:02:22PM -0500, David Turner wrote: > >> The intent of automatic gc is to have a git repository be relatively >> low-maintenance from a server-operator perspective. Of course, large >> operators like GitHub will need a more complicated management strategy, >> but for ordinary usage, git should just work. >> >> In this commit, git learns to ignore gc.log files which are older than >> (by default) one day old. It also learns about a config, gc.maxLogAge >> to manage this. >> >> So git should never get itself into a state where it refuses to do any >> maintenance, just because at some point some piece of the maintenance >> didn't make progress. That might still happen (e.g. because the repo >> is corrupt), but at the very least it won't be because Git is too dumb >> to try again. > > Sounds like a good goal and approach. > >> +gc.maxLogAge:: >> + If the file gc.log exists, then `git gc --auto` won't run >> + unless that file is more than maxLogAge seconds old. Default >> + is 86400, one day. Is there a reason why one day is chosen? If maintenance staff are available 24/7 then a shorter time would be appropriate, but if it's a 5 day work week then they may want longer. Is there a particular case it targets? > > For other time-based config, we use approxidate with a relative time, > like "1 day ago". I think it would make sense for this to match, as it > makes the config a little more readable. > > You can follow the prune_expire example which is right below your new > config variable in all of the hunks of your patch. Though I think > ultimately that isn't parsed inside gc, so you'd eventually look at how > "prune --expire" is handled (which I think is via parse_expiry_date()). > [...] Philip ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] gc: ignore old gc.log files 2017-02-09 23:57 ` Philip Oakley @ 2017-02-10 0:36 ` Jeff King 0 siblings, 0 replies; 5+ messages in thread From: Jeff King @ 2017-02-10 0:36 UTC (permalink / raw) To: Philip Oakley; +Cc: David Turner, git, pclouds On Thu, Feb 09, 2017 at 11:57:12PM -0000, Philip Oakley wrote: > > > +gc.maxLogAge:: > > > + If the file gc.log exists, then `git gc --auto` won't run > > > + unless that file is more than maxLogAge seconds old. Default > > > + is 86400, one day. > > Is there a reason why one day is chosen? If maintenance staff are available > 24/7 then a shorter time would be appropriate, but if it's a 5 day work week > then they may want longer. Is there a particular case it targets? I'm pretty sure the one-day time limit isn't scientific. It's just a number we've been throwing around. I'm not sure what maintenance staff matters, though. It basically needs long enough that we're not doing _too_ many fruitless gc's, because it wastes resources. But you'd prefer to not go too long without a gc for a repository that needs it. The root cause of the error could be any number of issues. But for the case that David cares about most, you basically want to keep trying until the too-many-objects condition goes away. That's usually on a 2-week timer. So trying once per day to see if the 2-week timer feels about right. That's certainly not science, but hopefully it at least frames the general ballpark. One possible option would be to auto-scale it with the pruneExpire time. I don't know if people actually tweak that value or not. -Peff ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] gc: ignore old gc.log files 2017-02-09 2:02 [PATCH] gc: ignore old gc.log files David Turner 2017-02-09 3:23 ` Jeff King @ 2017-02-09 4:51 ` Eric Wong 1 sibling, 0 replies; 5+ messages in thread From: Eric Wong @ 2017-02-09 4:51 UTC (permalink / raw) To: David Turner; +Cc: git, peff, pclouds David Turner <dturner@twosigma.com> wrote: > + echo fleem> .git/gc.log && A minor nit: echo fleem >.git/gc.log && Otherwise, it's good to know there's attention paid to this issue. I've been ignoring cron mails from my mirrors for too long :x Thanks. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-02-10 0:36 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-09 2:02 [PATCH] gc: ignore old gc.log files David Turner 2017-02-09 3:23 ` Jeff King 2017-02-09 23:57 ` Philip Oakley 2017-02-10 0:36 ` Jeff King 2017-02-09 4:51 ` Eric Wong
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).