git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] gc: ignore old gc.log files
@ 2017-02-09  2:02 David Turner
  2017-02-09  3:23 ` Jeff King
  2017-02-09  4:51 ` Eric Wong
  0 siblings, 2 replies; 5+ messages in thread
From: David Turner @ 2017-02-09  2:02 UTC (permalink / raw)
  To: git; +Cc: peff, pclouds, David Turner

The intent of automatic gc is to have a git repository be relatively
low-maintenance from a server-operator perspective.  Of course, large
operators like GitHub will need a more complicated management strategy,
but for ordinary usage, git should just work.

In this commit, git learns to ignore gc.log files which are older than
(by default) one day old.  It also learns about a config, gc.maxLogAge
to manage this.

So git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.  That might still happen (e.g. because the repo
is corrupt), but at the very least it won't be because Git is too dumb
to try again.

Signed-off-by: David Turner <dturner@twosigma.com>
Helped-by: Jeff King <peff@peff.net>
---
 Documentation/config.txt |  5 +++++
 builtin/gc.c             | 15 ++++++++++++++-
 t/t6500-gc.sh            | 13 +++++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..6751371cf 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,11 @@ gc.autoDetach::
 	Make `git gc --auto` return immediately and run in background
 	if the system supports it. Default is true.
 
+gc.maxLogAge::
+	If the file gc.log exists, then `git gc --auto` won't run
+	unless that file is more than maxLogAge seconds old.  Default
+	is 86400, one day.
+
 gc.packRefs::
 	Running `git pack-refs` in a repository renders it
 	unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..62fc84815 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,7 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static int gc_max_log_age_seconds = 86400;
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -111,6 +112,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_int("gc.maxlogage", &gc_max_log_age_seconds);
 	git_config_date_string("gc.pruneexpire", &prune_expire);
 	git_config_date_string("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config(git_default_config, NULL);
@@ -291,8 +293,19 @@ static int report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
 	int ret;
+	struct stat st;
+	const char *gc_log_path = git_path("gc.log");
+
+	if (stat(gc_log_path, &st)) {
+		if (errno == ENOENT)
+			return 0;
+		return error(_("Can't read %s"), gc_log_path);
+	}
+
+	if (time(NULL) - st.st_mtime > gc_max_log_age_seconds)
+		return 0;
 
-	ret = strbuf_read_file(&sb, git_path("gc.log"), 0);
+	ret = strbuf_read_file(&sb, gc_log_path, 0);
 	if (ret > 0)
 		return error(_("The last gc run reported the following. "
 			       "Please correct the root cause\n"
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 1762dfa6a..b69c2c190 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -67,5 +67,18 @@ test_expect_success 'auto gc with too many loose objects does not attempt to cre
 	test_line_count = 2 new # There is one new pack and its .idx
 '
 
+test_expect_success 'background auto gc does not run if gc.log is present and recent but does if it is old' '
+	keep=$(ls .git/objects/pack/*.pack|head -1|sed -e "s/pack$/keep/") &&
+	test_commit foo &&
+	test_commit bar &&
+	git repack &&
+	test_config gc.autopacklimit 1 &&
+	test_config gc.autodetach true &&
+	echo fleem> .git/gc.log &&
+	test_must_fail git gc --auto 2>err &&
+	test_i18ngrep "^error:" err &&
+	test-chmtime =-86401 .git/gc.log &&
+	git gc --auto
+'
 
 test_done
-- 
2.11.GIT


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] gc: ignore old gc.log files
  2017-02-09  2:02 [PATCH] gc: ignore old gc.log files David Turner
@ 2017-02-09  3:23 ` Jeff King
  2017-02-09 23:57   ` Philip Oakley
  2017-02-09  4:51 ` Eric Wong
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2017-02-09  3:23 UTC (permalink / raw)
  To: David Turner; +Cc: git, pclouds

On Wed, Feb 08, 2017 at 09:02:22PM -0500, David Turner wrote:

> The intent of automatic gc is to have a git repository be relatively
> low-maintenance from a server-operator perspective.  Of course, large
> operators like GitHub will need a more complicated management strategy,
> but for ordinary usage, git should just work.
> 
> In this commit, git learns to ignore gc.log files which are older than
> (by default) one day old.  It also learns about a config, gc.maxLogAge
> to manage this.
> 
> So git should never get itself into a state where it refuses to do any
> maintenance, just because at some point some piece of the maintenance
> didn't make progress.  That might still happen (e.g. because the repo
> is corrupt), but at the very least it won't be because Git is too dumb
> to try again.

Sounds like a good goal and approach.

> +gc.maxLogAge::
> +	If the file gc.log exists, then `git gc --auto` won't run
> +	unless that file is more than maxLogAge seconds old.  Default
> +	is 86400, one day.

For other time-based config, we use approxidate with a relative time,
like "1 day ago". I think it would make sense for this to match, as it
makes the config a little more readable.

You can follow the prune_expire example which is right below your new
config variable in all of the hunks of your patch. Though I think
ultimately that isn't parsed inside gc, so you'd eventually look at how
"prune --expire" is handled (which I think is via parse_expiry_date()).

> @@ -291,8 +293,19 @@ static int report_last_gc_error(void)
>  {
>  	struct strbuf sb = STRBUF_INIT;
>  	int ret;
> +	struct stat st;
> +	const char *gc_log_path = git_path("gc.log");

I usually try to avoid assigning the result of git_path(), because it's
a static buffer. In this case it's fine, because you don't call any
complex calls during the lifetime of the variable. But I consider
assigning to be a bad practice.

Using git_pathdup() or git_path_buf() is safer (but you do need to
remember to free() as appropriate).

Speaking of leaks, I think report_last_gc_error() will always leak the
strbuf contents when it is non-empty.

> +	if (stat(gc_log_path, &st)) {
> +		if (errno == ENOENT)
> +			return 0;
> +		return error(_("Can't read %s"), gc_log_path);
> +	}

I'm not quite clear on the life-cycle of this log file.

I think the current code works like this:

  - if a non-empty gc.log exists, we bail

  - when we daemonize, we take a lock via gc.log.lock

  - if we wrote anything to the lockfile log, then we move it into place
    (essentially blocking further auto-gc)

  - otherwise, we rollback the lockfile and leave gc.log untouched

That leaves a few quirks with your new strategy:

  - if our new run was unsuccessful (as judged by having non-empty log
    output), we'd probably want to overwrite the old logfile with our
    new log. Following the current-code logic we do, which is good.

  - if our new run is successful (empty log), we'll leave the old,
    crufty log in place. Probably process_log_file() should unlink() the
    original gc.log while holding the lock.

And here are a few things I noticed that I think aren't caused by your
patch, but are in the same area and might be worth examining:

  - we're not very atomic. After a day, two simultaneous auto-gc's might
    both ignore the gc.log file and continue to run. I don't think it
    matters, though. One of them will win the race to pick up
    gc.log.lock, and the other will immediately fail.

  - It looks like we clear the gc.log file only under another detached
    auto-gc. It seems like manually running a successful "git gc" should
    clear it, too.

  - We block further gc only based on the presence of data on stderr
    from the sub-programs. But _not_ if the sub-programs fail. So a
    program silently exiting with code 128 will stop further gc
    processing, but not prevent another auto-gc from running. Which
    is...weird. Maybe this can't happen, though, because I think we
    write our own error() in such cases, which makes such a failure
    inherently non-silent.

> +	if (time(NULL) - st.st_mtime > gc_max_log_age_seconds)
> +		return 0;

Hmm. What happens if the file has a timestamp in the future due to clock
skew? As long as time_t is signed, I think it's OK, but if it wraps, it
does the wrong thing here. You could rearrange the subtraction to handle
this. But I think if you start using approxidate, it will give you an
a cutoff time, and you can just do:

  if (st.st_mtime < gc_log_expiration)
	return 0; /* too old to care about */

> -	ret = strbuf_read_file(&sb, git_path("gc.log"), 0);
> +	ret = strbuf_read_file(&sb, gc_log_path, 0);

I would have written this as an open(), followed by an fstat() on the
file we opened, and then finally reading the contents if it's fresh
enough. But I'm not sure if that level of atomicity matters. We're not
doing any of this under lock.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gc: ignore old gc.log files
  2017-02-09  2:02 [PATCH] gc: ignore old gc.log files David Turner
  2017-02-09  3:23 ` Jeff King
@ 2017-02-09  4:51 ` Eric Wong
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Wong @ 2017-02-09  4:51 UTC (permalink / raw)
  To: David Turner; +Cc: git, peff, pclouds

David Turner <dturner@twosigma.com> wrote:
> +	echo fleem> .git/gc.log &&

A minor nit:

	echo fleem >.git/gc.log &&

Otherwise, it's good to know there's attention paid to this
issue.  I've been ignoring cron mails from my mirrors for too
long :x  Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gc: ignore old gc.log files
  2017-02-09  3:23 ` Jeff King
@ 2017-02-09 23:57   ` Philip Oakley
  2017-02-10  0:36     ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Philip Oakley @ 2017-02-09 23:57 UTC (permalink / raw)
  To: Jeff King, David Turner; +Cc: git, pclouds

From: "Jeff King" <peff@peff.net>
> On Wed, Feb 08, 2017 at 09:02:22PM -0500, David Turner wrote:
>
>> The intent of automatic gc is to have a git repository be relatively
>> low-maintenance from a server-operator perspective.  Of course, large
>> operators like GitHub will need a more complicated management strategy,
>> but for ordinary usage, git should just work.
>>
>> In this commit, git learns to ignore gc.log files which are older than
>> (by default) one day old.  It also learns about a config, gc.maxLogAge
>> to manage this.
>>
>> So git should never get itself into a state where it refuses to do any
>> maintenance, just because at some point some piece of the maintenance
>> didn't make progress.  That might still happen (e.g. because the repo
>> is corrupt), but at the very least it won't be because Git is too dumb
>> to try again.
>
> Sounds like a good goal and approach.
>
>> +gc.maxLogAge::
>> + If the file gc.log exists, then `git gc --auto` won't run
>> + unless that file is more than maxLogAge seconds old.  Default
>> + is 86400, one day.

Is there a reason why one day is chosen? If maintenance staff are available 
24/7 then a shorter time would be appropriate, but if it's a 5 day work week 
then they may want longer. Is there a particular case it targets?

>
> For other time-based config, we use approxidate with a relative time,
> like "1 day ago". I think it would make sense for this to match, as it
> makes the config a little more readable.
>
> You can follow the prune_expire example which is right below your new
> config variable in all of the hunks of your patch. Though I think
> ultimately that isn't parsed inside gc, so you'd eventually look at how
> "prune --expire" is handled (which I think is via parse_expiry_date()).
>
[...]

Philip 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] gc: ignore old gc.log files
  2017-02-09 23:57   ` Philip Oakley
@ 2017-02-10  0:36     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2017-02-10  0:36 UTC (permalink / raw)
  To: Philip Oakley; +Cc: David Turner, git, pclouds

On Thu, Feb 09, 2017 at 11:57:12PM -0000, Philip Oakley wrote:

> > > +gc.maxLogAge::
> > > + If the file gc.log exists, then `git gc --auto` won't run
> > > + unless that file is more than maxLogAge seconds old.  Default
> > > + is 86400, one day.
> 
> Is there a reason why one day is chosen? If maintenance staff are available
> 24/7 then a shorter time would be appropriate, but if it's a 5 day work week
> then they may want longer. Is there a particular case it targets?

I'm pretty sure the one-day time limit isn't scientific. It's just a
number we've been throwing around.

I'm not sure what maintenance staff matters, though. It basically needs
long enough that we're not doing _too_ many fruitless gc's, because it
wastes resources. But you'd prefer to not go too long without a gc for a
repository that needs it.

The root cause of the error could be any number of issues. But for the
case that David cares about most, you basically want to keep trying
until the too-many-objects condition goes away. That's usually on a
2-week timer. So trying once per day to see if the 2-week timer feels
about right.

That's certainly not science, but hopefully it at least frames the
general ballpark.

One possible option would be to auto-scale it with the pruneExpire time.
I don't know if people actually tweak that value or not.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-02-10  0:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-09  2:02 [PATCH] gc: ignore old gc.log files David Turner
2017-02-09  3:23 ` Jeff King
2017-02-09 23:57   ` Philip Oakley
2017-02-10  0:36     ` Jeff King
2017-02-09  4:51 ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).