git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Git List" <git@vger.kernel.org>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Michael Haggerty" <mhagger@alum.mit.edu>,
	"Stefan Beller" <stefanbeller@gmail.com>,
	"Jeff King" <peff@peff.net>,
	"Jonathan Nieder" <jrnieder@gmail.com>
Subject: Re: BUG: Race condition due to reflog expiry in "gc"
Date: Wed, 13 Mar 2019 11:28:39 +0100	[thread overview]
Message-ID: <87sgvrax0o.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <xmqq1s3bh7ky.fsf@gitster-ct.c.googlers.com>


On Wed, Mar 13 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> I'm still working on fixing a race condition I encountered in "gc"
>> recently, but am not 100% sure of the fix. So I thought I'd send a
>> braindump of what I have so far in case it jolts any memories.
>>
>> The problem is that when we "gc" we run "reflog expire --all". This
>> iterates over the reflogs, and takes a *.lock for each reference.
>>
>> It'll fail intermittendly in two ways:
>>
>>  1. If something is concurrently committing to the repo it'll fail
>>     because we for a tiny amount of time hold a HEAD.lock file, so HEAD
>>     can't be updated.
>>
>>  2. On a repository that's just being "git fetch"'d by some concurrent
>>     process the "gc" will fail, because the ref's SHA1 has changed,
>>     which we inspect as we aquire the lock.
>
> Both sounds very much expected and expectable outcome.  I am not
> sure how they need to be called bugs.

Let's leave aside that I started the subject with "BUG:" and let me
rephrase.

I was under the impression that git-gc was supposed to support operating
on a repository that's concurrently being modified, as long as you don't
set the likes of gc.pruneExpire too aggressively.

Running a "gc" in a loop without "git reflog expire --all" and when
watching the repository being GC'd with:

    fswatch -l 0.1 -t -r . 2>&1 | grep lock

We only create .git/MERGE_RR.lock, .git/gc.pid.lock and
git/packed-refs.lock. These are all things that would only cause another
concurrent GC to fail, not a normal git command.

So the only reason a concurrent commit (case #1) fails is because of the
refs being locked during the reflog iteration, and similarly "gc" itself
will fail due to a concurrently updating ref (case #2).

It seems that first of all we need this, I'll submit that as a separate
patch sometime soon:

    diff --git a/builtin/gc.c b/builtin/gc.c
    index 020f725acc..ae488646e1 100644
    --- a/builtin/gc.c
    +++ b/builtin/gc.c
    @@ -127,6 +127,12 @@ static void gc_config(void)
     			pack_refs = git_config_bool("gc.packrefs", value);
     	}

    +	if (!git_config_get_value("gc.reflogexpire", &value) && value &&
    +	    !strcmp(value, "never") &&
    +	    !git_config_get_value("gc.reflogexpireunreachable", &value) && value &&
    +	    !strcmp(value, "never"))
    +		prune_reflogs = 0;
    +
     	git_config_get_int("gc.aggressivewindow", &aggressive_window);
     	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
     	git_config_get_int("gc.auto", &gc_auto_threshold);

I.e. now even if your gc.* config says you don't want the reflogs
touched, we still pointlessly iterate over all of them. The case I'm
running into (a variant of #2) is one solved by that patch, i.e. I'm
fine "gc" just having the reflogs kept forever as a workaround in this
case.

Something like that should have been added back in 62aad1849f ("gc
--auto: do not lock refs in the background", 2014-05-25), i.e. now the
"prune_reflogs" variable is never used, it's just cargo-culted from a
copy/pasting of the "pack_refs" code.

In other "gc" phases in "pack-objects" and "prune" we also look at the
reflogs. This obviously bad patch ignores them entirely:

    diff --git a/builtin/prune.c b/builtin/prune.c
    index 97613eccb5..bccee7813e 100644
    --- a/builtin/prune.c
    +++ b/builtin/prune.c
    @@ -41,7 +41,7 @@ static void perform_reachability_traversal(struct rev_info *revs)

     	if (show_progress)
     		progress = start_delayed_progress(_("Checking connectivity"), 0);
    -	mark_reachable_objects(revs, 1, expire, progress);
    +	mark_reachable_objects(revs, 0, expire, progress);
     	stop_progress(&progress);
     	initialized = 1;
     }
    diff --git a/builtin/repack.c b/builtin/repack.c
    index 67f8978043..618ffbfe0a 100644
    --- a/builtin/repack.c
    +++ b/builtin/repack.c
    @@ -364,7 +364,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
     				 keep_pack_list.items[i].string);
     	argv_array_push(&cmd.args, "--non-empty");
     	argv_array_push(&cmd.args, "--all");
    -	argv_array_push(&cmd.args, "--reflog");
     	argv_array_push(&cmd.args, "--indexed-objects");
     	if (repository_format_partial_clone)
     		argv_array_push(&cmd.args, "--exclude-promisor-objects");

I'm just including that as illustration that add_reflogs_to_pending() in
revision.c during "gc" already iterates over the reflogs without locking
anything, but of course it's just reading them.

So one thing that would mitigate things a lot is if
files_reflog_expire() and its call to expire_reflog_ent() via
refs_for_each_reflog_ent() would lazily aquire the lock on the ref.

Digging a bit further that's actually what we're doing now since
4ff0f01cb7 ("refs: retry acquiring reference locks for 100ms",
2017-08-21).

But this runs into the logic we've had for a long time, or since your
bda3a31cc7 ("reflog-expire: Avoid creating new files in a directory
inside readdir(3) loop", 2008-01-25) where we first loop over all the
refs in the process of finding the reflogs, and then will try to lock
those refs at those expected SHA-1s. If they've changed in the meantime
we error out don't clean up the lockfile.

So just this fixes that:

    diff --git a/refs/files-backend.c b/refs/files-backend.c
    index ef053f716c..b6576f28b9 100644
    --- a/refs/files-backend.c
    +++ b/refs/files-backend.c
    @@ -3037,7 +3037,7 @@ static int files_reflog_expire(struct ref_store *ref_store,
     	 * reference itself, plus we might need to update the
     	 * reference if --updateref was specified:
     	 */
    -	lock = lock_ref_oid_basic(refs, refname, oid,
    +	lock = lock_ref_oid_basic(refs, refname, NULL,
     				  NULL, NULL, REF_NO_DEREF,
     				  &type, &err);
     	if (!lock) {

Which seems sensible to me. We'll still get the lock, we just don't
assert that the refname we use to get the lock must be at that
SHA-1. We'll still use it for the purposes of expiry.

But maybe I've missed some caveat in reflog_expiry_prepare() and friends
and we really do need the reflog at that OID, then this would suck less:

    diff --git a/builtin/reflog.c b/builtin/reflog.c
    index 4d3430900d..4bb272fdc8 100644
    --- a/builtin/reflog.c
    +++ b/builtin/reflog.c
    @@ -625,12 +625,16 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
     		free_worktrees(worktrees);
     		for (i = 0; i < collected.nr; i++) {
     			struct collected_reflog *e = collected.e[i];
    +			int st;
     			set_reflog_expiry_param(&cb.cmd, explicit_expiry, e->reflog);
    -			status |= reflog_expire(e->reflog, &e->oid, flags,
    -						reflog_expiry_prepare,
    -						should_expire_reflog_ent,
    -						reflog_expiry_cleanup,
    -						&cb);
    +			st = reflog_expire(e->reflog, &e->oid, flags,
    +					   reflog_expiry_prepare,
    +					   should_expire_reflog_ent,
    +					   reflog_expiry_cleanup,
    +					   &cb);
    +			if (st == -2)
    +				continue;
    +			status |= st;
     			free(e);
     		}
     		free(collected.e);
    diff --git a/refs/files-backend.c b/refs/files-backend.c
    index ef053f716c..8b0b6b7b85 100644
    --- a/refs/files-backend.c
    +++ b/refs/files-backend.c
    @@ -3041,6 +3041,11 @@ static int files_reflog_expire(struct ref_store *ref_store,
     				  NULL, NULL, REF_NO_DEREF,
     				  &type, &err);
     	if (!lock) {
    +		if (errno == EBUSY) {
    +			warning("cannot lock ref '%s': %s. Skipping!", refname, err.buf);
    +			strbuf_release(&err);
    +			return -2;
    +		}
     		error("cannot lock ref '%s': %s", refname, err.buf);
     		strbuf_release(&err);
     		return -1;

I.e. we just detect the EBUSY that verify_lock() sets if the OID doesn't
match, and don't prune that reflog. As seen above "pack-objects" and
"prune" will still iterate over the same logs later for the purposes of
reachability, so this shouldn't get us into a corrupt state due to
throwing away objects referenced in those logs, we'll just prune fewer
things than we could have.

So I think I'll use the first patch noted above as a hack to solve the
narrow problem I have now, but any comments on the above most
welcome. I'm not very familiar with the ref code in case that wasn't
obvious already.

B.t.w. the mention of f3b661f766 ("expire_reflog(): use a lock_file for
rewriting the reflog file", 2014-12-12) upthread is irrelevant. That's a
commit where we use the lockfile code to write out the *new* reflog,
which is unrelated to all of this.

  reply	other threads:[~2019-03-13 10:28 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-12 23:28 BUG: Race condition due to reflog expiry in "gc" Ævar Arnfjörð Bjarmason
2019-03-13  1:43 ` Junio C Hamano
2019-03-13 10:28   ` Ævar Arnfjörð Bjarmason [this message]
2019-03-13 16:02     ` Jeff King
2019-03-13 16:22       ` Ævar Arnfjörð Bjarmason
2019-03-13 23:54         ` [PATCH 0/5] gc: minor code cleanup + contention fixes Ævar Arnfjörð Bjarmason
2019-03-14 12:34           ` [PATCH v2 0/7] " Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 0/8] " Ævar Arnfjörð Bjarmason
2019-03-18  6:13               ` Junio C Hamano
2019-03-28 16:14               ` [PATCH v4 0/7] gc: tests and handle reflog expire config Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 1/7] gc: remove redundant check for gc_auto_threshold Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 2/7] gc: convert to using the_hash_algo Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 3/7] gc: refactor a "call me once" pattern Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 4/7] reflog tests: make use of "test_config" idiom Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 5/7] reflog tests: test for the "points nowhere" warning Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 6/7] reflog tests: assert lack of early exit with expiry="never" Ævar Arnfjörð Bjarmason
2019-03-28 16:14               ` [PATCH v4 7/7] gc: handle & check gc.reflogExpire config Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 1/8] gc: remove redundant check for gc_auto_threshold Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 2/8] gc: convert to using the_hash_algo Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 3/8] gc: refactor a "call me once" pattern Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 4/8] reflog tests: make use of "test_config" idiom Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 5/8] reflog tests: test for the "points nowhere" warning Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 6/8] reflog tests: assert lack of early exit with expiry="never" Ævar Arnfjörð Bjarmason
2019-03-19  6:20               ` Jeff King
2019-03-15 15:59             ` [PATCH v3 7/8] gc: handle & check gc.reflogExpire config Ævar Arnfjörð Bjarmason
2019-03-15 15:59             ` [PATCH v3 8/8] reflog expire: don't assert the OID when locking refs Ævar Arnfjörð Bjarmason
     [not found]               ` <b870a17d-2103-41b8-3cbc-7389d5fff33a@alum.mit.edu>
2019-03-21 14:10                 ` Ævar Arnfjörð Bjarmason
2019-03-19  6:27             ` [PATCH v2 0/7] gc: minor code cleanup + contention fixes Jeff King
2019-03-14 12:34           ` [PATCH v2 1/7] gc: remove redundant check for gc_auto_threshold Ævar Arnfjörð Bjarmason
2019-03-14 12:34           ` [PATCH v2 2/7] gc: convert to using the_hash_algo Ævar Arnfjörð Bjarmason
2019-03-15  9:51             ` Duy Nguyen
2019-03-15 10:42               ` Ævar Arnfjörð Bjarmason
2019-03-15 15:49                 ` Johannes Schindelin
2019-03-14 12:34           ` [PATCH v2 3/7] gc: refactor a "call me once" pattern Ævar Arnfjörð Bjarmason
2019-03-14 12:34           ` [PATCH v2 4/7] reflog tests: make use of "test_config" idiom Ævar Arnfjörð Bjarmason
2019-03-14 12:34           ` [PATCH v2 5/7] reflog: exit early if there's no work to do Ævar Arnfjörð Bjarmason
2019-03-15 10:01             ` Duy Nguyen
2019-03-15 15:51               ` Ævar Arnfjörð Bjarmason
2019-03-14 12:34           ` [PATCH v2 6/7] gc: don't run "reflog expire" when keeping reflogs Ævar Arnfjörð Bjarmason
2019-03-15 10:05             ` Duy Nguyen
2019-03-15 10:24               ` Ævar Arnfjörð Bjarmason
2019-03-15 10:32                 ` Duy Nguyen
2019-03-15 15:51                 ` Johannes Schindelin
2019-03-18  6:04                 ` Junio C Hamano
2019-03-14 12:34           ` [PATCH v2 7/7] reflog expire: don't assert the OID when locking refs Ævar Arnfjörð Bjarmason
2019-03-15 11:10             ` Duy Nguyen
2019-03-15 15:52               ` Ævar Arnfjörð Bjarmason
2019-03-13 23:54         ` [PATCH 1/5] gc: remove redundant check for gc_auto_threshold Ævar Arnfjörð Bjarmason
2019-03-14  0:25           ` Jeff King
2019-03-13 23:54         ` [PATCH 2/5] gc: convert to using the_hash_algo Ævar Arnfjörð Bjarmason
2019-03-14  0:26           ` Jeff King
2019-03-13 23:54         ` [PATCH 3/5] gc: refactor a "call me once" pattern Ævar Arnfjörð Bjarmason
2019-03-14  0:30           ` Jeff King
2019-03-13 23:54         ` [PATCH 4/5] gc: don't run "reflog expire" when keeping reflogs Ævar Arnfjörð Bjarmason
2019-03-14  0:40           ` Jeff King
2019-03-14  4:51             ` Junio C Hamano
2019-03-14 19:26               ` Jeff King
2019-03-13 23:54         ` [PATCH 5/5] reflog expire: don't assert the OID when locking refs Ævar Arnfjörð Bjarmason
2019-03-14  0:44           ` Jeff King
2019-03-14  0:25         ` BUG: Race condition due to reflog expiry in "gc" Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sgvrax0o.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=mhagger@alum.mit.edu \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=stefanbeller@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).