[PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them
@ 2019-05-10  8:41 Eric Rannaud
  2019-05-15  5:30 ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Rannaud @ 2019-05-10  8:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeremy Serror, Eric Rannaud

At a checkpoint, fast-import resets all branches and tags on disk to the
OID that we have in memory. If --force is not given, only branch resets
that result in fast forwards with respect to the state on disk are
allowed.

With this approach, even for refs that fast-import has not been
commanded to change for a long time (or ever), at each checkpoint, we
will systematically reset them to the last state this process knows
about, a state which may have been set before the previous checkpoint.
Any changes made to these refs by a different process since the last
checkpoint will be overwritten.

1> is one process, 2> is another:

	1> $ git fast-import --force
	1> reset refs/heads/master
	1> from $A
	1> checkpoint
	2> $ git branch -f refs/heads/master $B
	1> reset refs/heads/tip
	1> from $C
	1> checkpoint

At this point master points again to $A.

This problem is mitigated somewhat for branches when --force is not
specified (the default), as requiring a fast forward means in practice
that concurrent external changes are likely to be preserved. But it's
not foolproof either:

	1> $ git fast-import
	1> reset refs/heads/master
	1> from $A
	1> checkpoint
	2> $ git branch -f refs/heads/master refs/heads/master~1
	1> reset refs/heads/tip
	1> from $C
	1> checkpoint

At this point, master points again to $A, not $A~1 as requested by the
second process.

We now keep track of whether branches and tags have been changed by this
fast-import process since our last checkpoint (or startup). At the next
checkpoint, only refs and tags that new commands have changed are
written to disk.
---
 fast-import.c          |  14 ++++++
 t/t9300-fast-import.sh | 106 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 104 insertions(+), 16 deletions(-)


v2 and v1 were sent in Oct 2018. Only difference since is that the new
test cases in t9300 don't use sleep to implement the asynchronous dance.

Thanks.


diff --git a/fast-import.c b/fast-import.c
index f38d04fa5851..e9c3ea23ae42 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -103,6 +103,7 @@ struct branch {
 	uintmax_t num_notes;
 	unsigned active : 1;
 	unsigned delete : 1;
+	unsigned changed : 1;
 	unsigned pack_id : PACK_ID_BITS;
 	struct object_id oid;
 };
@@ -110,6 +111,7 @@ struct branch {
 struct tag {
 	struct tag *next_tag;
 	const char *name;
+	unsigned changed : 1;
 	unsigned int pack_id;
 	struct object_id oid;
 };
@@ -581,6 +583,7 @@ static struct branch *new_branch(const char *name)
 	b->branch_tree.versions[1].mode = S_IFDIR;
 	b->num_notes = 0;
 	b->active = 0;
+	b->changed = 0;
 	b->pack_id = MAX_PACK_ID;
 	branch_table[hc] = b;
 	branch_count++;
@@ -1571,6 +1574,10 @@ static int update_branch(struct branch *b)
 	struct object_id old_oid;
 	struct strbuf err = STRBUF_INIT;
 
+	if (!b->changed)
+		return 0;
+	b->changed = 0;
+
 	if (is_null_oid(&b->oid)) {
 		if (b->delete)
 			delete_ref(NULL, b->name, NULL, 0);
@@ -1636,6 +1643,10 @@ static void dump_tags(void)
 		goto cleanup;
 	}
 	for (t = first_tag; t; t = t->next_tag) {
+		if (!t->changed)
+			continue;
+		t->changed = 0;
+
 		strbuf_reset(&ref_name);
 		strbuf_addf(&ref_name, "refs/tags/%s", t->name);
 
@@ -2679,6 +2690,7 @@ static void parse_new_commit(const char *arg)
 
 	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
 		b->pack_id = pack_id;
+	b->changed = 1;
 	b->last_commit = object_count_by_type[OBJ_COMMIT];
 }
 
@@ -2763,6 +2775,7 @@ static void parse_new_tag(const char *arg)
 		t->pack_id = MAX_PACK_ID;
 	else
 		t->pack_id = pack_id;
+	t->changed = 1;
 }
 
 static void parse_reset_branch(const char *arg)
@@ -2783,6 +2796,7 @@ static void parse_reset_branch(const char *arg)
 		b = new_branch(arg);
 	read_next_command();
 	parse_from(b);
+	b->changed = 1;
 	if (command_buf.len > 0)
 		unread_command_buf = 1;
 }
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c4046..12abaebb22b6 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3121,14 +3121,23 @@ test_expect_success 'U: validate root delete result' '
 ### series V (checkpoint)
 ###
 
-# The commands in input_file should not produce any output on the file
-# descriptor set with --cat-blob-fd (or stdout if unspecified).
+# Instructions can be sent from $input_file to background_import() loop via the
+# fast-import progress command.
+# 
+# 	progress do: shell
+# Parse the next progress line and invoke it as a shell command.
+# 
+# 	progress do: checkpoint and stop
+# Send checkpoint and wait for its completion.
+# 
+# 	progress do: stop
+# Internal instruction.
 #
 # To make sure you're observing the side effects of checkpoint *before*
 # fast-import terminates (and thus writes out its state), check that the
 # fast-import process is still running using background_import_still_running
 # *after* evaluating the test conditions.
-background_import_then_checkpoint () {
+background_import () {
 	options=$1
 	input_file=$2
 
@@ -3153,22 +3162,30 @@ background_import_then_checkpoint () {
 	# pipes writing sequence. We want to assume that the write below could
 	# block, e.g. if fast-import blocks writing its own output to &9
 	# because there is no reader on &9 yet.
-	(
-		cat "$input_file"
-		echo "checkpoint"
-		echo "progress checkpoint"
-	) >&8 &
+	cat "$input_file" >&8 &
 
 	error=1 ;# assume the worst
 	while read output <&9
 	do
-		if test "$output" = "progress checkpoint"
+		if test "$output" = "progress do: shell"
+		then
+			read output <&9
+			shell_cmd="$(echo $output |sed -e 's/^progress //')"
+			$shell_cmd
+		elif test "$output" = "progress do: checkpoint and stop"
+		then
+			(
+				echo "checkpoint"
+				echo "progress do: stop"
+			) >&8 &
+		elif test "$output" = "progress do: stop"
 		then
 			error=0
 			break
+		else
+			# otherwise ignore cruft
+			echo >&2 "cruft: $output"
 		fi
-		# otherwise ignore cruft
-		echo >&2 "cruft: $output"
 	done
 
 	if test $error -eq 1
@@ -3189,10 +3206,11 @@ test_expect_success PIPE 'V: checkpoint helper does not get stuck with extra out
 	cat >input <<-INPUT_END &&
 	progress foo
 	progress bar
+	progress do: checkpoint and stop
 
 	INPUT_END
 
-	background_import_then_checkpoint "" input &&
+	background_import "" input &&
 	background_import_still_running
 '
 
@@ -3201,9 +3219,11 @@ test_expect_success PIPE 'V: checkpoint updates refs after reset' '
 	reset refs/heads/V
 	from refs/heads/U
 
+	progress do: checkpoint and stop
+
 	INPUT_END
 
-	background_import_then_checkpoint "" input &&
+	background_import "" input &&
 	test "$(git rev-parse --verify V)" = "$(git rev-parse --verify U)" &&
 	background_import_still_running
 '
@@ -3216,9 +3236,11 @@ test_expect_success PIPE 'V: checkpoint updates refs and marks after commit' '
 	data 0
 	from refs/heads/U
 
+	progress do: checkpoint and stop
+
 	INPUT_END
 
-	background_import_then_checkpoint "--export-marks=marks.actual" input &&
+	background_import "--export-marks=marks.actual" input &&
 
 	echo ":1 $(git rev-parse --verify V)" >marks.expected &&
 
@@ -3237,9 +3259,11 @@ test_expect_success PIPE 'V: checkpoint updates refs and marks after commit (no
 	data 0
 	from refs/heads/U
 
+	progress do: checkpoint and stop
+
 	INPUT_END
 
-	background_import_then_checkpoint "--export-marks=marks.actual" input &&
+	background_import "--export-marks=marks.actual" input &&
 
 	echo ":2 $(git rev-parse --verify V2)" >marks.expected &&
 
@@ -3255,13 +3279,63 @@ test_expect_success PIPE 'V: checkpoint updates tags after tag' '
 	tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
 	data 0
 
+	progress do: checkpoint and stop
+
 	INPUT_END
 
-	background_import_then_checkpoint "" input &&
+	background_import "" input &&
 	git show-ref -d Vtag &&
 	background_import_still_running
 '
 
+test_expect_success PIPE 'V: checkpoint only resets changed branches' '
+	cat >input <<-INPUT_END &&
+	reset refs/heads/V3
+	from refs/heads/U
+
+	checkpoint
+
+	progress do: shell
+	progress git branch -f V3 V
+
+	reset refs/heads/V4
+	from refs/heads/U
+
+	progress do: checkpoint and stop
+
+	INPUT_END
+
+	background_import "--force" input &&
+	test "$(git rev-parse --verify V3)" = "$(git rev-parse --verify V)" &&
+	background_import_still_running
+'
+
+test_expect_success PIPE 'V: checkpoint only updates changed tags' '
+	cat >input <<-INPUT_END &&
+	tag Vtag2
+	from refs/heads/U
+	tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	data 0
+
+	checkpoint
+
+	progress do: shell
+	progress git tag -f Vtag2 V
+
+	tag Vtag3
+	from refs/heads/U
+	tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	data 0
+
+	progress do: checkpoint and stop
+
+	INPUT_END
+
+	background_import "" input &&
+	test "$(git show-ref -d Vtag2 |tail -1 |awk \{print\ \$1\})" = "$(git rev-parse --verify V)" &&
+	background_import_still_running
+'
+
 ###
 ### series W (get-mark and empty orphan commits)
 ###
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them
  2019-05-10  8:41 [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them Eric Rannaud
@ 2019-05-15  5:30 ` Junio C Hamano
  2019-05-15  6:57   ` Eric Rannaud
  2019-05-16  0:35   ` Elijah Newren
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2019-05-15  5:30 UTC (permalink / raw)
  To: Eric Rannaud, Elijah Newren; +Cc: git, Jeremy Serror

"Eric Rannaud" <e@nanocritical.com> writes:

> We now keep track of whether branches and tags have been changed by this
> fast-import process since our last checkpoint (or startup). At the next
> checkpoint, only refs and tags that new commands have changed are
> written to disk.

And when we notice that we have been competing with some other
process and the ref that we wanted to update has already been
updated to a different value, what happens?  That should be part of
the description of the new behaviour (aka "fix") written here.

> ---

Missing sign-off.

It sounds like a worthwhile thing to do.

It seems that Elijah has been active in fast-import/export area in
the last 12 months, so let's borrow a second set of eyes from him.

Thanks.

>  fast-import.c          |  14 ++++++
>  t/t9300-fast-import.sh | 106 ++++++++++++++++++++++++++++++++++-------
>  2 files changed, 104 insertions(+), 16 deletions(-)
>
>
> v2 and v1 were sent in Oct 2018. Only difference since is that the new
> test cases in t9300 don't use sleep to implement the asynchronous dance.
>
> Thanks.
>
>
> diff --git a/fast-import.c b/fast-import.c
> index f38d04fa5851..e9c3ea23ae42 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -103,6 +103,7 @@ struct branch {
>  	uintmax_t num_notes;
>  	unsigned active : 1;
>  	unsigned delete : 1;
> +	unsigned changed : 1;
>  	unsigned pack_id : PACK_ID_BITS;
>  	struct object_id oid;
>  };
> @@ -110,6 +111,7 @@ struct branch {
>  struct tag {
>  	struct tag *next_tag;
>  	const char *name;
> +	unsigned changed : 1;
>  	unsigned int pack_id;
>  	struct object_id oid;
>  };
> @@ -581,6 +583,7 @@ static struct branch *new_branch(const char *name)
>  	b->branch_tree.versions[1].mode = S_IFDIR;
>  	b->num_notes = 0;
>  	b->active = 0;
> +	b->changed = 0;
>  	b->pack_id = MAX_PACK_ID;
>  	branch_table[hc] = b;
>  	branch_count++;
> @@ -1571,6 +1574,10 @@ static int update_branch(struct branch *b)
>  	struct object_id old_oid;
>  	struct strbuf err = STRBUF_INIT;
>  
> +	if (!b->changed)
> +		return 0;
> +	b->changed = 0;
> +
>  	if (is_null_oid(&b->oid)) {
>  		if (b->delete)
>  			delete_ref(NULL, b->name, NULL, 0);
> @@ -1636,6 +1643,10 @@ static void dump_tags(void)
>  		goto cleanup;
>  	}
>  	for (t = first_tag; t; t = t->next_tag) {
> +		if (!t->changed)
> +			continue;
> +		t->changed = 0;
> +
>  		strbuf_reset(&ref_name);
>  		strbuf_addf(&ref_name, "refs/tags/%s", t->name);
>  
> @@ -2679,6 +2690,7 @@ static void parse_new_commit(const char *arg)
>  
>  	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
>  		b->pack_id = pack_id;
> +	b->changed = 1;
>  	b->last_commit = object_count_by_type[OBJ_COMMIT];
>  }
>  
> @@ -2763,6 +2775,7 @@ static void parse_new_tag(const char *arg)
>  		t->pack_id = MAX_PACK_ID;
>  	else
>  		t->pack_id = pack_id;
> +	t->changed = 1;
>  }
>  
>  static void parse_reset_branch(const char *arg)
> @@ -2783,6 +2796,7 @@ static void parse_reset_branch(const char *arg)
>  		b = new_branch(arg);
>  	read_next_command();
>  	parse_from(b);
> +	b->changed = 1;
>  	if (command_buf.len > 0)
>  		unread_command_buf = 1;
>  }
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index 3668263c4046..12abaebb22b6 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -3121,14 +3121,23 @@ test_expect_success 'U: validate root delete result' '
>  ### series V (checkpoint)
>  ###
>  
> -# The commands in input_file should not produce any output on the file
> -# descriptor set with --cat-blob-fd (or stdout if unspecified).
> +# Instructions can be sent from $input_file to background_import() loop via the
> +# fast-import progress command.
> +# 
> +# 	progress do: shell
> +# Parse the next progress line and invoke it as a shell command.
> +# 
> +# 	progress do: checkpoint and stop
> +# Send checkpoint and wait for its completion.
> +# 
> +# 	progress do: stop
> +# Internal instruction.
>  #
>  # To make sure you're observing the side effects of checkpoint *before*
>  # fast-import terminates (and thus writes out its state), check that the
>  # fast-import process is still running using background_import_still_running
>  # *after* evaluating the test conditions.
> -background_import_then_checkpoint () {
> +background_import () {
>  	options=$1
>  	input_file=$2
>  
> @@ -3153,22 +3162,30 @@ background_import_then_checkpoint () {
>  	# pipes writing sequence. We want to assume that the write below could
>  	# block, e.g. if fast-import blocks writing its own output to &9
>  	# because there is no reader on &9 yet.
> -	(
> -		cat "$input_file"
> -		echo "checkpoint"
> -		echo "progress checkpoint"
> -	) >&8 &
> +	cat "$input_file" >&8 &
>  
>  	error=1 ;# assume the worst
>  	while read output <&9
>  	do
> -		if test "$output" = "progress checkpoint"
> +		if test "$output" = "progress do: shell"
> +		then
> +			read output <&9
> +			shell_cmd="$(echo $output |sed -e 's/^progress //')"
> +			$shell_cmd
> +		elif test "$output" = "progress do: checkpoint and stop"
> +		then
> +			(
> +				echo "checkpoint"
> +				echo "progress do: stop"
> +			) >&8 &
> +		elif test "$output" = "progress do: stop"
>  		then
>  			error=0
>  			break
> +		else
> +			# otherwise ignore cruft
> +			echo >&2 "cruft: $output"
>  		fi
> -		# otherwise ignore cruft
> -		echo >&2 "cruft: $output"
>  	done
>  
>  	if test $error -eq 1
> @@ -3189,10 +3206,11 @@ test_expect_success PIPE 'V: checkpoint helper does not get stuck with extra out
>  	cat >input <<-INPUT_END &&
>  	progress foo
>  	progress bar
> +	progress do: checkpoint and stop
>  
>  	INPUT_END
>  
> -	background_import_then_checkpoint "" input &&
> +	background_import "" input &&
>  	background_import_still_running
>  '
>  
> @@ -3201,9 +3219,11 @@ test_expect_success PIPE 'V: checkpoint updates refs after reset' '
>  	reset refs/heads/V
>  	from refs/heads/U
>  
> +	progress do: checkpoint and stop
> +
>  	INPUT_END
>  
> -	background_import_then_checkpoint "" input &&
> +	background_import "" input &&
>  	test "$(git rev-parse --verify V)" = "$(git rev-parse --verify U)" &&
>  	background_import_still_running
>  '
> @@ -3216,9 +3236,11 @@ test_expect_success PIPE 'V: checkpoint updates refs and marks after commit' '
>  	data 0
>  	from refs/heads/U
>  
> +	progress do: checkpoint and stop
> +
>  	INPUT_END
>  
> -	background_import_then_checkpoint "--export-marks=marks.actual" input &&
> +	background_import "--export-marks=marks.actual" input &&
>  
>  	echo ":1 $(git rev-parse --verify V)" >marks.expected &&
>  
> @@ -3237,9 +3259,11 @@ test_expect_success PIPE 'V: checkpoint updates refs and marks after commit (no
>  	data 0
>  	from refs/heads/U
>  
> +	progress do: checkpoint and stop
> +
>  	INPUT_END
>  
> -	background_import_then_checkpoint "--export-marks=marks.actual" input &&
> +	background_import "--export-marks=marks.actual" input &&
>  
>  	echo ":2 $(git rev-parse --verify V2)" >marks.expected &&
>  
> @@ -3255,13 +3279,63 @@ test_expect_success PIPE 'V: checkpoint updates tags after tag' '
>  	tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>  	data 0
>  
> +	progress do: checkpoint and stop
> +
>  	INPUT_END
>  
> -	background_import_then_checkpoint "" input &&
> +	background_import "" input &&
>  	git show-ref -d Vtag &&
>  	background_import_still_running
>  '
>  
> +test_expect_success PIPE 'V: checkpoint only resets changed branches' '
> +	cat >input <<-INPUT_END &&
> +	reset refs/heads/V3
> +	from refs/heads/U
> +
> +	checkpoint
> +
> +	progress do: shell
> +	progress git branch -f V3 V
> +
> +	reset refs/heads/V4
> +	from refs/heads/U
> +
> +	progress do: checkpoint and stop
> +
> +	INPUT_END
> +
> +	background_import "--force" input &&
> +	test "$(git rev-parse --verify V3)" = "$(git rev-parse --verify V)" &&
> +	background_import_still_running
> +'
> +
> +test_expect_success PIPE 'V: checkpoint only updates changed tags' '
> +	cat >input <<-INPUT_END &&
> +	tag Vtag2
> +	from refs/heads/U
> +	tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +	data 0
> +
> +	checkpoint
> +
> +	progress do: shell
> +	progress git tag -f Vtag2 V
> +
> +	tag Vtag3
> +	from refs/heads/U
> +	tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +	data 0
> +
> +	progress do: checkpoint and stop
> +
> +	INPUT_END
> +
> +	background_import "" input &&
> +	test "$(git show-ref -d Vtag2 |tail -1 |awk \{print\ \$1\})" = "$(git rev-parse --verify V)" &&
> +	background_import_still_running
> +'
> +
>  ###
>  ### series W (get-mark and empty orphan commits)
>  ###

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them
  2019-05-15  5:30 ` Junio C Hamano
@ 2019-05-15  6:57   ` Eric Rannaud
  2019-05-16  0:35   ` Elijah Newren
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Rannaud @ 2019-05-15  6:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elijah Newren, git, Jeremy Serror

On Tue, May 14, 2019 at 10:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Eric Rannaud" <e@nanocritical.com> writes:
>
> > We now keep track of whether branches and tags have been changed by this
> > fast-import process since our last checkpoint (or startup). At the next
> > checkpoint, only refs and tags that new commands have changed are
> > written to disk.
>
> And when we notice that we have been competing with some other
> process and the ref that we wanted to update has already been
> updated to a different value, what happens?  That should be part of
> the description of the new behaviour (aka "fix") written here.

This isn't addressed by this change, that behavior remains the same as
before. This change does not introduce a new failure mode. That is, we
still overwrite external changes if fast-import has itself been
instructed to change the ref since the last checkpoint (but without
--force it must be a fast forward wrt the current state on disk).

With this change applied:

        1> $ git fast-import --force
        1> reset refs/heads/master
        1> from $A
        1> checkpoint
        2> $ git branch -f refs/heads/master $B
        1> reset refs/heads/master
        1> from $C
        1> checkpoint

master points to $C.

I can update the commit message with more details.

We could provide more complete transactional semantics, I suppose,
i.e. only update a ref if:
    - (1) fast-import has been instructed to change it (changed == 1)
since the last checkpoint (this patch),
    - and (2) the ref on disk is identical to the one fast-import
wrote at the previous checkpoint (or is identical to the one loaded
from disk if we never modified it).
Would we also need a command to get fast-import to read some/all refs
from disk and update its internal state? But maybe this is going too
far.

In any case, the sequence above would then fail. But how should we
fail? The same way we do for checkpoints which would result in non-FF
when --force is not supplied? That is: don't update this ref, print a
warning, update other refs, continue processing the input, then
eventually exit with 1.

One problem with the current warning is that it goes to stderr and
there is no reliable way to detect on --cat-blob-fd that a checkpoint
failed. It is typical to use things like:

        checkpoint
        progress checkpoint

and wait to read "progress checkpoint" on --cat-blob-fd to know when a
checkpoint is complete. It works fine if checkpoints cannot fail. For
a long-running fast-import process, an error would not necessarily be
noticed until EOF or "done", and you wouldn't easily be able to tell
which checkpoint actually failed.

If we want to improve things beyond what this patch does, I think that
a reasonable balance would be:

    - (a) improve the existing FF-only transaction by providing an
immediate error on --cat-blob-fd if a checkpoint fails (the error
should specify the list of refs that were not updated); for simplicity
of processing the output of fast-import, we likely want to also print
something for successful checkpoints: "git fast-import
--checkpoint-success", and checkpoint prints on --cat-blob-fd:

        [checkpoint failed <ref> LF]*
        checkpoint complete LF

    - (b) add a --force option to the "checkpoint" command, bypassing
the FF-only check on demand so that an interactive program can retry a
checkpoint after reading the state on disk, and maybe eventually give
up and force the update.


> > ---
>
> Missing sign-off.


Thanks. I will resend with a better explanation and sign-off.


On Tue, May 14, 2019 at 10:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Eric Rannaud" <e@nanocritical.com> writes:
>
> > We now keep track of whether branches and tags have been changed by this
> > fast-import process since our last checkpoint (or startup). At the next
> > checkpoint, only refs and tags that new commands have changed are
> > written to disk.
>
> And when we notice that we have been competing with some other
> process and the ref that we wanted to update has already been
> updated to a different value, what happens?  That should be part of
> the description of the new behaviour (aka "fix") written here.
>
> > ---
>
> Missing sign-off.
>
> It sounds like a worthwhile thing to do.
>
> It seems that Elijah has been active in fast-import/export area in
> the last 12 months, so let's borrow a second set of eyes from him.
>
> Thanks.
>
> >  fast-import.c          |  14 ++++++
> >  t/t9300-fast-import.sh | 106 ++++++++++++++++++++++++++++++++++-------
> >  2 files changed, 104 insertions(+), 16 deletions(-)
> >
> >
> > v2 and v1 were sent in Oct 2018. Only difference since is that the new
> > test cases in t9300 don't use sleep to implement the asynchronous dance.
> >
> > Thanks.
> >
> >
> > diff --git a/fast-import.c b/fast-import.c
> > index f38d04fa5851..e9c3ea23ae42 100644
> > --- a/fast-import.c
> > +++ b/fast-import.c
> > @@ -103,6 +103,7 @@ struct branch {
> >       uintmax_t num_notes;
> >       unsigned active : 1;
> >       unsigned delete : 1;
> > +     unsigned changed : 1;
> >       unsigned pack_id : PACK_ID_BITS;
> >       struct object_id oid;
> >  };
> > @@ -110,6 +111,7 @@ struct branch {
> >  struct tag {
> >       struct tag *next_tag;
> >       const char *name;
> > +     unsigned changed : 1;
> >       unsigned int pack_id;
> >       struct object_id oid;
> >  };
> > @@ -581,6 +583,7 @@ static struct branch *new_branch(const char *name)
> >       b->branch_tree.versions[1].mode = S_IFDIR;
> >       b->num_notes = 0;
> >       b->active = 0;
> > +     b->changed = 0;
> >       b->pack_id = MAX_PACK_ID;
> >       branch_table[hc] = b;
> >       branch_count++;
> > @@ -1571,6 +1574,10 @@ static int update_branch(struct branch *b)
> >       struct object_id old_oid;
> >       struct strbuf err = STRBUF_INIT;
> >
> > +     if (!b->changed)
> > +             return 0;
> > +     b->changed = 0;
> > +
> >       if (is_null_oid(&b->oid)) {
> >               if (b->delete)
> >                       delete_ref(NULL, b->name, NULL, 0);
> > @@ -1636,6 +1643,10 @@ static void dump_tags(void)
> >               goto cleanup;
> >       }
> >       for (t = first_tag; t; t = t->next_tag) {
> > +             if (!t->changed)
> > +                     continue;
> > +             t->changed = 0;
> > +
> >               strbuf_reset(&ref_name);
> >               strbuf_addf(&ref_name, "refs/tags/%s", t->name);
> >
> > @@ -2679,6 +2690,7 @@ static void parse_new_commit(const char *arg)
> >
> >       if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
> >               b->pack_id = pack_id;
> > +     b->changed = 1;
> >       b->last_commit = object_count_by_type[OBJ_COMMIT];
> >  }
> >
> > @@ -2763,6 +2775,7 @@ static void parse_new_tag(const char *arg)
> >               t->pack_id = MAX_PACK_ID;
> >       else
> >               t->pack_id = pack_id;
> > +     t->changed = 1;
> >  }
> >
> >  static void parse_reset_branch(const char *arg)
> > @@ -2783,6 +2796,7 @@ static void parse_reset_branch(const char *arg)
> >               b = new_branch(arg);
> >       read_next_command();
> >       parse_from(b);
> > +     b->changed = 1;
> >       if (command_buf.len > 0)
> >               unread_command_buf = 1;
> >  }
> > diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> > index 3668263c4046..12abaebb22b6 100755
> > --- a/t/t9300-fast-import.sh
> > +++ b/t/t9300-fast-import.sh
> > @@ -3121,14 +3121,23 @@ test_expect_success 'U: validate root delete result' '
> >  ### series V (checkpoint)
> >  ###
> >
> > -# The commands in input_file should not produce any output on the file
> > -# descriptor set with --cat-blob-fd (or stdout if unspecified).
> > +# Instructions can be sent from $input_file to background_import() loop via the
> > +# fast-import progress command.
> > +#
> > +#    progress do: shell
> > +# Parse the next progress line and invoke it as a shell command.
> > +#
> > +#    progress do: checkpoint and stop
> > +# Send checkpoint and wait for its completion.
> > +#
> > +#    progress do: stop
> > +# Internal instruction.
> >  #
> >  # To make sure you're observing the side effects of checkpoint *before*
> >  # fast-import terminates (and thus writes out its state), check that the
> >  # fast-import process is still running using background_import_still_running
> >  # *after* evaluating the test conditions.
> > -background_import_then_checkpoint () {
> > +background_import () {
> >       options=$1
> >       input_file=$2
> >
> > @@ -3153,22 +3162,30 @@ background_import_then_checkpoint () {
> >       # pipes writing sequence. We want to assume that the write below could
> >       # block, e.g. if fast-import blocks writing its own output to &9
> >       # because there is no reader on &9 yet.
> > -     (
> > -             cat "$input_file"
> > -             echo "checkpoint"
> > -             echo "progress checkpoint"
> > -     ) >&8 &
> > +     cat "$input_file" >&8 &
> >
> >       error=1 ;# assume the worst
> >       while read output <&9
> >       do
> > -             if test "$output" = "progress checkpoint"
> > +             if test "$output" = "progress do: shell"
> > +             then
> > +                     read output <&9
> > +                     shell_cmd="$(echo $output |sed -e 's/^progress //')"
> > +                     $shell_cmd
> > +             elif test "$output" = "progress do: checkpoint and stop"
> > +             then
> > +                     (
> > +                             echo "checkpoint"
> > +                             echo "progress do: stop"
> > +                     ) >&8 &
> > +             elif test "$output" = "progress do: stop"
> >               then
> >                       error=0
> >                       break
> > +             else
> > +                     # otherwise ignore cruft
> > +                     echo >&2 "cruft: $output"
> >               fi
> > -             # otherwise ignore cruft
> > -             echo >&2 "cruft: $output"
> >       done
> >
> >       if test $error -eq 1
> > @@ -3189,10 +3206,11 @@ test_expect_success PIPE 'V: checkpoint helper does not get stuck with extra out
> >       cat >input <<-INPUT_END &&
> >       progress foo
> >       progress bar
> > +     progress do: checkpoint and stop
> >
> >       INPUT_END
> >
> > -     background_import_then_checkpoint "" input &&
> > +     background_import "" input &&
> >       background_import_still_running
> >  '
> >
> > @@ -3201,9 +3219,11 @@ test_expect_success PIPE 'V: checkpoint updates refs after reset' '
> >       reset refs/heads/V
> >       from refs/heads/U
> >
> > +     progress do: checkpoint and stop
> > +
> >       INPUT_END
> >
> > -     background_import_then_checkpoint "" input &&
> > +     background_import "" input &&
> >       test "$(git rev-parse --verify V)" = "$(git rev-parse --verify U)" &&
> >       background_import_still_running
> >  '
> > @@ -3216,9 +3236,11 @@ test_expect_success PIPE 'V: checkpoint updates refs and marks after commit' '
> >       data 0
> >       from refs/heads/U
> >
> > +     progress do: checkpoint and stop
> > +
> >       INPUT_END
> >
> > -     background_import_then_checkpoint "--export-marks=marks.actual" input &&
> > +     background_import "--export-marks=marks.actual" input &&
> >
> >       echo ":1 $(git rev-parse --verify V)" >marks.expected &&
> >
> > @@ -3237,9 +3259,11 @@ test_expect_success PIPE 'V: checkpoint updates refs and marks after commit (no
> >       data 0
> >       from refs/heads/U
> >
> > +     progress do: checkpoint and stop
> > +
> >       INPUT_END
> >
> > -     background_import_then_checkpoint "--export-marks=marks.actual" input &&
> > +     background_import "--export-marks=marks.actual" input &&
> >
> >       echo ":2 $(git rev-parse --verify V2)" >marks.expected &&
> >
> > @@ -3255,13 +3279,63 @@ test_expect_success PIPE 'V: checkpoint updates tags after tag' '
> >       tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> >       data 0
> >
> > +     progress do: checkpoint and stop
> > +
> >       INPUT_END
> >
> > -     background_import_then_checkpoint "" input &&
> > +     background_import "" input &&
> >       git show-ref -d Vtag &&
> >       background_import_still_running
> >  '
> >
> > +test_expect_success PIPE 'V: checkpoint only resets changed branches' '
> > +     cat >input <<-INPUT_END &&
> > +     reset refs/heads/V3
> > +     from refs/heads/U
> > +
> > +     checkpoint
> > +
> > +     progress do: shell
> > +     progress git branch -f V3 V
> > +
> > +     reset refs/heads/V4
> > +     from refs/heads/U
> > +
> > +     progress do: checkpoint and stop
> > +
> > +     INPUT_END
> > +
> > +     background_import "--force" input &&
> > +     test "$(git rev-parse --verify V3)" = "$(git rev-parse --verify V)" &&
> > +     background_import_still_running
> > +'
> > +
> > +test_expect_success PIPE 'V: checkpoint only updates changed tags' '
> > +     cat >input <<-INPUT_END &&
> > +     tag Vtag2
> > +     from refs/heads/U
> > +     tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> > +     data 0
> > +
> > +     checkpoint
> > +
> > +     progress do: shell
> > +     progress git tag -f Vtag2 V
> > +
> > +     tag Vtag3
> > +     from refs/heads/U
> > +     tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> > +     data 0
> > +
> > +     progress do: checkpoint and stop
> > +
> > +     INPUT_END
> > +
> > +     background_import "" input &&
> > +     test "$(git show-ref -d Vtag2 |tail -1 |awk \{print\ \$1\})" = "$(git rev-parse --verify V)" &&
> > +     background_import_still_running
> > +'
> > +
> >  ###
> >  ### series W (get-mark and empty orphan commits)
> >  ###



-- 
Eric Rannaud <e@nanocritical.com>
Nanocritical, CEO

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them
  2019-05-15  5:30 ` Junio C Hamano
  2019-05-15  6:57   ` Eric Rannaud
@ 2019-05-16  0:35   ` Elijah Newren
  2019-05-16  5:54     ` Eric Rannaud
  1 sibling, 1 reply; 5+ messages in thread
From: Elijah Newren @ 2019-05-16  0:35 UTC (permalink / raw)
  To: Eric Rannaud; +Cc: Git Mailing List, Jeremy Serror, Junio C Hamano

On Tue, May 14, 2019 at 10:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Eric Rannaud" <e@nanocritical.com> writes:
>
> > We now keep track of whether branches and tags have been changed by this
> > fast-import process since our last checkpoint (or startup). At the next
> > checkpoint, only refs and tags that new commands have changed are
> > written to disk.
>
> And when we notice that we have been competing with some other
> process and the ref that we wanted to update has already been
> updated to a different value, what happens?  That should be part of
> the description of the new behaviour (aka "fix") written here.
>
> > ---
>
> Missing sign-off.
>
> It sounds like a worthwhile thing to do.
>
> It seems that Elijah has been active in fast-import/export area in
> the last 12 months, so let's borrow a second set of eyes from him.

I glanced over the fast-import.c portion of the code and it seemed
sane, modulo this question about handling of failure to update
branches/tags during a checkpoint.  Didn't read the testcase portion
of the patch as closely yet, but I think the big question is failures
during checkpoints.

First, I have to admit to having never used checkpoints, and I'm
struggling a little bit to understand the usecase here.

For context on my angle, I've been always using --force and 'feature
done', which ensures that either all branches and tags successfully
update, or fast-import terminates with having made no changes (other
than perhaps having written a garbage packfile that can be pruned).
I'm particularly worried about the frontend hitting errors and not
completing its output.  With that in mind, I was surprised to notice
that explicit checkpoints update branches and tags; it seemed
incompatible with --done to me, though it turns out to only be
incompatible with my intended use (no partial updates).

In your case, you are clearly interested in partial updates for some
purpose.  It's easiest to figure out the 'right' behavior for partial
updates in conjunction with --force: as you state, if someone else
updates some ref after a checkpoint and the input stream continues to
change the ref, then tough luck for the external party; we were told
to `--force` so we overwrite.  But the case without --force is most
interesting, as Junio highlighted.  I don't think that case has been
thought about deeply in the past, for two reasons: (1) the fast-import
documentation is somewhat dismissive of checkpoints ("given that a 30
GiB Subversion repository can be loaded into Git through fast-import
in about 3 hours, explicit checkpointing may not be necessary") , (2)
I think most folks assume the repository isn't going to be interacted
with during the fast-import process.  However, the whole reason for
your patch is because checkpoints are in use *and* people are updating
the repo during the fast-import process, so this area that was likely
overlooked in the past now is pretty important to get right.

Looking at the fast-import code, it appears that update_branch and
update_tag both set a global 'failure' flag whenever a branch cannot
be updated due to it not being a fast-forward.  It prints a "warning"
but this is likely because it wants to update whatever branches and
tags it can and report issues on any it can't, and then after that's
done then exit.  Since branches and tags are typically just updated at
the very end of the process (again, based on the documentation
suggesting checkpoints aren't necessary), the end of program exit can
also be thought of as the "exit immediately after failing branch/tag
updates".  My current suspicion is that the current checkpointing code
not checking the 'failure' flag was an oversight.  If the user doesn't
want branches/tags forcibly updated, and they do want checkpoints when
the branches/tags get updated as input comes in, and there is the
possibility that external folks modify thing, then I think it makes
sense to abort the import once the checkpoint is finished (i.e. update
whatever branches/tags we can and then error out); continuing to
process the input stream seems wrong, because:
  * the reason to use checkpoints is because the fast-import process
is taking a long time; if it was quick, there'd be no reason to
manually checkpoint.  Therefore...
  * the user may be forced to backup and redo the long-lived
fast-import process anyway; we've cost them time by not erroring out
immediately
  * we allow errors to accumulate (other branches could have failures
to update later in the fast-import process), causing users to have to
figure out whether one or many failures were reported and then have to
try to track down if they have one or multiple causes
  * the users are less likely to notice early "warning" without
erroring out, but the cause (someone else updating a ref) may be
harder to diagnose the longer it has been since the external ref
update.

That's my impression looking over your change and reading over the
relevant bits of fast-import.c, but as I noted before I don't
understand the usecase and perhaps there's something important I'm not
grasping based on that.  It'd be great if you could provide more
details to help us determine if we need to update the checkpoint code
to appropriately handle errors.

Hope that helps,
Elijah

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them
  2019-05-16  0:35   ` Elijah Newren
@ 2019-05-16  5:54     ` Eric Rannaud
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Rannaud @ 2019-05-16  5:54 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jeremy Serror, Junio C Hamano

On Wed, May 15, 2019 at 5:36 PM Elijah Newren <newren@gmail.com> wrote:
> with during the fast-import process.  However, the whole reason for
> your patch is because checkpoints are in use *and* people are updating
> the repo during the fast-import process, so this area that was likely
> overlooked in the past now is pretty important to get right.

A little bit of background on our particular use case: fast-import is
used to perform frequent history (re)writing and is a great (and fast)
way to modify many refs without having to touch a checkout. For us, an
alternative would be to use libgit but this is simpler, multiprocess
and non-blocking, and doesn't have any additional dependencies.

We use --force: our fast-import frontend independently watches the
repo for updates so we are generally aware of external updates to the
repo. (And if we race with an external change, we can live with that
-- although being able to finely control which ref updates we want to
force and which ones we expect to be FF would be great, but that's for
another patch.)

We would like to use checkpoints to keep fast-import running, which
has the advantage of keeping some Git state in memory. Restarting git
fast-import every time we want to make a change is doable, but more
expensive.

With the current behavior, we have to shut down fast-import after
every checkpoint (in other words, we can't really use them) and
restart when needed, ensuring we don't overwrite refs we didn't intend
to change this time. Alternatively, we could maintain an up-to-date
(wrt .git/ on disk) mapping ref -> sha1 for all refs we've ever
changed, and make sure to precede any checkpoint with a bunch of
intended-to-be-no-op reset commands, which effectively let fast-import
know about external changes to refs that we do not wish to overwrite.

But this patch seemed like a better path forward.

> Looking at the fast-import code, it appears that update_branch and
> update_tag both set a global 'failure' flag whenever a branch cannot
> be updated due to it not being a fast-forward.  It prints a "warning"
> but this is likely because it wants to update whatever branches and
> tags it can and report issues on any it can't, and then after that's
> done then exit.  Since branches and tags are typically just updated at
> the very end of the process (again, based on the documentation
> suggesting checkpoints aren't necessary), the end of program exit can
> also be thought of as the "exit immediately after failing branch/tag
> updates".  My current suspicion is that the current checkpointing code
> not checking the 'failure' flag was an oversight.  If the user doesn't
> want branches/tags forcibly updated, and they do want checkpoints when
> the branches/tags get updated as input comes in, and there is the
> possibility that external folks modify thing, then I think it makes
> sense to abort the import once the checkpoint is finished (i.e. update
> whatever branches/tags we can and then error out); continuing to
> process the input stream seems wrong, because:
>   * the reason to use checkpoints is because the fast-import process
> is taking a long time; if it was quick, there'd be no reason to
> manually checkpoint.  Therefore...
>   * the user may be forced to backup and redo the long-lived
> fast-import process anyway; we've cost them time by not erroring out
> immediately
>   * we allow errors to accumulate (other branches could have failures
> to update later in the fast-import process), causing users to have to
> figure out whether one or many failures were reported and then have to
> try to track down if they have one or multiple causes
>   * the users are less likely to notice early "warning" without
> erroring out, but the cause (someone else updating a ref) may be
> harder to diagnose the longer it has been since the external ref
> update.

We use --force so I don't have an immediate interest in how checkpoint
errors are handled.

In a separate response to Junio, I suggested an alternative mechanism
to signal checkpoint conflicts on --cat-blob-fd, but I think it would
generally be fine if fast-import were to simply exit on such an error.
To reliably figure out which updates made it to disk and which didn't,
one can obtain expected OIDs with get-mark before initiating a
checkpoint, and compare with refs on disk after an error.

I'm not sure if such a change should be behind a flag or not. I think
anyone running without --force does care if a ref update is rejected;
and we're unlikely to break much by erroring out early. But maybe some
users out there don't check the exit status of fast-import and haven't
noticed that some branch they don't really care about is not being
updated successfully while everything else gets updated fine?

Thanks for the thoughtful analysis.

Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-05-16  5:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-10  8:41 [PATCH v3] fast-import: checkpoint: only write out refs and tags if we changed them Eric Rannaud
2019-05-15  5:30 ` Junio C Hamano
2019-05-15  6:57   ` Eric Rannaud
2019-05-16  0:35   ` Elijah Newren
2019-05-16  5:54     ` Eric Rannaud

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).