git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v1] convert: add "status=delayed" to filter process protocol
@ 2017-01-08 19:17 larsxschneider
  2017-01-08 20:14 ` Torsten Bögershausen
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: larsxschneider @ 2017-01-08 19:17 UTC (permalink / raw)
  To: git; +Cc: gitster, e, jnareb, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Some `clean` / `smudge` filters might require a significant amount of
time to process a single blob. During this process the Git checkout
operation is blocked and Git needs to wait until the filter is done to
continue with the checkout.

Teach the filter process protocol (introduced in edcc858) to accept the
status "delayed" as response to a filter request. Upon this response Git
continues with the checkout operation and asks the filter to process the
blob again after all other blobs have been processed.

Git has a multiple code paths that checkout a blob. Support delayed
checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---

You can find the RFC thread for this topic here:
http://public-inbox.org/git/D10F7C47-14E8-465B-8B7A-A09A1B28A39F@gmail.com/

Notes:
    Base Commit: e05806da9e (master)
    Diff on Web: https://github.com/larsxschneider/git/commit/ea25a1834b
    Checkout:    git fetch https://github.com/larsxschneider/git filter-process/delay-v1 && git checkout ea25a1834b

    Interdiff (filter-process/delay-v0..filter-process/delay-v1):

 Documentation/gitattributes.txt |  9 +++++++
 apply.c                         |  2 +-
 archive.c                       |  2 +-
 builtin/cat-file.c              |  2 +-
 builtin/checkout.c              |  1 +
 cache.h                         |  1 +
 convert.c                       | 33 ++++++++++++++-----------
 convert.h                       |  2 +-
 diff.c                          |  2 +-
 entry.c                         | 40 +++++++++++++++++++++++++++----
 merge-recursive.c               |  2 +-
 t/t0021-conversion.sh           | 53 +++++++++++++++++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         | 19 +++++++++++++++
 unpack-trees.c                  |  1 +
 14 files changed, 145 insertions(+), 24 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index e0b66c1220..f6bad8db40 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -473,6 +473,15 @@ packet:          git< 0000  # empty content!
 packet:          git< 0000  # empty list, keep "status=success" unchanged!
 ------------------------

+If the request cannot be fulfilled within a reasonable amount of time
+then the filter can respond with a "delayed" status and a flush packet.
+Git will perform the same request at a later point in time, again. The
+filter can delay a response multiple times for a single request.
+------------------------
+packet:          git< status=delayed
+packet:          git< 0000
+------------------------
+
 In case the filter cannot or does not want to process the content,
 it is expected to respond with an "error" status.
 ------------------------
diff --git a/apply.c b/apply.c
index 2ed808d429..aa7e6e4359 100644
--- a/apply.c
+++ b/apply.c
@@ -4328,7 +4328,7 @@ static int try_create_file(const char *path, unsigned int mode, const char *buf,
 	if (fd < 0)
 		return 1;

-	if (convert_to_working_tree(path, buf, size, &nbuf)) {
+	if (convert_to_working_tree(path, buf, size, &nbuf, NULL)) {
 		size = nbuf.len;
 		buf  = nbuf.buf;
 	}
diff --git a/archive.c b/archive.c
index 01751e574b..90d02463ef 100644
--- a/archive.c
+++ b/archive.c
@@ -77,7 +77,7 @@ void *sha1_file_to_archive(const struct archiver_args *args,
 		size_t size = 0;

 		strbuf_attach(&buf, buffer, *sizep, *sizep + 1);
-		convert_to_working_tree(path, buf.buf, buf.len, &buf);
+		convert_to_working_tree(path, buf.buf, buf.len, &buf, NULL);
 		if (commit)
 			format_subst(commit, buf.buf, buf.len, &buf);
 		buffer = strbuf_detach(&buf, &size);
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 30383e9eb4..4e3e31a00b 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -35,7 +35,7 @@ static int filter_object(const char *path, unsigned mode,
 			     oid_to_hex(oid), path);
 	if ((type == OBJ_BLOB) && S_ISREG(mode)) {
 		struct strbuf strbuf = STRBUF_INIT;
-		if (convert_to_working_tree(path, *buf, *size, &strbuf)) {
+		if (convert_to_working_tree(path, *buf, *size, &strbuf, NULL)) {
 			free(*buf);
 			*size = strbuf.len;
 			*buf = strbuf_detach(&strbuf, NULL);
diff --git a/builtin/checkout.c b/builtin/checkout.c
index bfe685c198..42885cfe92 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -369,6 +369,7 @@ static int checkout_paths(const struct checkout_opts *opts,
 			pos = skip_same_name(ce, pos) - 1;
 		}
 	}
+	errs |= checkout_delayed_entries(&state);

 	if (write_locked_index(&the_index, lock_file, COMMIT_LOCK))
 		die(_("unable to write new index file"));
diff --git a/cache.h b/cache.h
index a50a61a197..5d119ffc6b 100644
--- a/cache.h
+++ b/cache.h
@@ -1375,6 +1375,7 @@ struct checkout {

 #define TEMPORARY_FILENAME_LENGTH 25
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
+extern int checkout_delayed_entries(const struct checkout *state);

 struct cache_def {
 	struct strbuf path;
diff --git a/convert.c b/convert.c
index 4e17e45ed2..4075a58e64 100644
--- a/convert.c
+++ b/convert.c
@@ -672,7 +672,7 @@ static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, cons
 }

 static int apply_multi_file_filter(const char *path, const char *src, size_t len,
-				   int fd, struct strbuf *dst, const char *cmd,
+				   int fd, struct strbuf *dst, int *delayed, const char *cmd,
 				   const unsigned int wanted_capability)
 {
 	int err;
@@ -738,9 +738,14 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len
 		goto done;

 	read_multi_file_filter_status(process->out, &filter_status);
-	err = strcmp(filter_status.buf, "success");
-	if (err)
+	if (delayed && !strcmp(filter_status.buf, "delayed")) {
+		*delayed = 1;
 		goto done;
+	} else {
+		err = strcmp(filter_status.buf, "success");
+		if (err)
+			goto done;
+	}

 	err = read_packetized_to_strbuf(process->out, &nbuf) < 0;
 	if (err)
@@ -787,8 +792,8 @@ static struct convert_driver {
 } *user_convert, **user_convert_tail;

 static int apply_filter(const char *path, const char *src, size_t len,
-			int fd, struct strbuf *dst, struct convert_driver *drv,
-			const unsigned int wanted_capability)
+			int fd, struct strbuf *dst, int *delayed,
+			struct convert_driver *drv, const unsigned int wanted_capability)
 {
 	const char *cmd = NULL;

@@ -806,7 +811,7 @@ static int apply_filter(const char *path, const char *src, size_t len,
 	if (cmd && *cmd)
 		return apply_single_file_filter(path, src, len, fd, dst, cmd);
 	else if (drv->process && *drv->process)
-		return apply_multi_file_filter(path, src, len, fd, dst, drv->process, wanted_capability);
+		return apply_multi_file_filter(path, src, len, fd, dst, delayed, drv->process, wanted_capability);

 	return 0;
 }
@@ -1152,7 +1157,7 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;

-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv, CAP_CLEAN);
+	return apply_filter(path, NULL, 0, -1, NULL, NULL, ca.drv, CAP_CLEAN);
 }

 const char *get_convert_attr_ascii(const char *path)
@@ -1189,7 +1194,7 @@ int convert_to_git(const char *path, const char *src, size_t len,

 	convert_attrs(&ca, path);

-	ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN);
+	ret |= apply_filter(path, src, len, -1, dst, NULL, ca.drv, CAP_CLEAN);
 	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

@@ -1214,7 +1219,7 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 	assert(ca.drv);
 	assert(ca.drv->clean || ca.drv->process);

-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN))
+	if (!apply_filter(path, NULL, 0, fd, dst, NULL, ca.drv, CAP_CLEAN))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -1222,7 +1227,7 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 }

 static int convert_to_working_tree_internal(const char *path, const char *src,
-					    size_t len, struct strbuf *dst,
+					    size_t len, struct strbuf *dst, int *delayed,
 					    int normalizing)
 {
 	int ret = 0, ret_filter = 0;
@@ -1248,21 +1253,21 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}

-	ret_filter = apply_filter(path, src, len, -1, dst, ca.drv, CAP_SMUDGE);
+	ret_filter = apply_filter(path, src, len, -1, dst, delayed, ca.drv, CAP_SMUDGE);
 	if (!ret_filter && ca.drv && ca.drv->required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);

 	return ret | ret_filter;
 }

-int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
+int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst, int *delayed)
 {
-	return convert_to_working_tree_internal(path, src, len, dst, 0);
+	return convert_to_working_tree_internal(path, src, len, dst, delayed, 0);
 }

 int renormalize_buffer(const char *path, const char *src, size_t len, struct strbuf *dst)
 {
-	int ret = convert_to_working_tree_internal(path, src, len, dst, 1);
+	int ret = convert_to_working_tree_internal(path, src, len, dst, NULL, 1);
 	if (ret) {
 		src = dst->buf;
 		len = dst->len;
diff --git a/convert.h b/convert.h
index 82871a11d5..72f24078de 100644
--- a/convert.h
+++ b/convert.h
@@ -41,7 +41,7 @@ extern const char *get_convert_attr_ascii(const char *path);
 extern int convert_to_git(const char *path, const char *src, size_t len,
 			  struct strbuf *dst, enum safe_crlf checksafe);
 extern int convert_to_working_tree(const char *path, const char *src,
-				   size_t len, struct strbuf *dst);
+				   size_t len, struct strbuf *dst, int *delayed);
 extern int renormalize_buffer(const char *path, const char *src, size_t len,
 			      struct strbuf *dst);
 static inline int would_convert_to_git(const char *path)
diff --git a/diff.c b/diff.c
index 84dba60c40..243cd0ffdf 100644
--- a/diff.c
+++ b/diff.c
@@ -2960,7 +2960,7 @@ static void prep_temp_blob(const char *path, struct diff_tempfile *temp,
 	if (fd < 0)
 		die_errno("unable to create temp-file");
 	if (convert_to_working_tree(path,
-			(const char *)blob, (size_t)size, &buf)) {
+			(const char *)blob, (size_t)size, &buf, NULL)) {
 		blob = buf.buf;
 		size = buf.len;
 	}
diff --git a/entry.c b/entry.c
index c6eea240b6..43f3ae7b69 100644
--- a/entry.c
+++ b/entry.c
@@ -2,6 +2,14 @@
 #include "blob.h"
 #include "dir.h"
 #include "streaming.h"
+#include "list.h"
+
+static LIST_HEAD(delayed_cache_entry_queue_head);
+
+struct delayed_cache_entry {
+	struct cache_entry *ce;
+	struct list_head node;
+};

 static void create_directories(const char *path, int path_len,
 			       const struct checkout *state)
@@ -140,12 +148,13 @@ static int write_entry(struct cache_entry *ce,
 		       char *path, const struct checkout *state, int to_tempfile)
 {
 	unsigned int ce_mode_s_ifmt = ce->ce_mode & S_IFMT;
-	int fd, ret, fstat_done = 0;
+	int fd, ret, fstat_done = 0, delayed = 0;
 	char *new;
 	struct strbuf buf = STRBUF_INIT;
 	unsigned long size;
 	size_t wrote, newsize = 0;
 	struct stat st;
+	struct delayed_cache_entry *delayed_ce;

 	if (ce_mode_s_ifmt == S_IFREG) {
 		struct stream_filter *filter = get_stream_filter(ce->name,
@@ -178,10 +187,17 @@ static int write_entry(struct cache_entry *ce,
 		 * Convert from git internal format to working tree format
 		 */
 		if (ce_mode_s_ifmt == S_IFREG &&
-		    convert_to_working_tree(ce->name, new, size, &buf)) {
+		    convert_to_working_tree(ce->name, new, size, &buf, &delayed)) {
 			free(new);
-			new = strbuf_detach(&buf, &newsize);
-			size = newsize;
+			if (delayed) {
+				delayed_ce = xmalloc(sizeof(*delayed_ce));
+				delayed_ce->ce = ce;
+				list_add_tail(&delayed_ce->node, &delayed_cache_entry_queue_head);
+				goto finish;
+			} else {
+				new = strbuf_detach(&buf, &newsize);
+				size = newsize;
+			}
 		}

 		fd = open_output_fd(path, ce, to_tempfile);
@@ -291,3 +307,19 @@ int checkout_entry(struct cache_entry *ce,
 	create_directories(path.buf, path.len, state);
 	return write_entry(ce, path.buf, state, 0);
 }
+
+int checkout_delayed_entries(const struct checkout *state)
+{
+	struct delayed_cache_entry *head;
+	int errs = 0;
+
+	while (!list_empty(&delayed_cache_entry_queue_head)) {
+		head = list_first_entry(&delayed_cache_entry_queue_head,
+			struct delayed_cache_entry, node);
+		list_del(&head->node);
+		head->ce->ce_flags &= ~CE_UPDATE;
+		errs |= checkout_entry(head->ce, state, NULL);
+	}
+
+	return errs;
+}
diff --git a/merge-recursive.c b/merge-recursive.c
index d327209443..73dd33e61f 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -807,7 +807,7 @@ static int update_file_flags(struct merge_options *o,
 		}
 		if (S_ISREG(mode)) {
 			struct strbuf strbuf = STRBUF_INIT;
-			if (convert_to_working_tree(path, buf, size, &strbuf)) {
+			if (convert_to_working_tree(path, buf, size, &strbuf, NULL)) {
 				free(buf);
 				size = strbuf.len;
 				buf = strbuf_detach(&strbuf, NULL);
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 161f560446..8ae5b1a521 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -701,4 +701,57 @@ test_expect_success PERL 'invalid process filter must fail (and not hang!)' '
 	)
 '

+test_expect_success PERL 'delayed checkout in process filter' '
+	test_config_global filter.protocol.process "rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+		echo "*.r filter=protocol" >.gitattributes &&
+		cp "$TEST_ROOT/test.o" test.r &&
+		cp "$TEST_ROOT/test.o" test-delay1.r &&
+		cp "$TEST_ROOT/test.o" test-delay3.r &&
+		git add . &&
+		git commit -m "test commit 1"
+	) &&
+
+	S=$(file_size repo/test.r) &&
+	rm -rf repo-cloned &&
+	filter_git clone repo repo-cloned &&
+	cat >expected.log <<-EOF &&
+		START
+		init handshake complete
+		IN: smudge test.r $S [OK] -- OUT: $S . [OK]
+		IN: smudge test-delay1.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay1.r $S [OK] -- OUT: $S . [OK]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S . [OK]
+		STOP
+	EOF
+	test_cmp_count expected.log repo-cloned/rot13-filter.log &&
+
+	(
+		cd repo-cloned &&
+		rm *.r rot13-filter.log &&
+		filter_git checkout . &&
+		cat >expected.log <<-EOF &&
+			START
+			init handshake complete
+			IN: smudge test.r $S [OK] -- OUT: $S . [OK]
+			IN: smudge test-delay1.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay1.r $S [OK] -- OUT: $S . [OK]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S . [OK]
+			STOP
+		EOF
+		test_cmp_count expected.log rot13-filter.log
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
index 617f581e56..b10be1e543 100644
--- a/t/t0021/rot13-filter.pl
+++ b/t/t0021/rot13-filter.pl
@@ -17,6 +17,10 @@
 #     operation then the filter signals that it cannot or does not want
 #     to process the file and any file after that is processed with the
 #     same command.
+# (5) If data with a pathname that is a key in the DELAY hash is
+#     processed (e.g. 'test-delay1.r') then the filter signals n times
+#     to Git that the processing is delayed (n being the value of the
+#     DELAY hash key).
 #

 use strict;
@@ -25,6 +29,12 @@ use IO::File;

 my $MAX_PACKET_CONTENT_SIZE = 65516;
 my @capabilities            = @ARGV;
+my $DELAY3 = 3;
+my $DELAY1 = 1;
+
+my %DELAY;
+$DELAY{'test-delay1.r'} = 1;
+$DELAY{'test-delay3.r'} = 3;

 open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!";

@@ -166,6 +176,15 @@ while (1) {
 		packet_txt_write("status=abort");
 		packet_flush();
 	}
+	elsif ( $command eq "smudge" and
+		    exists $DELAY{$pathname} and
+		    $DELAY{$pathname} gt 0 ) {
+		$DELAY{$pathname} = $DELAY{$pathname} - 1;
+		print $debug "[DELAYED]\n";
+		$debug->flush();
+		packet_txt_write("status=delayed");
+		packet_flush();
+	}
 	else {
 		packet_txt_write("status=success");
 		packet_flush();
diff --git a/unpack-trees.c b/unpack-trees.c
index 7a6df99d10..662b75f72a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -268,6 +268,7 @@ static int check_updates(struct unpack_trees_options *o,
 			}
 		}
 	}
+	errs |= checkout_delayed_entries(state);
 	stop_progress(&progress);
 	if (o->update)
 		git_attr_set_direction(GIT_ATTR_CHECKIN, NULL);
--
2.11.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 19:17 [PATCH v1] convert: add "status=delayed" to filter process protocol larsxschneider
@ 2017-01-08 20:14 ` Torsten Bögershausen
  2017-01-11  9:48   ` Lars Schneider
  2017-01-08 20:45 ` Eric Wong
  2017-01-08 23:42 ` Junio C Hamano
  2 siblings, 1 reply; 15+ messages in thread
From: Torsten Bögershausen @ 2017-01-08 20:14 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, gitster, e, jnareb

On Sun, Jan 08, 2017 at 08:17:36PM +0100, larsxschneider@gmail.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Some `clean` / `smudge` filters might require a significant amount of
> time to process a single blob. During this process the Git checkout
> operation is blocked and Git needs to wait until the filter is done to
> continue with the checkout.
> 
> Teach the filter process protocol (introduced in edcc858) to accept the
> status "delayed" as response to a filter request. Upon this response Git
> continues with the checkout operation and asks the filter to process the
> blob again after all other blobs have been processed.
> 
> Git has a multiple code paths that checkout a blob. Support delayed
> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> 

Some feeling tells me that it may be better to leave convert_to_working_tree() as it is.
And change convert_to_working_tree_internal as suggested:

int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
 {
-	return convert_to_working_tree_internal(path, src, len, dst, 0);
+	return convert_to_working_tree_internal(path, src, len, dst, NULL, 0);
 }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 19:17 [PATCH v1] convert: add "status=delayed" to filter process protocol larsxschneider
  2017-01-08 20:14 ` Torsten Bögershausen
@ 2017-01-08 20:45 ` Eric Wong
  2017-01-11  9:51   ` Lars Schneider
  2017-01-08 23:42 ` Junio C Hamano
  2 siblings, 1 reply; 15+ messages in thread
From: Eric Wong @ 2017-01-08 20:45 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, gitster, jnareb

larsxschneider@gmail.com wrote:
> +++ b/t/t0021/rot13-filter.pl

> +$DELAY{'test-delay1.r'} = 1;
> +$DELAY{'test-delay3.r'} = 3;
> 
>  open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!";
> 
> @@ -166,6 +176,15 @@ while (1) {
>  		packet_txt_write("status=abort");
>  		packet_flush();
>  	}
> +	elsif ( $command eq "smudge" and
> +		    exists $DELAY{$pathname} and
> +		    $DELAY{$pathname} gt 0 ) {

Use '>' for numeric comparisons.  'gt' is for strings (man perlop)

Sidenote, staying <= 80 columns for the rest of the changes is
strongly preferred, some of us need giant fonts.  I think what
Torsten said about introducing a new *_internal function can
also help with that.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 19:17 [PATCH v1] convert: add "status=delayed" to filter process protocol larsxschneider
  2017-01-08 20:14 ` Torsten Bögershausen
  2017-01-08 20:45 ` Eric Wong
@ 2017-01-08 23:42 ` Junio C Hamano
  2017-01-10 22:11   ` Jakub Narębski
                     ` (2 more replies)
  2 siblings, 3 replies; 15+ messages in thread
From: Junio C Hamano @ 2017-01-08 23:42 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, e, jnareb

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> Some `clean` / `smudge` filters might require a significant amount of
> time to process a single blob. During this process the Git checkout
> operation is blocked and Git needs to wait until the filter is done to
> continue with the checkout.
>
> Teach the filter process protocol (introduced in edcc858) to accept the
> status "delayed" as response to a filter request. Upon this response Git
> continues with the checkout operation and asks the filter to process the
> blob again after all other blobs have been processed.

Hmm, I would have expected that the basic flow would become

	for each paths to be processed:
		convert-to-worktree to buf
		if not delayed:
			do the caller's thing to use buf
		else:
			remember path

	for each delayed paths:
		ensure filter process finished processing for path
		fetch the thing to buf from the process
		do the caller's thing to use buf

and that would make quite a lot of sense.  However, what is actually
implemented is a bit disappointing from that point of view.  While
its first part is the same as above, the latter part instead does:

	for each delayed paths:
		checkout the path

Presumably, checkout_entry() does the "ensure that the process is
done converting" (otherwise the result is simply buggy), but what
disappoints me is that this does not allow callers that call
"convert-to-working-tree", whose interface is obtain the bytestream 
in-core in the working tree representation, given an object in the
object-db representation in an in-core buffer, to _use_ the result
of the conversion.  The caller does not have a chance to even see
the result as it is written straight to the filesystem, once it
calls checkout_delayed_entries().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 23:42 ` Junio C Hamano
@ 2017-01-10 22:11   ` Jakub Narębski
  2017-01-10 23:33     ` Taylor Blau
  2017-01-11 10:20     ` Lars Schneider
  2017-01-11  9:43   ` Lars Schneider
       [not found]   ` <20170109233816.GA70151@Ida>
  2 siblings, 2 replies; 15+ messages in thread
From: Jakub Narębski @ 2017-01-10 22:11 UTC (permalink / raw)
  To: Junio C Hamano, Lars Schneider; +Cc: git, Eric Wong, Jakub Narebski

W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
> larsxschneider@gmail.com writes:
>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob. During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.

Lars, what is expected use case for this feature; that is when do you
think this problem may happen?  Is it something that happened IRL?

>>
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation and asks the filter to process the
>> blob again after all other blobs have been processed.
> 
> Hmm, I would have expected that the basic flow would become
> 
> 	for each paths to be processed:
> 		convert-to-worktree to buf
> 		if not delayed:
> 			do the caller's thing to use buf
> 		else:
> 			remember path
> 
> 	for each delayed paths:
> 		ensure filter process finished processing for path
> 		fetch the thing to buf from the process
> 		do the caller's thing to use buf

I would expect here to have a kind of event loop, namely

        while there are delayed paths:
                get path that is ready from filter
                fetch the thing to buf (supporting "delayed")
                if path done
                        do the caller's thing to use buf 
                        (e.g. finish checkout path, eof convert, etc.)

We can either trust filter process to tell us when it finished sending
delayed paths, or keep list of paths that are being delayed in Git.

> 
> and that would make quite a lot of sense.  However, what is actually
> implemented is a bit disappointing from that point of view.  While
> its first part is the same as above, the latter part instead does:
> 
> 	for each delayed paths:
> 		checkout the path
> 
> Presumably, checkout_entry() does the "ensure that the process is
> done converting" (otherwise the result is simply buggy), but what
> disappoints me is that this does not allow callers that call
> "convert-to-working-tree", whose interface is obtain the bytestream 
> in-core in the working tree representation, given an object in the
> object-db representation in an in-core buffer, to _use_ the result
> of the conversion.  The caller does not have a chance to even see
> the result as it is written straight to the filesystem, once it
> calls checkout_delayed_entries().
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-10 22:11   ` Jakub Narębski
@ 2017-01-10 23:33     ` Taylor Blau
  2017-01-11 10:20     ` Lars Schneider
  1 sibling, 0 replies; 15+ messages in thread
From: Taylor Blau @ 2017-01-10 23:33 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Junio C Hamano, Lars Schneider, git, Eric Wong

On Tue, Jan 10, 2017 at 11:11:01PM +0100, Jakub Narębski wrote:
> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
> > larsxschneider@gmail.com writes:
> >> From: Lars Schneider <larsxschneider@gmail.com>
> >>
> >> Some `clean` / `smudge` filters might require a significant amount of
> >> time to process a single blob. During this process the Git checkout
> >> operation is blocked and Git needs to wait until the filter is done to
> >> continue with the checkout.
>
> Lars, what is expected use case for this feature; that is when do you
> think this problem may happen?  Is it something that happened IRL?
>
> >>
> >> Teach the filter process protocol (introduced in edcc858) to accept the
> >> status "delayed" as response to a filter request. Upon this response Git
> >> continues with the checkout operation and asks the filter to process the
> >> blob again after all other blobs have been processed.
> >
> > Hmm, I would have expected that the basic flow would become
> >
> > 	for each paths to be processed:
> > 		convert-to-worktree to buf
> > 		if not delayed:
> > 			do the caller's thing to use buf
> > 		else:
> > 			remember path
> >
> > 	for each delayed paths:
> > 		ensure filter process finished processing for path
> > 		fetch the thing to buf from the process
> > 		do the caller's thing to use buf
>
> I would expect here to have a kind of event loop, namely
>
>         while there are delayed paths:
>                 get path that is ready from filter
>                 fetch the thing to buf (supporting "delayed")
>                 if path done
>                         do the caller's thing to use buf
>                         (e.g. finish checkout path, eof convert, etc.)
>
> We can either trust filter process to tell us when it finished sending
> delayed paths, or keep list of paths that are being delayed in Git.

This makes a lot of sense to me. The "get path that is ready from filter" should
block until the filter has data that it is ready to send. This way Git isn't
wasting time in a busy-loop asking whether the filter has data ready to be sent.
It also means that if the filter has one large chunk that it's ready to write,
Git can work on that while the filter continues to process more data,
theoretically improving the performance of checkouts with many large delayed
objects.

>
> >
> > and that would make quite a lot of sense.  However, what is actually
> > implemented is a bit disappointing from that point of view.  While
> > its first part is the same as above, the latter part instead does:
> >
> > 	for each delayed paths:
> > 		checkout the path
> >
> > Presumably, checkout_entry() does the "ensure that the process is
> > done converting" (otherwise the result is simply buggy), but what
> > disappoints me is that this does not allow callers that call
> > "convert-to-working-tree", whose interface is obtain the bytestream
> > in-core in the working tree representation, given an object in the
> > object-db representation in an in-core buffer, to _use_ the result
> > of the conversion.  The caller does not have a chance to even see
> > the result as it is written straight to the filesystem, once it
> > calls checkout_delayed_entries().
> >
>

In addition to the above, I'd also like to investigate adding a "no more items"
message into the filter protocol. This would be useful for filters that
batch delayed items into groups. In particular, if the batch size is `N`, and Git
sends `2N-1` items, the second batch will be under-filled. The filter on the
other end needs some mechanism to send the second batch, even though it hasn't
hit max capacity.

Specifically, this is how Git LFS implements object transfers for data it does
not have locally, but I'm sure that this sort of functionality would be useful
for other filter implementations as well.

--
Thanks,
Taylor Blau

ttaylorr@github.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 23:42 ` Junio C Hamano
  2017-01-10 22:11   ` Jakub Narębski
@ 2017-01-11  9:43   ` Lars Schneider
  2017-01-11 20:45     ` Junio C Hamano
       [not found]   ` <20170109233816.GA70151@Ida>
  2 siblings, 1 reply; 15+ messages in thread
From: Lars Schneider @ 2017-01-11  9:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git mailing list, Eric Wong, Jakub Narębski,
	Torsten Bögershausen, Taylor Blau


> On 09 Jan 2017, at 00:42, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob. During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.
>> 
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation and asks the filter to process the
>> blob again after all other blobs have been processed.
> 
> Hmm, I would have expected that the basic flow would become
> 
> 	for each paths to be processed:
> 		convert-to-worktree to buf
> 		if not delayed:
> 			do the caller's thing to use buf
> 		else:
> 			remember path
> 
> 	for each delayed paths:
> 		ensure filter process finished processing for path
> 		fetch the thing to buf from the process
> 		do the caller's thing to use buf
> 
> and that would make quite a lot of sense.  However, what is actually
> implemented is a bit disappointing from that point of view.  While
> its first part is the same as above, the latter part instead does:
> 
> 	for each delayed paths:
> 		checkout the path
> 
> Presumably, checkout_entry() does the "ensure that the process is
> done converting" (otherwise the result is simply buggy), but what
> disappoints me is that this does not allow callers that call
> "convert-to-working-tree", whose interface is obtain the bytestream 
> in-core in the working tree representation, given an object in the
> object-db representation in an in-core buffer, to _use_ the result
> of the conversion.  The caller does not have a chance to even see
> the result as it is written straight to the filesystem, once it
> calls checkout_delayed_entries().

I am not sure I can follow you here. A caller of "convert_to_working_tree"
would indeed see filtered result. Consider the following example. The 
filter delays the conversion twice and responds with the filtered results
on the third call:

CALL:     int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0)
RESPONSE: return == 1; *delayed == 1, *dst==''

CALL:     int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0)
RESPONSE: return == 1; *delayed == 1, *dst==''

CALL:     int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0)
RESPONSE: return == 1; *delayed == 0, *dst=='FILTERED_CONTENT'

I implemented the "checkout_delayed_entries" function in v1 because
it solved the problem with minimal changes in the existing code. Our previous 
discussion made me think that this is the preferred way:

     I do not think we want to see such a rewrite all over the
     codepaths.  It might be OK to add such a "these entries are known
     to be delayed" list in struct checkout so that the above becomes
     more like this:

       for (i = 0; i < active_nr; i++)
          checkout_entry(active_cache[i], state, NULL);
     + checkout_entry_finish(state);

     That is, addition of a single "some of the checkout_entry() calls
     done so far might have been lazy, and I'll give them a chance to
     clean up" might be palatable.  Anything more than that on the
     caller side is not.

c.f. http://public-inbox.org/git/xmqqvavotych.fsf@gitster.mtv.corp.google.com/

Thanks,
Lars




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 20:14 ` Torsten Bögershausen
@ 2017-01-11  9:48   ` Lars Schneider
  0 siblings, 0 replies; 15+ messages in thread
From: Lars Schneider @ 2017-01-11  9:48 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Git mailing list, Junio C Hamano, Eric Wong, Jakub Narębski,
	Taylor Blau


> On 08 Jan 2017, at 21:14, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Sun, Jan 08, 2017 at 08:17:36PM +0100, larsxschneider@gmail.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob. During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.
>> 
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation and asks the filter to process the
>> blob again after all other blobs have been processed.
>> 
>> Git has a multiple code paths that checkout a blob. Support delayed
>> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> 
> 
> Some feeling tells me that it may be better to leave convert_to_working_tree() as it is.
> And change convert_to_working_tree_internal as suggested:
> 
> int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
> {
> -	return convert_to_working_tree_internal(path, src, len, dst, 0);
> +	return convert_to_working_tree_internal(path, src, len, dst, NULL, 0);
> }

If I do this then I would have no way to communicate to the caller that the
processing is delayed. Consequently the caller would not know that an additional
call is necessary to fetch the result.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-08 20:45 ` Eric Wong
@ 2017-01-11  9:51   ` Lars Schneider
  0 siblings, 0 replies; 15+ messages in thread
From: Lars Schneider @ 2017-01-11  9:51 UTC (permalink / raw)
  To: Eric Wong; +Cc: git, gitster, jnareb


> On 08 Jan 2017, at 21:45, Eric Wong <e@80x24.org> wrote:
> 
> larsxschneider@gmail.com wrote:
>> +++ b/t/t0021/rot13-filter.pl
> 
>> +$DELAY{'test-delay1.r'} = 1;
>> +$DELAY{'test-delay3.r'} = 3;
>> 
>> open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!";
>> 
>> @@ -166,6 +176,15 @@ while (1) {
>> 		packet_txt_write("status=abort");
>> 		packet_flush();
>> 	}
>> +	elsif ( $command eq "smudge" and
>> +		    exists $DELAY{$pathname} and
>> +		    $DELAY{$pathname} gt 0 ) {
> 
> Use '>' for numeric comparisons.  'gt' is for strings (man perlop)

Still learning Perl :-)


> Sidenote, staying <= 80 columns for the rest of the changes is
> strongly preferred, some of us need giant fonts.  I think what
> Torsten said about introducing a new *_internal function can
> also help with that.

OK! 

Thank you,
Lars

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
       [not found]   ` <20170109233816.GA70151@Ida>
@ 2017-01-11 10:13     ` Lars Schneider
  2017-01-11 17:59       ` Taylor Blau
  0 siblings, 1 reply; 15+ messages in thread
From: Lars Schneider @ 2017-01-11 10:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, e, jnareb


> On 10 Jan 2017, at 00:38, Taylor Blau <ttaylorr@github.com> wrote:
> 
> I've been considering some alternative approaches in order to make the
> communication between Git and any extension that implements this protocol more
> intuitive.
> 
> In particular, I'm considering alternatives to:
> 
>> 	for each delayed paths:
>> 		ensure filter process finished processing for path
>> 		fetch the thing to buf from the process
>> 		do the caller's thing to use buf
> 
> As I understand it, the above sequence of steps would force Git to either:
> 
> a) loop over all delayed paths and ask the filter if it's done processing,
>   creating a busy-loop between the filter and Git, or...
> b) loop over all delayed paths sequentially, checking out each path in sequence
> 
> I would like to avoid both of those situations, and instead opt for an
> asynchronous approach. In (a), the protocol is far too chatty. In (b), the
> protocol is much less chatty, but forces the checkout to be the very last step,
> which has negative performance implications on checkouts with many large files.
> 
> For instance, checking out several multi-gigabyte files one after the other
> means that a significant amount of time is lost while the filter has some of the
> items ready. Instead of checking them out as they become available, Git waits
> until the very end when they are all available.
> 
> I think it would be preferable for the protocol to specify a sort of "done"
> signal against each path such that Git could check out delayed paths as they
> become available. If implemented this way, Git could checkout files
> asynchronously, while the filter continues to do work on the other end.

In v1 I implemented a) with the busy-loop problem in mind. 

My thinking was this:

If the filter sees at least one filter request twice then the filter knows that
Git has already requested all files that require filtering. At that point the
filter could just block the "delayed" answer to the latest filter request until
at least one of the previously delayed requests can be fulfilled. Then the filter
answers "delay" to Git until Git requests the blob that can be fulfilled. This
process cycles until all requests can be fulfilled. Wouldn't that work?

I think a "done" message by the filter is not easy. Right now the protocol works 
in a mode were Git always asks and the filter always answers. I believe changing
the filter to be able to initiate a "done" message would complicated the protocol.


> Additionally, the protocol should specify a sentinel "no more entries" value
> that could be sent from Git to the filter to signal that there are no more files
> to checkout. Some filters may implement mechanisms for converting files that
> require a signal to know when all files have been sent. Specifically, Git LFS
> (https://git-lfs.github.com) batches files to be transferred together, and needs
> to know when all files have been announced to truncate and send the last batch,
> if it is not yet full. I'm sure other filter implementations use a similar
> mechanism and would benefit from this as well.

I agree. I think the filter already has this info implicitly as explained above
but an explicit message would be better!

Thanks,
Lars

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-10 22:11   ` Jakub Narębski
  2017-01-10 23:33     ` Taylor Blau
@ 2017-01-11 10:20     ` Lars Schneider
  2017-01-11 14:53       ` Jakub Narębski
  1 sibling, 1 reply; 15+ messages in thread
From: Lars Schneider @ 2017-01-11 10:20 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Git mailing list, Eric Wong, Taylor Blau


> On 10 Jan 2017, at 23:11, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>> larsxschneider@gmail.com writes:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>> 
>>> Some `clean` / `smudge` filters might require a significant amount of
>>> time to process a single blob. During this process the Git checkout
>>> operation is blocked and Git needs to wait until the filter is done to
>>> continue with the checkout.
> 
> Lars, what is expected use case for this feature; that is when do you
> think this problem may happen?  Is it something that happened IRL?

Yes, this problem happens every day with filters that perform network
requests (e.g. GitLFS). In GitLFS we even implemented Git wrapper
commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
The ultimate goal of this patch is to be able to get rid of the wrapper 
commands.


>>> Teach the filter process protocol (introduced in edcc858) to accept the
>>> status "delayed" as response to a filter request. Upon this response Git
>>> continues with the checkout operation and asks the filter to process the
>>> blob again after all other blobs have been processed.
>> 
>> Hmm, I would have expected that the basic flow would become
>> 
>> 	for each paths to be processed:
>> 		convert-to-worktree to buf
>> 		if not delayed:
>> 			do the caller's thing to use buf
>> 		else:
>> 			remember path
>> 
>> 	for each delayed paths:
>> 		ensure filter process finished processing for path
>> 		fetch the thing to buf from the process
>> 		do the caller's thing to use buf
> 
> I would expect here to have a kind of event loop, namely
> 
>        while there are delayed paths:
>                get path that is ready from filter
>                fetch the thing to buf (supporting "delayed")
>                if path done
>                        do the caller's thing to use buf 
>                        (e.g. finish checkout path, eof convert, etc.)
> 
> We can either trust filter process to tell us when it finished sending
> delayed paths, or keep list of paths that are being delayed in Git.

I could implement "get path that is ready from filter" but wouldn't
that complicate the filter protocol? I think we can use the protocol pretty 
much as if with the strategy outlined here:
http://public-inbox.org/git/F533857D-9B51-44C1-8889-AA0542AD8250@gmail.com/


Thanks,
Lars

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-11 10:20     ` Lars Schneider
@ 2017-01-11 14:53       ` Jakub Narębski
  2017-01-11 20:41         ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Narębski @ 2017-01-11 14:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, Git mailing list, Eric Wong, Taylor Blau

W dniu 11.01.2017 o 11:20, Lars Schneider pisze: 
> On 10 Jan 2017, at 23:11, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>>> larsxschneider@gmail.com writes:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>>
>>>> Some `clean` / `smudge` filters might require a significant amount of
>>>> time to process a single blob. During this process the Git checkout
>>>> operation is blocked and Git needs to wait until the filter is done to
>>>> continue with the checkout.
>>
>> Lars, what is expected use case for this feature; that is when do you
>> think this problem may happen?  Is it something that happened IRL?
> 
> Yes, this problem happens every day with filters that perform network
> requests (e.g. GitLFS). 

Do I understand it correctly that the expected performance improvement
thanks to this feature is possible only if there is some amount of
parallelism and concurrency in the filter?  That is, filter can be sending
one blob to Git while processing other one, or filter can be fetching blobs
in parallel.

This means that filter process works as a kind of (de)multiplexer for
fetching and/or processing blob contents, I think.

> [...] In GitLFS we even implemented Git wrapper
> commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
> The ultimate goal of this patch is to be able to get rid of the wrapper 
> commands.

I'm sorry, I don't see it how the wrapper helps here.

> 
>>>> Teach the filter process protocol (introduced in edcc858) to accept the
>>>> status "delayed" as response to a filter request. Upon this response Git
>>>> continues with the checkout operation and asks the filter to process the
>>>> blob again after all other blobs have been processed.
>>>
>>> Hmm, I would have expected that the basic flow would become
>>>
>>> 	for each paths to be processed:
>>> 		convert-to-worktree to buf
>>> 		if not delayed:
>>> 			do the caller's thing to use buf
>>> 		else:
>>> 			remember path
>>>
>>> 	for each delayed paths:
>>> 		ensure filter process finished processing for path
>>> 		fetch the thing to buf from the process
>>> 		do the caller's thing to use buf
>>
>> I would expect here to have a kind of event loop, namely
>>
>>        while there are delayed paths:
>>                get path that is ready from filter
>>                fetch the thing to buf (supporting "delayed")
>>                if path done
>>                        do the caller's thing to use buf 
>>                        (e.g. finish checkout path, eof convert, etc.)
>>
>> We can either trust filter process to tell us when it finished sending
>> delayed paths, or keep list of paths that are being delayed in Git.
> 
> I could implement "get path that is ready from filter" but wouldn't
> that complicate the filter protocol? I think we can use the protocol pretty 
> much as if with the strategy outlined here:
> http://public-inbox.org/git/F533857D-9B51-44C1-8889-AA0542AD8250@gmail.com/

You are talking about the "busy-loop" solution, isn't it?  In the
same notation, it would look like this:

          while there are delayed paths:
                  for each delayed path:
                          request path from filter [1]
                          fetch the thing (supporting "delayed") [2]
                          if path done
                                  do the caller's thing to use buf
                                  remove it from delayed paths list


Footnotes:
----------
1) We don't send the Git-side contents of blob again, isn't it?
   So we need some protocol extension / new understanding anyway.
   for example that we don't send contents if we request path again.
2) If path is not ready at all, filter protocol would send status=delayed
   with empty contents.  This means that we would immediately go to the
   next path, if there is one.

There are some cases where busy loop is preferable, but I don't think
it is the case here.


The event loop solution would require additional protocol extension,
but I don't think those complicate protocol too much:

A. Git would need to signal filter process that it has sent all paths,
and that it should be sending delayed paths when they are ready.  This
could be done for example with "command=continue".

    packet:          git> command=continue


B. Filter driver, in the event-loop phase, when (de)multiplexing fetching
or processing of data, it would need now to initialize transfer, instead
of waiting for Git to ask.  It could look like this:

    packet:          git< status=resumed [3]
    packet:          git< pathname=file/to/be/resumed [4]
    packet:          git< 0000
    packet:          git< SMUDGED_CONTENT_CONTINUED
    packet:          git< 0000
    packet:          git< 0000  # empty list, means "status=success" [5]

Footnotes:
----------
3.) It could be "status=success", "status=delayed", "command=resumed", etc.
4.) In the future we can add byte at which we resume, size of file, etc.
5.) Of course sending reminder of contents may be further delayed.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-11 10:13     ` Lars Schneider
@ 2017-01-11 17:59       ` Taylor Blau
  0 siblings, 0 replies; 15+ messages in thread
From: Taylor Blau @ 2017-01-11 17:59 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, git, e, jnareb

On Wed, Jan 11, 2017 at 11:13:00AM +0100, Lars Schneider wrote:
>
> In v1 I implemented a) with the busy-loop problem in mind.
>
> My thinking was this:
>
> If the filter sees at least one filter request twice then the filter knows that
> Git has already requested all files that require filtering. At that point the
> filter could just block the "delayed" answer to the latest filter request until
> at least one of the previously delayed requests can be fulfilled. Then the filter
> answers "delay" to Git until Git requests the blob that can be fulfilled. This
> process cycles until all requests can be fulfilled. Wouldn't that work?
>
> I think a "done" message by the filter is not easy. Right now the protocol works
> in a mode were Git always asks and the filter always answers. I believe changing
> the filter to be able to initiate a "done" message would complicated the protocol.
>
> > Additionally, the protocol should specify a sentinel "no more entries" value
> > that could be sent from Git to the filter to signal that there are no more files
> > to checkout. Some filters may implement mechanisms for converting files that
> > require a signal to know when all files have been sent. Specifically, Git LFS
> > (https://git-lfs.github.com) batches files to be transferred together, and needs
> > to know when all files have been announced to truncate and send the last batch,
> > if it is not yet full. I'm sure other filter implementations use a similar
> > mechanism and would benefit from this as well.
>
> I agree. I think the filter already has this info implicitly as explained above
> but an explicit message would be better!

This works, but forces some undesirable constraints:

- The filter has to keep track of all items in the checkout to determine how
  many times Git has asked about them. This is probably fine, though annoying.
  Cases where this is not fine is when there are _many_ items in the checkout
  forcing filter implementations to keep track of large sets of data.
- The protocol is now _must_ ask for the status of checkout items in a
  particular order. If the assumption is that "Git has sent me all items in the
  checkout by the point I see Git ask for the status of an item more than once",
  then Git _cannot_ ask for the status of a delayed item while there are still
  more items to checkout. This could be OK, so long as the protocol commits to
  it and documents this behavior.

What are your thoughts?

--
Thanks,
Taylor Blau

ttaylorr@github.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-11 14:53       ` Jakub Narębski
@ 2017-01-11 20:41         ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2017-01-11 20:41 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Lars Schneider, Git mailing list, Eric Wong, Taylor Blau

Jakub Narębski <jnareb@gmail.com> writes:

>> Yes, this problem happens every day with filters that perform network
>> requests (e.g. GitLFS). 
>
> Do I understand it correctly that the expected performance improvement
> thanks to this feature is possible only if there is some amount of
> parallelism and concurrency in the filter?  That is, filter can be sending
> one blob to Git while processing other one, or filter can be fetching blobs
> in parallel.

The first-object latency may not be helped, but by allowing
"delayed", the latency to retrieve the second and subsequent objects
can be hidden, I would think.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
  2017-01-11  9:43   ` Lars Schneider
@ 2017-01-11 20:45     ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2017-01-11 20:45 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git mailing list, Eric Wong, Jakub Narębski,
	Torsten Bögershausen, Taylor Blau

Lars Schneider <larsxschneider@gmail.com> writes:

>> Hmm, I would have expected that the basic flow would become
>> 
>> 	for each paths to be processed:
>> 		convert-to-worktree to buf
>> 		if not delayed:
>> 			do the caller's thing to use buf
>> 		else:
>> 			remember path
>> 
>> 	for each delayed paths:
>> 		ensure filter process finished processing for path
>> 		fetch the thing to buf from the process
>> 		do the caller's thing to use buf
>> 
>> and that would make quite a lot of sense.  However, what is actually
>> implemented is a bit disappointing from that point of view.  While
>> its first part is the same as above, the latter part instead does:
>> 
>> 	for each delayed paths:
>> 		checkout the path
>> ...
>
> I am not sure I can follow you here.
> ...
> I implemented the "checkout_delayed_entries" function in v1 because
> it solved the problem with minimal changes in the existing code. Our previous 
> discussion made me think that this is the preferred way:
>
>      I do not think we want to see such a rewrite all over the
>      codepaths.  It might be OK to add such a "these entries are known
>      to be delayed" list in struct checkout so that the above becomes
>      more like this:
>
>        for (i = 0; i < active_nr; i++)
>           checkout_entry(active_cache[i], state, NULL);
>      + checkout_entry_finish(state);
>
>      That is, addition of a single "some of the checkout_entry() calls
>      done so far might have been lazy, and I'll give them a chance to
>      clean up" might be palatable.  Anything more than that on the
>      caller side is not.

But that is apples-and-oranges comparision, no?  The old discussion
assumes there is no "caller's thing to use buf" other than "checkout
to the working tree", which is why the function its main loop calls
is "checkout_entry()" and the caller does not see the contents of
the filtered blob at all.  In that context, checkout_entry_finish()
that equally does not let the caller see the contents of the
filtered blob is quite appropriate.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-01-11 20:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-08 19:17 [PATCH v1] convert: add "status=delayed" to filter process protocol larsxschneider
2017-01-08 20:14 ` Torsten Bögershausen
2017-01-11  9:48   ` Lars Schneider
2017-01-08 20:45 ` Eric Wong
2017-01-11  9:51   ` Lars Schneider
2017-01-08 23:42 ` Junio C Hamano
2017-01-10 22:11   ` Jakub Narębski
2017-01-10 23:33     ` Taylor Blau
2017-01-11 10:20     ` Lars Schneider
2017-01-11 14:53       ` Jakub Narębski
2017-01-11 20:41         ` Junio C Hamano
2017-01-11  9:43   ` Lars Schneider
2017-01-11 20:45     ` Junio C Hamano
     [not found]   ` <20170109233816.GA70151@Ida>
2017-01-11 10:13     ` Lars Schneider
2017-01-11 17:59       ` Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).