[PATCH v2] convert: add "status=delayed" to filter process protocol

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v2] convert: add "status=delayed" to filter process protocol
@ 2017-02-26 18:48 Lars Schneider
  2017-02-27  9:58 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Schneider @ 2017-02-26 18:48 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, tboegi, e, jnareb, ttaylorr

Some `clean` / `smudge` filters might require a significant amount of
time to process a single blob. During this process the Git checkout
operation is blocked and Git needs to wait until the filter is done to
continue with the checkout.

Teach the filter process protocol (introduced in edcc858) to accept the
status "delayed" as response to a filter request. Upon this response Git
continues with the checkout operation and asks the filter to process the
blob again after all other blobs have been processed.

Git has a multiple code paths that checkout a blob. Support delayed
checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---

Hi,

in v1 Junio criticized the "convert.h" interface of this patch [1].
After talking to Peff I think I understand Junio's point and I would
like to get your feedback on the new approach here. Please ignore all
changes behind async_convert_to_working_tree() and async_filter_finish()
for now as I plan to change the implementation as soon as the interface
is in an acceptable state.

The new interface also addresses Torsten's feedback and leaves
convert_to_working_tree() as is [2].

I also use '>' for numeric comparisons in Perl as suggested by Eric [3].

Please note, I rebased the patch to v2.12 as v1 did not apply clean on
master anymore.

Thanks,
Lars

[1] http://public-inbox.org/git/xmqqa8b115ll.fsf@gitster.mtv.corp.google.com/
[2] http://public-inbox.org/git/20170108201415.GA3569@tb-raspi/
[3] http://public-inbox.org/git/20170108204517.GA13779@starla/


RFC: http://public-inbox.org/git/D10F7C47-14E8-465B-8B7A-A09A1B28A39F@gmail.com/
 v1: http://public-inbox.org/git/20170108191736.47359-1-larsxschneider@gmail.com/


Notes:
    Base Ref: v2.12.0
    Web-Diff: https://github.com/larsxschneider/git/commit/13d5b37021
    Checkout: git fetch https://github.com/larsxschneider/git filter-process/delay-v2 && git checkout 13d5b37021

 Documentation/gitattributes.txt |  9 ++++++
 builtin/checkout.c              |  1 +
 cache.h                         |  1 +
 convert.c                       | 68 +++++++++++++++++++++++++++++++++--------
 convert.h                       | 13 ++++++++
 entry.c                         | 29 +++++++++++++++---
 t/t0021-conversion.sh           | 53 ++++++++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         | 19 ++++++++++++
 unpack-trees.c                  |  1 +
 9 files changed, 176 insertions(+), 18 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index e0b66c1220..f6bad8db40 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -473,6 +473,15 @@ packet:          git< 0000  # empty content!
 packet:          git< 0000  # empty list, keep "status=success" unchanged!
 ------------------------

+If the request cannot be fulfilled within a reasonable amount of time
+then the filter can respond with a "delayed" status and a flush packet.
+Git will perform the same request at a later point in time, again. The
+filter can delay a response multiple times for a single request.
+------------------------
+packet:          git< status=delayed
+packet:          git< 0000
+------------------------
+
 In case the filter cannot or does not want to process the content,
 it is expected to respond with an "error" status.
 ------------------------
diff --git a/builtin/checkout.c b/builtin/checkout.c
index f174f50303..742e8742cd 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -369,6 +369,7 @@ static int checkout_paths(const struct checkout_opts *opts,
 			pos = skip_same_name(ce, pos) - 1;
 		}
 	}
+	errs |= checkout_delayed_entries(&state);

 	if (write_locked_index(&the_index, lock_file, COMMIT_LOCK))
 		die(_("unable to write new index file"));
diff --git a/cache.h b/cache.h
index 61fc86e6d7..66dde99a79 100644
--- a/cache.h
+++ b/cache.h
@@ -1434,6 +1434,7 @@ struct checkout {

 #define TEMPORARY_FILENAME_LENGTH 25
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
+extern int checkout_delayed_entries(const struct checkout *state);

 struct cache_def {
 	struct strbuf path;
diff --git a/convert.c b/convert.c
index 4e17e45ed2..24d29f5c53 100644
--- a/convert.c
+++ b/convert.c
@@ -4,6 +4,7 @@
 #include "quote.h"
 #include "sigchain.h"
 #include "pkt-line.h"
+#include "list.h"

 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -38,6 +39,13 @@ struct text_stat {
 	unsigned printable, nonprintable;
 };

+static LIST_HEAD(delayed_item_queue_head);
+
+struct delayed_item {
+	void* item;
+	struct list_head node;
+};
+
 static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
 {
 	unsigned long i;
@@ -672,7 +680,7 @@ static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, cons
 }

 static int apply_multi_file_filter(const char *path, const char *src, size_t len,
-				   int fd, struct strbuf *dst, const char *cmd,
+				   int fd, struct strbuf *dst, int *delayed, const char *cmd,
 				   const unsigned int wanted_capability)
 {
 	int err;
@@ -738,9 +746,14 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len
 		goto done;

 	read_multi_file_filter_status(process->out, &filter_status);
-	err = strcmp(filter_status.buf, "success");
-	if (err)
+	if (delayed && !strcmp(filter_status.buf, "delayed")) {
+		*delayed = 1;
 		goto done;
+	} else {
+		err = strcmp(filter_status.buf, "success");
+		if (err)
+			goto done;
+	}

 	err = read_packetized_to_strbuf(process->out, &nbuf) < 0;
 	if (err)
@@ -787,8 +800,8 @@ static struct convert_driver {
 } *user_convert, **user_convert_tail;

 static int apply_filter(const char *path, const char *src, size_t len,
-			int fd, struct strbuf *dst, struct convert_driver *drv,
-			const unsigned int wanted_capability)
+			int fd, struct strbuf *dst, int *delayed,
+			struct convert_driver *drv, const unsigned int wanted_capability)
 {
 	const char *cmd = NULL;

@@ -806,7 +819,7 @@ static int apply_filter(const char *path, const char *src, size_t len,
 	if (cmd && *cmd)
 		return apply_single_file_filter(path, src, len, fd, dst, cmd);
 	else if (drv->process && *drv->process)
-		return apply_multi_file_filter(path, src, len, fd, dst, drv->process, wanted_capability);
+		return apply_multi_file_filter(path, src, len, fd, dst, delayed, drv->process, wanted_capability);

 	return 0;
 }
@@ -1152,7 +1165,7 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;

-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv, CAP_CLEAN);
+	return apply_filter(path, NULL, 0, -1, NULL, NULL, ca.drv, CAP_CLEAN);
 }

 const char *get_convert_attr_ascii(const char *path)
@@ -1189,7 +1202,7 @@ int convert_to_git(const char *path, const char *src, size_t len,

 	convert_attrs(&ca, path);

-	ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN);
+	ret |= apply_filter(path, src, len, -1, dst, NULL, ca.drv, CAP_CLEAN);
 	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

@@ -1214,7 +1227,7 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 	assert(ca.drv);
 	assert(ca.drv->clean || ca.drv->process);

-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN))
+	if (!apply_filter(path, NULL, 0, fd, dst, NULL, ca.drv, CAP_CLEAN))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -1222,7 +1235,7 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 }

 static int convert_to_working_tree_internal(const char *path, const char *src,
-					    size_t len, struct strbuf *dst,
+					    size_t len, struct strbuf *dst, int *delayed,
 					    int normalizing)
 {
 	int ret = 0, ret_filter = 0;
@@ -1248,21 +1261,50 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}

-	ret_filter = apply_filter(path, src, len, -1, dst, ca.drv, CAP_SMUDGE);
+	ret_filter = apply_filter(path, src, len, -1, dst, delayed, ca.drv, CAP_SMUDGE);
 	if (!ret_filter && ca.drv && ca.drv->required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);

 	return ret | ret_filter;
 }

+int async_convert_to_working_tree(const char *path, const char *src,
+								  size_t len, struct strbuf *dst, void *item)
+{
+	int delayed = 0;
+	struct delayed_item *delayed_item;
+	if (convert_to_working_tree_internal(path, src, len, dst, &delayed, 0)) {
+		if (delayed) {
+			delayed_item = xmalloc(sizeof(*delayed_item));
+			delayed_item->item = item;
+			list_add_tail(&delayed_item->node, &delayed_item_queue_head);
+			return ASYNC_FILTER_DELAYED;
+		}
+		return ASYNC_FILTER_SUCCESS;
+	}
+	return ASYNC_FILTER_FAIL;
+}
+
+void* async_filter_finish(void)
+{
+	struct delayed_item *head;
+	if (!list_empty(&delayed_item_queue_head)) {
+		head = list_first_entry(&delayed_item_queue_head,
+			struct delayed_item, node);
+		list_del(&head->node);
+		return head->item;
+	}
+	return NULL;
+}
+
 int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
 {
-	return convert_to_working_tree_internal(path, src, len, dst, 0);
+	return convert_to_working_tree_internal(path, src, len, dst, NULL, 0);
 }

 int renormalize_buffer(const char *path, const char *src, size_t len, struct strbuf *dst)
 {
-	int ret = convert_to_working_tree_internal(path, src, len, dst, 1);
+	int ret = convert_to_working_tree_internal(path, src, len, dst, NULL, 1);
 	if (ret) {
 		src = dst->buf;
 		len = dst->len;
diff --git a/convert.h b/convert.h
index 82871a11d5..acc016de9f 100644
--- a/convert.h
+++ b/convert.h
@@ -4,6 +4,15 @@
 #ifndef CONVERT_H
 #define CONVERT_H

+enum async_filter {
+	ASYNC_FILTER_SUCCESS = 0,
+	ASYNC_FILTER_FAIL = 1,
+	ASYNC_FILTER_DELAYED = 2
+};
+
+extern enum async_filter async_filter;
+
+
 enum safe_crlf {
 	SAFE_CRLF_FALSE = 0,
 	SAFE_CRLF_FAIL = 1,
@@ -42,6 +51,10 @@ extern int convert_to_git(const char *path, const char *src, size_t len,
 			  struct strbuf *dst, enum safe_crlf checksafe);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
+extern int async_convert_to_working_tree(const char *path, const char *src,
+					 size_t len, struct strbuf *dst,
+					 void *item);
+extern void* async_filter_finish(void);
 extern int renormalize_buffer(const char *path, const char *src, size_t len,
 			      struct strbuf *dst);
 static inline int would_convert_to_git(const char *path)
diff --git a/entry.c b/entry.c
index c6eea240b6..d15e69a55e 100644
--- a/entry.c
+++ b/entry.c
@@ -177,11 +177,17 @@ static int write_entry(struct cache_entry *ce,
 		/*
 		 * Convert from git internal format to working tree format
 		 */
-		if (ce_mode_s_ifmt == S_IFREG &&
-		    convert_to_working_tree(ce->name, new, size, &buf)) {
-			free(new);
-			new = strbuf_detach(&buf, &newsize);
-			size = newsize;
+		if (ce_mode_s_ifmt == S_IFREG) {
+			ret = async_convert_to_working_tree(ce->name, new, size, &buf, ce);
+			if (ret == ASYNC_FILTER_SUCCESS) {
+				free(new);
+				new = strbuf_detach(&buf, &newsize);
+				size = newsize;
+			}
+			else if (ret == ASYNC_FILTER_DELAYED) {
+				free(new);
+				goto finish;
+			}
 		}

 		fd = open_output_fd(path, ce, to_tempfile);
@@ -291,3 +297,16 @@ int checkout_entry(struct cache_entry *ce,
 	create_directories(path.buf, path.len, state);
 	return write_entry(ce, path.buf, state, 0);
 }
+
+int checkout_delayed_entries(const struct checkout *state)
+{
+	struct cache_entry *ce;
+	int errs = 0;
+
+	while ((ce = async_filter_finish())) {
+		ce->ce_flags &= ~CE_UPDATE;
+		errs |= checkout_entry(ce, state, NULL);
+	}
+
+	return errs;
+}
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 161f560446..8ae5b1a521 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -701,4 +701,57 @@ test_expect_success PERL 'invalid process filter must fail (and not hang!)' '
 	)
 '

+test_expect_success PERL 'delayed checkout in process filter' '
+	test_config_global filter.protocol.process "rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+		echo "*.r filter=protocol" >.gitattributes &&
+		cp "$TEST_ROOT/test.o" test.r &&
+		cp "$TEST_ROOT/test.o" test-delay1.r &&
+		cp "$TEST_ROOT/test.o" test-delay3.r &&
+		git add . &&
+		git commit -m "test commit 1"
+	) &&
+
+	S=$(file_size repo/test.r) &&
+	rm -rf repo-cloned &&
+	filter_git clone repo repo-cloned &&
+	cat >expected.log <<-EOF &&
+		START
+		init handshake complete
+		IN: smudge test.r $S [OK] -- OUT: $S . [OK]
+		IN: smudge test-delay1.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay1.r $S [OK] -- OUT: $S . [OK]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+		IN: smudge test-delay3.r $S [OK] -- OUT: $S . [OK]
+		STOP
+	EOF
+	test_cmp_count expected.log repo-cloned/rot13-filter.log &&
+
+	(
+		cd repo-cloned &&
+		rm *.r rot13-filter.log &&
+		filter_git checkout . &&
+		cat >expected.log <<-EOF &&
+			START
+			init handshake complete
+			IN: smudge test.r $S [OK] -- OUT: $S . [OK]
+			IN: smudge test-delay1.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay1.r $S [OK] -- OUT: $S . [OK]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S [DELAYED]
+			IN: smudge test-delay3.r $S [OK] -- OUT: $S . [OK]
+			STOP
+		EOF
+		test_cmp_count expected.log rot13-filter.log
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
index 617f581e56..ece0d314b4 100644
--- a/t/t0021/rot13-filter.pl
+++ b/t/t0021/rot13-filter.pl
@@ -17,6 +17,10 @@
 #     operation then the filter signals that it cannot or does not want
 #     to process the file and any file after that is processed with the
 #     same command.
+# (5) If data with a pathname that is a key in the DELAY hash is
+#     processed (e.g. 'test-delay1.r') then the filter signals n times
+#     to Git that the processing is delayed (n being the value of the
+#     DELAY hash key).
 #

 use strict;
@@ -25,6 +29,12 @@ use IO::File;

 my $MAX_PACKET_CONTENT_SIZE = 65516;
 my @capabilities            = @ARGV;
+my $DELAY3 = 3;
+my $DELAY1 = 1;
+
+my %DELAY;
+$DELAY{'test-delay1.r'} = 1;
+$DELAY{'test-delay3.r'} = 3;

 open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!";

@@ -166,6 +176,15 @@ while (1) {
 		packet_txt_write("status=abort");
 		packet_flush();
 	}
+	elsif ( $command eq "smudge" and
+		    exists $DELAY{$pathname} and
+		    $DELAY{$pathname} > 0 ) {
+		$DELAY{$pathname} = $DELAY{$pathname} - 1;
+		print $debug "[DELAYED]\n";
+		$debug->flush();
+		packet_txt_write("status=delayed");
+		packet_flush();
+	}
 	else {
 		packet_txt_write("status=success");
 		packet_flush();
diff --git a/unpack-trees.c b/unpack-trees.c
index 3a8ee19fe8..6b3246db03 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -315,6 +315,7 @@ static int check_updates(struct unpack_trees_options *o)
 			}
 		}
 	}
+	errs |= checkout_delayed_entries(&state);
 	stop_progress(&progress);
 	if (o->update)
 		git_attr_set_direction(GIT_ATTR_CHECKIN, NULL);

base-commit: e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
--
2.11.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] convert: add "status=delayed" to filter process protocol
  2017-02-26 18:48 [PATCH v2] convert: add "status=delayed" to filter process protocol Lars Schneider
@ 2017-02-27  9:58 ` Jeff King
  2017-02-27 10:32   ` Lars Schneider
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-02-27  9:58 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, gitster, tboegi, e, jnareb, ttaylorr

On Sun, Feb 26, 2017 at 07:48:16PM +0100, Lars Schneider wrote:

> +If the request cannot be fulfilled within a reasonable amount of time
> +then the filter can respond with a "delayed" status and a flush packet.
> +Git will perform the same request at a later point in time, again. The
> +filter can delay a response multiple times for a single request.
> +------------------------
> +packet:          git< status=delayed
> +packet:          git< 0000
> +------------------------
> +

So Git just asks for the same content again? I see two issues with that:

  1. Does git have to feed the blob content again? That can be expensive
     to access or to keep around in memory.

  2. What happens when the item isn't ready on the second request? I can
     think of a few options:

       a. The filter immediately says "nope, still delayed". But then
          Git ends up busy-looping with "is this one ready yet?"

       b. The filter blocks until the item is ready. But then if other
	  items _are_ ready, Git cannot work on processing them. We lose
	  parallelism.

       c. You could do a hybrid: block until _some_ item is ready, and
          then issue "delayed" responses for everything that isn't
	  ready. Then if you assume that Git is looping over and over
	  through the set of objects, it will either block or pick up
	  _something_ on each loop.

	  But it makes a quadratic number of requests in the worst case.
	  E.g., imagine you have N items and the last one is available
	  first, then the second-to-last, and so on. You'll ask N times,
	  then N-1, then N-2, and so on.

I think it would be much more efficient to do something like:

  [Git issues a request and gives it an opaque index id]
  git> command=smudge
  git> pathname=foo
  git> index=0
  git> 0000
  git> CONTENT
  git> 0000

  [The data isn't ready yet, so the filter tells us so...]
  git< status=delayed
  git< 0000

  [Git may make other requests, that are either served or delayed]
  git> command=smudge
  git> pathname=foo
  git> index=1
  git> 0000
  git< status=success
  git< 0000
  git< CONTENT
  git< 0000

  [Now Git has processed all of the items, and each one either has its
   final status, or has been marked as delayed. So we ask for a delayed
   item]
  git> command=wait
  git> 0000

  [Some time may pass if nothing is ready. But eventually we get...]
  git< status=success
  git< index=0
  git< 0000
  git< CONTENT
  git< 0000

From Git's side, the loop is something like:

  while (delayed_items > 0) {
	/* issue a wait, and get back the status/index pair */
	status = send_wait(&index);
	delayed_items--;

	/*
	 * use "index" to find the right item in our list of files;
	 * the format can be opaque to the filter, so we could index
	 * it however we like. But probably numeric indices in an array
	 * are the simplest.
	 */
	assert(index > 0 && index < nr_items);
	item[index].status = status;
	if (status == SUCCESS)
		read_content(&item[index]);
  }

and the filter side just attaches the "index" string to whatever its
internal queue structure is, and feeds it back verbatim when processing
that item finishes.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] convert: add "status=delayed" to filter process protocol
  2017-02-27  9:58 ` Jeff King
@ 2017-02-27 10:32   ` Lars Schneider
  2017-02-27 10:53     ` Jeff King
  2017-02-27 22:11     ` Jakub Narębski
  0 siblings, 2 replies; 7+ messages in thread
From: Lars Schneider @ 2017-02-27 10:32 UTC (permalink / raw)
  To: Jeff King
  Cc: Git List, Junio C Hamano, Torsten Bögershausen, Eric Wong,
	Jakub Narębski, ttaylorr


> On 27 Feb 2017, at 10:58, Jeff King <peff@peff.net> wrote:
> 
> On Sun, Feb 26, 2017 at 07:48:16PM +0100, Lars Schneider wrote:
> 
>> +If the request cannot be fulfilled within a reasonable amount of time
>> +then the filter can respond with a "delayed" status and a flush packet.
>> +Git will perform the same request at a later point in time, again. The
>> +filter can delay a response multiple times for a single request.
>> +------------------------
>> +packet:          git< status=delayed
>> +packet:          git< 0000
>> +------------------------
>> +
> 
> So Git just asks for the same content again? I see two issues with that:
> 
>  1. Does git have to feed the blob content again? That can be expensive
>     to access or to keep around in memory.
> 
>  2. What happens when the item isn't ready on the second request? I can
>     think of a few options:
> 
>       a. The filter immediately says "nope, still delayed". But then
>          Git ends up busy-looping with "is this one ready yet?"
> 
>       b. The filter blocks until the item is ready. But then if other
> 	  items _are_ ready, Git cannot work on processing them. We lose
> 	  parallelism.
> 
>       c. You could do a hybrid: block until _some_ item is ready, and
>          then issue "delayed" responses for everything that isn't
> 	  ready. Then if you assume that Git is looping over and over
> 	  through the set of objects, it will either block or pick up
> 	  _something_ on each loop.
> 
> 	  But it makes a quadratic number of requests in the worst case.
> 	  E.g., imagine you have N items and the last one is available
> 	  first, then the second-to-last, and so on. You'll ask N times,
> 	  then N-1, then N-2, and so on.

I completely agree - I need to change that. However, the goal of the v2
iteration was to get the "convert" interface in an acceptable state.
That's what I intended to say in the patch comment section:

    "Please ignore all changes behind async_convert_to_working_tree() and 
     async_filter_finish() for now as I plan to change the implementation 
     as soon as the interface is in an acceptable state."

> 
> I think it would be much more efficient to do something like:
> 
>  [Git issues a request and gives it an opaque index id]
>  git> command=smudge
>  git> pathname=foo
>  git> index=0
>  git> 0000
>  git> CONTENT
>  git> 0000
> 
>  [The data isn't ready yet, so the filter tells us so...]
>  git< status=delayed
>  git< 0000
> 
>  [Git may make other requests, that are either served or delayed]
>  git> command=smudge
>  git> pathname=foo
>  git> index=1
>  git> 0000
>  git< status=success
>  git< 0000
>  git< CONTENT
>  git< 0000
> 
>  [Now Git has processed all of the items, and each one either has its
>   final status, or has been marked as delayed. So we ask for a delayed
>   item]
>  git> command=wait
>  git> 0000
> 
>  [Some time may pass if nothing is ready. But eventually we get...]
>  git< status=success
>  git< index=0
>  git< 0000
>  git< CONTENT
>  git< 0000
> 
> From Git's side, the loop is something like:
> 
>  while (delayed_items > 0) {
> 	/* issue a wait, and get back the status/index pair */
> 	status = send_wait(&index);
> 	delayed_items--;
> 
> 	/*
> 	 * use "index" to find the right item in our list of files;
> 	 * the format can be opaque to the filter, so we could index
> 	 * it however we like. But probably numeric indices in an array
> 	 * are the simplest.
> 	 */
> 	assert(index > 0 && index < nr_items);
> 	item[index].status = status;
> 	if (status == SUCCESS)
> 		read_content(&item[index]);
>  }
> 
> and the filter side just attaches the "index" string to whatever its
> internal queue structure is, and feeds it back verbatim when processing
> that item finishes.

That could work! I had something like that in mind:

I teach Git a new command "list_completed" or similar. The filter
blocks this call until at least one item is ready for Git. 
Then the filter responds with a list of paths that identify the
"ready items". Then Git asks for these ready items just with the
path and not with any content. Could that work? Wouldn't the path
be "unique" to identify a blob per filter run?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] convert: add "status=delayed" to filter process protocol
  2017-02-27 10:32   ` Lars Schneider
@ 2017-02-27 10:53     ` Jeff King
  2017-04-09 18:17       ` Lars Schneider
  2017-02-27 22:11     ` Jakub Narębski
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-02-27 10:53 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen, Eric Wong,
	Jakub Narębski, ttaylorr

On Mon, Feb 27, 2017 at 11:32:47AM +0100, Lars Schneider wrote:

> I completely agree - I need to change that. However, the goal of the v2
> iteration was to get the "convert" interface in an acceptable state.
> That's what I intended to say in the patch comment section:
> 
>     "Please ignore all changes behind async_convert_to_working_tree() and 
>      async_filter_finish() for now as I plan to change the implementation 
>      as soon as the interface is in an acceptable state."

Ah, sorry, I missed that. I would think the underlying approach would
influence the interface to some degree. But as long as the interface
is sufficiently abstract, I think it gives you enough flexibility.

> > From Git's side, the loop is something like:
> > 
> >  while (delayed_items > 0) {
> > 	/* issue a wait, and get back the status/index pair */
> > 	status = send_wait(&index);
> > 	delayed_items--;
> > 
> > 	/*
> > 	 * use "index" to find the right item in our list of files;
> > 	 * the format can be opaque to the filter, so we could index
> > 	 * it however we like. But probably numeric indices in an array
> > 	 * are the simplest.
> > 	 */
> > 	assert(index > 0 && index < nr_items);
> > 	item[index].status = status;
> > 	if (status == SUCCESS)
> > 		read_content(&item[index]);
> >  }
> > 
> > and the filter side just attaches the "index" string to whatever its
> > internal queue structure is, and feeds it back verbatim when processing
> > that item finishes.
> 
> That could work! I had something like that in mind:
> 
> I teach Git a new command "list_completed" or similar. The filter
> blocks this call until at least one item is ready for Git. 
> Then the filter responds with a list of paths that identify the
> "ready items". Then Git asks for these ready items just with the
> path and not with any content. Could that work? Wouldn't the path
> be "unique" to identify a blob per filter run?

I think that could work, though I think there are few minor downsides
compared to what I wrote above:

  - if you respond with "these items are ready", and then make Git ask
    for each again, it's an extra round-trip for each set of ready
    items. You could just say "an item is ready; here it is" in a single
    response. For a local pipe the latency is probably negligible,
    though.

  - using paths as the index would probably work, but it means Git has
    to use the path to find the "struct checkout_entry" again. Which
    might mean a hashmap (though if you have them all in a sorted list,
    I guess you could also do a binary search).

  - Using an explicit index communicates to the filter not only what the
    index is, but also that Git is prepared to accept a delayed response
    for the item. For backwards compatibility, the filter would probably
    advertise "I have the 'delayed' capability", and then Git could
    choose to use it or not on a per-item basis. Realistically it would
    not change from item to item, but rather operation to operation. So
    that means we can easily convert the call-sites in Git to the async
    approach incrementally. As each one is converted, it turns on the
    flag that causes the filter code to send the "index" tag.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] convert: add "status=delayed" to filter process protocol
  2017-02-27 10:32   ` Lars Schneider
  2017-02-27 10:53     ` Jeff King
@ 2017-02-27 22:11     ` Jakub Narębski
  2017-04-09 18:41       ` Lars Schneider
  1 sibling, 1 reply; 7+ messages in thread
From: Jakub Narębski @ 2017-02-27 22:11 UTC (permalink / raw)
  To: Lars Schneider, Jeff King
  Cc: Git List, Junio C Hamano, Torsten Bögershausen, Eric Wong,
	Taylor Blau

W dniu 27.02.2017 o 11:32, Lars Schneider pisze:
> 
>> On 27 Feb 2017, at 10:58, Jeff King <peff@peff.net> wrote:
>>
>> On Sun, Feb 26, 2017 at 07:48:16PM +0100, Lars Schneider wrote:
>>
>>> +If the request cannot be fulfilled within a reasonable amount of time
>>> +then the filter can respond with a "delayed" status and a flush packet.
>>> +Git will perform the same request at a later point in time, again. The
>>> +filter can delay a response multiple times for a single request.
>>> +------------------------
>>> +packet:          git< status=delayed
>>> +packet:          git< 0000
>>> +------------------------

Is it something that happens instead of filter process sending the contents
of file, or is it something that happens after sending some part of the
contents (maybe empty) instead of empty list to keep "status=success"
unchanged or instead of "status=error" if there was a problem processing
file?

>>> +
>>
>> So Git just asks for the same content again? I see two issues with that:
>>
>>  1. Does git have to feed the blob content again? That can be expensive
>>     to access or to keep around in memory.
>>
>>  2. What happens when the item isn't ready on the second request? I can
>>     think of a few options:
>>
>>       a. The filter immediately says "nope, still delayed". But then
>>          Git ends up busy-looping with "is this one ready yet?"
>>
>>       b. The filter blocks until the item is ready. But then if other
>> 	  items _are_ ready, Git cannot work on processing them. We lose
>> 	  parallelism.
>>
>>       c. You could do a hybrid: block until _some_ item is ready, and
>>          then issue "delayed" responses for everything that isn't
>> 	  ready. Then if you assume that Git is looping over and over
>> 	  through the set of objects, it will either block or pick up
>> 	  _something_ on each loop.
>>
>> 	  But it makes a quadratic number of requests in the worst case.
>> 	  E.g., imagine you have N items and the last one is available
>> 	  first, then the second-to-last, and so on. You'll ask N times,
>> 	  then N-1, then N-2, and so on.

The current solution is a 'busy loop' one that I wrote about[1][2],
see below.

> 
> I completely agree - I need to change that. However, the goal of the v2
> iteration was to get the "convert" interface in an acceptable state.
> That's what I intended to say in the patch comment section:
> 
>     "Please ignore all changes behind async_convert_to_working_tree() and 
>      async_filter_finish() for now as I plan to change the implementation 
>      as soon as the interface is in an acceptable state."

I think that it is more important to start with a good abstraction,
and the proposal for protocol, rather than getting bogged down in
implementation details that may change as the idea for protocol
extension changes.

>>
>> I think it would be much more efficient to do something like:
>>
>>  [Git issues a request and gives it an opaque index id]
>>  git> command=smudge
>>  git> pathname=foo
>>  git> index=0
>>  git> 0000
>>  git> CONTENT
>>  git> 0000
>>
>>  [The data isn't ready yet, so the filter tells us so...]
>>  git< status=delayed
>>  git< 0000

So is it only as replacement for "status=success" + contents or
"status=abort", that is upfront before sending any part of the file?

Or, as one can assume from the point of the paragraph with the
"status=delayed", it is about replacing null list for success or
"status=error" after sending some part (maybe empty) of a file,
that is:

    [filter driver says that it can process contents]
    git< status=success
    git< 0000
    git< PARTIAL_SMUDGED_CONTENT (maybe empty)
    [there was some delay, for example one of shards is slow]
    git< 0000
    git< status=delayed
    git< 0000

>>
>>  [Git may make other requests, that are either served or delayed]
>>  git> command=smudge
>>  git> pathname=foo
>>  git> index=1
>>  git> 0000
>>  git< status=success
>>  git< 0000
>>  git< CONTENT
>>  git< 0000
>>
>>  [Now Git has processed all of the items, and each one either has its
>>   final status, or has been marked as delayed. So we ask for a delayed
>>   item]
>>  git> command=wait
>>  git> 0000

In my proposal[2] I have called this "command=continue"... but at this
point it is bikeshedding.  I think "command=wait" (or "await" ;-))
might be better.

>>
>>  [Some time may pass if nothing is ready. But eventually we get...]
>>  git< status=success

Or

    git< status=resumed

If it would not be undue burden on the filter driver process, we might
require for it to say where to continue at (in bytes), e.g.

    git< from=16426

That should, of course, go below index/pathname line.

>>  git< index=0

Or a filter driver could have used pathname as an index, that is

    git< pathname=path/testfile.dat

>>  git< 0000
>>  git< CONTENT
>>  git< 0000
>>
>> From Git's side, the loop is something like:
>>
>>  while (delayed_items > 0) {
>> 	/* issue a wait, and get back the status/index pair */
>> 	status = send_wait(&index);
>> 	delayed_items--;

This looks like my 'event loop' proposal[1][2], see below.

>>
>> 	/*
>> 	 * use "index" to find the right item in our list of files;
>> 	 * the format can be opaque to the filter, so we could index
>> 	 * it however we like. But probably numeric indices in an array
>> 	 * are the simplest.
>> 	 */
>> 	assert(index > 0 && index < nr_items);
>> 	item[index].status = status;
>> 	if (status == SUCCESS)
>> 		read_content(&item[index]);
>>  }
>>
>> and the filter side just attaches the "index" string to whatever its
>> internal queue structure is, and feeds it back verbatim when processing
>> that item finishes.

I have wrote about this for v1 version of this patch series:

[1]: https://public-inbox.org/git/ec8078ef-8ff2-d26f-ef73-5ef612737eee@gmail.com/
[2]: https://public-inbox.org/git/17fa31a5-8689-2766-952b-704f433a5b3a@gmail.com/

> 
> That could work! I had something like that in mind:
> 
> I teach Git a new command "list_completed" or similar. The filter
> blocks this call until at least one item is ready for Git. 
> Then the filter responds with a list of paths that identify the
> "ready items". Then Git asks for these ready items just with the
> path and not with any content. Could that work? Wouldn't the path
> be "unique" to identify a blob per filter run?

Why in the "drain" phase it is still Git that needs to ask filter for
contents, one file after another?  Wouldn't it be easier and simpler
for filter to finish sending contents, and send signal that it has
finished continue'ing?

To summarize my earlier emails, current proposal looks for me as if
it were a "busy loop" solution, that is[2]:

          while there are delayed paths:
                  for each delayed path:
                          request path from filter [a]
                          fetch the thing (supporting "delayed") [b]
                          if path done
                                  do the caller's thing to use buf
                                  remove it from delayed paths list


Footnotes:
----------
a) We don't send the Git-side contents of blob again, isn't it?
   So we need some protocol extension / new understanding anyway.
   for example that we don't send contents if we request path again.
b) If path is not ready at all, filter protocol would send status=delayed
   with empty contents.  This means that we would immediately go to the
   next path, if there is one.

There are some cases where busy loop is preferable, but I don't think
it is the case here.


The "event loop" solution, which is I think what Peff proposed, would
be something like this (from the protocol point of view, rather than
from the Git code point of view)[1]:

        while there are delayed paths:
                get path that is ready from filter
                fetch the thing to buf (supporting "delayed")
                if path done
                        do the caller's thing to use buf 
                        (e.g. finish checkout path, eof convert, etc.)

We can either trust filter process to tell us when it finished sending
delayed paths, or keep list of paths that are being delayed in Git.


Also, one thing that we need to be solved, assuming that the proposed
extension allows to send partial data from filter to be delayed and
continued later, is that Git needs to keep this partial response in buf;
this is because of precedence of gitattributes applying:

  Interaction between checkin/checkout attributes
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  In the check-in codepath, the worktree file is first converted with
  `filter` driver (if specified and corresponding driver defined), then
  the result is processed with `ident` (if specified), and then finally
  with `text` (again, if specified and applicable).

  In the check-out codepath, the blob content is first converted with
  `text`, and then `ident` and fed to `filter`.

Or maybe I misunderstood something; I am a bit confused by check-in
vs check-out there...


Note that with support of "command=wait" / "command=continue" Git
could interrupt sending contents, and drain unfinished files, if
it needs it... then continue sending rest of files.  Well, if the
protocol extension is interpreted to allow for this.

Best regards,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] convert: add "status=delayed" to filter process protocol
  2017-02-27 10:53     ` Jeff King
@ 2017-04-09 18:17       ` Lars Schneider
  0 siblings, 0 replies; 7+ messages in thread
From: Lars Schneider @ 2017-04-09 18:17 UTC (permalink / raw)
  To: Jeff King
  Cc: Git List, Junio C Hamano, Torsten Bögershausen, Eric Wong,
	Jakub Narębski, Taylor Blau


> On 27 Feb 2017, at 11:53, Jeff King <peff@peff.net> wrote:
> 
> On Mon, Feb 27, 2017 at 11:32:47AM +0100, Lars Schneider wrote:
> 
>> ...
> 
>>> From Git's side, the loop is something like:
>>> 
>>> while (delayed_items > 0) {
>>> 	/* issue a wait, and get back the status/index pair */
>>> 	status = send_wait(&index);
>>> 	delayed_items--;
>>> 
>>> 	/*
>>> 	 * use "index" to find the right item in our list of files;
>>> 	 * the format can be opaque to the filter, so we could index
>>> 	 * it however we like. But probably numeric indices in an array
>>> 	 * are the simplest.
>>> 	 */
>>> 	assert(index > 0 && index < nr_items);
>>> 	item[index].status = status;
>>> 	if (status == SUCCESS)
>>> 		read_content(&item[index]);
>>> }
>>> 
>>> and the filter side just attaches the "index" string to whatever its
>>> internal queue structure is, and feeds it back verbatim when processing
>>> that item finishes.
>> 
>> That could work! I had something like that in mind:
>> 
>> I teach Git a new command "list_completed" or similar. The filter
>> blocks this call until at least one item is ready for Git. 
>> Then the filter responds with a list of paths that identify the
>> "ready items". Then Git asks for these ready items just with the
>> path and not with any content. Could that work? Wouldn't the path
>> be "unique" to identify a blob per filter run?
> 
> I think that could work, though I think there are few minor downsides
> compared to what I wrote above:
> 
>  - if you respond with "these items are ready", and then make Git ask
>    for each again, it's an extra round-trip for each set of ready
>    items. You could just say "an item is ready; here it is" in a single
>    response. For a local pipe the latency is probably negligible,
>    though.

It is true that the extra round-trip is not strictly necessary but I think
it simplifies the protocol/the code as I can reuse the convert machinery 
as is.


>  - using paths as the index would probably work, but it means Git has
>    to use the path to find the "struct checkout_entry" again. Which
>    might mean a hashmap (though if you have them all in a sorted list,
>    I guess you could also do a binary search).

Agreed. I changed my implementation to use an index following your
suggestion.

>  - Using an explicit index communicates to the filter not only what the
>    index is, but also that Git is prepared to accept a delayed response
>    for the item. For backwards compatibility, the filter would probably
>    advertise "I have the 'delayed' capability", and then Git could
>    choose to use it or not on a per-item basis. Realistically it would
>    not change from item to item, but rather operation to operation. So
>    that means we can easily convert the call-sites in Git to the async
>    approach incrementally. As each one is converted, it turns on the
>    flag that causes the filter code to send the "index" tag.

Agreed. I change the implementation accordingly and I will send out the
patches shortly.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] convert: add "status=delayed" to filter process protocol
  2017-02-27 22:11     ` Jakub Narębski
@ 2017-04-09 18:41       ` Lars Schneider
  0 siblings, 0 replies; 7+ messages in thread
From: Lars Schneider @ 2017-04-09 18:41 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Jeff King, Git List, Junio C Hamano, Torsten Bögershausen,
	Eric Wong, Taylor Blau


> On 27 Feb 2017, at 23:11, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 27.02.2017 o 11:32, Lars Schneider pisze:
>> 
>>> On 27 Feb 2017, at 10:58, Jeff King <peff@peff.net> wrote:
>>> 
>>> On Sun, Feb 26, 2017 at 07:48:16PM +0100, Lars Schneider wrote:
>>> 
>>>> +If the request cannot be fulfilled within a reasonable amount of time
>>>> +then the filter can respond with a "delayed" status and a flush packet.
>>>> +Git will perform the same request at a later point in time, again. The
>>>> +filter can delay a response multiple times for a single request.
>>>> +------------------------
>>>> +packet:          git< status=delayed
>>>> +packet:          git< 0000
>>>> +------------------------
> 
> Is it something that happens instead of filter process sending the contents

Correct! I'll clarify this in v3!


>> 
>> I completely agree - I need to change that. However, the goal of the v2
>> iteration was to get the "convert" interface in an acceptable state.
>> That's what I intended to say in the patch comment section:
>> 
>>    "Please ignore all changes behind async_convert_to_working_tree() and 
>>     async_filter_finish() for now as I plan to change the implementation 
>>     as soon as the interface is in an acceptable state."
> 
> I think that it is more important to start with a good abstraction,
> and the proposal for protocol, rather than getting bogged down in
> implementation details that may change as the idea for protocol
> extension changes.

I'll send out v3 shortly as proposal for a complete solution.


>>> I think it would be much more efficient to do something like:
>>> 
>>> [Git issues a request and gives it an opaque index id]
>>> git> command=smudge
>>> git> pathname=foo
>>> git> index=0
>>> git> 0000
>>> git> CONTENT
>>> git> 0000
>>> 
>>> [The data isn't ready yet, so the filter tells us so...]
>>> git< status=delayed
>>> git< 0000
> 
> So is it only as replacement for "status=success" + contents or
> "status=abort", that is upfront before sending any part of the file?

Yes.


> Or, as one can assume from the point of the paragraph with the
> "status=delayed", it is about replacing null list for success or
> "status=error" after sending some part (maybe empty) of a file,
> that is:

No. As this would complicate things I don't want to support it. 
(and I clarified that in the docs in v3).


> If it would not be undue burden on the filter driver process, we might
> require for it to say where to continue at (in bytes), e.g.
> 
>    git< from=16426
> 
> That should, of course, go below index/pathname line.

This would make the protocol even more complicated. That's why I don't
want to support splitting the response.


>>> git< index=0
> 
> Or a filter driver could have used pathname as an index, that is
> 
>    git< pathname=path/testfile.dat

In v3 I've used an index to help Git finding the right cache entry
quickly.


> 
>>> git< 0000
>>> git< CONTENT
>>> git< 0000
>>> 
>>> From Git's side, the loop is something like:
>>> 
>>> while (delayed_items > 0) {
>>> 	/* issue a wait, and get back the status/index pair */
>>> 	status = send_wait(&index);
>>> 	delayed_items--;
> 
> This looks like my 'event loop' proposal[1][2], see below.

I implemented something similar in v3.


>> That could work! I had something like that in mind:
>> 
>> I teach Git a new command "list_completed" or similar. The filter
>> blocks this call until at least one item is ready for Git. 
>> Then the filter responds with a list of paths that identify the
>> "ready items". Then Git asks for these ready items just with the
>> path and not with any content. Could that work? Wouldn't the path
>> be "unique" to identify a blob per filter run?
> 
> Why in the "drain" phase it is still Git that needs to ask filter for
> contents, one file after another?  Wouldn't it be easier and simpler
> for filter to finish sending contents, and send signal that it has
> finished continue'ing?
> 
> To summarize my earlier emails, current proposal looks for me as if
> it were a "busy loop" solution, that is[2]:

In v3 the implementation still uses kind of a busy loop (I expect the
filter to block if there nothing ready, yet). An event loop would
complicate the protocol as the filter would need to initiate an action.
Right now only Git initiates actions.


> Footnotes:
> ----------
> a) We don't send the Git-side contents of blob again, isn't it?
>   So we need some protocol extension / new understanding anyway.
>   for example that we don't send contents if we request path again.
Correct - v3 doesn't send the content again.


> Also, one thing that we need to be solved, assuming that the proposed
> extension allows to send partial data from filter to be delayed and
> continued later, is that Git needs to keep this partial response in buf;
> this is because of precedence of gitattributes applying:

As mentioned above I don't want to support partial data as this
complicates things and is of no use for my Git LFS problem case.

Thanks,
Lars


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-04-09 18:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-26 18:48 [PATCH v2] convert: add "status=delayed" to filter process protocol Lars Schneider
2017-02-27  9:58 ` Jeff King
2017-02-27 10:32   ` Lars Schneider
2017-02-27 10:53     ` Jeff King
2017-04-09 18:17       ` Lars Schneider
2017-02-27 22:11     ` Jakub Narębski
2017-04-09 18:41       ` Lars Schneider

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).