git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v1 0/3] Git filter protocol
@ 2016-07-22 15:48 larsxschneider
  2016-07-22 15:48 ` [PATCH v1 1/3] convert: quote filter names in error messages larsxschneider
                   ` (4 more replies)
  0 siblings, 5 replies; 77+ messages in thread
From: larsxschneider @ 2016-07-22 15:48 UTC (permalink / raw)
  To: git; +Cc: peff, jnareb, tboegi, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time. This patch series
addresses this problem.

The first two patches are cleanup patches which are not really necessary
for the feature. The third patch is the relevant one :-)

You will notice that I do not check the exact number of "clean" filter
invocations in the tests. The reason is that Git calls "clean" multiple
times (up to 4 times for the same blob!). I posted a patch to demonstrate
the problem and I would prefer to tackle the problem in a seperate patch
series: http://thread.gmane.org/gmane.comp.version-control.git/300028/

The main reason for this Git core patch is to speed up Git large file
extensions such as git-annex or Git LFS. I proposed an according Git LFS
patch here: https://github.com/github/git-lfs/pull/1382

In addition to the Git core tests, all Git LFS integration tests run clean
on my machine.

Thanks,
Lars


Lars Schneider (3):
  convert: quote filter names in error messages
  convert: modernize tests
  convert: add filter.<driver>.useProtocol option

 Documentation/gitattributes.txt |  41 ++++++-
 convert.c                       | 222 +++++++++++++++++++++++++++++++++++---
 t/t0021-conversion.sh           | 232 ++++++++++++++++++++++++++++++++++------
 t/t0021/rot13.pl                |  80 ++++++++++++++
 4 files changed, 531 insertions(+), 44 deletions(-)
 create mode 100755 t/t0021/rot13.pl

--
2.9.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v1 1/3] convert: quote filter names in error messages
  2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
@ 2016-07-22 15:48 ` larsxschneider
  2016-07-22 15:48 ` [PATCH v1 2/3] convert: modernize tests larsxschneider
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 77+ messages in thread
From: larsxschneider @ 2016-07-22 15:48 UTC (permalink / raw)
  To: git; +Cc: peff, jnareb, tboegi, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git filter with spaces (e.g. `filter.sh foo`) are hard to read in
error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index b1614bf..522e2c5 100644
--- a/convert.c
+++ b/convert.c
@@ -397,7 +397,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	child_process.out = out;
 
 	if (start_command(&child_process))
-		return error("cannot fork to run external filter %s", params->cmd);
+		return error("cannot fork to run external filter '%s'", params->cmd);
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
@@ -415,13 +415,13 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
-		error("cannot feed the input to external filter %s", params->cmd);
+		error("cannot feed the input to external filter '%s'", params->cmd);
 
 	sigchain_pop(SIGPIPE);
 
 	status = finish_command(&child_process);
 	if (status)
-		error("external filter %s failed %d", params->cmd, status);
+		error("external filter '%s' failed %d", params->cmd, status);
 
 	strbuf_release(&cmd);
 	return (write_err || status);
@@ -462,15 +462,15 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
+		error("external filter '%s' failed", cmd);
 		ret = 0;
 	}
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v1 2/3] convert: modernize tests
  2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
  2016-07-22 15:48 ` [PATCH v1 1/3] convert: quote filter names in error messages larsxschneider
@ 2016-07-22 15:48 ` larsxschneider
  2016-07-26 15:18   ` Remi Galan Alfonso
  2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 77+ messages in thread
From: larsxschneider @ 2016-07-22 15:48 UTC (permalink / raw)
  To: git; +Cc: peff, jnareb, tboegi, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">".

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 62 +++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..a05a8d2 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -13,8 +13,8 @@ EOF
 chmod +x rot13.sh
 
 test_expect_success setup '
-	git config filter.rot13.smudge ./rot13.sh &&
-	git config filter.rot13.clean ./rot13.sh &&
+	test_config filter.rot13.smudge ./rot13.sh &&
+	test_config filter.rot13.clean ./rot13.sh &&
 
 	{
 	    echo "*.t filter=rot13"
@@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
 
 test_expect_success check '
 
-	cmp test.o test &&
-	cmp test.o test.t &&
+	test_cmp test.o test &&
+	test_cmp test.o test.t &&
 
 	# ident should be stripped in the repository
 	git diff --raw --exit-code :test :test.i &&
@@ -47,10 +47,10 @@ test_expect_success check '
 	embedded=$(sed -ne "$script" test.i) &&
 	test "z$id" = "z$embedded" &&
 
-	git cat-file blob :test.t > test.r &&
+	git cat-file blob :test.t >test.r &&
 
-	./rot13.sh < test.o > test.t &&
-	cmp test.r test.t
+	./rot13.sh < test.o >test.t &&
+	test_cmp test.r test.t
 '
 
 # If an expanded ident ever gets into the repository, we want to make sure that
@@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
 
 	# delete the files and check them out again, using a smudge filter
 	# that will count the args and echo the command-line back to us
-	git config filter.argc.smudge "sh ./argc.sh %f" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -141,7 +141,7 @@ test_expect_success 'filter shell-escaped filenames' '
 	test_cmp expect "$special" &&
 
 	# do the same thing, but with more args in the filter expression
-	git config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -154,9 +154,9 @@ test_expect_success 'filter shell-escaped filenames' '
 '
 
 test_expect_success 'required filter should filter data' '
-	git config filter.required.smudge ./rot13.sh &&
-	git config filter.required.clean ./rot13.sh &&
-	git config filter.required.required true &&
+	test_config filter.required.smudge ./rot13.sh &&
+	test_config filter.required.clean ./rot13.sh &&
+	test_config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
@@ -165,17 +165,17 @@ test_expect_success 'required filter should filter data' '
 
 	rm -f test.r &&
 	git checkout -- test.r &&
-	cmp test.o test.r &&
+	test_cmp test.o test.r &&
 
 	./rot13.sh <test.o >expected &&
 	git cat-file blob :test.r >actual &&
-	cmp expected actual
+	test_cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
-	git config filter.failsmudge.smudge false &&
-	git config filter.failsmudge.clean cat &&
-	git config filter.failsmudge.required true &&
+	test_config filter.failsmudge.smudge false &&
+	test_config filter.failsmudge.clean cat &&
+	test_config filter.failsmudge.required true &&
 
 	echo "*.fs filter=failsmudge" >.gitattributes &&
 
@@ -186,9 +186,9 @@ test_expect_success 'required filter smudge failure' '
 '
 
 test_expect_success 'required filter clean failure' '
-	git config filter.failclean.smudge cat &&
-	git config filter.failclean.clean false &&
-	git config filter.failclean.required true &&
+	test_config filter.failclean.smudge cat &&
+	test_config filter.failclean.clean false &&
+	test_config filter.failclean.required true &&
 
 	echo "*.fc filter=failclean" >.gitattributes &&
 
@@ -197,8 +197,8 @@ test_expect_success 'required filter clean failure' '
 '
 
 test_expect_success 'filtering large input to small output should use little memory' '
-	git config filter.devnull.clean "cat >/dev/null" &&
-	git config filter.devnull.required true &&
+	test_config filter.devnull.clean "cat >/dev/null" &&
+	test_config filter.devnull.required true &&
 	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
 	echo "30MB filter=devnull" >.gitattributes &&
 	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
@@ -207,7 +207,7 @@ test_expect_success 'filtering large input to small output should use little mem
 test_expect_success 'filter that does not read is fine' '
 	test-genrandom foo $((128 * 1024 + 1)) >big &&
 	echo "big filter=epipe" >.gitattributes &&
-	git config filter.epipe.clean "echo xyzzy" &&
+	test_config filter.epipe.clean "echo xyzzy" &&
 	git add big &&
 	git cat-file blob :big >actual &&
 	echo xyzzy >expect &&
@@ -215,20 +215,20 @@ test_expect_success 'filter that does not read is fine' '
 '
 
 test_expect_success EXPENSIVE 'filter large file' '
-	git config filter.largefile.smudge cat &&
-	git config filter.largefile.clean cat &&
+	test_config filter.largefile.smudge cat &&
+	test_config filter.largefile.clean cat &&
 	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
 	echo "2GB filter=largefile" >.gitattributes &&
 	git add 2GB 2>err &&
-	! test -s err &&
+	test_must_be_empty err &&
 	rm -f 2GB &&
 	git checkout -- 2GB 2>err &&
-	! test -s err
+	test_must_be_empty err
 '
 
 test_expect_success "filter: clean empty file" '
-	git config filter.in-repo-header.clean  "echo cleaned && cat" &&
-	git config filter.in-repo-header.smudge "sed 1d" &&
+	test_config filter.in-repo-header.clean  "echo cleaned && cat" &&
+	test_config filter.in-repo-header.smudge "sed 1d" &&
 
 	echo "empty-in-worktree    filter=in-repo-header" >>.gitattributes &&
 	>empty-in-worktree &&
@@ -240,8 +240,8 @@ test_expect_success "filter: clean empty file" '
 '
 
 test_expect_success "filter: smudge empty file" '
-	git config filter.empty-in-repo.clean "cat >/dev/null" &&
-	git config filter.empty-in-repo.smudge "echo smudged && cat" &&
+	test_config filter.empty-in-repo.clean "cat >/dev/null" &&
+	test_config filter.empty-in-repo.smudge "echo smudged && cat" &&
 
 	echo "empty-in-repo filter=empty-in-repo" >>.gitattributes &&
 	echo dead data walking >empty-in-repo &&
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
  2016-07-22 15:48 ` [PATCH v1 1/3] convert: quote filter names in error messages larsxschneider
  2016-07-22 15:48 ` [PATCH v1 2/3] convert: modernize tests larsxschneider
@ 2016-07-22 15:49 ` larsxschneider
  2016-07-22 22:32   ` Torsten Bögershausen
                     ` (3 more replies)
  2016-07-22 21:39 ` [PATCH v1 0/3] Git filter protocol Junio C Hamano
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
  4 siblings, 4 replies; 77+ messages in thread
From: larsxschneider @ 2016-07-22 15:49 UTC (permalink / raw)
  To: git; +Cc: peff, jnareb, tboegi, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time.

This patch adds the filter.<driver>.useProtocol option which, if enabled,
keeps the external filter process running and processes all blobs with
the following protocol over stdin/stdout.

1. Git starts the filter on first usage and expects a welcome message
with protocol version number:
	Git <-- Filter: "git-filter-protocol\n"
	Git <-- Filter: "version 1"

2. Git sends the command (either "smudge" or "clean"), the filename, the
content size in bytes, and the content separated by a newline character:
	Git --> Filter: "smudge\n"
	Git --> Filter: "testfile.dat\n"
	Git --> Filter: "7\n"
	Git --> Filter: "CONTENT"

3. The filter is expected to answer with the result content size in
bytes and the result content separated by a newline character:
	Git <-- Filter: "15\n"
	Git <-- Filter: "SMUDGED_CONTENT"

4. The filter is expected to wait for the next file in step 2, again.

Please note that the protocol filters do not support stream processing
with this implemenatation because the filter needs to know the length of
the result in advance. A protocol version 2 could address this in a
future patch.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Helped-by: Martin-Louis Bright <mlbright@gmail.com>
---
 Documentation/gitattributes.txt |  41 +++++++-
 convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
 t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
 t/t0021/rot13.pl                |  80 +++++++++++++++
 4 files changed, 494 insertions(+), 7 deletions(-)
 create mode 100755 t/t0021/rot13.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..7026d62 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate. If the setting filter.<driver>.useProtocol is
+enabled then Git can process all blobs with a single filter command
+invocation (see filter protocol below).
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +378,42 @@ substitution.  For example:
 ------------------------
 
 
+Filter Protocol
+^^^^^^^^^^^^^^^
+
+If the setting filter.<driver>.useProtocol is enabled then Git
+can process all blobs with a single filter command invocation
+by talking with the following protocol over stdin/stdout.
+
+Git starts the filter on first usage and expects a welcome
+message with protocol version number:
+------------------------
+Git < Filter: "git-filter-protocol\n"
+Git < Filter: "version 1"
+------------------------
+
+Afterwards Git sends a blob command (either "smudge" or "clean"),
+the filename, the content size in bytes, and the content separated
+by a newline character:
+------------------------
+Git > Filter: "smudge\n"
+Git > Filter: "testfile.dat\n"
+Git > Filter: "7\n"
+Git > Filter: "CONTENT"
+------------------------
+
+The filter is expected to answer with the result content size in
+bytes and the result content separated by a newline character:
+------------------------
+Git < Filter: "15\n"
+Git < Filter: "SMUDGED_CONTENT"
+------------------------
+
+Afterwards the filter is expected to wait for the next blob process
+command. A demo implementation can be found in `t/t0021/rot13.pl`
+located in the Git core repository.
+
+
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/convert.c b/convert.c
index 522e2c5..91ce86f 100644
--- a/convert.c
+++ b/convert.c
@@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	return ret;
 }
 
+static int cmd_process_map_init = 0;
+static struct hashmap cmd_process_map;
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	const char *cmd;
+	long protocol;
+	struct child_process process;
+};
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+							const struct cmd2process *e2,
+							const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_protocol_filter_entry(const char *cmd)
+{
+	struct cmd2process k;
+	hashmap_entry_init(&k, strhash(cmd));
+	k.cmd = cmd;
+	return hashmap_get(&cmd_process_map, &k, NULL);
+}
+
+static void stop_protocol_filter(struct cmd2process *entry) {
+	if (!entry)
+		return;
+	sigchain_push(SIGPIPE, SIG_IGN);
+	close(entry->process.in);
+	close(entry->process.out);
+	sigchain_pop(SIGPIPE);
+	finish_command(&entry->process);
+	child_process_clear(&entry->process);
+	hashmap_remove(&cmd_process_map, entry, NULL);
+	free(entry);
+}
+
+static struct cmd2process *start_protocol_filter(const char *cmd)
+{
+	int ret = 1;
+	struct cmd2process *entry = NULL;
+	struct child_process *process = NULL;
+	struct strbuf nbuf = STRBUF_INIT;
+	struct string_list split = STRING_LIST_INIT_NODUP;
+	const char *argv[] = { NULL, NULL };
+	const char *header = "git-filter-protocol\nversion";
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(entry, strhash(cmd));
+	entry->cmd = cmd;
+	process = &entry->process;
+
+	child_process_init(process);
+	argv[0] = cmd;
+	process->argv = argv;
+	process->use_shell = 1;
+	process->in = -1;
+	process->out = -1;
+
+	if (start_command(process)) {
+		error("cannot fork to run external persistent filter '%s'", cmd);
+		return NULL;
+	}
+	strbuf_reset(&nbuf);
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
+	sigchain_pop(SIGPIPE);
+
+	strbuf_stripspace(&nbuf, 0);
+	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
+	ret &= split.nr > 1;
+	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
+	if (ret) {
+		entry->protocol = strtol(split.items[1].string, NULL, 10);
+		switch (entry->protocol) {
+			case 1:
+				break;
+			default:
+				ret = 0;
+				error("unsupported protocol version %s for external persistent filter '%s'",
+					nbuf.buf, cmd);
+		}
+	}
+	string_list_clear(&split, 0);
+	strbuf_release(&nbuf);
+
+	if (!ret) {
+		error("initialization for external persistent filter '%s' failed", cmd);
+		return NULL;
+	}
+
+	hashmap_add(&cmd_process_map, entry);
+	return entry;
+}
+
+static int apply_protocol_filter(const char *path, const char *src, size_t len,
+						int fd, struct strbuf *dst, const char *cmd,
+						const char *filter_type)
+{
+	int ret = 1;
+	struct cmd2process *entry = NULL;
+	struct child_process *process = NULL;
+	struct stat fileStat;
+	struct strbuf nbuf = STRBUF_INIT;
+	size_t nbuf_len;
+	char *strtol_end;
+	char c;
+
+	if (!cmd || !*cmd)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if (!cmd_process_map_init) {
+		cmd_process_map_init = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+	} else {
+		entry = find_protocol_filter_entry(cmd);
+	}
+
+	if (!entry){
+		entry = start_protocol_filter(cmd);
+		if (!entry) {
+			stop_protocol_filter(entry);
+			return 0;
+		}
+	}
+	process = &entry->process;
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+	switch (entry->protocol) {
+		case 1:
+			if (fd >= 0 && !src) {
+				ret &= fstat(fd, &fileStat) != -1;
+				len = fileStat.st_size;
+			}
+			strbuf_reset(&nbuf);
+			strbuf_addf(&nbuf, "%s\n%s\n%zu\n", filter_type, path, len);
+			ret &= write_str_in_full(process->in, nbuf.buf) > 1;
+			if (len > 0) {
+				if (src)
+					ret &= write_in_full(process->in, src, len) == len;
+				else if (fd >= 0)
+					ret &= copy_fd(fd, process->in) == 0;
+				else
+					ret &= 0;
+			}
+
+			strbuf_reset(&nbuf);
+			while (xread(process->out, &c, 1) == 1 && c != '\n')
+				strbuf_addchars(&nbuf, c, 1);
+			nbuf_len = (size_t)strtol(nbuf.buf, &strtol_end, 10);
+			ret &= (strtol_end != nbuf.buf && errno != ERANGE);
+			strbuf_reset(&nbuf);
+			if (nbuf_len > 0)
+				ret &= strbuf_read_once(&nbuf, process->out, nbuf_len) == nbuf_len;
+			break;
+		default:
+			ret = 0;
+	}
+	sigchain_pop(SIGPIPE);
+
+	if (ret) {
+		strbuf_swap(dst, &nbuf);
+	} else {
+		// Something went wrong with the protocol filter. Force shutdown!
+		stop_protocol_filter(entry);
+	}
+	strbuf_release(&nbuf);
+	return ret;
+}
+
 static struct convert_driver {
 	const char *name;
 	struct convert_driver *next;
 	const char *smudge;
 	const char *clean;
 	int required;
+	int use_protocol;
 } *user_convert, **user_convert_tail;
 
 static int read_convert_config(const char *var, const char *value, void *cb)
@@ -526,6 +702,11 @@ static int read_convert_config(const char *var, const char *value, void *cb)
 	if (!strcmp("clean", key))
 		return git_config_string(&drv->clean, var, value);
 
+	if (!strcmp("useprotocol", key)) {
+		drv->use_protocol = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp("required", key)) {
 		drv->required = git_config_bool(var, value);
 		return 0;
@@ -823,7 +1004,10 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;
 
-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+	if (ca.drv->use_protocol)
+		return apply_protocol_filter(path, NULL, 0, -1, NULL, ca.drv->clean, "clean");
+	else
+		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
 }
 
 const char *get_convert_attr_ascii(const char *path)
@@ -857,16 +1041,20 @@ int convert_to_git(const char *path, const char *src, size_t len,
 {
 	int ret = 0;
 	const char *filter = NULL;
-	int required = 0;
+	int required = 0, use_protocol = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
 	if (ca.drv) {
 		filter = ca.drv->clean;
 		required = ca.drv->required;
+		use_protocol = ca.drv->use_protocol;
 	}
 
-	ret |= apply_filter(path, src, len, -1, dst, filter);
+	if (use_protocol)
+		ret |= apply_protocol_filter(path, src, len, -1, dst, filter, "clean");
+	else
+		ret |= apply_filter(path, src, len, -1, dst, filter);
 	if (!ret && required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
@@ -885,13 +1073,19 @@ int convert_to_git(const char *path, const char *src, size_t len,
 void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 			      enum safe_crlf checksafe)
 {
+	int ret = 0;
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
 	assert(ca.drv->clean);
 
-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+	if (ca.drv->use_protocol)
+		ret = apply_protocol_filter(path, NULL, 0, fd, dst, ca.drv->clean, "clean");
+	else
+		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
+
+	if (!ret)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -904,13 +1098,14 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 {
 	int ret = 0, ret_filter = 0;
 	const char *filter = NULL;
-	int required = 0;
+	int required = 0, use_protocol = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
 	if (ca.drv) {
 		filter = ca.drv->smudge;
 		required = ca.drv->required;
+		use_protocol = ca.drv->use_protocol;
 	}
 
 	ret |= ident_to_worktree(path, src, len, dst, ca.ident);
@@ -930,7 +1125,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, -1, dst, filter);
+	if (use_protocol)
+		ret_filter = apply_protocol_filter(path, src, len, -1, dst, filter, "smudge");
+	else
+		ret_filter |= apply_filter(path, src, len, -1, dst, filter);
 	if (!ret_filter && required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index a05a8d2..d9077ea 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -268,4 +268,174 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '
 
+test_expect_success 'required protocol filter should filter data' '
+	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	test_config_global filter.protocol.useprotocol true &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		echo "test22" >test2.r &&
+		echo "test333" >test3.r &&
+
+		rm -f output.log &&
+		git add . &&
+		sort output.log | uniq -c | sed "s/^[ ]*//" >uniq_output.log &&
+		cat >expected_add.log <<-\EOF &&
+			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
+			1 IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
+			1 start
+			1 wrote version
+		EOF
+		test_cmp expected_add.log uniq_output.log &&
+
+		printf "" >output.log &&
+		git commit . -m "test commit" &&
+		sort output.log | uniq -c | sed "s/^[ ]*//" |
+			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq_output.log &&
+		cat >expected_commit.log <<-\EOF &&
+			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
+			x IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
+			1 start
+			1 wrote version
+		EOF
+		test_cmp expected_commit.log uniq_output.log &&
+
+		printf "" >output.log &&
+		rm -f test?.r &&
+		git checkout . &&
+		cat output.log | grep -v "IN: clean" >smudge_output.log &&
+		cat >expected_checkout.log <<-\EOF &&
+			start
+			wrote version
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout.log smudge_output.log &&
+
+		git checkout empty &&
+
+		printf "" >output.log &&
+		git checkout master &&
+		cat output.log | grep -v "IN: clean" >smudge_output.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote version
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge_output.log
+	)
+'
+
+test_expect_success EXPENSIVE 'protocol filter large file' '
+	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "2GB filter=largefile" >.gitattributes &&
+		for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
+		git add 2GB 2>err &&
+		test_must_be_empty err &&
+		rm -f 2GB &&
+		git checkout -- 2GB 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'required protocol filter should fail with clean' '
+	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	test_config_global filter.protocol.useprotocol true &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "this is going to fail" >clean-write-fail.r &&
+		echo "test333" >test3.r &&
+
+		# Note: There are three clean paths in convert.c we just test one here.
+		test_must_fail git add .
+	)
+'
+
+test_expect_success 'protocol filter should restart after failure' '
+	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
+	test_config_global filter.protocol.useprotocol true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "1234567" >test2.o &&
+		cat test2.o >test2.r &&
+		echo "this is going to fail" >smudge-write-fail.o &&
+		cat smudge-write-fail.o >smudge-write-fail.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		printf "" >output.log &&
+		git checkout . &&
+		cat output.log | grep -v "IN: clean" >smudge_output.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote version
+			IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [FAIL]
+			start
+			wrote version
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge_output.log &&
+
+		test_cmp ../test.o test.r &&
+		./../rot13.sh <../test.o >expected &&
+		git cat-file blob :test.r >actual &&
+		test_cmp expected actual
+
+		test_cmp test2.o test2.r &&
+		./../rot13.sh <test2.o >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual
+
+		test_cmp test2.o test2.r &&
+		./../rot13.sh <test2.o >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual
+
+		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
+		./../rot13.sh <smudge-write-fail.o >expected &&
+		git cat-file blob :smudge-write-fail.r >actual &&
+		test_cmp expected actual							  # Clean worked!
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl
new file mode 100755
index 0000000..f2d7a03
--- /dev/null
+++ b/t/t0021/rot13.pl
@@ -0,0 +1,80 @@
+#!/usr/bin/env perl
+#
+# Example implementation for the Git filter protocol version 1
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+
+use strict;
+use warnings;
+use autodie;
+
+sub rot13 {
+    my ($str) = @_;
+    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
+    return $str;
+}
+
+$| = 1; # autoflush STDOUT
+
+open my $debug, ">>", "output.log";
+$debug->autoflush(1);
+
+print $debug "start\n";
+
+print STDOUT "git-filter-protocol\nversion 1";
+print $debug "wrote version\n";
+
+while (1) {
+    my $command = <STDIN>;
+    unless (defined($command)) {
+        exit();
+    }
+    chomp $command;
+    print $debug "IN: $command";
+    my $filename = <STDIN>;
+    chomp $filename;
+    print $debug " $filename";
+    my $filelen  = <STDIN>;
+    chomp $filelen;
+    print $debug " $filelen";
+
+    $filelen = int($filelen);
+    my $output;
+
+    if ( $filelen > 0 ) {
+        my $input;
+        {
+            binmode(STDIN);
+            my $bytes_read = 0;
+            $bytes_read = read STDIN, $input, $filelen;
+            if ( $bytes_read != $filelen ) {
+                die "not enough to read";
+            }
+            print $debug " [OK] -- ";
+        }
+
+        if ( $command eq "clean") {
+            $output = rot13($input);
+        }
+        elsif ( $command eq "smudge" ) {
+            $output = rot13($input);
+        }
+        else {
+            die "bad command\n";
+        }
+    }
+
+    my $output_len = length($output);
+    print STDOUT "$output_len\n";
+    print $debug "OUT: $output_len";
+    if ( $output_len > 0 ) {
+        if ( ($command eq "clean" and $filename eq "clean-write-fail.r") or
+             ($command eq "smudge" and $filename eq "smudge-write-fail.r") ) {
+            print STDOUT "fail";
+            print $debug " [FAIL]\n"
+        } else {
+            print STDOUT $output;
+            print $debug " [OK]\n";
+        }
+    }
+}
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 0/3] Git filter protocol
  2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
                   ` (2 preceding siblings ...)
  2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
@ 2016-07-22 21:39 ` Junio C Hamano
  2016-07-24 11:24   ` Lars Schneider
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
  4 siblings, 1 reply; 77+ messages in thread
From: Junio C Hamano @ 2016-07-22 21:39 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, jnareb, tboegi

larsxschneider@gmail.com writes:

> The first two patches are cleanup patches which are not really necessary
> for the feature.

These two looked trivially good.

I think I can agree with what 3/3 wants to do in principle, but

 * "protocol" is not quite the right word.  The current way to
   interact with clean and smudge filters can be considered using a
   different "protocol", that conveys the data and the options via
   the command line and pipe.  The most distinguishing feature that
   differentiates the old way and the new style this change allows
   is that it allows you to have a single instance of the process
   running that can be reused?

 * I am not sure what's the pros-and-cons in forcing people writing
   a single program that can do both cleaning and smudging.  You
   cannot have only "smudge" side that uses the long-running process
   while "clean" side that runs single-shot invocation with this
   design, which I'd imagine would be a downside.  If you are going
   to use a long-running process interface for both sides, this
   design allows you to do it with fewer number of processes, which
   may be an upside.

 * The way the serialized access to these long-running processes
   work in 3/3 would make it harder or impossible to later
   parallelize conversion?  I am imagining a far future where we
   would run "git checkout ." using (say) two threads, one
   responsible for active_cache[0..active_nr/2] and the other
   responsible for the remainder.

> You will notice that I do not check the exact number of "clean" filter
> invocations in the tests.

That is a good thing to do.  You shouldn't really care for the
proper operation of the feature, reducing the number of them would
be an independent topic (see the work of Peff earlier today), and we
may even find a need to make _more_ calls for correctness (again,
see the work of Peff earlier today -- to a person who wants to keep
the number of requests to the attribute system low, the change may
look like a regression, but it is necessary for the overall system;
you may find a similar need to running "clean" more for some need of
the overall system that you do not anticipate right now).


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
@ 2016-07-22 22:32   ` Torsten Bögershausen
  2016-07-24 12:09     ` Lars Schneider
  2016-07-22 23:19   ` Ramsay Jones
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 77+ messages in thread
From: Torsten Bögershausen @ 2016-07-22 22:32 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: peff, jnareb



On 07/22/2016 05:49 PM, larsxschneider@gmail.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.
>
> This patch adds the filter.<driver>.useProtocol option which, if enabled,
> keeps the external filter process running and processes all blobs with
> the following protocol over stdin/stdout.
>
> 1. Git starts the filter on first usage and expects a welcome message
> with protocol version number:
> 	Git <-- Filter: "git-filter-protocol\n"
> 	Git <-- Filter: "version 1"
Is there no terminator here ?
How long should the reading side wait without a '\n', or should it read
"version 1\n" ?

>
> 2. Git sends the command (either "smudge" or "clean"), the filename, the
> content size in bytes, and the content separated by a newline character:
> 	Git --> Filter: "smudge\n"
> 	Git --> Filter: "testfile.dat\n"
> 	Git --> Filter: "7\n"
> 	Git --> Filter: "CONTENT"
>
> 3. The filter is expected to answer with the result content size in
> bytes and the result content separated by a newline character:
> 	Git <-- Filter: "15\n"
> 	Git <-- Filter: "SMUDGED_CONTENT"
>
> 4. The filter is expected to wait for the next file in step 2, again.
>
> Please note that the protocol filters do not support stream processing
> with this implemenatation because the filter needs to know the length of
             ^^^^^^^^^^^^^^^^typo
> the result in advance. A protocol version 2 could address this in a
> future patch.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
> ---
>  Documentation/gitattributes.txt |  41 +++++++-
>  convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>  t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
>  t/t0021/rot13.pl                |  80 +++++++++++++++
>  4 files changed, 494 insertions(+), 7 deletions(-)
>  create mode 100755 t/t0021/rot13.pl
>
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 8882a3e..7026d62 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate. If the setting filter.<driver>.useProtocol is
> +enabled then Git can process all blobs with a single filter command
> +invocation (see filter protocol below).
>
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user to use.
> @@ -375,6 +378,42 @@ substitution.  For example:
>  ------------------------
>
>
> +Filter Protocol
> +^^^^^^^^^^^^^^^
> +
> +If the setting filter.<driver>.useProtocol is enabled then Git
> +can process all blobs with a single filter command invocation
> +by talking with the following protocol over stdin/stdout.
> +
> +Git starts the filter on first usage and expects a welcome
> +message with protocol version number:
> +------------------------
> +Git < Filter: "git-filter-protocol\n"
> +Git < Filter: "version 1"
> +------------------------
> +
> +Afterwards Git sends a blob command (either "smudge" or "clean"),
> +the filename, the content size in bytes, and the content separated
> +by a newline character:
> +------------------------
> +Git > Filter: "smudge\n"
> +Git > Filter: "testfile.dat\n"
> +Git > Filter: "7\n"
> +Git > Filter: "CONTENT"
> +------------------------
> +
> +The filter is expected to answer with the result content size in
> +bytes and the result content separated by a newline character:
> +------------------------
> +Git < Filter: "15\n"
> +Git < Filter: "SMUDGED_CONTENT"
> +------------------------
> +
> +Afterwards the filter is expected to wait for the next blob process
> +command. A demo implementation can be found in `t/t0021/rot13.pl`
> +located in the Git core repository.
> +
> +
>  Interaction between checkin/checkout attributes
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> diff --git a/convert.c b/convert.c
> index 522e2c5..91ce86f 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	return ret;
>  }
>
> +static int cmd_process_map_init = 0;
> +static struct hashmap cmd_process_map;
> +
> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	const char *cmd;
> +	long protocol;
> +	struct child_process process;
> +};
> +
> +static int cmd2process_cmp(const struct cmd2process *e1,
> +							const struct cmd2process *e2,
> +							const void *unused)
> +{
> +	return strcmp(e1->cmd, e2->cmd);
> +}
> +
> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
> +{
> +	struct cmd2process k;
> +	hashmap_entry_init(&k, strhash(cmd));
> +	k.cmd = cmd;
> +	return hashmap_get(&cmd_process_map, &k, NULL);
> +}
> +
> +static void stop_protocol_filter(struct cmd2process *entry) {
> +	if (!entry)
> +		return;
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	close(entry->process.in);
> +	close(entry->process.out);
> +	sigchain_pop(SIGPIPE);
> +	finish_command(&entry->process);
> +	child_process_clear(&entry->process);
> +	hashmap_remove(&cmd_process_map, entry, NULL);
> +	free(entry);
> +}
> +
> +static struct cmd2process *start_protocol_filter(const char *cmd)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	struct string_list split = STRING_LIST_INIT_NODUP;
> +	const char *argv[] = { NULL, NULL };
> +	const char *header = "git-filter-protocol\nversion";
> +
> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	process = &entry->process;
> +
> +	child_process_init(process);
> +	argv[0] = cmd;
> +	process->argv = argv;
> +	process->use_shell = 1;
> +	process->in = -1;
> +	process->out = -1;
> +
> +	if (start_command(process)) {
> +		error("cannot fork to run external persistent filter '%s'", cmd);
> +		return NULL;
> +	}
> +	strbuf_reset(&nbuf);
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
> +	sigchain_pop(SIGPIPE);
> +
> +	strbuf_stripspace(&nbuf, 0);
> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
> +	ret &= split.nr > 1;
> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
> +	if (ret) {
> +		entry->protocol = strtol(split.items[1].string, NULL, 10);
> +		switch (entry->protocol) {
> +			case 1:
> +				break;
> +			default:
> +				ret = 0;
> +				error("unsupported protocol version %s for external persistent filter '%s'",
> +					nbuf.buf, cmd);
> +		}
> +	}
> +	string_list_clear(&split, 0);
> +	strbuf_release(&nbuf);
> +
> +	if (!ret) {
> +		error("initialization for external persistent filter '%s' failed", cmd);
> +		return NULL;
> +	}
> +
> +	hashmap_add(&cmd_process_map, entry);
> +	return entry;
> +}
> +
> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> +						int fd, struct strbuf *dst, const char *cmd,
> +						const char *filter_type)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;
> +	struct stat fileStat;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	size_t nbuf_len;
> +	char *strtol_end;
> +	char c;
> +
> +	if (!cmd || !*cmd)
> +		return 0;
> +
> +	if (!dst)
> +		return 1;
> +
> +	if (!cmd_process_map_init) {
> +		cmd_process_map_init = 1;
> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
> +	} else {
> +		entry = find_protocol_filter_entry(cmd);
> +	}
> +
> +	if (!entry){
> +		entry = start_protocol_filter(cmd);
> +		if (!entry) {
> +			stop_protocol_filter(entry);
> +			return 0;
> +		}
> +	}
> +	process = &entry->process;
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	switch (entry->protocol) {
> +		case 1:
> +			if (fd >= 0 && !src) {
> +				ret &= fstat(fd, &fileStat) != -1;
> +				len = fileStat.st_size;
> +			}
> +			strbuf_reset(&nbuf);
> +			strbuf_addf(&nbuf, "%s\n%s\n%zu\n", filter_type, path, len);
> +			ret &= write_str_in_full(process->in, nbuf.buf) > 1;
> +			if (len > 0) {
> +				if (src)
> +					ret &= write_in_full(process->in, src, len) == len;
> +				else if (fd >= 0)
> +					ret &= copy_fd(fd, process->in) == 0;
> +				else
> +					ret &= 0;
> +			}
> +
> +			strbuf_reset(&nbuf);
> +			while (xread(process->out, &c, 1) == 1 && c != '\n')
> +				strbuf_addchars(&nbuf, c, 1);
> +			nbuf_len = (size_t)strtol(nbuf.buf, &strtol_end, 10);
> +			ret &= (strtol_end != nbuf.buf && errno != ERANGE);
> +			strbuf_reset(&nbuf);
> +			if (nbuf_len > 0)
> +				ret &= strbuf_read_once(&nbuf, process->out, nbuf_len) == nbuf_len;
> +			break;
> +		default:
> +			ret = 0;
> +	}
> +	sigchain_pop(SIGPIPE);
> +
> +	if (ret) {
> +		strbuf_swap(dst, &nbuf);
> +	} else {
> +		// Something went wrong with the protocol filter. Force shutdown!
> +		stop_protocol_filter(entry);
> +	}
> +	strbuf_release(&nbuf);
> +	return ret;
> +}
> +
>  static struct convert_driver {
>  	const char *name;
>  	struct convert_driver *next;
>  	const char *smudge;
>  	const char *clean;
>  	int required;
> +	int use_protocol;
>  } *user_convert, **user_convert_tail;
>
>  static int read_convert_config(const char *var, const char *value, void *cb)
> @@ -526,6 +702,11 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>  	if (!strcmp("clean", key))
>  		return git_config_string(&drv->clean, var, value);
>
> +	if (!strcmp("useprotocol", key)) {
> +		drv->use_protocol = git_config_bool(var, value);
> +		return 0;
> +	}
> +
>  	if (!strcmp("required", key)) {
>  		drv->required = git_config_bool(var, value);
>  		return 0;
> @@ -823,7 +1004,10 @@ int would_convert_to_git_filter_fd(const char *path)
>  	if (!ca.drv->required)
>  		return 0;
>
> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> +	if (ca.drv->use_protocol)
> +		return apply_protocol_filter(path, NULL, 0, -1, NULL, ca.drv->clean, "clean");
> +	else
> +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
>  }
>
>  const char *get_convert_attr_ascii(const char *path)
> @@ -857,16 +1041,20 @@ int convert_to_git(const char *path, const char *src, size_t len,
>  {
>  	int ret = 0;
>  	const char *filter = NULL;
> -	int required = 0;
> +	int required = 0, use_protocol = 0;
>  	struct conv_attrs ca;
>
>  	convert_attrs(&ca, path);
>  	if (ca.drv) {
>  		filter = ca.drv->clean;
>  		required = ca.drv->required;
> +		use_protocol = ca.drv->use_protocol;
>  	}
>
> -	ret |= apply_filter(path, src, len, -1, dst, filter);
> +	if (use_protocol)
> +		ret |= apply_protocol_filter(path, src, len, -1, dst, filter, "clean");
> +	else
> +		ret |= apply_filter(path, src, len, -1, dst, filter);
>  	if (!ret && required)
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>
> @@ -885,13 +1073,19 @@ int convert_to_git(const char *path, const char *src, size_t len,
>  void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
>  			      enum safe_crlf checksafe)
>  {
> +	int ret = 0;
>  	struct conv_attrs ca;
>  	convert_attrs(&ca, path);
>
>  	assert(ca.drv);
>  	assert(ca.drv->clean);
>
> -	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
> +	if (ca.drv->use_protocol)
> +		ret = apply_protocol_filter(path, NULL, 0, fd, dst, ca.drv->clean, "clean");
> +	else
> +		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
> +
> +	if (!ret)
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>
>  	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
> @@ -904,13 +1098,14 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  {
>  	int ret = 0, ret_filter = 0;
>  	const char *filter = NULL;
> -	int required = 0;
> +	int required = 0, use_protocol = 0;
>  	struct conv_attrs ca;
>
>  	convert_attrs(&ca, path);
>  	if (ca.drv) {
>  		filter = ca.drv->smudge;
>  		required = ca.drv->required;
> +		use_protocol = ca.drv->use_protocol;
>  	}
>
>  	ret |= ident_to_worktree(path, src, len, dst, ca.ident);
> @@ -930,7 +1125,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  		}
>  	}
>
> -	ret_filter = apply_filter(path, src, len, -1, dst, filter);
> +	if (use_protocol)
> +		ret_filter = apply_protocol_filter(path, src, len, -1, dst, filter, "smudge");
> +	else
> +		ret_filter |= apply_filter(path, src, len, -1, dst, filter);
>  	if (!ret_filter && required)
>  		die("%s: smudge filter %s failed", path, ca.drv->name);
>
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index a05a8d2..d9077ea 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -268,4 +268,174 @@ test_expect_success 'disable filter with empty override' '
>  	test_must_be_empty err
>  '
>
> +test_expect_success 'required protocol filter should filter data' '
> +	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.useprotocol true &&
> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +		git add . &&
> +		git commit . -m "test commit" &&
> +		git branch empty &&
> +
> +		cat ../test.o >test.r &&
> +		echo "test22" >test2.r &&
> +		echo "test333" >test3.r &&
> +
> +		rm -f output.log &&
> +		git add . &&
> +		sort output.log | uniq -c | sed "s/^[ ]*//" >uniq_output.log &&
> +		cat >expected_add.log <<-\EOF &&
> +			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
> +			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
> +			1 IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
> +			1 start
> +			1 wrote version
> +		EOF
> +		test_cmp expected_add.log uniq_output.log &&
> +
> +		printf "" >output.log &&
> +		git commit . -m "test commit" &&
> +		sort output.log | uniq -c | sed "s/^[ ]*//" |
> +			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq_output.log &&
> +		cat >expected_commit.log <<-\EOF &&
> +			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
> +			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
> +			x IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
> +			1 start
> +			1 wrote version
> +		EOF
> +		test_cmp expected_commit.log uniq_output.log &&
> +
> +		printf "" >output.log &&
> +		rm -f test?.r &&
> +		git checkout . &&
> +		cat output.log | grep -v "IN: clean" >smudge_output.log &&
> +		cat >expected_checkout.log <<-\EOF &&
> +			start
> +			wrote version
> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
> +			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
> +		EOF
> +		test_cmp expected_checkout.log smudge_output.log &&
> +
> +		git checkout empty &&
> +
> +		printf "" >output.log &&
> +		git checkout master &&
> +		cat output.log | grep -v "IN: clean" >smudge_output.log &&
> +		cat >expected_checkout_master.log <<-\EOF &&
> +			start
> +			wrote version
> +			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
> +			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
> +		EOF
> +		test_cmp expected_checkout_master.log smudge_output.log
> +	)
> +'
> +
> +test_expect_success EXPENSIVE 'protocol filter large file' '
> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "2GB filter=largefile" >.gitattributes &&
> +		for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
Side question:
Is there a way to "re-use" the 2GB test file through t0021?
It takes a long time to produce it, especially on my 32 Bit systems ;-)
But this may be a different patch.

> +		git add 2GB 2>err &&
> +		test_must_be_empty err &&
> +		rm -f 2GB &&
> +		git checkout -- 2GB 2>err &&
> +		test_must_be_empty err
> +	)
> +'
> +
> +test_expect_success 'required protocol filter should fail with clean' '
> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.useprotocol true &&
> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cat ../test.o >test.r &&
> +		echo "this is going to fail" >clean-write-fail.r &&
> +		echo "test333" >test3.r &&
> +
> +		# Note: There are three clean paths in convert.c we just test one here.
> +		test_must_fail git add .
> +	)
> +'
> +
> +test_expect_success 'protocol filter should restart after failure' '
> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.useprotocol true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cat ../test.o >test.r &&
> +		echo "1234567" >test2.o &&
> +		cat test2.o >test2.r &&
> +		echo "this is going to fail" >smudge-write-fail.o &&
> +		cat smudge-write-fail.o >smudge-write-fail.r &&
> +		git add . &&
> +		git commit . -m "test commit" &&
> +		rm -f *.r &&
> +
> +		printf "" >output.log &&
Is this the same as
 >output.log
to produce an empty file ?

> +		git checkout . &&
> +		cat output.log | grep -v "IN: clean" >smudge_output.log &&
> +		cat >expected_checkout_master.log <<-\EOF &&
> +			start
> +			wrote version
> +			IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [FAIL]
> +			start
> +			wrote version
> +			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
> +			IN: smudge test2.r 8 [OK] -- OUT: 8 [OK]
> +		EOF
> +		test_cmp expected_checkout_master.log smudge_output.log &&
> +
> +		test_cmp ../test.o test.r &&
> +		./../rot13.sh <../test.o >expected &&
> +		git cat-file blob :test.r >actual &&
> +		test_cmp expected actual
> +
> +		test_cmp test2.o test2.r &&
> +		./../rot13.sh <test2.o >expected &&
> +		git cat-file blob :test2.r >actual &&
> +		test_cmp expected actual
> +
> +		test_cmp test2.o test2.r &&
> +		./../rot13.sh <test2.o >expected &&
> +		git cat-file blob :test2.r >actual &&
> +		test_cmp expected actual
> +
> +		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
> +		./../rot13.sh <smudge-write-fail.o >expected &&
> +		git cat-file blob :smudge-write-fail.r >actual &&
> +		test_cmp expected actual							  # Clean worked!
> +	)
> +'
> +
>  test_done
> diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl
> new file mode 100755
> index 0000000..f2d7a03
> --- /dev/null
> +++ b/t/t0021/rot13.pl
> @@ -0,0 +1,80 @@
> +#!/usr/bin/env perl
Should this be
"$PERL_PATH" ?
And do we need to protect the TC with
test_have_prereq PERL or similar ?


An other solution could be to write a filter in C-language.
> +#
> +# Example implementation for the Git filter protocol version 1
> +# See Documentation/gitattributes.txt, section "Filter Protocol"
> +#
> +
> +use strict;
> +use warnings;
> +use autodie;
> +
> +sub rot13 {
> +    my ($str) = @_;
> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
> +    return $str;
> +}
> +
> +$| = 1; # autoflush STDOUT
> +
> +open my $debug, ">>", "output.log";
> +$debug->autoflush(1);
> +
> +print $debug "start\n";
> +
> +print STDOUT "git-filter-protocol\nversion 1";
Again, I don't like the missing terminator here.
What if we step up to protocol "version 10" ?
Could it work to use one and only one line,
with one terminator, like this ?
print STDOUT "git-filter-protocol version 1\1";
> +print $debug "wrote version\n";
> +
> +while (1) {
> +    my $command = <STDIN>;
> +    unless (defined($command)) {
> +        exit();
> +    }
> +    chomp $command;
> +    print $debug "IN: $command";
> +    my $filename = <STDIN>;
> +    chomp $filename;
> +    print $debug " $filename";
> +    my $filelen  = <STDIN>;
> +    chomp $filelen;
> +    print $debug " $filelen";
> +
> +    $filelen = int($filelen);
> +    my $output;
> +
> +    if ( $filelen > 0 ) {
> +        my $input;
> +        {
> +            binmode(STDIN);
> +            my $bytes_read = 0;
> +            $bytes_read = read STDIN, $input, $filelen;
> +            if ( $bytes_read != $filelen ) {
> +                die "not enough to read";
> +            }
> +            print $debug " [OK] -- ";
> +        }
> +
> +        if ( $command eq "clean") {
> +            $output = rot13($input);
> +        }
> +        elsif ( $command eq "smudge" ) {
> +            $output = rot13($input);
> +        }
> +        else {
> +            die "bad command\n";
> +        }
> +    }
> +
> +    my $output_len = length($output);
> +    print STDOUT "$output_len\n";
> +    print $debug "OUT: $output_len";
> +    if ( $output_len > 0 ) {
> +        if ( ($command eq "clean" and $filename eq "clean-write-fail.r") or
> +             ($command eq "smudge" and $filename eq "smudge-write-fail.r") ) {
> +            print STDOUT "fail";
> +            print $debug " [FAIL]\n"
> +        } else {
> +            print STDOUT $output;
> +            print $debug " [OK]\n";
> +        }
> +    }
> +}
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
  2016-07-22 22:32   ` Torsten Bögershausen
@ 2016-07-22 23:19   ` Ramsay Jones
  2016-07-22 23:28     ` Ramsay Jones
  2016-07-24 17:16     ` Lars Schneider
  2016-07-23  0:11   ` Jakub Narębski
  2016-07-23  8:14   ` Eric Wong
  3 siblings, 2 replies; 77+ messages in thread
From: Ramsay Jones @ 2016-07-22 23:19 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: peff, jnareb, tboegi



On 22/07/16 16:49, larsxschneider@gmail.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.
> 
> This patch adds the filter.<driver>.useProtocol option which, if enabled,
> keeps the external filter process running and processes all blobs with
> the following protocol over stdin/stdout.
> 
> 1. Git starts the filter on first usage and expects a welcome message
> with protocol version number:
> 	Git <-- Filter: "git-filter-protocol\n"
> 	Git <-- Filter: "version 1"

Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
interaction is fully defined, I guess it doesn't matter).

[If you wanted to check for a version, you could add a "version" command
instead, just like "clean" and "smudge".]

[snip]

> diff --git a/convert.c b/convert.c
> index 522e2c5..91ce86f 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	return ret;
>  }
>  
> +static int cmd_process_map_init = 0;
> +static struct hashmap cmd_process_map;
> +
> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	const char *cmd;
> +	long protocol;
> +	struct child_process process;
> +};
> +
> +static int cmd2process_cmp(const struct cmd2process *e1,
> +							const struct cmd2process *e2,
> +							const void *unused)
> +{
> +	return strcmp(e1->cmd, e2->cmd);
> +}
> +
> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
> +{
> +	struct cmd2process k;
> +	hashmap_entry_init(&k, strhash(cmd));
> +	k.cmd = cmd;
> +	return hashmap_get(&cmd_process_map, &k, NULL);
> +}
> +
> +static void stop_protocol_filter(struct cmd2process *entry) {
> +	if (!entry)
> +		return;
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	close(entry->process.in);
> +	close(entry->process.out);
> +	sigchain_pop(SIGPIPE);
> +	finish_command(&entry->process);
> +	child_process_clear(&entry->process);
> +	hashmap_remove(&cmd_process_map, entry, NULL);
> +	free(entry);
> +}
> +
> +static struct cmd2process *start_protocol_filter(const char *cmd)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	struct string_list split = STRING_LIST_INIT_NODUP;
> +	const char *argv[] = { NULL, NULL };
> +	const char *header = "git-filter-protocol\nversion";
> +
> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	process = &entry->process;
> +
> +	child_process_init(process);
> +	argv[0] = cmd;
> +	process->argv = argv;
> +	process->use_shell = 1;
> +	process->in = -1;
> +	process->out = -1;
> +
> +	if (start_command(process)) {
> +		error("cannot fork to run external persistent filter '%s'", cmd);
> +		return NULL;
> +	}
> +	strbuf_reset(&nbuf);
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;

Hmm, how much will be read into nbuf by this single call?
Since strbuf_read_once() makes a single call to xread(), with
a len argument that will probably be 8192, you can not really
tell how much it will read, in general. (xread() does not
guarantee how many bytes it will read.)

In particular, it could be less than strlen(header).

> +	sigchain_pop(SIGPIPE);
> +
> +	strbuf_stripspace(&nbuf, 0);
> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
> +	ret &= split.nr > 1;
> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
> +	if (ret) {
> +		entry->protocol = strtol(split.items[1].string, NULL, 10);
> +		switch (entry->protocol) {
> +			case 1:
> +				break;
> +			default:
> +				ret = 0;
> +				error("unsupported protocol version %s for external persistent filter '%s'",
> +					nbuf.buf, cmd);
> +		}
> +	}
> +	string_list_clear(&split, 0);
> +	strbuf_release(&nbuf);
> +
> +	if (!ret) {
> +		error("initialization for external persistent filter '%s' failed", cmd);
> +		return NULL;
> +	}
> +
> +	hashmap_add(&cmd_process_map, entry);
> +	return entry;
> +}
> +
> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> +						int fd, struct strbuf *dst, const char *cmd,
> +						const char *filter_type)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;
> +	struct stat fileStat;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	size_t nbuf_len;
> +	char *strtol_end;
> +	char c;
> +
> +	if (!cmd || !*cmd)
> +		return 0;
> +
> +	if (!dst)
> +		return 1;
> +
> +	if (!cmd_process_map_init) {
> +		cmd_process_map_init = 1;
> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
> +	} else {
> +		entry = find_protocol_filter_entry(cmd);
> +	}
> +
> +	if (!entry){
> +		entry = start_protocol_filter(cmd);
> +		if (!entry) {
> +			stop_protocol_filter(entry);
> +			return 0;
> +		}
> +	}
> +	process = &entry->process;
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	switch (entry->protocol) {
> +		case 1:
> +			if (fd >= 0 && !src) {
> +				ret &= fstat(fd, &fileStat) != -1;
> +				len = fileStat.st_size;
> +			}
> +			strbuf_reset(&nbuf);
> +			strbuf_addf(&nbuf, "%s\n%s\n%zu\n", filter_type, path, len);
> +			ret &= write_str_in_full(process->in, nbuf.buf) > 1;

why not write_in_full(process->in, nbuf.buf, nbuf.len) ?

> +			if (len > 0) {
> +				if (src)
> +					ret &= write_in_full(process->in, src, len) == len;
> +				else if (fd >= 0)
> +					ret &= copy_fd(fd, process->in) == 0;
> +				else
> +					ret &= 0;
> +			}
> +
> +			strbuf_reset(&nbuf);
> +			while (xread(process->out, &c, 1) == 1 && c != '\n')
> +				strbuf_addchars(&nbuf, c, 1);
> +			nbuf_len = (size_t)strtol(nbuf.buf, &strtol_end, 10);
> +			ret &= (strtol_end != nbuf.buf && errno != ERANGE);
> +			strbuf_reset(&nbuf);
> +			if (nbuf_len > 0)
> +				ret &= strbuf_read_once(&nbuf, process->out, nbuf_len) == nbuf_len;

Again, how many bytes will be read?
Note, that in the default configuration, a _maximum_ of
MAX_IO_SIZE (8MB or SSIZE_MAX, whichever is smaller) bytes
will be read.

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 23:19   ` Ramsay Jones
@ 2016-07-22 23:28     ` Ramsay Jones
  2016-07-24 17:16     ` Lars Schneider
  1 sibling, 0 replies; 77+ messages in thread
From: Ramsay Jones @ 2016-07-22 23:28 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: peff, jnareb, tboegi


Hi Lars,

On 23/07/16 00:19, Ramsay Jones wrote:
> 
> 
> On 22/07/16 16:49, larsxschneider@gmail.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>>
>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>> keeps the external filter process running and processes all blobs with
>> the following protocol over stdin/stdout.
>>
>> 1. Git starts the filter on first usage and expects a welcome message
>> with protocol version number:
>> 	Git <-- Filter: "git-filter-protocol\n"
>> 	Git <-- Filter: "version 1"
> 
> Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
> interaction is fully defined, I guess it doesn't matter).
> 
> [If you wanted to check for a version, you could add a "version" command
> instead, just like "clean" and "smudge".]
> 
> [snip]
> 
>> diff --git a/convert.c b/convert.c
>> index 522e2c5..91ce86f 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>>  	return ret;
>>  }
>>  
>> +static int cmd_process_map_init = 0;
>> +static struct hashmap cmd_process_map;
>> +
>> +struct cmd2process {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	const char *cmd;
>> +	long protocol;
>> +	struct child_process process;
>> +};
>> +
>> +static int cmd2process_cmp(const struct cmd2process *e1,
>> +							const struct cmd2process *e2,
>> +							const void *unused)
>> +{
>> +	return strcmp(e1->cmd, e2->cmd);
>> +}
>> +
>> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
>> +{
>> +	struct cmd2process k;
>> +	hashmap_entry_init(&k, strhash(cmd));
>> +	k.cmd = cmd;
>> +	return hashmap_get(&cmd_process_map, &k, NULL);
>> +}
>> +
>> +static void stop_protocol_filter(struct cmd2process *entry) {
>> +	if (!entry)
>> +		return;
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	close(entry->process.in);
>> +	close(entry->process.out);
>> +	sigchain_pop(SIGPIPE);
>> +	finish_command(&entry->process);
>> +	child_process_clear(&entry->process);
>> +	hashmap_remove(&cmd_process_map, entry, NULL);
>> +	free(entry);
>> +}
>> +
>> +static struct cmd2process *start_protocol_filter(const char *cmd)
>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry = NULL;
>> +	struct child_process *process = NULL;
>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	struct string_list split = STRING_LIST_INIT_NODUP;
>> +	const char *argv[] = { NULL, NULL };
>> +	const char *header = "git-filter-protocol\nversion";
>> +
>> +	entry = xmalloc(sizeof(*entry));
>> +	hashmap_entry_init(entry, strhash(cmd));
>> +	entry->cmd = cmd;
>> +	process = &entry->process;
>> +
>> +	child_process_init(process);
>> +	argv[0] = cmd;
>> +	process->argv = argv;
>> +	process->use_shell = 1;
>> +	process->in = -1;
>> +	process->out = -1;
>> +
>> +	if (start_command(process)) {
>> +		error("cannot fork to run external persistent filter '%s'", cmd);
>> +		return NULL;
>> +	}
>> +	strbuf_reset(&nbuf);
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
> 
> Hmm, how much will be read into nbuf by this single call?
> Since strbuf_read_once() makes a single call to xread(), with
> a len argument that will probably be 8192, you can not really
> tell how much it will read, in general. (xread() does not
> guarantee how many bytes it will read.)
> 
> In particular, it could be less than strlen(header).

Please ignore this email, it's late ... ;-)

Sorry for the noise.

[Off to bed now]

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
  2016-07-22 22:32   ` Torsten Bögershausen
  2016-07-22 23:19   ` Ramsay Jones
@ 2016-07-23  0:11   ` Jakub Narębski
  2016-07-23  7:27     ` Eric Wong
  2016-07-24 18:36     ` Lars Schneider
  2016-07-23  8:14   ` Eric Wong
  3 siblings, 2 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-23  0:11 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: peff, tboegi

W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>

Nb. this line is only needed if you want author name and/or date
different from the email sender, or if you have sender line misconfigured
(e.g. lacking the human readable name).

> 
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.

Do I understand it correctly (from the commit description) that with
this new option you start one filter process for the whole life of
the Git command (e.g. `git add .`), which may perform more than one
cleanup, but do not create a daemon or daemon-like process which
would live for several commands invocations?

> 
> This patch adds the filter.<driver>.useProtocol option which, if enabled,
> keeps the external filter process running and processes all blobs with
> the following protocol over stdin/stdout.

I agree with Junio that the name "useProtocol" is bad, and not quite
right. Perhaps "persistent" would be better? Also, what is the value
of `filter.<driver>.useProtocol`: boolean? or a script name?

I also agree that we might wat to be able to keep clean and smudge
filters separate, but be able to run a single program if they are
both the same. I think there is a special case for filter unset,
and/or filter being "cat" -- we would want to keep that.

My proposal is to use `filter.<driver>.persistent` as an addition
to 'clean' and 'smudge' variables, with the following possible
values:

  * none (the default)
  * clean
  * smudge
  * both

I assume that either Git would have to start multiple filter
commands for multi-threaded operation, or the protocol would have
to be extended to make persistent filter fork itself.


BTW. what would happen in your original proposal if the user had
*both* filter.<driver>.useProtocol and filter.<driver>.smudge
(and/or filter.<driver>.clean) set?

> 
> 1. Git starts the filter on first usage and expects a welcome message
> with protocol version number:
> 	Git <-- Filter: "git-filter-protocol\n"
> 	Git <-- Filter: "version 1"

I was wondering how Git would know that filter executable was started,
but then I realized it was once-per-command invocation, not a daemon.

I agree with Torsten that there should be a terminator after the
version number.

Also, for future extendability this should be probably followed by
possibly empty list of script capabilities, that is:

 	Git <-- Filter: "git-filter-protocol\n"
 	Git <-- Filter: "version 1.1\n"
 	Git <-- Filter: "capabilities clean smudge\n"

Or we can add capabilities in later version...

BTW. why not follow e.g. HTTP protocol example, and use

 	Git <-- Filter: "git-filter-protocol/1\n"

> 2. Git sends the command (either "smudge" or "clean"), the filename, the
> content size in bytes, and the content separated by a newline character:
> 	Git --> Filter: "smudge\n"

Would it help (for some cases) to pass the name of filter that
is being invoked?

> 	Git --> Filter: "testfile.dat\n"

Unfortunately, while sane filenames should not contain newlines[1],
the unfortunate fact is that *filenames can include newlines*, and
you need to be able to handle that[2].  Therefore you need either to
choose a different separator (the only one that can be safely used
is "\0", i.e. the NUL character - but it is not something easy to
handle by shell scripts), or C-quote filenames as needed, or always
C-quote filenames.  C-quoting at minimum should include quoting newline
character, and the escape character itself.

BTW. is it the basename of a file, or a full pathname?

[1]: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[2]: http://www.dwheeler.com/essays/filenames-in-shell.html

> 	Git --> Filter: "7\n"

That's the content size in bytes written as an ASCII number.

> 	Git --> Filter: "CONTENT"

Can filter ignore the content size, and just read all what it was
sent, that is until eof or something?

> 
> 3. The filter is expected to answer with the result content size in
> bytes and the result content separated by a newline character:
> 	Git <-- Filter: "15\n"
> 	Git <-- Filter: "SMUDGED_CONTENT"

I wonder how hard would be to write filters for this protocol...

> 
> 4. The filter is expected to wait for the next file in step 2, again.
> 
> Please note that the protocol filters do not support stream processing
> with this implemenatation because the filter needs to know the length of
            ^^^^^^^^~^^^^^^

implementation

> the result in advance. A protocol version 2 could address this in a
> future patch.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
> ---
>  Documentation/gitattributes.txt |  41 +++++++-
>  convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>  t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++

Wouldn't it be better to name the test case something more
descriptive, for example

   t/t0021-filter-driver-useProtocol.sh

The name of test should be adjusted to final name of the feature,
of course.

>  t/t0021/rot13.pl                |  80 +++++++++++++++
>  4 files changed, 494 insertions(+), 7 deletions(-)
>  create mode 100755 t/t0021/rot13.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 8882a3e..7026d62 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate. If the setting filter.<driver>.useProtocol is
> +enabled then Git can process all blobs with a single filter command
> +invocation (see filter protocol below).

This does not tell the precedence between `smudge`, `clean` and
filter.<driver>.useProtocol, see above. Also, discrepancy in how
config variables are referenced.

>  
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user to use.
> @@ -375,6 +378,42 @@ substitution.  For example:
>  ------------------------
>  
>  
> +Filter Protocol
> +^^^^^^^^^^^^^^^
> +
> +If the setting filter.<driver>.useProtocol is enabled then Git

This seems to tell that `useProtocol` is boolean-valued (?)

> +can process all blobs with a single filter command invocation
> +by talking with the following protocol over stdin/stdout.

Should we use stdin/stdout shortcut, or spell standard input
and standard output in full?

> diff --git a/convert.c b/convert.c
> index 522e2c5..91ce86f 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	return ret;
>  }
>  
> +static int cmd_process_map_init = 0;
> +static struct hashmap cmd_process_map;
> +
> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	const char *cmd;
> +	long protocol;
> +	struct child_process process;
> +};
[...]
> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
> +{
> +	struct cmd2process k;
> +	hashmap_entry_init(&k, strhash(cmd));
> +	k.cmd = cmd;
> +	return hashmap_get(&cmd_process_map, &k, NULL);

Should we use global variable cmd_process_map, or pass it as parameter?
The same question apply for other procedures and functions.

Note that I am not saying that it is a bad thing to use global
variable here.

[...]
> +static struct cmd2process *start_protocol_filter(const char *cmd)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	struct string_list split = STRING_LIST_INIT_NODUP;
> +	const char *argv[] = { NULL, NULL };
> +	const char *header = "git-filter-protocol\nversion";
> +
> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	process = &entry->process;
> +
> +	child_process_init(process);
> +	argv[0] = cmd;
> +	process->argv = argv;
> +	process->use_shell = 1;
> +	process->in = -1;
> +	process->out = -1;
> +
> +	if (start_command(process)) {
> +		error("cannot fork to run external persistent filter '%s'", cmd);
> +		return NULL;
> +	}
> +	strbuf_reset(&nbuf);

Is strbuf_reset needed here? We have not used nbuf variable yet.

> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
> +	sigchain_pop(SIGPIPE);
> +
> +	strbuf_stripspace(&nbuf, 0);
> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
> +	ret &= split.nr > 1;
> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
> +	if (ret) {
> +		entry->protocol = strtol(split.items[1].string, NULL, 10);

This does not handle at least some errors in version number parsing,
for example junk after version number. Don't we have some helper
functions for this?

Nb. this code makes it so that the version number must be integer.

> +		switch (entry->protocol) {
> +			case 1:
> +				break;
> +			default:
> +				ret = 0;
> +				error("unsupported protocol version %s for external persistent filter '%s'",
> +					nbuf.buf, cmd);
> +		}
> +	}
> +	string_list_clear(&split, 0);
> +	strbuf_release(&nbuf);
> +
> +	if (!ret) {
> +		error("initialization for external persistent filter '%s' failed", cmd);
> +		return NULL;
> +	}

Do we handle persistent filter command being killed before it finishes?
Or exiting with error? I don't know this Git API...

> +
> +	hashmap_add(&cmd_process_map, entry);
> +	return entry;
> +}
[...]

> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index a05a8d2..d9077ea 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -268,4 +268,174 @@ test_expect_success 'disable filter with empty override' '
>  	test_must_be_empty err
>  '
>  
> +test_expect_success 'required protocol filter should filter data' '
> +	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&

Perhaps align it?

  +	test_config_global filter.protocol.clean  \"$TEST_DIRECTORY/t0021/rot13.pl\" &&


> diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl

That's bit more than rot13... but it might be O.K. for a filename here.

> new file mode 100755
> index 0000000..f2d7a03
> --- /dev/null
> +++ b/t/t0021/rot13.pl
> @@ -0,0 +1,80 @@
> +#!/usr/bin/env perl

Don't we use other way to specify perl path for Git, and for its
test suite?

> +#
> +# Example implementation for the Git filter protocol version 1
> +# See Documentation/gitattributes.txt, section "Filter Protocol"
> +#
> +
> +use strict;
> +use warnings;
> +use autodie;

autodie?

> +
> +sub rot13 {
> +    my ($str) = @_;
> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
> +    return $str;
> +}
> +
> +$| = 1; # autoflush STDOUT

Perhaps *STDOUT->autoflush(1), if I remember my Perl correctly?
Should this matter? Why it is needed?

> +
> +open my $debug, ">>", "output.log";
> +$debug->autoflush(1);
> +
> +print $debug "start\n";
> +
> +print STDOUT "git-filter-protocol\nversion 1";
> +print $debug "wrote version\n";
> +
> +while (1) {
> +    my $command = <STDIN>;
> +    unless (defined($command)) {
> +        exit();
> +    }
> +    chomp $command;
> +    print $debug "IN: $command";
> +    my $filename = <STDIN>;
> +    chomp $filename;
> +    print $debug " $filename";
> +    my $filelen  = <STDIN>;
> +    chomp $filelen;
> +    print $debug " $filelen";
> +
> +    $filelen = int($filelen);
> +    my $output;
> +
> +    if ( $filelen > 0 ) {

Inconsistent style. You use

       unless (defined($command)) {

without extra whitespace after and before parentheses, but

       if ( $filelen > 0 ) {

instead of simply

       if ($filelen > 0) {

> +        my $input;
> +        {
> +            binmode(STDIN);
> +            my $bytes_read = 0;
> +            $bytes_read = read STDIN, $input, $filelen;
> +            if ( $bytes_read != $filelen ) {
> +                die "not enough to read";

I know it's only a test script (well, a part of one), but we would probably
want to have more information in the case of a real filter.

> +            }
> +            print $debug " [OK] -- ";
> +        }
> +
> +        if ( $command eq "clean") {
> +            $output = rot13($input);
> +        }
> +        elsif ( $command eq "smudge" ) {

Style; I think we use

  +        } elsif ( $command eq "smudge" ) {

> +            $output = rot13($input);
> +        }
> +        else {
> +            die "bad command\n";

Same here (both about style, and error message).

> +        }
> +    }

What happens if $filelen is zero, or negative? Ah, I see that $output
would be undef... which is bad, I think.

> +
> +    my $output_len = length($output);
> +    print STDOUT "$output_len\n";
> +    print $debug "OUT: $output_len";
> +    if ( $output_len > 0 ) {
> +        if ( ($command eq "clean" and $filename eq "clean-write-fail.r") or
> +             ($command eq "smudge" and $filename eq "smudge-write-fail.r") ) {

Hardcoded filenames, without it being described in the file header?

> +            print STDOUT "fail";

This is not defined in the protocol description!  Unless anything that
does not conform to the specification would work here, but at least it
is a recommended practice to be described in the documentation, don't
you think?

What would happen in $output_len is 4?

> +            print $debug " [FAIL]\n"
> +        } else {
> +            print STDOUT $output;
> +            print $debug " [OK]\n";
> +        }
> +    }
> +}

-- 
Jakub Narębski
 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-23  0:11   ` Jakub Narębski
@ 2016-07-23  7:27     ` Eric Wong
  2016-07-26 20:00       ` Jeff King
  2016-07-24 18:36     ` Lars Schneider
  1 sibling, 1 reply; 77+ messages in thread
From: Eric Wong @ 2016-07-23  7:27 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: larsxschneider, git, peff, tboegi

Jakub Narębski <jnareb@gmail.com> wrote:
> W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
> > +use strict;
> > +use warnings;
> > +use autodie;
> 
> autodie?

"set -e" for Perl (man autodie)

It's been a part of Perl for ages, but I've never used it
myself, either; I suppose it's fine for tests...

> > +$| = 1; # autoflush STDOUT
> 
> Perhaps *STDOUT->autoflush(1), if I remember my Perl correctly?
> Should this matter? Why it is needed?

It's better to always disable automatic output buffering when
writing to pipes or sockets for IPC.  Otherwise output may be
buffered indefinitely because the buffering mechanism doesn't
know a reader is stalled.

Same problem with using stdio.h functions for IPC in C.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
                     ` (2 preceding siblings ...)
  2016-07-23  0:11   ` Jakub Narębski
@ 2016-07-23  8:14   ` Eric Wong
  2016-07-24 19:11     ` Lars Schneider
  3 siblings, 1 reply; 77+ messages in thread
From: Eric Wong @ 2016-07-23  8:14 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, jnareb, tboegi

larsxschneider@gmail.com wrote:
> Please note that the protocol filters do not support stream processing
> with this implemenatation because the filter needs to know the length of
> the result in advance. A protocol version 2 could address this in a
> future patch.

Would it be prudent to reuse pkt-line for this?

> +static void stop_protocol_filter(struct cmd2process *entry) {
> +	if (!entry)
> +		return;
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	close(entry->process.in);
> +	close(entry->process.out);
> +	sigchain_pop(SIGPIPE);
> +	finish_command(&entry->process);
> +	child_process_clear(&entry->process);
> +	hashmap_remove(&cmd_process_map, entry, NULL);
> +	free(entry);
> +}
> +
> +static struct cmd2process *start_protocol_filter(const char *cmd)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;

These are unconditionally set below, so initializing to NULL
may hide future bugs.

> +	struct strbuf nbuf = STRBUF_INIT;
> +	struct string_list split = STRING_LIST_INIT_NODUP;
> +	const char *argv[] = { NULL, NULL };
> +	const char *header = "git-filter-protocol\nversion";

	static const char header[] = "git-filter-protocol\nversion";

...might be smaller by avoiding the extra pointer
(but compilers ought to be able to optimize it)

> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	process = &entry->process;

<snip>

> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;

starts_with() is probably more readable, here.

> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> +						int fd, struct strbuf *dst, const char *cmd,
> +						const char *filter_type)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry = NULL;
> +	struct child_process *process = NULL;

I would leave process initialized, here, since it should
always be set below:

> +	struct stat fileStat;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	size_t nbuf_len;
> +	char *strtol_end;
> +	char c;
> +
> +	if (!cmd || !*cmd)
> +		return 0;
> +
> +	if (!dst)
> +		return 1;
> +
> +	if (!cmd_process_map_init) {
> +		cmd_process_map_init = 1;
> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
> +	} else {
> +		entry = find_protocol_filter_entry(cmd);
> +	}
> +
> +	if (!entry){
> +		entry = start_protocol_filter(cmd);
> +		if (!entry) {
> +			stop_protocol_filter(entry);

stop_protocol_filter is a no-op, here, since entry is NULL

> +			return 0;
> +		}
> +	}
> +	process = &entry->process;
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	switch (entry->protocol) {
> +		case 1:
> +			if (fd >= 0 && !src) {
> +				ret &= fstat(fd, &fileStat) != -1;
> +				len = fileStat.st_size;

There's a truncation bug when sizeof(size_t) < sizeof(off_t)
(and mixedCase is inconsistent with our style)

> +    my $filelen  = <STDIN>;
> +    chomp $filelen;
> +    print $debug " $filelen";
> +
> +    $filelen = int($filelen);

Calling int() here is unnecessary and may hide bugs if you
forget to check $debug.   Perhaps a regexp check is safer:

	$filelen =~ /\A\d+\z/ or die "bad filelen: $filelen\n";

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 0/3] Git filter protocol
  2016-07-22 21:39 ` [PATCH v1 0/3] Git filter protocol Junio C Hamano
@ 2016-07-24 11:24   ` Lars Schneider
  2016-07-26 20:11     ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-24 11:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Jeff King, jnareb, Torsten Bögershausen,
	peartben, mlbright


On 22 Jul 2016, at 23:39, Junio C Hamano <gitster@pobox.com> wrote:

> larsxschneider@gmail.com writes:
> 
>> The first two patches are cleanup patches which are not really necessary
>> for the feature.
> 
> These two looked trivially good.
Thanks!


> I think I can agree with what 3/3 wants to do in principle, but
> 
> * "protocol" is not quite the right word.  The current way to
>   interact with clean and smudge filters can be considered using a
>   different "protocol", that conveys the data and the options via
>   the command line and pipe.  The most distinguishing feature that
>   differentiates the old way and the new style this change allows
>   is that it allows you to have a single instance of the process
>   running that can be reused?
I agree that the name is not ideal. When I started working on the
featured I called it "streaming" but then I read your comment in
$gmane/299863 and realized that this would be a misleading name.
Afterwards I called it "persistent"/"long running" but then I thought 
this term could trick people into thinking that this is some kind of 
daemon. Somehow I want to convey that the filter is persistent for 
one Git invocation only.

What if we would keep the config option "protocol" and make it an "int"? 
Undefined or version "1" would describe the existing clean/smudge 
protocol via command line and pipe. Version "2" would be the new protocol?


> * I am not sure what's the pros-and-cons in forcing people writing
>   a single program that can do both cleaning and smudging.  You
>   cannot have only "smudge" side that uses the long-running process
>   while "clean" side that runs single-shot invocation with this
>   design, which I'd imagine would be a downside.  If you are going
>   to use a long-running process interface for both sides, this
>   design allows you to do it with fewer number of processes, which
>   may be an upside.
We could define the protocol for clean and smudge individually. However,
if you have implemented the more complicated long-running protocol already
for one filter, then you could reuse the code for the other filter, too, as
this protocol is, as far as I can see, always more efficient (assuming you 
have source code access to both filters). Another argument could be that we 
don't define the "required" flag for the filters individually either.


> * The way the serialized access to these long-running processes
>   work in 3/3 would make it harder or impossible to later
>   parallelize conversion?  I am imagining a far future where we
>   would run "git checkout ." using (say) two threads, one
>   responsible for active_cache[0..active_nr/2] and the other
>   responsible for the remainder.
I hope this future is not too far away :-) 
However, I don't think that would be a problem as we could start the
long-running process once for each checkout thread, no?


Thank you,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 22:32   ` Torsten Bögershausen
@ 2016-07-24 12:09     ` Lars Schneider
  0 siblings, 0 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-24 12:09 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Git Mailing List, Jeff King, jnareb, mlbright


On 23 Jul 2016, at 00:32, Torsten Bögershausen <tboegi@web.de> wrote:

> On 07/22/2016 05:49 PM, larsxschneider@gmail.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> [...]
>> 
>> 1. Git starts the filter on first usage and expects a welcome message
>> with protocol version number:
>> 	Git <-- Filter: "git-filter-protocol\n"
>> 	Git <-- Filter: "version 1"
> Is there no terminator here ?
> How long should the reading side wait without a '\n', or should it read
> "version 1\n" ?
I agree, I will add the "\n" terminator!


>> [...]
>> 
>> Please note that the protocol filters do not support stream processing
>> with this implemenatation because the filter needs to know the length of
>            ^^^^^^^^^^^^^^^^typo
Thanks!


>> [...]
>> 
>> +
>> +test_expect_success EXPENSIVE 'protocol filter large file' '
>> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
>> +	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
>> +	rm -rf repo &&
>> +	mkdir repo &&
>> +	(
>> +		cd repo &&
>> +		git init &&
>> +
>> +		echo "2GB filter=largefile" >.gitattributes &&
>> +		for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
> Side question:
> Is there a way to "re-use" the 2GB test file through t0021?
> It takes a long time to produce it, especially on my 32 Bit systems ;-)
> But this may be a different patch.
I would like to keep the tests as unentangled as possible and therefore a direct
reuse might not be ideal. However, I could add a new "EXPENSIVE setup test" that 
prepares the file for both tests.


>> [...]
>> +
>> +		printf "" >output.log &&
> Is this the same as
> >output.log
> to produce an empty file ?
Yes, thank you :-)


>> [...]
>> +++ b/t/t0021/rot13.pl
>> @@ -0,0 +1,80 @@
>> +#!/usr/bin/env perl
> Should this be
> "$PERL_PATH" ?
I think we can't use this variable directly in the script. I could create the script file 
for the test and set the shebang to this value. However, no other "Perl file test" does it
and therefore I wonder if it is necessary:
t/t0202/test.pl
t/t9000/test.pl
t/t9700/test.pl
According to the documentation this is useful to avoid trouble on Windows. I will check
this test on Windows.

I also just noticed that all other Perl tests use "#!/usr/bin/perl". Should I change mine
to match those?


> And do we need to protect the TC with
> test_have_prereq PERL or similar ?
Probably not as the documentation states "Even without the PERL prerequisite, tests can 
assume there is a usable perl interpreter". However, all other Perl file tests do the same
and therefore I think it is a good idea.


>> [...]
>> +
>> +print STDOUT "git-filter-protocol\nversion 1";
> Again, I don't like the missing terminator here.
> What if we step up to protocol "version 10" ?
> Could it work to use one and only one line,
> with one terminator, like this ?
> print STDOUT "git-filter-protocol version 1\1";
The missing terminator was a mistake. As mentioned above, 
I will add it!


Thanks for the review,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-22 23:19   ` Ramsay Jones
  2016-07-22 23:28     ` Ramsay Jones
@ 2016-07-24 17:16     ` Lars Schneider
  2016-07-24 22:36       ` Ramsay Jones
  1 sibling, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-24 17:16 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Git Mailing List, Jeff King, jnareb, Torsten Bögershausen,
	mlbright


On 23 Jul 2016, at 01:19, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:

> On 22/07/16 16:49, larsxschneider@gmail.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>> 
>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>> keeps the external filter process running and processes all blobs with
>> the following protocol over stdin/stdout.
>> 
>> 1. Git starts the filter on first usage and expects a welcome message
>> with protocol version number:
>> 	Git <-- Filter: "git-filter-protocol\n"
>> 	Git <-- Filter: "version 1"
> 
> Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
> interaction is fully defined, I guess it doesn't matter).
> 
> [If you wanted to check for a version, you could add a "version" command
> instead, just like "clean" and "smudge".]

It was a conscious decision to have the `filter` talk first. My reasoning was:

(1) I want a reliable way to distinguish the existing filter protocol ("single-shot 
invocation") from the new one ("long running"). I don't think there would be a
situation where the existing protocol would talk first. Therefore the users would
not accidentally mix them with a possibly half working, undetermined, outcome.

(2) In the future we could extend the pipe protocol (see $gmane/297994, it's very
interesting). A filter could check Git's version and then pick the most appropriate
filter protocol on startup.


> [...]
>> +static struct cmd2process *start_protocol_filter(const char *cmd)
>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry = NULL;
>> +	struct child_process *process = NULL;
>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	struct string_list split = STRING_LIST_INIT_NODUP;
>> +	const char *argv[] = { NULL, NULL };
>> +	const char *header = "git-filter-protocol\nversion";
>> +
>> +	entry = xmalloc(sizeof(*entry));
>> +	hashmap_entry_init(entry, strhash(cmd));
>> +	entry->cmd = cmd;
>> +	process = &entry->process;
>> +
>> +	child_process_init(process);
>> +	argv[0] = cmd;
>> +	process->argv = argv;
>> +	process->use_shell = 1;
>> +	process->in = -1;
>> +	process->out = -1;
>> +
>> +	if (start_command(process)) {
>> +		error("cannot fork to run external persistent filter '%s'", cmd);
>> +		return NULL;
>> +	}
>> +	strbuf_reset(&nbuf);
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
> 
> Hmm, how much will be read into nbuf by this single call?
> Since strbuf_read_once() makes a single call to xread(), with
> a len argument that will probably be 8192, you can not really
> tell how much it will read, in general. (xread() does not
> guarantee how many bytes it will read.)
> 
> In particular, it could be less than strlen(header).

As mentioned to Torsten in $gmane/300156, I will add a newline
and then read until I find the second newline. That should solve
the problem, right?

(You wrote in $gmane/300119 that I should ignore your email but
I think you have a valid point here ;-)


>> [...]
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	switch (entry->protocol) {
>> +		case 1:
>> +			if (fd >= 0 && !src) {
>> +				ret &= fstat(fd, &fileStat) != -1;
>> +				len = fileStat.st_size;
>> +			}
>> +			strbuf_reset(&nbuf);
>> +			strbuf_addf(&nbuf, "%s\n%s\n%zu\n", filter_type, path, len);
>> +			ret &= write_str_in_full(process->in, nbuf.buf) > 1;
> 
> why not write_in_full(process->in, nbuf.buf, nbuf.len) ?
OK, this would save a "strlen" call. Do you think such a function could be of general
use? If yes, then I would add:

static inline ssize_t write_strbuf_in_full(int fd, struct strbuf *str)
{
	return write_in_full(fd, str->buf, str->len);
}


>> +			if (len > 0) {
>> +				if (src)
>> +					ret &= write_in_full(process->in, src, len) == len;
>> +				else if (fd >= 0)
>> +					ret &= copy_fd(fd, process->in) == 0;
>> +				else
>> +					ret &= 0;
>> +			}
>> +
>> +			strbuf_reset(&nbuf);
>> +			while (xread(process->out, &c, 1) == 1 && c != '\n')
>> +				strbuf_addchars(&nbuf, c, 1);
>> +			nbuf_len = (size_t)strtol(nbuf.buf, &strtol_end, 10);
>> +			ret &= (strtol_end != nbuf.buf && errno != ERANGE);
>> +			strbuf_reset(&nbuf);
>> +			if (nbuf_len > 0)
>> +				ret &= strbuf_read_once(&nbuf, process->out, nbuf_len) == nbuf_len;
> 
> Again, how many bytes will be read?
> Note, that in the default configuration, a _maximum_ of
> MAX_IO_SIZE (8MB or SSIZE_MAX, whichever is smaller) bytes
> will be read.
Would something like this be more appropriate?

strbuf_reset(&nbuf);
if (nbuf_len > 0) {
    strbuf_grow(&nbuf, nbuf_len);
    ret &= read_in_full(process->out, nbuf.buf, nbuf_len) == nbuf_len;
}


Thanks for the review,
Lars


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-23  0:11   ` Jakub Narębski
  2016-07-23  7:27     ` Eric Wong
@ 2016-07-24 18:36     ` Lars Schneider
  2016-07-24 20:14       ` Jakub Narębski
  1 sibling, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-24 18:36 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright


On 23 Jul 2016, at 02:11, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Nb. this line is only needed if you want author name and/or date
> different from the email sender, or if you have sender line misconfigured
> (e.g. lacking the human readable name).

I use "git format-patch" to generate these emails:

git format-patch --cover-letter --subject-prefix="PATCH ..." -M $BASE -o $OUTPUT

How would I disable this line? (I already checked the man page to no avail).
Plus, what does "Nb" stand for? :-)


>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
> 
> Do I understand it correctly (from the commit description) that with
> this new option you start one filter process for the whole life of
> the Git command (e.g. `git add .`), which may perform more than one
> cleanup, but do not create a daemon or daemon-like process which
> would live for several commands invocations?

Correct!


>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>> keeps the external filter process running and processes all blobs with
>> the following protocol over stdin/stdout.
> 
> I agree with Junio that the name "useProtocol" is bad, and not quite
> right. Perhaps "persistent" would be better? Also, what is the value
> of `filter.<driver>.useProtocol`: boolean? or a script name?

I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
I thought about "persistent" but this name wouldn't convey the scope of the 
persistency ("persistent for one Git operation" vs. "persistent for many Git 
operations"). What do you think about the protocol as int version idea
described in $gmane/300155 ?


> I also agree that we might wat to be able to keep clean and smudge
> filters separate, but be able to run a single program if they are
> both the same. I think there is a special case for filter unset,
> and/or filter being "cat" -- we would want to keep that.

Since 1a8630d there is a more efficient way to unset a filter ;-)
Can you think of other cases where the separation would be useful?


> My proposal is to use `filter.<driver>.persistent` as an addition
> to 'clean' and 'smudge' variables, with the following possible
> values:
> 
>  * none (the default)
>  * clean
>  * smudge
>  * both

That could work. However, I am not convinced, yet, that separate
filters are an actual use case.


> I assume that either Git would have to start multiple filter
> commands for multi-threaded operation, or the protocol would have
> to be extended to make persistent filter fork itself.

I think it would be better to have Git start multiple filter commands
to keep the protocol as simple and error free as possible.


> BTW. what would happen in your original proposal if the user had
> *both* filter.<driver>.useProtocol and filter.<driver>.smudge
> (and/or filter.<driver>.clean) set?

That wouldn't be an issue as "useProtocol" is just a boolean that
tells Git how to talk to "filter.<driver>.smudge" and "filter.<driver>.clean".
I need to make this more clear in the documentation.


>> 1. Git starts the filter on first usage and expects a welcome message
>> with protocol version number:
>> 	Git <-- Filter: "git-filter-protocol\n"
>> 	Git <-- Filter: "version 1"
> 
> I was wondering how Git would know that filter executable was started,
> but then I realized it was once-per-command invocation, not a daemon.
> 
> I agree with Torsten that there should be a terminator after the
> version number.

I agree, too :)


> Also, for future extendability this should be probably followed by
> possibly empty list of script capabilities, that is:
> 
> 	Git <-- Filter: "git-filter-protocol\n"
> 	Git <-- Filter: "version 1.1\n"
> 	Git <-- Filter: "capabilities clean smudge\n"
> 
> Or we can add capabilities in later version...

That is an interesting idea. My initial thought was to make the capabilities
of a certain version fix. If we want to add new capabilities then we would 
bump the version. I wonder what others think about your suggestion!


> BTW. why not follow e.g. HTTP protocol example, and use
> 
> 	Git <-- Filter: "git-filter-protocol/1\n"

I think my proposal is a bit more explicit as it states "version". If
you feel strongly about it, I could be convinced otherwise.


>> 2. Git sends the command (either "smudge" or "clean"), the filename, the
>> content size in bytes, and the content separated by a newline character:
>> 	Git --> Filter: "smudge\n"
> 
> Would it help (for some cases) to pass the name of filter that
> is being invoked?

Interesting thought! Can you imagine a case where this would be useful?


>> 	Git --> Filter: "testfile.dat\n"
> 
> Unfortunately, while sane filenames should not contain newlines[1],
> the unfortunate fact is that *filenames can include newlines*, and
> you need to be able to handle that[2].  Therefore you need either to
> choose a different separator (the only one that can be safely used
> is "\0", i.e. the NUL character - but it is not something easy to
> handle by shell scripts), or C-quote filenames as needed, or always
> C-quote filenames.  C-quoting at minimum should include quoting newline
> character, and the escape character itself.
> 
> BTW. is it the basename of a file, or a full pathname?
> 
> [1]: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
> [2]: http://www.dwheeler.com/essays/filenames-in-shell.html

Thanks for this explanation. A bash version of the protocol is not
trivial (I tried it but ended up using Perl). Therefore I think "\0"
would be a good choice?


>> 	Git --> Filter: "7\n"
> 
> That's the content size in bytes written as an ASCII number.

Correct.


>> 	Git --> Filter: "CONTENT"
> 
> Can filter ignore the content size, and just read all what it was
> sent, that is until eof or something?

What would that something be? Since CONTENT is binary it can contain
any character (even "\0")...


>> 3. The filter is expected to answer with the result content size in
>> bytes and the result content separated by a newline character:
>> 	Git <-- Filter: "15\n"
>> 	Git <-- Filter: "SMUDGED_CONTENT"
> 
> I wonder how hard would be to write filters for this protocol...

Easy :-) Plus you can look at a Perl (see t/t0021) and a golang implementation
already (https://github.com/github/git-lfs/pull/1382).


>> 4. The filter is expected to wait for the next file in step 2, again.
>> 
>> Please note that the protocol filters do not support stream processing
>> with this implemenatation because the filter needs to know the length of
>            ^^^^^^^^~^^^^^^
> 
> implementation
Thanks!


>> the result in advance. A protocol version 2 could address this in a
>> future patch.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
>> ---
>> Documentation/gitattributes.txt |  41 +++++++-
>> convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>> t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
> 
> Wouldn't it be better to name the test case something more
> descriptive, for example
> 
>   t/t0021-filter-driver-useProtocol.sh
> 
> The name of test should be adjusted to final name of the feature,
> of course.

I think the prefix numbers should be unique, no? And t0022 is already taken.


>> t/t0021/rot13.pl                |  80 +++++++++++++++
>> 4 files changed, 494 insertions(+), 7 deletions(-)
>> create mode 100755 t/t0021/rot13.pl
>> 
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> index 8882a3e..7026d62 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
>> fed the blob object from its standard input, and its standard
>> output is used to update the worktree file.  Similarly, the
>> `clean` command is used to convert the contents of worktree file
>> -upon checkin.
>> +upon checkin. By default these commands process only a single
>> +blob and terminate. If the setting filter.<driver>.useProtocol is
>> +enabled then Git can process all blobs with a single filter command
>> +invocation (see filter protocol below).
> 
> This does not tell the precedence between `smudge`, `clean` and
> filter.<driver>.useProtocol, see above. Also, discrepancy in how
> config variables are referenced.

As mentioned above "useProtocol" is a boolean. Therefore precedence shouldn't
be a problem. What do you mean by "discrepancy in how config variables are 
referenced"?


>> One use of the content filtering is to massage the content into a shape
>> that is more convenient for the platform, filesystem, and the user to use.
>> @@ -375,6 +378,42 @@ substitution.  For example:
>> ------------------------
>> 
>> 
>> +Filter Protocol
>> +^^^^^^^^^^^^^^^
>> +
>> +If the setting filter.<driver>.useProtocol is enabled then Git
> 
> This seems to tell that `useProtocol` is boolean-valued (?)

Correct.


>> +can process all blobs with a single filter command invocation
>> +by talking with the following protocol over stdin/stdout.
> 
> Should we use stdin/stdout shortcut, or spell standard input
> and standard output in full?

I think the documentation is not consistent here but spelled in full
seems to be used more often. I will change it!


>> diff --git a/convert.c b/convert.c
>> index 522e2c5..91ce86f 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>> 	return ret;
>> }
>> 
>> +static int cmd_process_map_init = 0;
>> +static struct hashmap cmd_process_map;
>> +
>> +struct cmd2process {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	const char *cmd;
>> +	long protocol;
>> +	struct child_process process;
>> +};
> [...]
>> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
>> +{
>> +	struct cmd2process k;
>> +	hashmap_entry_init(&k, strhash(cmd));
>> +	k.cmd = cmd;
>> +	return hashmap_get(&cmd_process_map, &k, NULL);
> 
> Should we use global variable cmd_process_map, or pass it as parameter?
> The same question apply for other procedures and functions.
> 
> Note that I am not saying that it is a bad thing to use global
> variable here.

Passing it would be nicer as this would make at least a few functions "pure".
I will change that!


> [...]
>> +static struct cmd2process *start_protocol_filter(const char *cmd)
>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry = NULL;
>> +	struct child_process *process = NULL;
>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	struct string_list split = STRING_LIST_INIT_NODUP;
>> +	const char *argv[] = { NULL, NULL };
>> +	const char *header = "git-filter-protocol\nversion";
>> +
>> +	entry = xmalloc(sizeof(*entry));
>> +	hashmap_entry_init(entry, strhash(cmd));
>> +	entry->cmd = cmd;
>> +	process = &entry->process;
>> +
>> +	child_process_init(process);
>> +	argv[0] = cmd;
>> +	process->argv = argv;
>> +	process->use_shell = 1;
>> +	process->in = -1;
>> +	process->out = -1;
>> +
>> +	if (start_command(process)) {
>> +		error("cannot fork to run external persistent filter '%s'", cmd);
>> +		return NULL;
>> +	}
>> +	strbuf_reset(&nbuf);
> 
> Is strbuf_reset needed here? We have not used nbuf variable yet.

Agreed, not needed!


>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
>> +	sigchain_pop(SIGPIPE);
>> +
>> +	strbuf_stripspace(&nbuf, 0);
>> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
>> +	ret &= split.nr > 1;
>> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
>> +	if (ret) {
>> +		entry->protocol = strtol(split.items[1].string, NULL, 10);
> 
> This does not handle at least some errors in version number parsing,
> for example junk after version number. Don't we have some helper
> functions for this?

I am not sure. I haven't found one.

> Nb. this code makes it so that the version number must be integer.

Nb? :-)


>> +		switch (entry->protocol) {
>> +			case 1:
>> +				break;
>> +			default:
>> +				ret = 0;
>> +				error("unsupported protocol version %s for external persistent filter '%s'",
>> +					nbuf.buf, cmd);
>> +		}
>> +	}
>> +	string_list_clear(&split, 0);
>> +	strbuf_release(&nbuf);
>> +
>> +	if (!ret) {
>> +		error("initialization for external persistent filter '%s' failed", cmd);
>> +		return NULL;
>> +	}
> 
> Do we handle persistent filter command being killed before it finishes?
> Or exiting with error? I don't know this Git API...

If the "apply_filter" function fails then Git would proceed and just not
filter the content. If you define the "required" flag for the filter then
Git would error in that case.


>> +
>> +	hashmap_add(&cmd_process_map, entry);
>> +	return entry;
>> +}
> [...]
> 
>> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
>> index a05a8d2..d9077ea 100755
>> --- a/t/t0021-conversion.sh
>> +++ b/t/t0021-conversion.sh
>> @@ -268,4 +268,174 @@ test_expect_success 'disable filter with empty override' '
>> 	test_must_be_empty err
>> '
>> 
>> +test_expect_success 'required protocol filter should filter data' '
>> +	test_config_global filter.protocol.smudge \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
>> +	test_config_global filter.protocol.clean \"$TEST_DIRECTORY/t0021/rot13.pl\" &&
> 
> Perhaps align it?
> 
>  +	test_config_global filter.protocol.clean  \"$TEST_DIRECTORY/t0021/rot13.pl\" &&

OK.


>> diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl
> 
> That's bit more than rot13... but it might be O.K. for a filename here.

"rot13-$FEATURE_NAME.pl" ?


>> new file mode 100755
>> index 0000000..f2d7a03
>> --- /dev/null
>> +++ b/t/t0021/rot13.pl
>> @@ -0,0 +1,80 @@
>> +#!/usr/bin/env perl
> 
> Don't we use other way to specify perl path for Git, and for its
> test suite?

Other tests use "#!/usr/bin/perl" - I will change that.
See $gmane/300156.


>> +#
>> +# Example implementation for the Git filter protocol version 1
>> +# See Documentation/gitattributes.txt, section "Filter Protocol"
>> +#
>> +
>> +use strict;
>> +use warnings;
>> +use autodie;
> 
> autodie?

See $gmane/300124.


>> +
>> +sub rot13 {
>> +    my ($str) = @_;
>> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
>> +    return $str;
>> +}
>> +
>> +$| = 1; # autoflush STDOUT
> 
> Perhaps *STDOUT->autoflush(1), if I remember my Perl correctly?
> Should this matter? Why it is needed?

As recommended by Eric in $gmane/300124 I will run flush explicitly.


>> +
>> +open my $debug, ">>", "output.log";
>> +$debug->autoflush(1);
>> +
>> +print $debug "start\n";
>> +
>> +print STDOUT "git-filter-protocol\nversion 1";
>> +print $debug "wrote version\n";
>> +
>> +while (1) {
>> +    my $command = <STDIN>;
>> +    unless (defined($command)) {
>> +        exit();
>> +    }
>> +    chomp $command;
>> +    print $debug "IN: $command";
>> +    my $filename = <STDIN>;
>> +    chomp $filename;
>> +    print $debug " $filename";
>> +    my $filelen  = <STDIN>;
>> +    chomp $filelen;
>> +    print $debug " $filelen";
>> +
>> +    $filelen = int($filelen);
>> +    my $output;
>> +
>> +    if ( $filelen > 0 ) {
> 
> Inconsistent style. You use
> 
>       unless (defined($command)) {
> 
> without extra whitespace after and before parentheses, but
> 
>       if ( $filelen > 0 ) {
> 
> instead of simply
> 
>       if ($filelen > 0) {

Agreed. I will fix it.


>> +        my $input;
>> +        {
>> +            binmode(STDIN);
>> +            my $bytes_read = 0;
>> +            $bytes_read = read STDIN, $input, $filelen;
>> +            if ( $bytes_read != $filelen ) {
>> +                die "not enough to read";
> 
> I know it's only a test script (well, a part of one), but we would probably
> want to have more information in the case of a real filter.

True. Do you think there is anything to change in the script, though?


> 
>> +            }
>> +            print $debug " [OK] -- ";
>> +        }
>> +
>> +        if ( $command eq "clean") {
>> +            $output = rot13($input);
>> +        }
>> +        elsif ( $command eq "smudge" ) {
> 
> Style; I think we use
> 
>  +        } elsif ( $command eq "smudge" ) {

OK.


> 
>> +            $output = rot13($input);
>> +        }
>> +        else {
>> +            die "bad command\n";
> 
> Same here (both about style, and error message).
> 
>> +        }
>> +    }
> 
> What happens if $filelen is zero, or negative? Ah, I see that $output
> would be undef... which is bad, I think.

Is this something we need to consider in the test script?


>> +
>> +    my $output_len = length($output);
>> +    print STDOUT "$output_len\n";
>> +    print $debug "OUT: $output_len";
>> +    if ( $output_len > 0 ) {
>> +        if ( ($command eq "clean" and $filename eq "clean-write-fail.r") or
>> +             ($command eq "smudge" and $filename eq "smudge-write-fail.r") ) {
> 
> Hardcoded filenames, without it being described in the file header?

Good point! I will add a comment!

> 
>> +            print STDOUT "fail";
> 
> This is not defined in the protocol description!  Unless anything that
> does not conform to the specification would work here, but at least it
> is a recommended practice to be described in the documentation, don't
> you think?
> 
> What would happen in $output_len is 4?

Then it would work :D
I understand your point. However, this is not a reference implementation.
It is a test script that is supposed to trigger bad behavior which we can test. 
Therefore, I would argue that such a return value is OK. I will document it in 
the header, though. 


Thanks a lot for your extensive review,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-23  8:14   ` Eric Wong
@ 2016-07-24 19:11     ` Lars Schneider
  2016-07-25  7:27       ` Eric Wong
  2016-07-25 15:48       ` Duy Nguyen
  0 siblings, 2 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-24 19:11 UTC (permalink / raw)
  To: Eric Wong
  Cc: Git Mailing List, Jeff King, jnareb, Torsten Bögershausen,
	mlbright


On 23 Jul 2016, at 10:14, Eric Wong <e@80x24.org> wrote:

> larsxschneider@gmail.com wrote:
>> Please note that the protocol filters do not support stream processing
>> with this implemenatation because the filter needs to know the length of
>> the result in advance. A protocol version 2 could address this in a
>> future patch.
> 
> Would it be prudent to reuse pkt-line for this?

Peff suggested that, too, in $gmane/299902. However, this would make the
protocol a bit more complicated and it wouldn't buy us anything for Git
large file processing filters (my main motivation for this patch) as these 
filters can't leverage streaming anyways.


>> +static void stop_protocol_filter(struct cmd2process *entry) {
>> +	if (!entry)
>> +		return;
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	close(entry->process.in);
>> +	close(entry->process.out);
>> +	sigchain_pop(SIGPIPE);
>> +	finish_command(&entry->process);
>> +	child_process_clear(&entry->process);
>> +	hashmap_remove(&cmd_process_map, entry, NULL);
>> +	free(entry);
>> +}
>> +
>> +static struct cmd2process *start_protocol_filter(const char *cmd)
>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry = NULL;
>> +	struct child_process *process = NULL;
> 
> These are unconditionally set below, so initializing to NULL
> may hide future bugs.

OK. I thought it is generally a good thing to initialize a pointer with 
NULL. Can you explain to me how this might hide future bugs?
I will remove the initialization.


>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	struct string_list split = STRING_LIST_INIT_NODUP;
>> +	const char *argv[] = { NULL, NULL };
>> +	const char *header = "git-filter-protocol\nversion";
> 
> 	static const char header[] = "git-filter-protocol\nversion";
> 
> ...might be smaller by avoiding the extra pointer
> (but compilers ought to be able to optimize it)

Agreed!


>> +	entry = xmalloc(sizeof(*entry));
>> +	hashmap_entry_init(entry, strhash(cmd));
>> +	entry->cmd = cmd;
>> +	process = &entry->process;
> 
> <snip>
> 
>> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
> 
> starts_with() is probably more readable, here.

OK, will fix.


>> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
>> +						int fd, struct strbuf *dst, const char *cmd,
>> +						const char *filter_type)
>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry = NULL;
>> +	struct child_process *process = NULL;
> 
> I would leave process initialized, here, since it should
> always be set below:

OK, will fix.


>> +	struct stat fileStat;
>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	size_t nbuf_len;
>> +	char *strtol_end;
>> +	char c;
>> +
>> +	if (!cmd || !*cmd)
>> +		return 0;
>> +
>> +	if (!dst)
>> +		return 1;
>> +
>> +	if (!cmd_process_map_init) {
>> +		cmd_process_map_init = 1;
>> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
>> +	} else {
>> +		entry = find_protocol_filter_entry(cmd);
>> +	}
>> +
>> +	if (!entry){
>> +		entry = start_protocol_filter(cmd);
>> +		if (!entry) {
>> +			stop_protocol_filter(entry);
> 
> stop_protocol_filter is a no-op, here, since entry is NULL

Oops - a result of my own refactoring :-) Thank you!


>> +			return 0;
>> +		}
>> +	}
>> +	process = &entry->process;
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	switch (entry->protocol) {
>> +		case 1:
>> +			if (fd >= 0 && !src) {
>> +				ret &= fstat(fd, &fileStat) != -1;
>> +				len = fileStat.st_size;
> 
> There's a truncation bug when sizeof(size_t) < sizeof(off_t)

OK. What would you suggest to do in that case? Should we just let the
filter fail? Is there anything else we could do?


> (and mixedCase is inconsistent with our style)

OK, will fix.


>> +    my $filelen  = <STDIN>;
>> +    chomp $filelen;
>> +    print $debug " $filelen";
>> +
>> +    $filelen = int($filelen);
> 
> Calling int() here is unnecessary and may hide bugs if you
> forget to check $debug.   Perhaps a regexp check is safer:
> 
> 	$filelen =~ /\A\d+\z/ or die "bad filelen: $filelen\n";

OK, will fix!


Thanks for your review,
Lars



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 18:36     ` Lars Schneider
@ 2016-07-24 20:14       ` Jakub Narębski
  2016-07-24 21:30         ` Jakub Narębski
  2016-07-25 20:09         ` Lars Schneider
  0 siblings, 2 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-24 20:14 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright

W dniu 2016-07-24 o 20:36, Lars Schneider pisze:
> On 23 Jul 2016, at 02:11, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> Nb. this line is only needed if you want author name and/or date
>> different from the email sender, or if you have sender line misconfigured
>> (e.g. lacking the human readable name).
> 
> I use "git format-patch" to generate these emails:
> 
> git format-patch --cover-letter --subject-prefix="PATCH ..." -M $BASE -o $OUTPUT
> 
> How would I disable this line? (I already checked the man page to no avail).

If you are using `git send-email` or equivalent, I think it is
stripped automatically if it is not needed (in you case it was,
because Sender was lacking human readable name... at least I think
it was because of what my email reader inserted as reply line).
If you are using an ordinary email client, you need to remove it
yourself, if needed.

> Plus, what does "Nb" stand for? :-)

Nb. (or N.b.) stands for "nota bene", which I meant do denote
as a note on a certain side aspect; I'll switch to "note", or
"BTW" / "by the way".
 
>>> Git's clean/smudge mechanism invokes an external filter process for every
>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>> then the startup time of the external filter processes can become a
>>> significant part of the overall Git execution time.
>>
>> Do I understand it correctly (from the commit description) that with
>> this new option you start one filter process for the whole life of
>> the Git command (e.g. `git add .`), which may perform more than one
>> cleanup, but do not create a daemon or daemon-like process which
>> would live for several commands invocations?
> 
> Correct!

It would be nice to make it more obvious. 

>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>> keeps the external filter process running and processes all blobs with
>>> the following protocol over stdin/stdout.
>>
>> I agree with Junio that the name "useProtocol" is bad, and not quite
>> right. Perhaps "persistent" would be better? Also, what is the value
>> of `filter.<driver>.useProtocol`: boolean? or a script name?

As you see I was not sure if `useProtocol` was boolean or a script name,
which means that it should be stated more explicitly.  Of course this
would end to not matter if the way new protocol is used were changed.

> I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
> I thought about "persistent" but this name wouldn't convey the scope of the 
> persistency ("persistent for one Git operation" vs. "persistent for many Git 
> operations"). What do you think about the protocol as int version idea
> described in $gmane/300155 ?

You mean the `protocol` as a config variable name (fully name being
`filter.<driver>.protocol`), being integer-valued, isn't it? Wouldn't
`protocolVersion` be a more explicit?

>> I also agree that we might wat to be able to keep clean and smudge
>> filters separate, but be able to run a single program if they are
>> both the same. I think there is a special case for filter unset,
>> and/or filter being "cat" -- we would want to keep that.
> 
> Since 1a8630d there is a more efficient way to unset a filter ;-)
> Can you think of other cases where the separation would be useful?

I can't think of any, but it doesn't mean that it does not exist.
It also does not mean that you need to consider situation that may
not happen. Covering one-way filters, like "indent" filter for `clean`,
should be enough... they do work with your proposal, don't they?

>> My proposal is to use `filter.<driver>.persistent` as an addition
>> to 'clean' and 'smudge' variables, with the following possible
>> values:
>>
>>  * none (the default)
>>  * clean
>>  * smudge
>>  * both
> 
> That could work. However, I am not convinced, yet, that separate
> filters are an actual use case.

YAGNI (You Ain't Gonna Need It), right.

>> I assume that either Git would have to start multiple filter
>> commands for multi-threaded operation, or the protocol would have
>> to be extended to make persistent filter fork itself.
> 
> I think it would be better to have Git start multiple filter commands
> to keep the protocol as simple and error free as possible.

Right. Also, I am not sure if exec+fork would be much faster than
fork+exec (where fork is n-way fork, and n is number of threads
that Git command invoking filter is using).

>> BTW. what would happen in your original proposal if the user had
>> *both* filter.<driver>.useProtocol and filter.<driver>.smudge
>> (and/or filter.<driver>.clean) set?
> 
> That wouldn't be an issue as "useProtocol" is just a boolean that
> tells Git how to talk to "filter.<driver>.smudge" and "filter.<driver>.clean".
> I need to make this more clear in the documentation.
> 
> 
>>> 1. Git starts the filter on first usage and expects a welcome message
>>> with protocol version number:
>>> 	Git <-- Filter: "git-filter-protocol\n"
>>> 	Git <-- Filter: "version 1"
>>
>> I was wondering how Git would know that filter executable was started,
>> but then I realized it was once-per-command invocation, not a daemon.
>>
>> I agree with Torsten that there should be a terminator after the
>> version number.
> 
> I agree, too :)

Note that if we agree about switch to `protocol` / `protocolVersion`
as a way to specify this protocol, it would probably need to be "protocol 2"
(assuming that "protocol 1" is the original implementation, with one fork
per affected file).

>> Also, for future extendability this should be probably followed by
>> possibly empty list of script capabilities, that is:
>>
>> 	Git <-- Filter: "git-filter-protocol\n"
>> 	Git <-- Filter: "version 1.1\n"

Note that "version 1.1" would not work with current implementation;
it accepts only integer version numbers. Which might be a good idea,
anyway.

>> 	Git <-- Filter: "capabilities clean smudge\n"
>>
>> Or we can add capabilities in later version...
> 
> That is an interesting idea. My initial thought was to make the capabilities
> of a certain version fix. If we want to add new capabilities then we would 
> bump the version. I wonder what others think about your suggestion!

Using capabilities (like git-upload-pack / git-receive-pack, that is
smart Git transfer protocols do) is probably slightly more difficult on
the Git side (assuming no capabilities negotiation), but also much more
flexible than pure version numbers.

One possible idea for a capability is support for passing input
and output of a filter via filesystem, like cleanToFile and smudgeFromFile
proposal in 'jh/clean-smudge-annex' (in 'pu').

For example:

 	Git <-- Filter: "capabilities clean smudge cleanToFile smudgeFromFile\n"

>> BTW. why not follow e.g. HTTP protocol example, and use
>>
>> 	Git <-- Filter: "git-filter-protocol/1\n"
> 
> I think my proposal is a bit more explicit as it states "version". If
> you feel strongly about it, I could be convinced otherwise.

No, I don't feel strongly about this. I think SSH also uses a separate
"version"-like line.

>>> 2. Git sends the command (either "smudge" or "clean"), the filename, the
>>> content size in bytes, and the content separated by a newline character:
>>> 	Git --> Filter: "smudge\n"
>>
>> Would it help (for some cases) to pass the name of filter that
>> is being invoked?
> 
> Interesting thought! Can you imagine a case where this would be useful?

Actually... no, I don't think so. I don't think there is a situation
where we might want to use the same filter commands for different filters
and have it behave differently depending on the filter name.

>>> 	Git --> Filter: "testfile.dat\n"
>>
>> Unfortunately, while sane filenames should not contain newlines[1],
>> the unfortunate fact is that *filenames can include newlines*, and
>> you need to be able to handle that[2].  Therefore you need either to
>> choose a different separator (the only one that can be safely used
>> is "\0", i.e. the NUL character - but it is not something easy to
>> handle by shell scripts), or C-quote filenames as needed, or always
>> C-quote filenames.  C-quoting at minimum should include quoting newline
>> character, and the escape character itself.
>>
>> BTW. is it the basename of a file, or a full pathname?
>>
>> [1]: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
>> [2]: http://www.dwheeler.com/essays/filenames-in-shell.html
> 
> Thanks for this explanation. A bash version of the protocol is not
> trivial (I tried it but ended up using Perl). Therefore I think "\0"
> would be a good choice?

That, or use git convention of surrounding C-quoted filenames in
double quotes (which means that if it begins with quote, it is C-quoted).

For example:

  $ git commit -m 'Initial commit'
  [master (root-commit) 266dab0] Initial commit
   2 files changed, 2 insertions(+)
   create mode 100644 foo
   create mode 100644 "foo \" \\ ,"
  $ ls -1
  foo " \ ,
  foo

I'm not sure which solution would be easier for filter writers,
NUL termination, or C-quoting.

>>> 	Git --> Filter: "7\n"
>>
>> That's the content size in bytes written as an ASCII number.
> 
> Correct.

But not obvious from the description / documentation. 

>>> 	Git --> Filter: "CONTENT"
>>
>> Can filter ignore the content size, and just read all what it was
>> sent, that is until eof or something?
> 
> What would that something be? Since CONTENT is binary it can contain
> any character (even "\0")...
 
Here by "or something" I meant some other way of detecting that there
is nothing more to read. But providing the size upfront (or size of
chunk / packet in the streaming interface, if/when it gets implemented)
is probably a better idea. Git knows it anyway, cheaply.

>>> 3. The filter is expected to answer with the result content size in
>>> bytes and the result content separated by a newline character:
>>> 	Git <-- Filter: "15\n"
>>> 	Git <-- Filter: "SMUDGED_CONTENT"
>>
>> I wonder how hard would be to write filters for this protocol...
> 
> Easy :-) Plus you can look at a Perl (see t/t0021) and a golang implementation
> already (https://github.com/github/git-lfs/pull/1382).

Right. Any programming language that has a way to specify "read N bytes"
would work. I think even bash would work, with 'read -N $len -r'... I think.

>>> ---
>>> Documentation/gitattributes.txt |  41 +++++++-
>>> convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>>> t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
>>
>> Wouldn't it be better to name the test case something more
>> descriptive, for example
>>
>>   t/t0021-filter-driver-useProtocol.sh
>>
>> The name of test should be adjusted to final name of the feature,
>> of course.
> 
> I think the prefix numbers should be unique, no? And t0022 is already taken.

I meant here that the "conversion" part of "t/t0021-conversion.sh" test
filename is not descriptive enough.
 
>>> t/t0021/rot13.pl                |  80 +++++++++++++++

This is all right, because it is in t0021 context.

>>> 4 files changed, 494 insertions(+), 7 deletions(-)
>>> create mode 100755 t/t0021/rot13.pl
>>>
>>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>>> index 8882a3e..7026d62 100644
>>> --- a/Documentation/gitattributes.txt
>>> +++ b/Documentation/gitattributes.txt
>>> @@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
>>> fed the blob object from its standard input, and its standard
>>> output is used to update the worktree file.  Similarly, the
>>> `clean` command is used to convert the contents of worktree file
>>> -upon checkin.
>>> +upon checkin. By default these commands process only a single
>>> +blob and terminate. If the setting filter.<driver>.useProtocol is
>>> +enabled then Git can process all blobs with a single filter command
>>> +invocation (see filter protocol below).
>>
>> This does not tell the precedence between `smudge`, `clean` and
>> filter.<driver>.useProtocol, see above. Also, discrepancy in how
>> config variables are referenced.
> 
> As mentioned above "useProtocol" is a boolean. Therefore precedence shouldn't
> be a problem.

Which was not obvious (but might not matter in the end).

>              What do you mean by "discrepancy in how config variables are 
> referenced"?
 
What I meant here that filter.<driver>.smudge and filter.<driver>.clean
were referenced as "`smudge` command" and "`clean` command" in the paragraph
you modified.

Perhaps filter.<driver>.useProtocol is all right (I have not looked further),
but it should be formatted as `filter.<driver>.useProtocol` IMVHO.
 
[...]
>>> diff --git a/convert.c b/convert.c
>>> index 522e2c5..91ce86f 100644
>>> --- a/convert.c
>>> +++ b/convert.c
>>> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>>> 	return ret;
>>> }
>>>
>>> +static int cmd_process_map_init = 0;
>>> +static struct hashmap cmd_process_map;
>>> +
>>> +struct cmd2process {
>>> +	struct hashmap_entry ent; /* must be the first member! */
>>> +	const char *cmd;
>>> +	long protocol;
>>> +	struct child_process process;
>>> +};
>> [...]
>>> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
>>> +{
>>> +	struct cmd2process k;
>>> +	hashmap_entry_init(&k, strhash(cmd));
>>> +	k.cmd = cmd;
>>> +	return hashmap_get(&cmd_process_map, &k, NULL);
>>
>> Should we use global variable cmd_process_map, or pass it as parameter?
>> The same question apply for other procedures and functions.
>>
>> Note that I am not saying that it is a bad thing to use global
>> variable here.
> 
> Passing it would be nicer as this would make at least a few functions "pure".
> I will change that!

You can always provide convenience functions that use global variable.
That's what Git code does with the_index, if I remember it correctly.

[...]
>>> +
>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
>>> +	sigchain_pop(SIGPIPE);
>>> +
>>> +	strbuf_stripspace(&nbuf, 0);
>>> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
>>> +	ret &= split.nr > 1;
>>> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
>>> +	if (ret) {
>>> +		entry->protocol = strtol(split.items[1].string, NULL, 10);
>>
>> This does not handle at least some errors in version number parsing,
>> for example junk after version number. Don't we have some helper
>> functions for this?
> 
> I am not sure. I haven't found one.

Hmmm... I remember there were some patches about this, but I don't know
if they were accepted.  We have strtol_i() in git-compat-util.h. 

And you can always check where the parsing ended (by not passing NULL,
of course).

[...]
>>> +		switch (entry->protocol) {
>>> +			case 1:
>>> +				break;
>>> +			default:
>>> +				ret = 0;
>>> +				error("unsupported protocol version %s for external persistent filter '%s'",
>>> +					nbuf.buf, cmd);
>>> +		}
>>> +	}
>>> +	string_list_clear(&split, 0);
>>> +	strbuf_release(&nbuf);
>>> +
>>> +	if (!ret) {
>>> +		error("initialization for external persistent filter '%s' failed", cmd);
>>> +		return NULL;
>>> +	}
>>
>> Do we handle persistent filter command being killed before it finishes?
>> Or exiting with error? I don't know this Git API...
> 
> If the "apply_filter" function fails then Git would proceed and just not
> filter the content. If you define the "required" flag for the filter then
> Git would error in that case.

Ah, right. 

[...]
>>> diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl
>>
>> That's bit more than rot13... but it might be O.K. for a filename here.
> 
> "rot13-$FEATURE_NAME.pl" ?

As I said, rot13.pl is all right; if change, then perhaps to rot13-filter.pl 

[...]
>>> +        my $input;
>>> +        {
>>> +            binmode(STDIN);
>>> +            my $bytes_read = 0;
>>> +            $bytes_read = read STDIN, $input, $filelen;
>>> +            if ( $bytes_read != $filelen ) {
>>> +                die "not enough to read";
>>
>> I know it's only a test script (well, a part of one), but we would probably
>> want to have more information in the case of a real filter.
> 
> True. Do you think there is anything to change in the script, though?

No, I don't think so. It is enough that the test script would crash
if fed incorrect data from Git. Better error messages would be nice,
but are not necessary.

    +                die "not enough to read: expected $filelen, got $bytes_read";

I have noticed that some 'die' have "\n" at the end, and some do
not. If I remember correctly it is for supressing error message from
Perl, with filename and line number, isn't it? Anyway, we probably
want to be consistent.


>>> +        }
>>> +    }
>>
>> What happens if $filelen is zero, or negative? Ah, I see that $output
>> would be undef... which is bad, I think.
> 
> Is this something we need to consider in the test script?

I think the test script should die if it gets incorrect content length
(e.g. negative), so that it catches bug on the Git side. It should
work for zero-length files, even if we don't test it -- we can in the
future.

[...]
>>> +            print STDOUT "fail";
>>
>> This is not defined in the protocol description!  Unless anything that
>> does not conform to the specification would work here, but at least it
>> is a recommended practice to be described in the documentation, don't
>> you think?
>>
>> What would happen in $output_len is 4?
> 
> Then it would work :D
> I understand your point. However, this is not a reference implementation.
> It is a test script that is supposed to trigger bad behavior which we can test. 
> Therefore, I would argue that such a return value is OK. I will document it in 
> the header, though. 

Why print "fail", and not die?

The problem is do the protocol need to have some way of communicating
errors from the filter to Git?  Perhaps using stderr would be enough
(but then Git would need to drain it, I think... unless it is not
redirected), perhaps some command is needed?

For example, instead of:

	Git <-- Filter: "15\n"
	Git <-- Filter: "SMUDGED_CONTENT"

perhaps filter should return

	Git <-- Filter: "error\n"
	Git <-- Filter: "ONE_LINE_OF_ERROR_DESCRIPTION\n"

on error? Or if printing expected output length upfront is easier,
use a signal (but that is supposedly not that reliable as message
passing mechanism)?

It might be the case that some files return errors, but some do not.

BTW. do we test the case where filter fails, or returns wrong output?

> Thanks a lot for your extensive review,
> Lars--

You are welcome.
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 20:14       ` Jakub Narębski
@ 2016-07-24 21:30         ` Jakub Narębski
  2016-07-25 20:16           ` Lars Schneider
  2016-07-25 20:09         ` Lars Schneider
  1 sibling, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-24 21:30 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright,
	Junio C Hamano, Eric Wong

W dniu 2016-07-24 o 22:14, Jakub Narębski pisze:
> W dniu 2016-07-24 o 20:36, Lars Schneider pisze:

>> I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
>> I thought about "persistent" but this name wouldn't convey the scope of the 
>> persistency ("persistent for one Git operation" vs. "persistent for many Git 
>> operations"). What do you think about the protocol as int version idea
>> described in $gmane/300155 ?
>
> You mean the `protocol` as a config variable name (fully name being
> `filter.<driver>.protocol`), being integer-valued, isn't it? Wouldn't
> `protocolVersion` be a more explicit?

Just throwing out further ideas:

Perhaps make `persistent` string-valued variable, with the only value
supported for now, namely "per-process" / "operation"?

Perhaps require for `pidfile` to be present for it to be daemon,
that is persist for possibly many Git operations. Or allow "daemon"
or "server" value for `persistent`, then?

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 17:16     ` Lars Schneider
@ 2016-07-24 22:36       ` Ramsay Jones
  2016-07-24 23:22         ` Jakub Narębski
  2016-07-25 20:24         ` Lars Schneider
  0 siblings, 2 replies; 77+ messages in thread
From: Ramsay Jones @ 2016-07-24 22:36 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, jnareb, Torsten Bögershausen,
	mlbright



On 24/07/16 18:16, Lars Schneider wrote:
> 
> On 23 Jul 2016, at 01:19, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
> 
>> On 22/07/16 16:49, larsxschneider@gmail.com wrote:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>
>>> Git's clean/smudge mechanism invokes an external filter process for every
>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>> then the startup time of the external filter processes can become a
>>> significant part of the overall Git execution time.
>>>
>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>> keeps the external filter process running and processes all blobs with
>>> the following protocol over stdin/stdout.
>>>
>>> 1. Git starts the filter on first usage and expects a welcome message
>>> with protocol version number:
>>> 	Git <-- Filter: "git-filter-protocol\n"
>>> 	Git <-- Filter: "version 1"
>>
>> Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
>> interaction is fully defined, I guess it doesn't matter).
>>
>> [If you wanted to check for a version, you could add a "version" command
>> instead, just like "clean" and "smudge".]
> 
> It was a conscious decision to have the `filter` talk first. My reasoning was:
> 
> (1) I want a reliable way to distinguish the existing filter protocol ("single-shot 
> invocation") from the new one ("long running"). I don't think there would be a
> situation where the existing protocol would talk first. Therefore the users would
> not accidentally mix them with a possibly half working, undetermined, outcome.

If an 'single-shot' filter were incorrectly configured, instead of a new one, then
the interaction could last a little while - since it would result in deadlock! ;-)

[If Git talks first instead, configuring a 'single-shot' filter _may_ still result
in a deadlock - depending on pipe size, etc.]

> 
> (2) In the future we could extend the pipe protocol (see $gmane/297994, it's very
> interesting). A filter could check Git's version and then pick the most appropriate
> filter protocol on startup.
> 
> 
>> [...]
>>> +static struct cmd2process *start_protocol_filter(const char *cmd)
>>> +{
>>> +	int ret = 1;
>>> +	struct cmd2process *entry = NULL;
>>> +	struct child_process *process = NULL;
>>> +	struct strbuf nbuf = STRBUF_INIT;
>>> +	struct string_list split = STRING_LIST_INIT_NODUP;
>>> +	const char *argv[] = { NULL, NULL };
>>> +	const char *header = "git-filter-protocol\nversion";
>>> +
>>> +	entry = xmalloc(sizeof(*entry));
>>> +	hashmap_entry_init(entry, strhash(cmd));
>>> +	entry->cmd = cmd;
>>> +	process = &entry->process;
>>> +
>>> +	child_process_init(process);
>>> +	argv[0] = cmd;
>>> +	process->argv = argv;
>>> +	process->use_shell = 1;
>>> +	process->in = -1;
>>> +	process->out = -1;
>>> +
>>> +	if (start_command(process)) {
>>> +		error("cannot fork to run external persistent filter '%s'", cmd);
>>> +		return NULL;
>>> +	}
>>> +	strbuf_reset(&nbuf);
>>> +
>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
>>
>> Hmm, how much will be read into nbuf by this single call?
>> Since strbuf_read_once() makes a single call to xread(), with
>> a len argument that will probably be 8192, you can not really
>> tell how much it will read, in general. (xread() does not
>> guarantee how many bytes it will read.)
>>
>> In particular, it could be less than strlen(header).
> 
> As mentioned to Torsten in $gmane/300156, I will add a newline
> and then read until I find the second newline. That should solve
> the problem, right?
> 
> (You wrote in $gmane/300119 that I should ignore your email but
> I think you have a valid point here ;-)

Heh, as I said, it was late and I was trying to do several things
at once. (I am updating 3 installations of Linux Mint 17.3 to Linux
Mint 18 - I decided to do a complete re-install, since I needed to
change partition sizes anyway. I have only just got email back up ...)

I stopped commenting on the patch early but, after sending the first
email, I decided to scan the rest of your patch before going to bed
and noticed something which would invalidate my comments ...

> 
> 
>>> [...]
>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>> +	switch (entry->protocol) {
>>> +		case 1:
>>> +			if (fd >= 0 && !src) {
>>> +				ret &= fstat(fd, &fileStat) != -1;
>>> +				len = fileStat.st_size;
>>> +			}
>>> +			strbuf_reset(&nbuf);
>>> +			strbuf_addf(&nbuf, "%s\n%s\n%zu\n", filter_type, path, len);
>>> +			ret &= write_str_in_full(process->in, nbuf.buf) > 1;
>>
>> why not write_in_full(process->in, nbuf.buf, nbuf.len) ?
> OK, this would save a "strlen" call. Do you think such a function could be of general
> use? If yes, then I would add:
> 
> static inline ssize_t write_strbuf_in_full(int fd, struct strbuf *str)
> {
> 	return write_in_full(fd, str->buf, str->len);
> }

[I don't have strong feelings either way (but I suspect it's not worth it).]

> 
> 
>>> +			if (len > 0) {
>>> +				if (src)
>>> +					ret &= write_in_full(process->in, src, len) == len;
>>> +				else if (fd >= 0)
>>> +					ret &= copy_fd(fd, process->in) == 0;
>>> +				else
>>> +					ret &= 0;
>>> +			}
>>> +
>>> +			strbuf_reset(&nbuf);
>>> +			while (xread(process->out, &c, 1) == 1 && c != '\n')
>>> +				strbuf_addchars(&nbuf, c, 1);
>>> +			nbuf_len = (size_t)strtol(nbuf.buf, &strtol_end, 10);
>>> +			ret &= (strtol_end != nbuf.buf && errno != ERANGE);
>>> +			strbuf_reset(&nbuf);
>>> +			if (nbuf_len > 0)
>>> +				ret &= strbuf_read_once(&nbuf, process->out, nbuf_len) == nbuf_len;
>>
>> Again, how many bytes will be read?
>> Note, that in the default configuration, a _maximum_ of
>> MAX_IO_SIZE (8MB or SSIZE_MAX, whichever is smaller) bytes
>> will be read.

... In particular, your 2GB test case should not have worked, so
I assumed that I had missed a loop somewhere ...

> Would something like this be more appropriate?
> 
> strbuf_reset(&nbuf);
> if (nbuf_len > 0) {
>     strbuf_grow(&nbuf, nbuf_len);
>     ret &= read_in_full(process->out, nbuf.buf, nbuf_len) == nbuf_len;
> }

... and this looks better. [Note: this comment would apply equally to the
version message.]

[Hmm, now can I remember which packages I need to install ...]

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 22:36       ` Ramsay Jones
@ 2016-07-24 23:22         ` Jakub Narębski
  2016-07-25 20:32           ` Lars Schneider
  2016-07-25 20:24         ` Lars Schneider
  1 sibling, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-24 23:22 UTC (permalink / raw)
  To: Ramsay Jones, Lars Schneider
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright

W dniu 2016-07-25 o 00:36, Ramsay Jones pisze:
> On 24/07/16 18:16, Lars Schneider wrote:
>> On 23 Jul 2016, at 01:19, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
>>> On 22/07/16 16:49, larsxschneider@gmail.com wrote:
[...]
>>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>>> keeps the external filter process running and processes all blobs with
>>>> the following protocol over stdin/stdout.
>>>>
>>>> 1. Git starts the filter on first usage and expects a welcome message
>>>> with protocol version number:
>>>> 	Git <-- Filter: "git-filter-protocol\n"
>>>> 	Git <-- Filter: "version 1"
>>>
>>> Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
>>> interaction is fully defined, I guess it doesn't matter).
>>
>> It was a conscious decision to have the `filter` talk first. My reasoning was:
>>
>> (1) I want a reliable way to distinguish the existing filter protocol ("single-shot 
>> invocation") from the new one ("long running"). I don't think there would be a
>> situation where the existing protocol would talk first. Therefore the users would
>> not accidentally mix them with a possibly half working, undetermined, outcome.
> 
> If an 'single-shot' filter were incorrectly configured, instead of a new one, then
> the interaction could last a little while - since it would result in deadlock! ;-)
> 
> [If Git talks first instead, configuring a 'single-shot' filter _may_ still result
> in a deadlock - depending on pipe size, etc.]

Would it be possible to do an equivalent of sending empty file to the filter?
If it is misconfigured old-style script, it would exit after possibly empty
output; if not, we would start new-style interaction.
 
This should be, if we agree that detecting misconfigured filters is a good
thing, tested.

>>
>> (2) In the future we could extend the pipe protocol (see $gmane/297994, it's very
>> interesting). A filter could check Git's version and then pick the most appropriate
>> filter protocol on startup.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 19:11     ` Lars Schneider
@ 2016-07-25  7:27       ` Eric Wong
  2016-07-25 15:48       ` Duy Nguyen
  1 sibling, 0 replies; 77+ messages in thread
From: Eric Wong @ 2016-07-25  7:27 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, jnareb, Torsten Bögershausen,
	mlbright

Lars Schneider <larsxschneider@gmail.com> wrote:
> On 23 Jul 2016, at 10:14, Eric Wong <e@80x24.org> wrote:
> > larsxschneider@gmail.com wrote:
> >> +static struct cmd2process *start_protocol_filter(const char *cmd)
> >> +{
> >> +	int ret = 1;
> >> +	struct cmd2process *entry = NULL;
> >> +	struct child_process *process = NULL;
> > 
> > These are unconditionally set below, so initializing to NULL
> > may hide future bugs.
> 
> OK. I thought it is generally a good thing to initialize a pointer with 
> NULL. Can you explain to me how this might hide future bugs?
> I will remove the initialization.

Compilers complain about uninitialized variables.  Blindly
setting them to NULL can allow them to be dereferenced;
triggering segfaults; especially if it's passed to a different
compilation unit the compiler can't see.

> >> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> >> +						int fd, struct strbuf *dst, const char *cmd,
> >> +						const char *filter_type)
> >> +{

<snip>

> >> +			if (fd >= 0 && !src) {
> >> +				ret &= fstat(fd, &fileStat) != -1;
> >> +				len = fileStat.st_size;
> > 
> > There's a truncation bug when sizeof(size_t) < sizeof(off_t)
> 
> OK. What would you suggest to do in that case? Should we just let the
> filter fail? Is there anything else we could do?

Anything which refers to something on disk (or will eventually
stored there, such as blobs) should evolve towards off_t rather
than size_t.  We just discovered a bunch of 32-bit truncation
bugs the other week:

https://public-inbox.org/git/1466807902.28869.8.camel@gmail.com/

If the protocol/ABI is frozen, it should probably fail;
and a 64-bit-off_t version for 32-bit systems should be defined.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 19:11     ` Lars Schneider
  2016-07-25  7:27       ` Eric Wong
@ 2016-07-25 15:48       ` Duy Nguyen
  1 sibling, 0 replies; 77+ messages in thread
From: Duy Nguyen @ 2016-07-25 15:48 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Eric Wong, Git Mailing List, Jeff King, Jakub Narębski,
	Torsten Bögershausen, mlbright

On Sun, Jul 24, 2016 at 9:11 PM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>
> On 23 Jul 2016, at 10:14, Eric Wong <e@80x24.org> wrote:
>
>> larsxschneider@gmail.com wrote:
>>> Please note that the protocol filters do not support stream processing
>>> with this implemenatation because the filter needs to know the length of
>>> the result in advance. A protocol version 2 could address this in a
>>> future patch.
>>
>> Would it be prudent to reuse pkt-line for this?
>
> Peff suggested that, too, in $gmane/299902.

And I was about to suggest the same too, until I saw his patch then
stopped. Having a common way to split a byte stream to a packet stream
could be a good thing.

> However, this would make the protocol a bit more complicated

For high level scripting languages, pkt-line is dead simple. If your
scripts are in sh then it could get a bit ugly, but I'm thinking of a
small utility to make shell scripting pkt-line easier anyway (it goes
back to the idea of rewriting index-helper as a script, which I might
do).

> and it wouldn't buy us anything for Git
> large file processing filters (my main motivation for this patch) as these
> filters can't leverage streaming anyways.

This is a good point. How are you planning to do it? Unless streaming
is done entirely in kernel (sendfie() and friends, which is not all
positive), I think you can still stream and wrap/unwrap pkt-line just
fine.
-- 
Duy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 20:14       ` Jakub Narębski
  2016-07-24 21:30         ` Jakub Narębski
@ 2016-07-25 20:09         ` Lars Schneider
  2016-07-26 14:18           ` Jakub Narębski
  1 sibling, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-25 20:09 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright


On 24 Jul 2016, at 22:14, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 2016-07-24 o 20:36, Lars Schneider pisze:
>> On 23 Jul 2016, at 02:11, Jakub Narębski <jnareb@gmail.com> wrote:
>>> W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>> 
>>> Nb. this line is only needed if you want author name and/or date
>>> different from the email sender, or if you have sender line misconfigured
>>> (e.g. lacking the human readable name).
>> 
>> I use "git format-patch" to generate these emails:
>> 
>> git format-patch --cover-letter --subject-prefix="PATCH ..." -M $BASE -o $OUTPUT
>> 
>> How would I disable this line? (I already checked the man page to no avail).
> 
> If you are using `git send-email` or equivalent, I think it is
> stripped automatically if it is not needed (in you case it was,
> because Sender was lacking human readable name... at least I think
> it was because of what my email reader inserted as reply line).
> If you are using an ordinary email client, you need to remove it
> yourself, if needed.

Weird. I am sending the patches with this command:

git send-email mystuff/* --to=git@vger.kernel.org --cc=...

Maybe I need to try the "--suppress-from" switch?!


>> Plus, what does "Nb" stand for? :-)
> 
> Nb. (or N.b.) stands for "nota bene", which I meant do denote
> as a note on a certain side aspect; I'll switch to "note", or
> "BTW" / "by the way".

OK, thanks for the explanation :-)


>>>> Git's clean/smudge mechanism invokes an external filter process for every
>>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>>> then the startup time of the external filter processes can become a
>>>> significant part of the overall Git execution time.
>>> 
>>> Do I understand it correctly (from the commit description) that with
>>> this new option you start one filter process for the whole life of
>>> the Git command (e.g. `git add .`), which may perform more than one
>>> cleanup, but do not create a daemon or daemon-like process which
>>> would live for several commands invocations?
>> 
>> Correct!
> 
> It would be nice to make it more obvious.

OK, I will try in v2.

> 
>>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>>> keeps the external filter process running and processes all blobs with
>>>> the following protocol over stdin/stdout.
>>> 
>>> I agree with Junio that the name "useProtocol" is bad, and not quite
>>> right. Perhaps "persistent" would be better? Also, what is the value
>>> of `filter.<driver>.useProtocol`: boolean? or a script name?
> 
> As you see I was not sure if `useProtocol` was boolean or a script name,
> which means that it should be stated more explicitly.  Of course this
> would end to not matter if the way new protocol is used were changed.
> 
>> I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
>> I thought about "persistent" but this name wouldn't convey the scope of the 
>> persistency ("persistent for one Git operation" vs. "persistent for many Git 
>> operations"). What do you think about the protocol as int version idea
>> described in $gmane/300155 ?
> 
> You mean the `protocol` as a config variable name (fully name being
> `filter.<driver>.protocol`), being integer-valued, isn't it? Wouldn't
> `protocolVersion` be a more explicit?

Yes, but based on your other feedback I plan to use this variable differently
anyways.


>>> I also agree that we might wat to be able to keep clean and smudge
>>> filters separate, but be able to run a single program if they are
>>> both the same. I think there is a special case for filter unset,
>>> and/or filter being "cat" -- we would want to keep that.
>> 
>> Since 1a8630d there is a more efficient way to unset a filter ;-)
>> Can you think of other cases where the separation would be useful?
> 
> I can't think of any, but it doesn't mean that it does not exist.
> It also does not mean that you need to consider situation that may
> not happen. Covering one-way filters, like "indent" filter for `clean`,
> should be enough... they do work with your proposal, don't they?

This should work right now but it would be a bit inefficient (the filter
would just pass the data unchanged through the smudge command). I plan to
add a "capabilities" flag to the protocol. Then you can define only
the "clean" capability and nothing or the current filter mechanism 
would happen for smudge (I will make a test case to demonstrate that
behavior in v2).


>>> My proposal is to use `filter.<driver>.persistent` as an addition
>>> to 'clean' and 'smudge' variables, with the following possible
>>> values:
>>> 
>>> * none (the default)
>>> * clean
>>> * smudge
>>> * both
>> 
>> That could work. However, I am not convinced, yet, that separate
>> filters are an actual use case.
> 
> YAGNI (You Ain't Gonna Need It), right.

That will work in v2.


>>> I assume that either Git would have to start multiple filter
>>> commands for multi-threaded operation, or the protocol would have
>>> to be extended to make persistent filter fork itself.
>> 
>> I think it would be better to have Git start multiple filter commands
>> to keep the protocol as simple and error free as possible.
> 
> Right. Also, I am not sure if exec+fork would be much faster than
> fork+exec (where fork is n-way fork, and n is number of threads
> that Git command invoking filter is using).
> 
>>> BTW. what would happen in your original proposal if the user had
>>> *both* filter.<driver>.useProtocol and filter.<driver>.smudge
>>> (and/or filter.<driver>.clean) set?
>> 
>> That wouldn't be an issue as "useProtocol" is just a boolean that
>> tells Git how to talk to "filter.<driver>.smudge" and "filter.<driver>.clean".
>> I need to make this more clear in the documentation.
>> 
>> 
>>>> 1. Git starts the filter on first usage and expects a welcome message
>>>> with protocol version number:
>>>> 	Git <-- Filter: "git-filter-protocol\n"
>>>> 	Git <-- Filter: "version 1"
>>> 
>>> I was wondering how Git would know that filter executable was started,
>>> but then I realized it was once-per-command invocation, not a daemon.
>>> 
>>> I agree with Torsten that there should be a terminator after the
>>> version number.
>> 
>> I agree, too :)
> 
> Note that if we agree about switch to `protocol` / `protocolVersion`
> as a way to specify this protocol, it would probably need to be "protocol 2"
> (assuming that "protocol 1" is the original implementation, with one fork
> per affected file).

Agreed.


>>> Also, for future extendability this should be probably followed by
>>> possibly empty list of script capabilities, that is:
>>> 
>>> 	Git <-- Filter: "git-filter-protocol\n"
>>> 	Git <-- Filter: "version 1.1\n"
> 
> Note that "version 1.1" would not work with current implementation;
> it accepts only integer version numbers. Which might be a good idea,
> anyway.

Agreed.


>>> 	Git <-- Filter: "capabilities clean smudge\n"
>>> 
>>> Or we can add capabilities in later version...
>> 
>> That is an interesting idea. My initial thought was to make the capabilities
>> of a certain version fix. If we want to add new capabilities then we would 
>> bump the version. I wonder what others think about your suggestion!
> 
> Using capabilities (like git-upload-pack / git-receive-pack, that is
> smart Git transfer protocols do) is probably slightly more difficult on
> the Git side (assuming no capabilities negotiation), but also much more
> flexible than pure version numbers.
> 
> One possible idea for a capability is support for passing input
> and output of a filter via filesystem, like cleanToFile and smudgeFromFile
> proposal in 'jh/clean-smudge-annex' (in 'pu').
> 
> For example:
> 
> 	Git <-- Filter: "capabilities clean smudge cleanToFile smudgeFromFile\n"

Yes, I like that very much. As stated above, I will add that in v2.


>>> BTW. why not follow e.g. HTTP protocol example, and use
>>> 
>>> 	Git <-- Filter: "git-filter-protocol/1\n"
>> 
>> I think my proposal is a bit more explicit as it states "version". If
>> you feel strongly about it, I could be convinced otherwise.
> 
> No, I don't feel strongly about this. I think SSH also uses a separate
> "version"-like line.
> 
>>>> 2. Git sends the command (either "smudge" or "clean"), the filename, the
>>>> content size in bytes, and the content separated by a newline character:
>>>> 	Git --> Filter: "smudge\n"
>>> 
>>> Would it help (for some cases) to pass the name of filter that
>>> is being invoked?
>> 
>> Interesting thought! Can you imagine a case where this would be useful?
> 
> Actually... no, I don't think so. I don't think there is a situation
> where we might want to use the same filter commands for different filters
> and have it behave differently depending on the filter name.

OK


>>>> 	Git --> Filter: "testfile.dat\n"
>>> 
>>> Unfortunately, while sane filenames should not contain newlines[1],
>>> the unfortunate fact is that *filenames can include newlines*, and
>>> you need to be able to handle that[2].  Therefore you need either to
>>> choose a different separator (the only one that can be safely used
>>> is "\0", i.e. the NUL character - but it is not something easy to
>>> handle by shell scripts), or C-quote filenames as needed, or always
>>> C-quote filenames.  C-quoting at minimum should include quoting newline
>>> character, and the escape character itself.
>>> 
>>> BTW. is it the basename of a file, or a full pathname?
>>> 
>>> [1]: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
>>> [2]: http://www.dwheeler.com/essays/filenames-in-shell.html
>> 
>> Thanks for this explanation. A bash version of the protocol is not
>> trivial (I tried it but ended up using Perl). Therefore I think "\0"
>> would be a good choice?
> 
> That, or use git convention of surrounding C-quoted filenames in
> double quotes (which means that if it begins with quote, it is C-quoted).
> 
> For example:
> 
>  $ git commit -m 'Initial commit'
>  [master (root-commit) 266dab0] Initial commit
>   2 files changed, 2 insertions(+)
>   create mode 100644 foo
>   create mode 100644 "foo \" \\ ,"
>  $ ls -1
>  foo " \ ,
>  foo
> 
> I'm not sure which solution would be easier for filter writers,
> NUL termination, or C-quoting.

Unless someone has a convincing argument for one solution or the other
I will go with the \0 termination as it seems easier.


>>>> 	Git --> Filter: "7\n"
>>> 
>>> That's the content size in bytes written as an ASCII number.
>> 
>> Correct.
> 
> But not obvious from the description / documentation.

I will improve that in v2. Should I add the info that it is base10 or
would you consider that a given?


>>>> 	Git --> Filter: "CONTENT"
>>> 
>>> Can filter ignore the content size, and just read all what it was
>>> sent, that is until eof or something?
>> 
>> What would that something be? Since CONTENT is binary it can contain
>> any character (even "\0")...
> 
> Here by "or something" I meant some other way of detecting that there
> is nothing more to read. But providing the size upfront (or size of
> chunk / packet in the streaming interface, if/when it gets implemented)
> is probably a better idea. Git knows it anyway, cheaply.
> 
>>>> 3. The filter is expected to answer with the result content size in
>>>> bytes and the result content separated by a newline character:
>>>> 	Git <-- Filter: "15\n"
>>>> 	Git <-- Filter: "SMUDGED_CONTENT"
>>> 
>>> I wonder how hard would be to write filters for this protocol...
>> 
>> Easy :-) Plus you can look at a Perl (see t/t0021) and a golang implementation
>> already (https://github.com/github/git-lfs/pull/1382).
> 
> Right. Any programming language that has a way to specify "read N bytes"
> would work. I think even bash would work, with 'read -N $len -r'... I think.
> 
>>>> ---
>>>> Documentation/gitattributes.txt |  41 +++++++-
>>>> convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>>>> t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
>>> 
>>> Wouldn't it be better to name the test case something more
>>> descriptive, for example
>>> 
>>>  t/t0021-filter-driver-useProtocol.sh
>>> 
>>> The name of test should be adjusted to final name of the feature,
>>> of course.
>> 
>> I think the prefix numbers should be unique, no? And t0022 is already taken.
> 
> I meant here that the "conversion" part of "t/t0021-conversion.sh" test
> filename is not descriptive enough.

Ah, I see. You suggest to rename the test case? Would that be OK with the Git
community?


>>>> t/t0021/rot13.pl                |  80 +++++++++++++++
> 
> This is all right, because it is in t0021 context.
> 
>>>> 4 files changed, 494 insertions(+), 7 deletions(-)
>>>> create mode 100755 t/t0021/rot13.pl
>>>> 
>>>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>>>> index 8882a3e..7026d62 100644
>>>> --- a/Documentation/gitattributes.txt
>>>> +++ b/Documentation/gitattributes.txt
>>>> @@ -300,7 +300,10 @@ checkout, when the `smudge` command is specified, the command is
>>>> fed the blob object from its standard input, and its standard
>>>> output is used to update the worktree file.  Similarly, the
>>>> `clean` command is used to convert the contents of worktree file
>>>> -upon checkin.
>>>> +upon checkin. By default these commands process only a single
>>>> +blob and terminate. If the setting filter.<driver>.useProtocol is
>>>> +enabled then Git can process all blobs with a single filter command
>>>> +invocation (see filter protocol below).
>>> 
>>> This does not tell the precedence between `smudge`, `clean` and
>>> filter.<driver>.useProtocol, see above. Also, discrepancy in how
>>> config variables are referenced.
>> 
>> As mentioned above "useProtocol" is a boolean. Therefore precedence shouldn't
>> be a problem.
> 
> Which was not obvious (but might not matter in the end).
> 
>>             What do you mean by "discrepancy in how config variables are 
>> referenced"?
> 
> What I meant here that filter.<driver>.smudge and filter.<driver>.clean
> were referenced as "`smudge` command" and "`clean` command" in the paragraph
> you modified.
> 
> Perhaps filter.<driver>.useProtocol is all right (I have not looked further),
> but it should be formatted as `filter.<driver>.useProtocol` IMVHO.

Initially I thought so, too. But "filter.<driver>.required", which is already
mentioned in gitattributes.txt, does not use this style. Should I change that, too,
or use the existing style?


> [...]
>>>> diff --git a/convert.c b/convert.c
>>>> index 522e2c5..91ce86f 100644
>>>> --- a/convert.c
>>>> +++ b/convert.c
>>>> @@ -481,12 +481,188 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>>>> 	return ret;
>>>> }
>>>> 
>>>> +static int cmd_process_map_init = 0;
>>>> +static struct hashmap cmd_process_map;
>>>> +
>>>> +struct cmd2process {
>>>> +	struct hashmap_entry ent; /* must be the first member! */
>>>> +	const char *cmd;
>>>> +	long protocol;
>>>> +	struct child_process process;
>>>> +};
>>> [...]
>>>> +static struct cmd2process *find_protocol_filter_entry(const char *cmd)
>>>> +{
>>>> +	struct cmd2process k;
>>>> +	hashmap_entry_init(&k, strhash(cmd));
>>>> +	k.cmd = cmd;
>>>> +	return hashmap_get(&cmd_process_map, &k, NULL);
>>> 
>>> Should we use global variable cmd_process_map, or pass it as parameter?
>>> The same question apply for other procedures and functions.
>>> 
>>> Note that I am not saying that it is a bad thing to use global
>>> variable here.
>> 
>> Passing it would be nicer as this would make at least a few functions "pure".
>> I will change that!
> 
> You can always provide convenience functions that use global variable.
> That's what Git code does with the_index, if I remember it correctly.
> 
> [...]
>>>> +
>>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
>>>> +	sigchain_pop(SIGPIPE);
>>>> +
>>>> +	strbuf_stripspace(&nbuf, 0);
>>>> +	string_list_split_in_place(&split, nbuf.buf, ' ', 2);
>>>> +	ret &= split.nr > 1;
>>>> +	ret &= strncmp(header, split.items[0].string, strlen(header)) == 0;
>>>> +	if (ret) {
>>>> +		entry->protocol = strtol(split.items[1].string, NULL, 10);
>>> 
>>> This does not handle at least some errors in version number parsing,
>>> for example junk after version number. Don't we have some helper
>>> functions for this?
>> 
>> I am not sure. I haven't found one.
> 
> Hmmm... I remember there were some patches about this, but I don't know
> if they were accepted.  We have strtol_i() in git-compat-util.h. 
> 
> And you can always check where the parsing ended (by not passing NULL,
> of course).

OK, try to make this nicer in v2.


> [...]
>>>> +		switch (entry->protocol) {
>>>> +			case 1:
>>>> +				break;
>>>> +			default:
>>>> +				ret = 0;
>>>> +				error("unsupported protocol version %s for external persistent filter '%s'",
>>>> +					nbuf.buf, cmd);
>>>> +		}
>>>> +	}
>>>> +	string_list_clear(&split, 0);
>>>> +	strbuf_release(&nbuf);
>>>> +
>>>> +	if (!ret) {
>>>> +		error("initialization for external persistent filter '%s' failed", cmd);
>>>> +		return NULL;
>>>> +	}
>>> 
>>> Do we handle persistent filter command being killed before it finishes?
>>> Or exiting with error? I don't know this Git API...
>> 
>> If the "apply_filter" function fails then Git would proceed and just not
>> filter the content. If you define the "required" flag for the filter then
>> Git would error in that case.
> 
> Ah, right. 
> 
> [...]
>>>> diff --git a/t/t0021/rot13.pl b/t/t0021/rot13.pl
>>> 
>>> That's bit more than rot13... but it might be O.K. for a filename here.
>> 
>> "rot13-$FEATURE_NAME.pl" ?
> 
> As I said, rot13.pl is all right; if change, then perhaps to rot13-filter.pl 

OK.


> [...]
>>>> +        my $input;
>>>> +        {
>>>> +            binmode(STDIN);
>>>> +            my $bytes_read = 0;
>>>> +            $bytes_read = read STDIN, $input, $filelen;
>>>> +            if ( $bytes_read != $filelen ) {
>>>> +                die "not enough to read";
>>> 
>>> I know it's only a test script (well, a part of one), but we would probably
>>> want to have more information in the case of a real filter.
>> 
>> True. Do you think there is anything to change in the script, though?
> 
> No, I don't think so. It is enough that the test script would crash
> if fed incorrect data from Git. Better error messages would be nice,
> but are not necessary.
> 
>    +                die "not enough to read: expected $filelen, got $bytes_read";
> 
> I have noticed that some 'die' have "\n" at the end, and some do
> not. If I remember correctly it is for supressing error message from
> Perl, with filename and line number, isn't it? Anyway, we probably
> want to be consistent.

OK.


>>>> +        }
>>>> +    }
>>> 
>>> What happens if $filelen is zero, or negative? Ah, I see that $output
>>> would be undef... which is bad, I think.
>> 
>> Is this something we need to consider in the test script?
> 
> I think the test script should die if it gets incorrect content length
> (e.g. negative), so that it catches bug on the Git side. It should
> work for zero-length files, even if we don't test it -- we can in the
> future.

OK.


> [...]
>>>> +            print STDOUT "fail";
>>> 
>>> This is not defined in the protocol description!  Unless anything that
>>> does not conform to the specification would work here, but at least it
>>> is a recommended practice to be described in the documentation, don't
>>> you think?
>>> 
>>> What would happen in $output_len is 4?
>> 
>> Then it would work :D
>> I understand your point. However, this is not a reference implementation.
>> It is a test script that is supposed to trigger bad behavior which we can test. 
>> Therefore, I would argue that such a return value is OK. I will document it in 
>> the header, though. 
> 
> Why print "fail", and not die?

Agree, "die" is better.


> The problem is do the protocol need to have some way of communicating
> errors from the filter to Git?  Perhaps using stderr would be enough
> (but then Git would need to drain it, I think... unless it is not
> redirected), perhaps some command is needed?
> 
> For example, instead of:
> 
> 	Git <-- Filter: "15\n"
> 	Git <-- Filter: "SMUDGED_CONTENT"
> 
> perhaps filter should return
> 
> 	Git <-- Filter: "error\n"
> 	Git <-- Filter: "ONE_LINE_OF_ERROR_DESCRIPTION\n"
> 
> on error? Or if printing expected output length upfront is easier,
> use a signal (but that is supposedly not that reliable as message
> passing mechanism)?
> 
> It might be the case that some files return errors, but some do not.

I would prefer it if the filter just dies in case of trouble and that
way communicates to Git that something went wrong. Everything else
just complicates the protocol.


> BTW. do we test the case where filter fails, or returns wrong output?

Yes, I added a failure test in v2.

Thanks a lot,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 21:30         ` Jakub Narębski
@ 2016-07-25 20:16           ` Lars Schneider
  2016-07-26 12:24             ` Jakub Narębski
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-25 20:16 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright,
	Junio C Hamano, Eric Wong


On 24 Jul 2016, at 23:30, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 2016-07-24 o 22:14, Jakub Narębski pisze:
>> W dniu 2016-07-24 o 20:36, Lars Schneider pisze:
> 
>>> I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
>>> I thought about "persistent" but this name wouldn't convey the scope of the 
>>> persistency ("persistent for one Git operation" vs. "persistent for many Git 
>>> operations"). What do you think about the protocol as int version idea
>>> described in $gmane/300155 ?
>> 
>> You mean the `protocol` as a config variable name (fully name being
>> `filter.<driver>.protocol`), being integer-valued, isn't it? Wouldn't
>> `protocolVersion` be a more explicit?
> 
> Just throwing out further ideas:
> 
> Perhaps make `persistent` string-valued variable, with the only value
> supported for now, namely "per-process" / "operation"?
> 
> Perhaps require for `pidfile` to be present for it to be daemon,
> that is persist for possibly many Git operations. Or allow "daemon"
> or "server" value for `persistent`, then?

I like the direction of this idea. What if we use a string-valued 
"filter.<driver>.protocol" with the following options:

"simple" / "invocation-per-file" / << empty >> --> current clean/smudge behavior
"invocation-per-process" --> new, proposed behavior

If necessary this could be enhanced in the future to support even a "daemon"
mode (with a pidfile config).

Thanks,
Lars


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 22:36       ` Ramsay Jones
  2016-07-24 23:22         ` Jakub Narębski
@ 2016-07-25 20:24         ` Lars Schneider
  1 sibling, 0 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-25 20:24 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Git Mailing List, Jeff King, jnareb, Torsten Bögershausen,
	mlbright


On 25 Jul 2016, at 00:36, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:

> On 24/07/16 18:16, Lars Schneider wrote:
>> 
>> On 23 Jul 2016, at 01:19, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
>> 
>>> On 22/07/16 16:49, larsxschneider@gmail.com wrote:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> 
>>>> Git's clean/smudge mechanism invokes an external filter process for every
>>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>>> then the startup time of the external filter processes can become a
>>>> significant part of the overall Git execution time.
>>>> 
>>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>>> keeps the external filter process running and processes all blobs with
>>>> the following protocol over stdin/stdout.
>>>> 
>>>> 1. Git starts the filter on first usage and expects a welcome message
>>>> with protocol version number:
>>>> 	Git <-- Filter: "git-filter-protocol\n"
>>>> 	Git <-- Filter: "version 1"
>>> 
>>> Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
>>> interaction is fully defined, I guess it doesn't matter).
>>> 
>>> [If you wanted to check for a version, you could add a "version" command
>>> instead, just like "clean" and "smudge".]
>> 
>> It was a conscious decision to have the `filter` talk first. My reasoning was:
>> 
>> (1) I want a reliable way to distinguish the existing filter protocol ("single-shot 
>> invocation") from the new one ("long running"). I don't think there would be a
>> situation where the existing protocol would talk first. Therefore the users would
>> not accidentally mix them with a possibly half working, undetermined, outcome.
> 
> If an 'single-shot' filter were incorrectly configured, instead of a new one, then
> the interaction could last a little while - since it would result in deadlock! ;-)
> 
> [If Git talks first instead, configuring a 'single-shot' filter _may_ still result
> in a deadlock - depending on pipe size, etc.]

Do you think this is an issue that needs to be addressed in the first version?
If yes, I would probably look into "select" to specify a timeout for the filter.
However, wouldn't the current "single-shot" clean/smudge filter block in the 
same way if they don't write anything?


>> (2) In the future we could extend the pipe protocol (see $gmane/297994, it's very
>> interesting). A filter could check Git's version and then pick the most appropriate
>> filter protocol on startup.
>> 
>> 
>>> [...]
>>>> +static struct cmd2process *start_protocol_filter(const char *cmd)
>>>> +{
>>>> +	int ret = 1;
>>>> +	struct cmd2process *entry = NULL;
>>>> +	struct child_process *process = NULL;
>>>> +	struct strbuf nbuf = STRBUF_INIT;
>>>> +	struct string_list split = STRING_LIST_INIT_NODUP;
>>>> +	const char *argv[] = { NULL, NULL };
>>>> +	const char *header = "git-filter-protocol\nversion";
>>>> +
>>>> +	entry = xmalloc(sizeof(*entry));
>>>> +	hashmap_entry_init(entry, strhash(cmd));
>>>> +	entry->cmd = cmd;
>>>> +	process = &entry->process;
>>>> +
>>>> +	child_process_init(process);
>>>> +	argv[0] = cmd;
>>>> +	process->argv = argv;
>>>> +	process->use_shell = 1;
>>>> +	process->in = -1;
>>>> +	process->out = -1;
>>>> +
>>>> +	if (start_command(process)) {
>>>> +		error("cannot fork to run external persistent filter '%s'", cmd);
>>>> +		return NULL;
>>>> +	}
>>>> +	strbuf_reset(&nbuf);
>>>> +
>>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>>> +	ret &= strbuf_read_once(&nbuf, process->out, 0) > 0;
>>> 
>>> Hmm, how much will be read into nbuf by this single call?
>>> Since strbuf_read_once() makes a single call to xread(), with
>>> a len argument that will probably be 8192, you can not really
>>> tell how much it will read, in general. (xread() does not
>>> guarantee how many bytes it will read.)
>>> 
>>> In particular, it could be less than strlen(header).
>> 
>> As mentioned to Torsten in $gmane/300156, I will add a newline
>> and then read until I find the second newline. That should solve
>> the problem, right?
>> 
>> (You wrote in $gmane/300119 that I should ignore your email but
>> I think you have a valid point here ;-)
> 
> Heh, as I said, it was late and I was trying to do several things
> at once. (I am updating 3 installations of Linux Mint 17.3 to Linux
> Mint 18 - I decided to do a complete re-install, since I needed to
> change partition sizes anyway. I have only just got email back up ...)
> 
> I stopped commenting on the patch early but, after sending the first
> email, I decided to scan the rest of your patch before going to bed
> and noticed something which would invalidate my comments ...
> 
>> 
>> 
>>>> [...]
>>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>>> +	switch (entry->protocol) {
>>>> +		case 1:
>>>> +			if (fd >= 0 && !src) {
>>>> +				ret &= fstat(fd, &fileStat) != -1;
>>>> +				len = fileStat.st_size;
>>>> +			}
>>>> +			strbuf_reset(&nbuf);
>>>> +			strbuf_addf(&nbuf, "%s\n%s\n%zu\n", filter_type, path, len);
>>>> +			ret &= write_str_in_full(process->in, nbuf.buf) > 1;
>>> 
>>> why not write_in_full(process->in, nbuf.buf, nbuf.len) ?
>> OK, this would save a "strlen" call. Do you think such a function could be of general
>> use? If yes, then I would add:
>> 
>> static inline ssize_t write_strbuf_in_full(int fd, struct strbuf *str)
>> {
>> 	return write_in_full(fd, str->buf, str->len);
>> }
> 
> [I don't have strong feelings either way (but I suspect it's not worth it).]

OK


>>>> +			if (len > 0) {
>>>> +				if (src)
>>>> +					ret &= write_in_full(process->in, src, len) == len;
>>>> +				else if (fd >= 0)
>>>> +					ret &= copy_fd(fd, process->in) == 0;
>>>> +				else
>>>> +					ret &= 0;
>>>> +			}
>>>> +
>>>> +			strbuf_reset(&nbuf);
>>>> +			while (xread(process->out, &c, 1) == 1 && c != '\n')
>>>> +				strbuf_addchars(&nbuf, c, 1);
>>>> +			nbuf_len = (size_t)strtol(nbuf.buf, &strtol_end, 10);
>>>> +			ret &= (strtol_end != nbuf.buf && errno != ERANGE);
>>>> +			strbuf_reset(&nbuf);
>>>> +			if (nbuf_len > 0)
>>>> +				ret &= strbuf_read_once(&nbuf, process->out, nbuf_len) == nbuf_len;
>>> 
>>> Again, how many bytes will be read?
>>> Note, that in the default configuration, a _maximum_ of
>>> MAX_IO_SIZE (8MB or SSIZE_MAX, whichever is smaller) bytes
>>> will be read.
> 
> ... In particular, your 2GB test case should not have worked, so
> I assumed that I had missed a loop somewhere ...

Thanks a lot for this comment. The 2GB test case was bogus... v2
will have a much improved version :-)


>> Would something like this be more appropriate?
>> 
>> strbuf_reset(&nbuf);
>> if (nbuf_len > 0) {
>>    strbuf_grow(&nbuf, nbuf_len);
>>    ret &= read_in_full(process->out, nbuf.buf, nbuf_len) == nbuf_len;
>> }
> 
> ... and this looks better. [Note: this comment would apply equally to the
> version message.]

And it works better with large files, too :D


> [Hmm, now can I remember which packages I need to install ...]

:-)


Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-24 23:22         ` Jakub Narębski
@ 2016-07-25 20:32           ` Lars Schneider
  2016-07-26 10:58             ` Jakub Narębski
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-25 20:32 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Ramsay Jones, Git Mailing List, Jeff King,
	Torsten Bögershausen, mlbright


On 25 Jul 2016, at 01:22, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 2016-07-25 o 00:36, Ramsay Jones pisze:
>> On 24/07/16 18:16, Lars Schneider wrote:
>>> On 23 Jul 2016, at 01:19, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
>>>> On 22/07/16 16:49, larsxschneider@gmail.com wrote:
> [...]
>>>>> This patch adds the filter.<driver>.useProtocol option which, if enabled,
>>>>> keeps the external filter process running and processes all blobs with
>>>>> the following protocol over stdin/stdout.
>>>>> 
>>>>> 1. Git starts the filter on first usage and expects a welcome message
>>>>> with protocol version number:
>>>>> 	Git <-- Filter: "git-filter-protocol\n"
>>>>> 	Git <-- Filter: "version 1"
>>>> 
>>>> Hmm, I was a bit surprised to see a 'filter' talk first (but so long as the
>>>> interaction is fully defined, I guess it doesn't matter).
>>> 
>>> It was a conscious decision to have the `filter` talk first. My reasoning was:
>>> 
>>> (1) I want a reliable way to distinguish the existing filter protocol ("single-shot 
>>> invocation") from the new one ("long running"). I don't think there would be a
>>> situation where the existing protocol would talk first. Therefore the users would
>>> not accidentally mix them with a possibly half working, undetermined, outcome.
>> 
>> If an 'single-shot' filter were incorrectly configured, instead of a new one, then
>> the interaction could last a little while - since it would result in deadlock! ;-)
>> 
>> [If Git talks first instead, configuring a 'single-shot' filter _may_ still result
>> in a deadlock - depending on pipe size, etc.]
> 
> Would it be possible to do an equivalent of sending empty file to the filter?
> If it is misconfigured old-style script, it would exit after possibly empty
> output; if not, we would start new-style interaction.

I think we would need to close the pipe to communicate "end" to the filter, no?
I would prefer to define the protocol explicitly as this is clearly easier.


> 
> This should be, if we agree that detecting misconfigured filters is a good
> thing, tested.
> 
>>> 
>>> (2) In the future we could extend the pipe protocol (see $gmane/297994, it's very
>>> interesting). A filter could check Git's version and then pick the most appropriate
>>> filter protocol on startup.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-25 20:32           ` Lars Schneider
@ 2016-07-26 10:58             ` Jakub Narębski
  0 siblings, 0 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-26 10:58 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Ramsay Jones, Git Mailing List, Jeff King,
	Torsten Bögershausen, mlbright

W dniu 2016-07-25 o 22:32, Lars Schneider pisze: 
> On 25 Jul 2016, at 01:22, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 2016-07-25 o 00:36, Ramsay Jones pisze:
>>> On 24/07/16 18:16, Lars Schneider wrote:

>>>> It was a conscious decision to have the `filter` talk first.
>>>> My reasoning was:
>>>> 
>>>> (1) I want a reliable way to distinguish the existing filter 
>>>> protocol ("single-shot invocation") from the new one ("long 
>>>> running"). I don't think there would be a situation where the 
>>>> existing protocol would talk first. Therefore the users would 
>>>> not accidentally mix them with a possibly half working, 
>>>> undetermined, outcome.
>>> 
>>> If an 'single-shot' filter were incorrectly configured, instead 
>>> of a new one, then the interaction could last a little while - 
>>> since it would result in deadlock! ;-)
>>> 
>>> [If Git talks first instead, configuring a 'single-shot' filter 
>>> _may_ still result in a deadlock - depending on pipe size, etc.]
>> 
>> Would it be possible to do an equivalent of sending empty file to 
>> the filter? If it is misconfigured old-style script, it would exit 
>> after possibly empty output; if not, we would start new-style 
>> interaction.
> 
> I think we would need to close the pipe to communicate "end" to the 
> filter, no? I would prefer to define the protocol explicitly as this 
> is clearly easier.

Well, we could always close stdin of a script, check if it quits,
then reopen. Or close stdin, and send commands via file descriptor 4.
Or send SIGPIPE. But I don't know if it is a good idea.

> On 25 Jul 2016, at 00:36, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:

>> If an 'single-shot' filter were incorrectly configured, instead of
>> a new one, then the interaction could last a little while - since
>> it would result in deadlock! ;-)
>> 
>> [If Git talks first instead, configuring a 'single-shot' filter
>> _may_ still result in a deadlock - depending on pipe size, etc.]
> 
> Do you think this is an issue that needs to be addressed in the first
> version? If yes, I would probably look into "select" to specify a
> timeout for the filter.

This might be a better idea.  Additionally, it would make it possible
to detect buggy v2 filter scripts.

>                         However, wouldn't the current "single-shot"
> clean/smudge filter block in the same way if they don't write
> anything?

Hmmm... so deadlocking (waiting for user to press ^C) might be
an acceptable solution. It would be good to tell him or her why
there was a deadlock (catch SIGINT), that Git was waiting for
specific command in a specific filter driver, for a specific file.


On the other hand v2 protocol has an additional problem: users
switching to v2, while using old one-shot filters (that worked
correctly).  So in my opinion you need to ensure two things:

(1) name things in such way that it is easy to see that you
need to write filter script specifically for the v2 protocol,

(2) if possible, do not hang but warn the user if he or she
wants to use v1 filter (per-file) with v2 protocol (per-command),
or at least help diagnose the issue.

>> This should be tested, if we agree that detecting misconfigured 
>> filters is a good thing.

[clarified]

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-25 20:16           ` Lars Schneider
@ 2016-07-26 12:24             ` Jakub Narębski
  0 siblings, 0 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-26 12:24 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright,
	Junio C Hamano, Eric Wong

W dniu 2016-07-25 o 22:16, Lars Schneider pisze:
> 
> On 24 Jul 2016, at 23:30, Jakub Narębski <jnareb@gmail.com> wrote:
> 
>> W dniu 2016-07-24 o 22:14, Jakub Narębski pisze:
>>> W dniu 2016-07-24 o 20:36, Lars Schneider pisze:
>>
>>>> I agree that the name is not ideal. "UseProtocol" as it is would be a boolean. 
>>>> I thought about "persistent" but this name wouldn't convey the scope of the 
>>>> persistency ("persistent for one Git operation" vs. "persistent for many Git 
>>>> operations"). What do you think about the protocol as int version idea
>>>> described in $gmane/300155 ?
>>>
>>> You mean the `protocol` as a config variable name (fully name being
>>> `filter.<driver>.protocol`), being integer-valued, isn't it? Wouldn't
>>> `protocolVersion` be a more explicit?
>>
>> Just throwing out further ideas:
>>
>> Perhaps make `persistent` string-valued variable, with the only value
>> supported for now, namely "per-process" / "operation"?
>>
>> Perhaps require for `pidfile` to be present for it to be daemon,
>> that is persist for possibly many Git operations. Or allow "daemon"
>> or "server" value for `persistent`, then?
> 
> I like the direction of this idea. What if we use a string-valued 
> "filter.<driver>.protocol" with the following options:
> 
> "simple" / "invocation-per-file" / << empty >> --> current clean/smudge behavior
> "invocation-per-process" --> new, proposed behavior
> 
> If necessary this could be enhanced in the future to support even a "daemon"
> mode (with a pidfile config).

Though, after thinking about it, this solution has the problem
that people might think that they can use their old per-file
filters, just flipping the `filter.<driver>.protocol`.

I dunno.
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-25 20:09         ` Lars Schneider
@ 2016-07-26 14:18           ` Jakub Narębski
  0 siblings, 0 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-26 14:18 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, Torsten Bögershausen, mlbright

W dniu 2016-07-25 o 22:09, Lars Schneider pisze:
> On 24 Jul 2016, at 22:14, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 2016-07-24 o 20:36, Lars Schneider pisze:
>>> On 23 Jul 2016, at 02:11, Jakub Narębski <jnareb@gmail.com> wrote:
>>>> W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
>>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>>
>>>> Nb. this line is only needed if you want author name and/or date
>>>> different from the email sender, or if you have sender line misconfigured
>>>> (e.g. lacking the human readable name).
>>>
>>> I use "git format-patch" to generate these emails:
>>>
>>> git format-patch --cover-letter --subject-prefix="PATCH ..." -M $BASE -o $OUTPUT
>>>
>>> How would I disable this line? (I already checked the man page to no avail).
>>
>> If you are using `git send-email` or equivalent, I think it is
>> stripped automatically if it is not needed (in you case it was,
>> because Sender was lacking human readable name... at least I think
>> it was because of what my email reader inserted as reply line).
>> If you are using an ordinary email client, you need to remove it
>> yourself, if needed.
> 
> Weird. I am sending the patches with this command:
> 
> git send-email mystuff/* --to=git@vger.kernel.org --cc=...
> 
> Maybe I need to try the "--suppress-from" switch?!

No, the "From:" line is needed, but only because your Sender seems
to be lacking human-readable name (for some strange reason).

But let's stop this here.
 

[...]
>>>> I also agree that we might want to be able to keep clean and smudge
>>>> filters separate, but be able to run a single program if they are
>>>> both the same. I think there is a special case for filter unset,
>>>> and/or filter being "cat" -- we would want to keep that.
>>>
>>> Since 1a8630d there is a more efficient way to unset a filter ;-)
>>> Can you think of other cases where the separation would be useful?
>>
>> I can't think of any, but it doesn't mean that it does not exist.
>> It also does not mean that you need to consider situation that may
>> not happen. Covering one-way filters, like "indent" filter for `clean`,
>> should be enough... they do work with your proposal, don't they?
> 
> This should work right now but it would be a bit inefficient (the filter
> would just pass the data unchanged through the smudge command). I plan to
> add a "capabilities" flag to the protocol. Then you can define only
> the "clean" capability and nothing or the current filter mechanism 
> would happen for smudge (I will make a test case to demonstrate that
> behavior in v2).

Isn't no-op filter (value not set, value set to empty string, "cat")
caught earlier in a common code?  We would certainly want to keep
one-way filter configuration mechanism without many changes.

Also, this should be of course tested.
 
>>>> 	Git <-- Filter: "capabilities clean smudge\n"
>>>>
>>>> Or we can add capabilities in later version...
>>>
>>> That is an interesting idea. My initial thought was to make the capabilities
>>> of a certain version fix. If we want to add new capabilities then we would 
>>> bump the version. I wonder what others think about your suggestion!
>>
>> Using capabilities (like git-upload-pack / git-receive-pack, that is
>> smart Git transfer protocols do) is probably slightly more difficult on
>> the Git side (assuming no capabilities negotiation), but also much more
>> flexible than pure version numbers.
>>
>> One possible idea for a capability is support for passing input
>> and output of a filter via filesystem, like cleanToFile and smudgeFromFile
>> proposal in 'jh/clean-smudge-annex' (in 'pu').
>>
>> For example:
>>
>> 	Git <-- Filter: "capabilities clean smudge cleanToFile smudgeFromFile\n"
> 
> Yes, I like that very much. As stated above, I will add that in v2.

I guess that you would add the idea of capabilities (though this
could be left for v3), not support for "cleanToFile" / "smudgeFromFile"
capabilities, and accompanying extension to the protocol, isn't it?
 
 
>>>>> 	Git --> Filter: "testfile.dat\n"
>>>>
>>>> Unfortunately, while sane filenames should not contain newlines[1],
>>>> the unfortunate fact is that *filenames can include newlines*, and
>>>> you need to be able to handle that[2].  Therefore you need either to
>>>> choose a different separator (the only one that can be safely used
>>>> is "\0", i.e. the NUL character - but it is not something easy to
>>>> handle by shell scripts), or C-quote filenames as needed, or always
>>>> C-quote filenames.  C-quoting at minimum should include quoting newline
>>>> character, and the escape character itself.
>>>>
>>>> BTW. is it the basename of a file, or a full pathname?
>>>>
>>>> [1]: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
>>>> [2]: http://www.dwheeler.com/essays/filenames-in-shell.html
>>>
>>> Thanks for this explanation. A bash version of the protocol is not
>>> trivial (I tried it but ended up using Perl). Therefore I think "\0"
>>> would be a good choice?
>>
>> That, or use git convention of surrounding C-quoted filenames in
>> double quotes (which means that if it begins with quote, it is C-quoted).
>>
>> For example:
>>
>>  $ git commit -m 'Initial commit'
>>  [master (root-commit) 266dab0] Initial commit
>>   2 files changed, 2 insertions(+)
>>   create mode 100644 foo
>>   create mode 100644 "foo \" \\ ,"
>>  $ ls -1
>>  foo " \ ,
>>  foo
>>
>> I'm not sure which solution would be easier for filter writers,
>> NUL termination, or C-quoting.
> 
> Unless someone has a convincing argument for one solution or the other
> I will go with the \0 termination as it seems easier.

Well, I think both C-quoting when necessary, and NUL-terminating
is easy from the Git side, but I guess that handling NUL-terminated
filenames is easier from scripts.
 
>>>>> 	Git --> Filter: "7\n"
>>>>
>>>> That's the content size in bytes written as an ASCII number.
>>>
>>> Correct.
>>
>> But not obvious from the description / documentation.
> 
> I will improve that in v2. Should I add the info that it is base10 or
> would you consider that a given?

No, this is obvious, and used thorough Git code / formats. 

>>>>> ---
>>>>> Documentation/gitattributes.txt |  41 +++++++-
>>>>> convert.c                       | 210 ++++++++++++++++++++++++++++++++++++++--
>>>>> t/t0021-conversion.sh           | 170 ++++++++++++++++++++++++++++++++
>>>>
>>>> Wouldn't it be better to name the test case something more
>>>> descriptive, for example
>>>>
>>>>  t/t0021-filter-driver-useProtocol.sh
>>>>
>>>> The name of test should be adjusted to final name of the feature,
>>>> of course.
>>>
>>> I think the prefix numbers should be unique, no? And t0022 is already taken.
>>
>> I meant here that the "conversion" part of "t/t0021-conversion.sh" test
>> filename is not descriptive enough.
> 
> Ah, I see. You suggest to rename the test case? Would that be OK with the Git
> community?

Ah, I'm sorry.  I was in mistaken assumption that it is a new test,
not an extension to an existing test case.  In this case the filename
change can come as a separate patch, anyway.
 
 
>>>             What do you mean by "discrepancy in how config variables are 
>>> referenced"?
>>
>> What I meant here that filter.<driver>.smudge and filter.<driver>.clean
>> were referenced as "`smudge` command" and "`clean` command" in the paragraph
>> you modified.
>>
>> Perhaps filter.<driver>.useProtocol is all right (I have not looked further),
>> but it should be formatted as `filter.<driver>.useProtocol` IMVHO.
> 
> Initially I thought so, too. But "filter.<driver>.required", which is already
> mentioned in gitattributes.txt, does not use this style. Should I change that, too,
> or use the existing style?

All right, if there is a precedent for using this style, it would
be all right.
 
 
>> The problem is do the protocol need to have some way of communicating
>> errors from the filter to Git?  Perhaps using stderr would be enough
>> (but then Git would need to drain it, I think... unless it is not
>> redirected), perhaps some command is needed?
>>
>> For example, instead of:
>>
>> 	Git <-- Filter: "15\n"
>> 	Git <-- Filter: "SMUDGED_CONTENT"
>>
>> perhaps filter should return
>>
>> 	Git <-- Filter: "error\n"
>> 	Git <-- Filter: "ONE_LINE_OF_ERROR_DESCRIPTION\n"
>>
>> on error? Or if printing expected output length upfront is easier,
>> use a signal (but that is supposedly not that reliable as message
>> passing mechanism)?
>>
>> It might be the case that some files return errors, but some do not.
> 
> I would prefer it if the filter just dies in case of trouble and that
> way communicates to Git that something went wrong. Everything else
> just complicates the protocol.

All right, this makes protocol simpler.  One thing we might want
to do is to ensure that stderr from the filter driver (where error
messages should be sent IMVHO, and which of course needs to be
documented) goes to the user, possibly with prefixing (like for
errors from the remote hooks).

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 2/3] convert: modernize tests
  2016-07-22 15:48 ` [PATCH v1 2/3] convert: modernize tests larsxschneider
@ 2016-07-26 15:18   ` Remi Galan Alfonso
  2016-07-26 20:40     ` Junio C Hamano
  0 siblings, 1 reply; 77+ messages in thread
From: Remi Galan Alfonso @ 2016-07-26 15:18 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, jnareb, tboegi

Hi Lars,

Sorry, minor nit that I noticed a couple of days ago but didn't
comment on the moment and forgot until now.

Lars Schneider <larsxschneider@gmail.com> wrote:
> Use `test_config` to set the config, check that files are empty with
> `test_must_be_empty`, compare files with `test_cmp`, and remove spaces
> after ">".

Considering how close it is to your patch, you might also want to
remove spaces after '<'.

There is only one occurrence in this file and it's in a line you are
already modifying.

See below:

>  test_expect_success check '
>  
> -        cmp test.o test &&
> -        cmp test.o test.t &&
> +        test_cmp test.o test &&
> +        test_cmp test.o test.t &&
>  
>          # ident should be stripped in the repository
>          git diff --raw --exit-code :test :test.i &&
> @@ -47,10 +47,10 @@ test_expect_success check '
>          embedded=$(sed -ne "$script" test.i) &&
>          test "z$id" = "z$embedded" &&
>  
> -        git cat-file blob :test.t > test.r &&
> +        git cat-file blob :test.t >test.r &&
>  
> -        ./rot13.sh < test.o > test.t &&
> -        cmp test.r test.t
> +        ./rot13.sh < test.o >test.t &&

Here.

> +        test_cmp test.r test.t
>  '

Thanks,
Rémi

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option
  2016-07-23  7:27     ` Eric Wong
@ 2016-07-26 20:00       ` Jeff King
  0 siblings, 0 replies; 77+ messages in thread
From: Jeff King @ 2016-07-26 20:00 UTC (permalink / raw)
  To: Eric Wong; +Cc: Jakub Narębski, larsxschneider, git, tboegi

On Sat, Jul 23, 2016 at 07:27:21AM +0000, Eric Wong wrote:

> Jakub Narębski <jnareb@gmail.com> wrote:
> > W dniu 2016-07-22 o 17:49, larsxschneider@gmail.com pisze:
> > > +use strict;
> > > +use warnings;
> > > +use autodie;
> > 
> > autodie?
> 
> "set -e" for Perl (man autodie)
> 
> It's been a part of Perl for ages, but I've never used it
> myself, either; I suppose it's fine for tests...

autodie has been around for a long time, but it only became part of the
perl core in v5.10.1 (according to Module::CoreList). I think the code
in perl/ requires only 5.8, but whenever we unconditionally use perl
without respect to NO_PERL (like in the test scripts), we usually shoot
for even antique versions of perl like 5.005.

So by those rules, we should avoid "autodie" here, though I wouldn't be
surprised if it takes a while for people to complain in practice (most
modern systems will have a recent enough perl, but it seems we go
through cycles where every few years somebody posts a bunch of patches
for ancient versions of IRIX or some other platform, cleaning up all of
these sorts of portability problems).

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 0/3] Git filter protocol
  2016-07-24 11:24   ` Lars Schneider
@ 2016-07-26 20:11     ` Jeff King
  0 siblings, 0 replies; 77+ messages in thread
From: Jeff King @ 2016-07-26 20:11 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Git Mailing List, jnareb,
	Torsten Bögershausen, peartben, mlbright

On Sun, Jul 24, 2016 at 01:24:29PM +0200, Lars Schneider wrote:

> What if we would keep the config option "protocol" and make it an "int"? 
> Undefined or version "1" would describe the existing clean/smudge 
> protocol via command line and pipe. Version "2" would be the new protocol?

FWIW, that is what I expected when I saw the word "protocol".

It's possible that we might never need a "v3" protocol specified here,
because your v2 protocol should be able to auto-upgrade. That is, if we
start a filter and it says "hi, I am speaking protocol 3", then Git
knows to speak the requested version from there on (or will barf if it
doesn't understand the version).

So you'd only need to say "filter.foo.protocol=v3" if there was some
protocol change that broke the initial conversation.

That does mean it is the filter which sets the maximum protocol level,
not git. So a filter which can speak v3 or v2 (to work with older
versions of git) does not know which to use. That could be solved by
specifying

  [filter "foo"]
  smudge = my-filter --version=3

or something.

I'm not sure it's worth thinking too hard about what-ifs here. We should
do the simplest thing that will work and avoid painting ourselves into a
corner for future upgrades.

> > * The way the serialized access to these long-running processes
> >   work in 3/3 would make it harder or impossible to later
> >   parallelize conversion?  I am imagining a far future where we
> >   would run "git checkout ." using (say) two threads, one
> >   responsible for active_cache[0..active_nr/2] and the other
> >   responsible for the remainder.
> I hope this future is not too far away :-) 
> However, I don't think that would be a problem as we could start the
> long-running process once for each checkout thread, no?

That's reasonable if we have a worker-thread model (which seems likely,
as that's what we use elsewhere in git), and if the main cost you want
to amortize is just process startup (so you pay the cost once per
worker, which is a constant factor and not too bad).

It's not a good model if the long-running process wants to amortize
other shared costs. For example, persistent https connections. Or even
user-interactive authentication steps, where you really would prefer to
do them once. The filter can implement its own ad-hoc sharing of
resources, but doing that portably is complicated.

Of course having an async protocol between git and the filter is also
complicated. Perhaps that's something that could wait for a v3 if
somebody really wants it.

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v1 2/3] convert: modernize tests
  2016-07-26 15:18   ` Remi Galan Alfonso
@ 2016-07-26 20:40     ` Junio C Hamano
  0 siblings, 0 replies; 77+ messages in thread
From: Junio C Hamano @ 2016-07-26 20:40 UTC (permalink / raw)
  To: Remi Galan Alfonso
  Cc: Lars Schneider, Git Mailing List, Jeff King, Jakub Narębski,
	Torsten Bögershausen

On Tue, Jul 26, 2016 at 8:18 AM, Remi Galan Alfonso
<remi.galan-alfonso@ensimag.grenoble-inp.fr> wrote:
> Considering how close it is to your patch, you might also want to
> remove spaces after '<'.
>
> There is only one occurrence in this file and it's in a line you are
> already modifying.

Good eyes.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v2 0/5] Git filter protocol
  2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
                   ` (3 preceding siblings ...)
  2016-07-22 21:39 ` [PATCH v1 0/3] Git filter protocol Junio C Hamano
@ 2016-07-27  0:06 ` larsxschneider
  2016-07-27  0:06   ` [PATCH v2 1/5] convert: quote filter names in error messages larsxschneider
                     ` (5 more replies)
  4 siblings, 6 replies; 77+ messages in thread
From: larsxschneider @ 2016-07-27  0:06 UTC (permalink / raw)
  To: git
  Cc: gitster, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

thanks a lot for the extensive reviews. I tried to address all mentioned
concerns and summarized them below. The most prominent changes since v1 are
the following:
* pipe communication uses a packet format (pkt-line) based protocol
* a long running filter application is defined with "filter.<driver>.process"
* Git offers a number of filter capabilties that a filter can request
  (right now only "smudge" and "clean" - in the future maybe "cleanFromFile",
  "smudgeToFile", and/or "stream")

Cheers,
Lars


## Torsten:
* add "\n" line terminator after version in init sequence
* prepare big file for EXPENSIVE tests once
* set "#!/usr/bin/perl" as shebang for rot13.pl to mimic other Perl test scripts
* add test_have_prereq PERL to t0021

## Ramsay:
* use write_in_full(process->in, nbuf.buf, nbuf.len) to avoid unneccsary strlen call
* use read_in_full to read data that exceeds MAX_IO_SIZE properly
* fix test case to check for large file filtering

## Jakub:
* use standard input/standard output instead of stdin/stdout
* replace global variable "cmd_process_map" with a function parameter where possible
* avoid "strbuf_reset" after STRBUF_INIT
* align test_config_global
* rename rot13.pl to rot13-filter.pl
* make Perl style consistent
* describe hard coded filenames in test filter header
* improve docs
* add filter capabilities field (enables cleanToFile, smudgeFromFile, and/or stream later)
* explain that content size in bytes is encoded in ASCII
* consistent line ending for die call in Perl (without "\n")
* make rot13 test filter die in case of failure (instead of returning "fail")

## Eric:
* flush explicitly in Perl test filter
* do not initialize variables to NULL if they are set unconditionally
* fix no-op stop_protocol_filter
* use off_t instead of size_t
* improve test filter int parsing ($filelen =~ /\A\d+\z/ or die "bad filelen: $filelen")

## Peff:
* use pkt-line protocol
* do not use Perl autodie

## Remi:
* remove spaces after '<'



Lars Schneider (5):
  convert: quote filter names in error messages
  convert: modernize tests
  pkt-line: extract and use `set_packet_header` function
  convert: generate large test files only once
  convert: add filter.<driver>.process option

 Documentation/gitattributes.txt |  54 +++++++-
 convert.c                       | 281 +++++++++++++++++++++++++++++++++++++---
 pkt-line.c                      |  15 ++-
 pkt-line.h                      |   1 +
 t/t0021-conversion.sh           | 272 ++++++++++++++++++++++++++++++++------
 t/t0021/rot13-filter.pl         | 146 +++++++++++++++++++++
 6 files changed, 704 insertions(+), 65 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

--
2.9.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v2 1/5] convert: quote filter names in error messages
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
@ 2016-07-27  0:06   ` larsxschneider
  2016-07-27 20:01     ` Jakub Narębski
  2016-07-27  0:06   ` [PATCH v2 2/5] convert: modernize tests larsxschneider
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 77+ messages in thread
From: larsxschneider @ 2016-07-27  0:06 UTC (permalink / raw)
  To: git
  Cc: gitster, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git filter with spaces (e.g. `filter.sh foo`) are hard to read in
error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index b1614bf..522e2c5 100644
--- a/convert.c
+++ b/convert.c
@@ -397,7 +397,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	child_process.out = out;
 
 	if (start_command(&child_process))
-		return error("cannot fork to run external filter %s", params->cmd);
+		return error("cannot fork to run external filter '%s'", params->cmd);
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
@@ -415,13 +415,13 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
-		error("cannot feed the input to external filter %s", params->cmd);
+		error("cannot feed the input to external filter '%s'", params->cmd);
 
 	sigchain_pop(SIGPIPE);
 
 	status = finish_command(&child_process);
 	if (status)
-		error("external filter %s failed %d", params->cmd, status);
+		error("external filter '%s' failed %d", params->cmd, status);
 
 	strbuf_release(&cmd);
 	return (write_err || status);
@@ -462,15 +462,15 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
+		error("external filter '%s' failed", cmd);
 		ret = 0;
 	}
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 2/5] convert: modernize tests
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
  2016-07-27  0:06   ` [PATCH v2 1/5] convert: quote filter names in error messages larsxschneider
@ 2016-07-27  0:06   ` larsxschneider
  2016-07-27  0:06   ` [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function larsxschneider
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: larsxschneider @ 2016-07-27  0:06 UTC (permalink / raw)
  To: git
  Cc: gitster, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">".

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 62 +++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..7b45136 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -13,8 +13,8 @@ EOF
 chmod +x rot13.sh
 
 test_expect_success setup '
-	git config filter.rot13.smudge ./rot13.sh &&
-	git config filter.rot13.clean ./rot13.sh &&
+	test_config filter.rot13.smudge ./rot13.sh &&
+	test_config filter.rot13.clean ./rot13.sh &&
 
 	{
 	    echo "*.t filter=rot13"
@@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
 
 test_expect_success check '
 
-	cmp test.o test &&
-	cmp test.o test.t &&
+	test_cmp test.o test &&
+	test_cmp test.o test.t &&
 
 	# ident should be stripped in the repository
 	git diff --raw --exit-code :test :test.i &&
@@ -47,10 +47,10 @@ test_expect_success check '
 	embedded=$(sed -ne "$script" test.i) &&
 	test "z$id" = "z$embedded" &&
 
-	git cat-file blob :test.t > test.r &&
+	git cat-file blob :test.t >test.r &&
 
-	./rot13.sh < test.o > test.t &&
-	cmp test.r test.t
+	./rot13.sh <test.o >test.t &&
+	test_cmp test.r test.t
 '
 
 # If an expanded ident ever gets into the repository, we want to make sure that
@@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
 
 	# delete the files and check them out again, using a smudge filter
 	# that will count the args and echo the command-line back to us
-	git config filter.argc.smudge "sh ./argc.sh %f" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -141,7 +141,7 @@ test_expect_success 'filter shell-escaped filenames' '
 	test_cmp expect "$special" &&
 
 	# do the same thing, but with more args in the filter expression
-	git config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -154,9 +154,9 @@ test_expect_success 'filter shell-escaped filenames' '
 '
 
 test_expect_success 'required filter should filter data' '
-	git config filter.required.smudge ./rot13.sh &&
-	git config filter.required.clean ./rot13.sh &&
-	git config filter.required.required true &&
+	test_config filter.required.smudge ./rot13.sh &&
+	test_config filter.required.clean ./rot13.sh &&
+	test_config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
@@ -165,17 +165,17 @@ test_expect_success 'required filter should filter data' '
 
 	rm -f test.r &&
 	git checkout -- test.r &&
-	cmp test.o test.r &&
+	test_cmp test.o test.r &&
 
 	./rot13.sh <test.o >expected &&
 	git cat-file blob :test.r >actual &&
-	cmp expected actual
+	test_cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
-	git config filter.failsmudge.smudge false &&
-	git config filter.failsmudge.clean cat &&
-	git config filter.failsmudge.required true &&
+	test_config filter.failsmudge.smudge false &&
+	test_config filter.failsmudge.clean cat &&
+	test_config filter.failsmudge.required true &&
 
 	echo "*.fs filter=failsmudge" >.gitattributes &&
 
@@ -186,9 +186,9 @@ test_expect_success 'required filter smudge failure' '
 '
 
 test_expect_success 'required filter clean failure' '
-	git config filter.failclean.smudge cat &&
-	git config filter.failclean.clean false &&
-	git config filter.failclean.required true &&
+	test_config filter.failclean.smudge cat &&
+	test_config filter.failclean.clean false &&
+	test_config filter.failclean.required true &&
 
 	echo "*.fc filter=failclean" >.gitattributes &&
 
@@ -197,8 +197,8 @@ test_expect_success 'required filter clean failure' '
 '
 
 test_expect_success 'filtering large input to small output should use little memory' '
-	git config filter.devnull.clean "cat >/dev/null" &&
-	git config filter.devnull.required true &&
+	test_config filter.devnull.clean "cat >/dev/null" &&
+	test_config filter.devnull.required true &&
 	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
 	echo "30MB filter=devnull" >.gitattributes &&
 	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
@@ -207,7 +207,7 @@ test_expect_success 'filtering large input to small output should use little mem
 test_expect_success 'filter that does not read is fine' '
 	test-genrandom foo $((128 * 1024 + 1)) >big &&
 	echo "big filter=epipe" >.gitattributes &&
-	git config filter.epipe.clean "echo xyzzy" &&
+	test_config filter.epipe.clean "echo xyzzy" &&
 	git add big &&
 	git cat-file blob :big >actual &&
 	echo xyzzy >expect &&
@@ -215,20 +215,20 @@ test_expect_success 'filter that does not read is fine' '
 '
 
 test_expect_success EXPENSIVE 'filter large file' '
-	git config filter.largefile.smudge cat &&
-	git config filter.largefile.clean cat &&
+	test_config filter.largefile.smudge cat &&
+	test_config filter.largefile.clean cat &&
 	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
 	echo "2GB filter=largefile" >.gitattributes &&
 	git add 2GB 2>err &&
-	! test -s err &&
+	test_must_be_empty err &&
 	rm -f 2GB &&
 	git checkout -- 2GB 2>err &&
-	! test -s err
+	test_must_be_empty err
 '
 
 test_expect_success "filter: clean empty file" '
-	git config filter.in-repo-header.clean  "echo cleaned && cat" &&
-	git config filter.in-repo-header.smudge "sed 1d" &&
+	test_config filter.in-repo-header.clean  "echo cleaned && cat" &&
+	test_config filter.in-repo-header.smudge "sed 1d" &&
 
 	echo "empty-in-worktree    filter=in-repo-header" >>.gitattributes &&
 	>empty-in-worktree &&
@@ -240,8 +240,8 @@ test_expect_success "filter: clean empty file" '
 '
 
 test_expect_success "filter: smudge empty file" '
-	git config filter.empty-in-repo.clean "cat >/dev/null" &&
-	git config filter.empty-in-repo.smudge "echo smudged && cat" &&
+	test_config filter.empty-in-repo.clean "cat >/dev/null" &&
+	test_config filter.empty-in-repo.smudge "echo smudged && cat" &&
 
 	echo "empty-in-repo filter=empty-in-repo" >>.gitattributes &&
 	echo dead data walking >empty-in-repo &&
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
  2016-07-27  0:06   ` [PATCH v2 1/5] convert: quote filter names in error messages larsxschneider
  2016-07-27  0:06   ` [PATCH v2 2/5] convert: modernize tests larsxschneider
@ 2016-07-27  0:06   ` larsxschneider
  2016-07-27  0:20     ` Junio C Hamano
  2016-07-27  0:06   ` [PATCH v2 4/5] convert: generate large test files only once larsxschneider
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 77+ messages in thread
From: larsxschneider @ 2016-07-27  0:06 UTC (permalink / raw)
  To: git
  Cc: gitster, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

`set_packet_header` converts an integer to a 4 byte hex string. Make
this function publicly available so that other parts of Git can easily
generate a pkt-line.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 15 ++++++++++-----
 pkt-line.h |  1 +
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index 62fdb37..6820224 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -98,9 +98,17 @@ void packet_buf_flush(struct strbuf *buf)
 }
 
 #define hex(a) (hexchar[(a) & 15])
-static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+void set_packet_header(char *buf, const int size)
 {
 	static char hexchar[] = "0123456789abcdef";
+	buf[0] = hex(size >> 12);
+	buf[1] = hex(size >> 8);
+	buf[2] = hex(size >> 4);
+	buf[3] = hex(size);
+}
+
+static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+{
 	size_t orig_len, n;
 
 	orig_len = out->len;
@@ -111,10 +119,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	if (n > LARGE_PACKET_MAX)
 		die("protocol error: impossibly long line");
 
-	out->buf[orig_len + 0] = hex(n >> 12);
-	out->buf[orig_len + 1] = hex(n >> 8);
-	out->buf[orig_len + 2] = hex(n >> 4);
-	out->buf[orig_len + 3] = hex(n);
+	set_packet_header(&out->buf[orig_len], n);
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
diff --git a/pkt-line.h b/pkt-line.h
index 3cb9d91..925c6d3 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+void set_packet_header(char *buf, const int size);
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 4/5] convert: generate large test files only once
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
                     ` (2 preceding siblings ...)
  2016-07-27  0:06   ` [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function larsxschneider
@ 2016-07-27  0:06   ` larsxschneider
  2016-07-27  2:35     ` Torsten Bögershausen
  2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
  2016-07-27 19:08   ` [PATCH v2 0/5] Git filter protocol Jakub Narębski
  5 siblings, 1 reply; 77+ messages in thread
From: larsxschneider @ 2016-07-27  0:06 UTC (permalink / raw)
  To: git
  Cc: gitster, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Generate a more interesting large test file with random characters in
between and reuse this test file in multiple tests. Run tests formerly
marked as EXPENSIVE every time but with a smaller test file.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7b45136..b9911a4 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -4,6 +4,13 @@ test_description='blob conversion via gitattributes'
 
 . ./test-lib.sh
 
+if test_have_prereq EXPENSIVE
+then
+	T0021_LARGE_FILE_SIZE=2048
+else
+	T0021_LARGE_FILE_SIZE=30
+fi
+
 cat <<EOF >rot13.sh
 #!$SHELL_PATH
 tr \
@@ -31,7 +38,15 @@ test_expect_success setup '
 	cat test >test.i &&
 	git add test test.t test.i &&
 	rm -f test test.t test.i &&
-	git checkout -- test test.t test.i
+	git checkout -- test test.t test.i &&
+
+	mkdir -p generated-test-data &&
+	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
+	do
+		# Generate 1MB of empty data and 100 bytes of random characters
+		printf "%1048576d" 1
+		printf "$(LC_ALL=C tr -dc "A-Za-z0-9" </dev/urandom | dd bs=$((RANDOM>>8)) count=1 2>/dev/null)"
+	done >generated-test-data/large.file
 '
 
 script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
@@ -199,9 +214,9 @@ test_expect_success 'required filter clean failure' '
 test_expect_success 'filtering large input to small output should use little memory' '
 	test_config filter.devnull.clean "cat >/dev/null" &&
 	test_config filter.devnull.required true &&
-	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
-	echo "30MB filter=devnull" >.gitattributes &&
-	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
+	cp generated-test-data/large.file large.file &&
+	echo "large.file filter=devnull" >.gitattributes &&
+	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add large.file
 '
 
 test_expect_success 'filter that does not read is fine' '
@@ -214,15 +229,15 @@ test_expect_success 'filter that does not read is fine' '
 	test_cmp expect actual
 '
 
-test_expect_success EXPENSIVE 'filter large file' '
+test_expect_success 'filter large file' '
 	test_config filter.largefile.smudge cat &&
 	test_config filter.largefile.clean cat &&
-	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
-	echo "2GB filter=largefile" >.gitattributes &&
-	git add 2GB 2>err &&
+	echo "large.file filter=largefile" >.gitattributes &&
+	cp generated-test-data/large.file large.file &&
+	git add large.file 2>err &&
 	test_must_be_empty err &&
-	rm -f 2GB &&
-	git checkout -- 2GB 2>err &&
+	rm -f large.file &&
+	git checkout -- large.file 2>err &&
 	test_must_be_empty err
 '
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
                     ` (3 preceding siblings ...)
  2016-07-27  0:06   ` [PATCH v2 4/5] convert: generate large test files only once larsxschneider
@ 2016-07-27  0:06   ` larsxschneider
  2016-07-27  1:32     ` Jeff King
                       ` (3 more replies)
  2016-07-27 19:08   ` [PATCH v2 0/5] Git filter protocol Jakub Narębski
  5 siblings, 4 replies; 77+ messages in thread
From: larsxschneider @ 2016-07-27  0:06 UTC (permalink / raw)
  To: git
  Cc: gitster, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time.

This patch adds the filter.<driver>.process string option which, if used,
keeps the external filter process running and processes all blobs with
the following packet format (pkt-line) based protocol over standard input
and standard output.

Git starts the filter on first usage and expects a welcome
message, protocol version number, and filter capabilities
seperated by spaces:
------------------------
packet:          git< git-filter-protocol
packet:          git< version 2
packet:          git< clean smudge
------------------------
Supported filter capabilities are "clean" and "smudge".

Afterwards Git sends a command (e.g. "smudge" or "clean" - based
on the supported capabilities), the filename, the content size as
ASCII number in bytes, and the content in packet format with a
flush packet at the end:
------------------------
packet:          git> smudge
packet:          git> testfile.dat
packet:          git> 7
packet:          git> CONTENT
packet:          git> 0000
------------------------

The filter is expected to respond with the result content size as
ASCII number in bytes and the result content in packet format with
a flush packet at the end:
------------------------
packet:          git< 57
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
------------------------
Please note: In a future version of Git the capability "stream"
might be supported. In that case the content size must not be
part of the filter response.

Afterwards the filter is expected to wait for the next command.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Helped-by: Martin-Louis Bright <mlbright@gmail.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt |  54 +++++++-
 convert.c                       | 269 ++++++++++++++++++++++++++++++++++++++--
 t/t0021-conversion.sh           | 175 ++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         | 146 ++++++++++++++++++++++
 4 files changed, 631 insertions(+), 13 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..8fb40d2 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate. If a long running filter process (see section
+below) is used then Git can process all blobs with a single filter
+invocation for the entire life of a single Git command (e.g.
+`git add .`).
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +379,54 @@ substitution.  For example:
 ------------------------
 
 
+Long Running Filter Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the filter command (string value) is defined via
+filter.<driver>.process then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command by talking with the following packet format (pkt-line)
+based protocol over standard input and standard output.
+
+Git starts the filter on first usage and expects a welcome
+message, protocol version number, and filter capabilities
+seperated by spaces:
+------------------------
+packet:          git< git-filter-protocol
+packet:          git< version 2
+packet:          git< clean smudge
+------------------------
+Supported filter capabilities are "clean" and "smudge".
+
+Afterwards Git sends a command (e.g. "smudge" or "clean" - based
+on the supported capabilities), the filename, the content size as
+ASCII number in bytes, and the content in packet format with a
+flush packet at the end:
+------------------------
+packet:          git> smudge
+packet:          git> testfile.dat
+packet:          git> 7
+packet:          git> CONTENT
+packet:          git> 0000
+------------------------
+
+The filter is expected to respond with the result content size as
+ASCII number in bytes and the result content in packet format with
+a flush packet at the end:
+------------------------
+packet:          git< 57
+packet:          git< SMUDGED_CONTENT
+packet:          git< 0000
+------------------------
+Please note: In a future version of Git the capability "stream"
+might be supported. In that case the content size must not be
+part of the filter response.
+
+Afterwards the filter is expected to wait for the next command.
+A demo implementation can be found in `t/t0021/rot13-filter.pl`
+located in the Git core repository.
+
+
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/convert.c b/convert.c
index 522e2c5..5ff200b 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "pkt-line.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -481,11 +482,232 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	return ret;
 }
 
+static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
+{
+	off_t bytes_read;
+	off_t total_bytes_read = 0;
+	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
+	do {
+		bytes_read = packet_read(
+			fd, NULL, NULL,
+			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
+			PACKET_READ_GENTLE_ON_EOF
+		);
+		total_bytes_read += bytes_read;
+	}
+	while (
+		bytes_read > 0 && 					// the last packet was no flush
+		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
+	);
+	strbuf_setlen(sb, total_bytes_read);
+	return total_bytes_read;
+}
+
+static int multi_packet_write(const char *src, size_t len, const int in, const int out)
+{
+	int ret = 1;
+	char header[4];
+	char buffer[8192];
+	off_t bytes_to_write;
+	while (ret) {
+		if (in >= 0) {
+			bytes_to_write = xread(in, buffer, sizeof(buffer));
+			if (bytes_to_write < 0)
+				ret &= 0;
+			src = buffer;
+		} else {
+			bytes_to_write = len > LARGE_PACKET_MAX - 4 ? LARGE_PACKET_MAX - 4 : len;
+			len -= bytes_to_write;
+		}
+		if (!bytes_to_write)
+			break;
+		set_packet_header(header, bytes_to_write + 4);
+		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
+		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
+	}
+	ret &= write_in_full(out, "0000", 4) == 4;
+	return ret;
+}
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	const char *cmd;
+	int clean;
+	int smudge;
+	struct child_process process;
+};
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+							const struct cmd2process *e2,
+							const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_protocol_filter_entry(struct hashmap *hashmap, const char *cmd)
+{
+	struct cmd2process k;
+	hashmap_entry_init(&k, strhash(cmd));
+	k.cmd = cmd;
+	return hashmap_get(hashmap, &k, NULL);
+}
+
+static void stop_protocol_filter(struct hashmap *hashmap, struct cmd2process *entry) {
+	if (!entry)
+		return;
+	sigchain_push(SIGPIPE, SIG_IGN);
+	close(entry->process.in);
+	close(entry->process.out);
+	sigchain_pop(SIGPIPE);
+	finish_command(&entry->process);
+	child_process_clear(&entry->process);
+	hashmap_remove(hashmap, entry, NULL);
+	free(entry);
+}
+
+static struct cmd2process *start_protocol_filter(struct hashmap *hashmap, const char *cmd)
+{
+	int ret = 1;
+	struct cmd2process *entry;
+	struct child_process *process;
+	const char *argv[] = { NULL, NULL };
+	struct string_list capabilities = STRING_LIST_INIT_NODUP;
+	char *capabilities_buffer;
+	int i;
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(entry, strhash(cmd));
+	entry->cmd = cmd;
+	entry->clean = 0;
+	entry->smudge = 0;
+	process = &entry->process;
+
+	child_process_init(process);
+	argv[0] = cmd;
+	process->argv = argv;
+	process->use_shell = 1;
+	process->in = -1;
+	process->out = -1;
+
+	if (start_command(process)) {
+		error("cannot fork to run external persistent filter '%s'", cmd);
+		stop_protocol_filter(hashmap, entry);
+		return NULL;
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+	ret &= strcmp(packet_read_line(process->out, NULL), "git-filter-protocol") == 0;
+	ret &= strcmp(packet_read_line(process->out, NULL), "version 2") == 0;
+	capabilities_buffer = packet_read_line(process->out, NULL);
+	sigchain_pop(SIGPIPE);
+
+	string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1);
+	for (i = 0; i < capabilities.nr; i++) {
+		const char *requested = capabilities.items[i].string;
+		if (!strcmp(requested, "clean")) {
+			entry->clean = 1;
+		} else if (!strcmp(requested, "smudge")) {
+			entry->smudge = 1;
+		} else {
+			warning(
+				"filter process '%s' requested unsupported filter capability '%s'",
+				cmd, requested
+			);
+		}
+	}
+	string_list_clear(&capabilities, 0);
+
+	if (!ret) {
+		error("initialization for external persistent filter '%s' failed", cmd);
+		stop_protocol_filter(hashmap, entry);
+		return NULL;
+	}
+
+	hashmap_add(hashmap, entry);
+	return entry;
+}
+
+static int cmd_process_map_init = 0;
+static struct hashmap cmd_process_map;
+
+static int apply_protocol_filter(const char *path, const char *src, size_t len,
+						int fd, struct strbuf *dst, const char *cmd,
+						const char *filter_type)
+{
+	int ret = 1;
+	struct cmd2process *entry;
+	struct child_process *process;
+	struct stat file_stat;
+	struct strbuf nbuf = STRBUF_INIT;
+	off_t expected_bytes;
+	char *strtol_end;
+	char *strbuf;
+
+	if (!cmd || !*cmd)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if (!cmd_process_map_init) {
+		cmd_process_map_init = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = find_protocol_filter_entry(&cmd_process_map, cmd);
+	}
+
+	if (!entry) {
+		entry = start_protocol_filter(&cmd_process_map, cmd);
+		if (!entry) {
+			return 0;
+		}
+	}
+	process = &entry->process;
+
+	if (!(!strcmp(filter_type, "clean") && entry->clean) &&
+		!(!strcmp(filter_type, "smudge") && entry->smudge)) {
+		return 0;
+	}
+
+	if (fd >= 0 && !src) {
+		ret &= fstat(fd, &file_stat) != -1;
+		len = file_stat.st_size;
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	packet_write(process->in, "%s\n", filter_type);
+	packet_write(process->in, "%s\n", path);
+	packet_write(process->in, "%zu\n", len);
+	ret &= multi_packet_write(src, len, fd, process->in);
+
+	strbuf = packet_read_line(process->out, NULL);
+	expected_bytes = (off_t)strtol(strbuf, &strtol_end, 10);
+	ret &= (strtol_end != strbuf && errno != ERANGE);
+
+	if (expected_bytes > 0) {
+		ret &= multi_packet_read(&nbuf, process->out, expected_bytes) == expected_bytes;
+	}
+
+	sigchain_pop(SIGPIPE);
+
+	if (ret) {
+		strbuf_swap(dst, &nbuf);
+	} else {
+		// Something went wrong with the protocol filter. Force shutdown!
+		stop_protocol_filter(&cmd_process_map, entry);
+	}
+	strbuf_release(&nbuf);
+	return ret;
+}
+
 static struct convert_driver {
 	const char *name;
 	struct convert_driver *next;
 	const char *smudge;
 	const char *clean;
+	const char *process;
 	int required;
 } *user_convert, **user_convert_tail;
 
@@ -526,6 +748,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
 	if (!strcmp("clean", key))
 		return git_config_string(&drv->clean, var, value);
 
+	if (!strcmp("process", key)) {
+		return git_config_string(&drv->process, var, value);
+	}
+
 	if (!strcmp("required", key)) {
 		drv->required = git_config_bool(var, value);
 		return 0;
@@ -823,7 +1049,10 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;
 
-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+	if (!ca.drv->clean && ca.drv->process)
+		return apply_protocol_filter(path, NULL, 0, -1, NULL, ca.drv->process, "clean");
+	else
+		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
 }
 
 const char *get_convert_attr_ascii(const char *path)
@@ -856,17 +1085,22 @@ int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
 	int ret = 0;
-	const char *filter = NULL;
+	const char *clean_filter = NULL;
+	const char *process_filter = NULL;
 	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
 	if (ca.drv) {
-		filter = ca.drv->clean;
+		clean_filter = ca.drv->clean;
+		process_filter = ca.drv->process;
 		required = ca.drv->required;
 	}
 
-	ret |= apply_filter(path, src, len, -1, dst, filter);
+	if (!clean_filter && process_filter)
+		ret |= apply_protocol_filter(path, src, len, -1, dst, process_filter, "clean");
+	else
+		ret |= apply_filter(path, src, len, -1, dst, clean_filter);
 	if (!ret && required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
@@ -885,13 +1119,19 @@ int convert_to_git(const char *path, const char *src, size_t len,
 void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 			      enum safe_crlf checksafe)
 {
+	int ret = 0;
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
-	assert(ca.drv->clean);
+	assert(ca.drv->clean || ca.drv->process);
 
-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+	if (!ca.drv->clean && ca.drv->process)
+		ret = apply_protocol_filter(path, NULL, 0, fd, dst, ca.drv->process, "clean");
+	else
+		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
+
+	if (!ret)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -902,14 +1142,16 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 					    size_t len, struct strbuf *dst,
 					    int normalizing)
 {
-	int ret = 0, ret_filter = 0;
-	const char *filter = NULL;
+	int ret = 0, ret_filter;
+	const char *smudge_filter = NULL;
+	const char *process_filter = NULL;
 	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
 	if (ca.drv) {
-		filter = ca.drv->smudge;
+		process_filter = ca.drv->process;
+		smudge_filter = ca.drv->smudge;
 		required = ca.drv->required;
 	}
 
@@ -922,7 +1164,7 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 	 * CRLF conversion can be skipped if normalizing, unless there
 	 * is a smudge filter.  The filter might expect CRLFs.
 	 */
-	if (filter || !normalizing) {
+	if (smudge_filter || process_filter || !normalizing) {
 		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
 		if (ret) {
 			src = dst->buf;
@@ -930,7 +1172,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, -1, dst, filter);
+	if (!smudge_filter && process_filter)
+		ret_filter = apply_protocol_filter(path, src, len, -1, dst, process_filter, "smudge");
+	else
+		ret_filter = apply_filter(path, src, len, -1, dst, smudge_filter);
 	if (!ret_filter && required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
@@ -1383,7 +1628,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	struct stream_filter *filter = NULL;
 
 	convert_attrs(&ca, path);
-	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
+	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index b9911a4..c4793ed 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -4,6 +4,11 @@ test_description='blob conversion via gitattributes'
 
 . ./test-lib.sh
 
+if ! test_have_prereq PERL; then
+	skip_all='skipping perl interface tests, perl not available'
+	test_done
+fi
+
 if test_have_prereq EXPENSIVE
 then
 	T0021_LARGE_FILE_SIZE=2048
@@ -283,4 +288,174 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '
 
+test_expect_success 'required protocol filter should filter data' '
+	test_config_global filter.protocol.process \"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		echo "test22" >test2.r &&
+		echo "test333" >test3.r &&
+
+		rm -f output.log &&
+		git add . &&
+		sort output.log | uniq -c | sed "s/^[ ]*//" >uniq_output.log &&
+		cat >expected_add.log <<-\EOF &&
+			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
+			1 IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_add.log uniq_output.log &&
+
+		>output.log &&
+		git commit . -m "test commit" &&
+		sort output.log | uniq -c | sed "s/^[ ]*//" |
+			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq_output.log &&
+		cat >expected_commit.log <<-\EOF &&
+			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
+			x IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_commit.log uniq_output.log &&
+
+		>output.log &&
+		rm -f test?.r &&
+		git checkout . &&
+		cat output.log | grep -v "IN: clean" >smudge_output.log &&
+		cat >expected_checkout.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout.log smudge_output.log &&
+
+		git checkout empty &&
+
+		>output.log &&
+		git checkout master &&
+		cat output.log | grep -v "IN: clean" >smudge_output.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge_output.log
+	)
+'
+
+test_expect_success 'protocol filter large file' '
+	test_config_global filter.protocol.process \"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.file filter=protocol" >.gitattributes &&
+		cp ../generated-test-data/large.file large.file &&
+		cp large.file large.original &&
+		./../rot13.sh <large.original >large.rot13 &&
+
+		git add large.file .gitattributes &&
+		git commit . -m "test commit" &&
+
+		rm -f large.file &&
+		git checkout -- large.file &&
+		git cat-file blob :large.file >actual &&
+		test_cmp large.rot13 actual
+	)
+'
+
+test_expect_success 'required protocol filter should fail with clean' '
+	test_config_global filter.protocol.process \"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "this is going to fail" >clean-write-fail.r &&
+		echo "test333" >test3.r &&
+
+		# Note: There are three clean paths in convert.c we just test one here.
+		test_must_fail git add .
+	)
+'
+
+test_expect_success 'protocol filter should restart after failure' '
+	test_config_global filter.protocol.process \"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "1234567" >test2.o &&
+		cat test2.o >test2.r &&
+		echo "this is going to fail" >smudge-write-fail.o &&
+		cat smudge-write-fail.o >smudge-write-fail.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		printf "" >output.log &&
+		git checkout . &&
+		cat output.log | grep -v "IN: clean" >smudge_output.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [FAIL]
+			start
+			wrote filter header
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge_output.log &&
+
+		test_cmp ../test.o test.r &&
+		./../rot13.sh <../test.o >expected &&
+		git cat-file blob :test.r >actual &&
+		test_cmp expected actual
+
+		test_cmp test2.o test2.r &&
+		./../rot13.sh <test2.o >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual
+
+		test_cmp test2.o test2.r &&
+		./../rot13.sh <test2.o >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual
+
+		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
+		./../rot13.sh <smudge-write-fail.o >expected &&
+		git cat-file blob :smudge-write-fail.r >actual &&
+		test_cmp expected actual							  # Clean worked!
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
new file mode 100755
index 0000000..7176836
--- /dev/null
+++ b/t/t0021/rot13-filter.pl
@@ -0,0 +1,146 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git filter protocol version 2
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+# This implementation supports two special test cases:
+# (1) If data with the filename "clean-write-fail.r" is processed with
+#     a "clean" operation then the write operation will die.
+# (2) If data with the filename "smudge-write-fail.r" is processed with
+#     a "smudge" operation then the write operation will die.
+#
+
+use strict;
+use warnings;
+
+my $MAX_PACKET_CONTENT_SIZE = 65516;
+
+sub rot13 {
+    my ($str) = @_;
+    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
+    return $str;
+}
+
+sub packet_read {
+    my $buffer;
+    my $bytes_read = read STDIN, $buffer, 4;
+    if ( $bytes_read == 0 ) {
+        return;
+    }
+    elsif ( $bytes_read != 4 ) {
+        die "invalid packet size '$bytes_read' field";
+    }
+    my $pkt_size = hex($buffer);
+    if ( $pkt_size == 0 ) {
+        return ( 1, "" );
+    }
+    elsif ( $pkt_size > 4 ) {
+        my $content_size = $pkt_size - 4;
+        $bytes_read = read STDIN, $buffer, $content_size;
+        if ( $bytes_read != $content_size ) {
+            die "invalid packet";
+        }
+        return ( 0, $buffer );
+    }
+    else {
+        die "invalid packet size";
+    }
+}
+
+sub packet_write {
+    my ($packet) = @_;
+    print STDOUT sprintf( "%04x", length($packet) + 4 );
+    print STDOUT $packet;
+    STDOUT->flush();
+}
+
+sub packet_flush {
+    print STDOUT sprintf( "%04x", 0 );
+    STDOUT->flush();
+}
+
+open my $debug, ">>", "output.log";
+print $debug "start\n";
+$debug->flush();
+
+packet_write("git-filter-protocol\n");
+packet_write("version 2\n");
+packet_write("clean smudge\n");
+print $debug "wrote filter header\n";
+$debug->flush();
+
+while (1) {
+    my $command = packet_read();
+    unless ( defined($command) ) {
+        exit();
+    }
+    chomp $command;
+    print $debug "IN: $command";
+    $debug->flush();
+    my $filename = packet_read();
+    chomp $filename;
+    print $debug " $filename";
+    $debug->flush();
+    my $filelen = packet_read();
+    chomp $filelen;
+    print $debug " $filelen";
+    $debug->flush();
+
+    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
+    my $output;
+
+    if ( $filelen > 0 ) {
+        my $input = "";
+        {
+            binmode(STDIN);
+            my $buffer;
+            my $done = 0;
+            while ( !$done ) {
+                ( $done, $buffer ) = packet_read();
+                $input .= $buffer;
+            }
+            print $debug " [OK] -- ";
+            $debug->flush();
+        }
+
+        if ( $command eq "clean" ) {
+            $output = rot13($input);
+        }
+        elsif ( $command eq "smudge" ) {
+            $output = rot13($input);
+        }
+        else {
+            die "bad command";
+        }
+    }
+
+    my $output_len = length($output);
+    packet_write("$output_len\n");
+    print $debug "OUT: $output_len ";
+    $debug->flush();
+    if ( $output_len > 0 ) {
+        if (   ( $command eq "clean" and $filename eq "clean-write-fail.r" )
+            or
+            ( $command eq "smudge" and $filename eq "smudge-write-fail.r" ) )
+        {
+            print $debug " [FAIL]\n";
+            $debug->flush();
+            die "write error";
+        }
+        else {
+            while ( length($output) > 0 ) {
+                my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
+                packet_write($packet);
+                if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
+                    $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
+                }
+                else {
+                    $output = "";
+                }
+            }
+            packet_flush();
+            print $debug "[OK]\n";
+            $debug->flush();
+        }
+    }
+}
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function
  2016-07-27  0:06   ` [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function larsxschneider
@ 2016-07-27  0:20     ` Junio C Hamano
  2016-07-27  9:13       ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Junio C Hamano @ 2016-07-27  0:20 UTC (permalink / raw)
  To: larsxschneider
  Cc: git, jnareb, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> `set_packet_header` converts an integer to a 4 byte hex string. Make
> this function publicly available so that other parts of Git can easily
> generate a pkt-line.

I think that having to do this is a strong sign that the design of
this series is going in a wrong direction.

If you need a helper function that writes a pkt-line format that
behaves differently from what is already available (for example,
packet_write()), it would be much better to design that new function
so that it would be generally useful and add that to pkt-line.[ch],
instead of creating random helper functions that use write(2)
directly, bypassing pkt-line API, to write stuff.

In other words, do not _mimick_ pkt-line; enhance pkt-line as
necessary and use it.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
@ 2016-07-27  1:32     ` Jeff King
  2016-07-27 17:31       ` Lars Schneider
  2016-07-27  9:41     ` Eric Wong
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 77+ messages in thread
From: Jeff King @ 2016-07-27  1:32 UTC (permalink / raw)
  To: larsxschneider
  Cc: git, gitster, jnareb, tboegi, mlbright, remi.galan-alfonso,
	pclouds, e, ramsay

On Wed, Jul 27, 2016 at 02:06:05AM +0200, larsxschneider@gmail.com wrote:

> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
> +{
> +	off_t bytes_read;
> +	off_t total_bytes_read = 0;

I haven't looked carefully at the whole patch yet, but there seems to be
some type issues here. off_t is a good type for storing the whole size
of a file (which may be larger than the amount of memory we can
allocate). But size_t is the right size for an in-memory object.

This function takes a size_t size, which makes sense if it is meant to
read everything into a strbuf.

So I think our total_bytes_read would probably want to be a size_t here,
too, because it cannot possibly grow larger than that (and that is
enforced by the loop below). Otherwise you get weirdness like "sb->buf +
total_bytes_ref" possibly overflowing memory.

> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush

What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
32-bit system)?

> +	do {
> +		bytes_read = packet_read(
> +			fd, NULL, NULL,
> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
> +			PACKET_READ_GENTLE_ON_EOF
> +		);

packet_read() actually returns an int, and may return "-1" on EOF (and
int is fine because we know that we are constrained to 16-bit values
by the pkt-line definition). You read it into an "off_t". I _think_ that
is OK, because I believe POSIX says off_t must be signed. But probably
"int" is the more correct type here.

> +		total_bytes_read += bytes_read;

If you do get "-1", I think you need to detect it here before adjusting
total_bytes_read.

> +	while (
> +		bytes_read > 0 && 					// the last packet was no flush
> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
> +	);

And I'm not sure if you need to distinguish between "0" and "-1" when
checking byte_read here.

> +	strbuf_setlen(sb, total_bytes_read);

Passing an off_t to something expecting a size_t, which can involve
truncation (though I think in practice you really are limited to
size_t).

> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)
> +{
> +	int ret = 1;
> +	char header[4];
> +	char buffer[8192];
> +	off_t bytes_to_write;
> +	while (ret) {
> +		if (in >= 0) {
> +			bytes_to_write = xread(in, buffer, sizeof(buffer));

Likewise here, xread() is returning ssize_t. Again, OK if we can assume
off_t is signed, but it probably makes sense to use the correct type (we
also know it cannot be larger than 8K, of course).

Why 8K? The pkt-line format naturally restricts us to just under 64K, so
why not take advantage of that and minimize the framing overhead for
large data?

> +			if (bytes_to_write < 0)
> +				ret &= 0;

I think "&= 0" is unusual for our codebase? Would just writing "= 0" be
more clear?

We do sometimes do "ret |= something()" but that is in cases where
"ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
your function's error-reporting is inverted from our usual style?

> +		set_packet_header(header, bytes_to_write + 4);
> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
> +	}

If you look at format_packet(), it pulls a slight trick: we have a
buffer 4 bytes larger than we need, format into "buf + 4", and then
write the final size at the beginning. That lets us write() it all in
one go.

At first I thought this function was simply reinventing packet_write(),
but I guess you are trying to avoid the extra copy of the data (once
into the buffer from xread, and then again via format_packet just to add
the extra bytes at the beginning).

I agree with what Junio said elsewhere, that there may be a way to make
the pkt-line code handle this zero-copy situation better. Perhaps
something like:

  struct pktline {
	/* first 4 bytes are reserved for length header */
	char buf[LARGE_PACKET_MAX];
  };
  #define PKTLINE_DATA_START(pkt) ((pkt)->buf + 4)
  #define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - 4)

  ...
  struct pktline pkt;
  ssize_t len = xread(fd, PKTLINE_DATA_START(&pkt), PKTLINE_DATA_LEN);
  packet_send(&pkt, len);

Then packet_send() knows that the first 4 bytes are reserved for it. I
suspect that the strbuf used by format_packet() could get away with
using such a "struct pktline" too, though in practice I doubt there's
any real efficiency to be gained (we generally reuse the same strbuf
over and over, so it will grow once to 64K and get reused).

> +	ret &= write_in_full(out, "0000", 4) == 4;

packet_flush() ?

I know the packet functions are keen on write_or_die() versus
write_in_full().  That is perhaps something that should be fixed.

This was just supposed to be a short note about off_t before eating
dinner (oops), so I didn't read past here.

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 4/5] convert: generate large test files only once
  2016-07-27  0:06   ` [PATCH v2 4/5] convert: generate large test files only once larsxschneider
@ 2016-07-27  2:35     ` Torsten Bögershausen
  2016-07-27 13:32       ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Torsten Bögershausen @ 2016-07-27  2:35 UTC (permalink / raw)
  To: larsxschneider, git
  Cc: gitster, jnareb, mlbright, remi.galan-alfonso, pclouds, e, ramsay,
	peff



On 07/27/2016 02:06 AM, larsxschneider@gmail.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Generate a more interesting large test file with random characters in
> between and reuse this test file in multiple tests. Run tests formerly
> marked as EXPENSIVE every time but with a smaller test file.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  t/t0021-conversion.sh | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index 7b45136..b9911a4 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -4,6 +4,13 @@ test_description='blob conversion via gitattributes'
>
>  . ./test-lib.sh
>
> +if test_have_prereq EXPENSIVE
> +then
> +	T0021_LARGE_FILE_SIZE=2048
> +else
> +	T0021_LARGE_FILE_SIZE=30
> +fi
> +
>  cat <<EOF >rot13.sh
>  #!$SHELL_PATH
>  tr \
> @@ -31,7 +38,15 @@ test_expect_success setup '
>  	cat test >test.i &&
>  	git add test test.t test.i &&
>  	rm -f test test.t test.i &&
> -	git checkout -- test test.t test.i
> +	git checkout -- test test.t test.i &&
> +
> +	mkdir -p generated-test-data &&
> +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
> +	do
> +		# Generate 1MB of empty data and 100 bytes of random characters
> +		printf "%1048576d" 1
> +		printf "$(LC_ALL=C tr -dc "A-Za-z0-9" </dev/urandom | dd bs=$((RANDOM>>8)) count=1 2>/dev/null)"
I'm not sure how portable /dev/urandom is.
The other thing, that "really random" numbers are an overkill, and
it may be easier to use pre-defined numbers,

The rest of 1..4 looks good, I will look at 5/5 later.

> +	done >generated-test-data/large.file
>  '
>
>  script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
> @@ -199,9 +214,9 @@ test_expect_success 'required filter clean failure' '
>  test_expect_success 'filtering large input to small output should use little memory' '
>  	test_config filter.devnull.clean "cat >/dev/null" &&
>  	test_config filter.devnull.required true &&
> -	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
> -	echo "30MB filter=devnull" >.gitattributes &&
> -	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
> +	cp generated-test-data/large.file large.file &&
> +	echo "large.file filter=devnull" >.gitattributes &&
> +	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add large.file
>  '
>
>  test_expect_success 'filter that does not read is fine' '
> @@ -214,15 +229,15 @@ test_expect_success 'filter that does not read is fine' '
>  	test_cmp expect actual
>  '
>
> -test_expect_success EXPENSIVE 'filter large file' '
> +test_expect_success 'filter large file' '
>  	test_config filter.largefile.smudge cat &&
>  	test_config filter.largefile.clean cat &&
> -	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
> -	echo "2GB filter=largefile" >.gitattributes &&
> -	git add 2GB 2>err &&
> +	echo "large.file filter=largefile" >.gitattributes &&
> +	cp generated-test-data/large.file large.file &&
> +	git add large.file 2>err &&
>  	test_must_be_empty err &&
> -	rm -f 2GB &&
> -	git checkout -- 2GB 2>err &&
> +	rm -f large.file &&
> +	git checkout -- large.file 2>err &&
>  	test_must_be_empty err
>  '
>
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function
  2016-07-27  0:20     ` Junio C Hamano
@ 2016-07-27  9:13       ` Lars Schneider
  2016-07-27 16:31         ` Junio C Hamano
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-27  9:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Jakub Narębski, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay, peff


> On 27 Jul 2016, at 02:20, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> `set_packet_header` converts an integer to a 4 byte hex string. Make
>> this function publicly available so that other parts of Git can easily
>> generate a pkt-line.
> 
> I think that having to do this is a strong sign that the design of
> this series is going in a wrong direction.

Thanks for the feedback. Do you think using "pkt-line" is a move into
the wrong direction in general or do you think only my usage of 
"pkt-line" is not ideal?


> If you need a helper function that writes a pkt-line format that
> behaves differently from what is already available (for example,
> packet_write()), it would be much better to design that new function
> so that it would be generally useful and add that to pkt-line.[ch],
> instead of creating random helper functions that use write(2)
> directly, bypassing pkt-line API, to write stuff.
> 
> In other words, do not _mimick_ pkt-line; enhance pkt-line as
> necessary and use it.

OK, I understand your argument. If we agree on the "pkt-line" usage
then I will address this issue.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
  2016-07-27  1:32     ` Jeff King
@ 2016-07-27  9:41     ` Eric Wong
  2016-07-29 10:38       ` Lars Schneider
  2016-07-27 23:31     ` Jakub Narębski
  2016-07-28 10:32     ` Torsten Bögershausen
  3 siblings, 1 reply; 77+ messages in thread
From: Eric Wong @ 2016-07-27  9:41 UTC (permalink / raw)
  To: larsxschneider
  Cc: git, gitster, jnareb, tboegi, mlbright, remi.galan-alfonso,
	pclouds, ramsay, peff

larsxschneider@gmail.com wrote:
> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)

I'm no expert in C, but this might be const-correctness taken
too far.  I think basing this on the read(2) prototype is less
surprising:

   static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)

Also what Jeff said about off_t vs size_t, but my previous
emails may have confused you w.r.t. off_t usage...

> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)

Same comment about over const ints above.
len can probably be off_t based on what is below; but you need
to process the loop in ssize_t-friendly chunks.

> +{
> +	int ret = 1;
> +	char header[4];
> +	char buffer[8192];
> +	off_t bytes_to_write;

What Jeff said, this should be ssize_t to match read(2) and xread

> +	while (ret) {
> +		if (in >= 0) {
> +			bytes_to_write = xread(in, buffer, sizeof(buffer));
> +			if (bytes_to_write < 0)
> +				ret &= 0;
> +			src = buffer;
> +		} else {
> +			bytes_to_write = len > LARGE_PACKET_MAX - 4 ? LARGE_PACKET_MAX - 4 : len;
> +			len -= bytes_to_write;
> +		}
> +		if (!bytes_to_write)
> +			break;

The whole ret &= .. style error handling is hard-to-follow and
here, a source of bugs.  I think the expected convention on
hitting errors is:

	1) stop whatever you're doing
	2) cleanup
	3) propagate the error to callers

"goto" is an acceptable way of accomplishing this.

For example, byte_to_write may still be negative at this point
(and interpreted as a really big number when cast to unsigned
size_t) and src/buffer could be stack garbage.

> +		set_packet_header(header, bytes_to_write + 4);
> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
> +	}
> +	ret &= write_in_full(out, "0000", 4) == 4;
> +	return ret;
> +}
> +

> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> +						int fd, struct strbuf *dst, const char *cmd,
> +						const char *filter_type)
> +{

<snip>

> +	if (fd >= 0 && !src) {
> +		ret &= fstat(fd, &file_stat) != -1;
> +		len = file_stat.st_size;

Same truncation bug I noticed earlier; what I originally meant
is the `len' arg probably ought to be off_t, here, not size_t.
32-bit x86 Linux systems have 32-bit size_t (unsigned), but
large file support means off_t is 64-bits (signed).

Also, is it worth continuing this function if fstat fails?

> +	}
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +
> +	packet_write(process->in, "%s\n", filter_type);
> +	packet_write(process->in, "%s\n", path);
> +	packet_write(process->in, "%zu\n", len);

I'm not sure if "%zu" is portable since we don't do C99 (yet?)
For 64-bit signed off_t, you can probably do:

	packet_write(process->in, "%"PRIuMAX"\n", (uintmax_t)len);

Since we don't have PRIiMAX or intmax_t, here, and a negative
len would be a bug (probably from failed fstat) anyways.

> +	ret &= multi_packet_write(src, len, fd, process->in);

multi_packet_write will probably fail if fstat failed above...

> +	strbuf = packet_read_line(process->out, NULL);

And this may just block or timeout if multi_packet_write failed.


Naptime, I may look at the rest another day.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 4/5] convert: generate large test files only once
  2016-07-27  2:35     ` Torsten Bögershausen
@ 2016-07-27 13:32       ` Jeff King
  2016-07-27 16:50         ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jeff King @ 2016-07-27 13:32 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: larsxschneider, git, gitster, jnareb, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay

On Wed, Jul 27, 2016 at 04:35:32AM +0200, Torsten Bögershausen wrote:

> > +	mkdir -p generated-test-data &&
> > +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
> > +	do
> > +		# Generate 1MB of empty data and 100 bytes of random characters
> > +		printf "%1048576d" 1
> > +		printf "$(LC_ALL=C tr -dc "A-Za-z0-9" </dev/urandom | dd bs=$((RANDOM>>8)) count=1 2>/dev/null)"
> I'm not sure how portable /dev/urandom is.
> The other thing, that "really random" numbers are an overkill, and
> it may be easier to use pre-defined numbers,

Right, there are a few reasons not to use /dev/urandom:

  - it's not portable

  - if we have to generate a lot of numbers, it drains the system's
    entropy pool, which is an unfriendly thing to do (and may also be
    slow)

  - it makes our tests random! This sounds like a good thing, but it
    means that if some input happens to cause failure, you are unlikely
    to be able to reproduce it.

Instead, use test-genrandom, which is an LCG that starts at a seed. So
you get a large amount of random-ish quickly and portably, and you get
the same data each time.

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function
  2016-07-27  9:13       ` Lars Schneider
@ 2016-07-27 16:31         ` Junio C Hamano
  0 siblings, 0 replies; 77+ messages in thread
From: Junio C Hamano @ 2016-07-27 16:31 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jakub Narębski, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay, peff

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 27 Jul 2016, at 02:20, Junio C Hamano <gitster@pobox.com> wrote:
>> 
>> larsxschneider@gmail.com writes:
>> 
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>> 
>>> `set_packet_header` converts an integer to a 4 byte hex string. Make
>>> this function publicly available so that other parts of Git can easily
>>> generate a pkt-line.
>> 
>> I think that having to do this is a strong sign that the design of
>> this series is going in a wrong direction.
>
> Thanks for the feedback. Do you think using "pkt-line" is a move into
> the wrong direction in general or do you think only my usage of 
> "pkt-line" is not ideal?

I only meant this:

    If you try to produce packet-line data without using helper
    functions in pkt-line.[ch] that are designed to do so
    (presumably because the current set of helpers lack some
    capability you want to use), I am not enthused.

And I did not see a utility of a public set-packet-header helper
unless you are hand-rolling a function that produces packet-line
data outside pkt-line.[ch], hence the comment we are discussing here
was made before actually seeing how this new helper is used.

As to the use of packet-line as the data format, I do not have a
strong opinion either way.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 4/5] convert: generate large test files only once
  2016-07-27 13:32       ` Jeff King
@ 2016-07-27 16:50         ` Lars Schneider
  0 siblings, 0 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-27 16:50 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Git Mailing List, Junio C Hamano,
	Jakub Narębski, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay


> On 27 Jul 2016, at 15:32, Jeff King <peff@peff.net> wrote:
> 
> On Wed, Jul 27, 2016 at 04:35:32AM +0200, Torsten Bögershausen wrote:
> 
>>> +	mkdir -p generated-test-data &&
>>> +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
>>> +	do
>>> +		# Generate 1MB of empty data and 100 bytes of random characters
>>> +		printf "%1048576d" 1
>>> +		printf "$(LC_ALL=C tr -dc "A-Za-z0-9" </dev/urandom | dd bs=$((RANDOM>>8)) count=1 2>/dev/null)"
>> I'm not sure how portable /dev/urandom is.
>> The other thing, that "really random" numbers are an overkill, and
>> it may be easier to use pre-defined numbers,
> 
> Right, there are a few reasons not to use /dev/urandom:
> 
>  - it's not portable
> 
>  - if we have to generate a lot of numbers, it drains the system's
>    entropy pool, which is an unfriendly thing to do (and may also be
>    slow)
> 
>  - it makes our tests random! This sounds like a good thing, but it
>    means that if some input happens to cause failure, you are unlikely
>    to be able to reproduce it.
> 
> Instead, use test-genrandom, which is an LCG that starts at a seed. So
> you get a large amount of random-ish quickly and portably, and you get
> the same data each time.

Thank you! That's exactly what I need here :-)

- Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  1:32     ` Jeff King
@ 2016-07-27 17:31       ` Lars Schneider
  2016-07-27 18:11         ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-27 17:31 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay


> On 27 Jul 2016, at 03:32, Jeff King <peff@peff.net> wrote:
> 
> On Wed, Jul 27, 2016 at 02:06:05AM +0200, larsxschneider@gmail.com wrote:
> 
>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
>> +{
>> +	off_t bytes_read;
>> +	off_t total_bytes_read = 0;
> 
> I haven't looked carefully at the whole patch yet, but there seems to be
> some type issues here. off_t is a good type for storing the whole size
> of a file (which may be larger than the amount of memory we can
> allocate). But size_t is the right size for an in-memory object.
> 
> This function takes a size_t size, which makes sense if it is meant to
> read everything into a strbuf.
> 
> So I think our total_bytes_read would probably want to be a size_t here,
> too, because it cannot possibly grow larger than that (and that is
> enforced by the loop below). Otherwise you get weirdness like "sb->buf +
> total_bytes_ref" possibly overflowing memory.

OK


>> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
> 
> What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
> 32-bit system)?

Would that be an acceptable solution?

if (size + 1 > SIZE_MAX)
	return die("unrepresentable length for filter buffer");

Can you point me to an example in the Git source how this kind of thing should
be handled?


>> +	do {
>> +		bytes_read = packet_read(
>> +			fd, NULL, NULL,
>> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
>> +			PACKET_READ_GENTLE_ON_EOF
>> +		);
> 
> packet_read() actually returns an int, and may return "-1" on EOF (and
> int is fine because we know that we are constrained to 16-bit values
> by the pkt-line definition). You read it into an "off_t". I _think_ that
> is OK, because I believe POSIX says off_t must be signed. But probably
> "int" is the more correct type here.

OK


>> +		total_bytes_read += bytes_read;
> 
> If you do get "-1", I think you need to detect it here before adjusting
> total_bytes_read.

Correct!


>> +	while (
>> +		bytes_read > 0 && 					// the last packet was no flush
>> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
>> +	);
> 
> And I'm not sure if you need to distinguish between "0" and "-1" when
> checking byte_read here.

We want to finish reading in both cases, no?


> 
>> +	strbuf_setlen(sb, total_bytes_read);
> 
> Passing an off_t to something expecting a size_t, which can involve
> truncation (though I think in practice you really are limited to
> size_t).

OK


>> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)
>> +{
>> +	int ret = 1;
>> +	char header[4];
>> +	char buffer[8192];
>> +	off_t bytes_to_write;
>> +	while (ret) {
>> +		if (in >= 0) {
>> +			bytes_to_write = xread(in, buffer, sizeof(buffer));
> 
> Likewise here, xread() is returning ssize_t. Again, OK if we can assume
> off_t is signed, but it probably makes sense to use the correct type (we
> also know it cannot be larger than 8K, of course).

OK


> Why 8K? The pkt-line format naturally restricts us to just under 64K, so
> why not take advantage of that and minimize the framing overhead for
> large data?

I took inspiration from here for 8K MAX_IO_SIZE:
https://github.com/git/git/blob/master/copy.c#L6

Is this read limit correct? Should I read 8 times to fill a pkt-line?


>> +			if (bytes_to_write < 0)
>> +				ret &= 0;
> 
> I think "&= 0" is unusual for our codebase? Would just writing "= 0" be
> more clear?

Yes!


> We do sometimes do "ret |= something()" but that is in cases where
> "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
> your function's error-reporting is inverted from our usual style?

I thought it makes the code easier to read and the filter doesn't care
at what point the error happens anyways. The filter either succeeds
or fails. What style would you suggest?


>> +		set_packet_header(header, bytes_to_write + 4);
>> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
>> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
>> +	}
> 
> If you look at format_packet(), it pulls a slight trick: we have a
> buffer 4 bytes larger than we need, format into "buf + 4", and then
> write the final size at the beginning. That lets us write() it all in
> one go.
> 
> At first I thought this function was simply reinventing packet_write(),
> but I guess you are trying to avoid the extra copy of the data (once
> into the buffer from xread, and then again via format_packet just to add
> the extra bytes at the beginning).

Yes, that was my intention.


> I agree with what Junio said elsewhere, that there may be a way to make
> the pkt-line code handle this zero-copy situation better. Perhaps
> something like:
> 
>  struct pktline {
> 	/* first 4 bytes are reserved for length header */
> 	char buf[LARGE_PACKET_MAX];
>  };
>  #define PKTLINE_DATA_START(pkt) ((pkt)->buf + 4)
>  #define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - 4)
> 
>  ...
>  struct pktline pkt;
>  ssize_t len = xread(fd, PKTLINE_DATA_START(&pkt), PKTLINE_DATA_LEN);
>  packet_send(&pkt, len);
> 
> Then packet_send() knows that the first 4 bytes are reserved for it. I
> suspect that the strbuf used by format_packet() could get away with
> using such a "struct pktline" too, though in practice I doubt there's
> any real efficiency to be gained (we generally reuse the same strbuf
> over and over, so it will grow once to 64K and get reused).

OK, I will try that.


>> +	ret &= write_in_full(out, "0000", 4) == 4;
> 
> packet_flush() ?
> 
> I know the packet functions are keen on write_or_die() versus
> write_in_full().  That is perhaps something that should be fixed.

Yes, the write_or_die calls were the reason for the manual packet flush. 
I will propose a change for these functions to accommodate non "required"
filters as it is OK when they fail. 


> This was just supposed to be a short note about off_t before eating
> dinner (oops), so I didn't read past here.

Thank you :-)

- Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27 17:31       ` Lars Schneider
@ 2016-07-27 18:11         ` Jeff King
  2016-07-28 12:10           ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jeff King @ 2016-07-27 18:11 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay

On Wed, Jul 27, 2016 at 07:31:26PM +0200, Lars Schneider wrote:

> >> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
> > 
> > What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
> > 32-bit system)?
> 
> Would that be an acceptable solution?
> 
> if (size + 1 > SIZE_MAX)
> 	return die("unrepresentable length for filter buffer");

No, because by definition "size" will wrap to 0. :)

You have to do:

  if (size > SIZE_MAX - 1)
	die("whoops");

> Can you point me to an example in the Git source how this kind of thing should
> be handled?

The strbuf code itself checks for overflows. So you could do:

  strbuf_grow(sb, size);
  ... fill up with size bytes ...
  strbuf_addch(sb, ...); /* extra byte for whatever */

That does mean _possibly_ making a second allocation just to add the
extra byte, but in practice it's not likely (unless the input exactly
matches the strbuf's growth pattern).

If you want to do it yourself, I think:

  strbuf_grow(sb, st_add(size, 1));

would work.

> >> +	while (
> >> +		bytes_read > 0 && 					// the last packet was no flush
> >> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
> >> +	);
> > 
> > And I'm not sure if you need to distinguish between "0" and "-1" when
> > checking byte_read here.
> 
> We want to finish reading in both cases, no?

If we get "-1", that's from an unexpected EOF during the packet_read(),
because you set GENTLE_ON_EOF. So there's nothing left to read, and we
should break and return an error.

I guess "0" would come from a flush packet? Why would the filter send
back a flush packet (unless you were using them to signal end-of-input,
but then why does the filter have to send back the number of bytes ahead
of time?).

> > Why 8K? The pkt-line format naturally restricts us to just under 64K, so
> > why not take advantage of that and minimize the framing overhead for
> > large data?
> 
> I took inspiration from here for 8K MAX_IO_SIZE:
> https://github.com/git/git/blob/master/copy.c#L6
> 
> Is this read limit correct? Should I read 8 times to fill a pkt-line?

MAX_IO_SIZE is generally 8 _megabytes_, not 8K. The loop in copy.c just
haad to pick an arbitrary size for doing its read/write proxying.  I
think in practice you are not likely to get much benefit from going
beyond 8K or so there, just because operating systems tend to do things
in page-sizes anyway, which are usually 4K.

But since you are formatting the result into a form that has framing
overhead, anything up to LARGE_PACKET_MAX will see benefits (though
admittedly even 4 bytes per 8K is not much).

I don't think it's worth the complexity of reading 8 times, but just
using a buffer size of LARGE_PACKET_MAX-4 would be the most efficient.

I doubt it matters _that much_ in practice, but any time I see a magic
number I have to wonder at the "why". At least basing it on
LARGE_PACKET_MAX has some basis, whereas 8K is largely just made-up. :)

> > We do sometimes do "ret |= something()" but that is in cases where
> > "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
> > your function's error-reporting is inverted from our usual style?
> 
> I thought it makes the code easier to read and the filter doesn't care
> at what point the error happens anyways. The filter either succeeds
> or fails. What style would you suggest?

I think that's orthogonal. I just mean that using zero for success puts
you in our usual style, and then accumulating errors can be done with
"|=".

I didn't look carefully at whether the accumulating style you're using
makes sense or not. But something like:

> >> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
> >> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;

does mean that we call the second write() even if the first one failed.
That's a waste of time (albeit a minor one), but it also means you could
potentially cover up the value of "errno" from the first one (though in
practice I'd expect the second one to fail for the same reason).

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
                     ` (4 preceding siblings ...)
  2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
@ 2016-07-27 19:08   ` Jakub Narębski
  2016-07-28  7:16     ` Lars Schneider
  5 siblings, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-27 19:08 UTC (permalink / raw)
  To: larsxschneider, git
  Cc: gitster, tboegi, mlbright, remi.galan-alfonso, pclouds, e, ramsay,
	peff

W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Hi,
> 
> thanks a lot for the extensive reviews. I tried to address all mentioned
> concerns and summarized them below. The most prominent changes since v1 are
> the following:
> * Git offers a number of filter capabilities that a filter can request
>   (right now only "smudge" and "clean" - in the future maybe "cleanFromFile",
>   "smudgeToFile", and/or "stream")
> * pipe communication uses a packet format (pkt-line) based protocol

I wonder if it would make sense to support both whole-file pipe communication,
and packet format (pkt-line) pipe communication.

The problem with whole-file pipe communication (original proposal for
new filter protocol is that it needs file size upfront.  For some types
of filters it is not a problem:
 - if a filtered file has the same size as original, like for rot13
   example in the test for the feature
 - if you can calculate the resulting file size from original size,
   like for most if not all encryption formats (that includes GPG,
   uudecode, base64, quoted-printable, hex, etc.); same for decryption,
   and from converting between fixed-width encodings
 - if resulting file size is saved somewhere that is easy to get, like
   for LFS solutions (I think).

For other filters it is serious problem.  For example indent, keyword
expansion, rezipping with zero compression (well, maybe not this one,
but at least the reverse of it), converting between encodings where
at least one is variable width (like UTF-8),...

IMHO writing whole-file persistent filters is easier than using pkt-line.
On the other hand using pkt-line allow for more detailed progress report.

> * a long running filter application is defined with "filter.<driver>.process"

I hope that won't confuse Git users into trying to use single-shot
filters with a new protocol...

> ## Torsten:
> * add "\n" line terminator after version in init sequence
> * prepare big file for EXPENSIVE tests once
> * set "#!/usr/bin/perl" as shebang for rot13.pl to mimic other Perl test scripts
> * add test_have_prereq PERL to t0021
> 
> ## Ramsay:
> * use write_in_full(process->in, nbuf.buf, nbuf.len) to avoid unneccesary strlen call
> * use read_in_full to read data that exceeds MAX_IO_SIZE properly
> * fix test case to check for large file filtering
> 
> ## Jakub:
> * use standard input/standard output instead of stdin/stdout [in description/documentation]
> * replace global variable "cmd_process_map" with a function parameter where possible
> * avoid "strbuf_reset" after STRBUF_INIT
> * align test_config_global
> * rename rot13.pl to rot13-filter.pl
> * make Perl style consistent
> * describe hard coded filenames in test filter header
> * improve docs
> * add filter capabilities field (enables cleanToFile, smudgeFromFile, and/or stream later)
> * explain that content size in bytes is encoded in ASCII
> * consistent line ending for die call in Perl (without "\n")
> * make rot13 test filter die in case of failure (instead of returning "fail")
> 
> ## Eric:
> * flush explicitly in Perl test filter
> * do not initialize variables to NULL if they are set unconditionally
> * fix no-op stop_protocol_filter
> * use off_t instead of size_t
> * improve test filter int parsing ($filelen =~ /\A\d+\z/ or die "bad filelen: $filelen")
> 
> ## Peff:
> * use pkt-line protocol
> * do not use Perl autodie
> 
> ## Remi:
> * remove spaces after '<'



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 1/5] convert: quote filter names in error messages
  2016-07-27  0:06   ` [PATCH v2 1/5] convert: quote filter names in error messages larsxschneider
@ 2016-07-27 20:01     ` Jakub Narębski
  2016-07-28  8:23       ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-27 20:01 UTC (permalink / raw)
  To: larsxschneider, git
  Cc: gitster, tboegi, mlbright, remi.galan-alfonso, pclouds, e, ramsay,
	peff

W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:

> Git filter with spaces (e.g. `filter.sh foo`) are hard to read in
> error messages. Quote them to improve the readability.

This is not something very important, but the above commit message
feels a bit clunky to me.  The change is easy to understand, though,
so the commit message style is not that important.

Perhaps "Git filter driver command with spaces"?

Well, nevermind; if you have a better idea, good.  If not, it is
good enough, IMHO.
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
  2016-07-27  1:32     ` Jeff King
  2016-07-27  9:41     ` Eric Wong
@ 2016-07-27 23:31     ` Jakub Narębski
  2016-07-29  8:04       ` Lars Schneider
  2016-07-28 10:32     ` Torsten Bögershausen
  3 siblings, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-27 23:31 UTC (permalink / raw)
  To: larsxschneider, git
  Cc: gitster, tboegi, mlbright, remi.galan-alfonso, pclouds, e, ramsay,
	peff

W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.

It is not strictly necessary... but do we have any benchmarks for this,
or is it just the feeling?  That is, in what situations Git may filter
a large number of files (initial checkout? initial add?, switching
to unrelated branch? getting large files from LFS solution?, and when
startup time might become significant part of execution time (MS Windows?
fast filters?)?

> 
> This patch adds the filter.<driver>.process string option which, if used,

String option... what are possible values?  What happens if you use
value that is not recognized by Git (it is "if used", isn't it)?  That's
not obvious from the commit message (though it might be from the docs).

What is missing is the description that it is set to a command, and
how it interacts with `clean` and `smudge` options.

> keeps the external filter process running and processes all blobs with
> the following packet format (pkt-line) based protocol over standard input
> and standard output.
> 
> Git starts the filter on first usage and expects a welcome
> message, protocol version number, and filter capabilities
> seperated by spaces:

s/seperated/separated/

Is there any handling of misconfigured one-shot filters, or would
they still hang the execution of a Git command?

> ------------------------
> packet:          git< git-filter-protocol
> packet:          git< version 2
> packet:          git< clean smudge

Wouldn't "capabilities clean smudge" be better?  Or is it the
"clean smudge" proposal easier to handle?

> ------------------------
> Supported filter capabilities are "clean" and "smudge".
> 
> Afterwards Git sends a command (e.g. "smudge" or "clean" - based
> on the supported capabilities), the filename, the content size as
> ASCII number in bytes, and the content in packet format with a
> flush packet at the end:
> ------------------------
> packet:          git> smudge
> packet:          git> testfile.dat

And here we don't have any problems with files containing embedded
newlines etc.  Also Git should not be sending invalid file names.
The question remains: is it absolute file path, or basename?

> packet:          git> 7
> packet:          git> CONTENT

Can Git send file contents using more than one packet?  I think
it should be stated upfront.

> packet:          git> 0000
> ------------------------

Why we need to send content size upfront?  Well, it is not a problem
for Git, but (as I wrote in reply to the cover letter for this
series) might be a problem for filter scripts.

> 
> The filter is expected to respond with the result content size as
> ASCII number in bytes and the result content in packet format with
> a flush packet at the end:
> ------------------------
> packet:          git< 57

This is not neccessary (and might be hard for scripts to do) if
pkt-line protocol is used.

In short: I think pkt-line is not worth the complication on
the Git side and on the filter size, unless it is used for
streaming, or at least filter not having to calculate output
size upfront.

> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> ------------------------
> Please note: In a future version of Git the capability "stream"
> might be supported. In that case the content size must not be
> part of the filter response.
> 
> Afterwards the filter is expected to wait for the next command.

When filter is supposed to exit, then?

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/gitattributes.txt |  54 +++++++-
>  convert.c                       | 269 ++++++++++++++++++++++++++++++++++++++--
>  t/t0021-conversion.sh           | 175 ++++++++++++++++++++++++++
>  t/t0021/rot13-filter.pl         | 146 ++++++++++++++++++++++
>  4 files changed, 631 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0021/rot13-filter.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 8882a3e..8fb40d2 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate. If a long running filter process (see section
> +below) is used then Git can process all blobs with a single filter
> +invocation for the entire life of a single Git command (e.g.
> +`git add .`).

Ah, all right, here we give an example.

But, is "blob" term used in this document, or do we use "file"
and "file contents" only?

>  
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user to use.
> @@ -375,6 +379,54 @@ substitution.  For example:
>  ------------------------
>  
>  
> +Long Running Filter Process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +If the filter command (string value) is defined via
> +filter.<driver>.process then Git can process all blobs with a
> +single filter invocation for the entire life of a single Git
> +command by talking with the following packet format (pkt-line)
> +based protocol over standard input and standard output.

Ah, so now it is the name of command, and I assume it is
exclusive with `clean` / `smudge`, or does it only takes
precedence based on capabilities of filter (that is if
for example "`process`" does not include 'clean' capability,
then `clean` filter is used, using per-file "protocol").
Or do something different happens (like preference for
old-style `clean` and `smudge` filters, and `process`
used if any unset)?

Anyway, Git command would never (I think) run both
"clean" and "smudge" filters.  But I might be wrong.

Yeah, I know this going back and forth seems like 
bike-shedding, but designing good user-facing API
is very, very important.

> +
> +Git starts the filter on first usage and expects a welcome
> +message, protocol version number, and filter capabilities
> +seperated by spaces:
> +------------------------
> +packet:          git< git-filter-protocol
> +packet:          git< version 2
> +packet:          git< clean smudge
> +------------------------

Neither of those is terminated by end of line character,
that is, "\n", isn't it?

> diff --git a/convert.c b/convert.c
> index 522e2c5..5ff200b 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -3,6 +3,7 @@
>  #include "run-command.h"
>  #include "quote.h"
>  #include "sigchain.h"
> +#include "pkt-line.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -481,11 +482,232 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	return ret;
>  }
>  
> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)

What's the purpose of this function?  Is it to gather read whole
contents of file into strbuf?  Or is it to read at most 'size'
bytes of file / of pkt-line stream into strbuf?

We probably don't want to keep the whole file in memory,
if possible.

> +{
> +	off_t bytes_read;
> +	off_t total_bytes_read = 0;
> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush

Why we put packet flush into strbuf?  Or is it only temporarily,
and we adjust that at the end... I see that it is.

> +	do {
> +		bytes_read = packet_read(
> +			fd, NULL, NULL,
> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
> +			PACKET_READ_GENTLE_ON_EOF
> +		);
> +		total_bytes_read += bytes_read;
> +	}
> +	while (
> +		bytes_read > 0 && 					// the last packet was no flush
> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
> +	);
> +	strbuf_setlen(sb, total_bytes_read);
> +	return total_bytes_read;
> +}
> +
> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)

What's the purpose of this function?  What are those 'in' and 'out'
parameters?  Those names do not describe them well.  If they are
file descriptors, add fd_* prefix (or whatever Git code uses).
Edit: I see that's what existing code uses.

Edit: so we are reading from *src + len or from fd_in, depending on
whether fd_in is set to 0 or not?  I guess that follows existing
code, where it is even worse, because it is hidden...

> +{
> +	int ret = 1;
> +	char header[4];
> +	char buffer[8192];

Could those two be in one variable?  Also, 'header' or 'pkt_header'?

Why 8192, and not LARGE_PACKET_MAX - 4?

> +	off_t bytes_to_write;
> +	while (ret) {
> +		if (in >= 0) {
> +			bytes_to_write = xread(in, buffer, sizeof(buffer));
> +			if (bytes_to_write < 0)
> +				ret &= 0;
> +			src = buffer;
> +		} else {
> +			bytes_to_write = len > LARGE_PACKET_MAX - 4 ? LARGE_PACKET_MAX - 4 : len;
> +			len -= bytes_to_write;
> +		}
> +		if (!bytes_to_write)
> +			break;
> +		set_packet_header(header, bytes_to_write + 4);
> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;

These three lines are equivalent to write_packet(), or however
it is named, isn't it?

> +	}
> +	ret &= write_in_full(out, "0000", 4) == 4;

This is equivalent to packet_flush(), or however it is named,
isn't it?

> +	return ret;
> +}
> +
> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	const char *cmd;
> +	int clean;
> +	int smudge;

These two are 'int' used as 'bool', isn't it?

> +	struct child_process process;
> +};
[...]
> +static struct cmd2process *find_protocol_filter_entry(struct hashmap *hashmap, const char *cmd)

Wouldn't it be more descriptive to name the first parameter
to this function 'cmd_hashmap', or something like that, rather
than plain 'hashmap' (it might be the same that is used / was
used for a global variable)?

Edit: or 'cmd_process_map'.

> +{
> +	struct cmd2process k;
> +	hashmap_entry_init(&k, strhash(cmd));
> +	k.cmd = cmd;
> +	return hashmap_get(hashmap, &k, NULL);
> +}

[...]
> +static struct cmd2process *start_protocol_filter(struct hashmap *hashmap, const char *cmd)
> +{
> +	int ret = 1;
> +	struct cmd2process *entry;
> +	struct child_process *process;
> +	const char *argv[] = { NULL, NULL };

Could we initialize it with  { cmd, NULL };?

Edit: Ah, I see that you follow filter_buffer_or_fd() example from
convert.c, isn't it?

> +	struct string_list capabilities = STRING_LIST_INIT_NODUP;
> +	char *capabilities_buffer;
> +	int i;
> +
> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	entry->clean = 0;
> +	entry->smudge = 0;

Wouldn't

   	entry->clean = entry->smudge = 0;

be more readable?

> +	process = &entry->process;
> +
> +	child_process_init(process);
> +	argv[0] = cmd;
> +	process->argv = argv;
> +	process->use_shell = 1;
> +	process->in = -1;

Maybe

  +	process->in  = -1;

to align, but perhaps it is not worth it.

> +	process->out = -1;
> +
> +	if (start_command(process)) {
> +		error("cannot fork to run external persistent filter '%s'", cmd);

Just a question: is "cannot fork" the only reason why start_command()
might have failed there?

Edit: Ah, I see that you follow filter_buffer_or_fd() example from
convert.c, again.

> +		stop_protocol_filter(hashmap, entry);
> +		return NULL;
> +	}
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	ret &= strcmp(packet_read_line(process->out, NULL), "git-filter-protocol") == 0;
> +	ret &= strcmp(packet_read_line(process->out, NULL), "version 2") == 0;

So that's why you need packet_read_line() to return string...

> +	capabilities_buffer = packet_read_line(process->out, NULL);
> +	sigchain_pop(SIGPIPE);
> +
> +	string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1);

This does not modify capabilities_buffer, does it?

> +	for (i = 0; i < capabilities.nr; i++) {
> +		const char *requested = capabilities.items[i].string;
> +		if (!strcmp(requested, "clean")) {
> +			entry->clean = 1;
> +		} else if (!strcmp(requested, "smudge")) {
> +			entry->smudge = 1;
> +		} else {
> +			warning(
> +				"filter process '%s' requested unsupported filter capability '%s'",
> +				cmd, requested
> +			);

Nice.  This makes it (somewhat) forward- and backward-compatibile.

> +		}
> +	}
> +	string_list_clear(&capabilities, 0);
> +
> +	if (!ret) {
> +		error("initialization for external persistent filter '%s' failed", cmd);

Do we need more detailed information about the source of error?

> +		stop_protocol_filter(hashmap, entry);
> +		return NULL;
> +	}
> +
> +	hashmap_add(hashmap, entry);
> +	return entry;
> +}
> +
> +static int cmd_process_map_init = 0;
> +static struct hashmap cmd_process_map;
> +
> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> +						int fd, struct strbuf *dst, const char *cmd,
> +						const char *filter_type)

That is... quite a lot of parameters.  But I guess there is precedens.
But I think 'fd' belongs to previous line, as it is alternative to
src+len.

> +{
> +	int ret = 1;
> +	struct cmd2process *entry;
> +	struct child_process *process;
> +	struct stat file_stat;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	off_t expected_bytes;
> +	char *strtol_end;
> +	char *strbuf;
> +
> +	if (!cmd || !*cmd)
> +		return 0;
> +
> +	if (!dst)
> +		return 1;
> +
> +	if (!cmd_process_map_init) {
> +		cmd_process_map_init = 1;
> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
> +		entry = NULL;

Is it better than having entry NULL-initialized?

> +	} else {
> +		entry = find_protocol_filter_entry(&cmd_process_map, cmd);
> +	}
> +
> +	if (!entry) {
> +		entry = start_protocol_filter(&cmd_process_map, cmd);

Hmmm... apply_filter() uses start_async() for some reason.  Why
it does not apply for this new kind of filter?

> +		if (!entry) {
> +			return 0;
> +		}
> +	}
> +	process = &entry->process;
> +
> +	if (!(!strcmp(filter_type, "clean") && entry->clean) &&
> +		!(!strcmp(filter_type, "smudge") && entry->smudge)) {

Would it be more readable as !(A || B) rather than (!A && !B)?

> +		return 0;
> +	}
> +
> +	if (fd >= 0 && !src) {
> +		ret &= fstat(fd, &file_stat) != -1;
> +		len = file_stat.st_size;
> +	}

All right, so we either use src+len,  or if we use fd we get
file size.

> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +
> +	packet_write(process->in, "%s\n", filter_type);
> +	packet_write(process->in, "%s\n", path);
> +	packet_write(process->in, "%zu\n", len);

So "\n" is included in protocol?

> +	ret &= multi_packet_write(src, len, fd, process->in);

How git-receive-pack etc. handle multi-packet write?

> +
> +	strbuf = packet_read_line(process->out, NULL);
> +	expected_bytes = (off_t)strtol(strbuf, &strtol_end, 10);
> +	ret &= (strtol_end != strbuf && errno != ERANGE);
> +
> +	if (expected_bytes > 0) {
> +		ret &= multi_packet_read(&nbuf, process->out, expected_bytes) == expected_bytes;
> +	}
> +
> +	sigchain_pop(SIGPIPE);
> +
> +	if (ret) {
> +		strbuf_swap(dst, &nbuf);
> +	} else {
> +		// Something went wrong with the protocol filter. Force shutdown!
> +		stop_protocol_filter(&cmd_process_map, entry);

Some error message would be nice... or do we print in down in stack?

> +	}
> +	strbuf_release(&nbuf);
> +	return ret;
> +}
> +


[...]
> @@ -823,7 +1049,10 @@ int would_convert_to_git_filter_fd(const char *path)
>  	if (!ca.drv->required)
>  		return 0;
>  
> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> +	if (!ca.drv->clean && ca.drv->process)
> +		return apply_protocol_filter(path, NULL, 0, -1, NULL, ca.drv->process, "clean");
> +	else
> +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);

So the rule is: if `clean` is not set, and `process` is, try to use
process for cleaning.  It was not clear for me from the documentation.

>  }
>  
>  const char *get_convert_attr_ascii(const char *path)
> @@ -856,17 +1085,22 @@ int convert_to_git(const char *path, const char *src, size_t len,
>                     struct strbuf *dst, enum safe_crlf checksafe)
>  {
>  	int ret = 0;
> -	const char *filter = NULL;
> +	const char *clean_filter = NULL;
> +	const char *process_filter = NULL;
>  	int required = 0;
>  	struct conv_attrs ca;
>  
>  	convert_attrs(&ca, path);
>  	if (ca.drv) {
> -		filter = ca.drv->clean;
> +		clean_filter = ca.drv->clean;
> +		process_filter = ca.drv->process;
>  		required = ca.drv->required;
>  	}
>  
> -	ret |= apply_filter(path, src, len, -1, dst, filter);
> +	if (!clean_filter && process_filter)
> +		ret |= apply_protocol_filter(path, src, len, -1, dst, process_filter, "clean");
> +	else
> +		ret |= apply_filter(path, src, len, -1, dst, clean_filter);

And the rule is the same here, as it should.

>  	if (!ret && required)
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);

Is it a correct error message for `process`?  I guess it is, as it prints
the name of driver, and not attempted command.  Well, we might be using
"process" filter in 'clean' mode,... but that is sophistry.

[...]
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index b9911a4..c4793ed 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -4,6 +4,11 @@ test_description='blob conversion via gitattributes'
>  
>  . ./test-lib.sh
>  
> +if ! test_have_prereq PERL; then
> +	skip_all='skipping perl interface tests, perl not available'
> +	test_done
> +fi

Do all tests require Perl?

> +test_expect_success 'required protocol filter should filter data' '
[...]
> +test_expect_success 'protocol filter large file' '
[...]
> +test_expect_success 'required protocol filter should fail with clean' '
[...]
> +test_expect_success 'protocol filter should restart after failure' '
[...]

> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
> new file mode 100755
> index 0000000..7176836
> --- /dev/null
> +++ b/t/t0021/rot13-filter.pl
> @@ -0,0 +1,146 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git filter protocol version 2
> +# See Documentation/gitattributes.txt, section "Filter Protocol"
> +#
> +# This implementation supports two special test cases:
> +# (1) If data with the filename "clean-write-fail.r" is processed with
> +#     a "clean" operation then the write operation will die.
> +# (2) If data with the filename "smudge-write-fail.r" is processed with
> +#     a "smudge" operation then the write operation will die.

Nice.

> +#
> +
> +use strict;
> +use warnings;
> +
> +my $MAX_PACKET_CONTENT_SIZE = 65516;
> +
> +sub rot13 {
> +    my ($str) = @_;
> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
> +    return $str;
> +}
> +
> +sub packet_read {
> +    my $buffer;
> +    my $bytes_read = read STDIN, $buffer, 4;
> +    if ( $bytes_read == 0 ) {
> +        return;
> +    }
> +    elsif ( $bytes_read != 4 ) {

This is a bit untypical bracket style...

> +        die "invalid packet size '$bytes_read' field";
> +    }
> +    my $pkt_size = hex($buffer);
> +    if ( $pkt_size == 0 ) {
> +        return ( 1, "" );
> +    }
> +    elsif ( $pkt_size > 4 ) {
> +        my $content_size = $pkt_size - 4;
> +        $bytes_read = read STDIN, $buffer, $content_size;
> +        if ( $bytes_read != $content_size ) {
> +            die "invalid packet";
> +        }
> +        return ( 0, $buffer );
> +    }
> +    else {
> +        die "invalid packet size";
> +    }
> +}

So packet reading is not that difficult...

> +
> +sub packet_write {
> +    my ($packet) = @_;
> +    print STDOUT sprintf( "%04x", length($packet) + 4 );
> +    print STDOUT $packet;
> +    STDOUT->flush();
> +}

...and packet write is easy.

> +
> +sub packet_flush {
> +    print STDOUT sprintf( "%04x", 0 );
> +    STDOUT->flush();
> +}
> +
> +open my $debug, ">>", "output.log";
> +print $debug "start\n";
> +$debug->flush();
> +
> +packet_write("git-filter-protocol\n");
> +packet_write("version 2\n");
> +packet_write("clean smudge\n");
> +print $debug "wrote filter header\n";
> +$debug->flush();

Isn't $debug flushed automatically?

> +
> +while (1) {
> +    my $command = packet_read();
> +    unless ( defined($command) ) {
> +        exit();
> +    }
> +    chomp $command;
> +    print $debug "IN: $command";
> +    $debug->flush();
> +    my $filename = packet_read();
> +    chomp $filename;
> +    print $debug " $filename";
> +    $debug->flush();
> +    my $filelen = packet_read();
> +    chomp $filelen;
> +    print $debug " $filelen";
> +    $debug->flush();
> +
> +    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
> +    my $output;
> +
> +    if ( $filelen > 0 ) {
> +        my $input = "";
> +        {
> +            binmode(STDIN);
> +            my $buffer;
> +            my $done = 0;
> +            while ( !$done ) {
> +                ( $done, $buffer ) = packet_read();
> +                $input .= $buffer;
> +            }
> +            print $debug " [OK] -- ";
> +            $debug->flush();
> +        }
> +
> +        if ( $command eq "clean" ) {
> +            $output = rot13($input);
> +        }
> +        elsif ( $command eq "smudge" ) {
> +            $output = rot13($input);
> +        }
> +        else {
> +            die "bad command";

Perhaps

               die "bad command $command";

> +        }
> +    }
> +
> +    my $output_len = length($output);
> +    packet_write("$output_len\n");
> +    print $debug "OUT: $output_len ";
> +    $debug->flush();
> +    if ( $output_len > 0 ) {
> +        if (   ( $command eq "clean" and $filename eq "clean-write-fail.r" )

What happened here with whitespace around parentheses?

> +            or
> +            ( $command eq "smudge" and $filename eq "smudge-write-fail.r" ) )
> +        {
> +            print $debug " [FAIL]\n";
> +            $debug->flush();
> +            die "write error";
> +        }
> +        else {
> +            while ( length($output) > 0 ) {
> +                my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
> +                packet_write($packet);
> +                if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
> +                    $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
> +                }
> +                else {
> +                    $output = "";
> +                }
> +            }
> +            packet_flush();
> +            print $debug "[OK]\n";
> +            $debug->flush();
> +        }
> +    }
> +}
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-27 19:08   ` [PATCH v2 0/5] Git filter protocol Jakub Narębski
@ 2016-07-28  7:16     ` Lars Schneider
  2016-07-28 10:42       ` Jakub Narębski
  2016-07-28 13:29       ` Jeff King
  0 siblings, 2 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-28  7:16 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, gitster, tboegi, mlbright, remi.galan-alfonso,
	pclouds, e, ramsay, peff


> On 27 Jul 2016, at 21:08, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Hi,
>> 
>> thanks a lot for the extensive reviews. I tried to address all mentioned
>> concerns and summarized them below. The most prominent changes since v1 are
>> the following:
>> * Git offers a number of filter capabilities that a filter can request
>>  (right now only "smudge" and "clean" - in the future maybe "cleanFromFile",
>>  "smudgeToFile", and/or "stream")
>> * pipe communication uses a packet format (pkt-line) based protocol
> 
> I wonder if it would make sense to support both whole-file pipe communication,
> and packet format (pkt-line) pipe communication.
> 
> The problem with whole-file pipe communication (original proposal for
> new filter protocol is that it needs file size upfront.  For some types
> of filters it is not a problem:
> - if a filtered file has the same size as original, like for rot13
>   example in the test for the feature
> - if you can calculate the resulting file size from original size,
>   like for most if not all encryption formats (that includes GPG,
>   uudecode, base64, quoted-printable, hex, etc.); same for decryption,
>   and from converting between fixed-width encodings
> - if resulting file size is saved somewhere that is easy to get, like
>   for LFS solutions (I think).
> 
> For other filters it is serious problem.  For example indent, keyword
> expansion, rezipping with zero compression (well, maybe not this one,
> but at least the reverse of it), converting between encodings where
> at least one is variable width (like UTF-8),...
> 
> IMHO writing whole-file persistent filters is easier than using pkt-line.
> On the other hand using pkt-line allow for more detailed progress report.

I initially wanted to support only "while-file" pipe, too.

But Peff ($gmane/299902), Duy, and Eric, seemed to prefer the pkt-line
solution (gmane is down - otherwise I would have given you the links).

After I have looked at it I think the pkt-line solution is indeed nicer
for the following reasons:

(1) A stream optimized version (read/write in separate threads) of the filter
    protocol can be implemented in the future without changing the protocol
(2) pkt-line is a simple and easy to implement format
(3) Reuse of existing Git communication infrastructure
    -> code and documentation are less surprising to people that know Git
    -> you can use GIT_TRACE_PACKET to easily inspect the
       communication between Git and the filter process
(4) The overheads is neglect able (4 byte header vs 65516 byte content)


>> * a long running filter application is defined with "filter.<driver>.process"
> 
> I hope that won't confuse Git users into trying to use single-shot
> filters with a new protocol...

Yes, that was my intention for this new config.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 1/5] convert: quote filter names in error messages
  2016-07-27 20:01     ` Jakub Narębski
@ 2016-07-28  8:23       ` Lars Schneider
  0 siblings, 0 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-28  8:23 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, gitster, tboegi, mlbright, remi.galan-alfonso, pclouds, e,
	ramsay, peff


> On 27 Jul 2016, at 22:01, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:
> 
>> Git filter with spaces (e.g. `filter.sh foo`) are hard to read in
>> error messages. Quote them to improve the readability.
> 
> This is not something very important, but the above commit message
> feels a bit clunky to me.  The change is easy to understand, though,
> so the commit message style is not that important.
> 
> Perhaps "Git filter driver command with spaces"?
> 
> Well, nevermind; if you have a better idea, good.  If not, it is
> good enough, IMHO.

Agreed. I will fix this with the next roll.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
                       ` (2 preceding siblings ...)
  2016-07-27 23:31     ` Jakub Narębski
@ 2016-07-28 10:32     ` Torsten Bögershausen
  3 siblings, 0 replies; 77+ messages in thread
From: Torsten Bögershausen @ 2016-07-28 10:32 UTC (permalink / raw)
  To: larsxschneider, git
  Cc: gitster, jnareb, mlbright, remi.galan-alfonso, pclouds, e, ramsay,
	peff


On 07/27/2016 02:06 AM, larsxschneider@gmail.com wrote:
Some comments inline
[]
 > The filter is expected to respond with the result content size as
 > ASCII number in bytes and the result content in packet format with
 > a flush packet at the end:
 > ------------------------
 > packet:          git< 57
 > packet:          git< SMUDGED_CONTENT
 > packet:          git< 0000
how does the filter report possible errors here ?
Let's say I want to convert UTF-8 into ISO-8859-1,
but the conversion is impossible.

  > packet:          git< -1
  > packet:          git< Error message, (or empty), followed by a '\n'
  > packet:          git< 0000

Side note: a filter may return length 0.
Suggestion: add 2 test cases.


 > ------------------------
 > Please note: In a future version of Git the capability "stream"
 > might be supported. In that case the content size must not be
 > part of the filter response.
 >
 > Afterwards the filter is expected to wait for the next command.
 >
 > Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
 > Helped-by: Martin-Louis Bright <mlbright@gmail.com>
 > Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
 > ---
 >  Documentation/gitattributes.txt |  54 +++++++-
 >  convert.c                       | 269 
++++++++++++++++++++++++++++++++++++++--
 >  t/t0021-conversion.sh           | 175 ++++++++++++++++++++++++++
 >  t/t0021/rot13-filter.pl         | 146 ++++++++++++++++++++++
 >  4 files changed, 631 insertions(+), 13 deletions(-)
 >  create mode 100755 t/t0021/rot13-filter.pl
 >
 > diff --git a/Documentation/gitattributes.txt 
b/Documentation/gitattributes.txt
 > index 8882a3e..8fb40d2 100644
 > --- a/Documentation/gitattributes.txt
 > +++ b/Documentation/gitattributes.txt
 > @@ -300,7 +300,11 @@ checkout, when the `smudge` command is 
specified, the command is
 >  fed the blob object from its standard input, and its standard
 >  output is used to update the worktree file.  Similarly, the
 >  `clean` command is used to convert the contents of worktree file
 > -upon checkin.
 > +upon checkin. By default these commands process only a single
 > +blob and terminate. If a long running filter process (see section
 > +below) is used then Git can process all blobs with a single filter
 > +invocation for the entire life of a single Git command (e.g.
 > +`git add .`).
 >
 >  One use of the content filtering is to massage the content into a shape
 >  that is more convenient for the platform, filesystem, and the user 
to use.
 > @@ -375,6 +379,54 @@ substitution.  For example:
 >  ------------------------
 >
 >
 > +Long Running Filter Process
 > +^^^^^^^^^^^^^^^^^^^^^^^^^^^
 > +
 > +If the filter command (string value) is defined via
 > +filter.<driver>.process then Git can process all blobs with a
 > +single filter invocation for the entire life of a single Git
 > +command by talking with the following packet format (pkt-line)
 > +based protocol over standard input and standard output.
 > +
 > +Git starts the filter on first usage and expects a welcome
 > +message, protocol version number, and filter capabilities
 > +seperated by spaces:
 > +------------------------
 > +packet:          git< git-filter-protocol
 > +packet:          git< version 2
 > +packet:          git< clean smudge
 > +------------------------
 > +Supported filter capabilities are "clean" and "smudge".
 > +
 > +Afterwards Git sends a command (e.g. "smudge" or "clean" - based
 > +on the supported capabilities), the filename, the content size as
 > +ASCII number in bytes, and the content in packet format with a
 > +flush packet at the end:
 > +------------------------
 > +packet:          git> smudge
 > +packet:          git> testfile.dat
 > +packet:          git> 7
 > +packet:          git> CONTENT
 > +packet:          git> 0000
 > +------------------------
 > +
 > +The filter is expected to respond with the result content size as
 > +ASCII number in bytes and the result content in packet format with
 > +a flush packet at the end:
 > +------------------------
 > +packet:          git< 57
 > +packet:          git< SMUDGED_CONTENT
 > +packet:          git< 0000
 > +------------------------
 > +Please note: In a future version of Git the capability "stream"
 > +might be supported. In that case the content size must not be
 > +part of the filter response.
 > +
 > +Afterwards the filter is expected to wait for the next command.
 > +A demo implementation can be found in `t/t0021/rot13-filter.pl`
 > +located in the Git core repository.
 > +
 > +
 >  Interaction between checkin/checkout attributes
 >  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 >
 > diff --git a/convert.c b/convert.c
 > index 522e2c5..5ff200b 100644
 > --- a/convert.c
 > +++ b/convert.c
 > @@ -3,6 +3,7 @@
 >  #include "run-command.h"
 >  #include "quote.h"
 >  #include "sigchain.h"
 > +#include "pkt-line.h"
 >
 >  /*
 >   * convert.c - convert a file when checking it out and checking it in.
 > @@ -481,11 +482,232 @@ static int apply_filter(const char *path, 
const char *src, size_t len, int fd,
 >  	return ret;
 >  }
 >
 > +static off_t multi_packet_read(struct strbuf *sb, const int fd, 
const size_t size)
 > +{
 > +	off_t bytes_read;
 > +	off_t total_bytes_read = 0;
 > +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet 
flush
 > +	do {
 > +		bytes_read = packet_read(
 > +			fd, NULL, NULL,
 > +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
 > +			PACKET_READ_GENTLE_ON_EOF
 > +		);
 > +		total_bytes_read += bytes_read;
 > +	}
 > +	while (
 > +		bytes_read > 0 && 					// the last packet was no flush
 > +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in 
the buffer
 > +	);
 > +	strbuf_setlen(sb, total_bytes_read);
 > +	return total_bytes_read;
 > +}
 > +
 > +static int multi_packet_write(const char *src, size_t len, const int 
in, const int out)
 > +{
 > +	int ret = 1;
 > +	char header[4];
 > +	char buffer[8192];
 > +	off_t bytes_to_write;
 > +	while (ret) {
 > +		if (in >= 0) {
 > +			bytes_to_write = xread(in, buffer, sizeof(buffer));
 > +			if (bytes_to_write < 0)
 > +				ret &= 0;
 > +			src = buffer;
 > +		} else {
 > +			bytes_to_write = len > LARGE_PACKET_MAX - 4 ? LARGE_PACKET_MAX - 
4 : len;
 > +			len -= bytes_to_write;
 > +		}
 > +		if (!bytes_to_write)
 > +			break;
 > +		set_packet_header(header, bytes_to_write + 4);
 > +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
 > +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
 > +	}
 > +	ret &= write_in_full(out, "0000", 4) == 4;
 > +	return ret;
 > +}
 > +
 > +struct cmd2process {
 > +	struct hashmap_entry ent; /* must be the first member! */
 > +	const char *cmd;
 > +	int clean;
 > +	int smudge;
 > +	struct child_process process;
 > +};
 > +
 > +static int cmd2process_cmp(const struct cmd2process *e1,
 > +							const struct cmd2process *e2,
 > +							const void *unused)
 > +{
 > +	return strcmp(e1->cmd, e2->cmd);
 > +}
 > +
 > +static struct cmd2process *find_protocol_filter_entry(struct hashmap 
*hashmap, const char *cmd)
 > +{
 > +	struct cmd2process k;
 > +	hashmap_entry_init(&k, strhash(cmd));
 > +	k.cmd = cmd;
 > +	return hashmap_get(hashmap, &k, NULL);
 > +}
 > +
 > +static void stop_protocol_filter(struct hashmap *hashmap, struct 
cmd2process *entry) {
 > +	if (!entry)
 > +		return;
 > +	sigchain_push(SIGPIPE, SIG_IGN);
 > +	close(entry->process.in);
 > +	close(entry->process.out);
 > +	sigchain_pop(SIGPIPE);
 > +	finish_command(&entry->process);
 > +	child_process_clear(&entry->process);
 > +	hashmap_remove(hashmap, entry, NULL);
 > +	free(entry);
 > +}
 > +
 > +static struct cmd2process *start_protocol_filter(struct hashmap 
*hashmap, const char *cmd)
 > +{
 > +	int ret = 1;
 > +	struct cmd2process *entry;
 > +	struct child_process *process;
 > +	const char *argv[] = { NULL, NULL };
 > +	struct string_list capabilities = STRING_LIST_INIT_NODUP;
 > +	char *capabilities_buffer;
 > +	int i;
 > +
 > +	entry = xmalloc(sizeof(*entry));
 > +	hashmap_entry_init(entry, strhash(cmd));
 > +	entry->cmd = cmd;
 > +	entry->clean = 0;
 > +	entry->smudge = 0;
 > +	process = &entry->process;
 > +
 > +	child_process_init(process);
 > +	argv[0] = cmd;
 > +	process->argv = argv;
 > +	process->use_shell = 1;
 > +	process->in = -1;
 > +	process->out = -1;
 > +
 > +	if (start_command(process)) {
 > +		error("cannot fork to run external persistent filter '%s'", cmd);
 > +		stop_protocol_filter(hashmap, entry);
 > +		return NULL;
 > +	}
 > +
 > +	sigchain_push(SIGPIPE, SIG_IGN);
 > +	ret &= strcmp(packet_read_line(process->out, NULL), 
"git-filter-protocol") == 0;
 > +	ret &= strcmp(packet_read_line(process->out, NULL), "version 2") == 0;
 > +	capabilities_buffer = packet_read_line(process->out, NULL);
 > +	sigchain_pop(SIGPIPE);
 > +
 > +	string_list_split_in_place(&capabilities, capabilities_buffer, ' ', 
-1);
 > +	for (i = 0; i < capabilities.nr; i++) {
 > +		const char *requested = capabilities.items[i].string;
 > +		if (!strcmp(requested, "clean")) {
 > +			entry->clean = 1;
 > +		} else if (!strcmp(requested, "smudge")) {
 > +			entry->smudge = 1;
 > +		} else {
 > +			warning(
 > +				"filter process '%s' requested unsupported filter capability '%s'",
 > +				cmd, requested
 > +			);
 > +		}
 > +	}
 > +	string_list_clear(&capabilities, 0);
 > +
 > +	if (!ret) {
 > +		error("initialization for external persistent filter '%s' failed", 
cmd);
 > +		stop_protocol_filter(hashmap, entry);
 > +		return NULL;
 > +	}
 > +
 > +	hashmap_add(hashmap, entry);
 > +	return entry;
 > +}
 > +
 > +static int cmd_process_map_init = 0;
 > +static struct hashmap cmd_process_map;
 > +
 > +static int apply_protocol_filter(const char *path, const char *src, 
size_t len,
 > +						int fd, struct strbuf *dst, const char *cmd,
 > +						const char *filter_type)
 > +{
 > +	int ret = 1;
 > +	struct cmd2process *entry;
 > +	struct child_process *process;
 > +	struct stat file_stat;
 > +	struct strbuf nbuf = STRBUF_INIT;
 > +	off_t expected_bytes;
 > +	char *strtol_end;
 > +	char *strbuf;
 > +
 > +	if (!cmd || !*cmd)
 > +		return 0;
 > +
 > +	if (!dst)
 > +		return 1;
 > +
 > +	if (!cmd_process_map_init) {
 > +		cmd_process_map_init = 1;
 > +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
 > +		entry = NULL;
 > +	} else {
 > +		entry = find_protocol_filter_entry(&cmd_process_map, cmd);
 > +	}
 > +
 > +	if (!entry) {
 > +		entry = start_protocol_filter(&cmd_process_map, cmd);
 > +		if (!entry) {
 > +			return 0;
 > +		}
 > +	}
 > +	process = &entry->process;
 > +
 > +	if (!(!strcmp(filter_type, "clean") && entry->clean) &&
 > +		!(!strcmp(filter_type, "smudge") && entry->smudge)) {
 > +		return 0;
 > +	}
 > +
 > +	if (fd >= 0 && !src) {
 > +		ret &= fstat(fd, &file_stat) != -1;
 > +		len = file_stat.st_size;
 > +	}
 > +
 > +	sigchain_push(SIGPIPE, SIG_IGN);
 > +
 > +	packet_write(process->in, "%s\n", filter_type);
 > +	packet_write(process->in, "%s\n", path);
 > +	packet_write(process->in, "%zu\n", len);
%zu is not portable, e.g. Windows
It may be better to cast len to an unsigned int64
and use
packet_write(process->in, "%" PRId64 "\n", (uint64_t)len);

But this is not ideal either, it could be

packet_write(process->in, "%" PRIdMAX "\n", (uintmax_t)len);


 > +	ret &= multi_packet_write(src, len, fd, process->in);
 > +
 > +	strbuf = packet_read_line(process->out, NULL);
 > +	expected_bytes = (off_t)strtol(strbuf, &strtol_end, 10);

A long is 32 bit under Windows. It could be better to use:
   gitstrtodmax (const char *nptr, char **endptr, int base)

 > +	ret &= (strtol_end != strbuf && errno != ERANGE);
 > +
 > +	if (expected_bytes > 0) {
As pointed out earlier, it is possible that the result of
a filter gives a len of 0, which should be handled.


 > +		ret &= multi_packet_read(&nbuf, process->out, expected_bytes) == 
expected_bytes;
 > +	}
 > +
 > +	sigchain_pop(SIGPIPE);
 > +
 > +	if (ret) {
 > +		strbuf_swap(dst, &nbuf);
 > +	} else {
 > +		// Something went wrong with the protocol filter. Force shutdown!
 > +		stop_protocol_filter(&cmd_process_map, entry);

Is there a warning() missing ?
 > +	}
 > +	strbuf_release(&nbuf);
 > +	return ret;
 > +}
 > +
 >  static struct convert_driver {
 >  	const char *name;
 >  	struct convert_driver *next;
 >  	const char *smudge;
 >  	const char *clean;
 > +	const char *process;
 >  	int required;
 >  } *user_convert, **user_convert_tail;
 >
 > @@ -526,6 +748,10 @@ static int read_convert_config(const char *var, 
const char *value, void *cb)
 >  	if (!strcmp("clean", key))
 >  		return git_config_string(&drv->clean, var, value);
 >
 > +	if (!strcmp("process", key)) {
 > +		return git_config_string(&drv->process, var, value);
 > +	}
 > +
 >  	if (!strcmp("required", key)) {
 >  		drv->required = git_config_bool(var, value);
 >  		return 0;
 > @@ -823,7 +1049,10 @@ int would_convert_to_git_filter_fd(const char 
*path)
 >  	if (!ca.drv->required)
 >  		return 0;
 >
 > -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
 > +	if (!ca.drv->clean && ca.drv->process)
 > +		return apply_protocol_filter(path, NULL, 0, -1, NULL, 
ca.drv->process, "clean");
 > +	else
 > +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
 >  }
 >
 >  const char *get_convert_attr_ascii(const char *path)
 > @@ -856,17 +1085,22 @@ int convert_to_git(const char *path, const 
char *src, size_t len,
 >                     struct strbuf *dst, enum safe_crlf checksafe)
 >  {
 >  	int ret = 0;
 > -	const char *filter = NULL;
 > +	const char *clean_filter = NULL;
 > +	const char *process_filter = NULL;
 >  	int required = 0;
 >  	struct conv_attrs ca;
 >
 >  	convert_attrs(&ca, path);
 >  	if (ca.drv) {
 > -		filter = ca.drv->clean;
 > +		clean_filter = ca.drv->clean;
 > +		process_filter = ca.drv->process;
 >  		required = ca.drv->required;
 >  	}
 >
 > -	ret |= apply_filter(path, src, len, -1, dst, filter);
 > +	if (!clean_filter && process_filter)
 > +		ret |= apply_protocol_filter(path, src, len, -1, dst, 
process_filter, "clean");
 > +	else
 > +		ret |= apply_filter(path, src, len, -1, dst, clean_filter);
 >  	if (!ret && required)
 >  		die("%s: clean filter '%s' failed", path, ca.drv->name);
 >
 > @@ -885,13 +1119,19 @@ int convert_to_git(const char *path, const 
char *src, size_t len,
 >  void convert_to_git_filter_fd(const char *path, int fd, struct 
strbuf *dst,
 >  			      enum safe_crlf checksafe)
 >  {
 > +	int ret = 0;
 >  	struct conv_attrs ca;
 >  	convert_attrs(&ca, path);
 >
 >  	assert(ca.drv);
 > -	assert(ca.drv->clean);
 > +	assert(ca.drv->clean || ca.drv->process);
 >
 > -	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
 > +	if (!ca.drv->clean && ca.drv->process)
 > +		ret = apply_protocol_filter(path, NULL, 0, fd, dst, 
ca.drv->process, "clean");
 > +	else
 > +		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
 > +
 > +	if (!ret)
 >  		die("%s: clean filter '%s' failed", path, ca.drv->name);
 >
 >  	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
 > @@ -902,14 +1142,16 @@ static int 
convert_to_working_tree_internal(const char *path, const char *src,
 >  					    size_t len, struct strbuf *dst,
 >  					    int normalizing)
 >  {
 > -	int ret = 0, ret_filter = 0;
 > -	const char *filter = NULL;
 > +	int ret = 0, ret_filter;
 > +	const char *smudge_filter = NULL;
 > +	const char *process_filter = NULL;
 >  	int required = 0;
 >  	struct conv_attrs ca;
 >
 >  	convert_attrs(&ca, path);
 >  	if (ca.drv) {
 > -		filter = ca.drv->smudge;
 > +		process_filter = ca.drv->process;
 > +		smudge_filter = ca.drv->smudge;
 >  		required = ca.drv->required;
 >  	}
 >
 > @@ -922,7 +1164,7 @@ static int 
convert_to_working_tree_internal(const char *path, const char *src,
 >  	 * CRLF conversion can be skipped if normalizing, unless there
 >  	 * is a smudge filter.  The filter might expect CRLFs.
 >  	 */
 > -	if (filter || !normalizing) {
 > +	if (smudge_filter || process_filter || !normalizing) {
 >  		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
 >  		if (ret) {
 >  			src = dst->buf;
 > @@ -930,7 +1172,10 @@ static int 
convert_to_working_tree_internal(const char *path, const char *src,
 >  		}
 >  	}
 >
 > -	ret_filter = apply_filter(path, src, len, -1, dst, filter);
 > +	if (!smudge_filter && process_filter)
 > +		ret_filter = apply_protocol_filter(path, src, len, -1, dst, 
process_filter, "smudge");
 > +	else
 > +		ret_filter = apply_filter(path, src, len, -1, dst, smudge_filter);
 >  	if (!ret_filter && required)
 >  		die("%s: smudge filter %s failed", path, ca.drv->name);
 >
 > @@ -1383,7 +1628,7 @@ struct stream_filter *get_stream_filter(const 
char *path, const unsigned char *s
 >  	struct stream_filter *filter = NULL;
 >
 >  	convert_attrs(&ca, path);
 > -	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
 > +	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 >  		return NULL;
 >
 >  	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 > diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
 > index b9911a4..c4793ed 100755
 > --- a/t/t0021-conversion.sh
 > +++ b/t/t0021-conversion.sh
 > @@ -4,6 +4,11 @@ test_description='blob conversion via gitattributes'
 >
 >  . ./test-lib.sh
 >
 > +if ! test_have_prereq PERL; then
 > +	skip_all='skipping perl interface tests, perl not available'
 > +	test_done
 > +fi
 > +
 >  if test_have_prereq EXPENSIVE
 >  then
 >  	T0021_LARGE_FILE_SIZE=2048
 > @@ -283,4 +288,174 @@ test_expect_success 'disable filter with empty 
override' '
 >  	test_must_be_empty err
 >  '
 >
 > +test_expect_success 'required protocol filter should filter data' '
 > +	test_config_global filter.protocol.process 
\"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
 > +	test_config_global filter.protocol.required true &&
 > +	rm -rf repo &&
 > +	mkdir repo &&
 > +	(
 > +		cd repo &&
 > +		git init &&
 > +
 > +		echo "*.r filter=protocol" >.gitattributes &&
 > +		git add . &&
 > +		git commit . -m "test commit" &&
 > +		git branch empty &&
 > +
 > +		cat ../test.o >test.r &&
 > +		echo "test22" >test2.r &&
 > +		echo "test333" >test3.r &&
 > +
 > +		rm -f output.log &&
 > +		git add . &&
 > +		sort output.log | uniq -c | sed "s/^[ ]*//" >uniq_output.log &&
 > +		cat >expected_add.log <<-\EOF &&
 > +			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
 > +			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
 > +			1 IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
 > +			1 start
 > +			1 wrote filter header
 > +		EOF
 > +		test_cmp expected_add.log uniq_output.log &&
 > +
 > +		>output.log &&
 > +		git commit . -m "test commit" &&
 > +		sort output.log | uniq -c | sed "s/^[ ]*//" |
 > +			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq_output.log &&
 > +		cat >expected_commit.log <<-\EOF &&
 > +			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
 > +			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
 > +			x IN: clean test3.r 8 [OK] -- OUT: 8 [OK]
 > +			1 start
 > +			1 wrote filter header
 > +		EOF
 > +		test_cmp expected_commit.log uniq_output.log &&
 > +
 > +		>output.log &&
 > +		rm -f test?.r &&
 > +		git checkout . &&
 > +		cat output.log | grep -v "IN: clean" >smudge_output.log &&
 > +		cat >expected_checkout.log <<-\EOF &&
 > +			start
 > +			wrote filter header
 > +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
 > +			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
 > +		EOF
 > +		test_cmp expected_checkout.log smudge_output.log &&
 > +
 > +		git checkout empty &&
 > +
 > +		>output.log &&
 > +		git checkout master &&
 > +		cat output.log | grep -v "IN: clean" >smudge_output.log &&
 > +		cat >expected_checkout_master.log <<-\EOF &&
 > +			start
 > +			wrote filter header
 > +			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
 > +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
 > +			IN: smudge test3.r 8 [OK] -- OUT: 8 [OK]
 > +		EOF
 > +		test_cmp expected_checkout_master.log smudge_output.log
 > +	)
 > +'
 > +
 > +test_expect_success 'protocol filter large file' '
 > +	test_config_global filter.protocol.process 
\"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
 > +	test_config_global filter.protocol.required true &&
 > +	rm -rf repo &&
 > +	mkdir repo &&
 > +	(
 > +		cd repo &&
 > +		git init &&
 > +
 > +		echo "*.file filter=protocol" >.gitattributes &&
 > +		cp ../generated-test-data/large.file large.file &&
 > +		cp large.file large.original &&
 > +		./../rot13.sh <large.original >large.rot13 &&
 > +
 > +		git add large.file .gitattributes &&
 > +		git commit . -m "test commit" &&
 > +
 > +		rm -f large.file &&
 > +		git checkout -- large.file &&
 > +		git cat-file blob :large.file >actual &&
 > +		test_cmp large.rot13 actual
 > +	)
 > +'
 > +
 > +test_expect_success 'required protocol filter should fail with clean' '
 > +	test_config_global filter.protocol.process 
\"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
 > +	test_config_global filter.protocol.required true &&
 > +	rm -rf repo &&
 > +	mkdir repo &&
 > +	(
 > +		cd repo &&
 > +		git init &&
 > +
 > +		echo "*.r filter=protocol" >.gitattributes &&
 > +
 > +		cat ../test.o >test.r &&
 > +		echo "this is going to fail" >clean-write-fail.r &&
 > +		echo "test333" >test3.r &&
 > +
 > +		# Note: There are three clean paths in convert.c we just test one 
here.
 > +		test_must_fail git add .
 > +	)
 > +'
 > +
 > +test_expect_success 'protocol filter should restart after failure' '
 > +	test_config_global filter.protocol.process 
\"$TEST_DIRECTORY/t0021/rot13-filter.pl\" &&
 > +	rm -rf repo &&
 > +	mkdir repo &&
 > +	(
 > +		cd repo &&
 > +		git init &&
 > +
 > +		echo "*.r filter=protocol" >.gitattributes &&
 > +
 > +		cat ../test.o >test.r &&
 > +		echo "1234567" >test2.o &&
 > +		cat test2.o >test2.r &&
 > +		echo "this is going to fail" >smudge-write-fail.o &&
 > +		cat smudge-write-fail.o >smudge-write-fail.r &&
 > +		git add . &&
 > +		git commit . -m "test commit" &&
 > +		rm -f *.r &&
 > +
 > +		printf "" >output.log &&
 > +		git checkout . &&
 > +		cat output.log | grep -v "IN: clean" >smudge_output.log &&
 > +		cat >expected_checkout_master.log <<-\EOF &&
 > +			start
 > +			wrote filter header
 > +			IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [FAIL]
 > +			start
 > +			wrote filter header
 > +			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
 > +			IN: smudge test2.r 8 [OK] -- OUT: 8 [OK]
 > +		EOF
 > +		test_cmp expected_checkout_master.log smudge_output.log &&
 > +
 > +		test_cmp ../test.o test.r &&
 > +		./../rot13.sh <../test.o >expected &&
 > +		git cat-file blob :test.r >actual &&
 > +		test_cmp expected actual
 > +
 > +		test_cmp test2.o test2.r &&
 > +		./../rot13.sh <test2.o >expected &&
 > +		git cat-file blob :test2.r >actual &&
 > +		test_cmp expected actual
 > +
 > +		test_cmp test2.o test2.r &&
 > +		./../rot13.sh <test2.o >expected &&
 > +		git cat-file blob :test2.r >actual &&
 > +		test_cmp expected actual
 > +
 > +		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
 > +		./../rot13.sh <smudge-write-fail.o >expected &&
 > +		git cat-file blob :smudge-write-fail.r >actual &&
 > +		test_cmp expected actual							  # Clean worked!
 > +	)
 > +'
 > +
 >  test_done
 > diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
 > new file mode 100755
 > index 0000000..7176836
 > --- /dev/null
 > +++ b/t/t0021/rot13-filter.pl
 > @@ -0,0 +1,146 @@
 > +#!/usr/bin/perl
 > +#
 > +# Example implementation for the Git filter protocol version 2
 > +# See Documentation/gitattributes.txt, section "Filter Protocol"
 > +#
 > +# This implementation supports two special test cases:
 > +# (1) If data with the filename "clean-write-fail.r" is processed with
 > +#     a "clean" operation then the write operation will die.
 > +# (2) If data with the filename "smudge-write-fail.r" is processed with
 > +#     a "smudge" operation then the write operation will die.
 > +#
 > +
 > +use strict;
 > +use warnings;
 > +
 > +my $MAX_PACKET_CONTENT_SIZE = 65516;
 > +
 > +sub rot13 {
 > +    my ($str) = @_;
 > +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
 > +    return $str;
 > +}
 > +
 > +sub packet_read {
 > +    my $buffer;
 > +    my $bytes_read = read STDIN, $buffer, 4;
 > +    if ( $bytes_read == 0 ) {
 > +        return;
 > +    }
 > +    elsif ( $bytes_read != 4 ) {
 > +        die "invalid packet size '$bytes_read' field";
 > +    }
 > +    my $pkt_size = hex($buffer);
 > +    if ( $pkt_size == 0 ) {
 > +        return ( 1, "" );
 > +    }
 > +    elsif ( $pkt_size > 4 ) {
 > +        my $content_size = $pkt_size - 4;
 > +        $bytes_read = read STDIN, $buffer, $content_size;
 > +        if ( $bytes_read != $content_size ) {
 > +            die "invalid packet";
 > +        }
 > +        return ( 0, $buffer );
 > +    }
 > +    else {
 > +        die "invalid packet size";
 > +    }
 > +}
 > +
 > +sub packet_write {
 > +    my ($packet) = @_;
 > +    print STDOUT sprintf( "%04x", length($packet) + 4 );
 > +    print STDOUT $packet;
 > +    STDOUT->flush();
 > +}
 > +
 > +sub packet_flush {
 > +    print STDOUT sprintf( "%04x", 0 );
 > +    STDOUT->flush();
 > +}
 > +
 > +open my $debug, ">>", "output.log";
 > +print $debug "start\n";
 > +$debug->flush();
 > +
 > +packet_write("git-filter-protocol\n");
 > +packet_write("version 2\n");
 > +packet_write("clean smudge\n");
 > +print $debug "wrote filter header\n";
 > +$debug->flush();
 > +
 > +while (1) {
 > +    my $command = packet_read();
 > +    unless ( defined($command) ) {
 > +        exit();
 > +    }
 > +    chomp $command;
 > +    print $debug "IN: $command";
 > +    $debug->flush();
 > +    my $filename = packet_read();
 > +    chomp $filename;
 > +    print $debug " $filename";
 > +    $debug->flush();
 > +    my $filelen = packet_read();
 > +    chomp $filelen;
 > +    print $debug " $filelen";
 > +    $debug->flush();
 > +
 > +    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
 > +    my $output;
 > +
 > +    if ( $filelen > 0 ) {
 > +        my $input = "";
 > +        {
 > +            binmode(STDIN);
 > +            my $buffer;
 > +            my $done = 0;
 > +            while ( !$done ) {
 > +                ( $done, $buffer ) = packet_read();
 > +                $input .= $buffer;
 > +            }
 > +            print $debug " [OK] -- ";
 > +            $debug->flush();
 > +        }
 > +
 > +        if ( $command eq "clean" ) {
 > +            $output = rot13($input);
 > +        }
 > +        elsif ( $command eq "smudge" ) {
 > +            $output = rot13($input);
 > +        }
 > +        else {
 > +            die "bad command";
 > +        }
 > +    }
 > +
 > +    my $output_len = length($output);
 > +    packet_write("$output_len\n");
 > +    print $debug "OUT: $output_len ";
 > +    $debug->flush();
 > +    if ( $output_len > 0 ) {
 > +        if (   ( $command eq "clean" and $filename eq 
"clean-write-fail.r" )
 > +            or
 > +            ( $command eq "smudge" and $filename eq 
"smudge-write-fail.r" ) )
 > +        {
 > +            print $debug " [FAIL]\n";
 > +            $debug->flush();
 > +            die "write error";
 > +        }
 > +        else {
 > +            while ( length($output) > 0 ) {
 > +                my $packet = substr( $output, 0, 
$MAX_PACKET_CONTENT_SIZE );
 > +                packet_write($packet);
 > +                if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
 > +                    $output = substr( $output, 
$MAX_PACKET_CONTENT_SIZE );
 > +                }
 > +                else {
 > +                    $output = "";
 > +                }
 > +            }
 > +            packet_flush();
 > +            print $debug "[OK]\n";
 > +            $debug->flush();
 > +        }
 > +    }
 > +}
 >


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-28  7:16     ` Lars Schneider
@ 2016-07-28 10:42       ` Jakub Narębski
  2016-07-28 13:29       ` Jeff King
  1 sibling, 0 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-28 10:42 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	mlbright, remi.galan-alfonso, pclouds, Eric Wong, Ramsay Jones,
	Jeff King

W dniu 2016-07-28 o 09:16, Lars Schneider pisze:
>> On 27 Jul 2016, at 21:08, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:
>>>
>>> thanks a lot for the extensive reviews. I tried to address all mentioned
>>> concerns and summarized them below. The most prominent changes since v1 are
>>> the following:
>>> * Git offers a number of filter capabilities that a filter can request
>>>  (right now only "smudge" and "clean" - in the future maybe "cleanFromFile",
>>>  "smudgeToFile", and/or "stream")
>>> * pipe communication uses a packet format (pkt-line) based protocol
>>
>> I wonder if it would make sense to support both whole-file pipe communication,
>> and packet format (pkt-line) pipe communication.
>>
>> The problem with whole-file pipe communication (original proposal for
>> new filter protocol is that it needs file size upfront.  For some types
>> of filters it is not a problem:
>> - if a filtered file has the same size as original, like for rot13
>>   example in the test for the feature
>> - if you can calculate the resulting file size from original size,
>>   like for most if not all encryption formats (that includes GPG,
>>   uudecode, base64, quoted-printable, hex, etc.); same for decryption,
>>   and from converting between fixed-width encodings
>> - if resulting file size is saved somewhere that is easy to get, like
>>   for LFS solutions (I think).
>>
>> For other filters it is serious problem.  For example indent, keyword
>> expansion, rezipping with zero compression (well, maybe not this one,
>> but at least the reverse of it), converting between encodings where
>> at least one is variable width (like UTF-8),...
>>
>> IMHO writing whole-file persistent filters is easier than using pkt-line.
>> On the other hand using pkt-line allow for more detailed progress report.
> 
> I initially wanted to support only "while-file" pipe, too.
> 
> But Peff ($gmane/299902), Duy, and Eric, seemed to prefer the pkt-line
> solution (gmane is down - otherwise I would have given you the links).

As GMane is down (at least the web interface; NNTP seems to be running)
I cannot examine their arguments.  Could you summarize?

> 
> After I have looked at it I think the pkt-line solution is indeed nicer
> for the following reasons:
> 
> (1) A stream optimized version (read/write in separate threads) of the filter
>     protocol can be implemented in the future without changing the protocol

I think the more important thing is that with pkt-line the filter does
not need to know the size of the output upfront.  Separate threads are
independent of protocol used, I think; and anyway Git never writes to
filter and reads from filter in the same command, isn't it?  The lifetime
of filter driver command is one Git command for now.

Oh, you meant having separate threads for writing to filter, and
separate thread for receiving output, so you don't have to wait to
send whole file to filter before starting receiving?  Note that
I think original filter implementation does it; at least async_start()
used in it hints about that (I need to examine how it works to tell
more).

> (2) pkt-line is a simple and easy to implement format

But it is more complicated than whole-file based protocol.  You need
to loop over packets... well you need that tool with whole-file, but
it is covered by existing helper functions (read_in_full()).  It is
easy to redirect file descriptors (copy_fd()), while you need to
convert contents into packets on write side, and unpack and unsplit
on the receive side in Git.

You also need to take care documenting if trailing "\n\0", "\n", "\0"
is a part of packet.

> (3) Reuse of existing Git communication infrastructure
>     -> code and documentation are less surprising to people that know Git

Whole-file read is not that difficult...

>     -> you can use GIT_TRACE_PACKET to easily inspect the
>        communication between Git and the filter process

...but this is a nice advantage.

If deemed necessary, we could also reuse progress report meters from
fetch / push side (percent, bandwidth/throughput), I guess.

> (4) The overheads is neglect able (4 byte header vs 65516 byte content)

Right.

> 
> 
>>> * a long running filter application is defined with "filter.<driver>.process"
>>
>> I hope that won't confuse Git users into trying to use single-shot
>> filters with a new protocol...
> 
> Yes, that was my intention for this new config.

All right, but you need to document the precedence rules: that is 

(1) if, accordingly to the operation, `clean` or `smudge` per-file filter
    is present, it is used; 
(2) if `process` semi-persistent protocol based filter is present,
    and it offers, accordingly to the operation, `clean` or `smudge`
    capabilities, it is used;
(3) otherwise, no filtering is performed.  

`clean` or `smudge` set to empty string means identity filter; I don't know
about rule for the new `process` filter if it is set to empty string.

At least in the commit message you would need to describe why this particular
solution was chosen.  I guess it is to avoid starting `protocol` filter,
only to realize that it does not offer "smudge" capability, and that `smudge`
filter is to be used because it is set.

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27 18:11         ` Jeff King
@ 2016-07-28 12:10           ` Lars Schneider
  2016-07-28 13:35             ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-28 12:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay


> On 27 Jul 2016, at 20:11, Jeff King <peff@peff.net> wrote:
> 
> On Wed, Jul 27, 2016 at 07:31:26PM +0200, Lars Schneider wrote:
> 
>>>> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
>>> 
>>> What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
>>> 32-bit system)?
>> 
>> Would that be an acceptable solution?
>> 
>> if (size + 1 > SIZE_MAX)
>> 	return die("unrepresentable length for filter buffer");
> 
> No, because by definition "size" will wrap to 0. :)
> 
> You have to do:
> 
>  if (size > SIZE_MAX - 1)
> 	die("whoops");
> 
>> Can you point me to an example in the Git source how this kind of thing should
>> be handled?
> 
> The strbuf code itself checks for overflows. So you could do:
> 
>  strbuf_grow(sb, size);
>  ... fill up with size bytes ...
>  strbuf_addch(sb, ...); /* extra byte for whatever */
> 
> That does mean _possibly_ making a second allocation just to add the
> extra byte, but in practice it's not likely (unless the input exactly
> matches the strbuf's growth pattern).
> 
> If you want to do it yourself, I think:
> 
>  strbuf_grow(sb, st_add(size, 1));

I like that solution! Thanks!


> would work.
> 
>>>> +	while (
>>>> +		bytes_read > 0 && 					// the last packet was no flush
>>>> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
>>>> +	);
>>> 
>>> And I'm not sure if you need to distinguish between "0" and "-1" when
>>> checking byte_read here.
>> 
>> We want to finish reading in both cases, no?
> 
> If we get "-1", that's from an unexpected EOF during the packet_read(),
> because you set GENTLE_ON_EOF. So there's nothing left to read, and we
> should break and return an error.

Right.


> I guess "0" would come from a flush packet? Why would the filter send
> back a flush packet (unless you were using them to signal end-of-input,
> but then why does the filter have to send back the number of bytes ahead
> of time?).

Sending the bytes ahead of time (if available) might be nice for efficient
buffer allocation. I am changing the code so that both cases can be handled
(size ahead of time and no size ahead of time).


>>> Why 8K? The pkt-line format naturally restricts us to just under 64K, so
>>> why not take advantage of that and minimize the framing overhead for
>>> large data?
>> 
>> I took inspiration from here for 8K MAX_IO_SIZE:
>> https://github.com/git/git/blob/master/copy.c#L6
>> 
>> Is this read limit correct? Should I read 8 times to fill a pkt-line?
> 
> MAX_IO_SIZE is generally 8 _megabytes_, not 8K. The loop in copy.c just
> haad to pick an arbitrary size for doing its read/write proxying.  I
> think in practice you are not likely to get much benefit from going
> beyond 8K or so there, just because operating systems tend to do things
> in page-sizes anyway, which are usually 4K.
> 
> But since you are formatting the result into a form that has framing
> overhead, anything up to LARGE_PACKET_MAX will see benefits (though
> admittedly even 4 bytes per 8K is not much).
> 
> I don't think it's worth the complexity of reading 8 times, but just
> using a buffer size of LARGE_PACKET_MAX-4 would be the most efficient.
> 
> I doubt it matters _that much_ in practice, but any time I see a magic
> number I have to wonder at the "why". At least basing it on
> LARGE_PACKET_MAX has some basis, whereas 8K is largely just made-up. :)

Sounds good. I will use LARGE_PACKET_MAX-4 !

> 
>>> We do sometimes do "ret |= something()" but that is in cases where
>>> "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
>>> your function's error-reporting is inverted from our usual style?
>> 
>> I thought it makes the code easier to read and the filter doesn't care
>> at what point the error happens anyways. The filter either succeeds
>> or fails. What style would you suggest?
> 
> I think that's orthogonal. I just mean that using zero for success puts
> you in our usual style, and then accumulating errors can be done with
> "|=".

Ah. I guess I was misguided by the way errors are currently handled
in `apply_filter` (success = 1; failure = 0):
https://github.com/git/git/blob/8c6d1f9807c67532e7fb545a944b064faff0f70b/convert.c#L437-L479

I wouldn't like if the different filter protocols would use different
error exit codes. Would it be OK to adjust the existing `apply_filter`
function in a cleanup patch?


> I didn't look carefully at whether the accumulating style you're using
> makes sense or not. But something like:
> 
>>>> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
>>>> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
> 
> does mean that we call the second write() even if the first one failed.
> That's a waste of time (albeit a minor one), but it also means you could
> potentially cover up the value of "errno" from the first one (though in
> practice I'd expect the second one to fail for the same reason).

Oh. You're right. For some reason I thought the second operator would
never be evaluated if the first operator is 0. Apparently that is not
the case for bit-wise & ... only for logical & ... thanks for the lesson!

- Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-28  7:16     ` Lars Schneider
  2016-07-28 10:42       ` Jakub Narębski
@ 2016-07-28 13:29       ` Jeff King
  2016-07-29  7:40         ` Jakub Narębski
  1 sibling, 1 reply; 77+ messages in thread
From: Jeff King @ 2016-07-28 13:29 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, Git Mailing List, gitster, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay

On Thu, Jul 28, 2016 at 09:16:18AM +0200, Lars Schneider wrote:

> But Peff ($gmane/299902), Duy, and Eric, seemed to prefer the pkt-line
> solution (gmane is down - otherwise I would have given you the links).

FWIW, I think there are arguments for transmitting size + content
(namely, that it is simpler); the downside is that it doesn't allow
streaming.

So I think there are two viable alternatives:

  1. Total size of data in ASCII decimal, newline, then that many bytes
     of content.

  2. No size header, then a series of pkt-lines followed by a flush
     packet.

And you should choose between the two based on whether it's more
important to allow streaming, or more important to make the filter
implementations simple[1].

Any solution that is in between those (like sending a size header and
then using pktlines anyway) is sacrificing simplicity but not getting
the streaming benefits.

-Peff

[1] I haven't thought hard enough about it to have a real opinion. My
    gut says to go with the streaming, just because we've had to
    retrofit streaming in other areas when dealing with blobs, so I
    think we'll end up there eventually. So choosing a simpler protocol
    like (1) would probably mean eventually implementing a next-version
    protocol that does (2), and having to support both.

PS Jakub asked for links, but gmane is down. Here are the relevant threads:

   http://public-inbox.org/git/20160720134916.GB19359@sigill.intra.peff.net

   http://public-inbox.org/git/20160722154900.19477-1-larsxschneider%40gmail.com/t/#u

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-28 12:10           ` Lars Schneider
@ 2016-07-28 13:35             ` Jeff King
  0 siblings, 0 replies; 77+ messages in thread
From: Jeff King @ 2016-07-28 13:35 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, e, ramsay

On Thu, Jul 28, 2016 at 02:10:12PM +0200, Lars Schneider wrote:

> > I think that's orthogonal. I just mean that using zero for success puts
> > you in our usual style, and then accumulating errors can be done with
> > "|=".
> 
> Ah. I guess I was misguided by the way errors are currently handled
> in `apply_filter` (success = 1; failure = 0):
> https://github.com/git/git/blob/8c6d1f9807c67532e7fb545a944b064faff0f70b/convert.c#L437-L479
> 
> I wouldn't like if the different filter protocols would use different
> error exit codes. Would it be OK to adjust the existing `apply_filter`
> function in a cleanup patch?

Ah, I see.

I think those return codes are a little different. They are not "success
or error", but "did convert or did not convert" (or "would convert" when
no buffer is given). And unless the filter is required, we quietly turn
errors into "did not convert" (and if it is, we die()).

So I'm not sure if changing them is a good idea. I agree with you that
it's probably inviting confusion to have the two sets of filter
functions have opposite return codes. So I think I retract my
suggestion. :)

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-28 13:29       ` Jeff King
@ 2016-07-29  7:40         ` Jakub Narębski
  2016-07-29  8:14           ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-29  7:40 UTC (permalink / raw)
  To: Jeff King, Lars Schneider
  Cc: Git Mailing List, gitster, tboegi, mlbright, remi.galan-alfonso,
	pclouds, e, ramsay

W dniu 2016-07-28 o 15:29, Jeff King pisze:
> On Thu, Jul 28, 2016 at 09:16:18AM +0200, Lars Schneider wrote:
> 
>> But Peff ($gmane/299902), Duy, and Eric, seemed to prefer the pkt-line
>> solution (gmane is down - otherwise I would have given you the links).
> 
> FWIW, I think there are arguments for transmitting size + content
> (namely, that it is simpler); the downside is that it doesn't allow
> streaming.

And that it requires for the filter to know the size of its output
upfront (which, as I wrote, might be easy to do based on size of input
and data stored elsewhere, or might need generating whole output to
know).

I don't know how parallel Git is, but if it is parallel enough,
and other limits do not apply (limited amount of CPU cores, I/O limits),
without streaming new filter protocol might be slower, unless startup
time dominates (MS Windows?):

Current parallel:

   |   startup   | processing 1 |
    |  startup    | processing 2  |
   | startup |  processing 3 |
     |  startup  |  processing 4  |

Protocol v2:

   |  startup  | processing 1 | processing 2 | processing 3 | processing 4 |

> 
> So I think there are two viable alternatives:
> 
>   1. Total size of data in ASCII decimal, newline, then that many bytes
>      of content.
> 
>   2. No size header, then a series of pkt-lines followed by a flush
>      packet.

    3. Optional size header[2][3], then a series of pkt-lines followed
       by a flush packet[4].

[2] Git should always provide size, because it is easy to do, and
    I think quite cheap (stored with blob, stored in index, or stat()
    on file away).  Filter can provide size if it is easy to calculate,
    or approximation of size / size hint[5] - it helps to avoid
    reallocation.
[3] It is also a place where filter can pass error conditions that
    are known before starting processing a file.
[4] On one hand you need to catch cases where real size is larger than
    size sent upfront, or smaller than size sent upfront; on the
    other hand it might be a place where to send warnings and errors...
    unless we utilize stderr of a process (but then there is a problem
    of deadlocking, I think).
[5] I suggest

        <size as ascii decimal>
        "approx" SPC <size as ascii decimal>
        "unknown"
        "fail"

> And you should choose between the two based on whether it's more
> important to allow streaming, or more important to make the filter
> implementations simple[1].
> 
> Any solution that is in between those (like sending a size header and
> then using pktlines anyway) is sacrificing simplicity but not getting
> the streaming benefits.
> 
> -Peff
> 
> [1] I haven't thought hard enough about it to have a real opinion. My
>     gut says to go with the streaming, just because we've had to
>     retrofit streaming in other areas when dealing with blobs, so I
>     think we'll end up there eventually. So choosing a simpler protocol
>     like (1) would probably mean eventually implementing a next-version
>     protocol that does (2), and having to support both.
> 
> PS Jakub asked for links, but gmane is down. Here are the relevant threads:
> 
>    http://public-inbox.org/git/20160720134916.GB19359@sigill.intra.peff.net
> 
>    http://public-inbox.org/git/20160722154900.19477-1-larsxschneider%40gmail.com/t/#u
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27 23:31     ` Jakub Narębski
@ 2016-07-29  8:04       ` Lars Schneider
  2016-07-29 17:35         ` Junio C Hamano
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-29  8:04 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	mlbright, Remi Galan Alfonso, Nguyen Thai Ngoc Duy, Eric Wong,
	Ramsay Jones, Jeff King, Johannes Schindelin


> On 28 Jul 2016, at 01:31, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-27 o 02:06, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
> 
> It is not strictly necessary... but do we have any benchmarks for this,
> or is it just the feeling?  That is, in what situations Git may filter
> a large number of files (initial checkout? initial add?, switching
> to unrelated branch? getting large files from LFS solution?, and when
> startup time might become significant part of execution time (MS Windows?
> fast filters?)?

All the operations you mentioned are slow because they all cause filter
process invocations. I benchmarked the new filter protocol and it is 
almost 70x faster when switching branches on my local machine (i7, SSD, 
OS X) with a test repository containing 12,000 files that need to be 
filtered. 
Details here: https://github.com/github/git-lfs/pull/1382

Based on my previous experience with Git filter invocations I expect even
more dramatic results on Windows (I will benchmark this, too, as soon as
the list agrees on this approach).


>> This patch adds the filter.<driver>.process string option which, if used,
> 
> String option... what are possible values?  What happens if you use
> value that is not recognized by Git (it is "if used", isn't it)?  That's
> not obvious from the commit message (though it might be from the docs).

Then the process invocation will fail in the same way current filter
process invocations fail. If "filter.<driver>.required" is set then
Git will fail, otherwise not. 


> What is missing is the description that it is set to a command, and
> how it interacts with `clean` and `smudge` options.

Right, I'll add that! I implemented it in a way that the "filter.<driver>.clean" 
and "filter.<driver>.smudge" take presence over "filter.<driver>.process".

This allows the use of a `single-shot` clean filter and a `long-running` 
smudge as suggested by Junio in the v1 discussion (no ref, gmane down).


>> keeps the external filter process running and processes all blobs with
>> the following packet format (pkt-line) based protocol over standard input
>> and standard output.
>> 
>> Git starts the filter on first usage and expects a welcome
>> message, protocol version number, and filter capabilities
>> seperated by spaces:
> 
> s/seperated/separated/

Thanks!


> Is there any handling of misconfigured one-shot filters, or would
> they still hang the execution of a Git command?

They would still hang. Would it be sufficient to mention that in the
docs?


>> ------------------------
>> packet:          git< git-filter-protocol
>> packet:          git< version 2
>> packet:          git< clean smudge
> 
> Wouldn't "capabilities clean smudge" be better?  Or is it the
> "clean smudge" proposal easier to handle?

Good suggestion! This is easy to handle and I think it mimics
the pack-protocol a bit more closely.


>> ------------------------
>> Supported filter capabilities are "clean" and "smudge".
>> 
>> Afterwards Git sends a command (e.g. "smudge" or "clean" - based
>> on the supported capabilities), the filename, the content size as
>> ASCII number in bytes, and the content in packet format with a
>> flush packet at the end:
>> ------------------------
>> packet:          git> smudge
>> packet:          git> testfile.dat
> 
> And here we don't have any problems with files containing embedded
> newlines etc.  Also Git should not be sending invalid file names.
> The question remains: is it absolute file path, or basename?

It sends a path absolute in context of the Git repo (e.g. subdir/testfile.dat).
I'll add that to the docs and I'll add a test case to demonstrate it.

> 
>> packet:          git> 7
>> packet:          git> CONTENT
> 
> Can Git send file contents using more than one packet?  I think
> it should be stated upfront.

OK


>> packet:          git> 0000
>> ------------------------
> 
> Why we need to send content size upfront?  Well, it is not a problem
> for Git, but (as I wrote in reply to the cover letter for this
> series) might be a problem for filter scripts.

I think sending it upfront is nice for buffer allocations of big files
and it doesn't cost us anything to do it.


>> The filter is expected to respond with the result content size as
>> ASCII number in bytes and the result content in packet format with
>> a flush packet at the end:
>> ------------------------
>> packet:          git< 57
> 
> This is not neccessary (and might be hard for scripts to do) if
> pkt-line protocol is used.
> 
> In short: I think pkt-line is not worth the complication on
> the Git side and on the filter size, unless it is used for
> streaming, or at least filter not having to calculate output
> size upfront.

As I stated in my other response to you, I think there is value
in having a single protocol for streaming and non-streaming
content. I'll add a capability "stream" that doesn't require the 
"size upfront" answer from the filter.


> 
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> ------------------------
>> Please note: In a future version of Git the capability "stream"
>> might be supported. In that case the content size must not be
>> part of the filter response.
>> 
>> Afterwards the filter is expected to wait for the next command.
> 
> When filter is supposed to exit, then?

Never by its own. The filter is always shutdown by Git. Something
for the docs, I guess :-)


>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> Documentation/gitattributes.txt |  54 +++++++-
>> convert.c                       | 269 ++++++++++++++++++++++++++++++++++++++--
>> t/t0021-conversion.sh           | 175 ++++++++++++++++++++++++++
>> t/t0021/rot13-filter.pl         | 146 ++++++++++++++++++++++
>> 4 files changed, 631 insertions(+), 13 deletions(-)
>> create mode 100755 t/t0021/rot13-filter.pl
>> 
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> index 8882a3e..8fb40d2 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the command is
>> fed the blob object from its standard input, and its standard
>> output is used to update the worktree file.  Similarly, the
>> `clean` command is used to convert the contents of worktree file
>> -upon checkin.
>> +upon checkin. By default these commands process only a single
>> +blob and terminate. If a long running filter process (see section
>> +below) is used then Git can process all blobs with a single filter
>> +invocation for the entire life of a single Git command (e.g.
>> +`git add .`).
> 
> Ah, all right, here we give an example.
> 
> But, is "blob" term used in this document, or do we use "file"
> and "file contents" only?

Both terms are used in the document. Blob is also already used in a
similar context above.


>> 
>> One use of the content filtering is to massage the content into a shape
>> that is more convenient for the platform, filesystem, and the user to use.
>> @@ -375,6 +379,54 @@ substitution.  For example:
>> ------------------------
>> 
>> 
>> +Long Running Filter Process
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +If the filter command (string value) is defined via
>> +filter.<driver>.process then Git can process all blobs with a
>> +single filter invocation for the entire life of a single Git
>> +command by talking with the following packet format (pkt-line)
>> +based protocol over standard input and standard output.
> 
> Ah, so now it is the name of command,

Correct!


> and I assume it is
> exclusive with `clean` / `smudge`, or does it only takes
> precedence based on capabilities of filter (that is if
> for example "`process`" does not include 'clean' capability,
> then `clean` filter is used, using per-file "protocol").
> Or do something different happens (like preference for
> old-style `clean` and `smudge` filters, and `process`
> used if any unset)?

The existing `single-shot` clean/smudge filter takes precedence
if they are available. The reason is that we would need to
start the `long-running` filter to figure out what capabilities
it has. On the upside I think that is the safest course of action
for existing Git installations out there.

I'll add test cases to demonstrate/ensure that behavior.


> Anyway, Git command would never (I think) run both
> "clean" and "smudge" filters.  But I might be wrong.

I think Git can invoke both filters on checkout.


> Yeah, I know this going back and forth seems like 
> bike-shedding, but designing good user-facing API
> is very, very important.

I agree and I really appreciate that you sacrifice your time to think
it through!


>> +
>> +Git starts the filter on first usage and expects a welcome
>> +message, protocol version number, and filter capabilities
>> +seperated by spaces:
>> +------------------------
>> +packet:          git< git-filter-protocol
>> +packet:          git< version 2
>> +packet:          git< clean smudge
>> +------------------------
> 
> Neither of those is terminated by end of line character,
> that is, "\n", isn't it?

Correct. Will add that.

> 
>> diff --git a/convert.c b/convert.c
>> index 522e2c5..5ff200b 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -3,6 +3,7 @@
>> #include "run-command.h"
>> #include "quote.h"
>> #include "sigchain.h"
>> +#include "pkt-line.h"
>> 
>> /*
>>  * convert.c - convert a file when checking it out and checking it in.
>> @@ -481,11 +482,232 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>> 	return ret;
>> }
>> 
>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
> 
> What's the purpose of this function?  Is it to gather read whole
> contents of file into strbuf?  Or is it to read at most 'size'
> bytes of file / of pkt-line stream into strbuf?

This function reads pkt-lines until a flush packet into the strbuf.


> We probably don't want to keep the whole file in memory,
> if possible.

I understand your concern but this what the original implementation
does, too, and therefore I would like to keep the behavior:
https://github.com/git/git/blob/8c6d1f9807c67532e7fb545a944b064faff0f70b/convert.c#L462


>> +{
>> +	off_t bytes_read;
>> +	off_t total_bytes_read = 0;
>> +	strbuf_grow(sb, size + 1);	// we need one extra byte for the packet flush
> 
> Why we put packet flush into strbuf?  Or is it only temporarily,
> and we adjust that at the end... I see that it is.

Correct.


>> +	do {
>> +		bytes_read = packet_read(
>> +			fd, NULL, NULL,
>> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
>> +			PACKET_READ_GENTLE_ON_EOF
>> +		);
>> +		total_bytes_read += bytes_read;
>> +	}
>> +	while (
>> +		bytes_read > 0 && 					// the last packet was no flush
>> +		sb->len - total_bytes_read - 1 > 0 	// we still have space left in the buffer
>> +	);
>> +	strbuf_setlen(sb, total_bytes_read);
>> +	return total_bytes_read;
>> +}
>> +
>> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)
> 
> What's the purpose of this function?  What are those 'in' and 'out'
> parameters?  Those names do not describe them well.  If they are
> file descriptors, add fd_* prefix (or whatever Git code uses).
> Edit: I see that's what existing code uses.

OK


> Edit: so we are reading from *src + len or from fd_in, depending on
> whether fd_in is set to 0 or not?  I guess that follows existing
> code, where it is even worse, because it is hidden...

Right


>> +{
>> +	int ret = 1;
>> +	char header[4];
>> +	char buffer[8192];
> 
> Could those two be in one variable?  Also, 'header' or 'pkt_header'?
> 
> Why 8192, and not LARGE_PACKET_MAX - 4?

Agreed. I've changed that.


> 
>> +	off_t bytes_to_write;
>> +	while (ret) {
>> +		if (in >= 0) {
>> +			bytes_to_write = xread(in, buffer, sizeof(buffer));
>> +			if (bytes_to_write < 0)
>> +				ret &= 0;
>> +			src = buffer;
>> +		} else {
>> +			bytes_to_write = len > LARGE_PACKET_MAX - 4 ? LARGE_PACKET_MAX - 4 : len;
>> +			len -= bytes_to_write;
>> +		}
>> +		if (!bytes_to_write)
>> +			break;
>> +		set_packet_header(header, bytes_to_write + 4);
>> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
>> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
> 
> These three lines are equivalent to write_packet(), or however
> it is named, isn't it?

A little different. As I discussed with Peff elsewhere the existing write_packet
functions doesn't write the content of src directly. It creates a buffer which 
would cause unnecessary copy operations. However, as Junio noted, creating
these pkt-lines here is wrong. I will add an function to pkt-line.h that can create
a pkt-line without buffer.


>> +	}
>> +	ret &= write_in_full(out, "0000", 4) == 4;
> 
> This is equivalent to packet_flush(), or however it is named,
> isn't it?

packet_flush uses "write_or_die" internally. If the a non required filter fails
then this is no reason to die. However, using this pkt-line knowledge here is
wrong. I will add a new function to pkt-line.h. What do you think about
"tolerant_packet_flush" as function name?


>> +	return ret;
>> +}
>> +
>> +struct cmd2process {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	const char *cmd;
>> +	int clean;
>> +	int smudge;
> 
> These two are 'int' used as 'bool', isn't it?

Yes, but I realized that this becomes cumbersome with more "bools". I create a bitmap
and a few macros to check and set the bitmap (seems to be the Git way for these kind
of problems).


>> +	struct child_process process;
>> +};
> [...]
>> +static struct cmd2process *find_protocol_filter_entry(struct hashmap *hashmap, const char *cmd)
> 
> Wouldn't it be more descriptive to name the first parameter
> to this function 'cmd_hashmap', or something like that, rather
> than plain 'hashmap' (it might be the same that is used / was
> used for a global variable)?
> 
> Edit: or 'cmd_process_map'.

I could, but wouldn't that shadow the global variable? Wouldn't that be even
more confusing?


>> +{
>> +	struct cmd2process k;
>> +	hashmap_entry_init(&k, strhash(cmd));
>> +	k.cmd = cmd;
>> +	return hashmap_get(hashmap, &k, NULL);
>> +}
> 
> [...]
>> +static struct cmd2process *start_protocol_filter(struct hashmap *hashmap, const char *cmd)
>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry;
>> +	struct child_process *process;
>> +	const char *argv[] = { NULL, NULL };
> 
> Could we initialize it with  { cmd, NULL };?
> 
> Edit: Ah, I see that you follow filter_buffer_or_fd() example from
> convert.c, isn't it?

Correct. But filter_buffer_or_fd() requires some extra work on cmd
that is not necessary here. I change it to your suggestion!


>> +	struct string_list capabilities = STRING_LIST_INIT_NODUP;
>> +	char *capabilities_buffer;
>> +	int i;
>> +
>> +	entry = xmalloc(sizeof(*entry));
>> +	hashmap_entry_init(entry, strhash(cmd));
>> +	entry->cmd = cmd;
>> +	entry->clean = 0;
>> +	entry->smudge = 0;
> 
> Wouldn't
> 
>   	entry->clean = entry->smudge = 0;
> 
> be more readable?

I changed it to a bitmap with a single int :)


>> +	process = &entry->process;
>> +
>> +	child_process_init(process);
>> +	argv[0] = cmd;
>> +	process->argv = argv;
>> +	process->use_shell = 1;
>> +	process->in = -1;
> 
> Maybe
> 
>  +	process->in  = -1;
> 
> to align, but perhaps it is not worth it.

Not aligned in the original code, therefore I wouldn't do it here.


>> +	process->out = -1;
>> +
>> +	if (start_command(process)) {
>> +		error("cannot fork to run external persistent filter '%s'", cmd);
> 
> Just a question: is "cannot fork" the only reason why start_command()
> might have failed there?
> 
> Edit: Ah, I see that you follow filter_buffer_or_fd() example from
> convert.c, again.

Correct.


>> +		stop_protocol_filter(hashmap, entry);
>> +		return NULL;
>> +	}
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	ret &= strcmp(packet_read_line(process->out, NULL), "git-filter-protocol") == 0;
>> +	ret &= strcmp(packet_read_line(process->out, NULL), "version 2") == 0;
> 
> So that's why you need packet_read_line() to return string...
> 
>> +	capabilities_buffer = packet_read_line(process->out, NULL);
>> +	sigchain_pop(SIGPIPE);
>> +
>> +	string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1);
> 
> This does not modify capabilities_buffer, does it?

Maybe, but this wouldn't matter as we don't use this buffer afterwards.


>> +	for (i = 0; i < capabilities.nr; i++) {
>> +		const char *requested = capabilities.items[i].string;
>> +		if (!strcmp(requested, "clean")) {
>> +			entry->clean = 1;
>> +		} else if (!strcmp(requested, "smudge")) {
>> +			entry->smudge = 1;
>> +		} else {
>> +			warning(
>> +				"filter process '%s' requested unsupported filter capability '%s'",
>> +				cmd, requested
>> +			);
> 
> Nice.  This makes it (somewhat) forward- and backward-compatibile.

Thanks :)


>> +		}
>> +	}
>> +	string_list_clear(&capabilities, 0);
>> +
>> +	if (!ret) {
>> +		error("initialization for external persistent filter '%s' failed", cmd);
> 
> Do we need more detailed information about the source of error?

Not sure. The beauty of pkt-line is that someone can just set GIT_TRACE_PACKET=1 to
see what is going on. I feel that this is enough and we don't need to have fine grained
error reporting...


>> +		stop_protocol_filter(hashmap, entry);
>> +		return NULL;
>> +	}
>> +
>> +	hashmap_add(hashmap, entry);
>> +	return entry;
>> +}
>> +
>> +static int cmd_process_map_init = 0;
>> +static struct hashmap cmd_process_map;
>> +
>> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
>> +						int fd, struct strbuf *dst, const char *cmd,
>> +						const char *filter_type)
> 
> That is... quite a lot of parameters.  But I guess there is precedens.

Agreed. But this mimics the original apply_protocol() interface and therefore 
I would like to keep it.


> But I think 'fd' belongs to previous line, as it is alternative to
> src+len.

You could also argue that fd is an alternative to src+len and belongs on a 
new line :-) In any way... if I would move fd up then we would exceed 80 chars...


>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry;
>> +	struct child_process *process;
>> +	struct stat file_stat;
>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	off_t expected_bytes;
>> +	char *strtol_end;
>> +	char *strbuf;
>> +
>> +	if (!cmd || !*cmd)
>> +		return 0;
>> +
>> +	if (!dst)
>> +		return 1;
>> +
>> +	if (!cmd_process_map_init) {
>> +		cmd_process_map_init = 1;
>> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
>> +		entry = NULL;
> 
> Is it better than having entry NULL-initialized?

I did this based on my understanding of Eric's feedback:

"Compilers complain about uninitialized variables.  Blindly
setting them to NULL can allow them to be dereferenced;
triggering segfaults; especially if it's passed to a different
compilation unit the compiler can't see."

See: http://public-inbox.org/git/20160725072745.GB11634%40starla/


>> +	} else {
>> +		entry = find_protocol_filter_entry(&cmd_process_map, cmd);
>> +	}
>> +
>> +	if (!entry) {
>> +		entry = start_protocol_filter(&cmd_process_map, cmd);
> 
> Hmmm... apply_filter() uses start_async() for some reason.  Why
> it does not apply for this new kind of filter?

Implementing the protocol in an async manner would be way harder
and is only necessary if you want to support true streaming (that
is reading and writing at the same time). This might be desired but
is out of scope for this patch series.


>> +		if (!entry) {
>> +			return 0;
>> +		}
>> +	}
>> +	process = &entry->process;
>> +
>> +	if (!(!strcmp(filter_type, "clean") && entry->clean) &&
>> +		!(!strcmp(filter_type, "smudge") && entry->smudge)) {
> 
> Would it be more readable as !(A || B) rather than (!A && !B)?

I solved it with a bitmask - no DeMorgan anymore :)


>> +		return 0;
>> +	}
>> +
>> +	if (fd >= 0 && !src) {
>> +		ret &= fstat(fd, &file_stat) != -1;
>> +		len = file_stat.st_size;
>> +	}
> 
> All right, so we either use src+len,  or if we use fd we get
> file size.

Right.


>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> +	packet_write(process->in, "%s\n", filter_type);
>> +	packet_write(process->in, "%s\n", path);
>> +	packet_write(process->in, "%zu\n", len);
> 
> So "\n" is included in protocol?

Yes. Pkt-line states "A non-binary line SHOULD BE terminated by an LF"
(see Documentation/technical/protocol-common.txt)


>> +	ret &= multi_packet_write(src, len, fd, process->in);
> 
> How git-receive-pack etc. handle multi-packet write?

This is my solution that avoids an unnecessary buffer. As mentioned
above I move this code to its proper place in pkt-line.h/c.


>> +
>> +	strbuf = packet_read_line(process->out, NULL);
>> +	expected_bytes = (off_t)strtol(strbuf, &strtol_end, 10);
>> +	ret &= (strtol_end != strbuf && errno != ERANGE);
>> +
>> +	if (expected_bytes > 0) {
>> +		ret &= multi_packet_read(&nbuf, process->out, expected_bytes) == expected_bytes;
>> +	}
>> +
>> +	sigchain_pop(SIGPIPE);
>> +
>> +	if (ret) {
>> +		strbuf_swap(dst, &nbuf);
>> +	} else {
>> +		// Something went wrong with the protocol filter. Force shutdown!
>> +		stop_protocol_filter(&cmd_process_map, entry);
> 
> Some error message would be nice... or do we print in down in stack?

OK, I will add an error(...).


>> +	}
>> +	strbuf_release(&nbuf);
>> +	return ret;
>> +}
>> +
> 
> 
> [...]
>> @@ -823,7 +1049,10 @@ int would_convert_to_git_filter_fd(const char *path)
>> 	if (!ca.drv->required)
>> 		return 0;
>> 
>> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
>> +	if (!ca.drv->clean && ca.drv->process)
>> +		return apply_protocol_filter(path, NULL, 0, -1, NULL, ca.drv->process, "clean");
>> +	else
>> +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> 
> So the rule is: if `clean` is not set, and `process` is, try to use
> process for cleaning.  It was not clear for me from the documentation.

True. I will add that to the docs.


>> }
>> 
>> const char *get_convert_attr_ascii(const char *path)
>> @@ -856,17 +1085,22 @@ int convert_to_git(const char *path, const char *src, size_t len,
>>                    struct strbuf *dst, enum safe_crlf checksafe)
>> {
>> 	int ret = 0;
>> -	const char *filter = NULL;
>> +	const char *clean_filter = NULL;
>> +	const char *process_filter = NULL;
>> 	int required = 0;
>> 	struct conv_attrs ca;
>> 
>> 	convert_attrs(&ca, path);
>> 	if (ca.drv) {
>> -		filter = ca.drv->clean;
>> +		clean_filter = ca.drv->clean;
>> +		process_filter = ca.drv->process;
>> 		required = ca.drv->required;
>> 	}
>> 
>> -	ret |= apply_filter(path, src, len, -1, dst, filter);
>> +	if (!clean_filter && process_filter)
>> +		ret |= apply_protocol_filter(path, src, len, -1, dst, process_filter, "clean");
>> +	else
>> +		ret |= apply_filter(path, src, len, -1, dst, clean_filter);
> 
> And the rule is the same here, as it should.
> 
>> 	if (!ret && required)
>> 		die("%s: clean filter '%s' failed", path, ca.drv->name);
> 
> Is it a correct error message for `process`?  I guess it is, as it prints
> the name of driver, and not attempted command.  Well, we might be using
> "process" filter in 'clean' mode,... but that is sophistry.

I see where you are coming from. However, it tries to run the "clean" filter 
with the process and therefore I think the message is correct.


> [...]
>> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
>> index b9911a4..c4793ed 100755
>> --- a/t/t0021-conversion.sh
>> +++ b/t/t0021-conversion.sh
>> @@ -4,6 +4,11 @@ test_description='blob conversion via gitattributes'
>> 
>> . ./test-lib.sh
>> 
>> +if ! test_have_prereq PERL; then
>> +	skip_all='skipping perl interface tests, perl not available'
>> +	test_done
>> +fi
> 
> Do all tests require Perl?

No. I will add the PERL requirement explicitly to the tests that need it.


>> +test_expect_success 'required protocol filter should filter data' '
> [...]
>> +test_expect_success 'protocol filter large file' '
> [...]
>> +test_expect_success 'required protocol filter should fail with clean' '
> [...]
>> +test_expect_success 'protocol filter should restart after failure' '
> [...]
> 
>> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
>> new file mode 100755
>> index 0000000..7176836
>> --- /dev/null
>> +++ b/t/t0021/rot13-filter.pl
>> @@ -0,0 +1,146 @@
>> +#!/usr/bin/perl
>> +#
>> +# Example implementation for the Git filter protocol version 2
>> +# See Documentation/gitattributes.txt, section "Filter Protocol"
>> +#
>> +# This implementation supports two special test cases:
>> +# (1) If data with the filename "clean-write-fail.r" is processed with
>> +#     a "clean" operation then the write operation will die.
>> +# (2) If data with the filename "smudge-write-fail.r" is processed with
>> +#     a "smudge" operation then the write operation will die.
> 
> Nice.
> 
>> +#
>> +
>> +use strict;
>> +use warnings;
>> +
>> +my $MAX_PACKET_CONTENT_SIZE = 65516;
>> +
>> +sub rot13 {
>> +    my ($str) = @_;
>> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
>> +    return $str;
>> +}
>> +
>> +sub packet_read {
>> +    my $buffer;
>> +    my $bytes_read = read STDIN, $buffer, 4;
>> +    if ( $bytes_read == 0 ) {
>> +        return;
>> +    }
>> +    elsif ( $bytes_read != 4 ) {
> 
> This is a bit untypical bracket style...

With bracket style you mean the spaces? I used PerlTidy in 
standard mode. I will adjust it to look more like perl/Git.pm ...

http://perltidy.sourceforge.net/

> 
>> +        die "invalid packet size '$bytes_read' field";
>> +    }
>> +    my $pkt_size = hex($buffer);
>> +    if ( $pkt_size == 0 ) {
>> +        return ( 1, "" );
>> +    }
>> +    elsif ( $pkt_size > 4 ) {
>> +        my $content_size = $pkt_size - 4;
>> +        $bytes_read = read STDIN, $buffer, $content_size;
>> +        if ( $bytes_read != $content_size ) {
>> +            die "invalid packet";
>> +        }
>> +        return ( 0, $buffer );
>> +    }
>> +    else {
>> +        die "invalid packet size";
>> +    }
>> +}
> 
> So packet reading is not that difficult...
> 
>> +
>> +sub packet_write {
>> +    my ($packet) = @_;
>> +    print STDOUT sprintf( "%04x", length($packet) + 4 );
>> +    print STDOUT $packet;
>> +    STDOUT->flush();
>> +}
> 
> ...and packet write is easy.
> 
>> +
>> +sub packet_flush {
>> +    print STDOUT sprintf( "%04x", 0 );
>> +    STDOUT->flush();
>> +}
>> +
>> +open my $debug, ">>", "output.log";
>> +print $debug "start\n";
>> +$debug->flush();
>> +
>> +packet_write("git-filter-protocol\n");
>> +packet_write("version 2\n");
>> +packet_write("clean smudge\n");
>> +print $debug "wrote filter header\n";
>> +$debug->flush();
> 
> Isn't $debug flushed automatically?

Maybe, but autoflush is not explicitly enabled. I will
enable it again (I disabled it because of Eric's comment
but I re-read the comment and he is only talking about
pipes).

http://public-inbox.org/git/20160723072721.GA20875%40starla/


>> +
>> +while (1) {
>> +    my $command = packet_read();
>> +    unless ( defined($command) ) {
>> +        exit();
>> +    }
>> +    chomp $command;
>> +    print $debug "IN: $command";
>> +    $debug->flush();
>> +    my $filename = packet_read();
>> +    chomp $filename;
>> +    print $debug " $filename";
>> +    $debug->flush();
>> +    my $filelen = packet_read();
>> +    chomp $filelen;
>> +    print $debug " $filelen";
>> +    $debug->flush();
>> +
>> +    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
>> +    my $output;
>> +
>> +    if ( $filelen > 0 ) {
>> +        my $input = "";
>> +        {
>> +            binmode(STDIN);
>> +            my $buffer;
>> +            my $done = 0;
>> +            while ( !$done ) {
>> +                ( $done, $buffer ) = packet_read();
>> +                $input .= $buffer;
>> +            }
>> +            print $debug " [OK] -- ";
>> +            $debug->flush();
>> +        }
>> +
>> +        if ( $command eq "clean" ) {
>> +            $output = rot13($input);
>> +        }
>> +        elsif ( $command eq "smudge" ) {
>> +            $output = rot13($input);
>> +        }
>> +        else {
>> +            die "bad command";
> 
> Perhaps
> 
>               die "bad command $command";

Agreed.


>> +        }
>> +    }
>> +
>> +    my $output_len = length($output);
>> +    packet_write("$output_len\n");
>> +    print $debug "OUT: $output_len ";
>> +    $debug->flush();
>> +    if ( $output_len > 0 ) {
>> +        if (   ( $command eq "clean" and $filename eq "clean-write-fail.r" )
> 
> What happened here with whitespace around parentheses?

Perl tidy :-)
Will fix!

Thanks a lot for this extensive review, again!

Best,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-29  7:40         ` Jakub Narębski
@ 2016-07-29  8:14           ` Lars Schneider
  2016-07-29 15:57             ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-29  8:14 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Jeff King, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, mlbright, Remi Galan Alfonso,
	Nguyen Thai Ngoc Duy, e, ramsay


> On 29 Jul 2016, at 09:40, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-28 o 15:29, Jeff King pisze:
>> On Thu, Jul 28, 2016 at 09:16:18AM +0200, Lars Schneider wrote:
>> 
>>> But Peff ($gmane/299902), Duy, and Eric, seemed to prefer the pkt-line
>>> solution (gmane is down - otherwise I would have given you the links).
>> 
>> FWIW, I think there are arguments for transmitting size + content
>> (namely, that it is simpler); the downside is that it doesn't allow
>> streaming.
> 
> And that it requires for the filter to know the size of its output
> upfront (which, as I wrote, might be easy to do based on size of input
> and data stored elsewhere, or might need generating whole output to
> know).
> 
> I don't know how parallel Git is, but if it is parallel enough,
> and other limits do not apply (limited amount of CPU cores, I/O limits),
> without streaming new filter protocol might be slower, unless startup
> time dominates (MS Windows?):
> 
> Current parallel:
> 
>   |   startup   | processing 1 |
>    |  startup    | processing 2  |
>   | startup |  processing 3 |
>     |  startup  |  processing 4  |
> 
> Protocol v2:
> 
>   |  startup  | processing 1 | processing 2 | processing 3 | processing 4 |

Based on the current filter design the "single-shot" invocations are
not executed in parallel.


>> So I think there are two viable alternatives:
>> 
>>  1. Total size of data in ASCII decimal, newline, then that many bytes
>>     of content.
>> 
>>  2. No size header, then a series of pkt-lines followed by a flush
>>     packet.
> 
>    3. Optional size header[2][3], then a series of pkt-lines followed
>       by a flush packet[4].
> 
> [2] Git should always provide size, because it is easy to do, and
>    I think quite cheap (stored with blob, stored in index, or stat()
>    on file away).  Filter can provide size if it is easy to calculate,
>    or approximation of size / size hint[5] - it helps to avoid
>    reallocation.

Agreed!


> [3] It is also a place where filter can pass error conditions that
>    are known before starting processing a file.

I am not sure I understand what you mean. Can you think of an example?


> [4] On one hand you need to catch cases where real size is larger than
>    size sent upfront, or smaller than size sent upfront; on the
>    other hand it might be a place where to send warnings and errors...
>    unless we utilize stderr of a process (but then there is a problem
>    of deadlocking, I think).
> [5] I suggest
> 
>        <size as ascii decimal>
>        "approx" SPC <size as ascii decimal>
>        "unknown"
>        "fail"

My current implementation supports only two cases. Either the filter
knows the size and sends it back. Or the filter doesn't know the size
and Git reads until the flush packet (your "unknown" case). "Approx" is 
probably hard to do and fail shouldn't be part of the size, no?

That being said a "fail" response is a very good idea! This allows
the filter to communicate to git that a non required filter process
failed. I will add that to the protocol. Thanks :) 

- Lars



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-27  9:41     ` Eric Wong
@ 2016-07-29 10:38       ` Lars Schneider
  2016-07-29 11:24         ` Jakub Narębski
  2016-08-05 18:55         ` Eric Wong
  0 siblings, 2 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-29 10:38 UTC (permalink / raw)
  To: Eric Wong
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, ramsay, peff


> On 27 Jul 2016, at 11:41, Eric Wong <e@80x24.org> wrote:
> 
> larsxschneider@gmail.com wrote:
>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
> 
> I'm no expert in C, but this might be const-correctness taken
> too far.  I think basing this on the read(2) prototype is less
> surprising:
> 
>   static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)

Hm... ok. I like `const` because I think it is usually easier to read/understand
functions that do not change their input variables. This way I can communicate
my intention to future people modifying this function!

If this is frowned upon in the Git community then I will add a comment to the
CodingGuidelines and remove the const :)

I agree with your reordering of the parameters, though!

Speaking of coding style... convert.c is already big and gets only bigger 
with this patch (1720 lines). Would it make sense to add a new file 
"convert-pipe-protocol.c"
or something for my additions?


> Also what Jeff said about off_t vs size_t, but my previous
> emails may have confused you w.r.t. off_t usage...
> 
>> +static int multi_packet_write(const char *src, size_t len, const int in, const int out)
> 
> Same comment about over const ints above.
> len can probably be off_t based on what is below; but you need
> to process the loop in ssize_t-friendly chunks.

I think I would prefer to keep it an size_t because that is the
type we get from Git initially. The code will be more clear in v3.


> 
>> +{
>> +	int ret = 1;
>> +	char header[4];
>> +	char buffer[8192];
>> +	off_t bytes_to_write;
> 
> What Jeff said, this should be ssize_t to match read(2) and xread

Agreed.


> 
>> +	while (ret) {
>> +		if (in >= 0) {
>> +			bytes_to_write = xread(in, buffer, sizeof(buffer));
>> +			if (bytes_to_write < 0)
>> +				ret &= 0;
>> +			src = buffer;
>> +		} else {
>> +			bytes_to_write = len > LARGE_PACKET_MAX - 4 ? LARGE_PACKET_MAX - 4 : len;
>> +			len -= bytes_to_write;
>> +		}
>> +		if (!bytes_to_write)
>> +			break;
> 
> The whole ret &= .. style error handling is hard-to-follow and
> here, a source of bugs.  I think the expected convention on
> hitting errors is:
> 
> 	1) stop whatever you're doing
> 	2) cleanup
> 	3) propagate the error to callers
> 
> "goto" is an acceptable way of accomplishing this.
> 
> For example, byte_to_write may still be negative at this point
> (and interpreted as a really big number when cast to unsigned
> size_t) and src/buffer could be stack garbage.

I changed the implementation here so that the &= style
is not necessary anymore. However, I will look into "goto"
for the other areas!


>> +		set_packet_header(header, bytes_to_write + 4);
>> +		ret &= write_in_full(out, &header, sizeof(header)) == sizeof(header);
>> +		ret &= write_in_full(out, src, bytes_to_write) == bytes_to_write;
>> +	}
>> +	ret &= write_in_full(out, "0000", 4) == 4;
>> +	return ret;
>> +}
>> +
> 
>> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
>> +						int fd, struct strbuf *dst, const char *cmd,
>> +						const char *filter_type)
>> +{
> 
> <snip>
> 
>> +	if (fd >= 0 && !src) {
>> +		ret &= fstat(fd, &file_stat) != -1;
>> +		len = file_stat.st_size;
> 
> Same truncation bug I noticed earlier; what I originally meant
> is the `len' arg probably ought to be off_t, here, not size_t.
> 32-bit x86 Linux systems have 32-bit size_t (unsigned), but
> large file support means off_t is 64-bits (signed).

OK. Would it be OK to keep size_t for this patch series?


> Also, is it worth continuing this function if fstat fails?

No :-)


>> +	}
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> +	packet_write(process->in, "%s\n", filter_type);
>> +	packet_write(process->in, "%s\n", path);
>> +	packet_write(process->in, "%zu\n", len);
> 
> I'm not sure if "%zu" is portable since we don't do C99 (yet?)
> For 64-bit signed off_t, you can probably do:
> 
> 	packet_write(process->in, "%"PRIuMAX"\n", (uintmax_t)len);
> 
> Since we don't have PRIiMAX or intmax_t, here, and a negative
> len would be a bug (probably from failed fstat) anyways.

OK. "%zu" is not used in the entire code base. I will go with
your suggestion!


>> +	ret &= multi_packet_write(src, len, fd, process->in);
> 
> multi_packet_write will probably fail if fstat failed above...

Yes. The error handling is bogus... I thought bitwise "and" would
act the same way as logical "and" (a bit embarrassing to admit that...).


> 
>> +	strbuf = packet_read_line(process->out, NULL);
> 
> And this may just block or timeout if multi_packet_write failed.

True, but unless there is anything easy to do I would leave that.
Or do you think it is really necessary to introduce "select" and
friends?


Thanks a lot for your review,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29 10:38       ` Lars Schneider
@ 2016-07-29 11:24         ` Jakub Narębski
  2016-07-29 11:31           ` Lars Schneider
  2016-08-05 18:55         ` Eric Wong
  1 sibling, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-29 11:24 UTC (permalink / raw)
  To: Lars Schneider, Eric Wong
  Cc: Git Mailing List, gitster, tboegi, mlbright, remi.galan-alfonso,
	pclouds, ramsay, peff

W dniu 2016-07-29 o 12:38, Lars Schneider pisze:
> On 27 Jul 2016, at 11:41, Eric Wong <e@80x24.org> wrote:
>> larsxschneider@gmail.com wrote:

>>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
>> 
>> I'm no expert in C, but this might be const-correctness taken
>> too far.  I think basing this on the read(2) prototype is less
>> surprising:
>> 
>>   static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)
>
> Hm... ok. I like `const` because I think it is usually easier to read/understand
> functions that do not change their input variables. This way I can communicate
> my intention to future people modifying this function!

Well, scalar types like `size_t` are always passed by value, so here `const`
doesn't matter, and it makes line longer.  I think library functions do not
use `const` for `size_t` parameters.

You are reading from the file descriptor `fd`, so it state would change.
Using `const` feels a bit like lying.  Also, it is scalar type.
 
[...] 
> I agree with your reordering of the parameters, though!
> 
> Speaking of coding style... convert.c is already big and gets only bigger 
> with this patch (1720 lines). Would it make sense to add a new file 
> "convert-pipe-protocol.c"
> or something for my additions?

I wonder if it would be possible to enhance existing functions, instead
of redoing them (at least in part) for per-command filter driver protocol.

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29 11:24         ` Jakub Narębski
@ 2016-07-29 11:31           ` Lars Schneider
  0 siblings, 0 replies; 77+ messages in thread
From: Lars Schneider @ 2016-07-29 11:31 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Eric Wong, Git Mailing List, gitster, tboegi, mlbright,
	remi.galan-alfonso, pclouds, ramsay, peff


> On 29 Jul 2016, at 13:24, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-29 o 12:38, Lars Schneider pisze:
>> On 27 Jul 2016, at 11:41, Eric Wong <e@80x24.org> wrote:
>>> larsxschneider@gmail.com wrote:
> 
>>>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t size)
>>> 
>>> I'm no expert in C, but this might be const-correctness taken
>>> too far.  I think basing this on the read(2) prototype is less
>>> surprising:
>>> 
>>>  static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)
>> 
>> Hm... ok. I like `const` because I think it is usually easier to read/understand
>> functions that do not change their input variables. This way I can communicate
>> my intention to future people modifying this function!
> 
> Well, scalar types like `size_t` are always passed by value, so here `const`
> doesn't matter, and it makes line longer.  I think library functions do not
> use `const` for `size_t` parameters.
> 
> You are reading from the file descriptor `fd`, so it state would change.
> Using `const` feels a bit like lying.  Also, it is scalar type.

OK, since you are the second reviewer arguing against `const` I will remove it.


> 
> [...] 
>> I agree with your reordering of the parameters, though!
>> 
>> Speaking of coding style... convert.c is already big and gets only bigger 
>> with this patch (1720 lines). Would it make sense to add a new file 
>> "convert-pipe-protocol.c"
>> or something for my additions?
> 
> I wonder if it would be possible to enhance existing functions, instead
> of redoing them (at least in part) for per-command filter driver protocol.

I think I reused as much as possible.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-29  8:14           ` Lars Schneider
@ 2016-07-29 15:57             ` Jeff King
  2016-07-29 16:20               ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jeff King @ 2016-07-29 15:57 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, mlbright, Remi Galan Alfonso,
	Nguyen Thai Ngoc Duy, e, ramsay

On Fri, Jul 29, 2016 at 10:14:17AM +0200, Lars Schneider wrote:

> My current implementation supports only two cases. Either the filter
> knows the size and sends it back. Or the filter doesn't know the size
> and Git reads until the flush packet (your "unknown" case). "Approx" is 
> probably hard to do and fail shouldn't be part of the size, no?

Ah, OK, I missed that you could handle both cases. I think that is a
reasonable approach. It means the filter has to bother with pkt-lines,
but beyond that, it can choose the simple or streaming approach as
appropriate.

> That being said a "fail" response is a very good idea! This allows
> the filter to communicate to git that a non required filter process
> failed. I will add that to the protocol. Thanks :) 

Maybe just send "ok <size>", "ok -1" (for streaming), or "fail <reason>"
followed by the content? That is similar to other Git protocols, though
I am not sure they are good models for sanity or extensibility. :)

I don't know if you would want to leave room for other "headers" in the
response, but you could also do something more HTTP-like, with a status
code, and arbitrary headers. And presumably git would just ignore
headers it doesn't know about. I think that's what Jakub's example was
leaning towards. I'm just not sure what other headers are really useful,
but it does leave room for extensibility.

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-29 15:57             ` Jeff King
@ 2016-07-29 16:20               ` Lars Schneider
  2016-07-29 16:50                 ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-29 16:20 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narębski, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, mlbright, Remi Galan Alfonso,
	Nguyen Thai Ngoc Duy, e, ramsay


> On 29 Jul 2016, at 17:57, Jeff King <peff@peff.net> wrote:
> 
> On Fri, Jul 29, 2016 at 10:14:17AM +0200, Lars Schneider wrote:
> 
>> My current implementation supports only two cases. Either the filter
>> knows the size and sends it back. Or the filter doesn't know the size
>> and Git reads until the flush packet (your "unknown" case). "Approx" is 
>> probably hard to do and fail shouldn't be part of the size, no?
> 
> Ah, OK, I missed that you could handle both cases. I think that is a
> reasonable approach. It means the filter has to bother with pkt-lines,
> but beyond that, it can choose the simple or streaming approach as
> appropriate.

Right.


>> That being said a "fail" response is a very good idea! This allows
>> the filter to communicate to git that a non required filter process
>> failed. I will add that to the protocol. Thanks :) 
> 
> Maybe just send "ok <size>", "ok -1" (for streaming), or "fail <reason>"
> followed by the content? That is similar to other Git protocols, though
> I am not sure they are good models for sanity or extensibility. :)
> 
> I don't know if you would want to leave room for other "headers" in the
> response, but you could also do something more HTTP-like, with a status
> code, and arbitrary headers. And presumably git would just ignore
> headers it doesn't know about. I think that's what Jakub's example was
> leaning towards. I'm just not sure what other headers are really useful,
> but it does leave room for extensibility.

Well, "ok <size>" wouldn't make much sense as we already transmitted
the size upfront I think. Right now I have implemented the following options:

"success\n" --> everything was alright
"reject\n" --> the filter rejected the operation but this is no error 
               if "filter.<driver>.required = false"
<anything else> --> failure that stops/restarts the filter process

I don't think sending any failure reason makes sense because if a failure
happens then we are likely in a bad state already (that's why I restart the
filter process. I think the filter can report trouble on its own via stdout,
no? I think this is what Git-LFS already does.

I am working on the docs right now and afterwards I will send a v3 :-)

- Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-29 16:20               ` Lars Schneider
@ 2016-07-29 16:50                 ` Jeff King
  2016-07-29 17:43                   ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jeff King @ 2016-07-29 16:50 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, mlbright, Remi Galan Alfonso,
	Nguyen Thai Ngoc Duy, e, ramsay

On Fri, Jul 29, 2016 at 06:20:51PM +0200, Lars Schneider wrote:

> >> That being said a "fail" response is a very good idea! This allows
> >> the filter to communicate to git that a non required filter process
> >> failed. I will add that to the protocol. Thanks :) 
> > 
> > Maybe just send "ok <size>", "ok -1" (for streaming), or "fail <reason>"
> > followed by the content? That is similar to other Git protocols, though
> > I am not sure they are good models for sanity or extensibility. :)
> > 
> > I don't know if you would want to leave room for other "headers" in the
> > response, but you could also do something more HTTP-like, with a status
> > code, and arbitrary headers. And presumably git would just ignore
> > headers it doesn't know about. I think that's what Jakub's example was
> > leaning towards. I'm just not sure what other headers are really useful,
> > but it does leave room for extensibility.
> 
> Well, "ok <size>" wouldn't make much sense as we already transmitted
> the size upfront I think. Right now I have implemented the following options:

Maybe I'm confused about where in the protocol we are. I was imagining:

  git> smudge
  git> <filename>
  git> <size>
  git> ...pkt-lines...
  git> pktline-flush

  git< ok <size>
  git< ...pkt-lines...
  git< flush

That is, we should say "I have something for you" or "I do not" before
sending a size, because in the "I do not" case we have no size to send.

A more extensible protocol might look like:

  git> smudge
  git> filename=<filename>
  git> size=<size>
  git> pktline-flush
  git> ...pkt-lines of data...
  git> pktline-flush

  git< ok (or success, or whatever status code you like)
  git< size=<size>
  git< pkt-line-flush
  git< ...pkt-lines of data...
  git< pktline-flush

That leaves room for new "keys" to be added before the first pkt-flush,
without having to change the parsing at all.

> "success\n" --> everything was alright
> "reject\n" --> the filter rejected the operation but this is no error 
>                if "filter.<driver>.required = false"
> <anything else> --> failure that stops/restarts the filter process
> 
> I don't think sending any failure reason makes sense because if a failure
> happens then we are likely in a bad state already (that's why I restart the
> filter process. I think the filter can report trouble on its own via stdout,
> no? I think this is what Git-LFS already does.

Git-LFS sends to stderr because there's no other option. I wonder if it
would be nicer to make it Git's responsibility to talk to the user,
because then it could respect things like "--quiet". I guess error
messages are generally printed regardless of verbosity, though, so
printing them unconditionally is OK.

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29  8:04       ` Lars Schneider
@ 2016-07-29 17:35         ` Junio C Hamano
  2016-07-29 23:11           ` Jakub Narębski
  0 siblings, 1 reply; 77+ messages in thread
From: Junio C Hamano @ 2016-07-29 17:35 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, Git Mailing List, Torsten Bögershausen,
	mlbright, Remi Galan Alfonso, Nguyen Thai Ngoc Duy, Eric Wong,
	Ramsay Jones, Jeff King, Johannes Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> I think sending it upfront is nice for buffer allocations of big files
> and it doesn't cost us anything to do it.

While I do NOT think "total size upfront" MUST BE avoided at all costs,
I do not think the above statement to justify it makes ANY sense.

Big files are by definition something you cannot afford to hold its
entirety in core, so you do not want to be told that you'd be fed 40GB
and ask xmalloc to allocate that much.

It allows the reader to be lazy for buffer allocations as long as
you know the file fits in-core, at the cost of forcing the writer to
somehow come up with the total number of bytes even before sending a
single byte (in other words, if the writer cannot produce and hold
the data in-core, it may even have to spool the data in a temporary
file only to count, and then play it back after showing the total
size).

It is good that you allow both mode of operations and the size of
the data can either be given upfront (which allows a single fixed
allocation upfront without realloc, as long as the data fits in
core), or be left "(atend)".

I just don't want to see it oversold as a "feature" that the size
has to come before data.  That is a limitation, not a feature.

Thanks.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-29 16:50                 ` Jeff King
@ 2016-07-29 17:43                   ` Lars Schneider
  2016-07-29 18:27                     ` Jeff King
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-29 17:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narębski, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, mlbright, Remi Galan Alfonso,
	Nguyen Thai Ngoc Duy, e, ramsay


> On 29 Jul 2016, at 18:50, Jeff King <peff@peff.net> wrote:
> 
> On Fri, Jul 29, 2016 at 06:20:51PM +0200, Lars Schneider wrote:
> 
>>>> That being said a "fail" response is a very good idea! This allows
>>>> the filter to communicate to git that a non required filter process
>>>> failed. I will add that to the protocol. Thanks :) 
>>> 
>>> Maybe just send "ok <size>", "ok -1" (for streaming), or "fail <reason>"
>>> followed by the content? That is similar to other Git protocols, though
>>> I am not sure they are good models for sanity or extensibility. :)
>>> 
>>> I don't know if you would want to leave room for other "headers" in the
>>> response, but you could also do something more HTTP-like, with a status
>>> code, and arbitrary headers. And presumably git would just ignore
>>> headers it doesn't know about. I think that's what Jakub's example was
>>> leaning towards. I'm just not sure what other headers are really useful,
>>> but it does leave room for extensibility.
>> 
>> Well, "ok <size>" wouldn't make much sense as we already transmitted
>> the size upfront I think. Right now I have implemented the following options:
> 
> Maybe I'm confused about where in the protocol we are. I was imagining:
> 
>  git> smudge
>  git> <filename>
>  git> <size>
>  git> ...pkt-lines...
>  git> pktline-flush
> 
>  git< ok <size>
>  git< ...pkt-lines...
>  git< flush
> 
> That is, we should say "I have something for you" or "I do not" before
> sending a size, because in the "I do not" case we have no size to send.


Right now the protocol is like that in the happy case (non-streaming):

git> smudge
git> <filename>
git> <size>
git> ...pkt-lines...
git> pktline-flush

git< <size>
git< ...pkt-lines...
git< flush
git< success

(diff to your version: no "ok" in front of size answer ... plus the
size answer is not present in the streaming case)


Here is the reject case (non-streaming):

git> smudge
git> <filename>
git> <size>
git> ...pkt-lines...
git> pktline-flush

git< 0
git< reject


Do you see a problem with this approach?


> A more extensible protocol might look like:
> 
>  git> smudge
>  git> filename=<filename>
>  git> size=<size>
>  git> pktline-flush
>  git> ...pkt-lines of data...
>  git> pktline-flush
> 
>  git< ok (or success, or whatever status code you like)
>  git< size=<size>
>  git< pkt-line-flush
>  git< ...pkt-lines of data...
>  git< pktline-flush
> 
> That leaves room for new "keys" to be added before the first pkt-flush,
> without having to change the parsing at all.

Alright. Will be in v3!


>> "success\n" --> everything was alright
>> "reject\n" --> the filter rejected the operation but this is no error 
>>               if "filter.<driver>.required = false"
>> <anything else> --> failure that stops/restarts the filter process
>> 
>> I don't think sending any failure reason makes sense because if a failure
>> happens then we are likely in a bad state already (that's why I restart the
>> filter process. I think the filter can report trouble on its own via stdout,
>> no? I think this is what Git-LFS already does.
> 
> Git-LFS sends to stderr because there's no other option. I wonder if it
> would be nicer to make it Git's responsibility to talk to the user,
> because then it could respect things like "--quiet". I guess error
> messages are generally printed regardless of verbosity, though, so
> printing them unconditionally is OK.

OK!

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 0/5] Git filter protocol
  2016-07-29 17:43                   ` Lars Schneider
@ 2016-07-29 18:27                     ` Jeff King
  0 siblings, 0 replies; 77+ messages in thread
From: Jeff King @ 2016-07-29 18:27 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, mlbright, Remi Galan Alfonso,
	Nguyen Thai Ngoc Duy, e, ramsay

On Fri, Jul 29, 2016 at 07:43:49PM +0200, Lars Schneider wrote:

> Here is the reject case (non-streaming):
> 
> git> smudge
> git> <filename>
> git> <size>
> git> ...pkt-lines...
> git> pktline-flush
> 
> git< 0
> git< reject
> 
> 
> Do you see a problem with this approach?

Only that it seemed a little weird to me to have to write a meaningless
"0" when "reject" covers the situation entirely. I don't think it's
wrong, though (and even in some ways right, because it decouples the
meaning of "reject" from the syntax of parsing, but I think it's OK for
the protocol parser to understand the difference between success and
failure codes).

-Peff

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29 17:35         ` Junio C Hamano
@ 2016-07-29 23:11           ` Jakub Narębski
  2016-07-29 23:44             ` Lars Schneider
  0 siblings, 1 reply; 77+ messages in thread
From: Jakub Narębski @ 2016-07-29 23:11 UTC (permalink / raw)
  To: git
  Cc: Git Mailing List, Torsten Bögershausen, mlbright,
	Remi Galan Alfonso, Nguyen Thai Ngoc Duy, Eric Wong, Ramsay Jones,
	Jeff King, Johannes Schindelin

W dniu 2016-07-29 o 19:35, Junio C Hamano pisze:
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> I think sending it upfront is nice for buffer allocations of big files
>> and it doesn't cost us anything to do it.
> 
> While I do NOT think "total size upfront" MUST BE avoided at all costs,
> I do not think the above statement to justify it makes ANY sense.
> 
> Big files are by definition something you cannot afford to hold its
> entirety in core, so you do not want to be told that you'd be fed 40GB
> and ask xmalloc to allocate that much.

I don't know much how filter driver work internally, but in some cases
Git reads or writes from file (file descriptor), in other cases it reads
or writes from str+len pair (it probably predates strbuf) - I think in
those cases file needs to fit in memory (in size_t).  So in some cases
Git reads file into memory.  Whether it uses xmalloc or mmap, I don't
know.

> 
> It allows the reader to be lazy for buffer allocations as long as
> you know the file fits in-core, at the cost of forcing the writer to
> somehow come up with the total number of bytes even before sending a
> single byte (in other words, if the writer cannot produce and hold
> the data in-core, it may even have to spool the data in a temporary
> file only to count, and then play it back after showing the total
> size).

For some types of filters you can know the size upfront:
 - for filters such as rot13, with 1-to-1 transformation, you know
   that the output size is the same as the input size
 - for block encodings, and for constant-width to constant-width
   encoding conversion, filter can calculate output size from the
   input size (e.g. <output size> = 2*<input size>)
 - filter may have get size from somewhere, for example LFS filter
   stub is constant size, and files are stored in artifactory with
   their length 

> 
> It is good that you allow both mode of operations and the size of
> the data can either be given upfront (which allows a single fixed
> allocation upfront without realloc, as long as the data fits in
> core), or be left "(atend)".

I think the protocol should be either: <size> + <contents>, or
<size unknown> + <contents> + <flush>, that is do not use flush
packet if size is known upfront -- it would be a second point
of truth (SPOT principle).
 
> I just don't want to see it oversold as a "feature" that the size
> has to come before data.  That is a limitation, not a feature.
> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29 23:11           ` Jakub Narębski
@ 2016-07-29 23:44             ` Lars Schneider
  2016-07-30  9:32               ` Jakub Narębski
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-07-29 23:44 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Git Mailing List, Torsten Bögershausen,
	mlbright, Remi Galan Alfonso, Nguyen Thai Ngoc Duy, Eric Wong,
	Ramsay Jones, Jeff King, Johannes Schindelin


> On 30 Jul 2016, at 01:11, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 2016-07-29 o 19:35, Junio C Hamano pisze:
>> Lars Schneider <larsxschneider@gmail.com> writes:
>> 
>>> I think sending it upfront is nice for buffer allocations of big files
>>> and it doesn't cost us anything to do it.
>> 
>> While I do NOT think "total size upfront" MUST BE avoided at all costs,
>> I do not think the above statement to justify it makes ANY sense.
>> 
>> Big files are by definition something you cannot afford to hold its
>> entirety in core, so you do not want to be told that you'd be fed 40GB
>> and ask xmalloc to allocate that much.
> 
> I don't know much how filter driver work internally, but in some cases
> Git reads or writes from file (file descriptor), in other cases it reads
> or writes from str+len pair (it probably predates strbuf) - I think in
> those cases file needs to fit in memory (in size_t).  So in some cases
> Git reads file into memory.  Whether it uses xmalloc or mmap, I don't
> know.
> 
>> 
>> It allows the reader to be lazy for buffer allocations as long as
>> you know the file fits in-core, at the cost of forcing the writer to
>> somehow come up with the total number of bytes even before sending a
>> single byte (in other words, if the writer cannot produce and hold
>> the data in-core, it may even have to spool the data in a temporary
>> file only to count, and then play it back after showing the total
>> size).
> 
> For some types of filters you can know the size upfront:
> - for filters such as rot13, with 1-to-1 transformation, you know
>   that the output size is the same as the input size
> - for block encodings, and for constant-width to constant-width
>   encoding conversion, filter can calculate output size from the
>   input size (e.g. <output size> = 2*<input size>)
> - filter may have get size from somewhere, for example LFS filter
>   stub is constant size, and files are stored in artifactory with
>   their length 
> 
>> 
>> It is good that you allow both mode of operations and the size of
>> the data can either be given upfront (which allows a single fixed
>> allocation upfront without realloc, as long as the data fits in
>> core), or be left "(atend)".
> 
> I think the protocol should be either: <size> + <contents>, or
> <size unknown> + <contents> + <flush>, that is do not use flush
> packet if size is known upfront -- it would be a second point
> of truth (SPOT principle).

As I mentioned elsewhere a <flush> packet is always send right now.
I have no strong opinion if this is good or bad. The implementation
was a little bit simpler and that's why I did it. I will implement 
whatever option the majority prefers :-)

Cheers,
Lars

> 
>> I just don't want to see it oversold as a "feature" that the size
>> has to come before data.  That is a limitation, not a feature.
>> 
>> Thanks.
>> 
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29 23:44             ` Lars Schneider
@ 2016-07-30  9:32               ` Jakub Narębski
  0 siblings, 0 replies; 77+ messages in thread
From: Jakub Narębski @ 2016-07-30  9:32 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Git Mailing List, Torsten Bögershausen,
	mlbright, Remi Galan Alfonso, Nguyen Thai Ngoc Duy, Eric Wong,
	Ramsay Jones, Jeff King, Johannes Schindelin

W dniu 2016-07-30 o 01:44, Lars Schneider pisze:
> On 30 Jul 2016, at 01:11, Jakub Narębski <jnareb@gmail.com> wrote:

>> I think the protocol should be either: <size> + <contents>, or
>> <size unknown> + <contents> + <flush>, that is do not use flush
>> packet if size is known upfront -- it would be a second point
>> of truth (SPOT principle).
>
> As I mentioned elsewhere a <flush> packet is always send right now.
> I have no strong opinion if this is good or bad. The implementation
> was a little bit simpler and that's why I did it. I will implement 
> whatever option the majority prefers :-)

Well, if we treat it as a size hint, then it is all right; as you
say it makes for a simpler implementation: read till flush.  Git
should not error out if there is mismatch between specified and
actual size of return from the filter; filter can do whatever
it wants.

I see there is v3 series sent, so I'll move the discussion there.
One thing: we probably would want for the size / size-hint
packet to be extensible, either

  size=<size> [(SPC <key>=<value>)...] "\n"

or

  <size> [(SPC <sth>)...] "\n"

that is, space separated list starting with size / size hint.

Upfront error could be signalled by putting for example "error"
in place of size, e.g.

  error <error description> "\n"

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-07-29 10:38       ` Lars Schneider
  2016-07-29 11:24         ` Jakub Narębski
@ 2016-08-05 18:55         ` Eric Wong
  2016-08-05 23:26           ` Lars Schneider
  1 sibling, 1 reply; 77+ messages in thread
From: Eric Wong @ 2016-08-05 18:55 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, ramsay, peff

Lars Schneider <larsxschneider@gmail.com> wrote:
> > On 27 Jul 2016, at 11:41, Eric Wong <e@80x24.org> wrote:
> > larsxschneider@gmail.com wrote:
> >> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
> >> +						int fd, struct strbuf *dst, const char *cmd,
> >> +						const char *filter_type)
> >> +{
> > 
> > <snip>
> > 
> >> +	if (fd >= 0 && !src) {
> >> +		ret &= fstat(fd, &file_stat) != -1;
> >> +		len = file_stat.st_size;
> > 
> > Same truncation bug I noticed earlier; what I originally meant
> > is the `len' arg probably ought to be off_t, here, not size_t.
> > 32-bit x86 Linux systems have 32-bit size_t (unsigned), but
> > large file support means off_t is 64-bits (signed).
> 
> OK. Would it be OK to keep size_t for this patch series?

I think there should at least be a truncation warning (or die)
for larger-than-4GB files on 32-bit.  I don't know how common
they are for git-lfs users.

Perhaps using xsize_t in git-compat-util.h works for now:

	len = xsize_t(file_stat.st_size);

(sorry, I haven't had much time to look at your other updates)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-08-05 18:55         ` Eric Wong
@ 2016-08-05 23:26           ` Lars Schneider
  2016-08-05 23:38             ` Eric Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Lars Schneider @ 2016-08-05 23:26 UTC (permalink / raw)
  To: Eric Wong
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, ramsay, peff


> On 05 Aug 2016, at 20:55, Eric Wong <e@80x24.org> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> wrote:
>>> On 27 Jul 2016, at 11:41, Eric Wong <e@80x24.org> wrote:
>>> larsxschneider@gmail.com wrote:
>>>> +static int apply_protocol_filter(const char *path, const char *src, size_t len,
>>>> +						int fd, struct strbuf *dst, const char *cmd,
>>>> +						const char *filter_type)
>>>> +{
>>> 
>>> <snip>
>>> 
>>>> +	if (fd >= 0 && !src) {
>>>> +		ret &= fstat(fd, &file_stat) != -1;
>>>> +		len = file_stat.st_size;
>>> 
>>> Same truncation bug I noticed earlier; what I originally meant
>>> is the `len' arg probably ought to be off_t, here, not size_t.
>>> 32-bit x86 Linux systems have 32-bit size_t (unsigned), but
>>> large file support means off_t is 64-bits (signed).
>> 
>> OK. Would it be OK to keep size_t for this patch series?
> 
> I think there should at least be a truncation warning (or die)
> for larger-than-4GB files on 32-bit.  I don't know how common
> they are for git-lfs users.
> 
> Perhaps using xsize_t in git-compat-util.h works for now:
> 
> 	len = xsize_t(file_stat.st_size);

Thanks for the hint! Should I add the same check to sha1_file's use
of fstat in line 1002 or is it not needed there?

https://github.com/git/git/blob/c6b0597e9ac7277e148e2fd4d7615ac6e0bfb661/sha1_file.c#L1002

Thanks,
Lars

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 5/5] convert: add filter.<driver>.process option
  2016-08-05 23:26           ` Lars Schneider
@ 2016-08-05 23:38             ` Eric Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Eric Wong @ 2016-08-05 23:38 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright,
	remi.galan-alfonso, pclouds, ramsay, peff

> >>> larsxschneider@gmail.com wrote:
> >>>> +static int apply_protocol_filter(const char *path, const char *src, size_t len,

Lars Schneider <larsxschneider@gmail.com> wrote:
> > On 05 Aug 2016, at 20:55, Eric Wong <e@80x24.org> wrote:
> > Perhaps using xsize_t in git-compat-util.h works for now:
> > 
> > 	len = xsize_t(file_stat.st_size);
> 
> Thanks for the hint! Should I add the same check to sha1_file's use
> of fstat in line 1002 or is it not needed there?
> 
> https://github.com/git/git/blob/c6b0597e9ac7277e148e2fd4d7615ac6e0bfb661/sha1_file.c#L1002

Not needed, if you look at the definition of "struct packed_git"
in cache.h, you'll see pack_size is already off_t, not size_t
like `len` is.

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2016-08-07  0:13 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-22 15:48 [PATCH v1 0/3] Git filter protocol larsxschneider
2016-07-22 15:48 ` [PATCH v1 1/3] convert: quote filter names in error messages larsxschneider
2016-07-22 15:48 ` [PATCH v1 2/3] convert: modernize tests larsxschneider
2016-07-26 15:18   ` Remi Galan Alfonso
2016-07-26 20:40     ` Junio C Hamano
2016-07-22 15:49 ` [PATCH v1 3/3] convert: add filter.<driver>.useProtocol option larsxschneider
2016-07-22 22:32   ` Torsten Bögershausen
2016-07-24 12:09     ` Lars Schneider
2016-07-22 23:19   ` Ramsay Jones
2016-07-22 23:28     ` Ramsay Jones
2016-07-24 17:16     ` Lars Schneider
2016-07-24 22:36       ` Ramsay Jones
2016-07-24 23:22         ` Jakub Narębski
2016-07-25 20:32           ` Lars Schneider
2016-07-26 10:58             ` Jakub Narębski
2016-07-25 20:24         ` Lars Schneider
2016-07-23  0:11   ` Jakub Narębski
2016-07-23  7:27     ` Eric Wong
2016-07-26 20:00       ` Jeff King
2016-07-24 18:36     ` Lars Schneider
2016-07-24 20:14       ` Jakub Narębski
2016-07-24 21:30         ` Jakub Narębski
2016-07-25 20:16           ` Lars Schneider
2016-07-26 12:24             ` Jakub Narębski
2016-07-25 20:09         ` Lars Schneider
2016-07-26 14:18           ` Jakub Narębski
2016-07-23  8:14   ` Eric Wong
2016-07-24 19:11     ` Lars Schneider
2016-07-25  7:27       ` Eric Wong
2016-07-25 15:48       ` Duy Nguyen
2016-07-22 21:39 ` [PATCH v1 0/3] Git filter protocol Junio C Hamano
2016-07-24 11:24   ` Lars Schneider
2016-07-26 20:11     ` Jeff King
2016-07-27  0:06 ` [PATCH v2 0/5] " larsxschneider
2016-07-27  0:06   ` [PATCH v2 1/5] convert: quote filter names in error messages larsxschneider
2016-07-27 20:01     ` Jakub Narębski
2016-07-28  8:23       ` Lars Schneider
2016-07-27  0:06   ` [PATCH v2 2/5] convert: modernize tests larsxschneider
2016-07-27  0:06   ` [PATCH v2 3/5] pkt-line: extract and use `set_packet_header` function larsxschneider
2016-07-27  0:20     ` Junio C Hamano
2016-07-27  9:13       ` Lars Schneider
2016-07-27 16:31         ` Junio C Hamano
2016-07-27  0:06   ` [PATCH v2 4/5] convert: generate large test files only once larsxschneider
2016-07-27  2:35     ` Torsten Bögershausen
2016-07-27 13:32       ` Jeff King
2016-07-27 16:50         ` Lars Schneider
2016-07-27  0:06   ` [PATCH v2 5/5] convert: add filter.<driver>.process option larsxschneider
2016-07-27  1:32     ` Jeff King
2016-07-27 17:31       ` Lars Schneider
2016-07-27 18:11         ` Jeff King
2016-07-28 12:10           ` Lars Schneider
2016-07-28 13:35             ` Jeff King
2016-07-27  9:41     ` Eric Wong
2016-07-29 10:38       ` Lars Schneider
2016-07-29 11:24         ` Jakub Narębski
2016-07-29 11:31           ` Lars Schneider
2016-08-05 18:55         ` Eric Wong
2016-08-05 23:26           ` Lars Schneider
2016-08-05 23:38             ` Eric Wong
2016-07-27 23:31     ` Jakub Narębski
2016-07-29  8:04       ` Lars Schneider
2016-07-29 17:35         ` Junio C Hamano
2016-07-29 23:11           ` Jakub Narębski
2016-07-29 23:44             ` Lars Schneider
2016-07-30  9:32               ` Jakub Narębski
2016-07-28 10:32     ` Torsten Bögershausen
2016-07-27 19:08   ` [PATCH v2 0/5] Git filter protocol Jakub Narębski
2016-07-28  7:16     ` Lars Schneider
2016-07-28 10:42       ` Jakub Narębski
2016-07-28 13:29       ` Jeff King
2016-07-29  7:40         ` Jakub Narębski
2016-07-29  8:14           ` Lars Schneider
2016-07-29 15:57             ` Jeff King
2016-07-29 16:20               ` Lars Schneider
2016-07-29 16:50                 ` Jeff King
2016-07-29 17:43                   ` Lars Schneider
2016-07-29 18:27                     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).