git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v3 00/10] Git filter protocol
       [not found] <20160727000605.49982-1-larsxschneider%40gmail.com/>
@ 2016-07-29 23:37 ` larsxschneider
  2016-07-29 23:37   ` [PATCH v3 01/10] pkt-line: extract set_packet_header() larsxschneider
                     ` (10 more replies)
  0 siblings, 11 replies; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

thanks a lot to Jakub, Peff, Torsten, Eric, and Junio for comments
and reviews.

Here is what has changed since v2:

* replace `/dev/urandom` with `test-genrandom` (Torsten, Peff)
* improve commit message "Git filter driver command with spaces" (Jakub)
* use proper types for memory and disk (Peff)
* create packet read buffer with overflow check (Peff)
* change capabilities format: "capabilities clean smudge" (Jakub)
* replace "%zu" (Eric)
* remove &= error handling (Eric, Peff)
* initialize *argv[] with  { cmd, NULL } (Jakub)
* reorder multi_packet_read() parameters to match read(2) (Eric)
* do not continue if fstat fails (Eric)
* filter: add reject response
* add functions to pkt-line.h/c that: (Jakub, Peff)
    - can write a packet without creating a new buffer
    - do not die in case of a failure
* add function to pkt-line.h/c that writes a pkt-line flush and does not die on error
* add filter stream capability
* add filter shutdown capability
* docs: fix LARGE_PACKET_MAX documentaion
    see http://public-inbox.org/git/20160726134257.GB19277%40sigill.intra.peff.net/
* docs: fix s/seperated/separated/ (Jakub)
* docs: "mis-configured one-shot filters would hang" (Jakub)
* docs: filter protocol filename absolute (Jakub)
* docs: state that Git can use more than one packet (Jabub)
* docs: add "\n" to lines (Jakub)
* docs: filter precedence (Jakub)

Cheers,
Lars

PS: If you prefer checkout the code from a Git repo instead then you can find
it here: https://github.com/larsxschneider/git/tree/protocol-filter/v3


Lars Schneider (10):
  pkt-line: extract set_packet_header()
  pkt-line: add direct_packet_write() and direct_packet_write_data()
  pkt-line: add packet_flush_gentle()
  pkt-line: call packet_trace() only if a packet is actually send
  pack-protocol: fix maximum pkt-line size
  run-command: add clean_on_exit_handler
  convert: quote filter names in error messages
  convert: modernize tests
  convert: generate large test files only once
  convert: add filter.<driver>.process option

 Documentation/gitattributes.txt             |  84 ++++-
 Documentation/technical/protocol-common.txt |   6 +-
 convert.c                                   | 412 +++++++++++++++++++++-
 pkt-line.c                                  |  53 ++-
 pkt-line.h                                  |   6 +
 run-command.c                               |  12 +-
 run-command.h                               |   1 +
 t/t0021-conversion.sh                       | 515 +++++++++++++++++++++++++---
 t/t0021/rot13-filter.pl                     | 177 ++++++++++
 9 files changed, 1193 insertions(+), 73 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

--
2.9.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 01/10] pkt-line: extract set_packet_header()
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-30 10:30     ` Jakub Narębski
  2016-07-29 23:37   ` [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

set_packet_header() converts an integer to a 4 byte hex string. Make
this function locally available so that other pkt-line functions can
use it.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index 62fdb37..445b8e1 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -98,9 +98,17 @@ void packet_buf_flush(struct strbuf *buf)
 }
 
 #define hex(a) (hexchar[(a) & 15])
-static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+static void set_packet_header(char *buf, const int size)
 {
 	static char hexchar[] = "0123456789abcdef";
+	buf[0] = hex(size >> 12);
+	buf[1] = hex(size >> 8);
+	buf[2] = hex(size >> 4);
+	buf[3] = hex(size);
+}
+
+static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+{
 	size_t orig_len, n;
 
 	orig_len = out->len;
@@ -111,10 +119,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	if (n > LARGE_PACKET_MAX)
 		die("protocol error: impossibly long line");
 
-	out->buf[orig_len + 0] = hex(n >> 12);
-	out->buf[orig_len + 1] = hex(n >> 8);
-	out->buf[orig_len + 2] = hex(n >> 4);
-	out->buf[orig_len + 3] = hex(n);
+	set_packet_header(&out->buf[orig_len], n);
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data()
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
  2016-07-29 23:37   ` [PATCH v3 01/10] pkt-line: extract set_packet_header() larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-30 10:49     ` Jakub Narębski
  2016-07-29 23:37   ` [PATCH v3 03/10] pkt-line: add packet_flush_gentle() larsxschneider
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Sometimes pkt-line data is already available in a buffer and it would
be a waste of resources to write the packet using packet_write() which
would copy the existing buffer into a strbuf before writing it.

If the caller has control over the buffer creation then the
PKTLINE_DATA_START macro can be used to skip the header and write
directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
would be the maximum). direct_packet_write() would take this buffer,
adjust the pkt-line header and write it.

If the caller has no control over the buffer creation then
direct_packet_write_data() can be used. This function creates a pkt-line
header. Afterwards the header and the data buffer are written using two
consecutive write calls.

Both functions have a gentle parameter that indicates if Git should die
in case of a write error (gentle set to 0) or return with a error (gentle
set to 1).

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 30 ++++++++++++++++++++++++++++++
 pkt-line.h |  5 +++++
 2 files changed, 35 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 445b8e1..6fae508 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -135,6 +135,36 @@ void packet_write(int fd, const char *fmt, ...)
 	write_or_die(fd, buf.buf, buf.len);
 }
 
+int direct_packet_write(int fd, char *buf, size_t size, int gentle)
+{
+	int ret = 0;
+	packet_trace(buf + 4, size - 4, 1);
+	set_packet_header(buf, size);
+	if (gentle)
+		ret = !write_or_whine_pipe(fd, buf, size, "pkt-line");
+	else
+		write_or_die(fd, buf, size);
+	return ret;
+}
+
+int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle)
+{
+	int ret = 0;
+	char hdr[4];
+	set_packet_header(hdr, sizeof(hdr) + size);
+	packet_trace(buf, size, 1);
+	if (gentle) {
+		ret = (
+			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
+			!write_or_whine_pipe(fd, buf, size, "pkt-line data")
+		);
+	} else {
+		write_or_die(fd, hdr, sizeof(hdr));
+		write_or_die(fd, buf, size);
+	}
+	return ret;
+}
+
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 {
 	va_list args;
diff --git a/pkt-line.h b/pkt-line.h
index 3cb9d91..02dcced 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,8 @@ void packet_flush(int fd);
 void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int direct_packet_write(int fd, char *buf, size_t size, int gentle);
+int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle);
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
@@ -77,6 +79,9 @@ char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
 
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
+#define PKTLINE_HEADER_LEN 4
+#define PKTLINE_DATA_START(pkt) ((pkt) + PKTLINE_HEADER_LEN)
+#define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - PKTLINE_HEADER_LEN)
 extern char packet_buffer[LARGE_PACKET_MAX];
 
 #endif
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
  2016-07-29 23:37   ` [PATCH v3 01/10] pkt-line: extract set_packet_header() larsxschneider
  2016-07-29 23:37   ` [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-30 12:04     ` Jakub Narębski
  2016-07-31 20:36     ` Torstem Bögershausen
  2016-07-29 23:37   ` [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
                     ` (7 subsequent siblings)
  10 siblings, 2 replies; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_flush() would die in case of a write error even though for some callers
an error would be acceptable. Add packet_flush_gentle() which writes a pkt-line
flush packet and returns `0` for success and `1` for failure.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 6 ++++++
 pkt-line.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 6fae508..1728690 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -91,6 +91,12 @@ void packet_flush(int fd)
 	write_or_die(fd, "0000", 4);
 }
 
+int packet_flush_gentle(int fd)
+{
+	packet_trace("0000", 4, 1);
+	return !write_or_whine_pipe(fd, "0000", 4, "flush packet");
+}
+
 void packet_buf_flush(struct strbuf *buf)
 {
 	packet_trace("0000", 4, 1);
diff --git a/pkt-line.h b/pkt-line.h
index 02dcced..3953c98 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_flush_gentle(int fd);
 int direct_packet_write(int fd, char *buf, size_t size, int gentle);
 int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle);
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (2 preceding siblings ...)
  2016-07-29 23:37   ` [PATCH v3 03/10] pkt-line: add packet_flush_gentle() larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-30 12:29     ` Jakub Narębski
  2016-07-29 23:37   ` [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size larsxschneider
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

The packet_trace() call is not ideal in format_packet() as we would print
a trace when a packet is formatted and (potentially) when the packet is
actually send. This was no problem up until now because format_packet()
was only used by one function. Fix it by moving the trace call into the
function that actally sends the packet.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pkt-line.c b/pkt-line.c
index 1728690..32c0a34 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -126,7 +126,6 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 		die("protocol error: impossibly long line");
 
 	set_packet_header(&out->buf[orig_len], n);
-	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
 void packet_write(int fd, const char *fmt, ...)
@@ -138,6 +137,7 @@ void packet_write(int fd, const char *fmt, ...)
 	va_start(args, fmt);
 	format_packet(&buf, fmt, args);
 	va_end(args);
+	packet_trace(buf.buf + 4, buf.len - 4, 1);
 	write_or_die(fd, buf.buf, buf.len);
 }
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (3 preceding siblings ...)
  2016-07-29 23:37   ` [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-30 13:58     ` Jakub Narębski
  2016-07-29 23:37   ` [PATCH v3 06/10] run-command: add clean_on_exit_handler larsxschneider
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

According to LARGE_PACKET_MAX in pkt-line.h the maximal lenght of a
pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
therefore the pkt-line data component must not exceed 65516 bytes.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/technical/protocol-common.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt
index bf30167..ecedb34 100644
--- a/Documentation/technical/protocol-common.txt
+++ b/Documentation/technical/protocol-common.txt
@@ -67,9 +67,9 @@ with non-binary data the same whether or not they contain the trailing
 LF (stripping the LF if present, and not complaining when it is
 missing).
 
-The maximum length of a pkt-line's data component is 65520 bytes.
-Implementations MUST NOT send pkt-line whose length exceeds 65524
-(65520 bytes of payload + 4 bytes of length data).
+The maximum length of a pkt-line's data component is 65516 bytes.
+Implementations MUST NOT send pkt-line whose length exceeds 65520
+(65516 bytes of payload + 4 bytes of length data).
 
 Implementations SHOULD NOT send an empty pkt-line ("0004").
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 06/10] run-command: add clean_on_exit_handler
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (4 preceding siblings ...)
  2016-07-29 23:37   ` [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-30  9:50     ` Johannes Sixt
  2016-07-29 23:37   ` [PATCH v3 07/10] convert: quote filter names in error messages larsxschneider
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Some commands might need to perform cleanup tasks on exit. Let's give
them an interface for doing this.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 run-command.c | 12 ++++++++----
 run-command.h |  1 +
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/run-command.c b/run-command.c
index 33bc63a..197b534 100644
--- a/run-command.c
+++ b/run-command.c
@@ -21,6 +21,7 @@ void child_process_clear(struct child_process *child)
 
 struct child_to_clean {
 	pid_t pid;
+	void (*clean_on_exit_handler)(pid_t);
 	struct child_to_clean *next;
 };
 static struct child_to_clean *children_to_clean;
@@ -30,6 +31,8 @@ static void cleanup_children(int sig, int in_signal)
 {
 	while (children_to_clean) {
 		struct child_to_clean *p = children_to_clean;
+		if (p->clean_on_exit_handler)
+			p->clean_on_exit_handler(p->pid);
 		children_to_clean = p->next;
 		kill(p->pid, sig);
 		if (!in_signal)
@@ -49,10 +52,11 @@ static void cleanup_children_on_exit(void)
 	cleanup_children(SIGTERM, 0);
 }
 
-static void mark_child_for_cleanup(pid_t pid)
+static void mark_child_for_cleanup(pid_t pid, void (*clean_on_exit_handler)(pid_t))
 {
 	struct child_to_clean *p = xmalloc(sizeof(*p));
 	p->pid = pid;
+	p->clean_on_exit_handler = clean_on_exit_handler;
 	p->next = children_to_clean;
 	children_to_clean = p;
 
@@ -422,7 +426,7 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0)
 		error_errno("cannot fork() for %s", cmd->argv[0]);
 	else if (cmd->clean_on_exit)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup(cmd->pid, cmd->clean_on_exit_handler);
 
 	/*
 	 * Wait for child's execvp. If the execvp succeeds (or if fork()
@@ -483,7 +487,7 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
 		error_errno("cannot spawn %s", cmd->argv[0]);
 	if (cmd->clean_on_exit && cmd->pid >= 0)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup(cmd->pid, cmd->clean_on_exit_handler);
 
 	argv_array_clear(&nargv);
 	cmd->argv = sargv;
@@ -752,7 +756,7 @@ int start_async(struct async *async)
 		exit(!!async->proc(proc_in, proc_out, async->data));
 	}
 
-	mark_child_for_cleanup(async->pid);
+	mark_child_for_cleanup(async->pid, NULL);
 
 	if (need_in)
 		close(fdin[0]);
diff --git a/run-command.h b/run-command.h
index 5066649..59d21ea 100644
--- a/run-command.h
+++ b/run-command.h
@@ -43,6 +43,7 @@ struct child_process {
 	unsigned stdout_to_stderr:1;
 	unsigned use_shell:1;
 	unsigned clean_on_exit:1;
+	void (*clean_on_exit_handler)(pid_t);
 };
 
 #define CHILD_PROCESS_INIT { NULL, ARGV_ARRAY_INIT, ARGV_ARRAY_INIT }
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 07/10] convert: quote filter names in error messages
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (5 preceding siblings ...)
  2016-07-29 23:37   ` [PATCH v3 06/10] run-command: add clean_on_exit_handler larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-29 23:37   ` [PATCH v3 08/10] convert: modernize tests larsxschneider
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard to
read in error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index b1614bf..522e2c5 100644
--- a/convert.c
+++ b/convert.c
@@ -397,7 +397,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	child_process.out = out;
 
 	if (start_command(&child_process))
-		return error("cannot fork to run external filter %s", params->cmd);
+		return error("cannot fork to run external filter '%s'", params->cmd);
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
@@ -415,13 +415,13 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
-		error("cannot feed the input to external filter %s", params->cmd);
+		error("cannot feed the input to external filter '%s'", params->cmd);
 
 	sigchain_pop(SIGPIPE);
 
 	status = finish_command(&child_process);
 	if (status)
-		error("external filter %s failed %d", params->cmd, status);
+		error("external filter '%s' failed %d", params->cmd, status);
 
 	strbuf_release(&cmd);
 	return (write_err || status);
@@ -462,15 +462,15 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
+		error("external filter '%s' failed", cmd);
 		ret = 0;
 	}
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 08/10] convert: modernize tests
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (6 preceding siblings ...)
  2016-07-29 23:37   ` [PATCH v3 07/10] convert: quote filter names in error messages larsxschneider
@ 2016-07-29 23:37   ` larsxschneider
  2016-07-29 23:38   ` [PATCH v3 09/10] convert: generate large test files only once larsxschneider
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">" and "<".

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 62 +++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..7b45136 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -13,8 +13,8 @@ EOF
 chmod +x rot13.sh
 
 test_expect_success setup '
-	git config filter.rot13.smudge ./rot13.sh &&
-	git config filter.rot13.clean ./rot13.sh &&
+	test_config filter.rot13.smudge ./rot13.sh &&
+	test_config filter.rot13.clean ./rot13.sh &&
 
 	{
 	    echo "*.t filter=rot13"
@@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
 
 test_expect_success check '
 
-	cmp test.o test &&
-	cmp test.o test.t &&
+	test_cmp test.o test &&
+	test_cmp test.o test.t &&
 
 	# ident should be stripped in the repository
 	git diff --raw --exit-code :test :test.i &&
@@ -47,10 +47,10 @@ test_expect_success check '
 	embedded=$(sed -ne "$script" test.i) &&
 	test "z$id" = "z$embedded" &&
 
-	git cat-file blob :test.t > test.r &&
+	git cat-file blob :test.t >test.r &&
 
-	./rot13.sh < test.o > test.t &&
-	cmp test.r test.t
+	./rot13.sh <test.o >test.t &&
+	test_cmp test.r test.t
 '
 
 # If an expanded ident ever gets into the repository, we want to make sure that
@@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
 
 	# delete the files and check them out again, using a smudge filter
 	# that will count the args and echo the command-line back to us
-	git config filter.argc.smudge "sh ./argc.sh %f" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -141,7 +141,7 @@ test_expect_success 'filter shell-escaped filenames' '
 	test_cmp expect "$special" &&
 
 	# do the same thing, but with more args in the filter expression
-	git config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -154,9 +154,9 @@ test_expect_success 'filter shell-escaped filenames' '
 '
 
 test_expect_success 'required filter should filter data' '
-	git config filter.required.smudge ./rot13.sh &&
-	git config filter.required.clean ./rot13.sh &&
-	git config filter.required.required true &&
+	test_config filter.required.smudge ./rot13.sh &&
+	test_config filter.required.clean ./rot13.sh &&
+	test_config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
@@ -165,17 +165,17 @@ test_expect_success 'required filter should filter data' '
 
 	rm -f test.r &&
 	git checkout -- test.r &&
-	cmp test.o test.r &&
+	test_cmp test.o test.r &&
 
 	./rot13.sh <test.o >expected &&
 	git cat-file blob :test.r >actual &&
-	cmp expected actual
+	test_cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
-	git config filter.failsmudge.smudge false &&
-	git config filter.failsmudge.clean cat &&
-	git config filter.failsmudge.required true &&
+	test_config filter.failsmudge.smudge false &&
+	test_config filter.failsmudge.clean cat &&
+	test_config filter.failsmudge.required true &&
 
 	echo "*.fs filter=failsmudge" >.gitattributes &&
 
@@ -186,9 +186,9 @@ test_expect_success 'required filter smudge failure' '
 '
 
 test_expect_success 'required filter clean failure' '
-	git config filter.failclean.smudge cat &&
-	git config filter.failclean.clean false &&
-	git config filter.failclean.required true &&
+	test_config filter.failclean.smudge cat &&
+	test_config filter.failclean.clean false &&
+	test_config filter.failclean.required true &&
 
 	echo "*.fc filter=failclean" >.gitattributes &&
 
@@ -197,8 +197,8 @@ test_expect_success 'required filter clean failure' '
 '
 
 test_expect_success 'filtering large input to small output should use little memory' '
-	git config filter.devnull.clean "cat >/dev/null" &&
-	git config filter.devnull.required true &&
+	test_config filter.devnull.clean "cat >/dev/null" &&
+	test_config filter.devnull.required true &&
 	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
 	echo "30MB filter=devnull" >.gitattributes &&
 	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
@@ -207,7 +207,7 @@ test_expect_success 'filtering large input to small output should use little mem
 test_expect_success 'filter that does not read is fine' '
 	test-genrandom foo $((128 * 1024 + 1)) >big &&
 	echo "big filter=epipe" >.gitattributes &&
-	git config filter.epipe.clean "echo xyzzy" &&
+	test_config filter.epipe.clean "echo xyzzy" &&
 	git add big &&
 	git cat-file blob :big >actual &&
 	echo xyzzy >expect &&
@@ -215,20 +215,20 @@ test_expect_success 'filter that does not read is fine' '
 '
 
 test_expect_success EXPENSIVE 'filter large file' '
-	git config filter.largefile.smudge cat &&
-	git config filter.largefile.clean cat &&
+	test_config filter.largefile.smudge cat &&
+	test_config filter.largefile.clean cat &&
 	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
 	echo "2GB filter=largefile" >.gitattributes &&
 	git add 2GB 2>err &&
-	! test -s err &&
+	test_must_be_empty err &&
 	rm -f 2GB &&
 	git checkout -- 2GB 2>err &&
-	! test -s err
+	test_must_be_empty err
 '
 
 test_expect_success "filter: clean empty file" '
-	git config filter.in-repo-header.clean  "echo cleaned && cat" &&
-	git config filter.in-repo-header.smudge "sed 1d" &&
+	test_config filter.in-repo-header.clean  "echo cleaned && cat" &&
+	test_config filter.in-repo-header.smudge "sed 1d" &&
 
 	echo "empty-in-worktree    filter=in-repo-header" >>.gitattributes &&
 	>empty-in-worktree &&
@@ -240,8 +240,8 @@ test_expect_success "filter: clean empty file" '
 '
 
 test_expect_success "filter: smudge empty file" '
-	git config filter.empty-in-repo.clean "cat >/dev/null" &&
-	git config filter.empty-in-repo.smudge "echo smudged && cat" &&
+	test_config filter.empty-in-repo.clean "cat >/dev/null" &&
+	test_config filter.empty-in-repo.smudge "echo smudged && cat" &&
 
 	echo "empty-in-repo filter=empty-in-repo" >>.gitattributes &&
 	echo dead data walking >empty-in-repo &&
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 09/10] convert: generate large test files only once
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (7 preceding siblings ...)
  2016-07-29 23:37   ` [PATCH v3 08/10] convert: modernize tests larsxschneider
@ 2016-07-29 23:38   ` larsxschneider
  2016-07-29 23:38   ` [PATCH v3 10/10] convert: add filter.<driver>.process option larsxschneider
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
  10 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:38 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Generate more interesting large test files with pseudo random characters
in between and reuse these test files in multiple tests. Run tests formerly
marked as EXPENSIVE every time but with a smaller data set.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7b45136..34c8eb9 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -4,6 +4,15 @@ test_description='blob conversion via gitattributes'
 
 . ./test-lib.sh
 
+if test_have_prereq EXPENSIVE
+then
+	T0021_LARGE_FILE_SIZE=2048
+	T0021_LARGISH_FILE_SIZE=100
+else
+	T0021_LARGE_FILE_SIZE=30
+	T0021_LARGISH_FILE_SIZE=2
+fi
+
 cat <<EOF >rot13.sh
 #!$SHELL_PATH
 tr \
@@ -31,7 +40,26 @@ test_expect_success setup '
 	cat test >test.i &&
 	git add test test.t test.i &&
 	rm -f test test.t test.i &&
-	git checkout -- test test.t test.i
+	git checkout -- test test.t test.i &&
+
+	mkdir generated-test-data &&
+	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
+	do
+		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
+		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"
+		# Generate 1MB of empty data and 100 bytes of random characters
+		# printf "$(test-genrandom start $i)"
+		printf "%1048576d" 1 >>generated-test-data/large.file &&
+		printf "$RANDOM_STRING" >>generated-test-data/large.file &&
+		printf "%1048576d" 1 >>generated-test-data/large.file.rot13 &&
+		printf "$ROT_RANDOM_STRING" >>generated-test-data/large.file.rot13 &&
+
+		if test $i = $T0021_LARGISH_FILE_SIZE
+		then
+			cat generated-test-data/large.file >generated-test-data/largish.file &&
+			cat generated-test-data/large.file.rot13 >generated-test-data/largish.file.rot13
+		fi
+	done
 '
 
 script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
@@ -199,9 +227,9 @@ test_expect_success 'required filter clean failure' '
 test_expect_success 'filtering large input to small output should use little memory' '
 	test_config filter.devnull.clean "cat >/dev/null" &&
 	test_config filter.devnull.required true &&
-	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
-	echo "30MB filter=devnull" >.gitattributes &&
-	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
+	cp generated-test-data/large.file large.file &&
+	echo "large.file filter=devnull" >.gitattributes &&
+	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add large.file
 '
 
 test_expect_success 'filter that does not read is fine' '
@@ -214,15 +242,15 @@ test_expect_success 'filter that does not read is fine' '
 	test_cmp expect actual
 '
 
-test_expect_success EXPENSIVE 'filter large file' '
+test_expect_success 'filter large file' '
 	test_config filter.largefile.smudge cat &&
 	test_config filter.largefile.clean cat &&
-	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
-	echo "2GB filter=largefile" >.gitattributes &&
-	git add 2GB 2>err &&
+	echo "large.file filter=largefile" >.gitattributes &&
+	cp generated-test-data/large.file large.file &&
+	git add large.file 2>err &&
 	test_must_be_empty err &&
-	rm -f 2GB &&
-	git checkout -- 2GB 2>err &&
+	rm -f large.file &&
+	git checkout -- large.file 2>err &&
 	test_must_be_empty err
 '
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (8 preceding siblings ...)
  2016-07-29 23:38   ` [PATCH v3 09/10] convert: generate large test files only once larsxschneider
@ 2016-07-29 23:38   ` larsxschneider
  2016-07-30 22:05     ` Jakub Narębski
  2016-07-31 22:19     ` Jakub Narębski
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
  10 siblings, 2 replies; 120+ messages in thread
From: larsxschneider @ 2016-07-29 23:38 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time.

This patch adds the filter.<driver>.process string option which, if used,
keeps the external filter process running and processes all blobs with
the following packet format (pkt-line) based protocol over standard input
and standard output.

Git starts the filter on first usage and expects a welcome
message, protocol version number, and filter capabilities
separated by spaces:
------------------------
packet:          git< git-filter-protocol\n
packet:          git< version 2\n
packet:          git< capabilities clean smudge\n
------------------------
Supported filter capabilities are "clean", "smudge", "stream",
and "shutdown".

Afterwards Git sends a command (based on the supported
capabilities), the filename including its path
relative to the repository root, the content size as ASCII number
in bytes, the content split in zero or many pkt-line packets,
and a flush packet at the end:
------------------------
packet:          git> smudge\n
packet:          git> filename=path/testfile.dat\n
packet:          git> size=7\n
packet:          git> CONTENT
packet:          git> 0000
------------------------

The filter is expected to respond with the result content size as
ASCII number in bytes. If the capability "stream" is defined then
the filter must not send the content size. Afterwards the result
content in send in zero or many pkt-line packets and a flush packet
at the end. Finally a "success" packet is send to indicate that
everything went well.
------------------------
packet:          git< size=57\n   (omitted with capability "stream")
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< success\n
------------------------

In case the filter cannot process the content, it is expected
to respond with the result content size 0 (only if "stream" is
not defined) and a "reject" packet.
------------------------
packet:          git< size=0\n    (omitted with capability "stream")
packet:          git< reject\n
------------------------

After the filter has processed a blob it is expected to wait for
the next command. A demo implementation can be found in
`t/t0021/rot13-filter.pl` located in the Git core repository.

If the filter supports the "shutdown" capability then Git will
send the "shutdown" command and wait until the filter answers
with "done". This gives the filter the opportunity to perform
cleanup tasks. Afterwards the filter is expected to exit.
------------------------
packet:          git> shutdown\n
packet:          git< done\n
------------------------

If a filter.<driver>.clean or filter.<driver>.smudge command
is configured then these commands always take precedence over
a configured filter.<driver>.process command.

Please note that you cannot use an existing filter.<driver>.clean
or filter.<driver>.smudge command as filter.<driver>.process
command. As soon as Git would detect a file that needs to be
processed by this filter, it would stop responding.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Helped-by: Martin-Louis Bright <mlbright@gmail.com>
---
 Documentation/gitattributes.txt |  84 ++++++++-
 convert.c                       | 400 +++++++++++++++++++++++++++++++++++++--
 t/t0021-conversion.sh           | 405 ++++++++++++++++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         | 177 ++++++++++++++++++
 4 files changed, 1053 insertions(+), 13 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..e3fbcc2 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate. If a long running filter process (see section
+below) is used then Git can process all blobs with a single filter
+invocation for the entire life of a single Git command (e.g.
+`git add .`).
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +379,84 @@ substitution.  For example:
 ------------------------
 
 
+Long Running Filter Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the filter command (string value) is defined via
+filter.<driver>.process then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using the following packet
+format (pkt-line, see protocol-common.txt) based protocol over
+standard input and standard output.
+
+Git starts the filter on first usage and expects a welcome
+message, protocol version number, and filter capabilities
+separated by spaces:
+------------------------
+packet:          git< git-filter-protocol\n
+packet:          git< version 2\n
+packet:          git< capabilities clean smudge\n
+------------------------
+Supported filter capabilities are "clean", "smudge", "stream",
+and "shutdown".
+
+Afterwards Git sends a command (based on the supported
+capabilities), the filename including its path
+relative to the repository root, the content size as ASCII number
+in bytes, the content split in zero or many pkt-line packets,
+and a flush packet at the end:
+------------------------
+packet:          git> smudge\n
+packet:          git> filename=path/testfile.dat\n
+packet:          git> size=7\n
+packet:          git> CONTENT
+packet:          git> 0000
+------------------------
+
+The filter is expected to respond with the result content size as
+ASCII number in bytes. If the capability "stream" is defined then
+the filter must not send the content size. Afterwards the result
+content in send in zero or many pkt-line packets and a flush packet
+at the end. Finally a "success" packet is send to indicate that
+everything went well.
+------------------------
+packet:          git< size=57\n   (omitted with capability "stream")
+packet:          git< SMUDGED_CONTENT
+packet:          git< 0000
+packet:          git< success\n
+------------------------
+
+In case the filter cannot process the content, it is expected
+to respond with the result content size 0 (only if "stream" is
+not defined) and a "reject" packet.
+------------------------
+packet:          git< size=0\n    (omitted with capability "stream")
+packet:          git< reject\n
+------------------------
+
+After the filter has processed a blob it is expected to wait for
+the next command. A demo implementation can be found in
+`t/t0021/rot13-filter.pl` located in the Git core repository.
+
+If the filter supports the "shutdown" capability then Git will
+send the "shutdown" command and wait until the filter answers
+with "done". This gives the filter the opportunity to perform
+cleanup tasks. Afterwards the filter is expected to exit.
+------------------------
+packet:          git> shutdown\n
+packet:          git< done\n
+------------------------
+
+If a filter.<driver>.clean or filter.<driver>.smudge command
+is configured then these commands always take precedence over
+a configured filter.<driver>.process command.
+
+Please note that you cannot use an existing filter.<driver>.clean
+or filter.<driver>.smudge command as filter.<driver>.process
+command. As soon as Git would detect a file that needs to be
+processed by this filter, it would stop responding.
+
+
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/convert.c b/convert.c
index 522e2c5..be6405c 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "pkt-line.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -481,11 +482,355 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	return ret;
 }
 
+static int multi_packet_read(int fd_in, struct strbuf *sb, size_t expected_bytes, int is_stream)
+{
+	int bytes_read;
+	size_t total_bytes_read = 0;
+	if (expected_bytes == 0 && !is_stream)
+		return 0;
+
+	if (is_stream)
+		strbuf_grow(sb, LARGE_PACKET_MAX);           // allocate space for at least one packet
+	else
+		strbuf_grow(sb, st_add(expected_bytes, 1));  // add one extra byte for the packet flush
+
+	do {
+		bytes_read = packet_read(
+			fd_in, NULL, NULL,
+			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
+			PACKET_READ_GENTLE_ON_EOF
+		);
+		if (bytes_read < 0)
+			return 1;  // unexpected EOF
+
+		if (is_stream &&
+			bytes_read > 0 &&
+			sb->len - total_bytes_read - 1 <= 0)
+			strbuf_grow(sb, st_add(sb->len, LARGE_PACKET_MAX));
+		total_bytes_read += bytes_read;
+	}
+	while (
+		bytes_read > 0 &&                   // the last packet was no flush
+		sb->len - total_bytes_read - 1 > 0  // we still have space left in the buffer
+	);
+	strbuf_setlen(sb, total_bytes_read);
+	return (is_stream ? 0 : expected_bytes != total_bytes_read);
+}
+
+static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
+{
+	int did_fail = 0;
+	ssize_t bytes_to_write;
+	while (!did_fail) {
+		bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_LEN);
+		if (bytes_to_write < 0)
+			return 1;
+		if (bytes_to_write == 0)
+			break;
+		did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1);
+	}
+	if (!did_fail)
+		did_fail = packet_flush_gentle(fd_out);
+	return did_fail;
+}
+
+static int multi_packet_write_from_buf(const char *src, size_t len, int fd_out)
+{
+	int did_fail = 0;
+	size_t bytes_written = 0;
+	size_t bytes_to_write;
+	while (!did_fail) {
+		if ((len - bytes_written) > PKTLINE_DATA_LEN)
+			bytes_to_write = PKTLINE_DATA_LEN;
+		else
+			bytes_to_write = len - bytes_written;
+		if (bytes_to_write == 0)
+			break;
+		did_fail |= direct_packet_write_data(fd_out, src + bytes_written, bytes_to_write, 1);
+		bytes_written += bytes_to_write;
+	}
+	if (!did_fail)
+		did_fail = packet_flush_gentle(fd_out);
+	return did_fail;
+}
+
+#define FILTER_CAPABILITIES_STREAM   0x1
+#define FILTER_CAPABILITIES_CLEAN    0x2
+#define FILTER_CAPABILITIES_SMUDGE   0x4
+#define FILTER_CAPABILITIES_SHUTDOWN 0x8
+#define FILTER_SUPPORTS_STREAM(type) ((type) & FILTER_CAPABILITIES_STREAM)
+#define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
+#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
+#define FILTER_SUPPORTS_SHUTDOWN(type) ((type) & FILTER_CAPABILITIES_SHUTDOWN)
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	const char *cmd;
+	int supported_capabilities;
+	struct child_process process;
+};
+
+static int cmd_process_map_initialized = 0;
+static struct hashmap cmd_process_map;
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+							const struct cmd2process *e2,
+							const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_protocol2_filter_entry(struct hashmap *hashmap, const char *cmd)
+{
+	struct cmd2process k;
+	hashmap_entry_init(&k, strhash(cmd));
+	k.cmd = cmd;
+	return hashmap_get(hashmap, &k, NULL);
+}
+
+static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry) {
+	if (!entry)
+		return;
+	sigchain_push(SIGPIPE, SIG_IGN);
+	close(entry->process.in);
+	close(entry->process.out);
+	sigchain_pop(SIGPIPE);
+	finish_command(&entry->process);
+	child_process_clear(&entry->process);
+	hashmap_remove(hashmap, entry, NULL);
+	free(entry);
+}
+
+void shutdown_protocol2_filter(pid_t pid)
+{
+	int did_fail;
+	struct cmd2process *entry;
+	struct hashmap_iter iter;
+	static const char shutdown[] = "shutdown\n";
+	char *result = NULL;
+
+	if (!cmd_process_map_initialized)
+		return;
+
+    hashmap_iter_init(&cmd_process_map, &iter);
+	while ((entry = hashmap_iter_next(&iter))) {
+		if (entry->process.pid == pid &&
+			FILTER_SUPPORTS_SHUTDOWN(entry->supported_capabilities)
+		) {
+			sigchain_push(SIGPIPE, SIG_IGN);
+			did_fail = direct_packet_write_data(
+				entry->process.in, shutdown, strlen(shutdown), 1);
+			if (!did_fail)
+				result = packet_read_line(entry->process.out, NULL);
+			close(entry->process.in);
+			close(entry->process.out);
+			sigchain_pop(SIGPIPE);
+
+			if (did_fail || !result || strcmp(result, "done"))
+				error("shutdown of external filter '%s' failed", entry->cmd);
+		}
+	}
+}
+
+static struct cmd2process *start_protocol2_filter(struct hashmap *hashmap, const char *cmd)
+{
+	int did_fail;
+	struct cmd2process *entry;
+	struct child_process *process;
+	const char *argv[] = { cmd, NULL };
+	struct string_list capabilities = STRING_LIST_INIT_NODUP;
+	char *capabilities_buffer;
+	int i;
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(entry, strhash(cmd));
+	entry->cmd = cmd;
+	entry->supported_capabilities = 0;
+	process = &entry->process;
+
+	child_process_init(process);
+	process->argv = argv;
+	process->use_shell = 1;
+	process->in = -1;
+	process->out = -1;
+	process->clean_on_exit = 1;
+	process->clean_on_exit_handler = shutdown_protocol2_filter;
+
+	if (start_command(process)) {
+		error("cannot fork to run external filter '%s'", cmd);
+		kill_protocol2_filter(hashmap, entry);
+		return NULL;
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+	did_fail = strcmp(packet_read_line(process->out, NULL), "git-filter-protocol");
+	if (!did_fail)
+		did_fail |= strcmp(packet_read_line(process->out, NULL), "version 2");
+	if (!did_fail)
+		capabilities_buffer = packet_read_line(process->out, NULL);
+	else
+		capabilities_buffer = NULL;
+	sigchain_pop(SIGPIPE);
+
+	if (!did_fail && capabilities_buffer) {
+		string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1);
+		if (capabilities.nr > 1 &&
+			!strcmp(capabilities.items[0].string, "capabilities")) {
+			for (i = 1; i < capabilities.nr; i++) {
+				const char *requested = capabilities.items[i].string;
+				if (!strcmp(requested, "stream")) {
+					entry->supported_capabilities |= FILTER_CAPABILITIES_STREAM;
+				} else if (!strcmp(requested, "clean")) {
+					entry->supported_capabilities |= FILTER_CAPABILITIES_CLEAN;
+				} else if (!strcmp(requested, "smudge")) {
+					entry->supported_capabilities |= FILTER_CAPABILITIES_SMUDGE;
+				} else if (!strcmp(requested, "shutdown")) {
+					entry->supported_capabilities |= FILTER_CAPABILITIES_SHUTDOWN;
+				} else {
+					warning(
+						"external filter '%s' requested unsupported filter capability '%s'",
+						cmd, requested
+					);
+				}
+			}
+		} else {
+			error("filter capabilities not found");
+			did_fail = 1;
+		}
+		string_list_clear(&capabilities, 0);
+	}
+
+	if (did_fail) {
+		error("initialization for external filter '%s' failed", cmd);
+		kill_protocol2_filter(hashmap, entry);
+		return NULL;
+	}
+
+	hashmap_add(hashmap, entry);
+	return entry;
+}
+
+static int apply_protocol2_filter(const char *path, const char *src, size_t len,
+						int fd, struct strbuf *dst, const char *cmd,
+						const int wanted_capability)
+{
+	int ret = 1;
+	struct cmd2process *entry;
+	struct child_process *process;
+	struct stat file_stat;
+	struct strbuf nbuf = STRBUF_INIT;
+	size_t expected_bytes = 0;
+	char *strtol_end;
+	char *strbuf;
+	char *filter_type;
+	char *filter_result = NULL;
+
+	if (!cmd || !*cmd)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if (!cmd_process_map_initialized) {
+		cmd_process_map_initialized = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = find_protocol2_filter_entry(&cmd_process_map, cmd);
+	}
+
+	fflush(NULL);
+
+	if (!entry) {
+		entry = start_protocol2_filter(&cmd_process_map, cmd);
+		if (!entry) {
+			return 0;
+		}
+	}
+	process = &entry->process;
+
+	if (!(wanted_capability & entry->supported_capabilities))
+		return 1;  // it is OK if the wanted capability is not supported
+
+	if FILTER_SUPPORTS_CLEAN(wanted_capability)
+		filter_type = "clean";
+	else if FILTER_SUPPORTS_SMUDGE(wanted_capability)
+		filter_type = "smudge";
+	else
+		die("unexpected filter type");
+
+	if (fd >= 0 && !src) {
+		if (fstat(fd, &file_stat) == -1)
+			return 0;
+		len = file_stat.st_size;
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	packet_buf_write(&nbuf, "%s\n", filter_type);
+	ret &= !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
+
+	if (ret) {
+		strbuf_reset(&nbuf);
+		packet_buf_write(&nbuf, "filename=%s\n", path);
+		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
+	}
+
+	if (ret) {
+		strbuf_reset(&nbuf);
+		packet_buf_write(&nbuf, "size=%"PRIuMAX"\n", (uintmax_t)len);
+		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
+	}
+
+	if (ret) {
+		if (fd >= 0)
+			ret = !multi_packet_write_from_fd(fd, process->in);
+		else
+			ret = !multi_packet_write_from_buf(src, len, process->in);
+	}
+
+	if (ret && !FILTER_SUPPORTS_STREAM(entry->supported_capabilities)) {
+		strbuf = packet_read_line(process->out, NULL);
+		if (strlen(strbuf) > 5 && !strncmp("size=", strbuf, 5)) {
+			expected_bytes = (off_t)strtol(strbuf + 5, &strtol_end, 10);
+			ret = (strtol_end != strbuf && errno != ERANGE);
+		} else {
+			ret = 0;
+		}
+	}
+
+	if (ret) {
+		strbuf_reset(&nbuf);
+		ret = !multi_packet_read(process->out, &nbuf, expected_bytes,
+			FILTER_SUPPORTS_STREAM(entry->supported_capabilities));
+	}
+
+	if (ret) {
+		filter_result = packet_read_line(process->out, NULL);
+		ret = !strcmp(filter_result, "success");
+	}
+
+	sigchain_pop(SIGPIPE);
+
+	if (ret) {
+		strbuf_swap(dst, &nbuf);
+	} else {
+		if (!filter_result || strcmp(filter_result, "reject")) {
+			// Something went wrong with the protocol filter. Force shutdown!
+			error("external filter '%s' failed", cmd);
+			kill_protocol2_filter(&cmd_process_map, entry);
+		}
+	}
+	strbuf_release(&nbuf);
+	return ret;
+}
+
 static struct convert_driver {
 	const char *name;
 	struct convert_driver *next;
 	const char *smudge;
 	const char *clean;
+	const char *process;
 	int required;
 } *user_convert, **user_convert_tail;
 
@@ -526,6 +871,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
 	if (!strcmp("clean", key))
 		return git_config_string(&drv->clean, var, value);
 
+	if (!strcmp("process", key)) {
+		return git_config_string(&drv->process, var, value);
+	}
+
 	if (!strcmp("required", key)) {
 		drv->required = git_config_bool(var, value);
 		return 0;
@@ -823,7 +1172,12 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;
 
-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+	if (!ca.drv->clean && ca.drv->process)
+		return apply_protocol2_filter(
+			path, NULL, 0, -1, NULL, ca.drv->process, FILTER_CAPABILITIES_CLEAN
+		);
+	else
+		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
 }
 
 const char *get_convert_attr_ascii(const char *path)
@@ -856,17 +1210,24 @@ int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
 	int ret = 0;
-	const char *filter = NULL;
+	const char *clean_filter = NULL;
+	const char *process_filter = NULL;
 	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
 	if (ca.drv) {
-		filter = ca.drv->clean;
+		clean_filter = ca.drv->clean;
+		process_filter = ca.drv->process;
 		required = ca.drv->required;
 	}
 
-	ret |= apply_filter(path, src, len, -1, dst, filter);
+	if (!clean_filter && process_filter)
+		ret |= apply_protocol2_filter(
+			path, src, len, -1, dst, process_filter, FILTER_CAPABILITIES_CLEAN
+		);
+	else
+		ret |= apply_filter(path, src, len, -1, dst, clean_filter);
 	if (!ret && required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
@@ -885,13 +1246,21 @@ int convert_to_git(const char *path, const char *src, size_t len,
 void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 			      enum safe_crlf checksafe)
 {
+	int ret = 0;
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
-	assert(ca.drv->clean);
+	assert(ca.drv->clean || ca.drv->process);
+
+	if (!ca.drv->clean && ca.drv->process)
+		ret = apply_protocol2_filter(
+			path, NULL, 0, fd, dst, ca.drv->process, FILTER_CAPABILITIES_CLEAN
+		);
+	else
+		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
 
-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+	if (!ret)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -902,14 +1271,16 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 					    size_t len, struct strbuf *dst,
 					    int normalizing)
 {
-	int ret = 0, ret_filter = 0;
-	const char *filter = NULL;
+	int ret = 0, ret_filter;
+	const char *smudge_filter = NULL;
+	const char *process_filter = NULL;
 	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
 	if (ca.drv) {
-		filter = ca.drv->smudge;
+		process_filter = ca.drv->process;
+		smudge_filter = ca.drv->smudge;
 		required = ca.drv->required;
 	}
 
@@ -922,7 +1293,7 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 	 * CRLF conversion can be skipped if normalizing, unless there
 	 * is a smudge filter.  The filter might expect CRLFs.
 	 */
-	if (filter || !normalizing) {
+	if (smudge_filter || process_filter || !normalizing) {
 		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
 		if (ret) {
 			src = dst->buf;
@@ -930,7 +1301,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, -1, dst, filter);
+	if (!smudge_filter && process_filter)
+		ret_filter = apply_protocol2_filter(
+			path, src, len, -1, dst, process_filter, FILTER_CAPABILITIES_SMUDGE
+		);
+	else
+		ret_filter = apply_filter(path, src, len, -1, dst, smudge_filter);
 	if (!ret_filter && required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
@@ -1383,7 +1759,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	struct stream_filter *filter = NULL;
 
 	convert_attrs(&ca, path);
-	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
+	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 34c8eb9..e8a7703 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -296,4 +296,409 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '
 
+test_expect_success PERL 'required process filter should filter data' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge shutdown" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		echo "test22" >test2.r &&
+		mkdir testsubdir &&
+		echo "test333" >testsubdir/test3.r &&
+
+		rm -f rot13-filter.log &&
+		git add . &&
+		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >uniq-rot13-filter.log &&
+		cat >expected_add.log <<-\EOF &&
+			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
+			1 IN: clean testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
+			1 IN: shutdown -- [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_add.log uniq-rot13-filter.log &&
+
+		>rot13-filter.log &&
+		git commit . -m "test commit" &&
+		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
+			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq-rot13-filter.log &&
+		cat >expected_commit.log <<-\EOF &&
+			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
+			x IN: clean testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
+			1 IN: shutdown -- [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_commit.log uniq-rot13-filter.log &&
+
+		>rot13-filter.log &&
+		rm -f test?.r testsubdir/test3.r &&
+		git checkout . &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
+			IN: shutdown -- [OK]
+		EOF
+		test_cmp expected_checkout.log smudge-rot13-filter.log &&
+
+		git checkout empty &&
+
+		>rot13-filter.log &&
+		git checkout master &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
+			IN: shutdown -- [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge-rot13-filter.log &&
+
+		./../rot13.sh <test.r >expected &&
+		git cat-file blob :test.r >actual &&
+		test_cmp expected actual &&
+
+		./../rot13.sh <test2.r >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual &&
+
+		./../rot13.sh <testsubdir/test3.r >expected &&
+		git cat-file blob :testsubdir/test3.r >actual &&
+		test_cmp expected actual
+	)
+'
+
+test_expect_success PERL 'required process filter should filter data stream' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl stream clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		echo "test22" >test2.r &&
+		mkdir testsubdir &&
+		echo "test333" >testsubdir/test3.r &&
+
+		rm -f rot13-filter.log &&
+		git add . &&
+		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >uniq-rot13-filter.log &&
+		cat >expected_add.log <<-\EOF &&
+			1 IN: clean test.r 57 [OK] -- OUT: STREAM [OK]
+			1 IN: clean test2.r 7 [OK] -- OUT: STREAM [OK]
+			1 IN: clean testsubdir/test3.r 8 [OK] -- OUT: STREAM [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_add.log uniq-rot13-filter.log &&
+
+		>rot13-filter.log &&
+		git commit . -m "test commit" &&
+		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
+			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq-rot13-filter.log &&
+		cat >expected_commit.log <<-\EOF &&
+			x IN: clean test.r 57 [OK] -- OUT: STREAM [OK]
+			x IN: clean test2.r 7 [OK] -- OUT: STREAM [OK]
+			x IN: clean testsubdir/test3.r 8 [OK] -- OUT: STREAM [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_commit.log uniq-rot13-filter.log &&
+
+		>rot13-filter.log &&
+		rm -f test?.r testsubdir/test3.r &&
+		git checkout . &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test2.r 7 [OK] -- OUT: STREAM [OK]
+			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: STREAM [OK]
+		EOF
+		test_cmp expected_checkout.log smudge-rot13-filter.log &&
+
+		git checkout empty &&
+
+		>rot13-filter.log &&
+		git checkout master &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test.r 57 [OK] -- OUT: STREAM [OK]
+			IN: smudge test2.r 7 [OK] -- OUT: STREAM [OK]
+			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: STREAM [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge-rot13-filter.log &&
+
+		./../rot13.sh <test.r >expected &&
+		git cat-file blob :test.r >actual &&
+		test_cmp expected actual &&
+
+		./../rot13.sh <test2.r >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual &&
+
+		./../rot13.sh <testsubdir/test3.r >expected &&
+		git cat-file blob :testsubdir/test3.r >actual &&
+		test_cmp expected actual
+	)
+'
+
+test_expect_success PERL 'required process filter should filter smudge data and one-shot filter should clean' '
+	test_config_global filter.protocol.clean ./../rot13.sh &&
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		echo "test22" >test2.r &&
+		mkdir testsubdir &&
+		echo "test333" >testsubdir/test3.r &&
+
+		rm -f rot13-filter.log &&
+		git add . &&
+		test_must_be_empty rot13-filter.log &&
+
+		>rot13-filter.log &&
+		git commit . -m "test commit" &&
+		test_must_be_empty rot13-filter.log &&
+
+		>rot13-filter.log &&
+		rm -f test?.r testsubdir/test3.r &&
+		git checkout . &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout.log smudge-rot13-filter.log &&
+
+		git checkout empty &&
+
+		>rot13-filter.log &&
+		git checkout master &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
+			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge-rot13-filter.log &&
+
+		./../rot13.sh <test.r >expected &&
+		git cat-file blob :test.r >actual &&
+		test_cmp expected actual &&
+
+		./../rot13.sh <test2.r >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual &&
+
+		./../rot13.sh <testsubdir/test3.r >expected &&
+		git cat-file blob :testsubdir/test3.r >actual &&
+		test_cmp expected actual
+	)
+'
+
+test_expect_success PERL 'required process filter should clean only' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+
+		rm -f rot13-filter.log &&
+		git add . &&
+		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >uniq-rot13-filter.log &&
+		cat >expected_add.log <<-\EOF &&
+			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_add.log uniq-rot13-filter.log &&
+
+		>rot13-filter.log &&
+		git commit . -m "test commit" &&
+		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
+			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq-rot13-filter.log &&
+		cat >expected_commit.log <<-\EOF &&
+			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+			1 start
+			1 wrote filter header
+		EOF
+		test_cmp expected_commit.log uniq-rot13-filter.log
+	)
+'
+
+test_expect_success PERL 'required process filter should process files larger LARGE_PACKET_MAX' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.file filter=protocol" >.gitattributes &&
+		cat ../generated-test-data/largish.file.rot13 >large.rot13 &&
+		cat ../generated-test-data/largish.file >large.file &&
+		cat large.file >large.original &&
+
+		git add large.file .gitattributes &&
+		git commit . -m "test commit" &&
+
+		rm -f large.file &&
+		git checkout -- large.file &&
+		git cat-file blob :large.file >actual &&
+		test_cmp large.rot13 actual
+	)
+'
+
+test_expect_success PERL 'required process filter should with clean error should fail' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "this is going to fail" >clean-write-fail.r &&
+		echo "test333" >test3.r &&
+
+		# Note: There are three clean paths in convert.c we just test one here.
+		test_must_fail git add .
+	)
+'
+
+test_expect_success PERL 'process filter should restart after unexpected write failure' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "1234567" >test2.o &&
+		cat test2.o >test2.r &&
+		echo "this is going to fail" >smudge-write-fail.o &&
+		cat smudge-write-fail.o >smudge-write-fail.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		printf "" >rot13-filter.log &&
+		git checkout . &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [WRITE FAIL]
+			start
+			wrote filter header
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge-rot13-filter.log &&
+
+		test_cmp ../test.o test.r &&
+		./../rot13.sh <../test.o >expected &&
+		git cat-file blob :test.r >actual &&
+		test_cmp expected actual &&
+
+		test_cmp test2.o test2.r &&
+		./../rot13.sh <test2.o >expected &&
+		git cat-file blob :test2.r >actual &&
+		test_cmp expected actual &&
+
+		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
+		./../rot13.sh <smudge-write-fail.o >expected &&
+		git cat-file blob :smudge-write-fail.r >actual &&
+		test_cmp expected actual							  # Clean worked!
+	)
+'
+
+test_expect_success PERL 'process filter should not restart after intentionally rejected file' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "1234567" >test2.o &&
+		cat test2.o >test2.r &&
+		echo "this is going to fail" >reject.o &&
+		cat reject.o >reject.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		printf "" >rot13-filter.log &&
+		git checkout . &&
+		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
+		cat >expected_checkout_master.log <<-\EOF &&
+			start
+			wrote filter header
+			IN: smudge reject.r 22 [OK] -- OUT: 0 [REJECT]
+			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+			IN: smudge test2.r 8 [OK] -- OUT: 8 [OK]
+		EOF
+		test_cmp expected_checkout_master.log smudge-rot13-filter.log
+	)
+'
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
new file mode 100755
index 0000000..cb0925d
--- /dev/null
+++ b/t/t0021/rot13-filter.pl
@@ -0,0 +1,177 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git filter protocol version 2
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+# The script takes the list of supported protocol capabilities as
+# arguments ("stream", "clean", and "smudge" are supported).
+#
+# This implementation supports three special test cases:
+# (1) If data with the filename "clean-write-fail.r" is processed with
+#     a "clean" operation then the write operation will die.
+# (2) If data with the filename "smudge-write-fail.r" is processed with
+#     a "smudge" operation then the write operation will die.
+# (3) If data with the filename "failure.r" is processed with any
+#     operation then the filter signals that the operation was not
+#     successful.
+#
+
+use strict;
+use warnings;
+
+my $MAX_PACKET_CONTENT_SIZE = 65516;
+my @capabilities            = @ARGV;
+
+sub rot13 {
+    my ($str) = @_;
+    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
+    return $str;
+}
+
+sub packet_read {
+    my $buffer;
+    my $bytes_read = read STDIN, $buffer, 4;
+    if ( $bytes_read == 0 ) {
+        return;
+    }
+    elsif ( $bytes_read != 4 ) {
+        die "invalid packet size '$bytes_read' field";
+    }
+    my $pkt_size = hex($buffer);
+    if ( $pkt_size == 0 ) {
+        return ( 1, "" );
+    }
+    elsif ( $pkt_size > 4 ) {
+        my $content_size = $pkt_size - 4;
+        $bytes_read = read STDIN, $buffer, $content_size;
+        if ( $bytes_read != $content_size ) {
+            die "invalid packet";
+        }
+        return ( 0, $buffer );
+    }
+    else {
+        die "invalid packet size";
+    }
+}
+
+sub packet_write {
+    my ($packet) = @_;
+    print STDOUT sprintf( "%04x", length($packet) + 4 );
+    print STDOUT $packet;
+    STDOUT->flush();
+}
+
+sub packet_flush {
+    print STDOUT sprintf( "%04x", 0 );
+    STDOUT->flush();
+}
+
+open my $debug, ">>", "rot13-filter.log";
+print $debug "start\n";
+$debug->flush();
+
+packet_write("git-filter-protocol\n");
+packet_write("version 2\n");
+packet_write( "capabilities " . join( ' ', @capabilities ) . "\n" );
+print $debug "wrote filter header\n";
+$debug->flush();
+
+while (1) {
+    my $command = packet_read();
+    unless ( defined($command) ) {
+        exit();
+    }
+    chomp $command;
+    print $debug "IN: $command";
+    $debug->flush();
+
+    if ( $command eq "shutdown" ) {
+        print $debug " -- [OK]";
+        $debug->flush();
+        packet_write("done\n");
+        exit();
+    }
+
+    my ($filename) = packet_read() =~ /filename=([^=]+)\n/;
+    print $debug " $filename";
+    $debug->flush();
+    my ($filelen) = packet_read() =~ /size=([^=]+)\n/;
+    chomp $filelen;
+    print $debug " $filelen";
+    $debug->flush();
+
+    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
+    my $output;
+
+    if ( $filelen > 0 ) {
+        my $input = "";
+        {
+            binmode(STDIN);
+            my $buffer;
+            my $done = 0;
+            while ( !$done ) {
+                ( $done, $buffer ) = packet_read();
+                $input .= $buffer;
+            }
+            print $debug " [OK] -- ";
+            $debug->flush();
+        }
+
+        if ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
+            $output = rot13($input);
+        }
+        elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
+            $output = rot13($input);
+        }
+        else {
+            die "bad command $command";
+        }
+    }
+
+    my $output_len = length($output);
+    if ( $filename eq "reject.r" ) {
+        $output_len = 0;
+    }
+
+    if ( grep( /^stream$/, @capabilities ) ) {
+        print $debug "OUT: STREAM ";
+    }
+    else {
+        packet_write("size=$output_len\n");
+        print $debug "OUT: $output_len ";
+    }
+    $debug->flush();
+
+    if ( $filename eq "reject.r" ) {
+        packet_write("reject\n");
+        print $debug "[REJECT]\n";    # Could also be an error
+        $debug->flush();
+    }
+
+    if ( $output_len > 0 ) {
+        if (( $command eq "clean" and $filename eq "clean-write-fail.r" )
+            or
+            ( $command eq "smudge" and $filename eq "smudge-write-fail.r" ))
+        {
+            print $debug "[WRITE FAIL]\n";
+            $debug->flush();
+            die "write error";
+        }
+        else {
+            while ( length($output) > 0 ) {
+                my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
+                packet_write($packet);
+                if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
+                    $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
+                }
+                else {
+                    $output = "";
+                }
+            }
+            packet_flush();
+            packet_write("success\n");
+            print $debug "[OK]\n";
+            $debug->flush();
+        }
+    }
+}
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/10] run-command: add clean_on_exit_handler
  2016-07-29 23:37   ` [PATCH v3 06/10] run-command: add clean_on_exit_handler larsxschneider
@ 2016-07-30  9:50     ` Johannes Sixt
  2016-08-01 11:14       ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Johannes Sixt @ 2016-07-30  9:50 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, gitster, jnareb, tboegi, mlbright, e, peff

Am 30.07.2016 um 01:37 schrieb larsxschneider@gmail.com:
> Some commands might need to perform cleanup tasks on exit. Let's give
> them an interface for doing this.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  run-command.c | 12 ++++++++----
>  run-command.h |  1 +
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/run-command.c b/run-command.c
> index 33bc63a..197b534 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -21,6 +21,7 @@ void child_process_clear(struct child_process *child)
>
>  struct child_to_clean {
>  	pid_t pid;
> +	void (*clean_on_exit_handler)(pid_t);
>  	struct child_to_clean *next;
>  };
>  static struct child_to_clean *children_to_clean;
> @@ -30,6 +31,8 @@ static void cleanup_children(int sig, int in_signal)
>  {
>  	while (children_to_clean) {
>  		struct child_to_clean *p = children_to_clean;
> +		if (p->clean_on_exit_handler)
> +			p->clean_on_exit_handler(p->pid);

This summons demons. cleanup_children() is invoked from a signal 
handler. In this case, it can call only async-signal-safe functions. It 
does not look like the handler that you are going to install later will 
take note of this caveat!

>  		children_to_clean = p->next;
>  		kill(p->pid, sig);
>  		if (!in_signal)

The condition that we see here in the context protects free(p) (which is 
not async-signal-safe). Perhaps the invocation of the new callback 
should be skipped in the same manner when this is called from a signal 
handler? 507d7804 (pager: don't use unsafe functions in signal handlers) 
may be worth a look.

-- Hannes


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 01/10] pkt-line: extract set_packet_header()
  2016-07-29 23:37   ` [PATCH v3 01/10] pkt-line: extract set_packet_header() larsxschneider
@ 2016-07-30 10:30     ` Jakub Narębski
  2016-08-01 11:33       ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-07-30 10:30 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: gitster, tboegi, mlbright, e, peff

W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> set_packet_header() converts an integer to a 4 byte hex string. Make
> this function locally available so that other pkt-line functions can
> use it.

This description is not that clear that set_packet_header() is a new
function.  Perhaps something like the following

  Extract the part of format_packet() that converts an integer to a 4 byte
  hex string into set_packet_header().  Make this new function ...

I also wonder if the part "Make this [new] function locally available..."
is needed; we need to justify exports, but I think we don't need to
justify limiting it to a module.  If you want to justify that it is
"static", perhaps it would be better to say why not to export it.

Anyway, I think it is worthy refactoring (and compiler should be
able to inline it, so there are no nano-performance considerations).

Good work!

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index 62fdb37..445b8e1 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -98,9 +98,17 @@ void packet_buf_flush(struct strbuf *buf)
>  }
>  
>  #define hex(a) (hexchar[(a) & 15])

I guess that this is inherited from the original, but this preprocessor
macro is local to the format_header() / set_packet_header() function,
and would not work outside it.  Therefore I think we should #undef it
after set_packet_header(), just in case somebody mistakes it for
a generic hex() function.  Perhaps even put it inside set_packet_header(),
together with #undef.

But I might be mistaken... let's check... no, it isn't used outside it.

> -static void format_packet(struct strbuf *out, const char *fmt, va_list args)
> +static void set_packet_header(char *buf, const int size)
>  {
>  	static char hexchar[] = "0123456789abcdef";
> +	buf[0] = hex(size >> 12);
> +	buf[1] = hex(size >> 8);
> +	buf[2] = hex(size >> 4);
> +	buf[3] = hex(size);
> +}
> +
> +static void format_packet(struct strbuf *out, const char *fmt, va_list args)

It is strange how 'git diff' chosen to represent this patch...

> +{
>  	size_t orig_len, n;
>  
>  	orig_len = out->len;
> @@ -111,10 +119,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
>  	if (n > LARGE_PACKET_MAX)
>  		die("protocol error: impossibly long line");
>  
> -	out->buf[orig_len + 0] = hex(n >> 12);
> -	out->buf[orig_len + 1] = hex(n >> 8);
> -	out->buf[orig_len + 2] = hex(n >> 4);
> -	out->buf[orig_len + 3] = hex(n);
> +	set_packet_header(&out->buf[orig_len], n);
>  	packet_trace(out->buf + orig_len + 4, n - 4, 1);
>  }
>  
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data()
  2016-07-29 23:37   ` [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
@ 2016-07-30 10:49     ` Jakub Narębski
  2016-08-01 12:00       ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-07-30 10:49 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Junio C Hamano, tboegi, mlbright, Eric Wong, Jeff King

W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Sometimes pkt-line data is already available in a buffer and it would
> be a waste of resources to write the packet using packet_write() which
> would copy the existing buffer into a strbuf before writing it.
> 
> If the caller has control over the buffer creation then the
> PKTLINE_DATA_START macro can be used to skip the header and write
> directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
> would be the maximum). direct_packet_write() would take this buffer,
> adjust the pkt-line header and write it.
> 
> If the caller has no control over the buffer creation then
> direct_packet_write_data() can be used. This function creates a pkt-line
> header. Afterwards the header and the data buffer are written using two
> consecutive write calls.

I don't quite understand what do you mean by "caller has control
over the buffer creation".  Do you mean that caller either can write
over the buffer, or cannot overwrite the buffer?  Or do you mean that
caller either can allocate buffer to hold header, or is getting
only the data?

> 
> Both functions have a gentle parameter that indicates if Git should die
> in case of a write error (gentle set to 0) or return with a error (gentle
> set to 1).

So they are *_maybe_gently(), isn't it ;-)?  Are there any existing
functions in Git codebase that take 'gently' / 'strict' / 'die_on_error'
parameter?

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 30 ++++++++++++++++++++++++++++++
>  pkt-line.h |  5 +++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index 445b8e1..6fae508 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -135,6 +135,36 @@ void packet_write(int fd, const char *fmt, ...)
>  	write_or_die(fd, buf.buf, buf.len);
>  }
>  
> +int direct_packet_write(int fd, char *buf, size_t size, int gentle)
> +{
> +	int ret = 0;
> +	packet_trace(buf + 4, size - 4, 1);
> +	set_packet_header(buf, size);
> +	if (gentle)
> +		ret = !write_or_whine_pipe(fd, buf, size, "pkt-line");
> +	else
> +		write_or_die(fd, buf, size);

Hmmm... in gently case we get the information in the warning that
it is about "pkt-line", which is missing from !gently case.  But
it is probably not important.

> +	return ret;
> +}

Nice clean function, thanks to extracting set_packet_header().

> +
> +int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle)

I would name the parameter 'data', rather than 'buf'; IMVHO it
better describes it.

> +{
> +	int ret = 0;
> +	char hdr[4];
> +	set_packet_header(hdr, sizeof(hdr) + size);
> +	packet_trace(buf, size, 1);
> +	if (gentle) {
> +		ret = (
> +			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||

You can write '4' here, no need for sizeof(hdr)... though compiler would
optimize it away.

> +			!write_or_whine_pipe(fd, buf, size, "pkt-line data")
> +		);

Do we want to try to write "pkt-line data" if "pkt-line header" failed?
If not, perhaps De Morgan-ize it

  +		ret = !(
  +			write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") &&
  +			write_or_whine_pipe(fd, buf, size, "pkt-line data")
  +		);


> +	} else {
> +		write_or_die(fd, hdr, sizeof(hdr));
> +		write_or_die(fd, buf, size);

I guess these two writes (here and in 'gently' case) are unavoidable...

> +	}
> +	return ret;
> +}
> +
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
>  {
>  	va_list args;
> diff --git a/pkt-line.h b/pkt-line.h
> index 3cb9d91..02dcced 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -23,6 +23,8 @@ void packet_flush(int fd);
>  void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  void packet_buf_flush(struct strbuf *buf);
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> +int direct_packet_write(int fd, char *buf, size_t size, int gentle);
> +int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle);
>  
>  /*
>   * Read a packetized line into the buffer, which must be at least size bytes
> @@ -77,6 +79,9 @@ char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
>  
>  #define DEFAULT_PACKET_MAX 1000
>  #define LARGE_PACKET_MAX 65520
> +#define PKTLINE_HEADER_LEN 4
> +#define PKTLINE_DATA_START(pkt) ((pkt) + PKTLINE_HEADER_LEN)
> +#define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - PKTLINE_HEADER_LEN)

Those are not used in direct_packet_write() and direct_packet_write_data();
but they would make them more verbose and less readable.

>  extern char packet_buffer[LARGE_PACKET_MAX];
>  
>  #endif
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-07-29 23:37   ` [PATCH v3 03/10] pkt-line: add packet_flush_gentle() larsxschneider
@ 2016-07-30 12:04     ` Jakub Narębski
  2016-08-01 12:28       ` Lars Schneider
  2016-07-31 20:36     ` Torstem Bögershausen
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-07-30 12:04 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: gitster, tboegi, mlbright, e, peff

W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_flush() would die in case of a write error even though for some callers
> an error would be acceptable. Add packet_flush_gentle() which writes a pkt-line
> flush packet and returns `0` for success and `1` for failure.

I think it should be packet_flush_gently(), as in "to flush gently",
but this is only my opinion; I have not checked the naming rules and
practices for the rest of Git codebase.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send
  2016-07-29 23:37   ` [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
@ 2016-07-30 12:29     ` Jakub Narębski
  2016-08-01 12:18       ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-07-30 12:29 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: gitster, tboegi, mlbright, e, peff

W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> The packet_trace() call is not ideal in format_packet() as we would print

Style; I think the following is more readable:

  The packet_trace() call in format_packet() is not ideal, as we would...

> a trace when a packet is formatted and (potentially) when the packet is
> actually send. This was no problem up until now because format_packet()
> was only used by one function. Fix it by moving the trace call into the
> function that actally sends the packet.

s/actally/actually/

I don't buy this explanation.  If you want to trace packets, you might
do it on input (when formatting packet), or on output (when writing
packet).  It's when there are more than one formatting function, but
one writing function, then placing trace call in write function means
less code duplication; and of course the reverse.

Another issue is that something may happen between formatting packet
and sending it, and we probably want to packet_trace() when packet
is actually send.

Neither of those is visible in commit message.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index 1728690..32c0a34 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -126,7 +126,6 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
>  		die("protocol error: impossibly long line");
>  
>  	set_packet_header(&out->buf[orig_len], n);
> -	packet_trace(out->buf + orig_len + 4, n - 4, 1);
>  }
>  
>  void packet_write(int fd, const char *fmt, ...)
> @@ -138,6 +137,7 @@ void packet_write(int fd, const char *fmt, ...)
>  	va_start(args, fmt);
>  	format_packet(&buf, fmt, args);
>  	va_end(args);
> +	packet_trace(buf.buf + 4, buf.len - 4, 1);
>  	write_or_die(fd, buf.buf, buf.len);
>  }
>  
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size
  2016-07-29 23:37   ` [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size larsxschneider
@ 2016-07-30 13:58     ` Jakub Narębski
  2016-08-01 12:23       ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-07-30 13:58 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: gitster, tboegi, mlbright, e, peff

W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> According to LARGE_PACKET_MAX in pkt-line.h the maximal lenght of a
> pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
> therefore the pkt-line data component must not exceed 65516 bytes.

s/lenght/length/

Is it maximum length of pkt-line packet, or maximum length of data
that can be send in a packet?

With 4 hex digits, maximal length if pkt-line packet (together
with length) is ffff_16, that is 2^16-1 = 65535.  Where does the
number 65520 comes from?

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/technical/protocol-common.txt | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt
> index bf30167..ecedb34 100644
> --- a/Documentation/technical/protocol-common.txt
> +++ b/Documentation/technical/protocol-common.txt
> @@ -67,9 +67,9 @@ with non-binary data the same whether or not they contain the trailing
>  LF (stripping the LF if present, and not complaining when it is
>  missing).
>  
> -The maximum length of a pkt-line's data component is 65520 bytes.
> -Implementations MUST NOT send pkt-line whose length exceeds 65524
> -(65520 bytes of payload + 4 bytes of length data).
> +The maximum length of a pkt-line's data component is 65516 bytes.
> +Implementations MUST NOT send pkt-line whose length exceeds 65520
> +(65516 bytes of payload + 4 bytes of length data).
>  
>  Implementations SHOULD NOT send an empty pkt-line ("0004").
>  
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-29 23:38   ` [PATCH v3 10/10] convert: add filter.<driver>.process option larsxschneider
@ 2016-07-30 22:05     ` Jakub Narębski
  2016-07-31  9:42       ` Jakub Narębski
  2016-08-01 13:32       ` Lars Schneider
  2016-07-31 22:19     ` Jakub Narębski
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-07-30 22:05 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Junio C Hamano, Torsten Bögershausen, Martin-Louis Bright,
	Eric Wong, Jeff King

W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.
> 
> This patch adds the filter.<driver>.process string option which, if used,
> keeps the external filter process running and processes all blobs with
> the following packet format (pkt-line) based protocol over standard input
> and standard output.

I think it would be nice to have here at least summary of the benchmarks
you did in https://github.com/github/git-lfs/pull/1382

> 
> Git starts the filter on first usage and expects a welcome
> message, protocol version number, and filter capabilities
> separated by spaces:
> ------------------------
> packet:          git< git-filter-protocol\n
> packet:          git< version 2\n
> packet:          git< capabilities clean smudge\n

Sorry for going back and forth, but now I think that 'capabilities' are
not really needed here, though they are in line with "version" in
the second packet / line, namely "version 2".  If it does not make
parsing more difficult...

> ------------------------
> Supported filter capabilities are "clean", "smudge", "stream",
> and "shutdown".

I'd rather put "stream" and "shutdown" capabilities into separate
patches, for easier review.

> 
> Afterwards Git sends a command (based on the supported
> capabilities), the filename including its path
> relative to the repository root, the content size as ASCII number
> in bytes, the content split in zero or many pkt-line packets,
> and a flush packet at the end:

I guess the following is the most basic example, with mode detailed
description left for the documentation.

> ------------------------
> packet:          git> smudge\n
> packet:          git> filename=path/testfile.dat\n
> packet:          git> size=7\n

So I see you went with "<variable>=<value>" idea, rather than "<value>"
(with <variable> defined by position in a sequence of 'header' packets),
or "<variable> <value>..." that introductory header uses.

> packet:          git> CONTENT
> packet:          git> 0000
> ------------------------
> 
> The filter is expected to respond with the result content size as
> ASCII number in bytes. If the capability "stream" is defined then
> the filter must not send the content size. Afterwards the result
> content in send in zero or many pkt-line packets and a flush packet
> at the end. 

If it does not cost filter anything, it could send size upfront
(based on size of original, or based on external data), even if
it is prepared for streaming.

In the opposite case, where filter cannot stream because it requires
whole contents upfront (e.g. to calculate hash of the contents, or
to do operation that needs whole file like sorting or reversing lines),
it should always be able to calculate the size... or not.  For
example 'sort | uniq' filter needs whole input upfront for sort,
but it does not know how many lines will be in output without doing
the 'uniq' part.

So I think the ability of filter to provide size (or size hint) of
its output should be decoupled from streaming support.

>             Finally a "success" packet is send to indicate that
> everything went well.

That's a nice addition, and probably a necessary one, to the stream
protocol.  Git must know and consume it - we wouldn't be able to
retrofit it later.

> ------------------------
> packet:          git< size=57\n   (omitted with capability "stream")

I was thinking about having possible responses to receiving file
contents (or starting receiving in the streaming case) to be:

  packet:          git< ok size=7\n    (or "ok 7\n", if size is known)

or

  packet:          git< ok\n           (if filter does not know size upfront)

or

  packet:          git< fail <msg>\n   (or just "fail" + packet with msg)

The last would be when filter knows upfront that it cannot perform
the operation.  Though sending an empty file with non-"success" final
would work as well.

For example LFS filter (that is configured as not required) may refuse
to store files which are smaller than some pre-defined constant threshold.

> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< success\n
> ------------------------
> 
> In case the filter cannot process the content, it is expected
> to respond with the result content size 0 (only if "stream" is
> not defined) and a "reject" packet.
> ------------------------
> packet:          git< size=0\n    (omitted with capability "stream")
> packet:          git< reject\n
> ------------------------

This is *wrong* idea!  Empty file, with size=0, can be a perfectly
legitimate response.  

For example rot13 filter should respond to an empty file on input
with an empty file on output.  LFS-like filters and encryption
mechanism should return empty file on fetch / decryption
if such empty file was stored / encrypted.

A strange LFS could even use filenames (with files being empty
themselves) as a lookup key for artifactory.  For example a kind
of CDN for common libraries, with version embedded in filename,
like 'libs/jquery-1.9.0.min.js', etc.

> 
> After the filter has processed a blob it is expected to wait for
> the next command. A demo implementation can be found in
> `t/t0021/rot13-filter.pl` located in the Git core repository.

If filter does not support "shutdown" capability (or if said
capability is postponed for later patch), it should behave sanely
when Git command reaps it (SIGTERM + wait + SIGKILL?, SIGCHLD?).

> 
> If the filter supports the "shutdown" capability then Git will
> send the "shutdown" command and wait until the filter answers
> with "done". This gives the filter the opportunity to perform
> cleanup tasks. Afterwards the filter is expected to exit.
> ------------------------
> packet:          git> shutdown\n
> packet:          git< done\n
> ------------------------

I guess there is no timeout mechanism: if filter hangs on shutdown,
then git command would also hang waiting for signal to exit.

> 
> If a filter.<driver>.clean or filter.<driver>.smudge command
> is configured then these commands always take precedence over
> a configured filter.<driver>.process command.

Note: the value of `clean`, `smudge` and `process` is a command,
not just a string.

I wonder if it would be worth it to explain the reasoning behind
this solution and show alternate ones.

 * Using a separate variable to signal that filters are invoked
   per-command rather than per-file, and use pkt-line interface,
   like boolean-valued `useProtocol`, or `protocolVersion` set
   to '2' or 'v2', or `persistence` set to 'per-command', there
   is high risk of user's trying to use exiting one-shot per-file
   filters... and Git hanging.

 * Using new variables for each capability, e.g. `processSmudge`
   and `processClean` would lead to explosion of variable names;
   I think.

 * Current solution of using `process` in addition to `clean`
   and `smudge` clearly says that you need to use different
   command for per-file (`clean` and `smudge`), and per-command
   filter, while allowing to use them together.

   The possible disadvantage is Git command starting `process`
   filter, only to see that it doesn't offer required capability,
   for example offering only "clean" but not "smudge".  There
   is simple workaround - set `smudge` variable (same as not
   present capability) to empty string.

> 
> Please note that you cannot use an existing filter.<driver>.clean
> or filter.<driver>.smudge command as filter.<driver>.process
> command. As soon as Git would detect a file that needs to be
> processed by this filter, it would stop responding.

I think this needs to be in the documentation (I have not checked
yet if it is), but is not needed in the already long commit message.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
> ---
>  Documentation/gitattributes.txt |  84 ++++++++-
>  convert.c                       | 400 +++++++++++++++++++++++++++++++++++++--
>  t/t0021-conversion.sh           | 405 ++++++++++++++++++++++++++++++++++++++++
>  t/t0021/rot13-filter.pl         | 177 ++++++++++++++++++
>  4 files changed, 1053 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0021/rot13-filter.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 8882a3e..e3fbcc2 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate. If a long running filter process (see section
> +below) is used then Git can process all blobs with a single filter
> +invocation for the entire life of a single Git command (e.g.
> +`git add .`).

Proposed improvement:

                       If a long running `process` filter is used
   in place of `clean` and/or `smudge` filters, then Git can process
   all blobs with a single filter command invocation for the entire
   life of a single Git command, for example `git add --all`.  See
   section below for the description of the protocol used to
   communicate with a `process` filter.

>  
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user to use.
> @@ -375,6 +379,84 @@ substitution.  For example:
>  ------------------------
>  
>  
> +Long Running Filter Process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +If the filter command (string value) is defined via

This is no mere string value, this is command invocation (with its
own rules, e.g. splitting parameters on whitespace, etc.).  Though
I'm not sure how to say it succintly.  Maybe skip "(string value)"?
But it is there for a reason...

> +filter.<driver>.process then Git can process all blobs with a

Shouldn't it be `filter.<driver>.process`?

> +single filter invocation for the entire life of a single Git
> +command. This is achieved by using the following packet
> +format (pkt-line, see protocol-common.txt) based protocol over

Can we linkgit-it (to technical documentation)?

> +standard input and standard output.
> +
> +Git starts the filter on first usage and expects a welcome

Is "usage" here correct?  Perhaps it would be more readable
to say that Git starts filter when encountering first file
that needs cleaning or smudgeing.

> +message, protocol version number, and filter capabilities
> +separated by spaces:
> +------------------------
> +packet:          git< git-filter-protocol\n
> +packet:          git< version 2\n
> +packet:          git< capabilities clean smudge\n
> +------------------------
> +Supported filter capabilities are "clean", "smudge", "stream",
> +and "shutdown".

Filter should include at least one of "clean" and "smudge"
capabilities (currently), otherwise it wouldn't do anything.

I don't know if it is a good place to say that because of pkt-line
recommendations about text-content packets, each of those should
terminate in endline, with "\n" included in pkt-line length.

> +
> +Afterwards Git sends a command (based on the supported
> +capabilities),

I think it should be something like the following:

   If among filter `process` capabilities there is capability
   that corresponds to the operation performed by a Git command
   (that is, either "clean" or "smudge"), then Git would send,
   in separate packets, a command (based on supported capabilites),

though it feels too "chatty" (and the sentence gets quite long).

>                the filename including its path
> +relative to the repository root, 

Errr... "the filename including its path"? Wouldn't be it simpler
to just say:

  the pathname of a file relative to the repository root,

Also, isn't it now "filename=<pathname>\n"?

>                                   the content size as ASCII number
> +in bytes, 

Could Git not give the size, for example if fstat() fails? Do
we reserve space for other information here?

Also, isn't it now "size=<bytes>\n"?

>             the content split in zero or many pkt-line packets,

s/zero or many/zero or more/

> +and a flush packet at the end:

I wonder if instead of long sentence, it would be more readable
to use enumeration (ordered list) or itemize (unordered list).

> +------------------------
> +packet:          git> smudge\n
> +packet:          git> filename=path/testfile.dat\n
> +packet:          git> size=7\n
> +packet:          git> CONTENT
> +packet:          git> 0000
> +------------------------
> +
> +The filter is expected to respond with the result content size as
> +ASCII number in bytes. If the capability "stream" is defined then
> +the filter must not send the content size.

As I wrote earlier, I think sending or not the size of the output
should be decoupled from the "stream" capability.

Streaming is IMVHO rather a capability of starting to send parts
of response before the whole contents of input arrives.  I think
per-file filters support that and that's what start_async() there
is about.

>                                             Afterwards the result
> +content in send in zero or many pkt-line packets and a flush packet
> +at the end. Finally a "success" packet is send to indicate that
> +everything went well.

I guess it is "success" packet if everything went well, and place
for informing about errors in the future - filter is assumed to die
if there are errors in filtering, isn't it?

That is, not "send to indicate", but "send if".

> +------------------------
> +packet:          git< size=57\n   (omitted with capability "stream")
> +packet:          git< SMUDGED_CONTENT
> +packet:          git< 0000
> +packet:          git< success\n
> +------------------------
> +
> +In case the filter cannot process the content, it is expected
> +to respond with the result content size 0 (only if "stream" is
> +not defined) and a "reject" packet.
> +------------------------
> +packet:          git< size=0\n    (omitted with capability "stream")
> +packet:          git< reject\n
> +------------------------

I would assume that we have two error conditions.  

First situation is when the filter knows upfront (after receiving name
and size of file, and after receiving contents for not-streaming filters)
that it cannot process the file (like e.g. LFS filter with artifactory
replica/shard being a bit behind master, and not including contents of
the file being filtered).

My proposal is to reply with "fail" _in place of_ size of reply:

   packet:         git< fail\n       (any case: size known or not, stream or not)

It could be "reject", or "error" instead of "fail".


Another situation is if filter encounters error during output,
either with streaming filter (or non-stream, but not storing whole
input upfront) realizing in the middle of output that there is something
wrong with input (e.g. converting between encoding, and encountering
character that cannot be represented in output encoding), or e.g. filter
process being killed, or network connection dropping with LFS filter, etc.
The filter has send some packets with output already.  In this case
filter should flush, and send "reject" or "error" packet.

   <error condition>
   packet:         git< "0000"       (flush packet)
   packet:         git< reject\n

Should there be a place for an error message, or would standard error
(stderr) be used for this?

> +
> +After the filter has processed a blob it is expected to wait for
> +the next command. A demo implementation can be found in
> +`t/t0021/rot13-filter.pl` located in the Git core repository.

It is actually in Git sources.  Is it the best way to refer to
such files?

> +
> +If the filter supports the "shutdown" capability then Git will
> +send the "shutdown" command and wait until the filter answers
> +with "done". This gives the filter the opportunity to perform
> +cleanup tasks. Afterwards the filter is expected to exit.
> +------------------------
> +packet:          git> shutdown\n
> +packet:          git< done\n
> +------------------------
> +
> +If a filter.<driver>.clean or filter.<driver>.smudge command
> +is configured then these commands always take precedence over
> +a configured filter.<driver>.process command.

All right; this is quite clear.

> +
> +Please note that you cannot use an existing filter.<driver>.clean
> +or filter.<driver>.smudge command as filter.<driver>.process
> +command. As soon as Git would detect a file that needs to be
> +processed by this filter, it would stop responding.

This isn't.


P.S. I will comment about the implementation part in the next email.
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-30 22:05     ` Jakub Narębski
@ 2016-07-31  9:42       ` Jakub Narębski
  2016-07-31 19:49         ` Lars Schneider
  2016-08-01 13:32       ` Lars Schneider
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-07-31  9:42 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Junio C Hamano, Torsten Bögershausen, Martin-Louis Bright,
	Eric Wong, Jeff King

[Excuse me replying to myself, but there are a few things I forgot,
 or realized only later]

W dniu 31.07.2016 o 00:05, Jakub Narębski pisze:
> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>>
>> This patch adds the filter.<driver>.process string option which, if used,
>> keeps the external filter process running and processes all blobs with
>> the following packet format (pkt-line) based protocol over standard input
>> and standard output.
> 
> I think it would be nice to have here at least summary of the benchmarks
> you did in https://github.com/github/git-lfs/pull/1382

Note that this feature is especially useful if startup time is long,
that is if you are using an operating system with costly fork / new process
startup time like MS Windows (which you have mentioned), or writing
filter in a programming language with large startup time like Java
or Python (the latter may have changed since).

  https://gnustavo.wordpress.com/2012/06/28/programming-languages-start-up-times/

[...]
> I was thinking about having possible responses to receiving file
> contents (or starting receiving in the streaming case) to be:
> 
>   packet:          git< ok size=7\n    (or "ok 7\n", if size is known)
> 
> or
> 
>   packet:          git< ok\n           (if filter does not know size upfront)
> 
> or
> 
>   packet:          git< fail <msg>\n   (or just "fail" + packet with msg)
> 
> The last would be when filter knows upfront that it cannot perform
> the operation.  Though sending an empty file with non-"success" final
> would work as well.

[...]

>> In case the filter cannot process the content, it is expected
>> to respond with the result content size 0 (only if "stream" is
>> not defined) and a "reject" packet.
>> ------------------------
>> packet:          git< size=0\n    (omitted with capability "stream")
>> packet:          git< reject\n
>> ------------------------
> 
> This is *wrong* idea!  Empty file, with size=0, can be a perfectly
> legitimate response.  

Actually, I think I have misunderstood your intent.  If you want to have
simpler protocol, with only one place to signal errors, that is after
sending a response, then proper way of signaling the error condition
would be to send empty file and then "reject" instead of "success":

   packet:          git< size=0\n    (omitted with capability "stream")
   packet:          git< 0000        (we need this flush packet)
   packet:          git< reject\n

Otherwise in the case without size upfront (capability "stream")
file with contents "reject" would be mistaken for the "reject" packet.

See below for proposal with two places to signal errors: before sending
first byte, and after.


NOTE: there is a bit of mixed and possibly confusing notation, that
is 0000 is flush packet, not packet with 0000 as content.  Perhaps
write pkt-line in full?


[...]
>> ---
>>  Documentation/gitattributes.txt |  84 ++++++++-
>>  convert.c                       | 400 +++++++++++++++++++++++++++++++++++++--
>>  t/t0021-conversion.sh           | 405 ++++++++++++++++++++++++++++++++++++++++
>>  t/t0021/rot13-filter.pl         | 177 ++++++++++++++++++
>>  4 files changed, 1053 insertions(+), 13 deletions(-)
>>  create mode 100755 t/t0021/rot13-filter.pl

Wouldn't it be better for easier review to split it into separate patches?
Perhaps at least the new test...

[...]
> I would assume that we have two error conditions.  
> 
> First situation is when the filter knows upfront (after receiving name
> and size of file, and after receiving contents for not-streaming filters)
> that it cannot process the file (like e.g. LFS filter with artifactory
> replica/shard being a bit behind master, and not including contents of
> the file being filtered).
> 
> My proposal is to reply with "fail" _in place of_ size of reply:
> 
>    packet:         git< fail\n       (any case: size known or not, stream or not)
> 
> It could be "reject", or "error" instead of "fail".
> 
> 
> Another situation is if filter encounters error during output,
> either with streaming filter (or non-stream, but not storing whole
> input upfront) realizing in the middle of output that there is something
> wrong with input (e.g. converting between encoding, and encountering
> character that cannot be represented in output encoding), or e.g. filter
> process being killed, or network connection dropping with LFS filter, etc.
> The filter has send some packets with output already.  In this case
> filter should flush, and send "reject" or "error" packet.
> 
>    <error condition>
>    packet:         git< "0000"       (flush packet)
>    packet:         git< reject\n
> 
> Should there be a place for an error message, or would standard error
> (stderr) be used for this?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-31  9:42       ` Jakub Narębski
@ 2016-07-31 19:49         ` Lars Schneider
  2016-07-31 22:59           ` Jakub Narębski
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-07-31 19:49 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 31 Jul 2016, at 11:42, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> [Excuse me replying to myself, but there are a few things I forgot,
> or realized only later]

No worries :)

> 
> W dniu 31.07.2016 o 00:05, Jakub Narębski pisze:
>> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>> 
>>> Git's clean/smudge mechanism invokes an external filter process for every
>>> single blob that is affected by a filter. If Git filters a lot of blobs
>>> then the startup time of the external filter processes can become a
>>> significant part of the overall Git execution time.
>>> 
>>> This patch adds the filter.<driver>.process string option which, if used,
>>> keeps the external filter process running and processes all blobs with
>>> the following packet format (pkt-line) based protocol over standard input
>>> and standard output.
>> 
>> I think it would be nice to have here at least summary of the benchmarks
>> you did in https://github.com/github/git-lfs/pull/1382
> 
> Note that this feature is especially useful if startup time is long,
> that is if you are using an operating system with costly fork / new process
> startup time like MS Windows (which you have mentioned), or writing
> filter in a programming language with large startup time like Java
> or Python (the latter may have changed since).
> 
>  https://gnustavo.wordpress.com/2012/06/28/programming-languages-start-up-times/

OK, I will add this. Is it OK to add the link to the commit message?
(since I don't know how long the link will be available).


> [...]
>> I was thinking about having possible responses to receiving file
>> contents (or starting receiving in the streaming case) to be:
>> 
>>  packet:          git< ok size=7\n    (or "ok 7\n", if size is known)
>> 
>> or
>> 
>>  packet:          git< ok\n           (if filter does not know size upfront)
>> 
>> or
>> 
>>  packet:          git< fail <msg>\n   (or just "fail" + packet with msg)
>> 
>> The last would be when filter knows upfront that it cannot perform
>> the operation.  Though sending an empty file with non-"success" final
>> would work as well.
> 
> [...]
> 
>>> In case the filter cannot process the content, it is expected
>>> to respond with the result content size 0 (only if "stream" is
>>> not defined) and a "reject" packet.
>>> ------------------------
>>> packet:          git< size=0\n    (omitted with capability "stream")
>>> packet:          git< reject\n
>>> ------------------------
>> 
>> This is *wrong* idea!  Empty file, with size=0, can be a perfectly
>> legitimate response.  
> 
> Actually, I think I have misunderstood your intent.  If you want to have
> simpler protocol, with only one place to signal errors, that is after
> sending a response, then proper way of signaling the error condition
> would be to send empty file and then "reject" instead of "success":
> 
>   packet:          git< size=0\n    (omitted with capability "stream")
>   packet:          git< 0000        (we need this flush packet)
>   packet:          git< reject\n
> 
> Otherwise in the case without size upfront (capability "stream")
> file with contents "reject" would be mistaken for the "reject" packet.
> 
> See below for proposal with two places to signal errors: before sending
> first byte, and after.

Right now the protocol is implemented covering the following cases:

## CASE 1 - no stream success

packet:          git< size=57\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< success\n


## CASE 2 - no stream success but 0 byte response

packet:          git< size=0\n
packet:          git< success\n


## CASE 3 - no stream filter; filter doesn't want to process the file

packet:          git< size=0\n
packet:          git< reject\n


## CASE 4 - no stream filter; filter error

packet:          git< size=57\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< error\n

CASE 4 is not explicitly checked. If a final message is neither
"success" nor "reject" then it is interpreted as error. If that
happens then Git will shutdown and restart the filter process
if there is another file to filter. 

Alternatively a filter process can shutdown itself, too, to signal
an error.

The corresponding stream filter look like this:

## CASE 1 - stream success

packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< success\n


## CASE 2 - stream success but 0 byte response

packet:          git< 0000
packet:          git< success\n


## CASE 3 - stream filter; filter doesn't want to process the file

packet:          git< 0000
packet:          git< reject\n


## CASE 4 - stream filter; filter error

packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< error\n

--

I just realized that the size 0 case is a bit inconsistent
in the no stream case as it has no flush packet. Maybe I 
should indeed remove the flush packet in the no stream case
completely?!

Do the cases above make sense to you?

Regarding error handling. I would prefer it if the filter prints
all errors to STDERR by itself. I think that is the safest
option to communicate errors to the users because if the communication
got into a bad state then Git might not be able to read the errors
properly.

See Peff's response on the topic, too:
http://public-inbox.org/git/20160729165018.GA6553%40sigill.intra.peff.net/


> NOTE: there is a bit of mixed and possibly confusing notation, that
> is 0000 is flush packet, not packet with 0000 as content.  Perhaps
> write pkt-line in full?

I am not sure I understand what you mean (maybe it's too late for me...).
Can you try to rephrase or give an example?

Thank you,
Lars



> 
> 
> [...]
>>> ---
>>> Documentation/gitattributes.txt |  84 ++++++++-
>>> convert.c                       | 400 +++++++++++++++++++++++++++++++++++++--
>>> t/t0021-conversion.sh           | 405 ++++++++++++++++++++++++++++++++++++++++
>>> t/t0021/rot13-filter.pl         | 177 ++++++++++++++++++
>>> 4 files changed, 1053 insertions(+), 13 deletions(-)
>>> create mode 100755 t/t0021/rot13-filter.pl
> 
> Wouldn't it be better for easier review to split it into separate patches?
> Perhaps at least the new test...
> 
> [...]
>> I would assume that we have two error conditions.  
>> 
>> First situation is when the filter knows upfront (after receiving name
>> and size of file, and after receiving contents for not-streaming filters)
>> that it cannot process the file (like e.g. LFS filter with artifactory
>> replica/shard being a bit behind master, and not including contents of
>> the file being filtered).
>> 
>> My proposal is to reply with "fail" _in place of_ size of reply:
>> 
>>   packet:         git< fail\n       (any case: size known or not, stream or not)
>> 
>> It could be "reject", or "error" instead of "fail".
>> 
>> 
>> Another situation is if filter encounters error during output,
>> either with streaming filter (or non-stream, but not storing whole
>> input upfront) realizing in the middle of output that there is something
>> wrong with input (e.g. converting between encoding, and encountering
>> character that cannot be represented in output encoding), or e.g. filter
>> process being killed, or network connection dropping with LFS filter, etc.
>> The filter has send some packets with output already.  In this case
>> filter should flush, and send "reject" or "error" packet.
>> 
>>   <error condition>
>>   packet:         git< "0000"       (flush packet)
>>   packet:         git< reject\n
>> 
>> Should there be a place for an error message, or would standard error
>> (stderr) be used for this?
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-07-29 23:37   ` [PATCH v3 03/10] pkt-line: add packet_flush_gentle() larsxschneider
  2016-07-30 12:04     ` Jakub Narębski
@ 2016-07-31 20:36     ` Torstem Bögershausen
  2016-07-31 21:45       ` Lars Schneider
  1 sibling, 1 reply; 120+ messages in thread
From: Torstem Bögershausen @ 2016-07-31 20:36 UTC (permalink / raw)
  To: larsxschneider@gmail.com
  Cc: git@vger.kernel.org, gitster@pobox.com, jnareb@gmail.com,
	mlbright@gmail.com, e@80x24.org, peff@peff.net



> Am 29.07.2016 um 20:37 schrieb larsxschneider@gmail.com:
> 
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_flush() would die in case of a write error even though for some callers
> an error would be acceptable.
What happens if there is a write error ?
Basically the protocol is out of synch.
Lenght information is mixed up with payload, or the other way
around.
It may be, that the consequences of a write error are acceptable,
because a filter is allowed to fail.
What is not acceptable is a "broken" protocol.
The consequence schould be to close the fd and tear down all
resources. connected to it.
In our case to terminate the external filter daemon in some way,
and to never use this instance again.


> Add packet_flush_gentle() which writes a pkt-line
> flush packet and returns `0` for success and `1` for failure.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> pkt-line.c | 6 ++++++
> pkt-line.h | 1 +
> 2 files changed, 7 insertions(+)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index 6fae508..1728690 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -91,6 +91,12 @@ void packet_flush(int fd)
>  write_or_die(fd, "0000", 4);
> }
> 
> +int packet_flush_gentle(int fd)
> +{
> +    packet_trace("0000", 4, 1);
> +    return !write_or_whine_pipe(fd, "0000", 4, "flush packet");
> +}
> +
> void packet_buf_flush(struct strbuf *buf)
> {
>  packet_trace("0000", 4, 1);
> diff --git a/pkt-line.h b/pkt-line.h
> index 02dcced..3953c98 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -23,6 +23,7 @@ void packet_flush(int fd);
> void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> void packet_buf_flush(struct strbuf *buf);
> void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> +int packet_flush_gentle(int fd);
> int direct_packet_write(int fd, char *buf, size_t size, int gentle);
> int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle);
> 
> -- 
> 2.9.0
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-07-31 20:36     ` Torstem Bögershausen
@ 2016-07-31 21:45       ` Lars Schneider
  2016-08-02 19:56         ` Torsten Bögershausen
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-07-31 21:45 UTC (permalink / raw)
  To: Torstem Bögershausen
  Cc: git@vger.kernel.org, gitster@pobox.com, jnareb@gmail.com,
	mlbright@gmail.com, e@80x24.org, peff@peff.net


> On 31 Jul 2016, at 22:36, Torstem Bögershausen <tboegi@web.de> wrote:
> 
> 
> 
>> Am 29.07.2016 um 20:37 schrieb larsxschneider@gmail.com:
>> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_flush() would die in case of a write error even though for some callers
>> an error would be acceptable.
> What happens if there is a write error ?
> Basically the protocol is out of synch.
> Lenght information is mixed up with payload, or the other way
> around.
> It may be, that the consequences of a write error are acceptable,
> because a filter is allowed to fail.
> What is not acceptable is a "broken" protocol.
> The consequence schould be to close the fd and tear down all
> resources. connected to it.
> In our case to terminate the external filter daemon in some way,
> and to never use this instance again.

Correct! That is exactly what is happening in kill_protocol2_filter()
here:


+static int apply_protocol2_filter(const char *path, const char *src, size_t len,
+						int fd, struct strbuf *dst, const char *cmd,
+						const int wanted_capability)
+{
...
+	if (ret) {
+		strbuf_swap(dst, &nbuf);
+	} else {
+		if (!filter_result || strcmp(filter_result, "reject")) {
+			// Something went wrong with the protocol filter. Force shutdown!
+			error("external filter '%s' failed", cmd);
+			kill_protocol2_filter(&cmd_process_map, entry);
+		}
+	}
+	strbuf_release(&nbuf);
+	return ret;
+}

More context:
https://github.com/larsxschneider/git/blob/e128326070847ac596e8bb21adebc8abab2003fc/convert.c#L821

- Lars


> 
> 
>> Add packet_flush_gentle() which writes a pkt-line
>> flush packet and returns `0` for success and `1` for failure.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> pkt-line.c | 6 ++++++
>> pkt-line.h | 1 +
>> 2 files changed, 7 insertions(+)
>> 
>> diff --git a/pkt-line.c b/pkt-line.c
>> index 6fae508..1728690 100644
>> --- a/pkt-line.c
>> +++ b/pkt-line.c
>> @@ -91,6 +91,12 @@ void packet_flush(int fd)
>> write_or_die(fd, "0000", 4);
>> }
>> 
>> +int packet_flush_gentle(int fd)
>> +{
>> +    packet_trace("0000", 4, 1);
>> +    return !write_or_whine_pipe(fd, "0000", 4, "flush packet");
>> +}
>> +
>> void packet_buf_flush(struct strbuf *buf)
>> {
>> packet_trace("0000", 4, 1);
>> diff --git a/pkt-line.h b/pkt-line.h
>> index 02dcced..3953c98 100644
>> --- a/pkt-line.h
>> +++ b/pkt-line.h
>> @@ -23,6 +23,7 @@ void packet_flush(int fd);
>> void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>> void packet_buf_flush(struct strbuf *buf);
>> void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>> +int packet_flush_gentle(int fd);
>> int direct_packet_write(int fd, char *buf, size_t size, int gentle);
>> int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle);
>> 
>> -- 
>> 2.9.0
>> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-29 23:38   ` [PATCH v3 10/10] convert: add filter.<driver>.process option larsxschneider
  2016-07-30 22:05     ` Jakub Narębski
@ 2016-07-31 22:19     ` Jakub Narębski
  2016-08-01 17:55       ` Lars Schneider
  2016-08-03 13:10       ` Lars Schneider
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-07-31 22:19 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Junio C Hamano, Torsten Bögershausen, Martin-Louis Bright,
	Eric Wong, Jeff King

W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
[...]
> +Please note that you cannot use an existing filter.<driver>.clean
> +or filter.<driver>.smudge command as filter.<driver>.process
> +command.

I think it would be more readable and easier to understand to write:

  ... you cannot use an existing ... command with
  filter.<driver>.process

About the style: wouldn't `filter.<driver>.process` be better?

>              As soon as Git would detect a file that needs to be
> +processed by this filter, it would stop responding.

This is quite convoluted, and hard to understand.  I would say
that because `clean` and `smudge` filters are expected to read
first, while Git expects `process` filter to say first, using
`clean` or `smudge` filter without changes as `process` filter
would lead to git command deadlocking / hanging / stopping
responding.

> +
> +
>  Interaction between checkin/checkout attributes
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
> diff --git a/convert.c b/convert.c
> index 522e2c5..be6405c 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -3,6 +3,7 @@
>  #include "run-command.h"
>  #include "quote.h"
>  #include "sigchain.h"
> +#include "pkt-line.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -481,11 +482,355 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	return ret;
>  }
>  
> +static int multi_packet_read(int fd_in, struct strbuf *sb, size_t expected_bytes, int is_stream)

About name of this function: `multi_packet_read` is fine, though I wonder
if `packet_read_in_full` with nearly the same parameters as `packet_read`,
or `packet_read_till_flush`, or `read_in_full_packetized` would be better.

Also, the problem is that while we know that what packet_read() stores
would fit in memory (in size_t), it is not true for reading whole file,
which might be very large - for example huge graphical assets like raw
images or raw videos, or virtual machine images.  Isn't that the goal
of git-LFS solutions, which need this feature?  Shouldn't we have then
both `multi_packet_read_to_fd` and `multi_packet_read_to_buf`,
or whatever?

Also, if we have `fd_in`, then perhaps `sb_out`?

I am also unsure if `expected_bytes` (or `expected_size`) should not be
just a size hint, leaving handing mismatch between expected size and
real size of output to the caller; then the `is_stream` would be not
needed.

> +{
> +	int bytes_read;
> +	size_t total_bytes_read = 0;

Why `bytes_read` is int, while `total_bytes_read` is size_t? Ah, I see
that packet_read() returns an int.  It should be ssize_t, just like
read(), isn't it?  But we know that packet size is limited, and would
fit in an int (or would it?).

Also, total_bytes_read could overflow size_t, but then we would have
problems storing the result in strbuf.

> +	if (expected_bytes == 0 && !is_stream)
> +		return 0;

So in all cases *except* size = 0 we expect flush packet after the
contents, but size = 0 is a corner case without flush packet?

> +
> +	if (is_stream)
> +		strbuf_grow(sb, LARGE_PACKET_MAX);           // allocate space for at least one packet
> +	else
> +		strbuf_grow(sb, st_add(expected_bytes, 1));  // add one extra byte for the packet flush
> +
> +	do {
> +		bytes_read = packet_read(
> +			fd_in, NULL, NULL,
> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
> +			PACKET_READ_GENTLE_ON_EOF
> +		);
> +		if (bytes_read < 0)
> +			return 1;  // unexpected EOF

Don't we usually return negative numbers on error?  Ah, I see that the
return is a bool, which allows to use boolean expression with 'return'.
But I am still unsure if it is good API, this return value.

If we move handling of size mismatch to the caller, then the function
can simply return the size of data read (probably off_t or uint64_t).
Then the caller can check if it is what it expected, and react accordingly.

> +
> +		if (is_stream &&
> +			bytes_read > 0 &&
> +			sb->len - total_bytes_read - 1 <= 0)
> +			strbuf_grow(sb, st_add(sb->len, LARGE_PACKET_MAX));
> +		total_bytes_read += bytes_read;
> +	}
> +	while (
> +		bytes_read > 0 &&                   // the last packet was no flush
> +		sb->len - total_bytes_read - 1 > 0  // we still have space left in the buffer

Ah, so buffer is resized only in the 'is_stream' case.  Perhaps then
use an "int options" instead of 'is_stream', and have one of flags
tell if we should resize or not, that is if size parameter is hint
or a strict limit.

> +	);
> +	strbuf_setlen(sb, total_bytes_read);
> +	return (is_stream ? 0 : expected_bytes != total_bytes_read);
> +}
> +
> +static int multi_packet_write_from_fd(const int fd_in, const int fd_out)

Is it equivalent of copy_fd() function, but where destination uses pkt-line
and we need to pack data into pkt-lines?

> +{
> +	int did_fail = 0;
> +	ssize_t bytes_to_write;
> +	while (!did_fail) {
> +		bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_LEN);

Using global variable packet_buffer makes this code thread-unsafe, isn't it?
But perhaps that is not a problem, because other functions are also
using this global variable.

It is more of PKTLINE_DATA_MAXLEN, isn't it?

> +		if (bytes_to_write < 0)
> +			return 1;
> +		if (bytes_to_write == 0)
> +			break;
> +		did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1);
> +	}
> +	if (!did_fail)
> +		did_fail = packet_flush_gentle(fd_out);

Shouldn't we try to flush even if there was an error?  Or is it
that if there is an error writing, then there is some problem
such that we know that flush would not work?

> +	return did_fail;

Return true on fail?  Shouldn't we follow example of copy_fd()
from copy.c, and return COPY_READ_ERROR, or COPY_WRITE_ERROR,
or PKTLINE_WRITE_ERROR?


> +}
> +
> +static int multi_packet_write_from_buf(const char *src, size_t len, int fd_out)

It is equivalent of write_in_full(), with different order of parameters,
but where destination file descriptor expects pkt-line and we need to pack
data into pkt-lines?

NOTE: function description comments?

> +{
> +	int did_fail = 0;
> +	size_t bytes_written = 0;
> +	size_t bytes_to_write;

Note to self: bytes_to_write should fit in size_t, as it is limited to
PKTLINE_DATA_LEN.  bytes_written should fit in size_t, as it is at most
len, which is of type size_t.

> +	while (!did_fail) {
> +		if ((len - bytes_written) > PKTLINE_DATA_LEN)
> +			bytes_to_write = PKTLINE_DATA_LEN;
> +		else
> +			bytes_to_write = len - bytes_written;
> +		if (bytes_to_write == 0)
> +			break;
> +		did_fail |= direct_packet_write_data(fd_out, src + bytes_written, bytes_to_write, 1);
> +		bytes_written += bytes_to_write;

Ah, I see now why we need both direct_packet_write() and
direct_packet_write_data().  Nice abstraction, makes for
clear code.

The last parameter of '1' means 'gently', isn't it?

> +	}
> +	if (!did_fail)
> +		did_fail = packet_flush_gentle(fd_out);
> +	return did_fail;
> +}

I think all three/four of those functions should be added in a separate
commit, separate patch in patch series.  Namely:

 - for git -> filter:
    * read from fd,      write pkt-line to fd  (off_t)
    * read from str+len, write pkt-line to fd  (size_t, ssize_t)
 - for filter -> git:
    * read pkt-line from fd, write to fd       (off_t)
    * read pkt-line from fd, write to str+len  (size_t, ssize_t)

Perhaps some of those can be in one overloaded function, perhaps it would
be easier to keep them separate.

Also, I do wonder how the fetch / push code spools pack file received
over pkt-lines to disk.  Can we reuse that code?  Or maybe that code
could use those new functions?


> +
> +#define FILTER_CAPABILITIES_STREAM   0x1
> +#define FILTER_CAPABILITIES_CLEAN    0x2
> +#define FILTER_CAPABILITIES_SMUDGE   0x4
> +#define FILTER_CAPABILITIES_SHUTDOWN 0x8
> +#define FILTER_SUPPORTS_STREAM(type) ((type) & FILTER_CAPABILITIES_STREAM)
> +#define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
> +#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
> +#define FILTER_SUPPORTS_SHUTDOWN(type) ((type) & FILTER_CAPABILITIES_SHUTDOWN)
> +
> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	const char *cmd;
> +	int supported_capabilities;

I wonder if switching from int (perhaps with field width of 1 to denote
that it is boolean-like flag) to mask makes it more readable, or less.
But I think it is.


Reading Documentation/technical/api-hashmap.txt I found the following
recommendation:

  `struct hashmap_entry`::

        An opaque structure representing an entry in the hash table, which must
        be used as first member of user data structures. Ideally it should be
        followed by an int-sized member to prevent unused memory on 64-bit
        systems due to alignment.

Therefore it "int supported_capabilities" should precede
"const char *cmd", I think.  Though it is not strictly necessary; it
is not as if this hash table were large (maximum size is limited by
the number of filter drivers configured), so we don't waste much space
due to internal padding / due to alignment.

> +	struct child_process process;
> +};
> +
> +static int cmd_process_map_initialized = 0;
> +static struct hashmap cmd_process_map;

Reading Documentation/technical/api-hashmap.txt I see that:

  `tablesize` is the allocated size of the hash table. A non-0 value indicates
  that the hashmap is initialized.

So cmd_process_map_initialized is not really needed, is it?

> +
> +static int cmd2process_cmp(const struct cmd2process *e1,
> +							const struct cmd2process *e2,
> +							const void *unused)
> +{
> +	return strcmp(e1->cmd, e2->cmd);
> +}

Well, to be exact (which is decidely not needed!) two commands might
be equivalent not being identical as strings (e.g. extra space between
parameters).  But it is something the user should care about, not Git.

> +
> +static struct cmd2process *find_protocol2_filter_entry(struct hashmap *hashmap, const char *cmd)

I'm not sure if *_protocol2_* is needed; those functions are static,
local to convert.c.

> +{
> +	struct cmd2process k;

Does this name of variable 'k' follow established convention?
'key' would be more descriptive, but it's not as if this function
was long; so 'k' is all right, I think.

> +	hashmap_entry_init(&k, strhash(cmd));
> +	k.cmd = cmd;
> +	return hashmap_get(hashmap, &k, NULL);
> +}
> +
> +static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry) {

Programming style: the opening brace should be on separate line,
that is:

  +static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry)
  +{

> +	if (!entry)
> +		return;
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	close(entry->process.in);
> +	close(entry->process.out);
> +	sigchain_pop(SIGPIPE);
> +	finish_command(&entry->process);
> +	child_process_clear(&entry->process);
> +	hashmap_remove(hashmap, entry, NULL);
> +	free(entry);
> +}

All those, from #define FILTER_CAPABILITIES_ to here could be put
in a separate patch, to reduce size of this one.  But I am less
sure that it is worth it for this case.

> +
> +void shutdown_protocol2_filter(pid_t pid)
> +{
[...]

In my opinion this should be postponed to a separate commit.

> +}
> +
> +static struct cmd2process *start_protocol2_filter(struct hashmap *hashmap, const char *cmd)

This has some parts in common with existing filter_buffer_or_fd().
I wonder if it would be worth to extract those common parts.

But perhaps it would be better to leave such refactoring for later.

> +{
> +	int did_fail;
> +	struct cmd2process *entry;
> +	struct child_process *process;
> +	const char *argv[] = { cmd, NULL };
> +	struct string_list capabilities = STRING_LIST_INIT_NODUP;
> +	char *capabilities_buffer;
> +	int i;
> +
> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	entry->supported_capabilities = 0;
> +	process = &entry->process;
> +
> +	child_process_init(process);

filter_buffer_or_fd() uses instead

  struct child_process child_process = CHILD_PROCESS_INIT;

But I see that you need to access &entry->process anyway, so you
need to have it here, and in this case child_process_init() is
equivalent.

I wonder if it would be worth it to use strbuf for cmd.

> +	process->argv = argv;
> +	process->use_shell = 1;
> +	process->in = -1;
> +	process->out = -1;
> +	process->clean_on_exit = 1;
> +	process->clean_on_exit_handler = shutdown_protocol2_filter;

These two lines are new, and related to the "shutdown" capability, isn't it?

> +
> +	if (start_command(process)) {
> +		error("cannot fork to run external filter '%s'", cmd);
> +		kill_protocol2_filter(hashmap, entry);

I guess the alternative solution of adding filter to the hashmap only
after starting the process would be racy?

Ah, disregard that. I see that this pattern is a common way to error
out in this function (for process-related errors).

> +		return NULL;
> +	}
> +
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	did_fail = strcmp(packet_read_line(process->out, NULL), "git-filter-protocol");
> +	if (!did_fail)
> +		did_fail |= strcmp(packet_read_line(process->out, NULL), "version 2");
> +	if (!did_fail)
> +		capabilities_buffer = packet_read_line(process->out, NULL);
> +	else
> +		capabilities_buffer = NULL;
> +	sigchain_pop(SIGPIPE);
> +
> +	if (!did_fail && capabilities_buffer) {
> +		string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1);
> +		if (capabilities.nr > 1 &&
> +			!strcmp(capabilities.items[0].string, "capabilities")) {
> +			for (i = 1; i < capabilities.nr; i++) {
> +				const char *requested = capabilities.items[i].string;
> +				if (!strcmp(requested, "stream")) {
> +					entry->supported_capabilities |= FILTER_CAPABILITIES_STREAM;
> +				} else if (!strcmp(requested, "clean")) {
> +					entry->supported_capabilities |= FILTER_CAPABILITIES_CLEAN;
> +				} else if (!strcmp(requested, "smudge")) {
> +					entry->supported_capabilities |= FILTER_CAPABILITIES_SMUDGE;
> +				} else if (!strcmp(requested, "shutdown")) {
> +					entry->supported_capabilities |= FILTER_CAPABILITIES_SHUTDOWN;
> +				} else {
> +					warning(
> +						"external filter '%s' requested unsupported filter capability '%s'",
> +						cmd, requested
> +					);
> +				}
> +			}
> +		} else {
> +			error("filter capabilities not found");
> +			did_fail = 1;
> +		}
> +		string_list_clear(&capabilities, 0);
> +	}

I wonder if the above conditional wouldn't be better to be put in
a separate function, parse_filter_capabilities(capabilities_buffer),
returning a mask, or having mask as an out parameter, and returning
an error condition.

> +
> +	if (did_fail) {
> +		error("initialization for external filter '%s' failed", cmd);

More detailed information not needed, because one can use GIT_PACKET_TRACE.
Would it be worth add this information as a kind of advice, or put it
in the documentation of the `process` option?

> +		kill_protocol2_filter(hashmap, entry);
> +		return NULL;
> +	}
> +
> +	hashmap_add(hashmap, entry);
> +	return entry;
> +}
> +
> +static int apply_protocol2_filter(const char *path, const char *src, size_t len,
> +						int fd, struct strbuf *dst, const char *cmd,
> +						const int wanted_capability)

apply_protocol2_filter, or apply_process_filter?  Or rather,
s/_protocol2_/_process_/g ?

This is equivalent to

   static int apply_filter(const char *path, const char *src, size_t len, int fd,
                           struct strbuf *dst, const char *cmd)

Could we have extended that one instead?

> +{
> +	int ret = 1;
> +	struct cmd2process *entry;
> +	struct child_process *process;
> +	struct stat file_stat;
> +	struct strbuf nbuf = STRBUF_INIT;
> +	size_t expected_bytes = 0;
> +	char *strtol_end;
> +	char *strbuf;
> +	char *filter_type;
> +	char *filter_result = NULL;
> +

> +	if (!cmd || !*cmd)
> +		return 0;
> +
> +	if (!dst)
> +		return 1;

This is the same as in apply_filter().

> +
> +	if (!cmd_process_map_initialized) {
> +		cmd_process_map_initialized = 1;
> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
> +		entry = NULL;
> +	} else {
> +		entry = find_protocol2_filter_entry(&cmd_process_map, cmd);
> +	}

Here we try to find existing process, rather than starting new
as in apply_filter()

> +
> +	fflush(NULL);

This is the same as in apply_filter(), but I wonder what it is for.

> +
> +	if (!entry) {
> +		entry = start_protocol2_filter(&cmd_process_map, cmd);
> +		if (!entry) {
> +			return 0;
> +		}

Style; we prefer:

  +		if (!entry)
  +			return 0;

This is very similar to apply_filter(), but the latter uses start_async()
from "run-command.h", with filter_buffer_or_fd() as asynchronous process,
which gets passed command to run in struct filter_params.  In this
function start_protocol2_filter() runs start_command(), synchronous API.

Why the difference?

> +	}
> +	process = &entry->process;
> +
> +	if (!(wanted_capability & entry->supported_capabilities))
> +		return 1;  // it is OK if the wanted capability is not supported
> +
> +	if FILTER_SUPPORTS_CLEAN(wanted_capability)
> +		filter_type = "clean";
> +	else if FILTER_SUPPORTS_SMUDGE(wanted_capability)
> +		filter_type = "smudge";
> +	else
> +		die("unexpected filter type");

Style: it should be

  +	if (FILTER_SUPPORTS_CLEAN(wanted_capability))
  +		filter_type = "clean";
  +	else if (FILTER_SUPPORTS_SMUDGE(wanted_capability))
  +		filter_type = "smudge";
  +	else
  +		die("unexpected filter type");

even though by accident the macro provides the parentheses to "if".

Can we make an error/die message more detailed?  Maybe it is
not possible...

> +
> +	if (fd >= 0 && !src) {
> +		if (fstat(fd, &file_stat) == -1)
> +			return 0;
> +		len = file_stat.st_size;
> +	}

All right, when fstat() can fail?  Could we then send contents without
size upfront, or is it better to require size to make it more consistent
for filter drivers scripts?

Could this whole "send single file" be put in a separate function?
Or is it not worth it?

> +
> +	sigchain_push(SIGPIPE, SIG_IGN);

Hmmm... ignoring SIGPIPE was good for one-shot filters.  Is it still
O.K. for per-command persistent ones?

> +
> +	packet_buf_write(&nbuf, "%s\n", filter_type);
> +	ret &= !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
> +
> +	if (ret) {
> +		strbuf_reset(&nbuf);
> +		packet_buf_write(&nbuf, "filename=%s\n", path);
> +		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
> +	}

Perhaps a better solution would be

        if (err)
        	goto fin_error;

rather than this.

> +
> +	if (ret) {
> +		strbuf_reset(&nbuf);
> +		packet_buf_write(&nbuf, "size=%"PRIuMAX"\n", (uintmax_t)len);
> +		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
> +	}

Or maybe extract writing the header for a file into a separate function?
This one gets a bit long...

> +
> +	if (ret) {
> +		if (fd >= 0)
> +			ret = !multi_packet_write_from_fd(fd, process->in);
> +		else
> +			ret = !multi_packet_write_from_buf(src, len, process->in);
> +	}

This is not streaming.  The above sends whole file, or whole string to
the filter process, without draining filter output.  If the filter were
to read some, then write some, it might deadlock on full buffers, isn't it?
Or am I mistaken?

> +
> +	if (ret && !FILTER_SUPPORTS_STREAM(entry->supported_capabilities)) {
> +		strbuf = packet_read_line(process->out, NULL);
> +		if (strlen(strbuf) > 5 && !strncmp("size=", strbuf, 5)) {
> +			expected_bytes = (off_t)strtol(strbuf + 5, &strtol_end, 10);
> +			ret = (strtol_end != strbuf && errno != ERANGE);
> +		} else {
> +			ret = 0;
> +		}
> +	}
> +
> +	if (ret) {
> +		strbuf_reset(&nbuf);
> +		ret = !multi_packet_read(process->out, &nbuf, expected_bytes,
> +			FILTER_SUPPORTS_STREAM(entry->supported_capabilities));
> +	}

What happens if the output of filter does not fit in size_t?  I see that
(I think) this problem is inherited from the original implementation.

> +
> +	if (ret) {
> +		filter_result = packet_read_line(process->out, NULL);
> +		ret = !strcmp(filter_result, "success");
> +	}
> +
> +	sigchain_pop(SIGPIPE);
> +
> +	if (ret) {
> +		strbuf_swap(dst, &nbuf);
> +	} else {
> +		if (!filter_result || strcmp(filter_result, "reject")) {
> +			// Something went wrong with the protocol filter. Force shutdown!
> +			error("external filter '%s' failed", cmd);
> +			kill_protocol2_filter(&cmd_process_map, entry);
> +		}
> +	}

So if Git gets finish signal "success" from filter, it accepts the output.
If Git gets finish signal "reject" from filter, it restarts filter (and
reject the output - user can retry the command himself / herself).
If Git gets any other finish signal, for example "error" (but this is not
standarized), then it rejects the output, keeping the unfiltered result,
but keeps filtering.

I think it is not described in this detail in the documentation of the
new protocol.

> +	strbuf_release(&nbuf);
> +	return ret;
> +}

I wonder if this point might be start of the new patch... but then you
would have no way to test what you wrote.

> +
>  static struct convert_driver {
>  	const char *name;
>  	struct convert_driver *next;
>  	const char *smudge;
>  	const char *clean;
> +	const char *process;
>  	int required;
>  } *user_convert, **user_convert_tail;

All right.

>  
> @@ -526,6 +871,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>  	if (!strcmp("clean", key))
>  		return git_config_string(&drv->clean, var, value);
>  
> +	if (!strcmp("process", key)) {
> +		return git_config_string(&drv->process, var, value);
> +	}
> +

All right.

>  	if (!strcmp("required", key)) {
>  		drv->required = git_config_bool(var, value);
>  		return 0;
> @@ -823,7 +1172,12 @@ int would_convert_to_git_filter_fd(const char *path)
>  	if (!ca.drv->required)
>  		return 0;
>  
> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> +	if (!ca.drv->clean && ca.drv->process)
> +		return apply_protocol2_filter(
> +			path, NULL, 0, -1, NULL, ca.drv->process, FILTER_CAPABILITIES_CLEAN
> +		);
> +	else
> +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);

Could we augment apply_filter() instead, so that the invocation is

        return apply_filter(path, NULL, 0, -1, NULL, ca.drv, FILTER_CLEAN);

Though I am not sure if moving this conditional to apply_filter would
be a good idea; maybe wrapper around augmented apply_filter_do()?

>  }
>  
>  const char *get_convert_attr_ascii(const char *path)
> @@ -856,17 +1210,24 @@ int convert_to_git(const char *path, const char *src, size_t len,
>                     struct strbuf *dst, enum safe_crlf checksafe)
>  {
>  	int ret = 0;
> -	const char *filter = NULL;
> +	const char *clean_filter = NULL;
> +	const char *process_filter = NULL;
>  	int required = 0;
>  	struct conv_attrs ca;
>  
>  	convert_attrs(&ca, path);
>  	if (ca.drv) {
> -		filter = ca.drv->clean;
> +		clean_filter = ca.drv->clean;
> +		process_filter = ca.drv->process;
>  		required = ca.drv->required;
>  	}

All right (assuming un-augmented apply_filter()).

>  
> -	ret |= apply_filter(path, src, len, -1, dst, filter);
> +	if (!clean_filter && process_filter)
> +		ret |= apply_protocol2_filter(
> +			path, src, len, -1, dst, process_filter, FILTER_CAPABILITIES_CLEAN
> +		);
> +	else
> +		ret |= apply_filter(path, src, len, -1, dst, clean_filter);

I wonder if it would be more readable to write it like this
(and of course elsewhere too):

  +	if (!clean_filter && process_filter)
  +		ret |= apply_protocol2_filter(
  +			path, src, len, -1, dst, process_filter, FILTER_CAPABILITIES_CLEAN
  +		);
  +	else
  +		ret |= apply_filter(
  +			path, src, len, -1, dst, clean_filter);
  +		);


Though it would screw up "git blame -C -C -w"

>  	if (!ret && required)
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>  
> @@ -885,13 +1246,21 @@ int convert_to_git(const char *path, const char *src, size_t len,
>  void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
>  			      enum safe_crlf checksafe)
>  {
> +	int ret = 0;

Right, 'ret' is needed because we now have two possibilities:
`clean` filter and `process` filter.

>  	struct conv_attrs ca;
>  	convert_attrs(&ca, path);
>  
>  	assert(ca.drv);
> -	assert(ca.drv->clean);
> +	assert(ca.drv->clean || ca.drv->process);
> +
> +	if (!ca.drv->clean && ca.drv->process)
> +		ret = apply_protocol2_filter(
> +			path, NULL, 0, fd, dst, ca.drv->process, FILTER_CAPABILITIES_CLEAN
> +		);
> +	else
> +		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
>  
> -	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
> +	if (!ret)
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>  
>  	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
> @@ -902,14 +1271,16 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  					    size_t len, struct strbuf *dst,
>  					    int normalizing)
>  {
> -	int ret = 0, ret_filter = 0;
> -	const char *filter = NULL;
> +	int ret = 0, ret_filter;

Why the change:

  -	int ret = 0, ret_filter = 0;
  +	int ret = 0, ret_filter;

> +	const char *smudge_filter = NULL;
> +	const char *process_filter = NULL;
>  	int required = 0;
>  	struct conv_attrs ca;
>  
>  	convert_attrs(&ca, path);
>  	if (ca.drv) {
> -		filter = ca.drv->smudge;
> +		process_filter = ca.drv->process;
> +		smudge_filter = ca.drv->smudge;
>  		required = ca.drv->required;
>  	}

All right, the same.

[...]
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index 34c8eb9..e8a7703 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -296,4 +296,409 @@ test_expect_success 'disable filter with empty override' '
>  	test_must_be_empty err
>  '
>  
> +test_expect_success PERL 'required process filter should filter data' '
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge shutdown" &&
> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +		git add . &&
> +		git commit . -m "test commit" &&

This is more of "Initial commit", not that it matters

> +		git branch empty &&
> +
> +		cat ../test.o >test.r &&

Err, the above is just copying file, isn't it?
Maybe it was copied from other tests, I have not checked.

> +		echo "test22" >test2.r &&
> +		mkdir testsubdir &&
> +		echo "test333" >testsubdir/test3.r &&

All right, we test text file, we test binary file (I assume), we test
file in a subdirectory.  What about testing empty file?  Or large file
which would not fit in the stdin/stdout buffer (as EXPENSIVE test)?

> +
> +		rm -f rot13-filter.log &&
> +		git add . &&

So this runs "clean" filter, storing cleaned contents in the index.

> +		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >uniq-rot13-filter.log &&
> +		cat >expected_add.log <<-\EOF &&
> +			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
> +			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
> +			1 IN: clean testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]

And we check the "know size upfront" case (mistakenly called non-"stream").

> +			1 IN: shutdown -- [OK]

And test "shutdown" capability (not as separate test).

> +			1 start
> +			1 wrote filter header
> +		EOF

And we are required to keep the expected_add.log file sorted by hand???

> +		test_cmp expected_add.log uniq-rot13-filter.log &&
> +
> +		>rot13-filter.log &&

Truncate log. Still in the same test.

> +		git commit . -m "test commit" &&

This is test commit with files undergoing "clean" part of filter.

> +		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
> +			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq-rot13-filter.log &&

There is known performance regression, in that filter is run more
than once on given file.

Actually... why it does not use cleaned-up contents from the index?

> +		cat >expected_commit.log <<-\EOF &&
> +			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
> +			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
> +			x IN: clean testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
> +			1 IN: shutdown -- [OK]
> +			1 start
> +			1 wrote filter header

Right, this is the goal of the patch series: for filter to be started
only once per git command invocation.

> +		EOF
> +		test_cmp expected_commit.log uniq-rot13-filter.log &&
> +

Still in the same test, even though we would be testing "smudge"
capability now.  

It's a pity that t/test-lib.sh does not support subtests from
the TAP specification (Test Anything Protocol that Git testsuite
uses).

> +		>rot13-filter.log &&
> +		rm -f test?.r testsubdir/test3.r &&
> +		git checkout . &&

All right, we removed some files so that "git checkout ." could
restore them to life.

> +		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&

Useless use of cat

  +		grep -v "IN: clean"  rot13-filter.log  >smudge-rot13-filter.log &&

Also: why 'git checkout <path>' would run "clean" filter?
Is it existing strange behaviour?

> +		cat >expected_checkout.log <<-\EOF &&
> +			start
> +			wrote filter header
> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
> +			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
> +			IN: shutdown -- [OK]
> +		EOF

This time without 'sort | uniq -c'.  Is it really needed for the
"good" case, or is it there for two cases to look similar?

> +		test_cmp expected_checkout.log smudge-rot13-filter.log &&
> +
> +		git checkout empty &&

Shouldn't we check that switching to branch 'empty' does not run
filters, or is it covered by other tests?  Or perhaps this simply
does not matter here, is it?

> +
> +		>rot13-filter.log &&
> +		git checkout master &&

Does it test different callpath than 'git checkout .'?  Well, the
set of files is different...

> +		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
> +		cat >expected_checkout_master.log <<-\EOF &&
> +			start
> +			wrote filter header
> +			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
> +			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
> +			IN: shutdown -- [OK]
> +		EOF
> +		test_cmp expected_checkout_master.log smudge-rot13-filter.log &&
> +

And here we start checking that the filter did filter,
that is the content in the repository is "clean"ed-up.
Still the same test.

> +		./../rot13.sh <test.r >expected &&
> +		git cat-file blob :test.r >actual &&
> +		test_cmp expected actual &&
> +
> +		./../rot13.sh <test2.r >expected &&
> +		git cat-file blob :test2.r >actual &&
> +		test_cmp expected actual &&
> +
> +		./../rot13.sh <testsubdir/test3.r >expected &&
> +		git cat-file blob :testsubdir/test3.r >actual &&
> +		test_cmp expected actual
> +	)
> +'
> +
> +test_expect_success PERL 'required process filter should filter data stream' '
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl stream clean smudge" &&
> +	test_config_global filter.protocol.required true &&

Errr... I don't see how it is different from the previous test.
[...]

> +
> +test_expect_success PERL 'required process filter should filter smudge data and one-shot filter should clean' '

All right, so this tests the precedence... well, it doesn't.

It tests that `process` filter with "smudge" capability only works well
with one-shot `clean` filter.

> +	test_config_global filter.protocol.clean ./../rot13.sh &&
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl smudge" &&

Why the difference in pathnames (the directory part) between those two?

> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +		git add . &&
> +		git commit . -m "test commit" &&
> +		git branch empty &&
> +
> +		cat ../test.o >test.r &&
> +		echo "test22" >test2.r &&
> +		mkdir testsubdir &&
> +		echo "test333" >testsubdir/test3.r &&
> +
> +		rm -f rot13-filter.log &&
> +		git add . &&
> +		test_must_be_empty rot13-filter.log &&
> +
> +		>rot13-filter.log &&
> +		git commit . -m "test commit" &&
> +		test_must_be_empty rot13-filter.log &&

All right, these tests that `process` filter is not ran.  But we don't
know if it is because it lacks capability, or because it is overriden
by one-shot filter (well, that comes later).

> +
> +		>rot13-filter.log &&
> +		rm -f test?.r testsubdir/test3.r &&
> +		git checkout . &&
> +		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
> +		cat >expected_checkout.log <<-\EOF &&
> +			start
> +			wrote filter header
> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
> +			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
> +		EOF
> +		test_cmp expected_checkout.log smudge-rot13-filter.log &&

This part is repeated many, many times.  Maybe add some helper
shell function for this?

[...]
> +		./../rot13.sh <test.r >expected &&
> +		git cat-file blob :test.r >actual &&
> +		test_cmp expected actual &&
> +
> +		./../rot13.sh <test2.r >expected &&
> +		git cat-file blob :test2.r >actual &&
> +		test_cmp expected actual &&
> +
> +		./../rot13.sh <testsubdir/test3.r >expected &&
> +		git cat-file blob :testsubdir/test3.r >actual &&
> +		test_cmp expected actual

Here we test that equivalent one-shot cleanup filter was run.
Here also we have repeated contents; maybe some helper function
would make it shorter?

> +	)
> +'

Here I am stopping examining tests in detail.

> +test_expect_success PERL 'required process filter should clean only' '
> +test_expect_success PERL 'required process filter should process files larger LARGE_PACKET_MAX' '

Those two tests do not depend on being required or not; it is only
that without required they would fail softly in case of latter test
(which we can detect too).

> +test_expect_success PERL 'required process filter should with clean error should fail' '
> +test_expect_success PERL 'process filter should restart after unexpected write failure' '

So these two are sort of complimentary.  When `process` is required,
then it should fail if it cannot filter some file.  If it is not,
it should keep processing other files.

> +test_expect_success PERL 'process filter should not restart after intentionally rejected file' '

Uh... all right, so "reject" means that filter cannot continue?
Strange meaning for 'reject', though ;-)

>  test_done
> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
> new file mode 100755
> index 0000000..cb0925d
> --- /dev/null
> +++ b/t/t0021/rot13-filter.pl
> @@ -0,0 +1,177 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git filter protocol version 2
> +# See Documentation/gitattributes.txt, section "Filter Protocol"
> +#
> +# The script takes the list of supported protocol capabilities as
> +# arguments ("stream", "clean", and "smudge" are supported).

What about "shutdown"?

> +#
> +# This implementation supports three special test cases:
> +# (1) If data with the filename "clean-write-fail.r" is processed with
> +#     a "clean" operation then the write operation will die.
> +# (2) If data with the filename "smudge-write-fail.r" is processed with
> +#     a "smudge" operation then the write operation will die.

All right, so it is hard failure with filter script dying.

> +# (3) If data with the filename "failure.r" is processed with any
> +#     operation then the filter signals that the operation was not
> +#     successful.

All right, so it is failure detected by filter script and signalled to Git.

> +#
> +
> +use strict;
> +use warnings;

So no more "use autodie", because of compatibility with old Perls.

> +
> +my $MAX_PACKET_CONTENT_SIZE = 65516;
> +my @capabilities            = @ARGV;

No autoflush this time?

> +
> +sub rot13 {
> +    my ($str) = @_;
> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
> +    return $str;
> +}
> +
> +sub packet_read {
> +    my $buffer;
> +    my $bytes_read = read STDIN, $buffer, 4;
> +    if ( $bytes_read == 0 ) {
> +        return;
> +    }
> +    elsif ( $bytes_read != 4 ) {
> +        die "invalid packet size '$bytes_read' field";
> +    }
> +    my $pkt_size = hex($buffer);
> +    if ( $pkt_size == 0 ) {
> +        return ( 1, "" );

Unusual return convention.  Though it is a test script, so
it doesn't matter much.

> +    }
> +    elsif ( $pkt_size > 4 ) {
> +        my $content_size = $pkt_size - 4;
> +        $bytes_read = read STDIN, $buffer, $content_size;
> +        if ( $bytes_read != $content_size ) {
> +            die "invalid packet";

More detailed error message, maybe?

> +        }
> +        return ( 0, $buffer );
> +    }
> +    else {
> +        die "invalid packet size";
> +    }
> +}
> +
> +sub packet_write {
> +    my ($packet) = @_;
> +    print STDOUT sprintf( "%04x", length($packet) + 4 );
> +    print STDOUT $packet;
> +    STDOUT->flush();
> +}
> +
> +sub packet_flush {
> +    print STDOUT sprintf( "%04x", 0 );
> +    STDOUT->flush();
> +}
> +
> +open my $debug, ">>", "rot13-filter.log";
> +print $debug "start\n";
> +$debug->flush();
> +
> +packet_write("git-filter-protocol\n");
> +packet_write("version 2\n");
> +packet_write( "capabilities " . join( ' ', @capabilities ) . "\n" );
> +print $debug "wrote filter header\n";
> +$debug->flush();
> +
> +while (1) {
> +    my $command = packet_read();
> +    unless ( defined($command) ) {
> +        exit();
> +    }
> +    chomp $command;
> +    print $debug "IN: $command";
> +    $debug->flush();
> +
> +    if ( $command eq "shutdown" ) {
> +        print $debug " -- [OK]";
> +        $debug->flush();
> +        packet_write("done\n");
> +        exit();
> +    }
> +
> +    my ($filename) = packet_read() =~ /filename=([^=]+)\n/;
> +    print $debug " $filename";
> +    $debug->flush();
> +    my ($filelen) = packet_read() =~ /size=([^=]+)\n/;
> +    chomp $filelen;

I think this chomp is not needed, as "\n" is not included.
Though the regexp should probably be anchored.

> +    print $debug " $filelen";
> +    $debug->flush();
> +
> +    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
> +    my $output;
> +
> +    if ( $filelen > 0 ) {

So here is a special case for $filelen = 0.
Negative $filelen is not allowed, via regexp.

> +        my $input = "";
> +        {
> +            binmode(STDIN);
> +            my $buffer;
> +            my $done = 0;
> +            while ( !$done ) {
> +                ( $done, $buffer ) = packet_read();
> +                $input .= $buffer;
> +            }
> +            print $debug " [OK] -- ";
> +            $debug->flush();
> +        }
> +
> +        if ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
> +            $output = rot13($input);
> +        }
> +        elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
> +            $output = rot13($input);
> +        }

These two conditionals could be shortened, but then they would be less
readable.  Or not:

           if ( grep { $_ eq $command } @capabilities ) {
           	$output = rot13($input);
           }

> +        else {
> +            die "bad command $command";
> +        }
> +    }
> +
> +    my $output_len = length($output);
> +    if ( $filename eq "reject.r" ) {
> +        $output_len = 0;
> +    }
> +
> +    if ( grep( /^stream$/, @capabilities ) ) {
> +        print $debug "OUT: STREAM ";
> +    }
> +    else {
> +        packet_write("size=$output_len\n");
> +        print $debug "OUT: $output_len ";
> +    }
> +    $debug->flush();
> +
> +    if ( $filename eq "reject.r" ) {
> +        packet_write("reject\n");
> +        print $debug "[REJECT]\n";    # Could also be an error

How if could be an error?

> +        $debug->flush();
> +    }
> +
> +    if ( $output_len > 0 ) {
> +        if (( $command eq "clean" and $filename eq "clean-write-fail.r" )
> +            or
> +            ( $command eq "smudge" and $filename eq "smudge-write-fail.r" ))

Perhaps simply:

  +        if ( $filename eq "${command}-write-fail.r" ) {

> +        {
> +            print $debug "[WRITE FAIL]\n";
> +            $debug->flush();
> +            die "write error";
> +        }
> +        else {
> +            while ( length($output) > 0 ) {
> +                my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
> +                packet_write($packet);
> +                if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
> +                    $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
> +                }
> +                else {
> +                    $output = "";
> +                }
> +            }
> +            packet_flush();
> +            packet_write("success\n");
> +            print $debug "[OK]\n";
> +            $debug->flush();
> +        }
> +    }
> +}
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-31 19:49         ` Lars Schneider
@ 2016-07-31 22:59           ` Jakub Narębski
  0 siblings, 0 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-07-31 22:59 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King

W dniu 31.07.2016 o 21:49, Lars Schneider pisze: 
> On 31 Jul 2016, at 11:42, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 31.07.2016 o 00:05, Jakub Narębski pisze:
>>> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
[...]
>>> I think it would be nice to have here at least summary of the benchmarks
>>> you did in https://github.com/github/git-lfs/pull/1382

This would be nice to have in the commit message: real benchmarks.

>>
>> Note that this feature is especially useful if startup time is long,
>> that is if you are using an operating system with costly fork / new process
>> startup time like MS Windows (which you have mentioned), or writing
>> filter in a programming language with large startup time like Java
>> or Python (the latter may have changed since).
>>
>>  https://gnustavo.wordpress.com/2012/06/28/programming-languages-start-up-times/
> 
> OK, I will add this. Is it OK to add the link to the commit message?
> (since I don't know how long the link will be available).

I don't think it is needed.  Perhaps only a sentence or half to notify
where you could get most from this feature, but even then it is not
necessary.

I'm sorry for the confusion.

>> See below for proposal with two places to signal errors: before sending
>> first byte, and after.
> 
> Right now the protocol is implemented covering the following cases:
> 
> ## CASE 1 - no stream success

It is less "stream", more "size unknown".  Real streaming is interleaving
reading and writing, which is currently not supported due to lack of
start_async() - I think.

> 
> packet:          git< size=57\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< success\n

Right.  What happens if either length(SMUDGED_CONTENT) < size,
or length(SMUDGED_CONTENT) > size?  It could conceivably happen,
e.g. due to an error in size calculation.

NOTE that without using flush packet to signal end of contents,
we would be not able to signal a situation when filter encounters
an error (per-file, or long temporary) when it have written some
content already.  For example this may happen for git-LFS filter,
if the server hosting artifactory (or even whole network) gets
down during cleanup / smudging.

Well, unless we would use other special packets:
 - empty packet, that is "0004" pkt-line
 - invalid packet, that is "0001", "0002", "0003" pkt-line
to signal premature end of SMUDGED_CONTENT.

> 
> 
> ## CASE 2 - no stream success but 0 byte response
> 
> packet:          git< size=0\n
> packet:          git< success\n

Why there is need to special case 0 byte (empty file) response?

  packet:          git< size=0\n
  packet:          git< 0000
  packet:          git< success\n

is perfectly fine.
  
> ## CASE 3 - no stream filter; filter doesn't want to process the file
> 
> packet:          git< size=0\n
> packet:          git< reject\n

Why not simply
 
  packet:          git< reject\n

Or, if we are going success/reject/whatever route

  packet:          git< size=0\n
  packet:          git< 0000
  packet:          git< reject\n

> ## CASE 4 - no stream filter; filter error
> 
> packet:          git< size=57\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< error\n
> 
> CASE 4 is not explicitly checked. If a final message is neither
> "success" nor "reject" then it is interpreted as error. If that
> happens then Git will shutdown and restart the filter process
> if there is another file to filter. 

This should be documented.

> 
> Alternatively a filter process can shutdown itself, too, to signal
> an error.
> 
> The corresponding stream filter look like this:
> 
> ## CASE 1 - stream success
> 
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< success\n
> 
> 
> ## CASE 2 - stream success but 0 byte response
> 
> packet:          git< 0000
> packet:          git< success\n
> 
> 
> ## CASE 3 - stream filter; filter doesn't want to process the file
> 
> packet:          git< 0000
> packet:          git< reject\n
> 
> 
> ## CASE 4 - stream filter; filter error
> 
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< error\n
> 
> --
> 
> I just realized that the size 0 case is a bit inconsistent
> in the no stream case as it has no flush packet. Maybe I 
> should indeed remove the flush packet in the no stream case
> completely?!

That's what I wrote about SPOT (single point of truth), of using
either size or flush packet, but not both.  But...

As I wrote, you need some mechanism to signal premature end
of contents, and start of an error description.

> 
> Do the cases above make sense to you?

Except for the inconsistency of the size 0 case.  This what
I meant to say.

> 
> Regarding error handling. I would prefer it if the filter prints
> all errors to STDERR by itself. I think that is the safest
> option to communicate errors to the users because if the communication
> got into a bad state then Git might not be able to read the errors
> properly.
> 
> See Peff's response on the topic, too:
> http://public-inbox.org/git/20160729165018.GA6553%40sigill.intra.peff.net/

Actually it looks like Peff is slightly against using stderr.

JK> Git-LFS sends to stderr because there's no other option. I wonder if it
JK> would be nicer to make it Git's responsibility to talk to the user,
JK> because then it could respect things like "--quiet". I guess error
JK> messages are generally printed regardless of verbosity, though, so
JK> printing them unconditionally is OK.

I think it should be O.K., and it makes writing filter drivers
simpler if we don't have multiplex channels.

>> NOTE: there is a bit of mixed and possibly confusing notation, that
>> is 0000 is flush packet, not packet with 0000 as content.  Perhaps
>> write pkt-line in full?
> 
> I am not sure I understand what you mean (maybe it's too late for me...).
> Can you try to rephrase or give an example?

Compare

  packet:          git< 0000

with

  packet:          git< success\n

The former as pkt-line is

  git< 0000

the latter is

  git< 000csuccess\n
       ^^^^
           \-- packet header

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/10] run-command: add clean_on_exit_handler
  2016-07-30  9:50     ` Johannes Sixt
@ 2016-08-01 11:14       ` Lars Schneider
  2016-08-02  5:53         ` Johannes Sixt
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 11:14 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright, e, peff


> On 30 Jul 2016, at 11:50, Johannes Sixt <j6t@kdbg.org> wrote:
> 
> Am 30.07.2016 um 01:37 schrieb larsxschneider@gmail.com:
>> Some commands might need to perform cleanup tasks on exit. Let's give
>> them an interface for doing this.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> run-command.c | 12 ++++++++----
>> run-command.h |  1 +
>> 2 files changed, 9 insertions(+), 4 deletions(-)
>> 
>> diff --git a/run-command.c b/run-command.c
>> index 33bc63a..197b534 100644
>> --- a/run-command.c
>> +++ b/run-command.c
>> @@ -21,6 +21,7 @@ void child_process_clear(struct child_process *child)
>> 
>> struct child_to_clean {
>> 	pid_t pid;
>> +	void (*clean_on_exit_handler)(pid_t);
>> 	struct child_to_clean *next;
>> };
>> static struct child_to_clean *children_to_clean;
>> @@ -30,6 +31,8 @@ static void cleanup_children(int sig, int in_signal)
>> {
>> 	while (children_to_clean) {
>> 		struct child_to_clean *p = children_to_clean;
>> +		if (p->clean_on_exit_handler)
>> +			p->clean_on_exit_handler(p->pid);
> 
> This summons demons. cleanup_children() is invoked from a signal handler. In this case, it can call only async-signal-safe functions. It does not look like the handler that you are going to install later will take note of this caveat!
> 
>> 		children_to_clean = p->next;
>> 		kill(p->pid, sig);
>> 		if (!in_signal)
> 
> The condition that we see here in the context protects free(p) (which is not async-signal-safe). Perhaps the invocation of the new callback should be skipped in the same manner when this is called from a signal handler? 507d7804 (pager: don't use unsafe functions in signal handlers) may be worth a look.

Thanks a lot of pointing this out to me!

Do I get it right that after the signal "SIGTERM" I can do a cleanup and don't 
need to worry about any function calls but if I get any other signal then I can 
only perform async-signal-safe calls?

If this is correct, then the following solution would work great:

		if (!in_signal && p->clean_on_exit_handler)
			p->clean_on_exit_handler(p->pid);

Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 01/10] pkt-line: extract set_packet_header()
  2016-07-30 10:30     ` Jakub Narębski
@ 2016-08-01 11:33       ` Lars Schneider
  2016-08-03 20:05         ` Jakub Narębski
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 11:33 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git, gitster, tboegi, mlbright, e, peff


> On 30 Jul 2016, at 12:30, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> set_packet_header() converts an integer to a 4 byte hex string. Make
>> this function locally available so that other pkt-line functions can
>> use it.
> 
> This description is not that clear that set_packet_header() is a new
> function.  Perhaps something like the following
> 
>  Extract the part of format_packet() that converts an integer to a 4 byte
>  hex string into set_packet_header().  Make this new function ...
> 
> I also wonder if the part "Make this [new] function locally available..."
> is needed; we need to justify exports, but I think we don't need to
> justify limiting it to a module.  If you want to justify that it is
> "static", perhaps it would be better to say why not to export it.
> 
> Anyway, I think it is worthy refactoring (and compiler should be
> able to inline it, so there are no nano-performance considerations).
> 
> Good work!

Thank you! I would go with this then:

Extract the part of format_packet() that converts an integer to a 4 byte
hex string into set_packet_header().

OK?


>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> pkt-line.c | 15 ++++++++++-----
>> 1 file changed, 10 insertions(+), 5 deletions(-)
>> 
>> diff --git a/pkt-line.c b/pkt-line.c
>> index 62fdb37..445b8e1 100644
>> --- a/pkt-line.c
>> +++ b/pkt-line.c
>> @@ -98,9 +98,17 @@ void packet_buf_flush(struct strbuf *buf)
>> }
>> 
>> #define hex(a) (hexchar[(a) & 15])
> 
> I guess that this is inherited from the original, but this preprocessor
> macro is local to the format_header() / set_packet_header() function,
> and would not work outside it.  Therefore I think we should #undef it
> after set_packet_header(), just in case somebody mistakes it for
> a generic hex() function.  Perhaps even put it inside set_packet_header(),
> together with #undef.
> 
> But I might be mistaken... let's check... no, it isn't used outside it.

Agreed. Would that be OK?

static void set_packet_header(char *buf, const int size)
{
	static char hexchar[] = "0123456789abcdef";
	#define hex(a) (hexchar[(a) & 15])
	buf[0] = hex(size >> 12);
	buf[1] = hex(size >> 8);
	buf[2] = hex(size >> 4);
	buf[3] = hex(size);
	#undef hex
}

- Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data()
  2016-07-30 10:49     ` Jakub Narębski
@ 2016-08-01 12:00       ` Lars Schneider
  2016-08-03 20:12         ` Jakub Narębski
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 12:00 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, tboegi, mlbright, Eric Wong,
	Jeff King


> On 30 Jul 2016, at 12:49, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Sometimes pkt-line data is already available in a buffer and it would
>> be a waste of resources to write the packet using packet_write() which
>> would copy the existing buffer into a strbuf before writing it.
>> 
>> If the caller has control over the buffer creation then the
>> PKTLINE_DATA_START macro can be used to skip the header and write
>> directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
>> would be the maximum). direct_packet_write() would take this buffer,
>> adjust the pkt-line header and write it.
>> 
>> If the caller has no control over the buffer creation then
>> direct_packet_write_data() can be used. This function creates a pkt-line
>> header. Afterwards the header and the data buffer are written using two
>> consecutive write calls.
> 
> I don't quite understand what do you mean by "caller has control
> over the buffer creation".  Do you mean that caller either can write
> over the buffer, or cannot overwrite the buffer?  Or do you mean that
> caller either can allocate buffer to hold header, or is getting
> only the data?

How about this:

[...]

If the caller creates the buffer then a proper pkt-line buffer with header
and data section can be created. The PKTLINE_DATA_START macro can be used 
to skip the header section and write directly to the data section (PKTLINE_DATA_LEN 
bytes would be the maximum). direct_packet_write() would take this buffer, 
fill the pkt-line header section with the appropriate data length value and 
write the entire buffer.

If the caller does not create the buffer, and consequently cannot leave room
for the pkt-line header, then direct_packet_write_data() can be used. This 
function creates an extra buffer for the pkt-line header and afterwards writes
the header buffer and the data buffer with two consecutive write calls.

---
Is that more clear?

> 
>> 
>> Both functions have a gentle parameter that indicates if Git should die
>> in case of a write error (gentle set to 0) or return with a error (gentle
>> set to 1).
> 
> So they are *_maybe_gently(), isn't it ;-)?  Are there any existing
> functions in Git codebase that take 'gently' / 'strict' / 'die_on_error'
> parameter?

Yes, git grep "gentle" reveals:

wrapper.c:static int memory_limit_check(size_t size, int gentle)
object.c:int type_from_string_gently(const char *str, ssize_t len, int gentle)


>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> pkt-line.c | 30 ++++++++++++++++++++++++++++++
>> pkt-line.h |  5 +++++
>> 2 files changed, 35 insertions(+)
>> 
>> diff --git a/pkt-line.c b/pkt-line.c
>> index 445b8e1..6fae508 100644
>> --- a/pkt-line.c
>> +++ b/pkt-line.c
>> @@ -135,6 +135,36 @@ void packet_write(int fd, const char *fmt, ...)
>> 	write_or_die(fd, buf.buf, buf.len);
>> }
>> 
>> +int direct_packet_write(int fd, char *buf, size_t size, int gentle)
>> +{
>> +	int ret = 0;
>> +	packet_trace(buf + 4, size - 4, 1);
>> +	set_packet_header(buf, size);
>> +	if (gentle)
>> +		ret = !write_or_whine_pipe(fd, buf, size, "pkt-line");
>> +	else
>> +		write_or_die(fd, buf, size);
> 
> Hmmm... in gently case we get the information in the warning that
> it is about "pkt-line", which is missing from !gently case.  But
> it is probably not important.
> 
>> +	return ret;
>> +}
> 
> Nice clean function, thanks to extracting set_packet_header().
> 
>> +
>> +int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle)
> 
> I would name the parameter 'data', rather than 'buf'; IMVHO it
> better describes it.

Agreed!

> 
>> +{
>> +	int ret = 0;
>> +	char hdr[4];
>> +	set_packet_header(hdr, sizeof(hdr) + size);
>> +	packet_trace(buf, size, 1);
>> +	if (gentle) {
>> +		ret = (
>> +			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
> 
> You can write '4' here, no need for sizeof(hdr)... though compiler would
> optimize it away.

Right, it would be optimized. However, I don't like the 4 there either. OK to use a macro
instead? PKTLINE_HEADER_LEN ?


>> +			!write_or_whine_pipe(fd, buf, size, "pkt-line data")
>> +		);
> 
> Do we want to try to write "pkt-line data" if "pkt-line header" failed?
> If not, perhaps De Morgan-ize it
> 
>  +		ret = !(
>  +			write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") &&
>  +			write_or_whine_pipe(fd, buf, size, "pkt-line data")
>  +		);


Original:
		ret = (
			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
			!write_or_whine_pipe(fd, data, size, "pkt-line data")
		);

Well, if the first write call fails (return == 0), then it is negated and evaluates to true.
I would think the second call is not evaluated, then?!

CPP reference:
"For the built-in logical OR operator, the result is true if either the first or the second 
operand (or both) is true. If the firstoperand is true, the second operand is not evaluated."
http://en.cppreference.com/w/cpp/language/operator_logical

Should I make this more explicit with a if clause?
 

>> +	} else {
>> +		write_or_die(fd, hdr, sizeof(hdr));
>> +		write_or_die(fd, buf, size);
> 
> I guess these two writes (here and in 'gently' case) are unavoidable...

I think so, too.


> 
>> +	}
>> +	return ret;
>> +}
>> +
>> void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
>> {
>> 	va_list args;
>> diff --git a/pkt-line.h b/pkt-line.h
>> index 3cb9d91..02dcced 100644
>> --- a/pkt-line.h
>> +++ b/pkt-line.h
>> @@ -23,6 +23,8 @@ void packet_flush(int fd);
>> void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>> void packet_buf_flush(struct strbuf *buf);
>> void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>> +int direct_packet_write(int fd, char *buf, size_t size, int gentle);
>> +int direct_packet_write_data(int fd, const char *buf, size_t size, int gentle);
>> 
>> /*
>>  * Read a packetized line into the buffer, which must be at least size bytes
>> @@ -77,6 +79,9 @@ char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
>> 
>> #define DEFAULT_PACKET_MAX 1000
>> #define LARGE_PACKET_MAX 65520
>> +#define PKTLINE_HEADER_LEN 4
>> +#define PKTLINE_DATA_START(pkt) ((pkt) + PKTLINE_HEADER_LEN)
>> +#define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - PKTLINE_HEADER_LEN)
> 
> Those are not used in direct_packet_write() and direct_packet_write_data();
> but they would make them more verbose and less readable.

Good point, I should use them to check for the maximal packet length!

Thanks,
Lars

> 
>> extern char packet_buffer[LARGE_PACKET_MAX];
>> 
>> #endif
>> 
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send
  2016-07-30 12:29     ` Jakub Narębski
@ 2016-08-01 12:18       ` Lars Schneider
  2016-08-03 20:15         ` Jakub Narębski
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 12:18 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git, gitster, tboegi, mlbright, e, peff


> On 30 Jul 2016, at 14:29, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> The packet_trace() call is not ideal in format_packet() as we would print
> 
> Style; I think the following is more readable:
> 
>  The packet_trace() call in format_packet() is not ideal, as we would...

Agreed!


>> a trace when a packet is formatted and (potentially) when the packet is
>> actually send. This was no problem up until now because format_packet()
>> was only used by one function. Fix it by moving the trace call into the
>> function that actally sends the packet.
> 
> s/actally/actually/

Thanks!


> I don't buy this explanation.  If you want to trace packets, you might
> do it on input (when formatting packet), or on output (when writing
> packet).  It's when there are more than one formatting function, but
> one writing function, then placing trace call in write function means
> less code duplication; and of course the reverse.
> 
> Another issue is that something may happen between formatting packet
> and sending it, and we probably want to packet_trace() when packet
> is actually send.
> 
> Neither of those is visible in commit message.

The packet_trace() call in format_packet() is not ideal, as we would print
a trace when a packet is formatted and (potentially) when the same packet is
actually written. This was no problem up until now because packet_write(),
the function that uses format_packet() and writes the formatted packet,
did not trace the packet.

This developer believes that trace calls should only happen when a packet
is actually written as the packet could be modified between formatting
and writing. Therefore the trace call was moved from format_packet() to 
packet_write().

--

Better?

> 
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> pkt-line.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/pkt-line.c b/pkt-line.c
>> index 1728690..32c0a34 100644
>> --- a/pkt-line.c
>> +++ b/pkt-line.c
>> @@ -126,7 +126,6 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
>> 		die("protocol error: impossibly long line");
>> 
>> 	set_packet_header(&out->buf[orig_len], n);
>> -	packet_trace(out->buf + orig_len + 4, n - 4, 1);
>> }
>> 
>> void packet_write(int fd, const char *fmt, ...)
>> @@ -138,6 +137,7 @@ void packet_write(int fd, const char *fmt, ...)
>> 	va_start(args, fmt);
>> 	format_packet(&buf, fmt, args);
>> 	va_end(args);
>> +	packet_trace(buf.buf + 4, buf.len - 4, 1);
>> 	write_or_die(fd, buf.buf, buf.len);
>> }
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size
  2016-07-30 13:58     ` Jakub Narębski
@ 2016-08-01 12:23       ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 12:23 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git, gitster, tboegi, mlbright, e, peff


> On 30 Jul 2016, at 15:58, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> According to LARGE_PACKET_MAX in pkt-line.h the maximal lenght of a
>> pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
>> therefore the pkt-line data component must not exceed 65516 bytes.
> 
> s/lenght/length/

Thanks!


> Is it maximum length of pkt-line packet, or maximum length of data
> that can be send in a packet?

65520 is the maximum length of a pkt-line.


> With 4 hex digits, maximal length if pkt-line packet (together
> with length) is ffff_16, that is 2^16-1 = 65535.  Where does the
> number 65520 comes from?

Historic reasons, I guess? However, it won't be changed. See response
from Peff here:
http://public-inbox.org/git/20160726134257.GB19277%40sigill.intra.peff.net/


> 
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> Documentation/technical/protocol-common.txt | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt
>> index bf30167..ecedb34 100644
>> --- a/Documentation/technical/protocol-common.txt
>> +++ b/Documentation/technical/protocol-common.txt
>> @@ -67,9 +67,9 @@ with non-binary data the same whether or not they contain the trailing
>> LF (stripping the LF if present, and not complaining when it is
>> missing).
>> 
>> -The maximum length of a pkt-line's data component is 65520 bytes.
>> -Implementations MUST NOT send pkt-line whose length exceeds 65524
>> -(65520 bytes of payload + 4 bytes of length data).
>> +The maximum length of a pkt-line's data component is 65516 bytes.
>> +Implementations MUST NOT send pkt-line whose length exceeds 65520
>> +(65516 bytes of payload + 4 bytes of length data).
>> 
>> Implementations SHOULD NOT send an empty pkt-line ("0004").
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-07-30 12:04     ` Jakub Narębski
@ 2016-08-01 12:28       ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 12:28 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git Mailing List, gitster, tboegi, mlbright, e, peff


> On 30 Jul 2016, at 14:04, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_flush() would die in case of a write error even though for some callers
>> an error would be acceptable. Add packet_flush_gentle() which writes a pkt-line
>> flush packet and returns `0` for success and `1` for failure.
> 
> I think it should be packet_flush_gently(), as in "to flush gently",
> but this is only my opinion; I have not checked the naming rules and
> practices for the rest of Git codebase.

Agreed. This would match:

object.c:int type_from_string_gently(const char *str, ssize_t len, int gentle)

Thanks,
Lars

> 
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-30 22:05     ` Jakub Narębski
  2016-07-31  9:42       ` Jakub Narębski
@ 2016-08-01 13:32       ` Lars Schneider
  2016-08-03 18:30         ` Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option) Jakub Narębski
  2016-08-03 22:47         ` [PATCH v3 10/10] convert: add filter.<driver>.process option Jakub Narębski
  1 sibling, 2 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 13:32 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 31 Jul 2016, at 00:05, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>> 
>> This patch adds the filter.<driver>.process string option which, if used,
>> keeps the external filter process running and processes all blobs with
>> the following packet format (pkt-line) based protocol over standard input
>> and standard output.
> 
> I think it would be nice to have here at least summary of the benchmarks
> you did in https://github.com/github/git-lfs/pull/1382

OK.


>> Git starts the filter on first usage and expects a welcome
>> message, protocol version number, and filter capabilities
>> separated by spaces:
>> ------------------------
>> packet:          git< git-filter-protocol\n
>> packet:          git< version 2\n
>> packet:          git< capabilities clean smudge\n
> 
> Sorry for going back and forth, but now I think that 'capabilities' are
> not really needed here, though they are in line with "version" in
> the second packet / line, namely "version 2".  If it does not make
> parsing more difficult...

I don't understand what you mean with "they are not really needed"?
The field is necessary to understand the protocol, no?

In the last roll I added the "key=value" format to the protocol upon
yours and Peff's suggestion. Would it be OK to change the startup
sequence accordingly?

packet:          git< version=2\n
packet:          git< capabilities=clean smudge\n


>> ------------------------
>> Supported filter capabilities are "clean", "smudge", "stream",
>> and "shutdown".
> 
> I'd rather put "stream" and "shutdown" capabilities into separate
> patches, for easier review.

I agree with "shutdown". I think I would like to remove the "stream"
option and make it the default for the following reasons:

(1) As you mentioned elsewhere, "stream" is not really streaming at this
point because we don't read/write in parallel.

(2) Junio and you pointed out that if we transmit size and flush packet
then we have redundancy in the protocol.

(3) With the newly introduced "success"/"reject"/"failure" packet at the 
end of a filter operation, a filter process has a way to signal Git that
something went wrong. Initially I had the idea that a filter process just
stops writing and Git would detect the mismatch between expected bytes
and received bytes. But the final status packet is a much clearer solution.

(4) Maintaining two slightly different protocols is a waste of resources 
and only increases the size of this (already large) patch.

My only argument for the size packet was that this allows efficient buffer
allocation. However, in non of my benchmarks this was actually a problem.
Therefore this is probably a epsilon optimization and should be removed.

OK with everyone?


>> Afterwards Git sends a command (based on the supported
>> capabilities), the filename including its path
>> relative to the repository root, the content size as ASCII number
>> in bytes, the content split in zero or many pkt-line packets,
>> and a flush packet at the end:
> 
> I guess the following is the most basic example, with mode detailed
> description left for the documentation.
> 
>> ------------------------
>> packet:          git> smudge\n
>> packet:          git> filename=path/testfile.dat\n
>> packet:          git> size=7\n
> 
> So I see you went with "<variable>=<value>" idea, rather than "<value>"
> (with <variable> defined by position in a sequence of 'header' packets),
> or "<variable> <value>..." that introductory header uses.

The implementation still requires the exact sequence of the packets.
However, we could make this more flexible in a later patch with the
"key=value" formatting.

> 
>> packet:          git> CONTENT
>> packet:          git> 0000
>> ------------------------
>> 
>> The filter is expected to respond with the result content size as
>> ASCII number in bytes. If the capability "stream" is defined then
>> the filter must not send the content size. Afterwards the result
>> content in send in zero or many pkt-line packets and a flush packet
>> at the end. 
> 
> If it does not cost filter anything, it could send size upfront
> (based on size of original, or based on external data), even if
> it is prepared for streaming.
> 
> In the opposite case, where filter cannot stream because it requires
> whole contents upfront (e.g. to calculate hash of the contents, or
> to do operation that needs whole file like sorting or reversing lines),
> it should always be able to calculate the size... or not.  For
> example 'sort | uniq' filter needs whole input upfront for sort,
> but it does not know how many lines will be in output without doing
> the 'uniq' part.
> 
> So I think the ability of filter to provide size (or size hint) of
> its output should be decoupled from streaming support.

AS mentioned above, I would like to remove the size packet completely
to simplify this patch. If there is really a need for such a packet
then we could add it later (given the flexible "key=value" format of the
protocol).


>>            Finally a "success" packet is send to indicate that
>> everything went well.
> 
> That's a nice addition, and probably a necessary one, to the stream
> protocol.  Git must know and consume it - we wouldn't be able to
> retrofit it later.
> 
>> ------------------------
>> packet:          git< size=57\n   (omitted with capability "stream")
> 
> I was thinking about having possible responses to receiving file
> contents (or starting receiving in the streaming case) to be:
> 
>  packet:          git< ok size=7\n    (or "ok 7\n", if size is known)
> 
> or
> 
>  packet:          git< ok\n           (if filter does not know size upfront)
> 
> or
> 
>  packet:          git< fail <msg>\n   (or just "fail" + packet with msg)
> 
> The last would be when filter knows upfront that it cannot perform
> the operation.  Though sending an empty file with non-"success" final
> would work as well.
> 
> For example LFS filter (that is configured as not required) may refuse
> to store files which are smaller than some pre-defined constant threshold.

Discussed in http://public-inbox.org/git/7255ef06-a9a0-91b7-b6da-a90322de926b%40gmail.com/

> 
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> packet:          git< success\n
>> ------------------------
>> 
>> In case the filter cannot process the content, it is expected
>> to respond with the result content size 0 (only if "stream" is
>> not defined) and a "reject" packet.
>> ------------------------
>> packet:          git< size=0\n    (omitted with capability "stream")
>> packet:          git< reject\n
>> ------------------------
> 
> This is *wrong* idea!  Empty file, with size=0, can be a perfectly
> legitimate response.

Discussed in http://public-inbox.org/git/7255ef06-a9a0-91b7-b6da-a90322de926b%40gmail.com/


> For example rot13 filter should respond to an empty file on input
> with an empty file on output.  LFS-like filters and encryption
> mechanism should return empty file on fetch / decryption
> if such empty file was stored / encrypted.
> 
> A strange LFS could even use filenames (with files being empty
> themselves) as a lookup key for artifactory.  For example a kind
> of CDN for common libraries, with version embedded in filename,
> like 'libs/jquery-1.9.0.min.js', etc.

Right, that would be possible with the current implementation. I will
add an empty file to the test case to prove it.


>> After the filter has processed a blob it is expected to wait for
>> the next command. A demo implementation can be found in
>> `t/t0021/rot13-filter.pl` located in the Git core repository.
> 
> If filter does not support "shutdown" capability (or if said
> capability is postponed for later patch), it should behave sanely
> when Git command reaps it (SIGTERM + wait + SIGKILL?, SIGCHLD?).

How would you do this? Don't you think the current solution is
good enough for processes that don't need a proper shutdown?


>> 
>> If the filter supports the "shutdown" capability then Git will
>> send the "shutdown" command and wait until the filter answers
>> with "done". This gives the filter the opportunity to perform
>> cleanup tasks. Afterwards the filter is expected to exit.
>> ------------------------
>> packet:          git> shutdown\n
>> packet:          git< done\n
>> ------------------------
> 
> I guess there is no timeout mechanism: if filter hangs on shutdown,
> then git command would also hang waiting for signal to exit.

Correct. Even if we implement a timeout - what time would we wait?
I think this is still the best option for now.


>> If a filter.<driver>.clean or filter.<driver>.smudge command
>> is configured then these commands always take precedence over
>> a configured filter.<driver>.process command.
> 
> Note: the value of `clean`, `smudge` and `process` is a command,
> not just a string.

OK

> I wonder if it would be worth it to explain the reasoning behind
> this solution and show alternate ones.
> 
> * Using a separate variable to signal that filters are invoked
>   per-command rather than per-file, and use pkt-line interface,
>   like boolean-valued `useProtocol`, or `protocolVersion` set
>   to '2' or 'v2', or `persistence` set to 'per-command', there
>   is high risk of user's trying to use exiting one-shot per-file
>   filters... and Git hanging.
> 
> * Using new variables for each capability, e.g. `processSmudge`
>   and `processClean` would lead to explosion of variable names;
>   I think.
> 
> * Current solution of using `process` in addition to `clean`
>   and `smudge` clearly says that you need to use different
>   command for per-file (`clean` and `smudge`), and per-command
>   filter, while allowing to use them together.
> 
>   The possible disadvantage is Git command starting `process`
>   filter, only to see that it doesn't offer required capability,
>   for example offering only "clean" but not "smudge".  There
>   is simple workaround - set `smudge` variable (same as not
>   present capability) to empty string.

If you think it is necessary to have this discussion in the
commit message, then I will add it.


>> Please note that you cannot use an existing filter.<driver>.clean
>> or filter.<driver>.smudge command as filter.<driver>.process
>> command. As soon as Git would detect a file that needs to be
>> processed by this filter, it would stop responding.
> 
> I think this needs to be in the documentation (I have not checked
> yet if it is), but is not needed in the already long commit message.

OK


>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
>> ---
>> Documentation/gitattributes.txt |  84 ++++++++-
>> convert.c                       | 400 +++++++++++++++++++++++++++++++++++++--
>> t/t0021-conversion.sh           | 405 ++++++++++++++++++++++++++++++++++++++++
>> t/t0021/rot13-filter.pl         | 177 ++++++++++++++++++
>> 4 files changed, 1053 insertions(+), 13 deletions(-)
>> create mode 100755 t/t0021/rot13-filter.pl
>> 
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> index 8882a3e..e3fbcc2 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the command is
>> fed the blob object from its standard input, and its standard
>> output is used to update the worktree file.  Similarly, the
>> `clean` command is used to convert the contents of worktree file
>> -upon checkin.
>> +upon checkin. By default these commands process only a single
>> +blob and terminate. If a long running filter process (see section
>> +below) is used then Git can process all blobs with a single filter
>> +invocation for the entire life of a single Git command (e.g.
>> +`git add .`).
> 
> Proposed improvement:
> 
>                       If a long running `process` filter is used
>   in place of `clean` and/or `smudge` filters, then Git can process
>   all blobs with a single filter command invocation for the entire
>   life of a single Git command, for example `git add --all`.  See
>   section below for the description of the protocol used to
>   communicate with a `process` filter.

Sounds good. I will use this!


>> One use of the content filtering is to massage the content into a shape
>> that is more convenient for the platform, filesystem, and the user to use.
>> @@ -375,6 +379,84 @@ substitution.  For example:
>> ------------------------
>> 
>> 
>> +Long Running Filter Process
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +If the filter command (string value) is defined via
> 
> This is no mere string value, this is command invocation (with its
> own rules, e.g. splitting parameters on whitespace, etc.).  Though
> I'm not sure how to say it succintly.  Maybe skip "(string value)"?
> But it is there for a reason...

How about: "If the filter command as string value is defined via ..."?


>> +filter.<driver>.process then Git can process all blobs with a
> 
> Shouldn't it be `filter.<driver>.process`?

OK, I will change it.


>> +single filter invocation for the entire life of a single Git
>> +command. This is achieved by using the following packet
>> +format (pkt-line, see protocol-common.txt) based protocol over
> 
> Can we linkgit-it (to technical documentation)?

I don't think that is possible because it was never done. See:
git grep "linkgit:tech"


>> +standard input and standard output.
>> +
>> +Git starts the filter on first usage and expects a welcome
> 
> Is "usage" here correct?  Perhaps it would be more readable
> to say that Git starts filter when encountering first file
> that needs cleaning or smudgeing.

OK. How about this:

Git starts the filter when it encounters the first file
that needs to be cleaned or smudged. After the filter started
Git expects a welcome message, protocol version number, and 
filter capabilities separated by spaces:


>> +message, protocol version number, and filter capabilities
>> +separated by spaces:
>> +------------------------
>> +packet:          git< git-filter-protocol\n
>> +packet:          git< version 2\n
>> +packet:          git< capabilities clean smudge\n
>> +------------------------
>> +Supported filter capabilities are "clean", "smudge", "stream",
>> +and "shutdown".
> 
> Filter should include at least one of "clean" and "smudge"
> capabilities (currently), otherwise it wouldn't do anything.

Well, I think that should be clear to the reader, no?


> I don't know if it is a good place to say that because of pkt-line
> recommendations about text-content packets, each of those should
> terminate in endline, with "\n" included in pkt-line length.
> 
>> +
>> +Afterwards Git sends a command (based on the supported
>> +capabilities),
> 
> I think it should be something like the following:
> 
>   If among filter `process` capabilities there is capability
>   that corresponds to the operation performed by a Git command
>   (that is, either "clean" or "smudge"), then Git would send,
>   in separate packets, a command (based on supported capabilites),
> 
> though it feels too "chatty" (and the sentence gets quite long).
> 
>>               the filename including its path
>> +relative to the repository root, 
> 
> Errr... "the filename including its path"? Wouldn't be it simpler
> to just say:
> 
>  the pathname of a file relative to the repository root,

Agreed


> Also, isn't it now "filename=<pathname>\n"?

You mean I should change filename to pathname? Agreed!


>>                                  the content size as ASCII number
>> +in bytes, 
> 
> Could Git not give the size, for example if fstat() fails? Do
> we reserve space for other information here?
> 
> Also, isn't it now "size=<bytes>\n"?

Size will go away as discussed in the beginning of this email.

> 
>>            the content split in zero or many pkt-line packets,
> 
> s/zero or many/zero or more/

Agreed!


>> +and a flush packet at the end:
> 
> I wonder if instead of long sentence, it would be more readable
> to use enumeration (ordered list) or itemize (unordered list).

Maybe, but I think that is good enough for now.


>> +------------------------
>> +packet:          git> smudge\n
>> +packet:          git> filename=path/testfile.dat\n
>> +packet:          git> size=7\n
>> +packet:          git> CONTENT
>> +packet:          git> 0000
>> +------------------------
>> +
>> +The filter is expected to respond with the result content size as
>> +ASCII number in bytes. If the capability "stream" is defined then
>> +the filter must not send the content size.
> 
> As I wrote earlier, I think sending or not the size of the output
> should be decoupled from the "stream" capability.
> 
> Streaming is IMVHO rather a capability of starting to send parts
> of response before the whole contents of input arrives.  I think
> per-file filters support that and that's what start_async() there
> is about.

Correct. However, size will go away as discussed in the beginning of 
this email. 


>>                                            Afterwards the result
>> +content in send in zero or many pkt-line packets and a flush packet
>> +at the end. Finally a "success" packet is send to indicate that
>> +everything went well.
> 
> I guess it is "success" packet if everything went well, and place
> for informing about errors in the future - filter is assumed to die
> if there are errors in filtering, isn't it?

Correct.


> That is, not "send to indicate", but "send if".

Agreed!


> 
>> +------------------------
>> +packet:          git< size=57\n   (omitted with capability "stream")
>> +packet:          git< SMUDGED_CONTENT
>> +packet:          git< 0000
>> +packet:          git< success\n
>> +------------------------
>> +
>> +In case the filter cannot process the content, it is expected
>> +to respond with the result content size 0 (only if "stream" is
>> +not defined) and a "reject" packet.
>> +------------------------
>> +packet:          git< size=0\n    (omitted with capability "stream")
>> +packet:          git< reject\n
>> +------------------------
> 
> I would assume that we have two error conditions.  
> 
> First situation is when the filter knows upfront (after receiving name
> and size of file, and after receiving contents for not-streaming filters)
> that it cannot process the file (like e.g. LFS filter with artifactory
> replica/shard being a bit behind master, and not including contents of
> the file being filtered).
> 
> My proposal is to reply with "fail" _in place of_ size of reply:
> 
>   packet:         git< fail\n       (any case: size known or not, stream or not)
> 
> It could be "reject", or "error" instead of "fail".
> 
> 
> Another situation is if filter encounters error during output,
> either with streaming filter (or non-stream, but not storing whole
> input upfront) realizing in the middle of output that there is something
> wrong with input (e.g. converting between encoding, and encountering
> character that cannot be represented in output encoding), or e.g. filter
> process being killed, or network connection dropping with LFS filter, etc.
> The filter has send some packets with output already.  In this case
> filter should flush, and send "reject" or "error" packet.
> 
>   <error condition>
>   packet:         git< "0000"       (flush packet)
>   packet:         git< reject\n
> 
> Should there be a place for an error message, or would standard error
> (stderr) be used for this?

Already discussed in http://public-inbox.org/git/6765D972-876A-4F94-A170-468002498296%40gmail.com/

I will add an example for the error case, too.


>> +
>> +After the filter has processed a blob it is expected to wait for
>> +the next command. A demo implementation can be found in
>> +`t/t0021/rot13-filter.pl` located in the Git core repository.
> 
> It is actually in Git sources.  Is it the best way to refer to
> such files?

Well, I could add a github.com link but I don't think everyone
would like that. What would you suggest?


>> +
>> +If the filter supports the "shutdown" capability then Git will
>> +send the "shutdown" command and wait until the filter answers
>> +with "done". This gives the filter the opportunity to perform
>> +cleanup tasks. Afterwards the filter is expected to exit.
>> +------------------------
>> +packet:          git> shutdown\n
>> +packet:          git< done\n
>> +------------------------
>> +
>> +If a filter.<driver>.clean or filter.<driver>.smudge command
>> +is configured then these commands always take precedence over
>> +a configured filter.<driver>.process command.
> 
> All right; this is quite clear.
> 
>> +
>> +Please note that you cannot use an existing filter.<driver>.clean
>> +or filter.<driver>.smudge command as filter.<driver>.process
>> +command. As soon as Git would detect a file that needs to be
>> +processed by this filter, it would stop responding.
> 
> This isn't.

Would that be better?


Please note that you cannot use an existing `filter.<driver>.clean`
or `filter.<driver>.smudge` command as `filter.<driver>.process`
command because the former two use a different inter process 
communication protocol than the latter one. As soon as Git would detect 
a file that needs to be processed by such an invalid "process" filter, 
it would wait for a proper protocol handshake and appear "hanging".


> 
> P.S. I will comment about the implementation part in the next email.

Sure! Thanks again for the extensive review,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-31 22:19     ` Jakub Narębski
@ 2016-08-01 17:55       ` Lars Schneider
  2016-08-04  0:42         ` Jakub Narębski
  2016-08-03 13:10       ` Lars Schneider
  1 sibling, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-01 17:55 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 01 Aug 2016, at 00:19, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
> [...]
>> +Please note that you cannot use an existing filter.<driver>.clean
>> +or filter.<driver>.smudge command as filter.<driver>.process
>> +command.
> 
> I think it would be more readable and easier to understand to write:
> 
>  ... you cannot use an existing ... command with
>  filter.<driver>.process
> 
> About the style: wouldn't `filter.<driver>.process` be better?

OK, changed it!


>>             As soon as Git would detect a file that needs to be
>> +processed by this filter, it would stop responding.
> 
> This is quite convoluted, and hard to understand.  I would say
> that because `clean` and `smudge` filters are expected to read
> first, while Git expects `process` filter to say first, using
> `clean` or `smudge` filter without changes as `process` filter
> would lead to git command deadlocking / hanging / stopping
> responding.

How about this:

Please note that you cannot use an existing `filter.<driver>.clean`
or `filter.<driver>.smudge` command with `filter.<driver>.process`
because the former two use a different inter process communication
protocol than the latter one. As soon as Git would detect a file
that needs to be processed by such an invalid "process" filter, 
it would wait for a proper protocol handshake and appear "hanging".


>> +
>> +
>> Interaction between checkin/checkout attributes
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 
>> diff --git a/convert.c b/convert.c
>> index 522e2c5..be6405c 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -3,6 +3,7 @@
>> #include "run-command.h"
>> #include "quote.h"
>> #include "sigchain.h"
>> +#include "pkt-line.h"
>> 
>> /*
>>  * convert.c - convert a file when checking it out and checking it in.
>> @@ -481,11 +482,355 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>> 	return ret;
>> }
>> 
>> +static int multi_packet_read(int fd_in, struct strbuf *sb, size_t expected_bytes, int is_stream)
> 
> About name of this function: `multi_packet_read` is fine, though I wonder
> if `packet_read_in_full` with nearly the same parameters as `packet_read`,
> or `packet_read_till_flush`, or `read_in_full_packetized` would be better.

I like `multi_packet_read` and will rename!


> Also, the problem is that while we know that what packet_read() stores
> would fit in memory (in size_t), it is not true for reading whole file,
> which might be very large - for example huge graphical assets like raw
> images or raw videos, or virtual machine images.  Isn't that the goal
> of git-LFS solutions, which need this feature?  Shouldn't we have then
> both `multi_packet_read_to_fd` and `multi_packet_read_to_buf`,
> or whatever?

Git LFS works well with the current clean/smudge mechanism that uses the
same on in memory buffers. I understand your concern but I think this
improvement is out of scope for this patch series.


> Also, if we have `fd_in`, then perhaps `sb_out`?

Agreed!


> I am also unsure if `expected_bytes` (or `expected_size`) should not be
> just a size hint, leaving handing mismatch between expected size and
> real size of output to the caller; then the `is_stream` would be not
> needed.

As mentioned in a previous email... I will drop the "size" support in
this patch series as it is not really needed.


>> +{
>> +	int bytes_read;
>> +	size_t total_bytes_read = 0;
> 
> Why `bytes_read` is int, while `total_bytes_read` is size_t? Ah, I see
> that packet_read() returns an int.  It should be ssize_t, just like
> read(), isn't it?  But we know that packet size is limited, and would
> fit in an int (or would it?).

Yes, it is limited but I agree on ssize_t!


> Also, total_bytes_read could overflow size_t, but then we would have
> problems storing the result in strbuf.

Would that check be ok?

		if (total_bytes_read > SIZE_MAX - bytes_read)
			return 1;  // `total_bytes_read` would overflow and is not representable


>> +	if (expected_bytes == 0 && !is_stream)
>> +		return 0;
> 
> So in all cases *except* size = 0 we expect flush packet after the
> contents, but size = 0 is a corner case without flush packet?

I agree that is inconsistent... I will change it!


>> +
>> +	if (is_stream)
>> +		strbuf_grow(sb, LARGE_PACKET_MAX);           // allocate space for at least one packet
>> +	else
>> +		strbuf_grow(sb, st_add(expected_bytes, 1));  // add one extra byte for the packet flush
>> +
>> +	do {
>> +		bytes_read = packet_read(
>> +			fd_in, NULL, NULL,
>> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
>> +			PACKET_READ_GENTLE_ON_EOF
>> +		);
>> +		if (bytes_read < 0)
>> +			return 1;  // unexpected EOF
> 
> Don't we usually return negative numbers on error?  Ah, I see that the
> return is a bool, which allows to use boolean expression with 'return'.
> But I am still unsure if it is good API, this return value.

According to Peff zero for success is the usual style:
http://public-inbox.org/git/20160728133523.GB21311%40sigill.intra.peff.net/


> If we move handling of size mismatch to the caller, then the function
> can simply return the size of data read (probably off_t or uint64_t).
> Then the caller can check if it is what it expected, and react accordingly.

True, but as discussed previously I will remove the size.


>> +
>> +		if (is_stream &&
>> +			bytes_read > 0 &&
>> +			sb->len - total_bytes_read - 1 <= 0)
>> +			strbuf_grow(sb, st_add(sb->len, LARGE_PACKET_MAX));
>> +		total_bytes_read += bytes_read;
>> +	}
>> +	while (
>> +		bytes_read > 0 &&                   // the last packet was no flush
>> +		sb->len - total_bytes_read - 1 > 0  // we still have space left in the buffer
> 
> Ah, so buffer is resized only in the 'is_stream' case.  Perhaps then
> use an "int options" instead of 'is_stream', and have one of flags
> tell if we should resize or not, that is if size parameter is hint
> or a strict limit.

Obsolete


>> +	);
>> +	strbuf_setlen(sb, total_bytes_read);
>> +	return (is_stream ? 0 : expected_bytes != total_bytes_read);
>> +}
>> +
>> +static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
> 
> Is it equivalent of copy_fd() function, but where destination uses pkt-line
> and we need to pack data into pkt-lines?

Correct!


>> +{
>> +	int did_fail = 0;
>> +	ssize_t bytes_to_write;
>> +	while (!did_fail) {
>> +		bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_LEN);
> 
> Using global variable packet_buffer makes this code thread-unsafe, isn't it?
> But perhaps that is not a problem, because other functions are also
> using this global variable.

Correct!


> It is more of PKTLINE_DATA_MAXLEN, isn't it?

Agreed, will change!


> 
>> +		if (bytes_to_write < 0)
>> +			return 1;
>> +		if (bytes_to_write == 0)
>> +			break;
>> +		did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1);
>> +	}
>> +	if (!did_fail)
>> +		did_fail = packet_flush_gentle(fd_out);
> 
> Shouldn't we try to flush even if there was an error?  Or is it
> that if there is an error writing, then there is some problem
> such that we know that flush would not work?

Right, that's what I though.


>> +	return did_fail;
> 
> Return true on fail?  Shouldn't we follow example of copy_fd()
> from copy.c, and return COPY_READ_ERROR, or COPY_WRITE_ERROR,
> or PKTLINE_WRITE_ERROR?

OK. How about this?

static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
{
	int did_fail = 0;
	ssize_t bytes_to_write;
	while (!did_fail) {
		bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_MAXLEN);
		if (bytes_to_write < 0)
			return COPY_READ_ERROR;
		if (bytes_to_write == 0)
			break;
		did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1);
	}
	if (!did_fail)
		did_fail = packet_flush_gently(fd_out);
	return (did_fail ? COPY_WRITE_ERROR : 0);
}


>> +}
>> +
>> +static int multi_packet_write_from_buf(const char *src, size_t len, int fd_out)
> 
> It is equivalent of write_in_full(), with different order of parameters,
> but where destination file descriptor expects pkt-line and we need to pack
> data into pkt-lines?

True. Do you suggest to reorder parameters? I also would like to rename `src` to `src_in`, OK?

> 
> NOTE: function description comments?

What do you mean here?


>> +{
>> +	int did_fail = 0;
>> +	size_t bytes_written = 0;
>> +	size_t bytes_to_write;
> 
> Note to self: bytes_to_write should fit in size_t, as it is limited to
> PKTLINE_DATA_LEN.  bytes_written should fit in size_t, as it is at most
> len, which is of type size_t.
> 
>> +	while (!did_fail) {
>> +		if ((len - bytes_written) > PKTLINE_DATA_LEN)
>> +			bytes_to_write = PKTLINE_DATA_LEN;
>> +		else
>> +			bytes_to_write = len - bytes_written;
>> +		if (bytes_to_write == 0)
>> +			break;
>> +		did_fail |= direct_packet_write_data(fd_out, src + bytes_written, bytes_to_write, 1);
>> +		bytes_written += bytes_to_write;
> 
> Ah, I see now why we need both direct_packet_write() and
> direct_packet_write_data().  Nice abstraction, makes for
> clear code.
> 
> The last parameter of '1' means 'gently', isn't it?

Correct. Thanks :)


>> +	}
>> +	if (!did_fail)
>> +		did_fail = packet_flush_gentle(fd_out);
>> +	return did_fail;
>> +}
> 
> I think all three/four of those functions should be added in a separate
> commit, separate patch in patch series.

OK

>  Namely:
> 
> - for git -> filter:
>    * read from fd,      write pkt-line to fd  (off_t)
>    * read from str+len, write pkt-line to fd  (size_t, ssize_t)
> - for filter -> git:
>    * read pkt-line from fd, write to fd       (off_t)

This one does not exist.


>    * read pkt-line from fd, write to str+len  (size_t, ssize_t)
> 
> Perhaps some of those can be in one overloaded function, perhaps it would
> be easier to keep them separate.

I would like to keep them separate as it is easier to comprehend.

> 
> Also, I do wonder how the fetch / push code spools pack file received
> over pkt-lines to disk.  Can we reuse that code?

I haven't found any.


>  Or maybe that code
> could use those new functions?

I think so, but this would be out of scope for this series :)


>> +
>> +#define FILTER_CAPABILITIES_STREAM   0x1
>> +#define FILTER_CAPABILITIES_CLEAN    0x2
>> +#define FILTER_CAPABILITIES_SMUDGE   0x4
>> +#define FILTER_CAPABILITIES_SHUTDOWN 0x8
>> +#define FILTER_SUPPORTS_STREAM(type) ((type) & FILTER_CAPABILITIES_STREAM)
>> +#define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
>> +#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
>> +#define FILTER_SUPPORTS_SHUTDOWN(type) ((type) & FILTER_CAPABILITIES_SHUTDOWN)
>> +
>> +struct cmd2process {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	const char *cmd;
>> +	int supported_capabilities;
> 
> I wonder if switching from int (perhaps with field width of 1 to denote
> that it is boolean-like flag) to mask makes it more readable, or less.
> But I think it is.
> 
> 
> Reading Documentation/technical/api-hashmap.txt I found the following
> recommendation:
> 
>  `struct hashmap_entry`::
> 
>        An opaque structure representing an entry in the hash table, which must
>        be used as first member of user data structures. Ideally it should be
>        followed by an int-sized member to prevent unused memory on 64-bit
>        systems due to alignment.
> 
> Therefore it "int supported_capabilities" should precede
> "const char *cmd", I think.  Though it is not strictly necessary; it
> is not as if this hash table were large (maximum size is limited by
> the number of filter drivers configured), so we don't waste much space
> due to internal padding / due to alignment.

Thanks! I will change it to your suggestion anyway!


> 
>> +	struct child_process process;
>> +};
>> +
>> +static int cmd_process_map_initialized = 0;
>> +static struct hashmap cmd_process_map;
> 
> Reading Documentation/technical/api-hashmap.txt I see that:
> 
>  `tablesize` is the allocated size of the hash table. A non-0 value indicates
>  that the hashmap is initialized.
> 
> So cmd_process_map_initialized is not really needed, is it?

I copied that from config.c:
https://github.com/git/git/blob/f8f7adce9fc50a11a764d57815602dcb818d1816/config.c#L1425-L1428

`git grep "tablesize"` reveals that the check for `tablesize` is only used
in hashmap.c ... so what approach should we use?


>> +
>> +static int cmd2process_cmp(const struct cmd2process *e1,
>> +							const struct cmd2process *e2,
>> +							const void *unused)
>> +{
>> +	return strcmp(e1->cmd, e2->cmd);
>> +}
> 
> Well, to be exact (which is decidely not needed!) two commands might
> be equivalent not being identical as strings (e.g. extra space between
> parameters).  But it is something the user should care about, not Git.
> 
>> +
>> +static struct cmd2process *find_protocol2_filter_entry(struct hashmap *hashmap, const char *cmd)
> 
> I'm not sure if *_protocol2_* is needed; those functions are static,
> local to convert.c.

I want to make sure that the reader understands that these functions are
related to the filter protocol version 2. Not OK?


>> +{
>> +	struct cmd2process k;
> 
> Does this name of variable 'k' follow established convention?
> 'key' would be more descriptive, but it's not as if this function
> was long; so 'k' is all right, I think.

I agree on "key".


> 
>> +	hashmap_entry_init(&k, strhash(cmd));
>> +	k.cmd = cmd;
>> +	return hashmap_get(hashmap, &k, NULL);
>> +}
>> +
>> +static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry) {
> 
> Programming style: the opening brace should be on separate line,
> that is:
> 
>  +static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry)
>  +{

Agreed!


>> +	if (!entry)
>> +		return;
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	close(entry->process.in);
>> +	close(entry->process.out);
>> +	sigchain_pop(SIGPIPE);
>> +	finish_command(&entry->process);
>> +	child_process_clear(&entry->process);
>> +	hashmap_remove(hashmap, entry, NULL);
>> +	free(entry);
>> +}
> 
> All those, from #define FILTER_CAPABILITIES_ to here could be put
> in a separate patch, to reduce size of this one.  But I am less
> sure that it is worth it for this case.
> 
>> +
>> +void shutdown_protocol2_filter(pid_t pid)
>> +{
> [...]
> 
> In my opinion this should be postponed to a separate commit.

Agreed!

> 
>> +}
>> +
>> +static struct cmd2process *start_protocol2_filter(struct hashmap *hashmap, const char *cmd)
> 
> This has some parts in common with existing filter_buffer_or_fd().
> I wonder if it would be worth to extract those common parts.
> 
> But perhaps it would be better to leave such refactoring for later.
> 
>> +{
>> +	int did_fail;
>> +	struct cmd2process *entry;
>> +	struct child_process *process;
>> +	const char *argv[] = { cmd, NULL };
>> +	struct string_list capabilities = STRING_LIST_INIT_NODUP;
>> +	char *capabilities_buffer;
>> +	int i;
>> +
>> +	entry = xmalloc(sizeof(*entry));
>> +	hashmap_entry_init(entry, strhash(cmd));
>> +	entry->cmd = cmd;
>> +	entry->supported_capabilities = 0;
>> +	process = &entry->process;
>> +
>> +	child_process_init(process);
> 
> filter_buffer_or_fd() uses instead
> 
>  struct child_process child_process = CHILD_PROCESS_INIT;
> 
> But I see that you need to access &entry->process anyway, so you
> need to have it here, and in this case child_process_init() is
> equivalent.
> 
> I wonder if it would be worth it to use strbuf for cmd.

What do you mean by "worth it to use strbuf for cmd"? Why would
we need a strbuf?


>> +	process->argv = argv;
>> +	process->use_shell = 1;
>> +	process->in = -1;
>> +	process->out = -1;
>> +	process->clean_on_exit = 1;
>> +	process->clean_on_exit_handler = shutdown_protocol2_filter;
> 
> These two lines are new, and related to the "shutdown" capability, isn't it?

Yes.


> 
>> +
>> +	if (start_command(process)) {
>> +		error("cannot fork to run external filter '%s'", cmd);
>> +		kill_protocol2_filter(hashmap, entry);
> 
> I guess the alternative solution of adding filter to the hashmap only
> after starting the process would be racy?
> 
> Ah, disregard that. I see that this pattern is a common way to error
> out in this function (for process-related errors).
> 
>> +		return NULL;
>> +	}
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	did_fail = strcmp(packet_read_line(process->out, NULL), "git-filter-protocol");
>> +	if (!did_fail)
>> +		did_fail |= strcmp(packet_read_line(process->out, NULL), "version 2");
>> +	if (!did_fail)
>> +		capabilities_buffer = packet_read_line(process->out, NULL);
>> +	else
>> +		capabilities_buffer = NULL;
>> +	sigchain_pop(SIGPIPE);
>> +
>> +	if (!did_fail && capabilities_buffer) {
>> +		string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1);
>> +		if (capabilities.nr > 1 &&
>> +			!strcmp(capabilities.items[0].string, "capabilities")) {
>> +			for (i = 1; i < capabilities.nr; i++) {
>> +				const char *requested = capabilities.items[i].string;
>> +				if (!strcmp(requested, "stream")) {
>> +					entry->supported_capabilities |= FILTER_CAPABILITIES_STREAM;
>> +				} else if (!strcmp(requested, "clean")) {
>> +					entry->supported_capabilities |= FILTER_CAPABILITIES_CLEAN;
>> +				} else if (!strcmp(requested, "smudge")) {
>> +					entry->supported_capabilities |= FILTER_CAPABILITIES_SMUDGE;
>> +				} else if (!strcmp(requested, "shutdown")) {
>> +					entry->supported_capabilities |= FILTER_CAPABILITIES_SHUTDOWN;
>> +				} else {
>> +					warning(
>> +						"external filter '%s' requested unsupported filter capability '%s'",
>> +						cmd, requested
>> +					);
>> +				}
>> +			}
>> +		} else {
>> +			error("filter capabilities not found");
>> +			did_fail = 1;
>> +		}
>> +		string_list_clear(&capabilities, 0);
>> +	}
> 
> I wonder if the above conditional wouldn't be better to be put in
> a separate function, parse_filter_capabilities(capabilities_buffer),
> returning a mask, or having mask as an out parameter, and returning
> an error condition.

Agreed.


>> +
>> +	if (did_fail) {
>> +		error("initialization for external filter '%s' failed", cmd);
> 
> More detailed information not needed, because one can use GIT_PACKET_TRACE.
> Would it be worth add this information as a kind of advice, or put it
> in the documentation of the `process` option?

I will put it into the docs.


> 
>> +		kill_protocol2_filter(hashmap, entry);
>> +		return NULL;
>> +	}
>> +
>> +	hashmap_add(hashmap, entry);
>> +	return entry;
>> +}
>> +
>> +static int apply_protocol2_filter(const char *path, const char *src, size_t len,
>> +						int fd, struct strbuf *dst, const char *cmd,
>> +						const int wanted_capability)
> 
> apply_protocol2_filter, or apply_process_filter?  Or rather,
> s/_protocol2_/_process_/g ?

Mh. I wanted to convey that this functions is protocol V2 related...

> 
> This is equivalent to
> 
>   static int apply_filter(const char *path, const char *src, size_t len, int fd,
>                           struct strbuf *dst, const char *cmd)
> 
> Could we have extended that one instead?

Initially I had one function but that got kind of long ... I prefer two for now.


>> +{
>> +	int ret = 1;
>> +	struct cmd2process *entry;
>> +	struct child_process *process;
>> +	struct stat file_stat;
>> +	struct strbuf nbuf = STRBUF_INIT;
>> +	size_t expected_bytes = 0;
>> +	char *strtol_end;
>> +	char *strbuf;
>> +	char *filter_type;
>> +	char *filter_result = NULL;
>> +
> 
>> +	if (!cmd || !*cmd)
>> +		return 0;
>> +
>> +	if (!dst)
>> +		return 1;
> 
> This is the same as in apply_filter().
> 
>> +
>> +	if (!cmd_process_map_initialized) {
>> +		cmd_process_map_initialized = 1;
>> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
>> +		entry = NULL;
>> +	} else {
>> +		entry = find_protocol2_filter_entry(&cmd_process_map, cmd);
>> +	}
> 
> Here we try to find existing process, rather than starting new
> as in apply_filter()
> 
>> +
>> +	fflush(NULL);
> 
> This is the same as in apply_filter(), but I wonder what it is for.

"If the stream argument is NULL, fflush() flushes all
 open output streams."

http://man7.org/linux/man-pages/man3/fflush.3.html

> 
>> +
>> +	if (!entry) {
>> +		entry = start_protocol2_filter(&cmd_process_map, cmd);
>> +		if (!entry) {
>> +			return 0;
>> +		}
> 
> Style; we prefer:
> 
>  +		if (!entry)
>  +			return 0;

Agreed.


> This is very similar to apply_filter(), but the latter uses start_async()
> from "run-command.h", with filter_buffer_or_fd() as asynchronous process,
> which gets passed command to run in struct filter_params.  In this
> function start_protocol2_filter() runs start_command(), synchronous API.
> 
> Why the difference?

The protocol V2 requires a sequential processing of the packets. See
discussion with Junio here:
http://public-inbox.org/git/xmqqbn1th5qn.fsf%40gitster.mtv.corp.google.com/

[LONG SNIP]

I will answer the second half in a separate email.

Thanks for the review,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/10] run-command: add clean_on_exit_handler
  2016-08-01 11:14       ` Lars Schneider
@ 2016-08-02  5:53         ` Johannes Sixt
  2016-08-02  7:41           ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Johannes Sixt @ 2016-08-02  5:53 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright, e, peff

Am 01.08.2016 um 13:14 schrieb Lars Schneider:
 >> On 30 Jul 2016, at 11:50, Johannes Sixt <j6t@kdbg.org> wrote:
 >> Am 30.07.2016 um 01:37 schrieb larsxschneider@gmail.com:
 >>> static struct child_to_clean *children_to_clean;
 >>> @@ -30,6 +31,8 @@ static void cleanup_children(int sig, int in_signal)
 >>> {
 >>> 	while (children_to_clean) {
 >>> 		struct child_to_clean *p = children_to_clean;
 >>> +		if (p->clean_on_exit_handler)
 >>> +			p->clean_on_exit_handler(p->pid);
 >>
 >> This summons demons. cleanup_children() is invoked from a signal
 >> handler. In this case, it can call only async-signal-safe functions.
 >> It does not look like the handler that you are going to install
 >> later will take note of this caveat!
 >>
 >>> 		children_to_clean = p->next;
 >>> 		kill(p->pid, sig);
 >>> 		if (!in_signal)
 >>
 >> The condition that we see here in the context protects free(p)
 >> (which is not async-signal-safe). Perhaps the invocation of the new
 >> callback should be skipped in the same manner when this is called
 >> from a signal handler? 507d7804 (pager: don't use unsafe functions
 >> in signal handlers) may be worth a look.
 >
 > Thanks a lot of pointing this out to me!
 >
 > Do I get it right that after the signal "SIGTERM" I can do a cleanup
 > and don't need to worry about any function calls but if I get any
 > other signal then I can only perform async-signal-safe calls?

No. SIGTERM is not special.

Perhaps you were misled by the SIGTERM mentioned in 
cleanup_children_on_exit()? This function is invoked on regular exit 
(not from a signal). SIGTERM is used in this case to terminate children 
that are still lingering around.

 > If this is correct, then the following solution would work great:
 >
 > 		if (!in_signal && p->clean_on_exit_handler)
 > 			p->clean_on_exit_handler(p->pid);

This should work nevertheless because in_signal is set when the function 
is invoked from a signal handler (of any signal that is caught) via 
cleanup_children_on_signal().

-- Hannes


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 06/10] run-command: add clean_on_exit_handler
  2016-08-02  5:53         ` Johannes Sixt
@ 2016-08-02  7:41           ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-02  7:41 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Git Mailing List, gitster, jnareb, tboegi, mlbright, e, peff


> On 02 Aug 2016, at 07:53, Johannes Sixt <j6t@kdbg.org> wrote:
> 
> Am 01.08.2016 um 13:14 schrieb Lars Schneider:
> >> On 30 Jul 2016, at 11:50, Johannes Sixt <j6t@kdbg.org> wrote:
> >> Am 30.07.2016 um 01:37 schrieb larsxschneider@gmail.com:
> >>> static struct child_to_clean *children_to_clean;
> >>> @@ -30,6 +31,8 @@ static void cleanup_children(int sig, int in_signal)
> >>> {
> >>> 	while (children_to_clean) {
> >>> 		struct child_to_clean *p = children_to_clean;
> >>> +		if (p->clean_on_exit_handler)
> >>> +			p->clean_on_exit_handler(p->pid);
> >>
> >> This summons demons. cleanup_children() is invoked from a signal
> >> handler. In this case, it can call only async-signal-safe functions.
> >> It does not look like the handler that you are going to install
> >> later will take note of this caveat!
> >>
> >>> 		children_to_clean = p->next;
> >>> 		kill(p->pid, sig);
> >>> 		if (!in_signal)
> >>
> >> The condition that we see here in the context protects free(p)
> >> (which is not async-signal-safe). Perhaps the invocation of the new
> >> callback should be skipped in the same manner when this is called
> >> from a signal handler? 507d7804 (pager: don't use unsafe functions
> >> in signal handlers) may be worth a look.
> >
> > Thanks a lot of pointing this out to me!
> >
> > Do I get it right that after the signal "SIGTERM" I can do a cleanup
> > and don't need to worry about any function calls but if I get any
> > other signal then I can only perform async-signal-safe calls?
> 
> No. SIGTERM is not special.
> 
> Perhaps you were misled by the SIGTERM mentioned in cleanup_children_on_exit()? This function is invoked on regular exit (not from a signal). SIGTERM is used in this case to terminate children that are still lingering around.

Yes, that was my source of confusion. Thanks for the clarification!

> 
> > If this is correct, then the following solution would work great:
> >
> > 		if (!in_signal && p->clean_on_exit_handler)
> > 			p->clean_on_exit_handler(p->pid);
> 
> This should work nevertheless because in_signal is set when the function is invoked from a signal handler (of any signal that is caught) via cleanup_children_on_signal().

Right. Thank you!

- Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-07-31 21:45       ` Lars Schneider
@ 2016-08-02 19:56         ` Torsten Bögershausen
  2016-08-05  9:59           ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Torsten Bögershausen @ 2016-08-02 19:56 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git@vger.kernel.org, gitster@pobox.com, jnareb@gmail.com,
	mlbright@gmail.com, e@80x24.org, peff@peff.net

On Sun, Jul 31, 2016 at 11:45:08PM +0200, Lars Schneider wrote:
> 
> > On 31 Jul 2016, at 22:36, Torstem Bögershausen <tboegi@web.de> wrote:
> > 
> > 
> > 
> >> Am 29.07.2016 um 20:37 schrieb larsxschneider@gmail.com:
> >> 
> >> From: Lars Schneider <larsxschneider@gmail.com>
> >> 
> >> packet_flush() would die in case of a write error even though for some callers
> >> an error would be acceptable.
> > What happens if there is a write error ?
> > Basically the protocol is out of synch.
> > Lenght information is mixed up with payload, or the other way
> > around.
> > It may be, that the consequences of a write error are acceptable,
> > because a filter is allowed to fail.
> > What is not acceptable is a "broken" protocol.
> > The consequence schould be to close the fd and tear down all
> > resources. connected to it.
> > In our case to terminate the external filter daemon in some way,
> > and to never use this instance again.
> 
> Correct! That is exactly what is happening in kill_protocol2_filter()
> here:

Wait a second.
Is kill the same as shutdown ?
I would expect that
The process terminates itself as soon as it detects EOF.
As there is nothing more read.

Then the next question: The combination of kill & protocol in kill_protocol(),
what does it mean ?
Is it more like a graceful shutdown_protocol() ?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-07-31 22:19     ` Jakub Narębski
  2016-08-01 17:55       ` Lars Schneider
@ 2016-08-03 13:10       ` Lars Schneider
  2016-08-04 10:18         ` Jakub Narębski
  1 sibling, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 13:10 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 01 Aug 2016, at 00:19, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
> 

[LONG SNIP]

First part answered here:
http://public-inbox.org/git/5180D54D-92C4-4875-AEB3-801663D70A8B%40gmail.com/

> 
>> +	}
>> +	process = &entry->process;
>> +
>> +	if (!(wanted_capability & entry->supported_capabilities))
>> +		return 1;  // it is OK if the wanted capability is not supported
>> +
>> +	if FILTER_SUPPORTS_CLEAN(wanted_capability)
>> +		filter_type = "clean";
>> +	else if FILTER_SUPPORTS_SMUDGE(wanted_capability)
>> +		filter_type = "smudge";
>> +	else
>> +		die("unexpected filter type");
> 
> Style: it should be
> 
>  +	if (FILTER_SUPPORTS_CLEAN(wanted_capability))
>  +		filter_type = "clean";
>  +	else if (FILTER_SUPPORTS_SMUDGE(wanted_capability))
>  +		filter_type = "smudge";
>  +	else
>  +		die("unexpected filter type");
> 
> even though by accident the macro provides the parentheses to "if".

Agreed.


> Can we make an error/die message more detailed?  Maybe it is
> not possible...

Yeah, I don't see an easy way...

> 
>> +
>> +	if (fd >= 0 && !src) {
>> +		if (fstat(fd, &file_stat) == -1)
>> +			return 0;
>> +		len = file_stat.st_size;
>> +	}
> 
> All right, when fstat() can fail?  Could we then send contents without
> size upfront, or is it better to require size to make it more consistent
> for filter drivers scripts?

If fstat() fails then there is clearly something wrong and the filter
should fail.


> Could this whole "send single file" be put in a separate function?
> Or is it not worth it?

This function would have almost the same signature as apply_protocol2_filter
and therefore I would say it's not worth it since the function is not
crazy long.


> 
>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
> 
> Hmmm... ignoring SIGPIPE was good for one-shot filters.  Is it still
> O.K. for per-command persistent ones?

Very good question. You are right... we don't want to ignore any errors
during the protocol... I will remove it.


> 
>> +
>> +	packet_buf_write(&nbuf, "%s\n", filter_type);
>> +	ret &= !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
>> +
>> +	if (ret) {
>> +		strbuf_reset(&nbuf);
>> +		packet_buf_write(&nbuf, "filename=%s\n", path);
>> +		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
>> +	}
> 
> Perhaps a better solution would be
> 
>        if (err)
>        	goto fin_error;
> 
> rather than this.

OK, I change it to goto error handling style.

> 
>> +
>> +	if (ret) {
>> +		strbuf_reset(&nbuf);
>> +		packet_buf_write(&nbuf, "size=%"PRIuMAX"\n", (uintmax_t)len);
>> +		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
>> +	}
> 
> Or maybe extract writing the header for a file into a separate function?
> This one gets a bit long...

Maybe... but I think that would make it harder to understand the protocol. I
think I would prefer to have all the communication in one function layer.


>> +
>> +	if (ret) {
>> +		if (fd >= 0)
>> +			ret = !multi_packet_write_from_fd(fd, process->in);
>> +		else
>> +			ret = !multi_packet_write_from_buf(src, len, process->in);
>> +	}
> 
> This is not streaming.  The above sends whole file, or whole string to
> the filter process, without draining filter output.  If the filter were
> to read some, then write some, it might deadlock on full buffers, isn't it?
> Or am I mistaken?

Correct.


>> +
>> +	if (ret && !FILTER_SUPPORTS_STREAM(entry->supported_capabilities)) {
>> +		strbuf = packet_read_line(process->out, NULL);
>> +		if (strlen(strbuf) > 5 && !strncmp("size=", strbuf, 5)) {
>> +			expected_bytes = (off_t)strtol(strbuf + 5, &strtol_end, 10);
>> +			ret = (strtol_end != strbuf && errno != ERANGE);
>> +		} else {
>> +			ret = 0;
>> +		}
>> +	}
>> +
>> +	if (ret) {
>> +		strbuf_reset(&nbuf);
>> +		ret = !multi_packet_read(process->out, &nbuf, expected_bytes,
>> +			FILTER_SUPPORTS_STREAM(entry->supported_capabilities));
>> +	}
> 
> What happens if the output of filter does not fit in size_t?  I see that
> (I think) this problem is inherited from the original implementation.

Correct. And therefore I would prefer not to change this in this series.


>> +
>> +	if (ret) {
>> +		filter_result = packet_read_line(process->out, NULL);
>> +		ret = !strcmp(filter_result, "success");
>> +	}
>> +
>> +	sigchain_pop(SIGPIPE);
>> +
>> +	if (ret) {
>> +		strbuf_swap(dst, &nbuf);
>> +	} else {
>> +		if (!filter_result || strcmp(filter_result, "reject")) {
>> +			// Something went wrong with the protocol filter. Force shutdown!
>> +			error("external filter '%s' failed", cmd);
>> +			kill_protocol2_filter(&cmd_process_map, entry);
>> +		}
>> +	}
> 
> So if Git gets finish signal "success" from filter, it accepts the output.
> If Git gets finish signal "reject" from filter, it restarts filter (and
> reject the output - user can retry the command himself / herself).
> If Git gets any other finish signal, for example "error" (but this is not
> standarized), then it rejects the output, keeping the unfiltered result,
> but keeps filtering.
> 
> I think it is not described in this detail in the documentation of the
> new protocol.

Agreed, will add!

> 
>> +	strbuf_release(&nbuf);
>> +	return ret;
>> +}
> 
> I wonder if this point might be start of the new patch... but then you
> would have no way to test what you wrote.
> 
>> +
>> static struct convert_driver {
>> 	const char *name;
>> 	struct convert_driver *next;
>> 	const char *smudge;
>> 	const char *clean;
>> +	const char *process;
>> 	int required;
>> } *user_convert, **user_convert_tail;
> 
> All right.
> 
>> 
>> @@ -526,6 +871,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>> 	if (!strcmp("clean", key))
>> 		return git_config_string(&drv->clean, var, value);
>> 
>> +	if (!strcmp("process", key)) {
>> +		return git_config_string(&drv->process, var, value);
>> +	}
>> +
> 
> All right.
> 
>> 	if (!strcmp("required", key)) {
>> 		drv->required = git_config_bool(var, value);
>> 		return 0;
>> @@ -823,7 +1172,12 @@ int would_convert_to_git_filter_fd(const char *path)
>> 	if (!ca.drv->required)
>> 		return 0;
>> 
>> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
>> +	if (!ca.drv->clean && ca.drv->process)
>> +		return apply_protocol2_filter(
>> +			path, NULL, 0, -1, NULL, ca.drv->process, FILTER_CAPABILITIES_CLEAN
>> +		);
>> +	else
>> +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> 
> Could we augment apply_filter() instead, so that the invocation is
> 
>        return apply_filter(path, NULL, 0, -1, NULL, ca.drv, FILTER_CLEAN);
> 
> Though I am not sure if moving this conditional to apply_filter would
> be a good idea; maybe wrapper around augmented apply_filter_do()?

Yes, a wrapper makes it way cleaner!


>> }
>> 
>> const char *get_convert_attr_ascii(const char *path)
>> @@ -856,17 +1210,24 @@ int convert_to_git(const char *path, const char *src, size_t len,
>>                    struct strbuf *dst, enum safe_crlf checksafe)
>> {
>> 	int ret = 0;
>> -	const char *filter = NULL;
>> +	const char *clean_filter = NULL;
>> +	const char *process_filter = NULL;
>> 	int required = 0;
>> 	struct conv_attrs ca;
>> 
>> 	convert_attrs(&ca, path);
>> 	if (ca.drv) {
>> -		filter = ca.drv->clean;
>> +		clean_filter = ca.drv->clean;
>> +		process_filter = ca.drv->process;
>> 		required = ca.drv->required;
>> 	}
> 
> All right (assuming un-augmented apply_filter()).
> 
>> 
>> -	ret |= apply_filter(path, src, len, -1, dst, filter);
>> +	if (!clean_filter && process_filter)
>> +		ret |= apply_protocol2_filter(
>> +			path, src, len, -1, dst, process_filter, FILTER_CAPABILITIES_CLEAN
>> +		);
>> +	else
>> +		ret |= apply_filter(path, src, len, -1, dst, clean_filter);
> 
> I wonder if it would be more readable to write it like this
> (and of course elsewhere too):
> 
>  +	if (!clean_filter && process_filter)
>  +		ret |= apply_protocol2_filter(
>  +			path, src, len, -1, dst, process_filter, FILTER_CAPABILITIES_CLEAN
>  +		);
>  +	else
>  +		ret |= apply_filter(
>  +			path, src, len, -1, dst, clean_filter);
>  +		);
> 
> 
> Though it would screw up "git blame -C -C -w"

Obsolete with the wrapper mentioned above.


>> 	if (!ret && required)
>> 		die("%s: clean filter '%s' failed", path, ca.drv->name);
>> 
>> @@ -885,13 +1246,21 @@ int convert_to_git(const char *path, const char *src, size_t len,
>> void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
>> 			      enum safe_crlf checksafe)
>> {
>> +	int ret = 0;
> 
> Right, 'ret' is needed because we now have two possibilities:
> `clean` filter and `process` filter.
> 
>> 	struct conv_attrs ca;
>> 	convert_attrs(&ca, path);
>> 
>> 	assert(ca.drv);
>> -	assert(ca.drv->clean);
>> +	assert(ca.drv->clean || ca.drv->process);
>> +
>> +	if (!ca.drv->clean && ca.drv->process)
>> +		ret = apply_protocol2_filter(
>> +			path, NULL, 0, fd, dst, ca.drv->process, FILTER_CAPABILITIES_CLEAN
>> +		);
>> +	else
>> +		ret = apply_filter(path, NULL, 0, fd, dst, ca.drv->clean);
>> 
>> -	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
>> +	if (!ret)
>> 		die("%s: clean filter '%s' failed", path, ca.drv->name);
>> 
>> 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
>> @@ -902,14 +1271,16 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>> 					    size_t len, struct strbuf *dst,
>> 					    int normalizing)
>> {
>> -	int ret = 0, ret_filter = 0;
>> -	const char *filter = NULL;
>> +	int ret = 0, ret_filter;
> 
> Why the change:
> 
>  -	int ret = 0, ret_filter = 0;
>  +	int ret = 0, ret_filter;

Reverted with the wrapper.


>> +	const char *smudge_filter = NULL;
>> +	const char *process_filter = NULL;
>> 	int required = 0;
>> 	struct conv_attrs ca;
>> 
>> 	convert_attrs(&ca, path);
>> 	if (ca.drv) {
>> -		filter = ca.drv->smudge;
>> +		process_filter = ca.drv->process;
>> +		smudge_filter = ca.drv->smudge;
>> 		required = ca.drv->required;
>> 	}
> 
> All right, the same.
> 
> [...]
>> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
>> index 34c8eb9..e8a7703 100755
>> --- a/t/t0021-conversion.sh
>> +++ b/t/t0021-conversion.sh
>> @@ -296,4 +296,409 @@ test_expect_success 'disable filter with empty override' '
>> 	test_must_be_empty err
>> '
>> 
>> +test_expect_success PERL 'required process filter should filter data' '
>> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge shutdown" &&
>> +	test_config_global filter.protocol.required true &&
>> +	rm -rf repo &&
>> +	mkdir repo &&
>> +	(
>> +		cd repo &&
>> +		git init &&
>> +
>> +		echo "*.r filter=protocol" >.gitattributes &&
>> +		git add . &&
>> +		git commit . -m "test commit" &&
> 
> This is more of "Initial commit", not that it matters
> 
>> +		git branch empty &&
>> +
>> +		cat ../test.o >test.r &&
> 
> Err, the above is just copying file, isn't it?
> Maybe it was copied from other tests, I have not checked.

It was created in the "setup" test.


>> +		echo "test22" >test2.r &&
>> +		mkdir testsubdir &&
>> +		echo "test333" >testsubdir/test3.r &&
> 
> All right, we test text file, we test binary file (I assume), we test
> file in a subdirectory.  What about testing empty file?  Or large file
> which would not fit in the stdin/stdout buffer (as EXPENSIVE test)?

No binary file. The main reason for this test is to check multiple files.
I'll add a empty file. A large file is tested in the next test.

> 
>> +
>> +		rm -f rot13-filter.log &&
>> +		git add . &&
> 
> So this runs "clean" filter, storing cleaned contents in the index.

Correct.


>> +		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >uniq-rot13-filter.log &&
>> +		cat >expected_add.log <<-\EOF &&
>> +			1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
>> +			1 IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
>> +			1 IN: clean testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
> 
> And we check the "know size upfront" case (mistakenly called non-"stream").

Correct - however, I removed non-stream


>> +			1 IN: shutdown -- [OK]
> 
> And test "shutdown" capability (not as separate test).

Fixed.


>> +			1 start
>> +			1 wrote filter header
>> +		EOF
> 
> And we are required to keep the expected_add.log file sorted by hand???

Well, the clean invocations (and therefore their order of appearance)
are not deterministic. See my discussion with Junio here:
http://public-inbox.org/git/xmqqshv18i8i.fsf%40gitster.mtv.corp.google.com/

> 
>> +		test_cmp expected_add.log uniq-rot13-filter.log &&
>> +
>> +		>rot13-filter.log &&
> 
> Truncate log. Still in the same test.
> 
>> +		git commit . -m "test commit" &&
> 
> This is test commit with files undergoing "clean" part of filter.
> 
>> +		sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
>> +			sed "s/^\([0-9]\) IN: clean/x IN: clean/" >uniq-rot13-filter.log &&
> 
> There is known performance regression, in that filter is run more
> than once on given file.
> 
> Actually... why it does not use cleaned-up contents from the index?

See discussion here: http://public-inbox.org/git/20160722152753.GA6859%40sigill.intra.peff.net/


>> +		cat >expected_commit.log <<-\EOF &&
>> +			x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
>> +			x IN: clean test2.r 7 [OK] -- OUT: 7 [OK]
>> +			x IN: clean testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
>> +			1 IN: shutdown -- [OK]
>> +			1 start
>> +			1 wrote filter header
> 
> Right, this is the goal of the patch series: for filter to be started
> only once per git command invocation.
> 
>> +		EOF
>> +		test_cmp expected_commit.log uniq-rot13-filter.log &&
>> +
> 
> Still in the same test, even though we would be testing "smudge"
> capability now.  
> 
> It's a pity that t/test-lib.sh does not support subtests from
> the TAP specification (Test Anything Protocol that Git testsuite
> uses).
> 
>> +		>rot13-filter.log &&
>> +		rm -f test?.r testsubdir/test3.r &&
>> +		git checkout . &&
> 
> All right, we removed some files so that "git checkout ." could
> restore them to life.
> 
>> +		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
> 
> Useless use of cat
> 
>  +		grep -v "IN: clean"  rot13-filter.log  >smudge-rot13-filter.log &&

Fixed, thanks!


> Also: why 'git checkout <path>' would run "clean" filter?
> Is it existing strange behaviour?

AFAIK, that's existing behavior.


>> +		cat >expected_checkout.log <<-\EOF &&
>> +			start
>> +			wrote filter header
>> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
>> +			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
>> +			IN: shutdown -- [OK]
>> +		EOF
> 
> This time without 'sort | uniq -c'.

Yes, because the smudge calls are deterministic!


>  Is it really needed for the
> "good" case, or is it there for two cases to look similar?

I am not sure what you mean?!


>> +		test_cmp expected_checkout.log smudge-rot13-filter.log &&
>> +
>> +		git checkout empty &&
> 
> Shouldn't we check that switching to branch 'empty' does not run
> filters, or is it covered by other tests?  Or perhaps this simply
> does not matter here, is it?

Easy enough to check. I will add this.

> 
>> +
>> +		>rot13-filter.log &&
>> +		git checkout master &&
> 
> Does it test different callpath than 'git checkout .'?  Well, the
> set of files is different...
> 
>> +		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
>> +		cat >expected_checkout_master.log <<-\EOF &&
>> +			start
>> +			wrote filter header
>> +			IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
>> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
>> +			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
>> +			IN: shutdown -- [OK]
>> +		EOF
>> +		test_cmp expected_checkout_master.log smudge-rot13-filter.log &&
>> +
> 
> And here we start checking that the filter did filter,
> that is the content in the repository is "clean"ed-up.
> Still the same test.
> 
>> +		./../rot13.sh <test.r >expected &&
>> +		git cat-file blob :test.r >actual &&
>> +		test_cmp expected actual &&
>> +
>> +		./../rot13.sh <test2.r >expected &&
>> +		git cat-file blob :test2.r >actual &&
>> +		test_cmp expected actual &&
>> +
>> +		./../rot13.sh <testsubdir/test3.r >expected &&
>> +		git cat-file blob :testsubdir/test3.r >actual &&
>> +		test_cmp expected actual
>> +	)
>> +'
>> +
>> +test_expect_success PERL 'required process filter should filter data stream' '
>> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl stream clean smudge" &&
>> +	test_config_global filter.protocol.required true &&
> 
> Errr... I don't see how it is different from the previous test.
> [...]

stream/non-stream ... but this is obsolete in the next roll. fixed!

> 
>> +
>> +test_expect_success PERL 'required process filter should filter smudge data and one-shot filter should clean' '
> 
> All right, so this tests the precedence... well, it doesn't.
> 
> It tests that `process` filter with "smudge" capability only works well
> with one-shot `clean` filter.

True. Isn't that what the test description indicates?


>> +	test_config_global filter.protocol.clean ./../rot13.sh &&
>> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl smudge" &&
> 
> Why the difference in pathnames (the directory part) between those two?

rot13.sh is generated in the header of the file.
rot13-filter.pl is part of the test suite


>> +	test_config_global filter.protocol.required true &&
>> +	rm -rf repo &&
>> +	mkdir repo &&
>> +	(
>> +		cd repo &&
>> +		git init &&
>> +
>> +		echo "*.r filter=protocol" >.gitattributes &&
>> +		git add . &&
>> +		git commit . -m "test commit" &&
>> +		git branch empty &&
>> +
>> +		cat ../test.o >test.r &&
>> +		echo "test22" >test2.r &&
>> +		mkdir testsubdir &&
>> +		echo "test333" >testsubdir/test3.r &&
>> +
>> +		rm -f rot13-filter.log &&
>> +		git add . &&
>> +		test_must_be_empty rot13-filter.log &&
>> +
>> +		>rot13-filter.log &&
>> +		git commit . -m "test commit" &&
>> +		test_must_be_empty rot13-filter.log &&
> 
> All right, these tests that `process` filter is not ran.  But we don't
> know if it is because it lacks capability, or because it is overriden
> by one-shot filter (well, that comes later).

Only the clean one shot filter is configured. Therefore that shouldn't be a
problem, right?


>> +
>> +		>rot13-filter.log &&
>> +		rm -f test?.r testsubdir/test3.r &&
>> +		git checkout . &&
>> +		cat rot13-filter.log | grep -v "IN: clean" >smudge-rot13-filter.log &&
>> +		cat >expected_checkout.log <<-\EOF &&
>> +			start
>> +			wrote filter header
>> +			IN: smudge test2.r 7 [OK] -- OUT: 7 [OK]
>> +			IN: smudge testsubdir/test3.r 8 [OK] -- OUT: 8 [OK]
>> +		EOF
>> +		test_cmp expected_checkout.log smudge-rot13-filter.log &&
> 
> This part is repeated many, many times.  Maybe add some helper
> shell function for this?

Good idea! Will add!


> [...]
>> +		./../rot13.sh <test.r >expected &&
>> +		git cat-file blob :test.r >actual &&
>> +		test_cmp expected actual &&
>> +
>> +		./../rot13.sh <test2.r >expected &&
>> +		git cat-file blob :test2.r >actual &&
>> +		test_cmp expected actual &&
>> +
>> +		./../rot13.sh <testsubdir/test3.r >expected &&
>> +		git cat-file blob :testsubdir/test3.r >actual &&
>> +		test_cmp expected actual
> 
> Here we test that equivalent one-shot cleanup filter was run.
> Here also we have repeated contents; maybe some helper function
> would make it shorter?

Agreed!


>> +	)
>> +'
> 
> Here I am stopping examining tests in detail.
> 
>> +test_expect_success PERL 'required process filter should clean only' '
>> +test_expect_success PERL 'required process filter should process files larger LARGE_PACKET_MAX' '
> 
> Those two tests do not depend on being required or not; it is only
> that without required they would fail softly in case of latter test
> (which we can detect too).

True, but since they fail hard it is easier to check.


>> +test_expect_success PERL 'required process filter should with clean error should fail' '
>> +test_expect_success PERL 'process filter should restart after unexpected write failure' '
> 
> So these two are sort of complimentary.  When `process` is required,
> then it should fail if it cannot filter some file.  If it is not,
> it should keep processing other files.

True.


>> +test_expect_success PERL 'process filter should not restart after intentionally rejected file' '
> 
> Uh... all right, so "reject" means that filter cannot continue?
> Strange meaning for 'reject', though ;-)

No, with reject a filter can say "I don't want to process that file". This is a legitimate
response and I don't Git to restart the filter in that case.


>> test_done
>> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
>> new file mode 100755
>> index 0000000..cb0925d
>> --- /dev/null
>> +++ b/t/t0021/rot13-filter.pl
>> @@ -0,0 +1,177 @@
>> +#!/usr/bin/perl
>> +#
>> +# Example implementation for the Git filter protocol version 2
>> +# See Documentation/gitattributes.txt, section "Filter Protocol"
>> +#
>> +# The script takes the list of supported protocol capabilities as
>> +# arguments ("stream", "clean", and "smudge" are supported).
> 
> What about "shutdown"?

Will fix.


>> +#
>> +# This implementation supports three special test cases:
>> +# (1) If data with the filename "clean-write-fail.r" is processed with
>> +#     a "clean" operation then the write operation will die.
>> +# (2) If data with the filename "smudge-write-fail.r" is processed with
>> +#     a "smudge" operation then the write operation will die.
> 
> All right, so it is hard failure with filter script dying.

Correct.

> 
>> +# (3) If data with the filename "failure.r" is processed with any
>> +#     operation then the filter signals that the operation was not
>> +#     successful.
> 
> All right, so it is failure detected by filter script and signalled to Git.
> 
>> +#
>> +
>> +use strict;
>> +use warnings;
> 
> So no more "use autodie", because of compatibility with old Perls.
> 
>> +
>> +my $MAX_PACKET_CONTENT_SIZE = 65516;
>> +my @capabilities            = @ARGV;
> 
> No autoflush this time?

Eric recommended to disable it:
http://public-inbox.org/git/20160723072721.GA20875%40starla/


>> +
>> +sub rot13 {
>> +    my ($str) = @_;
>> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
>> +    return $str;
>> +}
>> +
>> +sub packet_read {
>> +    my $buffer;
>> +    my $bytes_read = read STDIN, $buffer, 4;
>> +    if ( $bytes_read == 0 ) {
>> +        return;
>> +    }
>> +    elsif ( $bytes_read != 4 ) {
>> +        die "invalid packet size '$bytes_read' field";
>> +    }
>> +    my $pkt_size = hex($buffer);
>> +    if ( $pkt_size == 0 ) {
>> +        return ( 1, "" );
> 
> Unusual return convention.  Though it is a test script, so
> it doesn't matter much.
> 
>> +    }
>> +    elsif ( $pkt_size > 4 ) {
>> +        my $content_size = $pkt_size - 4;
>> +        $bytes_read = read STDIN, $buffer, $content_size;
>> +        if ( $bytes_read != $content_size ) {
>> +            die "invalid packet";
> 
> More detailed error message, maybe?

OK


>> +        }
>> +        return ( 0, $buffer );
>> +    }
>> +    else {
>> +        die "invalid packet size";
>> +    }
>> +}
>> +
>> +sub packet_write {
>> +    my ($packet) = @_;
>> +    print STDOUT sprintf( "%04x", length($packet) + 4 );
>> +    print STDOUT $packet;
>> +    STDOUT->flush();
>> +}
>> +
>> +sub packet_flush {
>> +    print STDOUT sprintf( "%04x", 0 );
>> +    STDOUT->flush();
>> +}
>> +
>> +open my $debug, ">>", "rot13-filter.log";
>> +print $debug "start\n";
>> +$debug->flush();
>> +
>> +packet_write("git-filter-protocol\n");
>> +packet_write("version 2\n");
>> +packet_write( "capabilities " . join( ' ', @capabilities ) . "\n" );
>> +print $debug "wrote filter header\n";
>> +$debug->flush();
>> +
>> +while (1) {
>> +    my $command = packet_read();
>> +    unless ( defined($command) ) {
>> +        exit();
>> +    }
>> +    chomp $command;
>> +    print $debug "IN: $command";
>> +    $debug->flush();
>> +
>> +    if ( $command eq "shutdown" ) {
>> +        print $debug " -- [OK]";
>> +        $debug->flush();
>> +        packet_write("done\n");
>> +        exit();
>> +    }
>> +
>> +    my ($filename) = packet_read() =~ /filename=([^=]+)\n/;
>> +    print $debug " $filename";
>> +    $debug->flush();
>> +    my ($filelen) = packet_read() =~ /size=([^=]+)\n/;
>> +    chomp $filelen;
> 
> I think this chomp is not needed, as "\n" is not included.
> Though the regexp should probably be anchored.

Agreed.


>> +    print $debug " $filelen";
>> +    $debug->flush();
>> +
>> +    $filelen =~ /\A\d+\z/ or die "bad filelen: $filelen";
>> +    my $output;
>> +
>> +    if ( $filelen > 0 ) {
> 
> So here is a special case for $filelen = 0.
> Negative $filelen is not allowed, via regexp.

Obsolete in v4.


>> +        my $input = "";
>> +        {
>> +            binmode(STDIN);
>> +            my $buffer;
>> +            my $done = 0;
>> +            while ( !$done ) {
>> +                ( $done, $buffer ) = packet_read();
>> +                $input .= $buffer;
>> +            }
>> +            print $debug " [OK] -- ";
>> +            $debug->flush();
>> +        }
>> +
>> +        if ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
>> +            $output = rot13($input);
>> +        }
>> +        elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
>> +            $output = rot13($input);
>> +        }
> 
> These two conditionals could be shortened, but then they would be less
> readable.  Or not:
> 
>           if ( grep { $_ eq $command } @capabilities ) {
>           	$output = rot13($input);
>           }

I would like to keep it that way for readability since
the test script also serves as example implementation.


>> +        else {
>> +            die "bad command $command";
>> +        }
>> +    }
>> +
>> +    my $output_len = length($output);
>> +    if ( $filename eq "reject.r" ) {
>> +        $output_len = 0;
>> +    }
>> +
>> +    if ( grep( /^stream$/, @capabilities ) ) {
>> +        print $debug "OUT: STREAM ";
>> +    }
>> +    else {
>> +        packet_write("size=$output_len\n");
>> +        print $debug "OUT: $output_len ";
>> +    }
>> +    $debug->flush();
>> +
>> +    if ( $filename eq "reject.r" ) {
>> +        packet_write("reject\n");
>> +        print $debug "[REJECT]\n";    # Could also be an error
> 
> How if could be an error?

Removed.


> 
>> +        $debug->flush();
>> +    }
>> +
>> +    if ( $output_len > 0 ) {
>> +        if (( $command eq "clean" and $filename eq "clean-write-fail.r" )
>> +            or
>> +            ( $command eq "smudge" and $filename eq "smudge-write-fail.r" ))
> 
> Perhaps simply:
> 
>  +        if ( $filename eq "${command}-write-fail.r" ) {

Nice! Will fix!


>> +        {
>> +            print $debug "[WRITE FAIL]\n";
>> +            $debug->flush();
>> +            die "write error";
>> +        }
>> +        else {
>> +            while ( length($output) > 0 ) {
>> +                my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
>> +                packet_write($packet);
>> +                if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
>> +                    $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
>> +                }
>> +                else {
>> +                    $output = "";
>> +                }
>> +            }
>> +            packet_flush();
>> +            packet_write("success\n");
>> +            print $debug "[OK]\n";
>> +            $debug->flush();
>> +        }
>> +    }
>> +}
>> 
> 


Thank you very much (again!) for your extensive review,
Lars



^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 00/12] Git filter protocol
  2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
                     ` (9 preceding siblings ...)
  2016-07-29 23:38   ` [PATCH v3 10/10] convert: add filter.<driver>.process option larsxschneider
@ 2016-08-03 16:42   ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 01/12] pkt-line: extract set_packet_header() larsxschneider
                       ` (11 more replies)
  10 siblings, 12 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

thanks a lot for the very helpful reviews!

Patch 1-10 are preparation. Patch 11 and 12 the real feature.

Diff to v3:
* simplify protocol, remove size information
* run clean_on_exit_handler() only on SIGTERM (Hannes)
* move hex() macro inside set_packet_header(), undef it after use (Jakub)
* rename buf to data in direct_packet_write_data() (Jakub)
* add benchmark summary (Jakub)
* add empty file test case (Jakub)
* rename multi_packet_read() to packet_read_till_flush()
* expect a flush packet even after 0 content
* move packet stream helper functions to pkt-line.c/h (Jakub)
* add GIT_PACKET_TRACE hint to docs
* remove SIGPIPE ignore (Jakub)
* change to goto error handling style (Jakub)
* cleanup test cases with helper functions (Jakub)
* move shutdown implementation to dedicated patch

Thanks,
Lars


Lars Schneider (12):
  pkt-line: extract set_packet_header()
  pkt-line: add direct_packet_write() and direct_packet_write_data()
  pkt-line: add packet_flush_gentle()
  pkt-line: call packet_trace() only if a packet is actually send
  pkt-line: add functions to read/write flush terminated packet streams
  pack-protocol: fix maximum pkt-line size
  run-command: add clean_on_exit_handler
  convert: quote filter names in error messages
  convert: modernize tests
  convert: generate large test files only once
  convert: add filter.<driver>.process option
  convert: add filter.<driver>.process shutdown command option

 Documentation/gitattributes.txt             | 108 +++++-
 Documentation/technical/protocol-common.txt |   6 +-
 convert.c                                   | 324 ++++++++++++++++--
 pkt-line.c                                  | 156 ++++++++-
 pkt-line.h                                  |  13 +
 run-command.c                               |  12 +-
 run-command.h                               |   1 +
 t/t0021-conversion.sh                       | 503 +++++++++++++++++++++++++---
 t/t0021/rot13-filter.pl                     | 155 +++++++++
 9 files changed, 1187 insertions(+), 91 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

--
2.9.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 20:18       ` Junio C Hamano
  2016-08-03 16:42     ` [PATCH v4 02/12] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
                       ` (10 subsequent siblings)
  11 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

set_packet_header() converts an integer to a 4 byte hex string. Make
this function locally available so that other pkt-line functions can
use it.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index 62fdb37..177dc73 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -97,10 +97,19 @@ void packet_buf_flush(struct strbuf *buf)
 	strbuf_add(buf, "0000", 4);
 }
 
-#define hex(a) (hexchar[(a) & 15])
-static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+static void set_packet_header(char *buf, const int size)
 {
 	static char hexchar[] = "0123456789abcdef";
+	#define hex(a) (hexchar[(a) & 15])
+	buf[0] = hex(size >> 12);
+	buf[1] = hex(size >> 8);
+	buf[2] = hex(size >> 4);
+	buf[3] = hex(size);
+	#undef hex
+}
+
+static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+{
 	size_t orig_len, n;
 
 	orig_len = out->len;
@@ -111,10 +120,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	if (n > LARGE_PACKET_MAX)
 		die("protocol error: impossibly long line");
 
-	out->buf[orig_len + 0] = hex(n >> 12);
-	out->buf[orig_len + 1] = hex(n >> 8);
-	out->buf[orig_len + 2] = hex(n >> 4);
-	out->buf[orig_len + 3] = hex(n);
+	set_packet_header(&out->buf[orig_len], n);
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 02/12] pkt-line: add direct_packet_write() and direct_packet_write_data()
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
  2016-08-03 16:42     ` [PATCH v4 01/12] pkt-line: extract set_packet_header() larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() larsxschneider
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Sometimes pkt-line data is already available in a buffer and it would
be a waste of resources to write the packet using packet_write() which
would copy the existing buffer into a strbuf before writing it.

If the caller has control over the buffer creation then the
PKTLINE_DATA_START macro can be used to skip the header and write
directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
would be the maximum). direct_packet_write() would take this buffer,
adjust the pkt-line header and write it.

If the caller has no control over the buffer creation then
direct_packet_write_data() can be used. This function creates a pkt-line
header. Afterwards the header and the data buffer are written using two
consecutive write calls.

Both functions have a gentle parameter that indicates if Git should die
in case of a write error (gentle set to 0) or return with a error (gentle
set to 1).

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 pkt-line.h |  5 +++++
 2 files changed, 47 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 177dc73..aa158ba 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -136,6 +136,48 @@ void packet_write(int fd, const char *fmt, ...)
 	write_or_die(fd, buf.buf, buf.len);
 }
 
+int direct_packet_write(int fd, char *buf, size_t size, int gentle)
+{
+	int ret = 0;
+	if (size > LARGE_PACKET_MAX) {
+		if (gentle)
+			return 0;
+		else
+			die("protocol error: impossibly long line");
+	}
+	packet_trace(buf + 4, size - 4, 1);
+	set_packet_header(buf, size);
+	if (gentle)
+		ret = !write_or_whine_pipe(fd, buf, size, "pkt-line");
+	else
+		write_or_die(fd, buf, size);
+	return ret;
+}
+
+int direct_packet_write_data(int fd, const char *data, size_t size, int gentle)
+{
+	int ret = 0;
+	char hdr[PKTLINE_HEADER_LEN];
+	if (size > PKTLINE_DATA_MAXLEN) {
+		if (gentle)
+			return 0;
+		else
+			die("protocol error: impossibly long line");
+	}
+	set_packet_header(hdr, PKTLINE_HEADER_LEN + size);
+	packet_trace(data, size, 1);
+	if (gentle) {
+		ret = (
+			!write_or_whine_pipe(fd, hdr, PKTLINE_HEADER_LEN, "pkt-line header") ||
+			!write_or_whine_pipe(fd, data, size, "pkt-line data")
+		);
+	} else {
+		write_or_die(fd, hdr, PKTLINE_HEADER_LEN);
+		write_or_die(fd, data, size);
+	}
+	return ret;
+}
+
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 {
 	va_list args;
diff --git a/pkt-line.h b/pkt-line.h
index 3cb9d91..ed64511 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,8 @@ void packet_flush(int fd);
 void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int direct_packet_write(int fd, char *buf, size_t size, int gentle);
+int direct_packet_write_data(int fd, const char *data, size_t size, int gentle);
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
@@ -77,6 +79,9 @@ char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
 
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
+#define PKTLINE_HEADER_LEN 4
+#define PKTLINE_DATA_START(pkt) ((pkt) + PKTLINE_HEADER_LEN)
+#define PKTLINE_DATA_MAXLEN (LARGE_PACKET_MAX - PKTLINE_HEADER_LEN)
 extern char packet_buffer[LARGE_PACKET_MAX];
 
 #endif
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 03/12] pkt-line: add packet_flush_gentle()
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
  2016-08-03 16:42     ` [PATCH v4 01/12] pkt-line: extract set_packet_header() larsxschneider
  2016-08-03 16:42     ` [PATCH v4 02/12] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 21:39       ` Jeff King
  2016-08-03 16:42     ` [PATCH v4 04/12] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
                       ` (8 subsequent siblings)
  11 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_flush() would die in case of a write error even though for some callers
an error would be acceptable. Add packet_flush_gentle() which writes a pkt-line
flush packet and returns `0` for success and `1` for failure.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 6 ++++++
 pkt-line.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index aa158ba..c8a052a 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -91,6 +91,12 @@ void packet_flush(int fd)
 	write_or_die(fd, "0000", 4);
 }
 
+int packet_flush_gently(int fd)
+{
+	packet_trace("0000", 4, 1);
+	return !write_or_whine_pipe(fd, "0000", 4, "flush packet");
+}
+
 void packet_buf_flush(struct strbuf *buf)
 {
 	packet_trace("0000", 4, 1);
diff --git a/pkt-line.h b/pkt-line.h
index ed64511..2fbaee9 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_flush_gently(int fd);
 int direct_packet_write(int fd, char *buf, size_t size, int gentle);
 int direct_packet_write_data(int fd, const char *data, size_t size, int gentle);
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 04/12] pkt-line: call packet_trace() only if a packet is actually send
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (2 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 05/12] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
                       ` (7 subsequent siblings)
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

The packet_trace() call is not ideal in format_packet() as we would print
a trace when a packet is formatted and (potentially) when the packet is
actually send. This was no problem up until now because format_packet()
was only used by one function. Fix it by moving the trace call into the
function that actally sends the packet.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pkt-line.c b/pkt-line.c
index c8a052a..d1368e6 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -127,7 +127,6 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 		die("protocol error: impossibly long line");
 
 	set_packet_header(&out->buf[orig_len], n);
-	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
 void packet_write(int fd, const char *fmt, ...)
@@ -139,6 +138,7 @@ void packet_write(int fd, const char *fmt, ...)
 	va_start(args, fmt);
 	format_packet(&buf, fmt, args);
 	va_end(args);
+	packet_trace(buf.buf + 4, buf.len - 4, 1);
 	write_or_die(fd, buf.buf, buf.len);
 }
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 05/12] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (3 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 04/12] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 06/12] pack-protocol: fix maximum pkt-line size larsxschneider
                       ` (6 subsequent siblings)
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write_stream_with_flush_from_fd() and
packet_write_stream_with_flush_from_buf() write a stream of packets. All
content packets use the maximal packet size except for the last one.
After the last content packet a `flush` control packet is written.

packet_read_till_flush() reads arbitary sized packets until it detects
a `flush` packet.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 pkt-line.h |  7 +++++
 2 files changed, 95 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index d1368e6..f115537 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -193,6 +193,44 @@ void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 	va_end(args);
 }
 
+int packet_write_stream_with_flush_from_fd(const int fd_in, const int fd_out)
+{
+	int did_fail = 0;
+	ssize_t bytes_to_write;
+	while (!did_fail) {
+		bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_MAXLEN);
+		if (bytes_to_write < 0)
+			return COPY_READ_ERROR;
+		if (bytes_to_write == 0)
+			break;
+		did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1);
+	}
+	if (!did_fail)
+		did_fail = packet_flush_gently(fd_out);
+	return (did_fail ? COPY_WRITE_ERROR : 0);
+}
+
+int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out)
+{
+	int did_fail = 0;
+	size_t bytes_written = 0;
+	size_t bytes_to_write;
+	while (!did_fail) {
+		if ((len - bytes_written) > PKTLINE_DATA_MAXLEN)
+			bytes_to_write = PKTLINE_DATA_MAXLEN;
+		else
+			bytes_to_write = len - bytes_written;
+		if (bytes_to_write == 0)
+			break;
+		did_fail |= direct_packet_write_data(fd_out, src_in + bytes_written, bytes_to_write, 1);
+		bytes_written += bytes_to_write;
+	}
+	if (!did_fail)
+		did_fail = packet_flush_gently(fd_out);
+	return did_fail;
+}
+
+
 static int get_packet_data(int fd, char **src_buf, size_t *src_size,
 			   void *dst, unsigned size, int options)
 {
@@ -302,3 +340,53 @@ char *packet_read_line_buf(char **src, size_t *src_len, int *dst_len)
 {
 	return packet_read_line_generic(-1, src, src_len, dst_len);
 }
+
+ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out)
+{
+	int len, ret;
+	int options = PACKET_READ_GENTLE_ON_EOF;
+	char linelen[4];
+
+	size_t oldlen = sb_out->len;
+	size_t oldalloc = sb_out->alloc;
+
+	for (;;) {
+		// Read packet header
+		ret = get_packet_data(fd_in, NULL, NULL, linelen, 4, options);
+		if (ret < 0)
+			goto done;
+		len = packet_length(linelen);
+		if (len < 0)
+			die("protocol error: bad line length character: %.4s", linelen);
+		if (!len) {
+			// Found a flush packet - Done!
+			packet_trace("0000", 4, 0);
+			break;
+		}
+		len -= 4;
+
+		// Read packet content
+		strbuf_grow(sb_out, len);
+		ret = get_packet_data(fd_in, NULL, NULL, sb_out->buf + sb_out->len, len, options);
+		if (ret < 0)
+			goto done;
+
+		if (ret != len) {
+			error("protocol error: incomplete read (expected %d, got %d)", len, ret);
+			goto done;
+		}
+
+		packet_trace(sb_out->buf + sb_out->len, len, 0);
+		sb_out->len += len;
+	}
+
+done:
+	if (ret < 0) {
+		if (oldalloc == 0)
+			strbuf_release(sb_out);
+		else
+			strbuf_setlen(sb_out, oldlen);
+		return ret;  // unexpected EOF
+	}
+	return sb_out->len - oldlen;
+}
diff --git a/pkt-line.h b/pkt-line.h
index 2fbaee9..3c0821f 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -26,6 +26,8 @@ void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((f
 int packet_flush_gently(int fd);
 int direct_packet_write(int fd, char *buf, size_t size, int gentle);
 int direct_packet_write_data(int fd, const char *data, size_t size, int gentle);
+int packet_write_stream_with_flush_from_fd(const int fd_in, const int fd_out);
+int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out);
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
@@ -78,6 +80,11 @@ char *packet_read_line(int fd, int *size);
  */
 char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
 
+/*
+ * Reads a stream of variable sized packets until a flush packet is detected.
+ */
+ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out);
+
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
 #define PKTLINE_HEADER_LEN 4
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 06/12] pack-protocol: fix maximum pkt-line size
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (4 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 05/12] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 07/12] run-command: add clean_on_exit_handler larsxschneider
                       ` (5 subsequent siblings)
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

According to LARGE_PACKET_MAX in pkt-line.h the maximal length of a
pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
therefore the pkt-line data component must not exceed 65516 bytes.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/technical/protocol-common.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt
index bf30167..ecedb34 100644
--- a/Documentation/technical/protocol-common.txt
+++ b/Documentation/technical/protocol-common.txt
@@ -67,9 +67,9 @@ with non-binary data the same whether or not they contain the trailing
 LF (stripping the LF if present, and not complaining when it is
 missing).
 
-The maximum length of a pkt-line's data component is 65520 bytes.
-Implementations MUST NOT send pkt-line whose length exceeds 65524
-(65520 bytes of payload + 4 bytes of length data).
+The maximum length of a pkt-line's data component is 65516 bytes.
+Implementations MUST NOT send pkt-line whose length exceeds 65520
+(65516 bytes of payload + 4 bytes of length data).
 
 Implementations SHOULD NOT send an empty pkt-line ("0004").
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (5 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 06/12] pack-protocol: fix maximum pkt-line size larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 21:24       ` Jeff King
  2016-08-03 16:42     ` [PATCH v4 08/12] convert: quote filter names in error messages larsxschneider
                       ` (4 subsequent siblings)
  11 siblings, 1 reply; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Some commands might need to perform cleanup tasks on exit. Let's give
them an interface for doing this.

Please note, that the cleanup callback is not executed if Git dies of a
signal. The reason is that only "async-signal-safe" functions would be
allowed to be call in that case. Since we cannot control what functions
the callback will use, we will not support the case. See 507d7804 for
more details.

Helped-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 run-command.c | 12 ++++++++----
 run-command.h |  1 +
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/run-command.c b/run-command.c
index 33bc63a..6ca75f3 100644
--- a/run-command.c
+++ b/run-command.c
@@ -21,6 +21,7 @@ void child_process_clear(struct child_process *child)
 
 struct child_to_clean {
 	pid_t pid;
+	void (*clean_on_exit_handler)(pid_t);
 	struct child_to_clean *next;
 };
 static struct child_to_clean *children_to_clean;
@@ -30,6 +31,8 @@ static void cleanup_children(int sig, int in_signal)
 {
 	while (children_to_clean) {
 		struct child_to_clean *p = children_to_clean;
+		if (!in_signal && p->clean_on_exit_handler)
+			p->clean_on_exit_handler(p->pid);
 		children_to_clean = p->next;
 		kill(p->pid, sig);
 		if (!in_signal)
@@ -49,10 +52,11 @@ static void cleanup_children_on_exit(void)
 	cleanup_children(SIGTERM, 0);
 }
 
-static void mark_child_for_cleanup(pid_t pid)
+static void mark_child_for_cleanup(pid_t pid, void (*clean_on_exit_handler)(pid_t))
 {
 	struct child_to_clean *p = xmalloc(sizeof(*p));
 	p->pid = pid;
+	p->clean_on_exit_handler = clean_on_exit_handler;
 	p->next = children_to_clean;
 	children_to_clean = p;
 
@@ -422,7 +426,7 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0)
 		error_errno("cannot fork() for %s", cmd->argv[0]);
 	else if (cmd->clean_on_exit)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup(cmd->pid, cmd->clean_on_exit_handler);
 
 	/*
 	 * Wait for child's execvp. If the execvp succeeds (or if fork()
@@ -483,7 +487,7 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
 		error_errno("cannot spawn %s", cmd->argv[0]);
 	if (cmd->clean_on_exit && cmd->pid >= 0)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup(cmd->pid, cmd->clean_on_exit_handler);
 
 	argv_array_clear(&nargv);
 	cmd->argv = sargv;
@@ -752,7 +756,7 @@ int start_async(struct async *async)
 		exit(!!async->proc(proc_in, proc_out, async->data));
 	}
 
-	mark_child_for_cleanup(async->pid);
+	mark_child_for_cleanup(async->pid, NULL);
 
 	if (need_in)
 		close(fdin[0]);
diff --git a/run-command.h b/run-command.h
index 5066649..59d21ea 100644
--- a/run-command.h
+++ b/run-command.h
@@ -43,6 +43,7 @@ struct child_process {
 	unsigned stdout_to_stderr:1;
 	unsigned use_shell:1;
 	unsigned clean_on_exit:1;
+	void (*clean_on_exit_handler)(pid_t);
 };
 
 #define CHILD_PROCESS_INIT { NULL, ARGV_ARRAY_INIT, ARGV_ARRAY_INIT }
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 08/12] convert: quote filter names in error messages
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (6 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 07/12] run-command: add clean_on_exit_handler larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 09/12] convert: modernize tests larsxschneider
                       ` (3 subsequent siblings)
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard to
read in error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index b1614bf..522e2c5 100644
--- a/convert.c
+++ b/convert.c
@@ -397,7 +397,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	child_process.out = out;
 
 	if (start_command(&child_process))
-		return error("cannot fork to run external filter %s", params->cmd);
+		return error("cannot fork to run external filter '%s'", params->cmd);
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
@@ -415,13 +415,13 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
-		error("cannot feed the input to external filter %s", params->cmd);
+		error("cannot feed the input to external filter '%s'", params->cmd);
 
 	sigchain_pop(SIGPIPE);
 
 	status = finish_command(&child_process);
 	if (status)
-		error("external filter %s failed %d", params->cmd, status);
+		error("external filter '%s' failed %d", params->cmd, status);
 
 	strbuf_release(&cmd);
 	return (write_err || status);
@@ -462,15 +462,15 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
+		error("external filter '%s' failed", cmd);
 		ret = 0;
 	}
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 09/12] convert: modernize tests
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (7 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 08/12] convert: quote filter names in error messages larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 10/12] convert: generate large test files only once larsxschneider
                       ` (2 subsequent siblings)
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">" and "<".

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 62 +++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..7b45136 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -13,8 +13,8 @@ EOF
 chmod +x rot13.sh
 
 test_expect_success setup '
-	git config filter.rot13.smudge ./rot13.sh &&
-	git config filter.rot13.clean ./rot13.sh &&
+	test_config filter.rot13.smudge ./rot13.sh &&
+	test_config filter.rot13.clean ./rot13.sh &&
 
 	{
 	    echo "*.t filter=rot13"
@@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
 
 test_expect_success check '
 
-	cmp test.o test &&
-	cmp test.o test.t &&
+	test_cmp test.o test &&
+	test_cmp test.o test.t &&
 
 	# ident should be stripped in the repository
 	git diff --raw --exit-code :test :test.i &&
@@ -47,10 +47,10 @@ test_expect_success check '
 	embedded=$(sed -ne "$script" test.i) &&
 	test "z$id" = "z$embedded" &&
 
-	git cat-file blob :test.t > test.r &&
+	git cat-file blob :test.t >test.r &&
 
-	./rot13.sh < test.o > test.t &&
-	cmp test.r test.t
+	./rot13.sh <test.o >test.t &&
+	test_cmp test.r test.t
 '
 
 # If an expanded ident ever gets into the repository, we want to make sure that
@@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
 
 	# delete the files and check them out again, using a smudge filter
 	# that will count the args and echo the command-line back to us
-	git config filter.argc.smudge "sh ./argc.sh %f" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -141,7 +141,7 @@ test_expect_success 'filter shell-escaped filenames' '
 	test_cmp expect "$special" &&
 
 	# do the same thing, but with more args in the filter expression
-	git config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -154,9 +154,9 @@ test_expect_success 'filter shell-escaped filenames' '
 '
 
 test_expect_success 'required filter should filter data' '
-	git config filter.required.smudge ./rot13.sh &&
-	git config filter.required.clean ./rot13.sh &&
-	git config filter.required.required true &&
+	test_config filter.required.smudge ./rot13.sh &&
+	test_config filter.required.clean ./rot13.sh &&
+	test_config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
@@ -165,17 +165,17 @@ test_expect_success 'required filter should filter data' '
 
 	rm -f test.r &&
 	git checkout -- test.r &&
-	cmp test.o test.r &&
+	test_cmp test.o test.r &&
 
 	./rot13.sh <test.o >expected &&
 	git cat-file blob :test.r >actual &&
-	cmp expected actual
+	test_cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
-	git config filter.failsmudge.smudge false &&
-	git config filter.failsmudge.clean cat &&
-	git config filter.failsmudge.required true &&
+	test_config filter.failsmudge.smudge false &&
+	test_config filter.failsmudge.clean cat &&
+	test_config filter.failsmudge.required true &&
 
 	echo "*.fs filter=failsmudge" >.gitattributes &&
 
@@ -186,9 +186,9 @@ test_expect_success 'required filter smudge failure' '
 '
 
 test_expect_success 'required filter clean failure' '
-	git config filter.failclean.smudge cat &&
-	git config filter.failclean.clean false &&
-	git config filter.failclean.required true &&
+	test_config filter.failclean.smudge cat &&
+	test_config filter.failclean.clean false &&
+	test_config filter.failclean.required true &&
 
 	echo "*.fc filter=failclean" >.gitattributes &&
 
@@ -197,8 +197,8 @@ test_expect_success 'required filter clean failure' '
 '
 
 test_expect_success 'filtering large input to small output should use little memory' '
-	git config filter.devnull.clean "cat >/dev/null" &&
-	git config filter.devnull.required true &&
+	test_config filter.devnull.clean "cat >/dev/null" &&
+	test_config filter.devnull.required true &&
 	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
 	echo "30MB filter=devnull" >.gitattributes &&
 	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
@@ -207,7 +207,7 @@ test_expect_success 'filtering large input to small output should use little mem
 test_expect_success 'filter that does not read is fine' '
 	test-genrandom foo $((128 * 1024 + 1)) >big &&
 	echo "big filter=epipe" >.gitattributes &&
-	git config filter.epipe.clean "echo xyzzy" &&
+	test_config filter.epipe.clean "echo xyzzy" &&
 	git add big &&
 	git cat-file blob :big >actual &&
 	echo xyzzy >expect &&
@@ -215,20 +215,20 @@ test_expect_success 'filter that does not read is fine' '
 '
 
 test_expect_success EXPENSIVE 'filter large file' '
-	git config filter.largefile.smudge cat &&
-	git config filter.largefile.clean cat &&
+	test_config filter.largefile.smudge cat &&
+	test_config filter.largefile.clean cat &&
 	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
 	echo "2GB filter=largefile" >.gitattributes &&
 	git add 2GB 2>err &&
-	! test -s err &&
+	test_must_be_empty err &&
 	rm -f 2GB &&
 	git checkout -- 2GB 2>err &&
-	! test -s err
+	test_must_be_empty err
 '
 
 test_expect_success "filter: clean empty file" '
-	git config filter.in-repo-header.clean  "echo cleaned && cat" &&
-	git config filter.in-repo-header.smudge "sed 1d" &&
+	test_config filter.in-repo-header.clean  "echo cleaned && cat" &&
+	test_config filter.in-repo-header.smudge "sed 1d" &&
 
 	echo "empty-in-worktree    filter=in-repo-header" >>.gitattributes &&
 	>empty-in-worktree &&
@@ -240,8 +240,8 @@ test_expect_success "filter: clean empty file" '
 '
 
 test_expect_success "filter: smudge empty file" '
-	git config filter.empty-in-repo.clean "cat >/dev/null" &&
-	git config filter.empty-in-repo.smudge "echo smudged && cat" &&
+	test_config filter.empty-in-repo.clean "cat >/dev/null" &&
+	test_config filter.empty-in-repo.smudge "echo smudged && cat" &&
 
 	echo "empty-in-repo filter=empty-in-repo" >>.gitattributes &&
 	echo dead data walking >empty-in-repo &&
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 10/12] convert: generate large test files only once
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (8 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 09/12] convert: modernize tests larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
  2016-08-03 16:42     ` [PATCH v4 12/12] convert: add filter.<driver>.process shutdown command option larsxschneider
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Generate more interesting large test files with pseudo random characters
in between and reuse these test files in multiple tests. Run tests formerly
marked as EXPENSIVE every time but with a smaller data set.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7b45136..34c8eb9 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -4,6 +4,15 @@ test_description='blob conversion via gitattributes'
 
 . ./test-lib.sh
 
+if test_have_prereq EXPENSIVE
+then
+	T0021_LARGE_FILE_SIZE=2048
+	T0021_LARGISH_FILE_SIZE=100
+else
+	T0021_LARGE_FILE_SIZE=30
+	T0021_LARGISH_FILE_SIZE=2
+fi
+
 cat <<EOF >rot13.sh
 #!$SHELL_PATH
 tr \
@@ -31,7 +40,26 @@ test_expect_success setup '
 	cat test >test.i &&
 	git add test test.t test.i &&
 	rm -f test test.t test.i &&
-	git checkout -- test test.t test.i
+	git checkout -- test test.t test.i &&
+
+	mkdir generated-test-data &&
+	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
+	do
+		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
+		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"
+		# Generate 1MB of empty data and 100 bytes of random characters
+		# printf "$(test-genrandom start $i)"
+		printf "%1048576d" 1 >>generated-test-data/large.file &&
+		printf "$RANDOM_STRING" >>generated-test-data/large.file &&
+		printf "%1048576d" 1 >>generated-test-data/large.file.rot13 &&
+		printf "$ROT_RANDOM_STRING" >>generated-test-data/large.file.rot13 &&
+
+		if test $i = $T0021_LARGISH_FILE_SIZE
+		then
+			cat generated-test-data/large.file >generated-test-data/largish.file &&
+			cat generated-test-data/large.file.rot13 >generated-test-data/largish.file.rot13
+		fi
+	done
 '
 
 script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
@@ -199,9 +227,9 @@ test_expect_success 'required filter clean failure' '
 test_expect_success 'filtering large input to small output should use little memory' '
 	test_config filter.devnull.clean "cat >/dev/null" &&
 	test_config filter.devnull.required true &&
-	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
-	echo "30MB filter=devnull" >.gitattributes &&
-	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
+	cp generated-test-data/large.file large.file &&
+	echo "large.file filter=devnull" >.gitattributes &&
+	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add large.file
 '
 
 test_expect_success 'filter that does not read is fine' '
@@ -214,15 +242,15 @@ test_expect_success 'filter that does not read is fine' '
 	test_cmp expect actual
 '
 
-test_expect_success EXPENSIVE 'filter large file' '
+test_expect_success 'filter large file' '
 	test_config filter.largefile.smudge cat &&
 	test_config filter.largefile.clean cat &&
-	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
-	echo "2GB filter=largefile" >.gitattributes &&
-	git add 2GB 2>err &&
+	echo "large.file filter=largefile" >.gitattributes &&
+	cp generated-test-data/large.file large.file &&
+	git add large.file 2>err &&
 	test_must_be_empty err &&
-	rm -f 2GB &&
-	git checkout -- 2GB 2>err &&
+	rm -f large.file &&
+	git checkout -- large.file 2>err &&
 	test_must_be_empty err
 '
 
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (9 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 10/12] convert: generate large test files only once larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  2016-08-03 17:45       ` Junio C Hamano
                         ` (2 more replies)
  2016-08-03 16:42     ` [PATCH v4 12/12] convert: add filter.<driver>.process shutdown command option larsxschneider
  11 siblings, 3 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge filter
written in golang to filter 12,000 files. This process took 364s with the
existing filter mechanism and 5s with the new mechanism. See details here:
https://github.com/github/git-lfs/pull/1382

This patch adds the `filter.<driver>.process` string option which, if used,
keeps the external filter process running and processes all blobs with
the packet format (pkt-line) based protocol over standard input and standard
output described below.

Git starts the filter when it encounters the first file
that needs to be cleaned or smudged. After the filter started
Git expects a welcome message, protocol version number, and
filter capabilities separated by spaces:
------------------------
packet:          git< git-filter-protocol\n
packet:          git< version=2\n
packet:          git< capabilities=clean smudge\n
------------------------
Supported filter capabilities are "clean" and "smudge".

Afterwards Git sends a command (based on the supported
capabilities), the pathname of a file relative to the
repository root, the content split in zero or more pkt-line
packets, and a flush packet at the end:
------------------------
packet:          git> command=smudge\n
packet:          git> pathname=path/testfile.dat\n
packet:          git> CONTENT
packet:          git> 0000
------------------------

The filter is expected to respond with the result content in zero
or more pkt-line packets and a flush packet at the end. Finally, a
"result=success" packet is expected if everything went well.
------------------------
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< result=success\n
------------------------

If the result content is empty then the filter is expected to respond
only with a flush packet and a "result=success" packet.
------------------------
packet:          git< 0000
packet:          git< result=success\n
------------------------

In case the filter cannot or does not want to process the content,
it is expected to respond with a flush packet and a "result=reject"
packet. Depending on the `filter.<driver>.required` flag Git will
interpret that as error but it will not stop or restart the filter
process.
------------------------
packet:          git< 0000
packet:          git< result=reject\n
------------------------

If the filter experiences an error during processing, then it can
either die or send a flush packet and a "result=error" packet. If
Git receives such an error then it will stop and restart the filter
with the next file that needs to be processed.
------------------------
packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
packet:          git< 0000
packet:          git< result=error\n
------------------------

After the filter has processed a blob it is expected to wait for
the next command. When the Git process terminates, it will send
a kill signal to the filter in that stage.

If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
is configured then these commands always take precedence over
a configured `filter.<driver>.process` command.

Helped-by: Martin-Louis Bright <mlbright@gmail.com>
Reviewed-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt |  98 +++++++++++-
 convert.c                       | 277 ++++++++++++++++++++++++++++++----
 t/t0021-conversion.sh           | 322 ++++++++++++++++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         | 148 ++++++++++++++++++
 4 files changed, 815 insertions(+), 30 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..49514ab 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,13 @@ checkout, when the `smudge` command is specified, the command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +381,96 @@ substitution.  For example:
 ------------------------
 
 
+Long Running Filter Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the filter command (a string value) is defined via
+`filter.<driver>.process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using the following packet format
+(pkt-line, see technical/protocol-common.txt) based protocol over
+standard input and standard output.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git expects a welcome message, protocol version number, and
+filter capabilities separated by spaces:
+------------------------
+packet:          git< git-filter-protocol\n
+packet:          git< version=2\n
+packet:          git< capabilities=clean smudge\n
+------------------------
+Supported filter capabilities are "clean" and "smudge".
+
+Afterwards Git sends a command (based on the supported
+capabilities), the pathname of a file relative to the
+repository root, the content split in zero or more pkt-line
+packets, and a flush packet at the end:
+------------------------
+packet:          git> command=smudge\n
+packet:          git> pathname=path/testfile.dat\n
+packet:          git> CONTENT
+packet:          git> 0000
+------------------------
+
+The filter is expected to respond with the result content in zero
+or more pkt-line packets and a flush packet at the end. Finally, a
+"result=success" packet is expected if everything went well.
+------------------------
+packet:          git< SMUDGED_CONTENT
+packet:          git< 0000
+packet:          git< result=success\n
+------------------------
+
+If the result content is empty then the filter is expected to respond
+only with a flush packet and a "result=success" packet.
+------------------------
+packet:          git< 0000
+packet:          git< result=success\n
+------------------------
+
+In case the filter cannot or does not want to process the content,
+it is expected to respond with a flush packet and a "result=reject"
+packet. Depending on the `filter.<driver>.required` flag Git will
+interpret that as error but it will not stop or restart the filter
+process.
+------------------------
+packet:          git< 0000
+packet:          git< result=reject\n
+------------------------
+
+If the filter experiences an error during processing, then it can
+either die or send a flush packet and a "result=error" packet. If
+Git receives such an error then it will stop and restart the filter
+with the next file that needs to be processed.
+------------------------
+packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
+packet:          git< 0000
+packet:          git< result=error\n
+------------------------
+
+After the filter has processed a blob it is expected to wait for
+the next command. When the Git process terminates, it will send
+a kill signal to the filter in that stage.
+
+A long running filter demo implementation can be found in
+`t/t0021/rot13-filter.pl` located in the Git core repository.
+If you develop your own long running filter process then the
+`GIT_TRACE_PACKET` environment variables can be very helpful
+for debugging (see linkgit:git[1]).
+
+If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
+is configured then these commands always take precedence over
+a configured `filter.<driver>.process` command.
+
+Please note that you cannot use an existing `filter.<driver>.clean`
+or `filter.<driver>.smudge` command with `filter.<driver>.process`
+because the former two use a different inter process communication
+protocol than the latter one. As soon as Git would detect a file
+that needs to be processed by such an invalid "process" filter,
+it would wait for a proper protocol handshake and appear "hanging".
+
+
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/convert.c b/convert.c
index 522e2c5..130430a 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "pkt-line.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -427,7 +428,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	return (write_err || status);
 }
 
-static int apply_filter(const char *path, const char *src, size_t len, int fd,
+static int apply_single_file_filter(const char *path, const char *src, size_t len, int fd,
                         struct strbuf *dst, const char *cmd)
 {
 	/*
@@ -441,12 +442,6 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	struct async async;
 	struct filter_params params;
 
-	if (!cmd || !*cmd)
-		return 0;
-
-	if (!dst)
-		return 1;
-
 	memset(&async, 0, sizeof(async));
 	async.proc = filter_buffer_or_fd;
 	async.data = &params;
@@ -481,14 +476,245 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	return ret;
 }
 
+#define FILTER_CAPABILITIES_CLEAN    (1u<<0)
+#define FILTER_CAPABILITIES_SMUDGE   (1u<<1)
+#define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
+#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	int supported_capabilities;
+	const char *cmd;
+	struct child_process process;
+};
+
+static int cmd_process_map_initialized = 0;
+static struct hashmap cmd_process_map;
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+                           const struct cmd2process *e2,
+                           const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_multi_file_filter_entry(struct hashmap *hashmap, const char *cmd)
+{
+	struct cmd2process key;
+	hashmap_entry_init(&key, strhash(cmd));
+	key.cmd = cmd;
+	return hashmap_get(hashmap, &key, NULL);
+}
+
+static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *entry)
+{
+	if (!entry)
+		return;
+	sigchain_push(SIGPIPE, SIG_IGN);
+	close(entry->process.in);
+	close(entry->process.out);
+	sigchain_pop(SIGPIPE);
+	finish_command(&entry->process);
+	child_process_clear(&entry->process);
+	hashmap_remove(hashmap, entry, NULL);
+	free(entry);
+}
+
+static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, const char *cmd)
+{
+	int did_fail;
+	struct cmd2process *entry;
+	struct child_process *process;
+	const char *argv[] = { cmd, NULL };
+	static const char cap_key[] = "capabilities=";
+	int cap_key_len = strlen(cap_key);
+	struct string_list cap_list = STRING_LIST_INIT_NODUP;
+	char *cap_buf;
+	int i;
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(entry, strhash(cmd));
+	entry->cmd = cmd;
+	entry->supported_capabilities = 0;
+	process = &entry->process;
+
+	child_process_init(process);
+	process->argv = argv;
+	process->use_shell = 1;
+	process->in = -1;
+	process->out = -1;
+
+	if (start_command(process)) {
+		error("cannot fork to run external filter '%s'", cmd);
+		kill_multi_file_filter(hashmap, entry);
+		return NULL;
+	}
+
+	did_fail = strcmp(packet_read_line(process->out, NULL), "git-filter-protocol");
+	if (did_fail)
+		goto done;
+
+	did_fail = strcmp(packet_read_line(process->out, NULL), "version=2");
+	if (did_fail)
+		goto done;
+
+	cap_buf = packet_read_line(process->out, NULL);
+	if (!cap_buf ||
+		strlen(cap_buf) <= cap_key_len ||
+		strncmp(cap_buf, cap_key, cap_key_len)) {
+		error("filter capabilities not found");
+		did_fail = 1;
+		goto done;
+	}
+
+	string_list_split_in_place(&cap_list, &cap_buf[cap_key_len], ' ', -1);
+	if (cap_list.nr > 0) {
+		for (i = 0; i < cap_list.nr; i++) {
+			const char *requested = cap_list.items[i].string;
+			if (!strcmp(requested, "clean")) {
+				entry->supported_capabilities |= FILTER_CAPABILITIES_CLEAN;
+			} else if (!strcmp(requested, "smudge")) {
+				entry->supported_capabilities |= FILTER_CAPABILITIES_SMUDGE;
+			} else {
+				warning(
+					"external filter '%s' requested unsupported filter capability '%s'",
+					cmd, requested
+				);
+			}
+		}
+	}
+	string_list_clear(&cap_list, 0);
+
+done:
+	if (did_fail) {
+		error("initialization for external filter '%s' failed", cmd);
+		kill_multi_file_filter(hashmap, entry);
+		return NULL;
+	}
+
+	hashmap_add(hashmap, entry);
+	return entry;
+}
+
+static int apply_multi_file_filter(const char *path, const char *src, size_t len,
+                                   int fd, struct strbuf *dst, const char *cmd,
+                                   const int wanted_capability)
+{
+	int ret = 1;
+	struct cmd2process *entry;
+	struct child_process *process;
+	struct stat file_stat;
+	struct strbuf nbuf = STRBUF_INIT;
+	char *filter_type;
+	char *filter_result = NULL;
+
+	if (!cmd_process_map_initialized) {
+		cmd_process_map_initialized = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = find_multi_file_filter_entry(&cmd_process_map, cmd);
+	}
+
+	fflush(NULL);
+
+	if (!entry) {
+		entry = start_multi_file_filter(&cmd_process_map, cmd);
+		if (!entry)
+			return 0;
+	}
+	process = &entry->process;
+
+	if (!(wanted_capability & entry->supported_capabilities))
+		return 1;  // it is OK if the wanted capability is not supported
+
+	if (FILTER_SUPPORTS_CLEAN(wanted_capability))
+		filter_type = "clean";
+	else if (FILTER_SUPPORTS_SMUDGE(wanted_capability))
+		filter_type = "smudge";
+	else
+		die("unexpected filter type");
+
+	if (fd >= 0 && !src) {
+		if (fstat(fd, &file_stat) == -1)
+			return 0;
+		len = file_stat.st_size;
+	}
+
+	packet_buf_write(&nbuf, "command=%s\n", filter_type);
+	ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
+	if (!ret)
+		goto done;
+
+	strbuf_reset(&nbuf);
+	packet_buf_write(&nbuf, "pathname=%s\n", path);
+	ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
+	if (!ret)
+		goto done;
+
+	if (fd >= 0)
+		ret = !packet_write_stream_with_flush_from_fd(fd, process->in);
+	else
+		ret = !packet_write_stream_with_flush_from_buf(src, len, process->in);
+	if (!ret)
+		goto done;
+
+	strbuf_reset(&nbuf);
+	ret = packet_read_till_flush(process->out, &nbuf) >= 0;
+	if (!ret)
+		goto done;
+
+	filter_result = packet_read_line(process->out, NULL);
+	ret = !strcmp(filter_result, "result=success");
+
+done:
+	if (ret) {
+		strbuf_swap(dst, &nbuf);
+	} else {
+		if (!filter_result || strcmp(filter_result, "result=reject")) {
+			// Something went wrong with the protocol filter. Force shutdown!
+			error("external filter '%s' failed", cmd);
+			kill_multi_file_filter(&cmd_process_map, entry);
+		}
+	}
+	strbuf_release(&nbuf);
+	return ret;
+}
+
 static struct convert_driver {
 	const char *name;
 	struct convert_driver *next;
 	const char *smudge;
 	const char *clean;
+	const char *process;
 	int required;
 } *user_convert, **user_convert_tail;
 
+static int apply_filter(const char *path, const char *src, size_t len,
+                        int fd, struct strbuf *dst, struct convert_driver *drv,
+                        const int wanted_capability)
+{
+	const char* cmd = NULL;
+
+	if (!drv)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if (FILTER_SUPPORTS_CLEAN(wanted_capability) && drv->clean)
+		cmd = drv->clean;
+	else if (FILTER_SUPPORTS_SMUDGE(wanted_capability) && drv->smudge)
+		cmd = drv->smudge;
+
+	if (cmd && *cmd)
+		return apply_single_file_filter(path, src, len, fd, dst, cmd);
+	else if (drv->process && *drv->process)
+		return apply_multi_file_filter(path, src, len, fd, dst, drv->process, wanted_capability);
+
+	return 0;
+}
+
 static int read_convert_config(const char *var, const char *value, void *cb)
 {
 	const char *key, *name;
@@ -526,6 +752,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
 	if (!strcmp("clean", key))
 		return git_config_string(&drv->clean, var, value);
 
+	if (!strcmp("process", key)) {
+		return git_config_string(&drv->process, var, value);
+	}
+
 	if (!strcmp("required", key)) {
 		drv->required = git_config_bool(var, value);
 		return 0;
@@ -823,7 +1053,7 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;
 
-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+	return apply_filter(path, NULL, 0, -1, NULL, ca.drv, FILTER_CAPABILITIES_CLEAN);
 }
 
 const char *get_convert_attr_ascii(const char *path)
@@ -856,18 +1086,12 @@ int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
 	int ret = 0;
-	const char *filter = NULL;
-	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
-	if (ca.drv) {
-		filter = ca.drv->clean;
-		required = ca.drv->required;
-	}
 
-	ret |= apply_filter(path, src, len, -1, dst, filter);
-	if (!ret && required)
+	ret |= apply_filter(path, src, len, -1, dst, ca.drv, FILTER_CAPABILITIES_CLEAN);
+	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	if (ret && dst) {
@@ -889,9 +1113,9 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
-	assert(ca.drv->clean);
+	assert(ca.drv->clean || ca.drv->process);
 
-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, FILTER_CAPABILITIES_CLEAN))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -903,15 +1127,9 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 					    int normalizing)
 {
 	int ret = 0, ret_filter = 0;
-	const char *filter = NULL;
-	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
-	if (ca.drv) {
-		filter = ca.drv->smudge;
-		required = ca.drv->required;
-	}
 
 	ret |= ident_to_worktree(path, src, len, dst, ca.ident);
 	if (ret) {
@@ -920,9 +1138,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 	}
 	/*
 	 * CRLF conversion can be skipped if normalizing, unless there
-	 * is a smudge filter.  The filter might expect CRLFs.
+	 * is a smudge or process filter (even if the process filter doesn't
+	 * support smudge).  The filters might expect CRLFs.
 	 */
-	if (filter || !normalizing) {
+	if ((ca.drv && (ca.drv->smudge || ca.drv->process)) || !normalizing) {
 		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
 		if (ret) {
 			src = dst->buf;
@@ -930,8 +1149,8 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, -1, dst, filter);
-	if (!ret_filter && required)
+	ret_filter = apply_filter(path, src, len, -1, dst, ca.drv, FILTER_CAPABILITIES_SMUDGE);
+	if (!ret_filter && ca.drv && ca.drv->required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
 	return ret | ret_filter;
@@ -1383,7 +1602,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	struct stream_filter *filter = NULL;
 
 	convert_attrs(&ca, path);
-	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
+	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 34c8eb9..c1a22f4 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -42,6 +42,9 @@ test_expect_success setup '
 	rm -f test test.t test.i &&
 	git checkout -- test test.t test.i &&
 
+	echo "content-test2" >test2.o &&
+	echo "content-test3-subdir" >test3-subdir.o &&
+
 	mkdir generated-test-data &&
 	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
 	do
@@ -296,4 +299,323 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '
 
+check_filter () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	cat >expected.log &&
+	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_count_clean () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	cat >expected.log &&
+	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
+		sed "s/^\([0-9]\) IN: clean/x IN: clean/" >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_ignore_clean () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" &&
+	cat >expected.log &&
+	grep -v "IN: clean" rot13-filter.log >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_no_call () {
+	rm -f rot13-filter.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	test_must_be_empty rot13-filter.log
+}
+
+check_rot13 () {
+	test_cmp $1 $2 &&
+	./../rot13.sh <$1 >expected &&
+	git cat-file blob :$2 >actual &&
+	test_cmp expected actual
+}
+
+test_expect_success PERL 'required process filter should filter data' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		cat ../test2.o >test2.r &&
+		mkdir testsubdir &&
+		cat ../test3-subdir.o >testsubdir/test3-subdir.r &&
+		>test4-empty.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 IN: clean test2.r 14 [OK] -- OUT: 14 [OK]
+					1 IN: clean test4-empty.r 0 [OK] -- OUT: 0 [OK]
+					1 IN: clean testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF &&
+					x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					x IN: clean test2.r 14 [OK] -- OUT: 14 [OK]
+					x IN: clean test4-empty.r 0 [OK] -- OUT: 0 [OK]
+					x IN: clean testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		rm -f test?.r testsubdir/test3-subdir.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+					IN: smudge testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout empty \
+				<<-\EOF &&
+					start
+					wrote filter header
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout master \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+					IN: smudge test4-empty.r 0 [OK] -- OUT: 0 [OK]
+					IN: smudge testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+		check_rot13 ../test3-subdir.o testsubdir/test3-subdir.r
+	)
+'
+
+test_expect_success PERL 'required process filter should filter smudge data and one-shot filter should clean' '
+	test_config_global filter.protocol.clean ./../rot13.sh &&
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		cat ../test2.o >test2.r &&
+
+		check_filter_no_call \
+			git add . &&
+
+		check_filter_no_call \
+			git commit . -m "test commit" &&
+
+		rm -f test?.r testsubdir/test3-subdir.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		git checkout empty &&
+
+		check_filter_ignore_clean \
+			git checkout master\
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r
+	)
+'
+
+test_expect_success PERL 'required process filter should clean only' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF
+					x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+	)
+'
+
+test_expect_success PERL 'required process filter should process files larger LARGE_PACKET_MAX' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.file filter=protocol" >.gitattributes &&
+		cat ../generated-test-data/largish.file.rot13 >large.rot13 &&
+		cat ../generated-test-data/largish.file >large.file &&
+		cat large.file >large.original &&
+
+		git add large.file .gitattributes &&
+		git commit . -m "test commit" &&
+
+		rm -f large.file &&
+		git checkout -- large.file &&
+		git cat-file blob :large.file >actual &&
+		test_cmp large.rot13 actual
+	)
+'
+
+test_expect_success PERL 'required process filter should with clean error should fail' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "this is going to fail" >clean-write-fail.r &&
+		echo "content-test3-subdir" >test3.r &&
+
+		# Note: There are three clean paths in convert.c we just test one here.
+		test_must_fail git add .
+	)
+'
+
+test_expect_success PERL 'process filter should restart after unexpected write failure' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		cat ../test2.o >test2.r &&
+		echo "this is going to fail" >smudge-write-fail.o &&
+		cat smudge-write-fail.o >smudge-write-fail.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [WRITE FAIL]
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+
+		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
+		./../rot13.sh <smudge-write-fail.o >expected &&
+		git cat-file blob :smudge-write-fail.r >actual &&
+		test_cmp expected actual							  # Clean worked!
+	)
+'
+
+test_expect_success PERL 'process filter should not restart after intentionally rejected file' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		cat ../test2.o >test2.r &&
+		echo "this is going to be rejected" >reject.o &&
+		cat reject.o >reject.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge reject.r 29 [OK] -- OUT: 0 [REJECT]
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
new file mode 100755
index 0000000..ca6d5e4
--- /dev/null
+++ b/t/t0021/rot13-filter.pl
@@ -0,0 +1,148 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git filter protocol version 2
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+# The script takes the list of supported protocol capabilities as
+# arguments ("clean", "smudge", etc).
+#
+# This implementation supports three special test cases:
+# (1) If data with the pathname "clean-write-fail.r" is processed with
+#     a "clean" operation then the write operation will die.
+# (2) If data with the pathname "smudge-write-fail.r" is processed with
+#     a "smudge" operation then the write operation will die.
+# (3) If data with the pathname "reject.r" is processed with any
+#     operation then the filter signals that it does not want to process
+#     the file.
+#
+
+use strict;
+use warnings;
+
+my $MAX_PACKET_CONTENT_SIZE = 65516;
+my @capabilities            = @ARGV;
+
+sub rot13 {
+    my ($str) = @_;
+    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
+    return $str;
+}
+
+sub packet_read {
+    my $buffer;
+    my $bytes_read = read STDIN, $buffer, 4;
+    if ( $bytes_read == 0 ) {
+        return;
+    }
+    elsif ( $bytes_read != 4 ) {
+        die "invalid packet size '$bytes_read' field";
+    }
+    my $pkt_size = hex($buffer);
+    if ( $pkt_size == 0 ) {
+        return ( 1, "" );
+    }
+    elsif ( $pkt_size > 4 ) {
+        my $content_size = $pkt_size - 4;
+        $bytes_read = read STDIN, $buffer, $content_size;
+        if ( $bytes_read != $content_size ) {
+            die "invalid packet ($content_size expected; $bytes_read read)";
+        }
+        return ( 0, $buffer );
+    }
+    else {
+        die "invalid packet size";
+    }
+}
+
+sub packet_write {
+    my ($packet) = @_;
+    print STDOUT sprintf( "%04x", length($packet) + 4 );
+    print STDOUT $packet;
+    STDOUT->flush();
+}
+
+sub packet_flush {
+    print STDOUT sprintf( "%04x", 0 );
+    STDOUT->flush();
+}
+
+open my $debug, ">>", "rot13-filter.log";
+print $debug "start\n";
+$debug->flush();
+
+packet_write("git-filter-protocol\n");
+packet_write("version=2\n");
+packet_write( "capabilities=" . join( ' ', @capabilities ) . "\n" );
+print $debug "wrote filter header\n";
+$debug->flush();
+
+while (1) {
+    my ($command) = packet_read() =~ /^command=([^=]+)\n$/;
+    unless ( defined($command) ) {
+        exit();
+    }
+    print $debug "IN: $command";
+    $debug->flush();
+
+    my ($pathname) = packet_read() =~ /^pathname=([^=]+)\n$/;
+    print $debug " $pathname";
+    $debug->flush();
+
+    my $input = "";
+    {
+        binmode(STDIN);
+        my $buffer;
+        my $done = 0;
+        while ( !$done ) {
+            ( $done, $buffer ) = packet_read();
+            $input .= $buffer;
+        }
+        print $debug " " . length($input) . " [OK] -- ";
+        $debug->flush();
+    }
+
+    my $output;
+    if ( $pathname eq "reject.r" ) {
+        $output = "";
+    }
+    elsif ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
+        $output = rot13($input);
+    }
+    elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
+        $output = rot13($input);
+    }
+    else {
+        die "bad command $command";
+    }
+
+    print $debug "OUT: " . length($output) . " ";
+    $debug->flush();
+
+    if ( $pathname eq "${command}-write-fail.r" ) {
+        print $debug "[WRITE FAIL]\n";
+        $debug->flush();
+        die "write error";
+    }
+    elsif ( $pathname eq "reject.r" ) {
+        packet_flush();
+        print $debug "[REJECT]\n";
+        $debug->flush();
+        packet_write("result=reject\n");
+    }
+    else {
+        while ( length($output) > 0 ) {
+            my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
+            packet_write($packet);
+            if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
+                $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
+            }
+            else {
+                $output = "";
+            }
+        }
+        packet_flush();
+        print $debug "[OK]\n";
+        $debug->flush();
+        packet_write("result=success\n");
+    }
+}
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 12/12] convert: add filter.<driver>.process shutdown command option
  2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
                       ` (10 preceding siblings ...)
  2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
@ 2016-08-03 16:42     ` larsxschneider
  11 siblings, 0 replies; 120+ messages in thread
From: larsxschneider @ 2016-08-03 16:42 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, tboegi, mlbright, e, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the "shutdown" capability to the `filter.<driver>.process` filter
protocol. If a filter supports this capability then Git will send the
"shutdown" command and wait until the filter answers. This gives the
filter the opportunity to perform cleanup tasks. Afterwards the filter
is expected to exit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt | 12 ++++++-
 convert.c                       | 35 ++++++++++++++++++++
 t/t0021-conversion.sh           | 71 +++++++++++++++++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         |  7 ++++
 4 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 49514ab..5556cc0 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -400,7 +400,8 @@ packet:          git< git-filter-protocol\n
 packet:          git< version=2\n
 packet:          git< capabilities=clean smudge\n
 ------------------------
-Supported filter capabilities are "clean" and "smudge".
+Supported filter capabilities are "clean", "smudge", and
+"shutdown".
 
 Afterwards Git sends a command (based on the supported
 capabilities), the pathname of a file relative to the
@@ -453,6 +454,15 @@ After the filter has processed a blob it is expected to wait for
 the next command. When the Git process terminates, it will send
 a kill signal to the filter in that stage.
 
+If the filter supports the "shutdown" capability then Git will
+send the "shutdown" command and wait until the filter answers
+with "done". This gives the filter the opportunity to perform
+cleanup tasks. Afterwards the filter is expected to exit.
+------------------------
+packet:          git> command=shutdown\n
+packet:          git< result=success\n
+------------------------
+
 A long running filter demo implementation can be found in
 `t/t0021/rot13-filter.pl` located in the Git core repository.
 If you develop your own long running filter process then the
diff --git a/convert.c b/convert.c
index 130430a..41e3229 100644
--- a/convert.c
+++ b/convert.c
@@ -478,8 +478,10 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
 
 #define FILTER_CAPABILITIES_CLEAN    (1u<<0)
 #define FILTER_CAPABILITIES_SMUDGE   (1u<<1)
+#define FILTER_CAPABILITIES_SHUTDOWN (1u<<2)
 #define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
 #define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
+#define FILTER_SUPPORTS_SHUTDOWN(type) ((type) & FILTER_CAPABILITIES_SHUTDOWN)
 
 struct cmd2process {
 	struct hashmap_entry ent; /* must be the first member! */
@@ -520,6 +522,35 @@ static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *
 	free(entry);
 }
 
+void shutdown_multi_file_filter(pid_t pid)
+{
+	int did_fail;
+	struct cmd2process *entry;
+	struct hashmap_iter iter;
+	static const char shutdown[] = "command=shutdown\n";
+	char *result = NULL;
+
+	if (!cmd_process_map_initialized)
+		return;
+
+	hashmap_iter_init(&cmd_process_map, &iter);
+	while ((entry = hashmap_iter_next(&iter))) {
+		if (entry->process.pid == pid &&
+			FILTER_SUPPORTS_SHUTDOWN(entry->supported_capabilities)
+		) {
+			did_fail = direct_packet_write_data(
+				entry->process.in, shutdown, strlen(shutdown), 1);
+			if (!did_fail)
+				result = packet_read_line(entry->process.out, NULL);
+			close(entry->process.in);
+			close(entry->process.out);
+
+			if (did_fail || !result || strcmp(result, "result=success"))
+				error("shutdown of external filter '%s' failed", entry->cmd);
+		}
+	}
+}
+
 static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, const char *cmd)
 {
 	int did_fail;
@@ -543,6 +574,8 @@ static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, cons
 	process->use_shell = 1;
 	process->in = -1;
 	process->out = -1;
+	process->clean_on_exit = 1;
+	process->clean_on_exit_handler = shutdown_multi_file_filter;
 
 	if (start_command(process)) {
 		error("cannot fork to run external filter '%s'", cmd);
@@ -575,6 +608,8 @@ static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, cons
 				entry->supported_capabilities |= FILTER_CAPABILITIES_CLEAN;
 			} else if (!strcmp(requested, "smudge")) {
 				entry->supported_capabilities |= FILTER_CAPABILITIES_SMUDGE;
+			} else if (!strcmp(requested, "shutdown")) {
+				entry->supported_capabilities |= FILTER_CAPABILITIES_SHUTDOWN;
 			} else {
 				warning(
 					"external filter '%s' requested unsupported filter capability '%s'",
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index c1a22f4..613e370 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -417,6 +417,77 @@ test_expect_success PERL 'required process filter should filter data' '
 	)
 '
 
+test_expect_success PERL 'required process filter should filter data with shutdown' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge shutdown" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cat ../test.o >test.r &&
+		cat ../test2.o >test2.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 IN: clean test2.r 14 [OK] -- OUT: 14 [OK]
+					1 IN: shutdown -- [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF &&
+					x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					x IN: clean test2.r 14 [OK] -- OUT: 14 [OK]
+					1 IN: shutdown -- [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		rm -f test?.r testsubdir/test3-subdir.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+					IN: shutdown -- [OK]
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout empty \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: shutdown -- [OK]
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout master \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+					IN: shutdown -- [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r
+	)
+'
+
 test_expect_success PERL 'required process filter should filter smudge data and one-shot filter should clean' '
 	test_config_global filter.protocol.clean ./../rot13.sh &&
 	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl smudge" &&
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
index ca6d5e4..654741b 100755
--- a/t/t0021/rot13-filter.pl
+++ b/t/t0021/rot13-filter.pl
@@ -84,6 +84,13 @@ while (1) {
     print $debug "IN: $command";
     $debug->flush();
 
+    if ( $command eq "shutdown" ) {
+        print $debug " -- [OK]";
+        $debug->flush();
+        packet_write("result=success\n");
+        exit();
+    }
+
     my ($pathname) = packet_read() =~ /^pathname=([^=]+)\n$/;
     print $debug " $pathname";
     $debug->flush();
-- 
2.9.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
@ 2016-08-03 17:45       ` Junio C Hamano
  2016-08-03 21:48         ` Lars Schneider
  2016-08-03 20:29       ` Junio C Hamano
  2016-08-05 21:34       ` Torsten Bögershausen
  2 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2016-08-03 17:45 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, jnareb, tboegi, mlbright, e, peff

larsxschneider@gmail.com writes:

> packet:          git< git-filter-protocol\n
> packet:          git< version=2\n
> packet:          git< capabilities=clean smudge\n

During the discussion on the future of pack-protocol, it was pointed
out that having to shove all capabilities on a single line/packet
was one of the things we would want to fix in the current protocol
when we revamp to v2.  As this exhange between the convert machinery
and an external process is a brand new one, I do not think you want
to mimic the limitation in the current pack protocol like this; the
limitation mostly came from the constraint that we cannot break
existing pack protocol clients and servers before we extended the
protocol to add capabilities.

You may not foresee that the caps won't grow very long beyond
clean/smudge right now, just like we did not foresee that we would
wish to be able to convey a lot longer capability values to the
other side when we added the capability exchange to the pack
protocol, so "but but but we will never have that many" is not a
good counter-argument.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option)
  2016-08-01 13:32       ` Lars Schneider
@ 2016-08-03 18:30         ` Jakub Narębski
  2016-08-05 10:32           ` Lars Schneider
  2016-08-06 18:24           ` Lars Schneider
  2016-08-03 22:47         ` [PATCH v3 10/10] convert: add filter.<driver>.process option Jakub Narębski
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-08-03 18:30 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King

[I'm sorry for taking so long in writing this, as I see there is v4 already]

Greetings,


I'll answer to individual emails in more detail later, but I'd like to
go back to the drawing board, and attempt to summarize the discussion and
the proposal so far.

The ultimate goal is to be able to run filter drivers faster for both `clean`
and `smudge` operations.  This is done by starting filter driver once per
git command invocation, instead of once per file being processed.  Git needs
to pass actual contents of files to filter driver, and get its output.

We want the protocol between Git and filter driver process to be extensible,
so that new features can be added without modifying protocol.


1. CONFIGURATION

As I wrote, there are different ways of configuring new-type filter driver:

 * Using a separate variable to mark filter as using new protocol
   (the original approach):

   	[filter "protocol"]
		protocolVersion = v2
		clean  = rot13-clean-filter.pl
		smudge = rot13-smudge-filter.pl

   PROS: allows to have separate clean and smudge filters
   CONS: does not allow using old-style per-file filter together with new;
         easy to make mistake and use old-style filter, leading to hang

 * Creating new variables for new filter type, separate for each phase,
   for example `cleanProcess` and `smudgeProcess` (or `processClean` and
   `processSmudge`).

   	[filter "protocol"]
   		cleanProcess  = rot13-clean-filter.pl
   		smudgeProcess = rot13-smudge-filter.pl

   PROS: allows to have separate clean and smudge filters;
         makes possible to use per-file and per-command filters together
   CONS: proliferation of additional variables, (esp. when extending it);
   NOTE: need to decide precedence between `clean` and `cleanProcess`, etc.

 # Using a single variable for new filter type, and decide on which phase
   (which operation) is supported by filter driver during the handshake
   *(current approach)*

   	[filter "protocol"]
   		process = rot13-filtes.pl

   PROS: per-file and per-command filters possible with precedence rule;
         extensible to other types of drivers: textconv, diff, etc.
         only one invocation for commands which use both clean and smudge
   CONS: need single driver to be responsible for both clean and smudge;
         need to run driver to know that it does not support given
           operation (workaround exists)


2. HANDSHAKE (INITIALIZATION)

Next, there is deciding on and designing the handshake between Git (between
Git command) and the filter driver process.  With the `filter.<driver>.process`
solution the driver needs to tell which operations among (for now) "clean"
and "smudge" it does support.  Plus it provides a way to extend protocol,
adding new features, like support for streaming, cleaning from file or
smudging to file, providing size upfront, perhaps even progress report.

Current handshake consist of filter driver printing a signature, version
number and capabilities, in that order.  Git checks that it is well formed
and matches expectations, and notes which of "clean" and "smudge" operations
are supported by the filter.

There is no interaction from the Git side in the handshake, for example to
set options and expectations common to all files being filtered.  Take
one possible extension of protocol: supporting streaming.  The filter
driver needs to know whether it needs to read all the input, or whether
it can start printing output while input is incoming (e.g. to reduce
memory consumption)... though we may simply decide it to be next version
of the protocol.

On the other hand if the handshake began with Git sending some initializer
info to the filter driver, we probably could detect one-shot filter
misconfigured as process-filter.

Note that we need some way of deciding where handshake ends, either by
specifying number of entries (currently: three lines / pkt-line packets),
or providing some terminator ("smart" transport protocol uses flush packet
for this).

Current handshake (in symbolic form):

    git< [signature]    git-filter-protocol
    git< [version]      version 2
    git< [capabilites]  clean smudge

It is expected that the handshake is limited to this information, and
that they are in this order; so naming them doesn't buy us much

    git< [capabilites]  capabilities clean smudge

or

    git< [capabilites]  capabilities=clean smudge

or

    git< [capabilites]  capabilities: clean smudge

If capabilities are to be third item, adding "capabilities", as if Git would
look at the name and select what to do based on this name, doesn't buy us
anything.  Well, beside self-documenting of the protocol.  The "smart" protocol
do not use "capabilities" as prefix/name either.

We would probably do not want to move from strict-order of information, that
is "positional parameters".  It would require to implement a parser, both for
the Git side and for the filter driver process side.

On the other hand requiring flush packet to end the handshake doesn't bring
much overhead (it is 4 bytes, it is not over the network), and improves
extendability.  Well, so does using names, be it "<var> <value>", 
"<var>=<value>", "<var>: <value>...", "<var>=[<value>, <value>...]", etc.


Let's take a look how other parts of Git communicate with external process
(a "helper").

The git-credential(1) protocol uses <variable>=<value> syntax.  But capabilities
form a list; "<var>=<val1> <val2>" doesn't look that well.  Credential helper
only uses scalar (single) values.

The gitremote-helpers(1) protocol is command / response; for example helper
responds to "capabilities" command with the list of capabilities.  Here commands
and parameters are space separated, e.g. "option <name> <value>".

The "smart" transport protocol (send-pack and receive-pack) had to (ab)use
a quirk of implementation to extend protocol with capabilities negotiation.
Here the capabilities list is sent without any prefix; some capabilities
are parametrized, and use <capability>=<value> syntax (for example
"symref=HEAD:refs/heads/master").  The handshake is closed with flush
packet, but as it consist of variable-length ref advertisement, it needs
to have explicit terminator of the each part of the "handshake".


3. SENDING CONTENTS (FILE TO BE FILTERED AND FILTER OUTPUT)

Next thing to design is decision how to send contents to be filtered
to the filter driver process, and how to get filtered output from the
filter driver process.

One thing I think we can agree on early, is sending data to filter
process on its standard input, and receiving filtered result from its
standard output.

Because Git is sending (and receiving) multiple files, it needs some
way to distinguish where one file ends and the next begins, in both
directions, to and from filter.  Also, the `clean` and `smudge`
filters support expansion of the '%f' placeholder, so at least 
some filter drivers need name of the file being filtered.  So the
protocol must send it somehow to the filter driver.

There are different approaches possible; here are ones that were used,
and ones I thought about.

 * Send whole data to filter at once, and receiver all data at once,
   for example using something akin to the 'tar' archive, or 
   uncompressed 'zip' archive (both are implemented in Git for the
   `git archive` command).  Or just list of sizes and pathnames,
   empty entry as terminator, and then contents of all files
   concatenated.

   PROS:
   - can use the one-shot infrastructure implemented already
   CONS:
   - complicates Git code and filter driver code unnecessarily
   - difficult to implement error handling, esp. soft errors
     on filter driver side (error for single file, perhaps during
     output)
   - in synchronous version (non-streaming) requires absurd amout
     of memory / storage for the filter driver process

 * Send/receive data file by file, using <size> + <content>,
   that is, send size (plus other data like the filename), then
   file contents.

   This was the protocol used in the first iteration of series.

   PROS:
   - simple to implement on Git and on filter driver side
   NOTE:
   - you need to loop over read / user read_in_full anyway
   CONS:
   - no way to signal an error encountered during output, e.g. LFS
     network/server failure for after some contents were actually
     sent
   - impossible to implement streaming for filters that do not
     know size of output without examining full input

 # Send/receive data file by file, using some kind of chunking,
   with a end-of-file marker.  The solution used by Git is
   pkt-line, with flush packet used to signal end of file.

   This is protocol used by the current implementation.

   PROS:
   - no need to know size upfront, so easier streaming support
   - you can signal error that happened during output, after
     some data were sent, as well as error known upfront
   - tracing support for free (GIT_TRACE_PACKET)
   CONS:
   - filter driver program slightly more difficult to implement
   - some negligible amount of overhead

If we want in the end to implement streaming, then the last solution
is the way to go.


4. PER-FILE HANDSHAKE - SENDING FILE TO FILTER

Let's assume that for simplicity we want to implement (for now) only
the synchronous (non-streaming) case, where we send whole contents
of a file to filter driver process, and *then* read filter driver
output.  This is enough for git-LFS solutions, which were the reason
for this patch series.  But we want to keep the protocol flexible
enough so that streaming and other features could be added easily.

First, if we choose the solution where one process is responsible
for both "clean" and "smudge" operations (and in the future possibly
also "cleanFromFile" and "smudgeToFile"), Git needs to tell the
driver which operation to perform.

Together with operation Git can send additional information
(sub-capabilities)... or we can use a separate line / packet to
send it.

If we are using pkt-line, then the convention is that text lines
are terminated using LF ("\n") character.  This needs to be stated
explicitly in the documentation for filter.<driver>.process writers.

    git> packet:  [operation] clean size=67\n

We could denote that it is operation name, but it is obvious from
position in the stream, thus not really needed.

Then we need to provide the filename; some filters supposedly need
this ('%f' in per-file `clean` / `smudge`).  Note that filename can
contain internal space characters, and could contain newlines, equal
signs; anything that is not NUL ("\0") character.

    git> packet:  [pathname] subdir/sample-file.r\n

In most cases filename would be text, so perhaps we should use "\n"
terminator (which filter driver would have to strip).  We could use
"filename=" prefix, but it is not necessary.  We know where / when
to expect the pathname (relative to project root).

If we would want to be able to add variable number of packets to
the handshake, then Git should send flush packet to signal the
end of the handshake.  But IMVHO it is unnecessary complication
of the protocol; there is enough flexibility in it.  We know
that handshake consists of two packets.

The Git would sent contents of the file to be filtered, using
as many pack lines as needed (note: large file support needs
to be tested, at least as expensive test).  Flush packet is
used to signal the end of the file.

    git> packets:  <file contents>
    git> flush packet


5. FILTER DRIVER PROCESS RESPONSE

First filter should, in my opinion, reply that it received the
request (or the command, in the case of streaming supported).
Also, in this response it can provide further information to
Git process.

    git< packet: [received]  ok size=67\n

This response could be used to refuse to filter specific file
upfront (for example if the file is not present in the artifactory
for git-LFS solutions).

   git< packet: [rejected]  reject\n

We can even provide the reasoning to Git (maybe in the future
extension)... or filter driver can print the explanation to the
standard error (but then, no --quiet / --verbose support).

   git< packet: [rejected]  reject with-message\n
   git< packet: [message]   File not found on server\n
   git< flush packet

Another response, which I think should be standarized, or at
least described in the documentation, is filter driver refusing
to filter further (e.g. git-LFS and network is down), to be not
restarted by Git.

   git< packet: [quit]      quit msg=Server error\n

or

   git< packet: [quit]      quit Server error\n

or

   git< packet: [quit]      quit with-message\n
   git< packet: [message]   Server error\n
   git< flush packet

Maybe this is over-engineering, but I don't think so.

Next comes the output from the filter driver (filtered contents),
using possibly multiple pkt-lines, ending with a flush packet:

    git< packets:  <filtered contents>
    git< flush packet

Note that empty file would consist of zero pack lines of contents,
and one flush packet.

Finally, to allow handling of [resumable] errors that occurred
during sending file contents, especially for the future streaming
filters case, we want to confirm that we send whole file
successfully.

    git< packet: [status]   success\n

If there was an error during process, making data receives so far
invalid, filter driver should tell about it

    git< packet: [status]   fail\n

or

    git< packet: [status]   reject\n

This may happen for example for UCS-2 <-> UTF-8 filter when invalid
byte sequence is encountered.  This may happen for git-LFS if the
server fails during fetch, and spare / slave server doesn't have
a file.

We may want to quit filtering at this point, and not to send another
file.

   git< packet: [status]    quit\n

There is place for extra information after the status, and in the
future we can allow variable length information too.


Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 01/10] pkt-line: extract set_packet_header()
  2016-08-01 11:33       ` Lars Schneider
@ 2016-08-03 20:05         ` Jakub Narębski
  2016-08-05 11:52           ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-08-03 20:05 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, gitster, tboegi, mlbright, e, peff

[This response might have been invalidated by v4]

W dniu 01.08.2016 o 13:33, Lars Schneider pisze: 
>> On 30 Jul 2016, at 12:30, Jakub Narębski <jnareb@gmail.com> wrote:

>>> #define hex(a) (hexchar[(a) & 15])
>>
>> I guess that this is inherited from the original, but this preprocessor
>> macro is local to the format_header() / set_packet_header() function,
>> and would not work outside it.  Therefore I think we should #undef it
>> after set_packet_header(), just in case somebody mistakes it for
>> a generic hex() function.  Perhaps even put it inside set_packet_header(),
>> together with #undef.
>>
>> But I might be mistaken... let's check... no, it isn't used outside it.
> 
> Agreed. Would that be OK?
> 
> static void set_packet_header(char *buf, const int size)
> {
> 	static char hexchar[] = "0123456789abcdef";
> 	#define hex(a) (hexchar[(a) & 15])
> 	buf[0] = hex(size >> 12);
> 	buf[1] = hex(size >> 8);
> 	buf[2] = hex(size >> 4);
> 	buf[3] = hex(size);
> 	#undef hex
> }

That's better, though I wonder if we need to start #defines at begining
of line.  But I think current proposal is O.K.


Either this (which has unnecessary larger scope)

  #define hex(a) (hexchar[(a) & 15])
  static void set_packet_header(char *buf, const int size)
  {
  	static char hexchar[] = "0123456789abcdef";

  	buf[0] = hex(size >> 12);
  	buf[1] = hex(size >> 8);
  	buf[2] = hex(size >> 4);
  	buf[3] = hex(size);
  }
  #undef hex

or this (which looks worse)

  static void set_packet_header(char *buf, const int size)
  {
  	static char hexchar[] = "0123456789abcdef";
  #define hex(a) (hexchar[(a) & 15])
  	buf[0] = hex(size >> 12);
  	buf[1] = hex(size >> 8);
  	buf[2] = hex(size >> 4);
  	buf[3] = hex(size);
  #undef hex
  }


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data()
  2016-08-01 12:00       ` Lars Schneider
@ 2016-08-03 20:12         ` Jakub Narębski
  2016-08-05 12:02           ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-08-03 20:12 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Junio C Hamano, tboegi, mlbright, Eric Wong,
	Jeff King

[This response might have been invalidated by v4]

W dniu 01.08.2016 o 14:00, Lars Schneider pisze:
>> On 30 Jul 2016, at 12:49, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>>>
>>> Sometimes pkt-line data is already available in a buffer and it would
>>> be a waste of resources to write the packet using packet_write() which
>>> would copy the existing buffer into a strbuf before writing it.
>>>
>>> If the caller has control over the buffer creation then the
>>> PKTLINE_DATA_START macro can be used to skip the header and write
>>> directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
>>> would be the maximum). direct_packet_write() would take this buffer,
>>> adjust the pkt-line header and write it.
>>>
>>> If the caller has no control over the buffer creation then
>>> direct_packet_write_data() can be used. This function creates a pkt-line
>>> header. Afterwards the header and the data buffer are written using two
>>> consecutive write calls.
>>
>> I don't quite understand what do you mean by "caller has control
>> over the buffer creation".  Do you mean that caller either can write
>> over the buffer, or cannot overwrite the buffer?  Or do you mean that
>> caller either can allocate buffer to hold header, or is getting
>> only the data?
> 
> How about this:
> 
> [...]
> 
> If the caller creates the buffer then a proper pkt-line buffer with header
> and data section can be created. The PKTLINE_DATA_START macro can be used 
> to skip the header section and write directly to the data section (PKTLINE_DATA_LEN 
> bytes would be the maximum). direct_packet_write() would take this buffer, 
> fill the pkt-line header section with the appropriate data length value and 
> write the entire buffer.
> 
> If the caller does not create the buffer, and consequently cannot leave room
> for the pkt-line header, then direct_packet_write_data() can be used. This 
> function creates an extra buffer for the pkt-line header and afterwards writes
> the header buffer and the data buffer with two consecutive write calls.
> 
> ---
> Is that more clear?

Yes, I think it is more clear.  

The only thing that could be improved is to perhaps instead of using

  "then a proper pkt-line buffer with header and data section can be created"

it might be more clear to write

  "then a proper pkt-line buffer with data section and a place for pkt-line header"
 

>>> +{
>>> +	int ret = 0;
>>> +	char hdr[4];
>>> +	set_packet_header(hdr, sizeof(hdr) + size);
>>> +	packet_trace(buf, size, 1);
>>> +	if (gentle) {
>>> +		ret = (
>>> +			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
>>
>> You can write '4' here, no need for sizeof(hdr)... though compiler would
>> optimize it away.
> 
> Right, it would be optimized. However, I don't like the 4 there either. OK to use a macro
> instead? PKTLINE_HEADER_LEN ?

Did you mean 

    +	char hdr[PKTLINE_HEADER_LEN];
    +	set_packet_header(hdr, sizeof(hdr) + size);

 
>>> +			!write_or_whine_pipe(fd, buf, size, "pkt-line data")
>>> +		);
>>
>> Do we want to try to write "pkt-line data" if "pkt-line header" failed?
>> If not, perhaps De Morgan-ize it
>>
>>  +		ret = !(
>>  +			write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") &&
>>  +			write_or_whine_pipe(fd, buf, size, "pkt-line data")
>>  +		);
> 
> 
> Original:
> 		ret = (
> 			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
> 			!write_or_whine_pipe(fd, data, size, "pkt-line data")
> 		);
> 
> Well, if the first write call fails (return == 0), then it is negated and evaluates to true.
> I would think the second call is not evaluated, then?!

This is true both for || and for &&, as in C logical boolean operators
short-circuit.

> Should I make this more explicit with a if clause?

No need.

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send
  2016-08-01 12:18       ` Lars Schneider
@ 2016-08-03 20:15         ` Jakub Narębski
  0 siblings, 0 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-08-03 20:15 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, gitster, tboegi, mlbright, e, peff

[This response might have been invalidated by v4]

W dniu 01.08.2016 o 14:18, Lars Schneider pisze:
>> On 30 Jul 2016, at 14:29, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:

>> I don't buy this explanation.  If you want to trace packets, you might
>> do it on input (when formatting packet), or on output (when writing
>> packet).  It's when there are more than one formatting function, but
>> one writing function, then placing trace call in write function means
>> less code duplication; and of course the reverse.
>>
>> Another issue is that something may happen between formatting packet
>> and sending it, and we probably want to packet_trace() when packet
>> is actually send.
>>
>> Neither of those is visible in commit message.
> 
> The packet_trace() call in format_packet() is not ideal, as we would print
> a trace when a packet is formatted and (potentially) when the same packet is
> actually written. This was no problem up until now because packet_write(),
> the function that uses format_packet() and writes the formatted packet,
> did not trace the packet.
> 
> This developer believes that trace calls should only happen when a packet
> is actually written as the packet could be modified between formatting
> and writing. Therefore the trace call was moved from format_packet() to 
> packet_write().
> 
> --
> 
> Better?

Yes, that's much better.

P.S. Yes, this is one of those changes where commit message is much longer
     than the change itself...

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-03 16:42     ` [PATCH v4 01/12] pkt-line: extract set_packet_header() larsxschneider
@ 2016-08-03 20:18       ` Junio C Hamano
  2016-08-03 21:12         ` Jeff King
  2016-08-03 21:56         ` Lars Schneider
  0 siblings, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-03 20:18 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, jnareb, tboegi, mlbright, e, peff

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> set_packet_header() converts an integer to a 4 byte hex string. Make
> this function locally available so that other pkt-line functions can
> use it.

Didn't I say that this is a bad idea already in an earlier review?

The only reason why you want it, together with direct_packet_write()
(which I think is another bad idea), is because you use
packet_buf_write() to create a "<header><payload>" in a buf in the
usercode in step 11/12 like this:

+	packet_buf_write(&nbuf, "command=%s\n", filter_type);
+	ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);

which would be totally unnecessary if you just did strbuf_addf()
into nbuf and used packet_write() like everybody else does.

Puzzled.  Why are steps 01/12 and 02/12 an improvement?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
  2016-08-03 17:45       ` Junio C Hamano
@ 2016-08-03 20:29       ` Junio C Hamano
  2016-08-03 21:37         ` Lars Schneider
  2016-08-05 21:34       ` Torsten Bögershausen
  2 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2016-08-03 20:29 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, jnareb, tboegi, mlbright, e, peff

larsxschneider@gmail.com writes:

> +#define FILTER_CAPABILITIES_CLEAN    (1u<<0)
> +#define FILTER_CAPABILITIES_SMUDGE   (1u<<1)
> +#define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
> +#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)

I would expect a lot shorter names as these are file-local;
CAP_CLEAN and CAP_SMUDGE, perhaps, _WITHOUT_ "supports BLAH" macros?

	if (FILTER_SUPPORTS_CLEAN(type))

is not all that more readable than

	if (CAP_CLEAN & type)



> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	int supported_capabilities;
> +	const char *cmd;
> +	struct child_process process;
> +};
> +
> +static int cmd_process_map_initialized = 0;
> +static struct hashmap cmd_process_map;

Don't initialize statics to 0 or NULL.

> +static int cmd2process_cmp(const struct cmd2process *e1,
> +                           const struct cmd2process *e2,
> +                           const void *unused)
> +{
> +	return strcmp(e1->cmd, e2->cmd);
> +}
> +
> +static struct cmd2process *find_multi_file_filter_entry(struct hashmap *hashmap, const char *cmd)
> +{
> +	struct cmd2process key;
> +	hashmap_entry_init(&key, strhash(cmd));
> +	key.cmd = cmd;
> +	return hashmap_get(hashmap, &key, NULL);
> +}
> +
> +static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *entry)
> +{
> +	if (!entry)
> +		return;
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	close(entry->process.in);
> +	close(entry->process.out);
> +	sigchain_pop(SIGPIPE);
> +	finish_command(&entry->process);

I wonder if we want to diagnose failures from close(), which is a
lot more interesting than usual because these are connected to
pipes.

> +static int apply_multi_file_filter(const char *path, const char *src, size_t len,
> +                                   int fd, struct strbuf *dst, const char *cmd,
> +                                   const int wanted_capability)
> +{
> +	int ret = 1;
> + ...
> +	if (!(wanted_capability & entry->supported_capabilities))
> +		return 1;  // it is OK if the wanted capability is not supported

No // comment please.

> +	filter_result = packet_read_line(process->out, NULL);
> +	ret = !strcmp(filter_result, "result=success");
> +
> +done:
> +	if (ret) {
> +		strbuf_swap(dst, &nbuf);
> +	} else {
> +		if (!filter_result || strcmp(filter_result, "result=reject")) {
> +			// Something went wrong with the protocol filter. Force shutdown!
> +			error("external filter '%s' failed", cmd);
> +			kill_multi_file_filter(&cmd_process_map, entry);
> +		}
> +	}
> +	strbuf_release(&nbuf);
> +	return ret;
> +}

I think this was already pointed out in the previous review by Peff,
but a variable "ret" that says "0 is bad" somehow makes it hard to
follow the code.  Perhaps rename it to "int error", flip the meaning,
and if the caller wants this function to return non-zero on success
flip the polarity in the return statement itself, i.e. "return !errors",
may make it easier to follow?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-03 20:18       ` Junio C Hamano
@ 2016-08-03 21:12         ` Jeff King
  2016-08-03 21:27           ` Jeff King
  2016-08-04 16:14           ` Junio C Hamano
  2016-08-03 21:56         ` Lars Schneider
  1 sibling, 2 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 21:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git, jnareb, tboegi, mlbright, e

On Wed, Aug 03, 2016 at 01:18:55PM -0700, Junio C Hamano wrote:

> larsxschneider@gmail.com writes:
> 
> > From: Lars Schneider <larsxschneider@gmail.com>
> >
> > set_packet_header() converts an integer to a 4 byte hex string. Make
> > this function locally available so that other pkt-line functions can
> > use it.
> 
> Didn't I say that this is a bad idea already in an earlier review?
> 
> The only reason why you want it, together with direct_packet_write()
> (which I think is another bad idea), is because you use
> packet_buf_write() to create a "<header><payload>" in a buf in the
> usercode in step 11/12 like this:
> 
> +	packet_buf_write(&nbuf, "command=%s\n", filter_type);
> +	ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
> 
> which would be totally unnecessary if you just did strbuf_addf()
> into nbuf and used packet_write() like everybody else does.
> 
> Puzzled.  Why are steps 01/12 and 02/12 an improvement?

I think it is an attempt to avoid the extra memcpy() of the bytes into
another packet buffer.

I notice that the solution does still end up a using a double-write() in
some cases, though.  I was curious if this made any difference, though,
so I wrote a short test program:

-- >8 --
#include <unistd.h>
#include <string.h>

int main(int argc, char **argv)
{
        int type;

        if (argv[1] && !strcmp(argv[1], "prepend"))
                type = 0; /* size prepended to buffer */
        else if (argv[1] && !strcmp(argv[1], "write"))
                type = 1;
        else if (argv[1] && !strcmp(argv[1], "memcpy"))
                type = 2;
        else
                return 1;

        while (1) {
                char buf[65520];
                int r = read(0, buf + 4, sizeof(buf));
                if (r <= 0)
                        break;
                if (!type) {
                        memcpy(buf, "1234", 4);
                        write(1, buf, r + 4);
                } else if (type == 1) {
                        write(1, "1234", 4);
                        write(1, buf + 4, r);
                } else if (type == 2) {
                        char packet[sizeof(buf) + 4];
                        memcpy(packet, "1234", 4);
                        memcpy(packet + 4, buf + 4, r);
                        write(1, packet, r + 4);
                }
        }
        return 0;
}
-- >8 --

We'd expect "prepend" to be the fastest, as it does a single write and
zero-copy. And then it is a question of whether the double-write is
worse than the extra memcpy.

On Linux, feeding 100MB of zeroes into stdin, I got (best-of-five):

  - prepend: 11ms
  - write: 11ms
  - memcpy: 15ms

So it _does_ make a difference to avoid the memcpy, though 4ms per 100MB
does not seem like it is probably worth caring about. The double-write
also gets worse if you use a smaller buffer size (e.g., if you drop to
4K, that adds back in about 4ms of overhead because you're calling
write() a lot more times).

The cost of write() may vary on other platforms, but the cost of memcpy
generally shouldn't. So I'm inclined to say that it is not really worth
micro-optimizing the interface.

I think the other issue is that format_packet() only lets you send
string data via "%s", so it cannot be used for arbitrary data that may
contain NULs. So we do need _some_ other interface to let you send a raw
data packet, and it's going to look similar to the direct_packet_write()
thing.

The alternative is to hand-code it, which is what send_sideband() does
(it uses xsnprintf("%04x") to do the hex formatting, though).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 16:42     ` [PATCH v4 07/12] run-command: add clean_on_exit_handler larsxschneider
@ 2016-08-03 21:24       ` Jeff King
  2016-08-03 22:15         ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 21:24 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, gitster, jnareb, tboegi, mlbright, e

On Wed, Aug 03, 2016 at 06:42:20PM +0200, larsxschneider@gmail.com wrote:

> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Some commands might need to perform cleanup tasks on exit. Let's give
> them an interface for doing this.
> 
> Please note, that the cleanup callback is not executed if Git dies of a
> signal. The reason is that only "async-signal-safe" functions would be
> allowed to be call in that case. Since we cannot control what functions
> the callback will use, we will not support the case. See 507d7804 for
> more details.

I'm not clear on why we want this cleanup filter. It looks like you use
it in the final patch to send an explicit shutdown to any filters we
start. But I see two issues with that:

  1. This shutdown may come at any time, and you have no idea what state
     the protocol conversation with the filter is in. You could be in
     the middle of sending another pkt-line, or in a sequence of non-command
     pkt-lines where "shutdown" is not recognized.

  2. If your protocol does bad things when it is cut off in the middle
     without an explicit shutdown, then it's a bad protocol. As you
     note, this patch doesn't cover signal death, nor could it ever
     cover something like "kill -9", or a bug which prevented git from
     saying "shutdown".

     You're much better off to design the protocol so that a premature
     EOF is detected as an error.  For example, if we're feeding file
     data to the filter, and we're worried it might be writing it to
     a data store (like LFS), we would not want it to see EOF and say
     "well, I guess I got all the data; time to store this!". Instead,
     it should know how many bytes are coming, or should have some kind
     of framing so that the sender says "and now you have seen all the
     bytes" (like a pkt-line flush).

     AFAIK, your protocol _does_ do those things sensibly, so this
     explicit shutdown isn't really accomplishing anything.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-03 21:12         ` Jeff King
@ 2016-08-03 21:27           ` Jeff King
  2016-08-04 16:14           ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 21:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git, jnareb, tboegi, mlbright, e

On Wed, Aug 03, 2016 at 05:12:21PM -0400, Jeff King wrote:

> The alternative is to hand-code it, which is what send_sideband() does
> (it uses xsnprintf("%04x") to do the hex formatting, though).

After seeing that, I wondered why we need set_packet_header() at all.
But we do for the case when we are filling in the size at the start of a
buffer, because xsnprintf() will write an extra NUL that we do not care
about. send_sideband() is happy to then overwrite it with data, but
code (like format_packet) that computes the buffer, then fills in the
size, must avoid overwriting the first byte of the buffer.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 20:29       ` Junio C Hamano
@ 2016-08-03 21:37         ` Lars Schneider
  2016-08-03 21:43           ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 21:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jnareb, tboegi, mlbright, e, peff


> On 03 Aug 2016, at 22:29, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> +#define FILTER_CAPABILITIES_CLEAN    (1u<<0)
>> +#define FILTER_CAPABILITIES_SMUDGE   (1u<<1)
>> +#define FILTER_SUPPORTS_CLEAN(type)  ((type) & FILTER_CAPABILITIES_CLEAN)
>> +#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
> 
> I would expect a lot shorter names as these are file-local;
> CAP_CLEAN and CAP_SMUDGE, perhaps, _WITHOUT_ "supports BLAH" macros?
> 
> 	if (FILTER_SUPPORTS_CLEAN(type))
> 
> is not all that more readable than
> 
> 	if (CAP_CLEAN & type)

OK. I will change that.


>> +struct cmd2process {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	int supported_capabilities;
>> +	const char *cmd;
>> +	struct child_process process;
>> +};
>> +
>> +static int cmd_process_map_initialized = 0;
>> +static struct hashmap cmd_process_map;
> 
> Don't initialize statics to 0 or NULL.

OK, statics are initialized implicitly to 0.
I will fix it.


>> +static int cmd2process_cmp(const struct cmd2process *e1,
>> +                           const struct cmd2process *e2,
>> +                           const void *unused)
>> +{
>> +	return strcmp(e1->cmd, e2->cmd);
>> +}
>> +
>> +static struct cmd2process *find_multi_file_filter_entry(struct hashmap *hashmap, const char *cmd)
>> +{
>> +	struct cmd2process key;
>> +	hashmap_entry_init(&key, strhash(cmd));
>> +	key.cmd = cmd;
>> +	return hashmap_get(hashmap, &key, NULL);
>> +}
>> +
>> +static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *entry)
>> +{
>> +	if (!entry)
>> +		return;
>> +	sigchain_push(SIGPIPE, SIG_IGN);
>> +	close(entry->process.in);
>> +	close(entry->process.out);
>> +	sigchain_pop(SIGPIPE);
>> +	finish_command(&entry->process);
> 
> I wonder if we want to diagnose failures from close(), which is a
> lot more interesting than usual because these are connected to
> pipes.

In this particular case we kill the filter. That means some error 
already happened, therefore the result wouldn't be of interest
anymore, I think. Wrong?

The other case is the proper shutdown (see 12/12). However, in
that case Git is already exiting and therefore I wonder what
we would do with a "close" error?


>> +static int apply_multi_file_filter(const char *path, const char *src, size_t len,
>> +                                   int fd, struct strbuf *dst, const char *cmd,
>> +                                   const int wanted_capability)
>> +{
>> +	int ret = 1;
>> + ...
>> +	if (!(wanted_capability & entry->supported_capabilities))
>> +		return 1;  // it is OK if the wanted capability is not supported
> 
> No // comment please.

OK!


>> +	filter_result = packet_read_line(process->out, NULL);
>> +	ret = !strcmp(filter_result, "result=success");
>> +
>> +done:
>> +	if (ret) {
>> +		strbuf_swap(dst, &nbuf);
>> +	} else {
>> +		if (!filter_result || strcmp(filter_result, "result=reject")) {
>> +			// Something went wrong with the protocol filter. Force shutdown!
>> +			error("external filter '%s' failed", cmd);
>> +			kill_multi_file_filter(&cmd_process_map, entry);
>> +		}
>> +	}
>> +	strbuf_release(&nbuf);
>> +	return ret;
>> +}
> 
> I think this was already pointed out in the previous review by Peff,
> but a variable "ret" that says "0 is bad" somehow makes it hard to
> follow the code.  Perhaps rename it to "int error", flip the meaning,
> and if the caller wants this function to return non-zero on success
> flip the polarity in the return statement itself, i.e. "return !errors",
> may make it easier to follow?

This follows the existing filter function. Please see Peff's later
reply here:

"So I'm not sure if changing them is a good idea. I agree with you that
it's probably inviting confusion to have the two sets of filter
functions have opposite return codes. So I think I retract my
suggestion. :)"

http://public-inbox.org/git/20160728133523.GB21311%40sigill.intra.peff.net/

That's why I kept it the way it is. If you prefer the "!errors" approach
then I will change that.


Thanks for looking at the patch,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/12] pkt-line: add packet_flush_gentle()
  2016-08-03 16:42     ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() larsxschneider
@ 2016-08-03 21:39       ` Jeff King
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
  2016-08-04 16:16         ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() Junio C Hamano
  0 siblings, 2 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 21:39 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, gitster, jnareb, tboegi, mlbright, e

On Wed, Aug 03, 2016 at 06:42:16PM +0200, larsxschneider@gmail.com wrote:

> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_flush() would die in case of a write error even though for some callers
> an error would be acceptable. Add packet_flush_gentle() which writes a pkt-line
> flush packet and returns `0` for success and `1` for failure.

Our normal convention would be "0" for success, "-1" for failure.

I see write_or_whine_pipe(), which you use here, has a bizarre "0 for
failure, 1 for success", but that nobody actually checks it.

I actually think you probably don't want to use write_or_whine_pipe()
here. It does two things:

  1. It writes to stderr unconditionally. But if you are doing a
     "gently" form, then you probably don't want unconditional errors.
     Since the point of not dying is that you could presumably recover
     in some way, or do some other more intelligent action.

     The existing callers of write_or_whine_pipe() are all in the trace
     code. Their use is not "let's handle an error", but "we _would_ die
     except that this is low-priority debugging code that should not
     interrupt the normal flow". So there it at least makes sense to
     unconditionally complain to stderr, but not to die().

     For your series, I don't think that is true (and especially for
     most potential callers of a generic "gently flush the packet"
     function).

  2. It calls check_pipe(), which will turn EPIPE into death-by-SIGPIPE
     (in case you had for some reason ignored SIGPIPE).

     But I think that's the opposite of what you want. You know you're
     writing to a pipe, and I would think EPIPE is the most common
     reason that your writes would fail (i.e., the helper unexpectedly
     died while you were writing to it).

     So you would want to explicitly ignore SIGPIPE while talking to the
     helper, and then handle EPIPE just as any other error.

Thinking about (2), I'd go so far as to say that the trace actually
should just be using:

  if (write_in_full(...) < 0)
	warning("unable to write trace to ...: %s", strerror(errno));

and we should get rid of write_or_whine_pipe entirely.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 21:37         ` Lars Schneider
@ 2016-08-03 21:43           ` Junio C Hamano
  2016-08-03 22:01             ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2016-08-03 21:43 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jakub Narębski, Torsten Bögershausen,
	mlbright, Eric Wong, Jeff King

On Wed, Aug 3, 2016 at 2:37 PM, Lars Schneider <larsxschneider@gmail.com> wrote:
>>
>> I think this was already pointed out in the previous review by Peff,
>> but a variable "ret" that says "0 is bad" somehow makes it hard to
>> follow the code.  Perhaps rename it to "int error", flip the meaning,
>> and if the caller wants this function to return non-zero on success
>> flip the polarity in the return statement itself, i.e. "return !errors",
>> may make it easier to follow?
>
> This follows the existing filter function. Please see Peff's later
> reply here:

Which I did before mentioning "pointed out in his review".

> That's why I kept it the way it is. If you prefer the "!errors" approach
> then I will change that.

I am not suggesting to change the RETURN VALUE from this function.
That is why I mentioned "return !errors" to flip the polarity at the end.
Inside the function, "ret" variable _forces_ the readers to think "this
function unlike the others signal an error with 0" constantly while
reading it, and one possible approach to reduce the mental burden
is to replace "ret" variable with "errors" variable, which is clear to
anybody that it would be non-zero when we saw error(s).

Oh, I am not suggesting to _count_ the number of errors by
mentioning a possible variable name "errors"; the only reason
why I mentioned that name is because "error" is already
taken, and "seen_error" is a bit too long.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 17:45       ` Junio C Hamano
@ 2016-08-03 21:48         ` Lars Schneider
  2016-08-03 22:46           ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 21:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jnareb, tboegi, mlbright, e, peff


> On 03 Aug 2016, at 19:45, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> packet:          git< git-filter-protocol\n
>> packet:          git< version=2\n
>> packet:          git< capabilities=clean smudge\n
> 
> During the discussion on the future of pack-protocol, it was pointed
> out that having to shove all capabilities on a single line/packet
> was one of the things we would want to fix in the current protocol
> when we revamp to v2.  As this exhange between the convert machinery
> and an external process is a brand new one, I do not think you want
> to mimic the limitation in the current pack protocol like this; the
> limitation mostly came from the constraint that we cannot break
> existing pack protocol clients and servers before we extended the
> protocol to add capabilities.
> 
> You may not foresee that the caps won't grow very long beyond
> clean/smudge right now, just like we did not foresee that we would
> wish to be able to convey a lot longer capability values to the
> other side when we added the capability exchange to the pack
> protocol, so "but but but we will never have that many" is not a
> good counter-argument.

OK. Is this the v2 discussion you are referring to?
http://public-inbox.org/git/1461972887-22100-1-git-send-email-sbeller%40google.com/

What format do you suggest?

packet:          git< git-filter-protocol\n
packet:          git< version=2\n
packet:          git< capability=clean\n
packet:          git< capability=smudge\n
packet:          git< 0000

or

packet:          git< git-filter-protocol\n
packet:          git< version=2\n
packet:          git< capability\n
packet:          git< clean\n
packet:          git< smudge\n
packet:          git< 0000

or  ... ?

I would prefer the first one, I think.

- Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-03 20:18       ` Junio C Hamano
  2016-08-03 21:12         ` Jeff King
@ 2016-08-03 21:56         ` Lars Schneider
  1 sibling, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 21:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jnareb, tboegi, mlbright, e, peff


> On 03 Aug 2016, at 22:18, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> set_packet_header() converts an integer to a 4 byte hex string. Make
>> this function locally available so that other pkt-line functions can
>> use it.
> 
> Didn't I say that this is a bad idea already in an earlier review?

Yes, but in that earlier version I made this function *publicly*
available. In this patch the function is only available and used
within pkt-line.c.


> The only reason why you want it, together with direct_packet_write()
> (which I think is another bad idea), is because you use
> packet_buf_write() to create a "<header><payload>" in a buf in the
> usercode in step 11/12 like this:
> 
> +	packet_buf_write(&nbuf, "command=%s\n", filter_type);
> +	ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
> 
> which would be totally unnecessary if you just did strbuf_addf()
> into nbuf and used packet_write() like everybody else does.

The usercode in step 11/12 could use packet_buf_write(). I am not
worried about performance here. What I am worried about is that
packet_buf_write() dies on error. Since direct_packet_write()
has a "gentle" parameter in can handle these cases. This is important
because a filter might be configured as "required=false" and then
errors are OK.

Would you prefer to see a packet_buf_write_gently() instead?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 21:43           ` Junio C Hamano
@ 2016-08-03 22:01             ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 22:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Jakub Narębski, Torsten Bögershausen,
	mlbright, Eric Wong, Jeff King


> On 03 Aug 2016, at 23:43, Junio C Hamano <gitster@pobox.com> wrote:
> 
> On Wed, Aug 3, 2016 at 2:37 PM, Lars Schneider <larsxschneider@gmail.com> wrote:
>>> 
>>> I think this was already pointed out in the previous review by Peff,
>>> but a variable "ret" that says "0 is bad" somehow makes it hard to
>>> follow the code.  Perhaps rename it to "int error", flip the meaning,
>>> and if the caller wants this function to return non-zero on success
>>> flip the polarity in the return statement itself, i.e. "return !errors",
>>> may make it easier to follow?
>> 
>> This follows the existing filter function. Please see Peff's later
>> reply here:
> 
> Which I did before mentioning "pointed out in his review".
> 
>> That's why I kept it the way it is. If you prefer the "!errors" approach
>> then I will change that.
> 
> I am not suggesting to change the RETURN VALUE from this function.
> That is why I mentioned "return !errors" to flip the polarity at the end.
> Inside the function, "ret" variable _forces_ the readers to think "this
> function unlike the others signal an error with 0" constantly while
> reading it, and one possible approach to reduce the mental burden
> is to replace "ret" variable with "errors" variable, which is clear to
> anybody that it would be non-zero when we saw error(s).
> 
> Oh, I am not suggesting to _count_ the number of errors by
> mentioning a possible variable name "errors"; the only reason
> why I mentioned that name is because "error" is already
> taken, and "seen_error" is a bit too long.

Agreed. I got that you didn't suggest to change the return value :-)
In order to be consistent I would also adjust the error handling in
the existing apply_filter() function that I renamed to 
apply_single_file_filter() in 11/12. OK?

Thanks,
Lars 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 21:24       ` Jeff King
@ 2016-08-03 22:15         ` Lars Schneider
  2016-08-03 22:53           ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 22:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, jnareb, tboegi, mlbright, e


> On 03 Aug 2016, at 23:24, Jeff King <peff@peff.net> wrote:
> 
> On Wed, Aug 03, 2016 at 06:42:20PM +0200, larsxschneider@gmail.com wrote:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Some commands might need to perform cleanup tasks on exit. Let's give
>> them an interface for doing this.
>> 
>> Please note, that the cleanup callback is not executed if Git dies of a
>> signal. The reason is that only "async-signal-safe" functions would be
>> allowed to be call in that case. Since we cannot control what functions
>> the callback will use, we will not support the case. See 507d7804 for
>> more details.
> 
> I'm not clear on why we want this cleanup filter. It looks like you use
> it in the final patch to send an explicit shutdown to any filters we
> start. But I see two issues with that:
> 
>  1. This shutdown may come at any time, and you have no idea what state
>     the protocol conversation with the filter is in. You could be in
>     the middle of sending another pkt-line, or in a sequence of non-command
>     pkt-lines where "shutdown" is not recognized.

Maybe I am missing something, but I don't think that can happen because 
the cleanup callback is *only* executed if Git exits normally without error. 
In that case we would be in a sane protocol state, no?


>  2. If your protocol does bad things when it is cut off in the middle
>     without an explicit shutdown, then it's a bad protocol. As you
>     note, this patch doesn't cover signal death, nor could it ever
>     cover something like "kill -9", or a bug which prevented git from
>     saying "shutdown".
> 
>     You're much better off to design the protocol so that a premature
>     EOF is detected as an error.  For example, if we're feeding file
>     data to the filter, and we're worried it might be writing it to
>     a data store (like LFS), we would not want it to see EOF and say
>     "well, I guess I got all the data; time to store this!". Instead,
>     it should know how many bytes are coming, or should have some kind
>     of framing so that the sender says "and now you have seen all the
>     bytes" (like a pkt-line flush).
> 
>     AFAIK, your protocol _does_ do those things sensibly, so this
>     explicit shutdown isn't really accomplishing anything.

Thanks. The shutdown command is not intended to be a mechanism to tell
the filter that everything went well. At this point - as you mentioned -
the filter already received all data in the right way. The shutdown
command is intended to give the filter some time to perform some post
processing before Git returns.

See here for some brainstorming how this feature could be useful
in filters similar to Git LFS:
https://github.com/github/git-lfs/issues/1401#issuecomment-236133991

- Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 21:48         ` Lars Schneider
@ 2016-08-03 22:46           ` Jeff King
  2016-08-05 12:53             ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 22:46 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, git, jnareb, tboegi, mlbright, e

On Wed, Aug 03, 2016 at 11:48:00PM +0200, Lars Schneider wrote:

> OK. Is this the v2 discussion you are referring to?
> http://public-inbox.org/git/1461972887-22100-1-git-send-email-sbeller%40google.com/
> 
> What format do you suggest?
> 
> packet:          git< git-filter-protocol\n
> packet:          git< version=2\n
> packet:          git< capability=clean\n
> packet:          git< capability=smudge\n
> packet:          git< 0000
> 
> or
> 
> packet:          git< git-filter-protocol\n
> packet:          git< version=2\n
> packet:          git< capability\n
> packet:          git< clean\n
> packet:          git< smudge\n
> packet:          git< 0000
> 
> or  ... ?
> 
> I would prefer the first one, I think.

How about:

  version=2
  clean=true
  smudge=true
  0000

? Then we do not have to care about multiple "capability" keys (so
something naively parsing this could just store them in a string list,
for example).

You could also make "clean" a synonym for "clean=true" or something, and
have:

  version=2
  clean
  smudge
  0000

but it's probably better to have the protocol err on the side of
verbose-but-unambiguous. It's not like people are typing this routinely.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-08-01 13:32       ` Lars Schneider
  2016-08-03 18:30         ` Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option) Jakub Narębski
@ 2016-08-03 22:47         ` Jakub Narębski
  1 sibling, 0 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-08-03 22:47 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King

[Note that some of this might have been invalidated by v4]

W dniu 01.08.2016 o 15:32, Lars Schneider pisze:
>> On 31 Jul 2016, at 00:05, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:

>>> Git starts the filter on first usage and expects a welcome
>>> message, protocol version number, and filter capabilities
>>> separated by spaces:
>>> ------------------------
>>> packet:          git< git-filter-protocol\n
>>> packet:          git< version 2\n
>>> packet:          git< capabilities clean smudge\n
>>
>> Sorry for going back and forth, but now I think that 'capabilities' are
>> not really needed here, though they are in line with "version" in
>> the second packet / line, namely "version 2".  If it does not make
>> parsing more difficult...
> 
> I don't understand what you mean with "they are not really needed"?
> The field is necessary to understand the protocol, no?
> 
> In the last roll I added the "key=value" format to the protocol upon
> yours and Peff's suggestion. Would it be OK to change the startup
> sequence accordingly?
> 
> packet:          git< version=2\n
> packet:          git< capabilities=clean smudge\n
 
With current implementation, Git checks second packet of the handshake
for version, and third packet for capabilities.  The "capabilities" or
"capabilities=" is entirely redundant; it is the position of packet
(it is the packet number) that matters.  At least for now.

The only thing that "version" in "version 2" and "capabilities"
in "capabilities: clean smudge" helps is self-describing of the protocol.

To really make use of them you would have to end handshake with flush
packet, and do a parsing: loop over every packet, and match known
patterns.  Well, perhaps with exception of known header: it doesn't
makes sense to have "version N" anywhere else than second packet,
and it doesn't makes sense to repeat it.

We also don't want to proliferate packets unnecessarily.  Each packet
is a bit (a tiny bit) of a performance hit.
 
>>> ------------------------
>>> Supported filter capabilities are "clean", "smudge", "stream",
>>> and "shutdown".
>>
>> I'd rather put "stream" and "shutdown" capabilities into separate
>> patches, for easier review.
> 
> I agree with "shutdown". I think I would like to remove the "stream"
> option and make it the default for the following reasons:
> 
> (1) As you mentioned elsewhere, "stream" is not really streaming at this
> point because we don't read/write in parallel.

We could, following the example of original per-file filter drivers.
It is as simple as starting writer using start_async(), as if we did
writing from Git in a child process.

Though that might be left for later (assuming that protocol is flexible
enough), as synchronous protocol (write, then read) is a bit simpler to
implement.

> (2) Junio and you pointed out that if we transmit size and flush packet
> then we have redundancy in the protocol.

Providing size upfront can be a hint for filter or Git.  For example
HTTP provides Content-Length: header, though it is not strictly necessary.

> (3) With the newly introduced "success"/"reject"/"failure" packet at the 
> end of a filter operation, a filter process has a way to signal Git that
> something went wrong. Initially I had the idea that a filter process just
> stops writing and Git would detect the mismatch between expected bytes
> and received bytes. But the final status packet is a much clearer solution.

The solution with stopping writing wouldn't work, I don't think.

> (4) Maintaining two slightly different protocols is a waste of resources 
> and only increases the size of this (already large) patch.

Right, better to design and implement basic protocol, taking care that
it is extensible, and only then add to it.

> My only argument for the size packet was that this allows efficient buffer
> allocation. However, in non of my benchmarks this was actually a problem.
> Therefore this is probably a epsilon optimization and should be removed.
> 
> OK with everyone?

All right.

>>> After the filter has processed a blob it is expected to wait for
>>> the next command. A demo implementation can be found in
>>> `t/t0021/rot13-filter.pl` located in the Git core repository.
>>
>> If filter does not support "shutdown" capability (or if said
>> capability is postponed for later patch), it should behave sanely
>> when Git command reaps it (SIGTERM + wait + SIGKILL?, SIGCHLD?).
> 
> How would you do this? Don't you think the current solution is
> good enough for processes that don't need a proper shutdown?

Actually... couldn't filter driver register atexit() / signal handler
to do a clean exit, if it is needed?
 
 
>> I wonder if it would be worth it to explain the reasoning behind
>> this solution and show alternate ones.
>>
>> * Using a separate variable to signal that filters are invoked
>>   per-command rather than per-file, and use pkt-line interface,
>>   like boolean-valued `useProtocol`, or `protocolVersion` set
>>   to '2' or 'v2', or `persistence` set to 'per-command', there
>>   is high risk of user's trying to use exiting one-shot per-file
>>   filters... and Git hanging.
>>
>> * Using new variables for each capability, e.g. `processSmudge`
>>   and `processClean` would lead to explosion of variable names;
>>   I think.
>>
>> * Current solution of using `process` in addition to `clean`
>>   and `smudge` clearly says that you need to use different
>>   command for per-file (`clean` and `smudge`), and per-command
>>   filter, while allowing to use them together.
>>
>>   The possible disadvantage is Git command starting `process`
>>   filter, only to see that it doesn't offer required capability,
>>   for example offering only "clean" but not "smudge".  There
>>   is simple workaround - set `smudge` variable (same as not
>>   present capability) to empty string.
> 
> If you think it is necessary to have this discussion in the
> commit message, then I will add it.

I think it would be good idea (not necessary, but helpful), though
possibly not in such exacting detail.  Just why this one, one sentence
or more.
 
 >>> +single filter invocation for the entire life of a single Git
>>> +command. This is achieved by using the following packet
>>> +format (pkt-line, see protocol-common.txt) based protocol over
>>
>> Can we linkgit-it (to technical documentation)?
> 
> I don't think that is possible because it was never done. See:
> git grep "linkgit:tech"

A pity.  Well, not your problem, anyway.

>>> +Git starts the filter on first usage and expects a welcome
>>
>> Is "usage" here correct?  Perhaps it would be more readable
>> to say that Git starts filter when encountering first file
>> that needs cleaning or smudgeing.
> 
> OK. How about this:
> 
> Git starts the filter when it encounters the first file
> that needs to be cleaned or smudged. After the filter started
> Git expects a welcome message, protocol version number, and 
> filter capabilities separated by spaces:

Better. 

>>> +
>>> +After the filter has processed a blob it is expected to wait for
>>> +the next command. A demo implementation can be found in
>>> +`t/t0021/rot13-filter.pl` located in the Git core repository.
>>
>> It is actually in Git sources.  Is it the best way to refer to
>> such files?
> 
> Well, I could add a github.com link but I don't think everyone
> would like that. What would you suggest?

Sorry, I wasn't clear.  What I meant is if "<file> located in the
Git core repository" is the best way to refer to such files, and
if we could do better.

But I think it is all right as it is.

Later we might want to provide some example filter.<driver>.process
filters e.g. in contrib/.  But that's for the future.
 
>>> +
>>> +Please note that you cannot use an existing filter.<driver>.clean
>>> +or filter.<driver>.smudge command as filter.<driver>.process
>>> +command. As soon as Git would detect a file that needs to be
>>> +processed by this filter, it would stop responding.
>>
>> This isn't.
> 
> Would that be better?
> 
> 
> Please note that you cannot use an existing `filter.<driver>.clean`
> or `filter.<driver>.smudge` command as `filter.<driver>.process`
> command because the former two use a different inter process 
> communication protocol than the latter one. As soon as Git would detect 
> a file that needs to be processed by such an invalid "process" filter, 
> it would wait for a proper protocol handshake and appear "hanging".

This is better.

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 22:15         ` Lars Schneider
@ 2016-08-03 22:53           ` Jeff King
  2016-08-03 23:09             ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 22:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, gitster, jnareb, tboegi, mlbright, e

On Thu, Aug 04, 2016 at 12:15:46AM +0200, Lars Schneider wrote:

> > I'm not clear on why we want this cleanup filter. It looks like you use
> > it in the final patch to send an explicit shutdown to any filters we
> > start. But I see two issues with that:
> > 
> >  1. This shutdown may come at any time, and you have no idea what state
> >     the protocol conversation with the filter is in. You could be in
> >     the middle of sending another pkt-line, or in a sequence of non-command
> >     pkt-lines where "shutdown" is not recognized.
> 
> Maybe I am missing something, but I don't think that can happen because 
> the cleanup callback is *only* executed if Git exits normally without error. 
> In that case we would be in a sane protocol state, no?

OK, then maybe I am doubly missing the point. I thought this cleanup was
here to hit the case where we call die() and git exits unexpectedly.

If you only want to cover the "we are done, no errors, goodbye" case,
then why don't you just write shutdown when we're done?

I realize you may have multiple filters, but I don't think it should be
run-command's job to iterate over them. You are presumably keeping a
list of active filters, and should have a function to iterate over that.

Or better yet, do not require a shutdown at all. The filter sees EOF and
knows there is nothing more to do. If we are in the middle of an
operation, then it knows git died. If not, then presumably git had
nothing else to say (and really, it is not the filter's business if git
saw an error or not).

Though...

> Thanks. The shutdown command is not intended to be a mechanism to tell
> the filter that everything went well. At this point - as you mentioned -
> the filter already received all data in the right way. The shutdown
> command is intended to give the filter some time to perform some post
> processing before Git returns.
> 
> See here for some brainstorming how this feature could be useful
> in filters similar to Git LFS:
> https://github.com/github/git-lfs/issues/1401#issuecomment-236133991

OK, so it is not really "tell the filter to shutdown" but "I am done
with you, filter, but I will wait for you to tell me you are all done,
so that I can tell the user".

I'm not sure if calling that "shutdown" makes sense, though. It's almost
more of a checkpoint (and I wonder if git would ever want to
"checkpoint" without hanging up the connection).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 0/7] minor trace fixes and cosmetic improvements
  2016-08-03 21:39       ` Jeff King
@ 2016-08-03 22:56         ` Jeff King
  2016-08-03 22:56           ` [PATCH 1/7] trace: handle NULL argument in trace_disable() Jeff King
                             ` (7 more replies)
  2016-08-04 16:16         ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() Junio C Hamano
  1 sibling, 8 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 22:56 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

On Wed, Aug 03, 2016 at 05:39:20PM -0400, Jeff King wrote:

> Thinking about (2), I'd go so far as to say that the trace actually
> should just be using:
> 
>   if (write_in_full(...) < 0)
> 	warning("unable to write trace to ...: %s", strerror(errno));
> 
> and we should get rid of write_or_whine_pipe entirely.

I started to write a patch to do that, but it turns out the trace code
is full of bugs (and opportunities for cosmetic improvements).

Here's what I came up with.

  [1/7]: trace: handle NULL argument in trace_disable()
  [2/7]: trace: stop using write_or_whine_pipe()
  [3/7]: trace: use warning() for printing trace errors
  [4/7]: trace: cosmetic fixes for error messages
  [5/7]: trace: correct variable name in write() error message
  [6/7]: trace: disable key after write error
  [7/7]: write_or_die: drop write_or_whine_pipe()

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 1/7] trace: handle NULL argument in trace_disable()
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
@ 2016-08-03 22:56           ` Jeff King
  2016-08-03 22:58           ` [PATCH 2/7] trace: stop using write_or_whine_pipe() Jeff King
                             ` (6 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 22:56 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

All of the trace functions treat a NULL key as a synonym for
the default GIT_TRACE key. Except for trace_disable(), which
will segfault.

Fortunately, this can't cause any bugs, as the function has
no callers. But rather than drop it, let's fix the bug, as I
plan to add a caller.

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/trace.c b/trace.c
index 4aeea60..f204a7d 100644
--- a/trace.c
+++ b/trace.c
@@ -25,15 +25,25 @@
 #include "cache.h"
 #include "quote.h"
 
+/*
+ * "Normalize" a key argument by converting NULL to our trace_default,
+ * and otherwise passing through the value. All caller-facing functions
+ * should normalize their inputs in this way, though most get it
+ * for free by calling get_trace_fd() (directly or indirectly).
+ */
+static void normalize_trace_key(struct trace_key **key)
+{
+	static struct trace_key trace_default = { "GIT_TRACE" };
+	if (!*key)
+		*key = &trace_default;
+}
+
 /* Get a trace file descriptor from "key" env variable. */
 static int get_trace_fd(struct trace_key *key)
 {
-	static struct trace_key trace_default = { "GIT_TRACE" };
 	const char *trace;
 
-	/* use default "GIT_TRACE" if NULL */
-	if (!key)
-		key = &trace_default;
+	normalize_trace_key(&key);
 
 	/* don't open twice */
 	if (key->initialized)
@@ -75,6 +85,8 @@ static int get_trace_fd(struct trace_key *key)
 
 void trace_disable(struct trace_key *key)
 {
+	normalize_trace_key(&key);
+
 	if (key->need_close)
 		close(key->fd);
 	key->fd = 0;
-- 
2.9.2.670.g42e63de


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 2/7] trace: stop using write_or_whine_pipe()
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
  2016-08-03 22:56           ` [PATCH 1/7] trace: handle NULL argument in trace_disable() Jeff King
@ 2016-08-03 22:58           ` Jeff King
  2016-08-03 22:58           ` [PATCH 3/7] trace: use warning() for printing trace errors Jeff King
                             ` (5 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 22:58 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

The write_or_whine_pipe function does two things:

  1. it checks for EPIPE and converts it into a signal death

  2. it prints a message to stderr on error

The first thing does not help us, and actively hurts.
Generally we would simply die from SIGPIPE in this case,
unless somebody has taken the time to ignore SIGPIPE for the
whole process.  And if they _did_ do that, it seems rather
silly for the trace code, which otherwise takes pains to
continue even in the face of errors (e.g., by not using
write_or_die!), to take down the whole process for one
specific type of error.

Nor does the second thing help us; it just makes it harder
to write our error message, because we have to feed bits of
it as an argument to write_or_whine_pipe(). Translators
never get to see the full message, and it's hard for us to
customize it.

Let's switch to just using write_in_full() and writing our
own error string. For now, the error is identical to what
write_or_whine_pipe() would say, but now that it's more
under our control, we can improve it in future patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/trace.c b/trace.c
index f204a7d..bdbe149 100644
--- a/trace.c
+++ b/trace.c
@@ -132,18 +132,23 @@ static int prepare_trace_line(const char *file, int line,
 	return 1;
 }
 
+static void trace_write(struct trace_key *key, const void *buf, unsigned len)
+{
+	if (write_in_full(get_trace_fd(key), buf, len) < 0)
+		fprintf(stderr, "%s: write error (%s)\n", err_msg, strerror(errno));
+}
+
 void trace_verbatim(struct trace_key *key, const void *buf, unsigned len)
 {
 	if (!trace_want(key))
 		return;
-	write_or_whine_pipe(get_trace_fd(key), buf, len, err_msg);
+	trace_write(key, buf, len);
 }
 
 static void print_trace_line(struct trace_key *key, struct strbuf *buf)
 {
 	strbuf_complete_line(buf);
-
-	write_or_whine_pipe(get_trace_fd(key), buf->buf, buf->len, err_msg);
+	trace_write(key, buf->buf, buf->len);
 	strbuf_release(buf);
 }
 
-- 
2.9.2.670.g42e63de


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
  2016-08-03 22:56           ` [PATCH 1/7] trace: handle NULL argument in trace_disable() Jeff King
  2016-08-03 22:58           ` [PATCH 2/7] trace: stop using write_or_whine_pipe() Jeff King
@ 2016-08-03 22:58           ` Jeff King
  2016-08-04 20:41             ` Junio C Hamano
  2016-08-03 23:00           ` [PATCH 4/7] trace: cosmetic fixes for error messages Jeff King
                             ` (4 subsequent siblings)
  7 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 22:58 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

Right now we just fprintf() straight to stderr, which can
make the output hard to distinguish. It would be helpful to
give it one of our usual prefixes like "error:", "warning:",
etc.

It doesn't make sense to use error() here, as the trace code
is "optional" debugging code. If something goes wrong, we
should warn the user, but saying "error" implies the actual
git operation had a problem. So warning() is the only sane
choice.

Note that this does end up calling warn_routine() to do the
formatting. So in theory, somebody who tries to trace from
their warn_routine() could cause a loop. But nobody does
this, and in fact nobody in the history of git has ever
replaced the default warn_builtin (there isn't even a
set_warn_routine function!).

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/trace.c b/trace.c
index bdbe149..6a77e4d 100644
--- a/trace.c
+++ b/trace.c
@@ -61,9 +61,8 @@ static int get_trace_fd(struct trace_key *key)
 	else if (is_absolute_path(trace)) {
 		int fd = open(trace, O_WRONLY | O_APPEND | O_CREAT, 0666);
 		if (fd == -1) {
-			fprintf(stderr,
-				"Could not open '%s' for tracing: %s\n"
-				"Defaulting to tracing on stderr...\n",
+			warning("Could not open '%s' for tracing: %s\n"
+				"Defaulting to tracing on stderr...",
 				trace, strerror(errno));
 			key->fd = STDERR_FILENO;
 		} else {
@@ -71,10 +70,10 @@ static int get_trace_fd(struct trace_key *key)
 			key->need_close = 1;
 		}
 	} else {
-		fprintf(stderr, "What does '%s' for %s mean?\n"
+		warning("What does '%s' for %s mean?\n"
 			"If you want to trace into a file, then please set "
 			"%s to an absolute pathname (starting with /).\n"
-			"Defaulting to tracing on stderr...\n",
+			"Defaulting to tracing on stderr...",
 			trace, key->key, key->key);
 		key->fd = STDERR_FILENO;
 	}
@@ -135,7 +134,7 @@ static int prepare_trace_line(const char *file, int line,
 static void trace_write(struct trace_key *key, const void *buf, unsigned len)
 {
 	if (write_in_full(get_trace_fd(key), buf, len) < 0)
-		fprintf(stderr, "%s: write error (%s)\n", err_msg, strerror(errno));
+		warning("%s: write error (%s)", err_msg, strerror(errno));
 }
 
 void trace_verbatim(struct trace_key *key, const void *buf, unsigned len)
-- 
2.9.2.670.g42e63de


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 4/7] trace: cosmetic fixes for error messages
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
                             ` (2 preceding siblings ...)
  2016-08-03 22:58           ` [PATCH 3/7] trace: use warning() for printing trace errors Jeff King
@ 2016-08-03 23:00           ` Jeff King
  2016-08-04 20:42             ` Junio C Hamano
  2016-08-03 23:00           ` [PATCH 5/7] trace: correct variable name in write() error message Jeff King
                             ` (3 subsequent siblings)
  7 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 23:00 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

The error messages for the trace code are often multi-line;
the first line gets a nice "warning:", but the rest are
left-aligned. Let's give them an indentation to make sure
they stand out as a unit.

While we're here, let's also downcase the first letter of
each error (our usual style), and break up a long line of
advice (since we're already using multiple lines, one more
doesn't hurt).

We also replace "What does 'foo' for GIT_TRACE mean?". While
cute, it's probably a good idea to give more context, and
follow our usual styles. So it's now "unknown trace value
for 'GIT_TRACE': foo".

Signed-off-by: Jeff King <peff@peff.net>
---
I went with an indent the size of "warning: ", because that string does
not actually get translated (it seems like it probably should, though).

I think it would be nicer to still to print:

 warning: first line
 warning: second line

etc. We do that for "advice:", but not the rest of the vreportf
functions. It might be nice to do that, but we'd have to go back to
printing into a buffer (since we can't break up the incoming format
string that we feed to fprintf).

 trace.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/trace.c b/trace.c
index 6a77e4d..4d68eb5 100644
--- a/trace.c
+++ b/trace.c
@@ -61,8 +61,8 @@ static int get_trace_fd(struct trace_key *key)
 	else if (is_absolute_path(trace)) {
 		int fd = open(trace, O_WRONLY | O_APPEND | O_CREAT, 0666);
 		if (fd == -1) {
-			warning("Could not open '%s' for tracing: %s\n"
-				"Defaulting to tracing on stderr...",
+			warning("could not open '%s' for tracing: %s\n"
+				"         Defaulting to tracing on stderr...",
 				trace, strerror(errno));
 			key->fd = STDERR_FILENO;
 		} else {
@@ -70,11 +70,11 @@ static int get_trace_fd(struct trace_key *key)
 			key->need_close = 1;
 		}
 	} else {
-		warning("What does '%s' for %s mean?\n"
-			"If you want to trace into a file, then please set "
-			"%s to an absolute pathname (starting with /).\n"
-			"Defaulting to tracing on stderr...",
-			trace, key->key, key->key);
+		warning("unknown trace value for '%s': %s\n"
+			"         If you want to trace into a file, then please set %s\n"
+			"         to an absolute pathname (starting with /)\n"
+			"         Defaulting to tracing on stderr...",
+			key->key, trace, key->key);
 		key->fd = STDERR_FILENO;
 	}
 
@@ -93,7 +93,7 @@ void trace_disable(struct trace_key *key)
 	key->need_close = 0;
 }
 
-static const char err_msg[] = "Could not trace into fd given by "
+static const char err_msg[] = "could not trace into fd given by "
 	"GIT_TRACE environment variable";
 
 static int prepare_trace_line(const char *file, int line,
-- 
2.9.2.670.g42e63de


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 5/7] trace: correct variable name in write() error message
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
                             ` (3 preceding siblings ...)
  2016-08-03 23:00           ` [PATCH 4/7] trace: cosmetic fixes for error messages Jeff King
@ 2016-08-03 23:00           ` Jeff King
  2016-08-03 23:01           ` [PATCH 6/7] trace: disable key after write error Jeff King
                             ` (2 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 23:00 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

Our error message for write() always mentions GIT_TRACE,
even though we may be writing for a different variable
entirely. It's also not quite accurate to say "fd given by
GIT_TRACE environment variable", as we may hit this error
based on a filename the user put in the variable (we do
complain and switch to stderr if the file cannot be opened,
but it's still possible to hit a write() error on the
descriptor later).

So let's fix those things, and switch to our more usual
"unable to do X: Y" format for the error.

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/trace.c b/trace.c
index 4d68eb5..4efb256 100644
--- a/trace.c
+++ b/trace.c
@@ -93,9 +93,6 @@ void trace_disable(struct trace_key *key)
 	key->need_close = 0;
 }
 
-static const char err_msg[] = "could not trace into fd given by "
-	"GIT_TRACE environment variable";
-
 static int prepare_trace_line(const char *file, int line,
 			      struct trace_key *key, struct strbuf *buf)
 {
@@ -133,8 +130,11 @@ static int prepare_trace_line(const char *file, int line,
 
 static void trace_write(struct trace_key *key, const void *buf, unsigned len)
 {
-	if (write_in_full(get_trace_fd(key), buf, len) < 0)
-		warning("%s: write error (%s)", err_msg, strerror(errno));
+	if (write_in_full(get_trace_fd(key), buf, len) < 0) {
+		normalize_trace_key(&key);
+		warning("unable to write trace for %s: %s",
+			key->key, strerror(errno));
+	}
 }
 
 void trace_verbatim(struct trace_key *key, const void *buf, unsigned len)
-- 
2.9.2.670.g42e63de


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 6/7] trace: disable key after write error
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
                             ` (4 preceding siblings ...)
  2016-08-03 23:00           ` [PATCH 5/7] trace: correct variable name in write() error message Jeff King
@ 2016-08-03 23:01           ` Jeff King
  2016-08-04 20:45             ` Junio C Hamano
  2016-08-03 23:01           ` [PATCH 7/7] write_or_die: drop write_or_whine_pipe() Jeff King
  2016-08-03 23:04           ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
  7 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 23:01 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

If we get a write error writing to a trace descriptor, the
error isn't likely to go away if we keep writing. Instead,
you'll just get the same error over and over. E.g., try:

  GIT_TRACE_PACKET=42 git ls-remote >/dev/null

You don't really need to see:

  warning: unable to write trace for GIT_TRACE_PACKET: Bad file descriptor

hundreds of times. We could fallback to tracing to stderr,
as we do in the error code-path for open(), but there's not
much point. If the user fed us a bogus descriptor, they're
probably better off fixing their invocation. And if they
didn't, and we saw a transient error (e.g., ENOSPC writing
to a file), it probably doesn't help anybody to have half of
the trace in a file, and half on stderr.

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/trace.c b/trace.c
index 4efb256..083eb98 100644
--- a/trace.c
+++ b/trace.c
@@ -134,6 +134,7 @@ static void trace_write(struct trace_key *key, const void *buf, unsigned len)
 		normalize_trace_key(&key);
 		warning("unable to write trace for %s: %s",
 			key->key, strerror(errno));
+		trace_disable(key);
 	}
 }
 
-- 
2.9.2.670.g42e63de


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 7/7] write_or_die: drop write_or_whine_pipe()
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
                             ` (5 preceding siblings ...)
  2016-08-03 23:01           ` [PATCH 6/7] trace: disable key after write error Jeff King
@ 2016-08-03 23:01           ` Jeff King
  2016-08-03 23:04           ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
  7 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 23:01 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

This function has no callers, and is not likely to gain any
because it's confusing to use.

It unconditionally complains to stderr, but _doesn't_ die.
Yet any caller which wants a "gentle" write would generally
want to suppress the error message, because presumably
they're going to write a better one, and/or try the
operation again.

And the check_pipe() call leads to confusing behaviors. It
means we die for EPIPE, but not for other errors, which is
confusing and pointless.

On top of all that, it has unusual error return semantics,
which makes it easy for callers to get it wrong.

Let's drop the function, and if somebody ever needs to
resurrect something like it, they can fix these warts.

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h        |  1 -
 write_or_die.c | 12 ------------
 2 files changed, 13 deletions(-)

diff --git a/cache.h b/cache.h
index b5f76a4..3885911 100644
--- a/cache.h
+++ b/cache.h
@@ -1740,7 +1740,6 @@ extern int copy_file(const char *dst, const char *src, int mode);
 extern int copy_file_with_time(const char *dst, const char *src, int mode);
 
 extern void write_or_die(int fd, const void *buf, size_t count);
-extern int write_or_whine_pipe(int fd, const void *buf, size_t count, const char *msg);
 extern void fsync_or_die(int fd, const char *);
 
 extern ssize_t read_in_full(int fd, void *buf, size_t count);
diff --git a/write_or_die.c b/write_or_die.c
index 9816879..0734432 100644
--- a/write_or_die.c
+++ b/write_or_die.c
@@ -82,15 +82,3 @@ void write_or_die(int fd, const void *buf, size_t count)
 		die_errno("write error");
 	}
 }
-
-int write_or_whine_pipe(int fd, const void *buf, size_t count, const char *msg)
-{
-	if (write_in_full(fd, buf, count) < 0) {
-		check_pipe(errno);
-		fprintf(stderr, "%s: write error (%s)\n",
-			msg, strerror(errno));
-		return 0;
-	}
-
-	return 1;
-}
-- 
2.9.2.670.g42e63de

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/7] minor trace fixes and cosmetic improvements
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
                             ` (6 preceding siblings ...)
  2016-08-03 23:01           ` [PATCH 7/7] write_or_die: drop write_or_whine_pipe() Jeff King
@ 2016-08-03 23:04           ` Jeff King
  7 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-03 23:04 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

On Wed, Aug 03, 2016 at 06:56:00PM -0400, Jeff King wrote:

> On Wed, Aug 03, 2016 at 05:39:20PM -0400, Jeff King wrote:
> 
> > Thinking about (2), I'd go so far as to say that the trace actually
> > should just be using:
> > 
> >   if (write_in_full(...) < 0)
> > 	warning("unable to write trace to ...: %s", strerror(errno));
> > 
> > and we should get rid of write_or_whine_pipe entirely.
> 
> I started to write a patch to do that, but it turns out the trace code
> is full of bugs (and opportunities for cosmetic improvements).
> 
> Here's what I came up with.
> 
>   [1/7]: trace: handle NULL argument in trace_disable()
>   [2/7]: trace: stop using write_or_whine_pipe()
>   [3/7]: trace: use warning() for printing trace errors
>   [4/7]: trace: cosmetic fixes for error messages
>   [5/7]: trace: correct variable name in write() error message
>   [6/7]: trace: disable key after write error
>   [7/7]: write_or_die: drop write_or_whine_pipe()

Oops, I meant to detach this from the parent thread, but apparently I am
incompetent at editing email headers.

This really is totally orthogonal to your series (except that you should
obviously not use write_or_whine_pipe(), but that is the case whether I
rip it out or not :) ).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 22:53           ` Jeff King
@ 2016-08-03 23:09             ` Lars Schneider
  2016-08-03 23:15               ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-03 23:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, jnareb, tboegi, mlbright, e


> On 04 Aug 2016, at 00:53, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Aug 04, 2016 at 12:15:46AM +0200, Lars Schneider wrote:
> 
>>> I'm not clear on why we want this cleanup filter. It looks like you use
>>> it in the final patch to send an explicit shutdown to any filters we
>>> start. But I see two issues with that:
>>> 
>>> 1. This shutdown may come at any time, and you have no idea what state
>>>    the protocol conversation with the filter is in. You could be in
>>>    the middle of sending another pkt-line, or in a sequence of non-command
>>>    pkt-lines where "shutdown" is not recognized.
>> 
>> Maybe I am missing something, but I don't think that can happen because 
>> the cleanup callback is *only* executed if Git exits normally without error. 
>> In that case we would be in a sane protocol state, no?
> 
> OK, then maybe I am doubly missing the point. I thought this cleanup was
> here to hit the case where we call die() and git exits unexpectedly.
> 
> If you only want to cover the "we are done, no errors, goodbye" case,
> then why don't you just write shutdown when we're done?

I think I tried that at some point but the filter code is called from
multiple places and therefore I looked into atexit() (via run-command)
and it seemed easier. Do you have a place in mind where you would call 
the shutdown after all blobs are processed explicitly?


> I realize you may have multiple filters, but I don't think it should be
> run-command's job to iterate over them. You are presumably keeping a
> list of active filters, and should have a function to iterate over that.

Yes, that would be easy.


> Or better yet, do not require a shutdown at all. The filter sees EOF and
> knows there is nothing more to do. If we are in the middle of an
> operation, then it knows git died. If not, then presumably git had
> nothing else to say (and really, it is not the filter's business if git
> saw an error or not).

EOF? The filter is supposed to process multiple files. How would one EOF
indicate that we are done?


> Though...
> 
>> Thanks. The shutdown command is not intended to be a mechanism to tell
>> the filter that everything went well. At this point - as you mentioned -
>> the filter already received all data in the right way. The shutdown
>> command is intended to give the filter some time to perform some post
>> processing before Git returns.
>> 
>> See here for some brainstorming how this feature could be useful
>> in filters similar to Git LFS:
>> https://github.com/github/git-lfs/issues/1401#issuecomment-236133991
> 
> OK, so it is not really "tell the filter to shutdown" but "I am done
> with you, filter, but I will wait for you to tell me you are all done,
> so that I can tell the user".

Correct!


> I'm not sure if calling that "shutdown" makes sense, though. It's almost
> more of a checkpoint (and I wonder if git would ever want to
> "checkpoint" without hanging up the connection).

OK, I agree that the naming might not be ideal. But "checkpoint" does not
convey that it is only executed once after all blobs are filtered?!

I understand that Git might not want to wait for the filter...

- Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 23:09             ` Lars Schneider
@ 2016-08-03 23:15               ` Jeff King
  2016-08-05 13:08                 ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-03 23:15 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, gitster, jnareb, tboegi, mlbright, e

On Thu, Aug 04, 2016 at 01:09:57AM +0200, Lars Schneider wrote:

> > Or better yet, do not require a shutdown at all. The filter sees EOF and
> > knows there is nothing more to do. If we are in the middle of an
> > operation, then it knows git died. If not, then presumably git had
> > nothing else to say (and really, it is not the filter's business if git
> > saw an error or not).
> 
> EOF? The filter is supposed to process multiple files. How would one EOF
> indicate that we are done?

I think we may be talking about two different EOFs.

Git sends a file in pkt-line format, and the flush marks EOF for that
file. But the filter keeps running, waiting for more input. This can
happen multiple times.

Eventually git calls close() on the descriptor, and the filter sees the
"real" EOF (i.e., read() returns 0). That is the signal that git is
done.

> > I'm not sure if calling that "shutdown" makes sense, though. It's almost
> > more of a checkpoint (and I wonder if git would ever want to
> > "checkpoint" without hanging up the connection).
> 
> OK, I agree that the naming might not be ideal. But "checkpoint" does not
> convey that it is only executed once after all blobs are filtered?!

Does the filter need to care? It's told to do any deferred work, and to
report back when it's done. The fact that git is calling it before it
decides to exit is not the filter's business (and you can imagine for
something like fast-import, it might want to feed files to something
like LFS, too; it already checkpoints occasionally to avoid lost work,
and would presumably want to ask LFS to checkpoint, too).

> I understand that Git might not want to wait for the filter...

If git _doesn't_ want to wait for the filter, I don't think you need a
checkpoint at all. The filter just does its deferred work when it sees
git hang up the connection (i.e., the "real" EOF from above).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-08-01 17:55       ` Lars Schneider
@ 2016-08-04  0:42         ` Jakub Narębski
  0 siblings, 0 replies; 120+ messages in thread
From: Jakub Narębski @ 2016-08-04  0:42 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King

[Some of those answers might have been invalidated by v4]

W dniu 01.08.2016 o 19:55, Lars Schneider pisze:
>> On 01 Aug 2016, at 00:19, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
>> [...]

>>> +static int multi_packet_read(int fd_in, struct strbuf *sb, size_t expected_bytes, int is_stream)
>>
>> About name of this function: `multi_packet_read` is fine, though I wonder
>> if `packet_read_in_full` with nearly the same parameters as `packet_read`,
>> or `packet_read_till_flush`, or `read_in_full_packetized` would be better.
> 
> I like `multi_packet_read` and will rename!
 
Errr... what? multi_packet_read() is the current name...
 
>> Also, the problem is that while we know that what packet_read() stores
>> would fit in memory (in size_t), it is not true for reading whole file,
>> which might be very large - for example huge graphical assets like raw
>> images or raw videos, or virtual machine images.  Isn't that the goal
>> of git-LFS solutions, which need this feature?  Shouldn't we have then
>> both `multi_packet_read_to_fd` and `multi_packet_read_to_buf`,
>> or whatever?
> 
> Git LFS works well with the current clean/smudge mechanism that uses the
> same on in memory buffers. I understand your concern but I think this
> improvement is out of scope for this patch series.

True.  

BTW. this means that it cannot share code with fetch / push codebase,
where Git spools from pkt-line to packfile on disk.


>> Also, total_bytes_read could overflow size_t, but then we would have
>> problems storing the result in strbuf.
> 
> Would that check be ok?
> 
> 		if (total_bytes_read > SIZE_MAX - bytes_read)
> 			return 1;  // `total_bytes_read` would overflow and is not representable

Well, if current code doesn't have such check, then I think it would
be all right to not have it either.

Note that we do not use C++ comments.
 
 

>>> +
>>> +	if (is_stream)
>>> +		strbuf_grow(sb, LARGE_PACKET_MAX);           // allocate space for at least one packet
>>> +	else
>>> +		strbuf_grow(sb, st_add(expected_bytes, 1));  // add one extra byte for the packet flush
>>> +
>>> +	do {
>>> +		bytes_read = packet_read(
>>> +			fd_in, NULL, NULL,
>>> +			sb->buf + total_bytes_read, sb->len - total_bytes_read - 1,
>>> +			PACKET_READ_GENTLE_ON_EOF
>>> +		);
>>> +		if (bytes_read < 0)
>>> +			return 1;  // unexpected EOF
>>
>> Don't we usually return negative numbers on error?  Ah, I see that the
>> return is a bool, which allows to use boolean expression with 'return'.
>> But I am still unsure if it is good API, this return value.
> 
> According to Peff zero for success is the usual style:
> http://public-inbox.org/git/20160728133523.GB21311%40sigill.intra.peff.net/

The usual case is 0 for success, but -1 (and not 1) for error.
But I agree with Peff that keeping existing API is better. 

>>> +	);
>>> +	strbuf_setlen(sb, total_bytes_read);
>>> +	return (is_stream ? 0 : expected_bytes != total_bytes_read);
>>> +}
>>> +
>>> +static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
>>
>> Is it equivalent of copy_fd() function, but where destination uses pkt-line
>> and we need to pack data into pkt-lines?
> 
> Correct!

Yes, and we cannot keep the naming convention.  Though maybe mentioning
the equivalence in the comment above function would be good idea...

>>> +	return did_fail;
>>
>> Return true on fail?  Shouldn't we follow example of copy_fd()
>> from copy.c, and return COPY_READ_ERROR, or COPY_WRITE_ERROR,
>> or PKTLINE_WRITE_ERROR?
> 
> OK. How about this?
> 
> static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
> {
> 	int did_fail = 0;
> 	ssize_t bytes_to_write;
> 	while (!did_fail) {
> 		bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_MAXLEN);
> 		if (bytes_to_write < 0)
> 			return COPY_READ_ERROR;
> 		if (bytes_to_write == 0)
> 			break;
> 		did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1);
> 	}
> 	if (!did_fail)
> 		did_fail = packet_flush_gently(fd_out);
> 	return (did_fail ? COPY_WRITE_ERROR : 0);
> }

That's better, I think. 
 
>>> +}
>>> +
>>> +static int multi_packet_write_from_buf(const char *src, size_t len, int fd_out)
>>
>> It is equivalent of write_in_full(), with different order of parameters,
>> but where destination file descriptor expects pkt-line and we need to pack
>> data into pkt-lines?
> 
> True. Do you suggest to reorder parameters? I also would like to rename `src` to `src_in`, OK?

Well, no need to reorder parameters.  Better keep it the same as for
other function.  'src' is input ('source'), 'src_in' is tautologic.

>> NOTE: function description comments?
> 
> What do you mean here?

Sorry for being so cryptic.  What I meant is to think about adding comments
describing new functions just above them.
 
>>  Namely:
>>
>> - for git -> filter:
>>    * read from fd,      write pkt-line to fd  (off_t)
>>    * read from str+len, write pkt-line to fd  (size_t, ssize_t)
>> - for filter -> git:
>>    * read pkt-line from fd, write to fd       (off_t)
> 
> This one does not exist.

Right, because filter output goes to Git via strbuf.
 
>>    * read pkt-line from fd, write to str+len  (size_t, ssize_t)
[...]

>>> +	struct child_process process;
>>> +};
>>> +
>>> +static int cmd_process_map_initialized = 0;
>>> +static struct hashmap cmd_process_map;
>>
>> Reading Documentation/technical/api-hashmap.txt I see that:
>>
>>  `tablesize` is the allocated size of the hash table. A non-0 value indicates
>>  that the hashmap is initialized.
>>
>> So cmd_process_map_initialized is not really needed, is it?
> 
> I copied that from config.c:
> https://github.com/git/git/blob/f8f7adce9fc50a11a764d57815602dcb818d1816/config.c#L1425-L1428
> 
> `git grep "tablesize"` reveals that the check for `tablesize` is only used
> in hashmap.c ... so what approach should we use?

Well, git code is not always the best example... 

>>> +static int apply_protocol2_filter(const char *path, const char *src, size_t len,
>>> +						int fd, struct strbuf *dst, const char *cmd,
>>> +						const int wanted_capability)
>>
[...]

>> This is equivalent to
>>
>>   static int apply_filter(const char *path, const char *src, size_t len, int fd,
>>                           struct strbuf *dst, const char *cmd)
>>
>> Could we have extended that one instead?
> 
> Initially I had one function but that got kind of long ... I prefer two for now.

All right, we could always refactor to avoid code duplication later. 
 

>>> +
>>> +	fflush(NULL);
>>
>> This is the same as in apply_filter(), but I wonder what it is for.
> 
> "If the stream argument is NULL, fflush() flushes all
>  open output streams."
> 
> http://man7.org/linux/man-pages/man3/fflush.3.html

What I wanted to ask was not "what it does?",
but "why we need to flush here?".
 
>> This is very similar to apply_filter(), but the latter uses start_async()
>> from "run-command.h", with filter_buffer_or_fd() as asynchronous process,
>> which gets passed command to run in struct filter_params.  In this
>> function start_protocol2_filter() runs start_command(), synchronous API.
>>
>> Why the difference?
> 
> The protocol V2 requires a sequential processing of the packets. See
> discussion with Junio here:
> http://public-inbox.org/git/xmqqbn1th5qn.fsf%40gitster.mtv.corp.google.com/

I don't know what you want to refer to.  The linked email explains
why we fork/start_async() Git process, and the answer was to support
streaming.

There isn't anything there about why protocol v2 requires sequential /
synchronous processing of file output, that is write file contents in
full, then read, instead of having child write, and Git read and ready
to read (so filter driver can start writing immediately, and do not need
to wait for the other ed to stop writing / finish file).

Best regards,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-08-03 13:10       ` Lars Schneider
@ 2016-08-04 10:18         ` Jakub Narębski
  2016-08-05 13:20           ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Narębski @ 2016-08-04 10:18 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King

[Some of this answer might have been invalidated by v4;
 I might be away from computer for a few days, so I won't be reviewing]

W dniu 03.08.2016 o 15:10, Lars Schneider pisze:
> On 01 Aug 2016, at 00:19, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 30.07.2016 o 01:38, larsxschneider@gmail.com pisze:
[...]
 
>> Could this whole "send single file" be put in a separate function?
>> Or is it not worth it?
> 
> This function would have almost the same signature as apply_protocol2_filter
> and therefore I would say it's not worth it since the function is not
> crazy long.
 
All right.  Though I would say that if it makes the function more
readable, then it might be worth it.

[...]
>>> +
>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>
>> Hmmm... ignoring SIGPIPE was good for one-shot filters.  Is it still
>> O.K. for per-command persistent ones?
> 
> Very good question. You are right... we don't want to ignore any errors
> during the protocol... I will remove it.

I was actually just wondering.

Actually the default behavior if SIGPIPE is not ignored (or if the
SIGPIPE signal is not blocked / masked out) is to *terminate* the
writing program, which we do not want.

The correct solution is to check for error during write, and check
if errno is set to EPIPE.  This means that reader (filter driver
process) has closed pipe, usually due to crash, and we need to handle
that sanely, either restarting or quitting while providing sane
information about error to the user.

Well, we might want to set a signal handler for SIGPIPE, not just
simply ignore it (especially for streaming case; stop streaming
if filter driver crashed); though signal handlers are quite limited
about what might be done in them.  But that's for the future.


Read from closed pipe returns EOF; write to closed pipe results in
SIGPIPE and returns -1 (setting errno to EPIPE).
 
>>
>>> +
>>> +	packet_buf_write(&nbuf, "%s\n", filter_type);
>>> +	ret &= !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
>>> +
>>> +	if (ret) {
>>> +		strbuf_reset(&nbuf);
>>> +		packet_buf_write(&nbuf, "filename=%s\n", path);
>>> +		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
>>> +	}
>>
>> Perhaps a better solution would be
>>
>>        if (err)
>>        	goto fin_error;
>>
>> rather than this.
> 
> OK, I change it to goto error handling style.

Well, at least try it and check if it makes code more readable.
 
>>> +	if (ret) {
>>> +		strbuf_reset(&nbuf);
>>> +		packet_buf_write(&nbuf, "size=%"PRIuMAX"\n", (uintmax_t)len);
>>> +		ret = !direct_packet_write(process->in, nbuf.buf, nbuf.len, 1);
>>> +	}
>>
>> Or maybe extract writing the header for a file into a separate function?
>> This one gets a bit long...
> 
> Maybe... but I think that would make it harder to understand the protocol. I
> think I would prefer to have all the communication in one function layer.

I don't understand your reasoning here ("make it harder to understand the
protocol").  If you choose good names for function writing header, then
the main function would be the high-level view of protocol, e.g.

   git> <command>
   git> <header>
   git> <contents>
   git> <flush>

   git< <command accepted>
   git< <contents>
   git< <flush>
   git< <sent status>
 
[...]
>>> +
>>> +	if (ret) {
>>> +		filter_result = packet_read_line(process->out, NULL);
>>> +		ret = !strcmp(filter_result, "success");
>>> +	}
>>> +
>>> +	sigchain_pop(SIGPIPE);
>>> +
>>> +	if (ret) {
>>> +		strbuf_swap(dst, &nbuf);
>>> +	} else {
>>> +		if (!filter_result || strcmp(filter_result, "reject")) {
>>> +			// Something went wrong with the protocol filter. Force shutdown!

Don't use C++ one-line comments (that's C99-ism).

>>> +			error("external filter '%s' failed", cmd);
>>> +			kill_protocol2_filter(&cmd_process_map, entry);
>>> +		}
>>> +	}
>>
>> So if Git gets finish signal "success" from filter, it accepts the output.
>> If Git gets finish signal "reject" from filter, it restarts filter (and
>> reject the output - user can retry the command himself / herself).
>> If Git gets any other finish signal, for example "error" (but this is not
>> standarized), then it rejects the output, keeping the unfiltered result,
>> but keeps filtering.
>>
>> I think it is not described in this detail in the documentation of the
>> new protocol.
> 
> Agreed, will add!

That would be nice.

>>> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
>>> +	if (!ca.drv->clean && ca.drv->process)
>>> +		return apply_protocol2_filter(
>>> +			path, NULL, 0, -1, NULL, ca.drv->process, FILTER_CAPABILITIES_CLEAN
>>> +		);
>>> +	else
>>> +		return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
>>
>> Could we augment apply_filter() instead, so that the invocation is
>>
>>        return apply_filter(path, NULL, 0, -1, NULL, ca.drv, FILTER_CLEAN);
>>
>> Though I am not sure if moving this conditional to apply_filter would
>> be a good idea; maybe wrapper around augmented apply_filter_do()?
> 
> Yes, a wrapper makes it way cleaner!

That's good, because we have quite a few of those constructs. 
And I think the compiler would inline it, so there is no penalty.

>>> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
[...]
>>> +		git branch empty &&
>>> +
>>> +		cat ../test.o >test.r &&
>>
>> Err, the above is just copying file, isn't it?
>> Maybe it was copied from other tests, I have not checked.
> 
> It was created in the "setup" test.
 
What I meant here (among other things) is that you uselessly use
'cat' to copy files:

    +		cp ../test.o test.r &&
 
>>> +		echo "test22" >test2.r &&
>>> +		mkdir testsubdir &&
>>> +		echo "test333" >testsubdir/test3.r &&
>>
>> All right, we test text file, we test binary file (I assume), we test
>> file in a subdirectory.  What about testing empty file?  Or large file
>> which would not fit in the stdin/stdout buffer (as EXPENSIVE test)?
> 
> No binary file. The main reason for this test is to check multiple files.
> I'll add a empty file. A large file is tested in the next test.

I assume that this large file is binary file; what matters is that it
includes NUL character ("\0"), i.e. zero byte, checking that there is
no error that would terminate it at NUL.

I'll end here for now.

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-03 21:12         ` Jeff King
  2016-08-03 21:27           ` Jeff King
@ 2016-08-04 16:14           ` Junio C Hamano
  2016-08-05 14:55             ` Lars Schneider
  2016-08-05 17:31             ` Lars Schneider
  1 sibling, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-04 16:14 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git, jnareb, tboegi, mlbright, e

Jeff King <peff@peff.net> writes:

> The cost of write() may vary on other platforms, but the cost of memcpy
> generally shouldn't. So I'm inclined to say that it is not really worth
> micro-optimizing the interface.
>
> I think the other issue is that format_packet() only lets you send
> string data via "%s", so it cannot be used for arbitrary data that may
> contain NULs. So we do need _some_ other interface to let you send a raw
> data packet, and it's going to look similar to the direct_packet_write()
> thing.

OK.  That is a much better argument than "I already stuff the length
bytes in my buffer" (which will invite "How about stop doing that?")
to justify a new "I have N bytes of data, send it out", whose
signature would look more like write(2) and deserve to be called
packet_write() but unfortunately the name is taken by what should
have called packet_fmt() or something, but that squats on a good
name packet_write().  Sigh.




	

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/12] pkt-line: add packet_flush_gentle()
  2016-08-03 21:39       ` Jeff King
  2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
@ 2016-08-04 16:16         ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-04 16:16 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git, jnareb, tboegi, mlbright, e

Jeff King <peff@peff.net> writes:

>   2. It calls check_pipe(), which will turn EPIPE into death-by-SIGPIPE
>      (in case you had for some reason ignored SIGPIPE).
> ...
>
> Thinking about (2), I'd go so far as to say that the trace actually
> should just be using:
>
>   if (write_in_full(...) < 0)
> 	warning("unable to write trace to ...: %s", strerror(errno));
>
> and we should get rid of write_or_whine_pipe entirely.

I like the simplicity the above suggestion gives us.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-03 22:58           ` [PATCH 3/7] trace: use warning() for printing trace errors Jeff King
@ 2016-08-04 20:41             ` Junio C Hamano
  2016-08-04 21:21               ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2016-08-04 20:41 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git

Jeff King <peff@peff.net> writes:

> Right now we just fprintf() straight to stderr, which can
> make the output hard to distinguish. It would be helpful to
> give it one of our usual prefixes like "error:", "warning:",
> etc.
>
> It doesn't make sense to use error() here, as the trace code
> is "optional" debugging code. If something goes wrong, we
> should warn the user, but saying "error" implies the actual
> git operation had a problem. So warning() is the only sane
> choice.
>
> Note that this does end up calling warn_routine() to do the
> formatting. So in theory, somebody who tries to trace from
> their warn_routine() could cause a loop. But nobody does
> this, and in fact nobody in the history of git has ever
> replaced the default warn_builtin (there isn't even a
> set_warn_routine function!).

I think the last bit is about to change; cf. 545f13c0 (usage: add
set_warn_routine(), 2016-07-30) on cc/apply-am topic.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 4/7] trace: cosmetic fixes for error messages
  2016-08-03 23:00           ` [PATCH 4/7] trace: cosmetic fixes for error messages Jeff King
@ 2016-08-04 20:42             ` Junio C Hamano
  2016-08-05  8:00               ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2016-08-04 20:42 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git

Jeff King <peff@peff.net> writes:

> I think it would be nicer to still to print:
>
>  warning: first line
>  warning: second line
>
> etc. We do that for "advice:", but not the rest of the vreportf
> functions. It might be nice to do that, but we'd have to go back to
> printing into a buffer (since we can't break up the incoming format
> string that we feed to fprintf).

Yes, yes.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 6/7] trace: disable key after write error
  2016-08-03 23:01           ` [PATCH 6/7] trace: disable key after write error Jeff King
@ 2016-08-04 20:45             ` Junio C Hamano
  2016-08-04 21:22               ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2016-08-04 20:45 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git

Jeff King <peff@peff.net> writes:

> If we get a write error writing to a trace descriptor, the
> error isn't likely to go away if we keep writing. Instead,
> you'll just get the same error over and over. E.g., try:
>
>   GIT_TRACE_PACKET=42 git ls-remote >/dev/null
>
> You don't really need to see:
>
>   warning: unable to write trace for GIT_TRACE_PACKET: Bad file descriptor
>
> hundreds of times. We could fallback to tracing to stderr,
> as we do in the error code-path for open(), but there's not
> much point. If the user fed us a bogus descriptor, they're
> probably better off fixing their invocation. And if they
> didn't, and we saw a transient error (e.g., ENOSPC writing
> to a file), it probably doesn't help anybody to have half of
> the trace in a file, and half on stderr.

Yes, I think I like this better than "we cannot open the named file,
so let's trace into standard error stream" that is done in the code
in the context of [3/7].  We should do the same over there.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-04 20:41             ` Junio C Hamano
@ 2016-08-04 21:21               ` Jeff King
  2016-08-04 21:28                 ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-04 21:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, larsxschneider, git

On Thu, Aug 04, 2016 at 01:41:02PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Right now we just fprintf() straight to stderr, which can
> > make the output hard to distinguish. It would be helpful to
> > give it one of our usual prefixes like "error:", "warning:",
> > etc.
> >
> > It doesn't make sense to use error() here, as the trace code
> > is "optional" debugging code. If something goes wrong, we
> > should warn the user, but saying "error" implies the actual
> > git operation had a problem. So warning() is the only sane
> > choice.
> >
> > Note that this does end up calling warn_routine() to do the
> > formatting. So in theory, somebody who tries to trace from
> > their warn_routine() could cause a loop. But nobody does
> > this, and in fact nobody in the history of git has ever
> > replaced the default warn_builtin (there isn't even a
> > set_warn_routine function!).
> 
> I think the last bit is about to change; cf. 545f13c0 (usage: add
> set_warn_routine(), 2016-07-30) on cc/apply-am topic.

Thanks, I meant to check this series against "pu" to make sure there are
no new callers for write_or_whine_pipe(), but forgot to.

It looks like that same topic does add one new caller, and switches the
"fprintf" inside it to use warning().

IMHO the call added by 19a73ac (builtin/apply: make try_create_file()
return -1 on error, 2016-07-30) should just be a regular:

  if (write_in_full(...) < 0)
        error(...);

We don't care about the weird pipe handling there (we know we're writing
to a file we just created), and the way the error message is passed in
just makes things weird. Plus it seems more like an error() than a
warning (e.g., we call error() immediately below when close() fails!).
But 8fab3c6 (write_or_die: use warning() instead of fprintf(stderr,
...), 2016-07-30)  makes it an unconditional warning (that commit, btw,
has a bug in that it retains the trailing newline of the message, even
though warning() will add one itself).

So I'd suggest that series drop the call write_or_whine_pipe() and drop
8fab3c6 entirely.

I wondered if that would then let us drop set_warn_routine(), but it
looks like there are other warning() calls it cares about. So that would
invalidate the last paragraph here, though I still think converting the
trace errors to warning() is a reasonable thing to do.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 6/7] trace: disable key after write error
  2016-08-04 20:45             ` Junio C Hamano
@ 2016-08-04 21:22               ` Jeff King
  2016-08-05  7:58                 ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-04 21:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git

On Thu, Aug 04, 2016 at 01:45:11PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > If we get a write error writing to a trace descriptor, the
> > error isn't likely to go away if we keep writing. Instead,
> > you'll just get the same error over and over. E.g., try:
> >
> >   GIT_TRACE_PACKET=42 git ls-remote >/dev/null
> >
> > You don't really need to see:
> >
> >   warning: unable to write trace for GIT_TRACE_PACKET: Bad file descriptor
> >
> > hundreds of times. We could fallback to tracing to stderr,
> > as we do in the error code-path for open(), but there's not
> > much point. If the user fed us a bogus descriptor, they're
> > probably better off fixing their invocation. And if they
> > didn't, and we saw a transient error (e.g., ENOSPC writing
> > to a file), it probably doesn't help anybody to have half of
> > the trace in a file, and half on stderr.
> 
> Yes, I think I like this better than "we cannot open the named file,
> so let's trace into standard error stream" that is done in the code
> in the context of [3/7].  We should do the same over there.

Yeah, I was tempted to strip that out, too. I'll look into preparing a
patch on top.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-04 21:21               ` Jeff King
@ 2016-08-04 21:28                 ` Junio C Hamano
  2016-08-05  7:56                   ` Jeff King
  2016-08-05  7:59                   ` Christian Couder
  0 siblings, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-04 21:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, larsxschneider, git

Jeff King <peff@peff.net> writes:

> I wondered if that would then let us drop set_warn_routine(), but it
> looks like there are other warning() calls it cares about. So that would
> invalidate the last paragraph here, though I still think converting the
> trace errors to warning() is a reasonable thing to do.

Yes.  That is why tonight's pushout will have this series in 'jch'
(that is a point on a linear history between 'master' and 'pu') and
tentatively ejects cc/apply-am topic out of 'pu', expecting it to be
rerolled.

Thanks.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-04 21:28                 ` Junio C Hamano
@ 2016-08-05  7:56                   ` Jeff King
  2016-08-05  7:59                   ` Christian Couder
  1 sibling, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-05  7:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, larsxschneider, git

On Thu, Aug 04, 2016 at 02:28:09PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I wondered if that would then let us drop set_warn_routine(), but it
> > looks like there are other warning() calls it cares about. So that would
> > invalidate the last paragraph here, though I still think converting the
> > trace errors to warning() is a reasonable thing to do.
> 
> Yes.  That is why tonight's pushout will have this series in 'jch'
> (that is a point on a linear history between 'master' and 'pu') and
> tentatively ejects cc/apply-am topic out of 'pu', expecting it to be
> rerolled.

Here's a replacement patch 3. Same code, but it clarifies the
warn_routine situation in the commit message.

-- >8 --
Subject: [PATCH] trace: use warning() for printing trace errors

Right now we just fprintf() straight to stderr, which can
make the output hard to distinguish. It would be helpful to
give it one of our usual prefixes like "error:", "warning:",
etc.

It doesn't make sense to use error() here, as the trace code
is "optional" debugging code. If something goes wrong, we
should warn the user, but saying "error" implies the actual
git operation had a problem. So warning() is the only sane
choice.

Note that this does end up calling warn_routine() to do the
formatting. This is probably a good thing, since they are
clearly trying to hook messages before they make it to
stderr. However, it also means that in theory somebody who
tries to trace from their warn_routine() could cause a loop.
This seems rather unlikely in practice (we've never even
overridden the default warn_builtin routine before, and
recent discussions to do so would just install a noop
routine).

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/trace.c b/trace.c
index bdbe149..6a77e4d 100644
--- a/trace.c
+++ b/trace.c
@@ -61,9 +61,8 @@ static int get_trace_fd(struct trace_key *key)
 	else if (is_absolute_path(trace)) {
 		int fd = open(trace, O_WRONLY | O_APPEND | O_CREAT, 0666);
 		if (fd == -1) {
-			fprintf(stderr,
-				"Could not open '%s' for tracing: %s\n"
-				"Defaulting to tracing on stderr...\n",
+			warning("Could not open '%s' for tracing: %s\n"
+				"Defaulting to tracing on stderr...",
 				trace, strerror(errno));
 			key->fd = STDERR_FILENO;
 		} else {
@@ -71,10 +70,10 @@ static int get_trace_fd(struct trace_key *key)
 			key->need_close = 1;
 		}
 	} else {
-		fprintf(stderr, "What does '%s' for %s mean?\n"
+		warning("What does '%s' for %s mean?\n"
 			"If you want to trace into a file, then please set "
 			"%s to an absolute pathname (starting with /).\n"
-			"Defaulting to tracing on stderr...\n",
+			"Defaulting to tracing on stderr...",
 			trace, key->key, key->key);
 		key->fd = STDERR_FILENO;
 	}
@@ -135,7 +134,7 @@ static int prepare_trace_line(const char *file, int line,
 static void trace_write(struct trace_key *key, const void *buf, unsigned len)
 {
 	if (write_in_full(get_trace_fd(key), buf, len) < 0)
-		fprintf(stderr, "%s: write error (%s)\n", err_msg, strerror(errno));
+		warning("%s: write error (%s)", err_msg, strerror(errno));
 }
 
 void trace_verbatim(struct trace_key *key, const void *buf, unsigned len)
-- 
2.9.2.707.g48ee8b7


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 6/7] trace: disable key after write error
  2016-08-04 21:22               ` Jeff King
@ 2016-08-05  7:58                 ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-05  7:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git

On Thu, Aug 04, 2016 at 05:22:44PM -0400, Jeff King wrote:

> On Thu, Aug 04, 2016 at 01:45:11PM -0700, Junio C Hamano wrote:
> 
> > Jeff King <peff@peff.net> writes:
> > 
> > > If we get a write error writing to a trace descriptor, the
> > > error isn't likely to go away if we keep writing. Instead,
> > > you'll just get the same error over and over. E.g., try:
> > >
> > >   GIT_TRACE_PACKET=42 git ls-remote >/dev/null
> > >
> > > You don't really need to see:
> > >
> > >   warning: unable to write trace for GIT_TRACE_PACKET: Bad file descriptor
> > >
> > > hundreds of times. We could fallback to tracing to stderr,
> > > as we do in the error code-path for open(), but there's not
> > > much point. If the user fed us a bogus descriptor, they're
> > > probably better off fixing their invocation. And if they
> > > didn't, and we saw a transient error (e.g., ENOSPC writing
> > > to a file), it probably doesn't help anybody to have half of
> > > the trace in a file, and half on stderr.
> > 
> > Yes, I think I like this better than "we cannot open the named file,
> > so let's trace into standard error stream" that is done in the code
> > in the context of [3/7].  We should do the same over there.
> 
> Yeah, I was tempted to strip that out, too. I'll look into preparing a
> patch on top.

Here's a patch that can go on the tip of jk/trace-fixup.

-- >8 --
Subject: [PATCH] trace: do not fall back to stderr

If the trace code cannot open a specified file, or does not
understand the contents of the GIT_TRACE variable, it falls
back to printing trace output to stderr.

This is an attempt to be helpful, but in practice it just
ends up annoying. The user was trying to get the output to
go somewhere else, so spewing it to stderr does not really
accomplish that. And as it's intended for debugging, they
can presumably re-run the command with their error
corrected.

So instead of falling back, this patch disables bogus trace
keys for the rest of the program, just as we do for write
errors. We can drop the "Defaulting to..." part of the error
message entirely; after seeing "cannot open '/foo'", the
user can assume that tracing is skipped.

Signed-off-by: Jeff King <peff@peff.net>
---
 trace.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/trace.c b/trace.c
index 083eb98..7508aea 100644
--- a/trace.c
+++ b/trace.c
@@ -61,10 +61,9 @@ static int get_trace_fd(struct trace_key *key)
 	else if (is_absolute_path(trace)) {
 		int fd = open(trace, O_WRONLY | O_APPEND | O_CREAT, 0666);
 		if (fd == -1) {
-			warning("could not open '%s' for tracing: %s\n"
-				"         Defaulting to tracing on stderr...",
+			warning("could not open '%s' for tracing: %s",
 				trace, strerror(errno));
-			key->fd = STDERR_FILENO;
+			trace_disable(key);
 		} else {
 			key->fd = fd;
 			key->need_close = 1;
@@ -72,10 +71,9 @@ static int get_trace_fd(struct trace_key *key)
 	} else {
 		warning("unknown trace value for '%s': %s\n"
 			"         If you want to trace into a file, then please set %s\n"
-			"         to an absolute pathname (starting with /)\n"
-			"         Defaulting to tracing on stderr...",
+			"         to an absolute pathname (starting with /)",
 			key->key, trace, key->key);
-		key->fd = STDERR_FILENO;
+		trace_disable(key);
 	}
 
 	key->initialized = 1;
-- 
2.9.2.707.g48ee8b7


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-04 21:28                 ` Junio C Hamano
  2016-08-05  7:56                   ` Jeff King
@ 2016-08-05  7:59                   ` Christian Couder
  2016-08-05 18:41                     ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Christian Couder @ 2016-08-05  7:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Lars Schneider, git

On Thu, Aug 4, 2016 at 11:28 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> I wondered if that would then let us drop set_warn_routine(), but it
>> looks like there are other warning() calls it cares about. So that would
>> invalidate the last paragraph here, though I still think converting the
>> trace errors to warning() is a reasonable thing to do.
>
> Yes.  That is why tonight's pushout will have this series in 'jch'
> (that is a point on a linear history between 'master' and 'pu') and
> tentatively ejects cc/apply-am topic out of 'pu', expecting it to be
> rerolled.

Ok, I will reroll soon with Peff's suggested changes.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 4/7] trace: cosmetic fixes for error messages
  2016-08-04 20:42             ` Junio C Hamano
@ 2016-08-05  8:00               ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-05  8:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git

On Thu, Aug 04, 2016 at 01:42:12PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I think it would be nicer to still to print:
> >
> >  warning: first line
> >  warning: second line
> >
> > etc. We do that for "advice:", but not the rest of the vreportf
> > functions. It might be nice to do that, but we'd have to go back to
> > printing into a buffer (since we can't break up the incoming format
> > string that we feed to fprintf).
> 
> Yes, yes.

Actually, I guess in this case we could easily do:

   warning("something");
   warning("something else");

etc (the lines are fairly stand-alone, so I don't think it runs afoul of
the usual translator-lego problem; not to mention that these aren't
actually translated). I don't really care that much between that and the
indented output, but if there's a preference, I'm happy to re-roll with
that.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 03/10] pkt-line: add packet_flush_gentle()
  2016-08-02 19:56         ` Torsten Bögershausen
@ 2016-08-05  9:59           ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05  9:59 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: git@vger.kernel.org, gitster@pobox.com, jnareb@gmail.com,
	mlbright@gmail.com, e@80x24.org, peff@peff.net


> On 02 Aug 2016, at 21:56, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Sun, Jul 31, 2016 at 11:45:08PM +0200, Lars Schneider wrote:
>> 
>>> On 31 Jul 2016, at 22:36, Torstem Bögershausen <tboegi@web.de> wrote:
>>> 
>>> 
>>> 
>>>> Am 29.07.2016 um 20:37 schrieb larsxschneider@gmail.com:
>>>> 
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> 
>>>> packet_flush() would die in case of a write error even though for some callers
>>>> an error would be acceptable.
>>> What happens if there is a write error ?
>>> Basically the protocol is out of synch.
>>> Lenght information is mixed up with payload, or the other way
>>> around.
>>> It may be, that the consequences of a write error are acceptable,
>>> because a filter is allowed to fail.
>>> What is not acceptable is a "broken" protocol.
>>> The consequence schould be to close the fd and tear down all
>>> resources. connected to it.
>>> In our case to terminate the external filter daemon in some way,
>>> and to never use this instance again.
>> 
>> Correct! That is exactly what is happening in kill_protocol2_filter()
>> here:
> 
> Wait a second.
> Is kill the same as shutdown ?
> I would expect that

No, kill is used if the filter behaved strangely or signaled an error.
"Shutdown" is a graceful shutdown. However, that might not be an ideal
name. See the bottom of my discussion with Peff here:
http://public-inbox.org/git/74C2CEA6-EAAB-406F-8B37-969654955413%40gmail.com/


> The process terminates itself as soon as it detects EOF.
> As there is nothing more read.
> 
> Then the next question: The combination of kill & protocol in kill_protocol(),
> what does it mean ?

I renamed that function to "kill_multi_file_filter". Initially I called
the multi file filter "protocol" (bad decision I know) and named the
functions accordingly.


> Is it more like a graceful shutdown_protocol() ?

Yes.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option)
  2016-08-03 18:30         ` Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option) Jakub Narębski
@ 2016-08-05 10:32           ` Lars Schneider
  2016-08-06 18:24           ` Lars Schneider
  1 sibling, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 10:32 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 03 Aug 2016, at 20:30, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> The ultimate goal is to be able to run filter drivers faster for both `clean`
> and `smudge` operations.  This is done by starting filter driver once per
> git command invocation, instead of once per file being processed.  Git needs
> to pass actual contents of files to filter driver, and get its output.
> 
> We want the protocol between Git and filter driver process to be extensible,
> so that new features can be added without modifying protocol.
> 
> 
> 1. CONFIGURATION
> 
> As I wrote, there are different ways of configuring new-type filter driver:
> 
> ...
> 
> # Using a single variable for new filter type, and decide on which phase
>   (which operation) is supported by filter driver during the handshake
>   *(current approach)*
> 
>   	[filter "protocol"]
>   		process = rot13-filtes.pl
> 
>   PROS: per-file and per-command filters possible with precedence rule;
>         extensible to other types of drivers: textconv, diff, etc.
>         only one invocation for commands which use both clean and smudge
>   CONS: need single driver to be responsible for both clean and smudge;
>         need to run driver to know that it does not support given
>           operation (workaround exists)
> 
> 
> 2. HANDSHAKE (INITIALIZATION)
> 
> Next, there is deciding on and designing the handshake between Git (between
> Git command) and the filter driver process.  With the `filter.<driver>.process`
> solution the driver needs to tell which operations among (for now) "clean"
> and "smudge" it does support.  Plus it provides a way to extend protocol,
> adding new features, like support for streaming, cleaning from file or
> smudging to file, providing size upfront, perhaps even progress report.
> 
> Current handshake consist of filter driver printing a signature, version
> number and capabilities, in that order.  Git checks that it is well formed
> and matches expectations, and notes which of "clean" and "smudge" operations
> are supported by the filter.
> 
> There is no interaction from the Git side in the handshake, for example to
> set options and expectations common to all files being filtered.  Take
> one possible extension of protocol: supporting streaming.  The filter
> driver needs to know whether it needs to read all the input, or whether
> it can start printing output while input is incoming (e.g. to reduce
> memory consumption)... though we may simply decide it to be next version
> of the protocol.
> 
> On the other hand if the handshake began with Git sending some initializer
> info to the filter driver, we probably could detect one-shot filter
> misconfigured as process-filter.

OK, I'll look into this.


> Note that we need some way of deciding where handshake ends, either by
> specifying number of entries (currently: three lines / pkt-line packets),
> or providing some terminator ("smart" transport protocol uses flush packet
> for this).
> 
> ...

Would you be OK with Peff's suggestion?

  version=2
  clean=true
  smudge=true
  0000

http://public-inbox.org/git/20160803224619.bwtbvmslhuicx2qi%40sigill.intra.peff.net/



> 3. SENDING CONTENTS (FILE TO BE FILTERED AND FILTER OUTPUT)
> 
> Next thing to design is decision how to send contents to be filtered
> to the filter driver process, and how to get filtered output from the
> filter driver process.
> 
> ...
> 
> # Send/receive data file by file, using some kind of chunking,
>   with a end-of-file marker.  The solution used by Git is
>   pkt-line, with flush packet used to signal end of file.
> 
>   This is protocol used by the current implementation.
> 
>   PROS:
>   - no need to know size upfront, so easier streaming support
>   - you can signal error that happened during output, after
>     some data were sent, as well as error known upfront
>   - tracing support for free (GIT_TRACE_PACKET)
>   CONS:
>   - filter driver program slightly more difficult to implement
>   - some negligible amount of overhead
> 
> If we want in the end to implement streaming, then the last solution
> is the way to go.
> 
> 
> 4. PER-FILE HANDSHAKE - SENDING FILE TO FILTER
> 
> Let's assume that for simplicity we want to implement (for now) only
> the synchronous (non-streaming) case, where we send whole contents
> of a file to filter driver process, and *then* read filter driver
> output.
> ...
> 
> If we are using pkt-line, then the convention is that text lines
> are terminated using LF ("\n") character.  This needs to be stated
> explicitly in the documentation for filter.<driver>.process writers.
> 
>    git> packet:  [operation] clean size=67\n
> 
> We could denote that it is operation name, but it is obvious from
> position in the stream, thus not really needed.

I would prefer not to mix command and size in one packet as it
makes parsing a little more difficult.


> ...
> 
> The Git would sent contents of the file to be filtered, using
> as many pack lines as needed (note: large file support needs
> to be tested, at least as expensive test).  Flush packet is
> used to signal the end of the file.
> 
>    git> packets:  <file contents>
>    git> flush packet

If expensive tests are enabled the test suite will process data
larger then max pkt size.


> 5. FILTER DRIVER PROCESS RESPONSE
> 
> First filter should, in my opinion, reply that it received the
> request (or the command, in the case of streaming supported).
> Also, in this response it can provide further information to
> Git process.
> 
>    git< packet: [received]  ok size=67\n

I think this would be different for real streaming and the current
non-streaming... therefore it would complicate the protocol?!
I wonder if it is truly necessary.


> This response could be used to refuse to filter specific file
> upfront (for example if the file is not present in the artifactory
> for git-LFS solutions).
> 
>   git< packet: [rejected]  reject\n

Reject is already supported in v4.


> We can even provide the reasoning to Git (maybe in the future
> extension)... or filter driver can print the explanation to the
> standard error (but then, no --quiet / --verbose support).
> 
>   git< packet: [rejected]  reject with-message\n
>   git< packet: [message]   File not found on server\n
>   git< flush packet

I think Git shouldn't care about these details. If the filter
needs to tell something then it should use stderr. 


> Another response, which I think should be standarized, or at
> least described in the documentation, is filter driver refusing
> to filter further (e.g. git-LFS and network is down), to be not
> restarted by Git.
> 
>   git< packet: [quit]      quit msg=Server error\n
> 
> or
> 
>   git< packet: [quit]      quit Server error\n
> 
> or
> 
>   git< packet: [quit]      quit with-message\n
>   git< packet: [message]   Server error\n
>   git< flush packet
> 
> Maybe this is over-engineering, but I don't think so.

Interesting idea! I will look into this for v5!


> Next comes the output from the filter driver (filtered contents),
> using possibly multiple pkt-lines, ending with a flush packet:
> 
>    git< packets:  <filtered contents>
>    git< flush packet
> 
> Note that empty file would consist of zero pack lines of contents,
> and one flush packet.
> 
> Finally, to allow handling of [resumable] errors that occurred
> during sending file contents, especially for the future streaming
> filters case, we want to confirm that we send whole file
> successfully.
> 
>    git< packet: [status]   success\n
> 
> If there was an error during process, making data receives so far
> invalid, filter driver should tell about it
> 
>    git< packet: [status]   fail\n
> 
> or
> 
>    git< packet: [status]   reject\n
> 
> This may happen for example for UCS-2 <-> UTF-8 filter when invalid
> byte sequence is encountered.  This may happen for git-LFS if the
> server fails during fetch, and spare / slave server doesn't have
> a file.

Correct!


> We may want to quit filtering at this point, and not to send another
> file.
> 
>   git< packet: [status]    quit\n

I don't get this one. Git would restart the filter as soon as it finds
another file that needs to be filtered, right?


Thanks a lot for this write up!

- Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 01/10] pkt-line: extract set_packet_header()
  2016-08-03 20:05         ` Jakub Narębski
@ 2016-08-05 11:52           ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 11:52 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git, gitster, tboegi, mlbright, e, peff


> On 03 Aug 2016, at 22:05, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> [This response might have been invalidated by v4]
> 
> W dniu 01.08.2016 o 13:33, Lars Schneider pisze: 
>>> On 30 Jul 2016, at 12:30, Jakub Narębski <jnareb@gmail.com> wrote:
> 
>>>> #define hex(a) (hexchar[(a) & 15])
>>> 
>>> I guess that this is inherited from the original, but this preprocessor
>>> macro is local to the format_header() / set_packet_header() function,
>>> and would not work outside it.  Therefore I think we should #undef it
>>> after set_packet_header(), just in case somebody mistakes it for
>>> a generic hex() function.  Perhaps even put it inside set_packet_header(),
>>> together with #undef.
>>> 
>>> But I might be mistaken... let's check... no, it isn't used outside it.
>> 
>> Agreed. Would that be OK?
>> 
>> static void set_packet_header(char *buf, const int size)
>> {
>> 	static char hexchar[] = "0123456789abcdef";
>> 	#define hex(a) (hexchar[(a) & 15])
>> 	buf[0] = hex(size >> 12);
>> 	buf[1] = hex(size >> 8);
>> 	buf[2] = hex(size >> 4);
>> 	buf[3] = hex(size);
>> 	#undef hex
>> }
> 
> That's better, though I wonder if we need to start #defines at begining
> of line.  But I think current proposal is O.K.
> 
> 
> Either this (which has unnecessary larger scope)
> 
>  #define hex(a) (hexchar[(a) & 15])
>  static void set_packet_header(char *buf, const int size)
>  {
>  	static char hexchar[] = "0123456789abcdef";
> 
>  	buf[0] = hex(size >> 12);
>  	buf[1] = hex(size >> 8);
>  	buf[2] = hex(size >> 4);
>  	buf[3] = hex(size);
>  }
>  #undef hex
> 
> or this (which looks worse)
> 
>  static void set_packet_header(char *buf, const int size)
>  {
>  	static char hexchar[] = "0123456789abcdef";
>  #define hex(a) (hexchar[(a) & 15])
>  	buf[0] = hex(size >> 12);
>  	buf[1] = hex(size >> 8);
>  	buf[2] = hex(size >> 4);
>  	buf[3] = hex(size);
>  #undef hex
>  }
> 

I probably will drop this patch as Junio is not convinced that it
is a good idea:
http://public-inbox.org/git/xmqqd1lp8v2o.fsf%40gitster.mtv.corp.google.com/

- Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data()
  2016-08-03 20:12         ` Jakub Narębski
@ 2016-08-05 12:02           ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 12:02 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, tboegi, mlbright, Eric Wong,
	Jeff King


> On 03 Aug 2016, at 22:12, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> [This response might have been invalidated by v4]
> 
> W dniu 01.08.2016 o 14:00, Lars Schneider pisze:
>>> On 30 Jul 2016, at 12:49, Jakub Narębski <jnareb@gmail.com> wrote:
>>> W dniu 30.07.2016 o 01:37, larsxschneider@gmail.com pisze:
>>>> 
>>>> Sometimes pkt-line data is already available in a buffer and it would
>>>> be a waste of resources to write the packet using packet_write() which
>>>> would copy the existing buffer into a strbuf before writing it.
>>>> 
>>>> If the caller has control over the buffer creation then the
>>>> PKTLINE_DATA_START macro can be used to skip the header and write
>>>> directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
>>>> would be the maximum). direct_packet_write() would take this buffer,
>>>> adjust the pkt-line header and write it.
>>>> 
>>>> If the caller has no control over the buffer creation then
>>>> direct_packet_write_data() can be used. This function creates a pkt-line
>>>> header. Afterwards the header and the data buffer are written using two
>>>> consecutive write calls.
>>> 
>>> I don't quite understand what do you mean by "caller has control
>>> over the buffer creation".  Do you mean that caller either can write
>>> over the buffer, or cannot overwrite the buffer?  Or do you mean that
>>> caller either can allocate buffer to hold header, or is getting
>>> only the data?
>> 
>> How about this:
>> 
>> [...]
>> 
>> If the caller creates the buffer then a proper pkt-line buffer with header
>> and data section can be created. The PKTLINE_DATA_START macro can be used 
>> to skip the header section and write directly to the data section (PKTLINE_DATA_LEN 
>> bytes would be the maximum). direct_packet_write() would take this buffer, 
>> fill the pkt-line header section with the appropriate data length value and 
>> write the entire buffer.
>> 
>> If the caller does not create the buffer, and consequently cannot leave room
>> for the pkt-line header, then direct_packet_write_data() can be used. This 
>> function creates an extra buffer for the pkt-line header and afterwards writes
>> the header buffer and the data buffer with two consecutive write calls.
>> 
>> ---
>> Is that more clear?
> 
> Yes, I think it is more clear.  
> 
> The only thing that could be improved is to perhaps instead of using
> 
>  "then a proper pkt-line buffer with header and data section can be created"
> 
> it might be more clear to write
> 
>  "then a proper pkt-line buffer with data section and a place for pkt-line header"

OK. I changed it to

"If the caller has control over the buffer creation then a proper pkt-line
buffer with header and data section can be allocated. The 
PKTLINE_DATA_START macro can be used to skip the header and write
directly into the data section of a pkt-line (PKTLINE_DATA_LEN bytes
would be the maximum)..."

However, I am not yet sure if I can/will keep this patch:
http://public-inbox.org/git/xmqqeg645x6b.fsf%40gitster.mtv.corp.google.com/


> 
>>>> +{
>>>> +	int ret = 0;
>>>> +	char hdr[4];
>>>> +	set_packet_header(hdr, sizeof(hdr) + size);
>>>> +	packet_trace(buf, size, 1);
>>>> +	if (gentle) {
>>>> +		ret = (
>>>> +			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
>>> 
>>> You can write '4' here, no need for sizeof(hdr)... though compiler would
>>> optimize it away.
>> 
>> Right, it would be optimized. However, I don't like the 4 there either. OK to use a macro
>> instead? PKTLINE_HEADER_LEN ?
> 
> Did you mean 
> 
>    +	char hdr[PKTLINE_HEADER_LEN];
>    +	set_packet_header(hdr, sizeof(hdr) + size);

yes!


>>>> +			!write_or_whine_pipe(fd, buf, size, "pkt-line data")
>>>> +		);
>>> 
>>> Do we want to try to write "pkt-line data" if "pkt-line header" failed?
>>> If not, perhaps De Morgan-ize it
>>> 
>>> +		ret = !(
>>> +			write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") &&
>>> +			write_or_whine_pipe(fd, buf, size, "pkt-line data")
>>> +		);
>> 
>> 
>> Original:
>> 		ret = (
>> 			!write_or_whine_pipe(fd, hdr, sizeof(hdr), "pkt-line header") ||
>> 			!write_or_whine_pipe(fd, data, size, "pkt-line data")
>> 		);
>> 
>> Well, if the first write call fails (return == 0), then it is negated and evaluates to true.
>> I would think the second call is not evaluated, then?!
> 
> This is true both for || and for &&, as in C logical boolean operators
> short-circuit.

True. That's why I did not get your "de morganize" it comment... what would de morgan change?

> 
>> Should I make this more explicit with a if clause?
> 
> No need.

OK


Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 22:46           ` Jeff King
@ 2016-08-05 12:53             ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 12:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, jnareb, tboegi, mlbright, e


> On 04 Aug 2016, at 00:46, Jeff King <peff@peff.net> wrote:
> 
> On Wed, Aug 03, 2016 at 11:48:00PM +0200, Lars Schneider wrote:
> 
>> OK. Is this the v2 discussion you are referring to?
>> http://public-inbox.org/git/1461972887-22100-1-git-send-email-sbeller%40google.com/
>> 
>> What format do you suggest?
>> 
>> packet:          git< git-filter-protocol\n
>> packet:          git< version=2\n
>> packet:          git< capability=clean\n
>> packet:          git< capability=smudge\n
>> packet:          git< 0000
>> 
>> or
>> 
>> packet:          git< git-filter-protocol\n
>> packet:          git< version=2\n
>> packet:          git< capability\n
>> packet:          git< clean\n
>> packet:          git< smudge\n
>> packet:          git< 0000
>> 
>> or  ... ?
>> 
>> I would prefer the first one, I think.
> 
> How about:
> 
>  version=2
>  clean=true
>  smudge=true
>  0000
> 
> ? Then we do not have to care about multiple "capability" keys (so
> something naively parsing this could just store them in a string list,
> for example).

Alright. I will go with this solution.

Thanks,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-03 23:15               ` Jeff King
@ 2016-08-05 13:08                 ` Lars Schneider
  2016-08-05 21:19                   ` Torsten Bögershausen
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 13:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, jnareb, tboegi, mlbright, e


> On 04 Aug 2016, at 01:15, Jeff King <peff@peff.net> wrote:
> 
> On Thu, Aug 04, 2016 at 01:09:57AM +0200, Lars Schneider wrote:
> 
>>> Or better yet, do not require a shutdown at all. The filter sees EOF and
>>> knows there is nothing more to do. If we are in the middle of an
>>> operation, then it knows git died. If not, then presumably git had
>>> nothing else to say (and really, it is not the filter's business if git
>>> saw an error or not).
>> 
>> EOF? The filter is supposed to process multiple files. How would one EOF
>> indicate that we are done?
> 
> I think we may be talking about two different EOFs.
> 
> Git sends a file in pkt-line format, and the flush marks EOF for that
> file. But the filter keeps running, waiting for more input. This can
> happen multiple times.

Correct.

> Eventually git calls close() on the descriptor, and the filter sees the
> "real" EOF (i.e., read() returns 0). That is the signal that git is
> done.

Right.

> 
>>> I'm not sure if calling that "shutdown" makes sense, though. It's almost
>>> more of a checkpoint (and I wonder if git would ever want to
>>> "checkpoint" without hanging up the connection).
>> 
>> OK, I agree that the naming might not be ideal. But "checkpoint" does not
>> convey that it is only executed once after all blobs are filtered?!
> 
> Does the filter need to care? It's told to do any deferred work, and to
> report back when it's done. The fact that git is calling it before it
> decides to exit is not the filter's business (and you can imagine for
> something like fast-import, it might want to feed files to something
> like LFS, too; it already checkpoints occasionally to avoid lost work,
> and would presumably want to ask LFS to checkpoint, too).
> 
>> I understand that Git might not want to wait for the filter...
> 
> If git _doesn't_ want to wait for the filter, I don't think you need a
> checkpoint at all.

True. However, I wonder if it could be useful if the filter is allowed
to do some finishing work *before* Git returns to the user.


> The filter just does its deferred work when it sees
> git hang up the connection (i.e., the "real" EOF from above).

Yeah it could do that. But then the filter cannot do things like
modifying the index after the fact... however, that might be considered
nasty by the Git community anyways... I am thinking about dropping
this patch in the next roll as it is not strictly necessary for my
current use case.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 10/10] convert: add filter.<driver>.process option
  2016-08-04 10:18         ` Jakub Narębski
@ 2016-08-05 13:20           ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 13:20 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 04 Aug 2016, at 12:18, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> ...
>>>> +
>>>> +	sigchain_push(SIGPIPE, SIG_IGN);
>>> 
>>> Hmmm... ignoring SIGPIPE was good for one-shot filters.  Is it still
>>> O.K. for per-command persistent ones?
>> 
>> Very good question. You are right... we don't want to ignore any errors
>> during the protocol... I will remove it.
> 
> I was actually just wondering.
> 
> Actually the default behavior if SIGPIPE is not ignored (or if the
> SIGPIPE signal is not blocked / masked out) is to *terminate* the
> writing program, which we do not want.
> 
> The correct solution is to check for error during write, and check
> if errno is set to EPIPE.  This means that reader (filter driver
> process) has closed pipe, usually due to crash, and we need to handle
> that sanely, either restarting or quitting while providing sane
> information about error to the user.
> 
> Well, we might want to set a signal handler for SIGPIPE, not just
> simply ignore it (especially for streaming case; stop streaming
> if filter driver crashed); though signal handlers are quite limited
> about what might be done in them.  But that's for the future.
> 
> 
> Read from closed pipe returns EOF; write to closed pipe results in
> SIGPIPE and returns -1 (setting errno to EPIPE).

OK, I think I understand. I will address that in the next round.


>>> ...
>>> Or maybe extract writing the header for a file into a separate function?
>>> This one gets a bit long...
>> 
>> Maybe... but I think that would make it harder to understand the protocol. I
>> think I would prefer to have all the communication in one function layer.
> 
> I don't understand your reasoning here ("make it harder to understand the
> protocol").  If you choose good names for function writing header, then
> the main function would be the high-level view of protocol, e.g.
> 
>   git> <command>
>   git> <header>
>   git> <contents>
>   git> <flush>
> 
>   git< <command accepted>
>   git< <contents>
>   git< <flush>
>   git< <sent status>
> 

OK, I will move the header into a separate function.


>>>> ...
>>>> +		cat ../test.o >test.r &&
>>> 
>>> Err, the above is just copying file, isn't it?
>>> Maybe it was copied from other tests, I have not checked.
>> 
>> It was created in the "setup" test.
> 
> What I meant here (among other things) is that you uselessly use
> 'cat' to copy files:
> 
>    +		cp ../test.o test.r &&

Ah right. No idea why I did that. I'll use cp, of course :-)


>>>> +		echo "test22" >test2.r &&
>>>> +		mkdir testsubdir &&
>>>> +		echo "test333" >testsubdir/test3.r &&
>>> 
>>> All right, we test text file, we test binary file (I assume), we test
>>> file in a subdirectory.  What about testing empty file?  Or large file
>>> which would not fit in the stdin/stdout buffer (as EXPENSIVE test)?
>> 
>> No binary file. The main reason for this test is to check multiple files.
>> I'll add a empty file. A large file is tested in the next test.
> 
> I assume that this large file is binary file; what matters is that it
> includes NUL character ("\0"), i.e. zero byte, checking that there is
> no error that would terminate it at NUL.

Good idea! I will add a small test file with \0 bytes in between to test binaries.


Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-04 16:14           ` Junio C Hamano
@ 2016-08-05 14:55             ` Lars Schneider
  2016-08-05 16:31               ` Junio C Hamano
  2016-08-05 17:31             ` Lars Schneider
  1 sibling, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 14:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, jnareb, tboegi, mlbright, e


> On 04 Aug 2016, at 18:14, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Jeff King <peff@peff.net> writes:
> 
>> The cost of write() may vary on other platforms, but the cost of memcpy
>> generally shouldn't. So I'm inclined to say that it is not really worth
>> micro-optimizing the interface.
>> 
>> I think the other issue is that format_packet() only lets you send
>> string data via "%s", so it cannot be used for arbitrary data that may
>> contain NULs. So we do need _some_ other interface to let you send a raw
>> data packet, and it's going to look similar to the direct_packet_write()
>> thing.
> 
> OK.  That is a much better argument than "I already stuff the length
> bytes in my buffer" (which will invite "How about stop doing that?")
> to justify a new "I have N bytes of data, send it out", whose
> signature would look more like write(2) and deserve to be called
> packet_write() but unfortunately the name is taken by what should
> have called packet_fmt() or something, but that squats on a good
> name packet_write().  Sigh.

Well, my argument wasn't meant to be offensive. It was just an idea that
I published the to get feedback. Now I understand that it wasn't a particular
good idea (thanks Peff for the performance test!).

However, besides the bogus performance argument I introduced that function
to allow packet writs to fail using the `gentle` parameter:
http://public-inbox.org/git/D116610C-F33A-43DA-A49D-0B33958822E5%40gmail.com/

Would you be OK if I introduce packet_write_gently() that returns `0` if the
write was OK and `-1` if it failed?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-05 14:55             ` Lars Schneider
@ 2016-08-05 16:31               ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-05 16:31 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Jeff King, git, jnareb, tboegi, mlbright, e

Lars Schneider <larsxschneider@gmail.com> writes:

> However, besides the bogus performance argument I introduced that function
> to allow packet writs to fail using the `gentle` parameter:
> http://public-inbox.org/git/D116610C-F33A-43DA-A49D-0B33958822E5%40gmail.com/
>
> Would you be OK if I introduce packet_write_gently() that returns `0` if the
> write was OK and `-1` if it failed?

Yes, I agree with you that it would be a good thing to have a
_gently() variant that lets the caller deal with possible error
conditions itself instead of dying.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-04 16:14           ` Junio C Hamano
  2016-08-05 14:55             ` Lars Schneider
@ 2016-08-05 17:31             ` Lars Schneider
  2016-08-05 17:41               ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 17:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, jnareb, tboegi, mlbright, e


> On 04 Aug 2016, at 18:14, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Jeff King <peff@peff.net> writes:
> 
>> The cost of write() may vary on other platforms, but the cost of memcpy
>> generally shouldn't. So I'm inclined to say that it is not really worth
>> micro-optimizing the interface.
>> 
>> I think the other issue is that format_packet() only lets you send
>> string data via "%s", so it cannot be used for arbitrary data that may
>> contain NULs. So we do need _some_ other interface to let you send a raw
>> data packet, and it's going to look similar to the direct_packet_write()
>> thing.
> 
> OK.  That is a much better argument than "I already stuff the length
> bytes in my buffer" (which will invite "How about stop doing that?")
> to justify a new "I have N bytes of data, send it out", whose
> signature would look more like write(2) and deserve to be called
> packet_write() but unfortunately the name is taken by what should
> have called packet_fmt() or something, but that squats on a good
> name packet_write().  Sigh.

"Sigh" means, a series preparation patch that renames "packet_write()" 
to "paket_write_fmt()" would not be a good idea? It is used 59 times 
currently...

- Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 01/12] pkt-line: extract set_packet_header()
  2016-08-05 17:31             ` Lars Schneider
@ 2016-08-05 17:41               ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-05 17:41 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jeff King, Git Mailing List, Jakub Narębski,
	Torsten Bögershausen, Martin-Louis Bright, Eric Wong

On Fri, Aug 5, 2016 at 10:31 AM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>
>> On 04 Aug 2016, at 18:14, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> signature would look more like write(2) and deserve to be called
>> packet_write() but unfortunately the name is taken by what should
>> have called packet_fmt() or something, but that squats on a good
>> name packet_write().  Sigh.
>
> "Sigh" means, a series preparation patch that renames "packet_write()"
> to "paket_write_fmt()" would not be a good idea? It is used 59 times
> currently...

It would be a good idea in the longer term, I would think. I just wasn't
sure if you are willing to volunteer, and in-flight topics will tolerate, such
a change right now. I have a feeling that all the current callsites are
fairly stable and no in-flight topic touches them, so if you feel like doing
so, please go ahead ;-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/7] trace: use warning() for printing trace errors
  2016-08-05  7:59                   ` Christian Couder
@ 2016-08-05 18:41                     ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-05 18:41 UTC (permalink / raw)
  To: Christian Couder; +Cc: Jeff King, Lars Schneider, git

Christian Couder <christian.couder@gmail.com> writes:

> On Thu, Aug 4, 2016 at 11:28 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Jeff King <peff@peff.net> writes:
>>
>>> I wondered if that would then let us drop set_warn_routine(), but it
>>> looks like there are other warning() calls it cares about. So that would
>>> invalidate the last paragraph here, though I still think converting the
>>> trace errors to warning() is a reasonable thing to do.
>>
>> Yes.  That is why tonight's pushout will have this series in 'jch'
>> (that is a point on a linear history between 'master' and 'pu') and
>> tentatively ejects cc/apply-am topic out of 'pu', expecting it to be
>> rerolled.
>
> Ok, I will reroll soon with Peff's suggested changes.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-05 13:08                 ` Lars Schneider
@ 2016-08-05 21:19                   ` Torsten Bögershausen
  2016-08-05 21:50                     ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Torsten Bögershausen @ 2016-08-05 21:19 UTC (permalink / raw)
  To: Lars Schneider, Jeff King; +Cc: git, gitster, jnareb, mlbright, e

On 2016-08-05 15.08, Lars Schneider wrote:

[]
> Yeah it could do that. But then the filter cannot do things like
> modifying the index after the fact... however, that might be considered
> nasty by the Git community anyways... I am thinking about dropping
> this patch in the next roll as it is not strictly necessary for my
> current use case.
(Thanks Peff for helping me out with the EOF explanation)

I would say that a filter is a filter, and should do nothing else than filtering
one file,
(or a stream).
When you want to modify the index, a hook may be your friend.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
  2016-08-03 17:45       ` Junio C Hamano
  2016-08-03 20:29       ` Junio C Hamano
@ 2016-08-05 21:34       ` Torsten Bögershausen
  2016-08-05 21:49         ` Lars Schneider
  2016-08-05 22:06         ` Junio C Hamano
  2 siblings, 2 replies; 120+ messages in thread
From: Torsten Bögershausen @ 2016-08-05 21:34 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: gitster, jnareb, mlbright, e, peff

On 2016-08-03 18.42, larsxschneider@gmail.com wrote:
> The filter is expected to respond with the result content in zero
> or more pkt-line packets and a flush packet at the end. Finally, a
> "result=success" packet is expected if everything went well.
> ------------------------
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< result=success\n
> ------------------------
I would really send the diagnostics/return codes before the content.

> If the result content is empty then the filter is expected to respond
> only with a flush packet and a "result=success" packet.
> ------------------------
> packet:          git< 0000
> packet:          git< result=success\n
> ------------------------

Which may be:

packet:          git< result=success\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000

or for an empty file:

packet:          git< result=success\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000


or in case of an error:
packet:          git< result=reject\n
# And this will not send the "0000" packet

Does this makes sense ?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-05 21:34       ` Torsten Bögershausen
@ 2016-08-05 21:49         ` Lars Schneider
  2016-08-05 22:06         ` Junio C Hamano
  1 sibling, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 21:49 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, gitster, jnareb, mlbright, e, peff


> On 05 Aug 2016, at 23:34, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On 2016-08-03 18.42, larsxschneider@gmail.com wrote:
>> The filter is expected to respond with the result content in zero
>> or more pkt-line packets and a flush packet at the end. Finally, a
>> "result=success" packet is expected if everything went well.
>> ------------------------
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> packet:          git< result=success\n
>> ------------------------
> I would really send the diagnostics/return codes before the content.
> 
>> If the result content is empty then the filter is expected to respond
>> only with a flush packet and a "result=success" packet.
>> ------------------------
>> packet:          git< 0000
>> packet:          git< result=success\n
>> ------------------------
> 
> Which may be:
> 
> packet:          git< result=success\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> 
> or for an empty file:
> 
> packet:          git< result=success\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000

I think you meant:
packet:          git< result=success\n
packet:          git< 0000

Right?

> 
> or in case of an error:
> packet:          git< result=reject\n
> # And this will not send the "0000" packet
> 
> Does this makes sense ?

I see your point. However, I think your suggestion would not work in the
true streaming case as the filter wouldn't know upfront if the operation 
will succeed, right?

Thanks for the review,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 07/12] run-command: add clean_on_exit_handler
  2016-08-05 21:19                   ` Torsten Bögershausen
@ 2016-08-05 21:50                     ` Lars Schneider
  0 siblings, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-05 21:50 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Jeff King, git, gitster, jnareb, mlbright, e


> On 05 Aug 2016, at 23:19, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On 2016-08-05 15.08, Lars Schneider wrote:
> 
> []
>> Yeah it could do that. But then the filter cannot do things like
>> modifying the index after the fact... however, that might be considered
>> nasty by the Git community anyways... I am thinking about dropping
>> this patch in the next roll as it is not strictly necessary for my
>> current use case.
> (Thanks Peff for helping me out with the EOF explanation)
> 
> I would say that a filter is a filter, and should do nothing else than filtering
> one file,
> (or a stream).
> When you want to modify the index, a hook may be your friend.

Agreed. I will remove that feature.

Thanks,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-05 21:34       ` Torsten Bögershausen
  2016-08-05 21:49         ` Lars Schneider
@ 2016-08-05 22:06         ` Junio C Hamano
  2016-08-05 22:27           ` Jeff King
  2016-08-06 20:40           ` Torsten Bögershausen
  1 sibling, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2016-08-05 22:06 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: larsxschneider, git, jnareb, mlbright, e, peff

Torsten Bögershausen <tboegi@web.de> writes:

> On 2016-08-03 18.42, larsxschneider@gmail.com wrote:
>> The filter is expected to respond with the result content in zero
>> or more pkt-line packets and a flush packet at the end. Finally, a
>> "result=success" packet is expected if everything went well.
>> ------------------------
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> packet:          git< result=success\n
>> ------------------------
> I would really send the diagnostics/return codes before the content.

I smell the assumption "by the time the filter starts output, it
must have finished everything and knows both size and the status".

I'd prefer to have a protocol that allows us to do streaming I/O on
both ends when possible, even if the initial version of the filters
(and the code that sits on the Git side) hold everything in-core
before starting to talk.

>> If the result content is empty then the filter is expected to respond
>> only with a flush packet and a "result=success" packet.
> ...
> Which may be:
>
> packet:          git< result=success\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
>
> or for an empty file:
>
> packet:          git< result=success\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000

The above two look the same to me.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-05 22:06         ` Junio C Hamano
@ 2016-08-05 22:27           ` Jeff King
  2016-08-06 11:55             ` Lars Schneider
  2016-08-06 20:40           ` Torsten Bögershausen
  1 sibling, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-05 22:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, larsxschneider, git, jnareb, mlbright,
	e

On Fri, Aug 05, 2016 at 03:06:28PM -0700, Junio C Hamano wrote:

> Torsten Bögershausen <tboegi@web.de> writes:
> 
> > On 2016-08-03 18.42, larsxschneider@gmail.com wrote:
> >> The filter is expected to respond with the result content in zero
> >> or more pkt-line packets and a flush packet at the end. Finally, a
> >> "result=success" packet is expected if everything went well.
> >> ------------------------
> >> packet:          git< SMUDGED_CONTENT
> >> packet:          git< 0000
> >> packet:          git< result=success\n
> >> ------------------------
> > I would really send the diagnostics/return codes before the content.
> 
> I smell the assumption "by the time the filter starts output, it
> must have finished everything and knows both size and the status".
> 
> I'd prefer to have a protocol that allows us to do streaming I/O on
> both ends when possible, even if the initial version of the filters
> (and the code that sits on the Git side) hold everything in-core
> before starting to talk.

I think you really want to handle both cases:

  - the server says "no, I can't fulfill your request" (e.g., HTTP 404)

  - the server can abort an in-progress response to indicate that it
    could not be fulfilled completely (in HTTP chunked encoding, this
    requires hanging up before sending the final EOF chunk)

If we expect the second case to be rare, then hanging up before sending
the flush packet is probably OK. But we could also have a trailing error
code after the data to say "ignore that, we saw an error, but I can
still handle more requests".

It is true that you don't need the up-front status code in that case
(you can send an empty body and say "ignore that, we saw an error") but
that feels a little weird. And I expect it makes the lives of the client
easier to get a code up front, before it starts taking steps to handle
what it _thinks_ is probably a valid response.

-Peff

PS I haven't followed HTTP/2 development much, but I think it solves the
   "hangup" issue by putting each request/response in its own framed
   stream. I actually wonder if that is a direction we will want to go
   eventually, too, or the same reason that HTTP/2 did: multiple async
   requests across a single connection.

   We already have some precedent in the sideband protocol. So imagine,
   for example, that we could ask the filter to work on several files
   simultaneously, by sending

     git> \1[file1 content]
     git> \2[file2 content]
     git> \1[file1 content]

   and so on. I don't think this is something that needs to happen in
   the initial protocol (it's not like git can do parallel checkout
   right now anyway). If there's a capability negotiation at the front
   of the protocol, then an async feature can be worked out later. Just
   food for thought at this point.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-05 22:27           ` Jeff King
@ 2016-08-06 11:55             ` Lars Schneider
  2016-08-06 12:14               ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-06 11:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Torsten Bögershausen, Git Mailing List,
	Jakub Narębski, mlbright, e


> On 06 Aug 2016, at 00:27, Jeff King <peff@peff.net> wrote:
> 
> On Fri, Aug 05, 2016 at 03:06:28PM -0700, Junio C Hamano wrote:
> 
>> Torsten Bögershausen <tboegi@web.de> writes:
>> 
>>> On 2016-08-03 18.42, larsxschneider@gmail.com wrote:
>>>> The filter is expected to respond with the result content in zero
>>>> or more pkt-line packets and a flush packet at the end. Finally, a
>>>> "result=success" packet is expected if everything went well.
>>>> ------------------------
>>>> packet:          git< SMUDGED_CONTENT
>>>> packet:          git< 0000
>>>> packet:          git< result=success\n
>>>> ------------------------
>>> I would really send the diagnostics/return codes before the content.
>> 
>> I smell the assumption "by the time the filter starts output, it
>> must have finished everything and knows both size and the status".
>> 
>> I'd prefer to have a protocol that allows us to do streaming I/O on
>> both ends when possible, even if the initial version of the filters
>> (and the code that sits on the Git side) hold everything in-core
>> before starting to talk.
> 
> I think you really want to handle both cases:
> 
>  - the server says "no, I can't fulfill your request" (e.g., HTTP 404)

You can do this with the current protocol:

packet:          git< 0000
packet:          git< result=reject\n

Admittedly the flush packet could be consider overhead but I think
that is neglectable.


>  - the server can abort an in-progress response to indicate that it
>    could not be fulfilled completely (in HTTP chunked encoding, this
>    requires hanging up before sending the final EOF chunk)

Also already supported with the following sequence:

packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
packet:          git< 0000
packet:          git< result=error\n


> If we expect the second case to be rare, then hanging up before sending
> the flush packet is probably OK. But we could also have a trailing error
> code after the data to say "ignore that, we saw an error, but I can
> still handle more requests".
> 
> It is true that you don't need the up-front status code in that case
> (you can send an empty body and say "ignore that, we saw an error") but
> that feels a little weird.

I understand your argument. However, I think "0000" indicates 
"I have nothing for you" and therefore I think it would be OK in the
reject case.


> And I expect it makes the lives of the client
> easier to get a code up front, before it starts taking steps to handle
> what it _thinks_ is probably a valid response.

I am not sure I can follow you here. Which actor are you referring to when
you write "client" -- Git, right? If the response is rejected right away
then Git just needs to read a single flush. If the response experiences
an error only later, then the filter wouldn't know about the error when
it starts sending. Therefore I don't see how an error code up front could
make it easier for Git.

- Lars



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-06 11:55             ` Lars Schneider
@ 2016-08-06 12:14               ` Jeff King
  2016-08-06 18:19                 ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-06 12:14 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Torsten Bögershausen, Git Mailing List,
	Jakub Narębski, mlbright, e

On Sat, Aug 06, 2016 at 01:55:23PM +0200, Lars Schneider wrote:

> > And I expect it makes the lives of the client
> > easier to get a code up front, before it starts taking steps to handle
> > what it _thinks_ is probably a valid response.
> 
> I am not sure I can follow you here. Which actor are you referring to when
> you write "client" -- Git, right? If the response is rejected right away
> then Git just needs to read a single flush. If the response experiences
> an error only later, then the filter wouldn't know about the error when
> it starts sending. Therefore I don't see how an error code up front could
> make it easier for Git.

Yes, I mean git (I see it as the "client" side of the connection in that
it is making requests of the filter, which will then provide responses).

What I mean is that the git code could look something like:

  status == send_filter_request();
  if (status == OK) {
	prepare_storage();
	read_response_into_storage();
  } else {
	complain();
  }

But if there's no status up front, then you probably have:

  send_filter_request();
  prepare_storage();
  status = read_response_into_storage();
  if (status != OK) {
	rollback_storage();
	complain();
  }

In the first case, we could easily avoid preparing the storage if our
request wasn't going to be filled, whereas in the second we have to do
it unconditionally. That's not a big deal if preparing the storage is
initializing a strbuf. It's more so if you're opening a temporary object
file to stream into.

You _do_ still have to deal with rollback in the first one (for the case
that the stream ends prematurely for whatever reason). So it's really a
question of where and how often we expect the failures to come, and
whether it is worth git knowing up front that the request is not going
to be fulfilled.

I dunno. It's not _that_ big a deal to code around. I was just surprised
not to see an up-front status when responding to a request. It seems
like the normal thing in just about every protocol I've ever used.

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-06 12:14               ` Jeff King
@ 2016-08-06 18:19                 ` Lars Schneider
  2016-08-08 15:02                   ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-06 18:19 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Torsten Bögershausen, Git Mailing List,
	Jakub Narębski, mlbright, e


> On 06 Aug 2016, at 14:14, Jeff King <peff@peff.net> wrote:
> 
> On Sat, Aug 06, 2016 at 01:55:23PM +0200, Lars Schneider wrote:
> 
>>> And I expect it makes the lives of the client
>>> easier to get a code up front, before it starts taking steps to handle
>>> what it _thinks_ is probably a valid response.
>> 
>> I am not sure I can follow you here. Which actor are you referring to when
>> you write "client" -- Git, right? If the response is rejected right away
>> then Git just needs to read a single flush. If the response experiences
>> an error only later, then the filter wouldn't know about the error when
>> it starts sending. Therefore I don't see how an error code up front could
>> make it easier for Git.
> 
> Yes, I mean git (I see it as the "client" side of the connection in that
> it is making requests of the filter, which will then provide responses).
> 
> What I mean is that the git code could look something like:
> 
>  status == send_filter_request();
>  if (status == OK) {
> 	prepare_storage();
> 	read_response_into_storage();
>  } else {
> 	complain();
>  }
> 
> But if there's no status up front, then you probably have:
> 
>  send_filter_request();
>  prepare_storage();
>  status = read_response_into_storage();
>  if (status != OK) {
> 	rollback_storage();
> 	complain();
>  }
> 
> In the first case, we could easily avoid preparing the storage if our
> request wasn't going to be filled, whereas in the second we have to do
> it unconditionally. That's not a big deal if preparing the storage is
> initializing a strbuf. It's more so if you're opening a temporary object
> file to stream into.
> 
> You _do_ still have to deal with rollback in the first one (for the case
> that the stream ends prematurely for whatever reason). So it's really a
> question of where and how often we expect the failures to come, and
> whether it is worth git knowing up front that the request is not going
> to be fulfilled.
> 
> I dunno. It's not _that_ big a deal to code around. I was just surprised
> not to see an up-front status when responding to a request. It seems
> like the normal thing in just about every protocol I've ever used.

Alright. The fact that it "surprised" you is a bad sign. 
How about this:

Happy answer:
------------------------
packet:          git< status=accept\n
packet:          git< SMUDGED_CONTENT
packet:          git< 0000
packet:          git< status=success\n
------------------------

Happy answer with no content:
------------------------
packet:          git< status=success\n
------------------------

Rejected content:
------------------------
packet:          git< status=reject\n
------------------------

Error during content response:
------------------------
packet:          git< status=accept\n
packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
packet:          git< 0000
packet:          git< status=error\n
------------------------

Cheers,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option)
  2016-08-03 18:30         ` Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option) Jakub Narębski
  2016-08-05 10:32           ` Lars Schneider
@ 2016-08-06 18:24           ` Lars Schneider
  1 sibling, 0 replies; 120+ messages in thread
From: Lars Schneider @ 2016-08-06 18:24 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Martin-Louis Bright, Eric Wong, Jeff King


> On 03 Aug 2016, at 20:30, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> ...
> 
> 
> 2. HANDSHAKE (INITIALIZATION)
> 
> Next, there is deciding on and designing the handshake between Git (between
> Git command) and the filter driver process.  With the `filter.<driver>.process`
> solution the driver needs to tell which operations among (for now) "clean"
> and "smudge" it does support.  Plus it provides a way to extend protocol,
> adding new features, like support for streaming, cleaning from file or
> smudging to file, providing size upfront, perhaps even progress report.
> 
> Current handshake consist of filter driver printing a signature, version
> number and capabilities, in that order.  Git checks that it is well formed
> and matches expectations, and notes which of "clean" and "smudge" operations
> are supported by the filter.
> 
> There is no interaction from the Git side in the handshake, for example to
> set options and expectations common to all files being filtered.  Take
> one possible extension of protocol: supporting streaming.  The filter
> driver needs to know whether it needs to read all the input, or whether
> it can start printing output while input is incoming (e.g. to reduce
> memory consumption)... though we may simply decide it to be next version
> of the protocol.

I would like to change the startup sequence to this:

Git starts the filter when it encounters the first file
that needs to be cleaned or smudged. After the filter started
Git sends a welcome message, a list of supported protocol
version numbers, and a flush packet. Git expects to read the
welcome message and one protocol version number from the
previously sent list. Afterwards Git sends a list of supported
capabilities and a flush packet. Git expects to read a list of
desired capabilities, which must be a subset of the supported
capabilities list, and a flush packet as response:
------------------------
packet:          git> git-filter-client
packet:          git> version=2
packet:          git> version=42
packet:          git> 0000
packet:          git< git-filter-server
packet:          git< version=2
packet:          git> clean=true
packet:          git> smudge=true
packet:          git> not-yet-invented=true
packet:          git> 0000
packet:          git< clean=true
packet:          git< smudge=true
packet:          git< 0000
------------------------

This would allow us to detect the case if a user configures an
existing clean/smudge filter as `filter.<driver>.process`.
Since Git is talking first, it would not "hang" in that case.

Would that be ok with you?

Thanks,
Lars


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-05 22:06         ` Junio C Hamano
  2016-08-05 22:27           ` Jeff King
@ 2016-08-06 20:40           ` Torsten Bögershausen
  1 sibling, 0 replies; 120+ messages in thread
From: Torsten Bögershausen @ 2016-08-06 20:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git, jnareb, mlbright, e, peff

On 2016-08-06 00.06, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
>> On 2016-08-03 18.42, larsxschneider@gmail.com wrote:
>>> The filter is expected to respond with the result content in zero
>>> or more pkt-line packets and a flush packet at the end. Finally, a
>>> "result=success" packet is expected if everything went well.
>>> ------------------------
>>> packet:          git< SMUDGED_CONTENT
>>> packet:          git< 0000
>>> packet:          git< result=success\n
>>> ------------------------
>> I would really send the diagnostics/return codes before the content.
> I smell the assumption "by the time the filter starts output, it
> must have finished everything and knows both size and the status".
>
> I'd prefer to have a protocol that allows us to do streaming I/O on
> both ends when possible, even if the initial version of the filters
> (and the code that sits on the Git side) hold everything in-core
> before starting to talk.
>
>>> If the result content is empty then the filter is expected to respond
>>> only with a flush packet and a "result=success" packet.
>> ...
>> Which may be:
>>
>> packet:          git< result=success\n
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>>
>> or for an empty file:
>>
>> packet:          git< result=success\n
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
> The above two look the same to me.
Copy-paste error.
i see that we need a status after the complete transfer,
and after some thinking I would like to take back my comment.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-06 18:19                 ` Lars Schneider
@ 2016-08-08 15:02                   ` Jeff King
  2016-08-08 16:21                     ` Lars Schneider
  0 siblings, 1 reply; 120+ messages in thread
From: Jeff King @ 2016-08-08 15:02 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Torsten Bögershausen, Git Mailing List,
	Jakub Narębski, mlbright, e

On Sat, Aug 06, 2016 at 08:19:28PM +0200, Lars Schneider wrote:

> > I dunno. It's not _that_ big a deal to code around. I was just surprised
> > not to see an up-front status when responding to a request. It seems
> > like the normal thing in just about every protocol I've ever used.
> 
> Alright. The fact that it "surprised" you is a bad sign. 
> How about this:
> 
> Happy answer:
> ------------------------
> packet:          git< status=accept\n
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< status=success\n
> ------------------------

I notice that the status pkt-lines are by themselves. I had assumed we'd
be sending other data, too (presumably before, but I guess possibly
after, too). Something like:

  git< status=accept
  git< 0000
  git< SMUDGED_CONTENT
  git< 0000
  git< status=success
  git< 0000

I don't have any particular meta-information in mind, but I thought
stuff like the tentative "size" field would be here.

I had imagined it at the front, but I guess it could go in either place.
I wonder if keys at the end could simply replace ones from the beginning
(so if you say "foo=bar" at the front, that is tentative, but if you
then say "foo=revised" at the end, that takes precedence).

And so the happy answer is really:

  git< status=success
  git< 0000
  git< SMUDGED_CONTENT
  git< 0000
  git< 0000  # empty list!

i.e., no second status. The original "success" still holds.

And then:

> Happy answer with no content:
> ------------------------
> packet:          git< status=success\n
> ------------------------

This can just be spelled:

  git< status=success
  git< 0000
  git< 0000   # empty content!
  git< 0000   # empty list!

> Rejected content:
> ------------------------
> packet:          git< status=reject\n
> ------------------------

I'd assume that an error status would end the output for that file
immediately, no empty lists necessary (so what you have here). I'd
probably just call this "error" (see below).

> Error during content response:
> ------------------------
> packet:          git< status=accept\n
> packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
> packet:          git< 0000
> packet:          git< status=error\n
> ------------------------

And then this would be:

  git< status=success
  git< 0000
  git< HALF_OF_CONTENT
  git< 0000
  git< status=error
  git< 0000

And then you have only two status codes: success and error. Which keeps
things simple.

There's one other case, which is when the filter dies halfway through
the conversation, like:

  git< status=success
  git< 0000
  git< CONTENT
  git< 0000
  ... EOF on pipe ...

Any time git does not get the conversation all the way to the final
flush after the trailers, it should be considered an error (because we
can never know if the filter was about to say "whoops, status=error").

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-08 15:02                   ` Jeff King
@ 2016-08-08 16:21                     ` Lars Schneider
  2016-08-08 16:26                       ` Jeff King
  0 siblings, 1 reply; 120+ messages in thread
From: Lars Schneider @ 2016-08-08 16:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Torsten Bögershausen, Git Mailing List,
	Jakub Narębski, mlbright, e


> On 08 Aug 2016, at 17:02, Jeff King <peff@peff.net> wrote:
> 
> On Sat, Aug 06, 2016 at 08:19:28PM +0200, Lars Schneider wrote:
> 
>>> I dunno. It's not _that_ big a deal to code around. I was just surprised
>>> not to see an up-front status when responding to a request. It seems
>>> like the normal thing in just about every protocol I've ever used.
>> 
>> Alright. The fact that it "surprised" you is a bad sign. 
>> How about this:
>> 
>> Happy answer:
>> ------------------------
>> packet:          git< status=accept\n
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> packet:          git< status=success\n
>> ------------------------
> 
> I notice that the status pkt-lines are by themselves. I had assumed we'd
> be sending other data, too (presumably before, but I guess possibly
> after, too). Something like:
> 
>  git< status=accept
>  git< 0000
>  git< SMUDGED_CONTENT
>  git< 0000
>  git< status=success
>  git< 0000
> 
> I don't have any particular meta-information in mind, but I thought
> stuff like the tentative "size" field would be here.
> 
> I had imagined it at the front, but I guess it could go in either place.
> I wonder if keys at the end could simply replace ones from the beginning
> (so if you say "foo=bar" at the front, that is tentative, but if you
> then say "foo=revised" at the end, that takes precedence).
> 
> And so the happy answer is really:
> 
>  git< status=success
>  git< 0000
>  git< SMUDGED_CONTENT
>  git< 0000
>  git< 0000  # empty list!
> 
> i.e., no second status. The original "success" still holds.

OK, that sounds sensible to me.


> And then:
> 
>> Happy answer with no content:
>> ------------------------
>> packet:          git< status=success\n
>> ------------------------
> 
> This can just be spelled:
> 
>  git< status=success
>  git< 0000
>  git< 0000   # empty content!
>  git< 0000   # empty list!

Is the first flush packet one too many?
If there is nothing then I think we shouldn't
send any packets?!

I agree with the remaining two flush packets.


>> Rejected content:
>> ------------------------
>> packet:          git< status=reject\n
>> ------------------------
> 
> I'd assume that an error status would end the output for that file
> immediately, no empty lists necessary (so what you have here). I'd
> probably just call this "error" (see below).

OK!


>> Error during content response:
>> ------------------------
>> packet:          git< status=accept\n
>> packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> packet:          git< 0000
>> packet:          git< status=error\n
>> ------------------------
> 
> And then this would be:
> 
>  git< status=success
>  git< 0000
>  git< HALF_OF_CONTENT
>  git< 0000
>  git< status=error
>  git< 0000
> 
> And then you have only two status codes: success and error. Which keeps
> things simple.
> 
> There's one other case, which is when the filter dies halfway through
> the conversation, like:
> 
>  git< status=success
>  git< 0000
>  git< CONTENT
>  git< 0000
>  ... EOF on pipe ...
> 
> Any time git does not get the conversation all the way to the final
> flush after the trailers, it should be considered an error (because we
> can never know if the filter was about to say "whoops, status=error").

Right. I agree with the protocol above and I will implement it
that way.

There is one more thing: I introduced a return value "status=error-all".
Using this the filter can signal Git that it does not want to process
any other file using the particular command.

Jakub came up with this idea here:

"Another response, which I think should be standarized, or at
least described in the documentation, is filter driver refusing
to filter further (e.g. git-LFS and network is down), to be not
restarted by Git."

http://public-inbox.org/git/607c07fe-5b6f-fd67-13e1-705020c267ee%40gmail.com/

I think it is a good idea. Do you see arguments against it?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 11/12] convert: add filter.<driver>.process option
  2016-08-08 16:21                     ` Lars Schneider
@ 2016-08-08 16:26                       ` Jeff King
  0 siblings, 0 replies; 120+ messages in thread
From: Jeff King @ 2016-08-08 16:26 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Torsten Bögershausen, Git Mailing List,
	Jakub Narębski, mlbright, e

On Mon, Aug 08, 2016 at 06:21:18PM +0200, Lars Schneider wrote:

> >> Happy answer with no content:
> >> ------------------------
> >> packet:          git< status=success\n
> >> ------------------------
> > 
> > This can just be spelled:
> > 
> >  git< status=success
> >  git< 0000
> >  git< 0000   # empty content!
> >  git< 0000   # empty list!
> 
> Is the first flush packet one too many?
> If there is nothing then I think we shouldn't
> send any packets?!
> 
> I agree with the remaining two flush packets.

There isn't nothing, there is a "status" field (though I think that
should probably be required, so I guess you could imagine it as a
stand-alone pkt, separate from the list terminated by the flush). But
regardless, you need the first flush to say "I am done telling you
up-front keys, now I am starting the content".

Otherwise, what would:

  git< status=success
  git< foo=bar
  git< 0000

be parsed as? Is "foo=bar" the first line of content, or the rest of the
pre-content header? (You could guess if you could see the total
conversation, but you can't; you have to parse it as it comes).

> There is one more thing: I introduced a return value "status=error-all".
> Using this the filter can signal Git that it does not want to process
> any other file using the particular command.
> 
> Jakub came up with this idea here:
> 
> "Another response, which I think should be standarized, or at
> least described in the documentation, is filter driver refusing
> to filter further (e.g. git-LFS and network is down), to be not
> restarted by Git."
> 
> http://public-inbox.org/git/607c07fe-5b6f-fd67-13e1-705020c267ee%40gmail.com/
> 
> I think it is a good idea. Do you see arguments against it?

No, that seems reasonable (I would have just implemented that by hanging
up the connection, but explicitly communicating is more robust).

-Peff

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2016-08-08 16:26 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20160727000605.49982-1-larsxschneider%40gmail.com/>
2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
2016-07-29 23:37   ` [PATCH v3 01/10] pkt-line: extract set_packet_header() larsxschneider
2016-07-30 10:30     ` Jakub Narębski
2016-08-01 11:33       ` Lars Schneider
2016-08-03 20:05         ` Jakub Narębski
2016-08-05 11:52           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
2016-07-30 10:49     ` Jakub Narębski
2016-08-01 12:00       ` Lars Schneider
2016-08-03 20:12         ` Jakub Narębski
2016-08-05 12:02           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 03/10] pkt-line: add packet_flush_gentle() larsxschneider
2016-07-30 12:04     ` Jakub Narębski
2016-08-01 12:28       ` Lars Schneider
2016-07-31 20:36     ` Torstem Bögershausen
2016-07-31 21:45       ` Lars Schneider
2016-08-02 19:56         ` Torsten Bögershausen
2016-08-05  9:59           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-07-30 12:29     ` Jakub Narębski
2016-08-01 12:18       ` Lars Schneider
2016-08-03 20:15         ` Jakub Narębski
2016-07-29 23:37   ` [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size larsxschneider
2016-07-30 13:58     ` Jakub Narębski
2016-08-01 12:23       ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 06/10] run-command: add clean_on_exit_handler larsxschneider
2016-07-30  9:50     ` Johannes Sixt
2016-08-01 11:14       ` Lars Schneider
2016-08-02  5:53         ` Johannes Sixt
2016-08-02  7:41           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 07/10] convert: quote filter names in error messages larsxschneider
2016-07-29 23:37   ` [PATCH v3 08/10] convert: modernize tests larsxschneider
2016-07-29 23:38   ` [PATCH v3 09/10] convert: generate large test files only once larsxschneider
2016-07-29 23:38   ` [PATCH v3 10/10] convert: add filter.<driver>.process option larsxschneider
2016-07-30 22:05     ` Jakub Narębski
2016-07-31  9:42       ` Jakub Narębski
2016-07-31 19:49         ` Lars Schneider
2016-07-31 22:59           ` Jakub Narębski
2016-08-01 13:32       ` Lars Schneider
2016-08-03 18:30         ` Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option) Jakub Narębski
2016-08-05 10:32           ` Lars Schneider
2016-08-06 18:24           ` Lars Schneider
2016-08-03 22:47         ` [PATCH v3 10/10] convert: add filter.<driver>.process option Jakub Narębski
2016-07-31 22:19     ` Jakub Narębski
2016-08-01 17:55       ` Lars Schneider
2016-08-04  0:42         ` Jakub Narębski
2016-08-03 13:10       ` Lars Schneider
2016-08-04 10:18         ` Jakub Narębski
2016-08-05 13:20           ` Lars Schneider
2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
2016-08-03 16:42     ` [PATCH v4 01/12] pkt-line: extract set_packet_header() larsxschneider
2016-08-03 20:18       ` Junio C Hamano
2016-08-03 21:12         ` Jeff King
2016-08-03 21:27           ` Jeff King
2016-08-04 16:14           ` Junio C Hamano
2016-08-05 14:55             ` Lars Schneider
2016-08-05 16:31               ` Junio C Hamano
2016-08-05 17:31             ` Lars Schneider
2016-08-05 17:41               ` Junio C Hamano
2016-08-03 21:56         ` Lars Schneider
2016-08-03 16:42     ` [PATCH v4 02/12] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
2016-08-03 16:42     ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() larsxschneider
2016-08-03 21:39       ` Jeff King
2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
2016-08-03 22:56           ` [PATCH 1/7] trace: handle NULL argument in trace_disable() Jeff King
2016-08-03 22:58           ` [PATCH 2/7] trace: stop using write_or_whine_pipe() Jeff King
2016-08-03 22:58           ` [PATCH 3/7] trace: use warning() for printing trace errors Jeff King
2016-08-04 20:41             ` Junio C Hamano
2016-08-04 21:21               ` Jeff King
2016-08-04 21:28                 ` Junio C Hamano
2016-08-05  7:56                   ` Jeff King
2016-08-05  7:59                   ` Christian Couder
2016-08-05 18:41                     ` Junio C Hamano
2016-08-03 23:00           ` [PATCH 4/7] trace: cosmetic fixes for error messages Jeff King
2016-08-04 20:42             ` Junio C Hamano
2016-08-05  8:00               ` Jeff King
2016-08-03 23:00           ` [PATCH 5/7] trace: correct variable name in write() error message Jeff King
2016-08-03 23:01           ` [PATCH 6/7] trace: disable key after write error Jeff King
2016-08-04 20:45             ` Junio C Hamano
2016-08-04 21:22               ` Jeff King
2016-08-05  7:58                 ` Jeff King
2016-08-03 23:01           ` [PATCH 7/7] write_or_die: drop write_or_whine_pipe() Jeff King
2016-08-03 23:04           ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
2016-08-04 16:16         ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() Junio C Hamano
2016-08-03 16:42     ` [PATCH v4 04/12] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-08-03 16:42     ` [PATCH v4 05/12] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-08-03 16:42     ` [PATCH v4 06/12] pack-protocol: fix maximum pkt-line size larsxschneider
2016-08-03 16:42     ` [PATCH v4 07/12] run-command: add clean_on_exit_handler larsxschneider
2016-08-03 21:24       ` Jeff King
2016-08-03 22:15         ` Lars Schneider
2016-08-03 22:53           ` Jeff King
2016-08-03 23:09             ` Lars Schneider
2016-08-03 23:15               ` Jeff King
2016-08-05 13:08                 ` Lars Schneider
2016-08-05 21:19                   ` Torsten Bögershausen
2016-08-05 21:50                     ` Lars Schneider
2016-08-03 16:42     ` [PATCH v4 08/12] convert: quote filter names in error messages larsxschneider
2016-08-03 16:42     ` [PATCH v4 09/12] convert: modernize tests larsxschneider
2016-08-03 16:42     ` [PATCH v4 10/12] convert: generate large test files only once larsxschneider
2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
2016-08-03 17:45       ` Junio C Hamano
2016-08-03 21:48         ` Lars Schneider
2016-08-03 22:46           ` Jeff King
2016-08-05 12:53             ` Lars Schneider
2016-08-03 20:29       ` Junio C Hamano
2016-08-03 21:37         ` Lars Schneider
2016-08-03 21:43           ` Junio C Hamano
2016-08-03 22:01             ` Lars Schneider
2016-08-05 21:34       ` Torsten Bögershausen
2016-08-05 21:49         ` Lars Schneider
2016-08-05 22:06         ` Junio C Hamano
2016-08-05 22:27           ` Jeff King
2016-08-06 11:55             ` Lars Schneider
2016-08-06 12:14               ` Jeff King
2016-08-06 18:19                 ` Lars Schneider
2016-08-08 15:02                   ` Jeff King
2016-08-08 16:21                     ` Lars Schneider
2016-08-08 16:26                       ` Jeff King
2016-08-06 20:40           ` Torsten Bögershausen
2016-08-03 16:42     ` [PATCH v4 12/12] convert: add filter.<driver>.process shutdown command option larsxschneider

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).