[PATCH v6 00/13] Git filter protocol

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v6 00/13] Git filter protocol
@ 2016-08-25 11:07 larsxschneider
  2016-08-25 11:07 ` [PATCH v6 01/13] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
                   ` (13 more replies)
  0 siblings, 14 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

The goal of this series is to avoid launching a new clean/smudge filter
process for each file that is filtered.

A short summary about v1 to v5 can be found here:
https://git.github.io/rev_news/2016/08/17/edition-18/

This series is also published on web:
https://github.com/larsxschneider/git/pull/10

Thanks a lot for your reviews,
Lars


## Changes since v5

### Peff
* http://public-inbox.org/git/20160810131541.ovpvgwdxjibae5gy%40sigill.intra.peff.net/
    * drop patch "pkt-line: add `gentle` parameter to format_packet()"
* http://public-inbox.org/git/20160810143321.q7mjirgr5ynml5ff@sigill.intra.peff.net/
    * drop patch "pkt-line: call packet_trace() only if a packet is actually send"
* http://public-inbox.org/git/20160810132814.gqnipsdwyzjmuqjy%40sigill.intra.peff.net/
    * make pkt-line write buffer static
    * replace PKTLINE_DATA_MAXLEN with sizeof(packet_write_buffer) - 4)
      in pkt-line.c to --> makes it easier to see that things are correct
* http://public-inbox.org/git/20160810133745.wagccvvf35o3pbwb%40sigill.intra.peff.net/
    * check max content length before using packet_write_fmt_gently()


### Junio
* http://public-inbox.org/git/xmqqpopg5yqf.fsf%40gitster.mtv.corp.google.com/
    *  change packet_write_gently_fmt() to packet_write_fmt_gently()
* http://public-inbox.org/git/xmqq8tw45vtg.fsf@gitster.mtv.corp.google.com/
    * rename packet_write() to packet_write_fmt() before adding new functions


## Stefan
* http://public-inbox.org/git/CAGZ79kboxgBRHSa2s7CKZ1Uo%3D13WT%3DrT8VHCNJNj_Q9jQzZAYw%40mail.gmail.com/
    * state more clearly that everything after version=2 is specific to version=2
    * do not duplicate protocol in commit message,
      summarize design decisions instead
    * use single line comment
    * remove unnessary braces


## Dscho
* http://public-inbox.org/git/alpine.DEB.2.20.1608181617240.4924@virtualbox/
    * wrap commit messages at 72
    * s/CLOEXEX/CLOEXEC/


## Interdiff (v5..v6)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 6e563a6..6346700 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -393,13 +393,20 @@ standard input and standard output.

 Git starts the filter when it encounters the first file
 that needs to be cleaned or smudged. After the filter started
-Git sends a welcome message, a list of supported protocol
-version numbers, and a flush packet. Git expects to read the
-welcome message and one protocol version number from the
-previously sent list. Afterwards Git sends a list of supported
-capabilities and a flush packet. Git expects to read a list of
-desired capabilities, which must be a subset of the supported
-capabilities list, and a flush packet as response:
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only there
+to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of supported capabilities
+and a flush packet. Git expects to read a list of desired capabilities,
+which must be a subset of the supported capabilities list, and a flush
+packet as response:
 ------------------------
 packet:          git> git-filter-client
 packet:          git> version=2
diff --git a/convert.c b/convert.c
index e421f4a..362a0af 100644
--- a/convert.c
+++ b/convert.c
@@ -530,7 +530,9 @@ static int packet_write_list(int fd, const char *line, ...)
  {
    if (!line)
      break;
-   err = packet_write_gently_fmt(fd, "%s", line);
+   if (strlen(line) > PKTLINE_DATA_MAXLEN)
+     return -1;
+   err = packet_write_fmt_gently(fd, "%s", line);
    if (err)
      return err;
    line = va_arg(args, const char*);
@@ -686,11 +688,18 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len

  sigchain_push(SIGPIPE, SIG_IGN);

- err = packet_write_gently_fmt(process->in, "command=%s\n", filter_type);
+
+ err = (strlen(filter_type) > PKTLINE_DATA_MAXLEN);
+ if (err)
+   goto done;
+ err = packet_write_fmt_gently(process->in, "command=%s\n", filter_type);
  if (err)
    goto done;

- err = packet_write_gently_fmt(process->in, "pathname=%s\n", path);
+ err = (strlen(path) > PKTLINE_DATA_MAXLEN);
+ if (err)
+   goto done;
+ err = packet_write_fmt_gently(process->in, "pathname=%s\n", path);
  if (err)
    goto done;

@@ -722,9 +731,7 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len

  if (err || errno == EPIPE) {
    if (!strcmp(filter_status.buf, "error")) {
-     /*
-        * The filter signaled a problem with the file.
-        */
+     /* The filter signaled a problem with the file. */
    } else if (!strcmp(filter_status.buf, "error-all")) {
      /*
       * The filter signaled a permanent problem. Don't try to filter
@@ -818,9 +825,8 @@ static int read_convert_config(const char *var, const char *value, void *cb)
  if (!strcmp("clean", key))
    return git_config_string(&drv->clean, var, value);

- if (!strcmp("process", key)) {
+ if (!strcmp("process", key))
    return git_config_string(&drv->process, var, value);
- }

  if (!strcmp("required", key)) {
    drv->required = git_config_bool(var, value);
diff --git a/pkt-line.c b/pkt-line.c
index b98e291..3033aa3 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -3,7 +3,7 @@
 #include "run-command.h"

 char packet_buffer[LARGE_PACKET_MAX];
-char packet_write_buffer[LARGE_PACKET_MAX];
+static char packet_write_buffer[LARGE_PACKET_MAX];
 static const char *packet_trace_prefix = "git";
 static struct trace_key trace_packet = TRACE_KEY_INIT(PACKET);
 static struct trace_key trace_pack = TRACE_KEY_INIT(PACKFILE);
@@ -107,6 +107,7 @@ void packet_buf_flush(struct strbuf *buf)
 static void set_packet_header(char *buf, const int size)
 {
  static char hexchar[] = "0123456789abcdef";
+
  #define hex(a) (hexchar[(a) & 15])
  buf[0] = hex(size >> 12);
  buf[1] = hex(size >> 8);
@@ -115,7 +116,7 @@ static void set_packet_header(char *buf, const int size)
  #undef hex
 }

-static int format_packet(int gentle, struct strbuf *out, const char *fmt, va_list args)
+static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 {
  size_t orig_len, n;

@@ -124,15 +125,11 @@ static int format_packet(int gentle, struct strbuf *out, const char *fmt, va_lis
  strbuf_vaddf(out, fmt, args);
  n = out->len - orig_len;

- if (n > LARGE_PACKET_MAX) {
-   if (gentle)
-     return -1;
-   else
-     die("protocol error: impossibly long line");
- }
+ if (n > LARGE_PACKET_MAX)
+   die("protocol error: impossibly long line");

  set_packet_header(&out->buf[orig_len], n);
- return 0;
+ packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }

 void packet_write_fmt(int fd, const char *fmt, ...)
@@ -142,28 +139,26 @@ void packet_write_fmt(int fd, const char *fmt, ...)

  strbuf_reset(&buf);
  va_start(args, fmt);
- format_packet(0, &buf, fmt, args);
+ format_packet(&buf, fmt, args);
  va_end(args);
- packet_trace(buf.buf + 4, buf.len - 4, 1);
  write_or_die(fd, buf.buf, buf.len);
 }

-int packet_write_gently_fmt(int fd, const char *fmt, ...)
+int packet_write_fmt_gently(int fd, const char *fmt, ...)
 {
  static struct strbuf buf = STRBUF_INIT;
  va_list args;

  strbuf_reset(&buf);
  va_start(args, fmt);
- format_packet(1, &buf, fmt, args);
+ format_packet(&buf, fmt, args);
  va_end(args);
- packet_trace(buf.buf + 4, buf.len - 4, 1);
  return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
 }

 int packet_write_gently(const int fd_out, const char *buf, size_t size)
 {
- if (size > PKTLINE_DATA_MAXLEN)
+ if (size > sizeof(packet_write_buffer) - 4)
    return -1;
  packet_trace(buf, size, 1);
  memmove(packet_write_buffer + 4, buf, size);
@@ -177,7 +172,7 @@ void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
  va_list args;

  va_start(args, fmt);
- format_packet(0, buf, fmt, args);
+ format_packet(buf, fmt, args);
  va_end(args);
 }

@@ -185,13 +180,14 @@ int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out)
 {
  int err = 0;
  ssize_t bytes_to_write;
+
  while (!err) {
-   bytes_to_write = xread(fd_in, packet_write_buffer, PKTLINE_DATA_MAXLEN);
+   bytes_to_write = xread(fd_in, packet_write_buffer, sizeof(packet_write_buffer) - 4);
    if (bytes_to_write < 0)
      return COPY_READ_ERROR;
    if (bytes_to_write == 0)
      break;
-   if (bytes_to_write > PKTLINE_DATA_MAXLEN)
+   if (bytes_to_write > sizeof(packet_write_buffer) - 4)
      return COPY_WRITE_ERROR;
    err = packet_write_gently(fd_out, packet_write_buffer, bytes_to_write);
  }
@@ -205,9 +201,10 @@ int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int
  int err = 0;
  size_t bytes_written = 0;
  size_t bytes_to_write;
+
  while (!err) {
-   if ((len - bytes_written) > PKTLINE_DATA_MAXLEN)
-     bytes_to_write = PKTLINE_DATA_MAXLEN;
+   if ((len - bytes_written) > sizeof(packet_write_buffer) - 4)
+     bytes_to_write = sizeof(packet_write_buffer) - 4;
    else
      bytes_to_write = len - bytes_written;
    if (bytes_to_write == 0)
diff --git a/pkt-line.h b/pkt-line.h
index ddd6041..6a3b7cf 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -24,7 +24,7 @@ void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (print
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 int packet_flush_gently(int fd);
-int packet_write_gently_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out);
 int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out);



Lars Schneider (13):
  pkt-line: rename packet_write() to packet_write_fmt()
  pkt-line: extract set_packet_header()
  pkt-line: add packet_write_fmt_gently()
  pkt-line: add packet_flush_gently()
  pkt-line: add packet_write_gently()
  pkt-line: add functions to read/write flush terminated packet streams
  pack-protocol: fix maximum pkt-line size
  convert: quote filter names in error messages
  convert: modernize tests
  convert: generate large test files only once
  convert: make apply_filter() adhere to standard Git error handling
  convert: add filter.<driver>.process option
  read-cache: make sure file handles are not inherited by child
    processes

 Documentation/gitattributes.txt             | 146 ++++++++-
 Documentation/technical/protocol-common.txt |   6 +-
 builtin/archive.c                           |   4 +-
 builtin/receive-pack.c                      |   4 +-
 builtin/remote-ext.c                        |   4 +-
 builtin/upload-archive.c                    |   4 +-
 connect.c                                   |   2 +-
 convert.c                                   | 373 ++++++++++++++++++---
 daemon.c                                    |   2 +-
 http-backend.c                              |   2 +-
 pkt-line.c                                  | 142 +++++++-
 pkt-line.h                                  |  12 +-
 read-cache.c                                |   2 +-
 shallow.c                                   |   2 +-
 t/t0021-conversion.sh                       | 482 +++++++++++++++++++++++++---
 t/t0021/rot13-filter.pl                     | 176 ++++++++++
 unpack-trees.c                              |   1 +
 upload-pack.c                               |  30 +-
 18 files changed, 1272 insertions(+), 122 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

--
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 01/13] pkt-line: rename packet_write() to packet_write_fmt()
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 11:07 ` [PATCH v6 02/13] pkt-line: extract set_packet_header() larsxschneider
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write() should be called packet_write_fmt() as the string
parameter can be formatted.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 builtin/archive.c        |  4 ++--
 builtin/receive-pack.c   |  4 ++--
 builtin/remote-ext.c     |  4 ++--
 builtin/upload-archive.c |  4 ++--
 connect.c                |  2 +-
 daemon.c                 |  2 +-
 http-backend.c           |  2 +-
 pkt-line.c               |  2 +-
 pkt-line.h               |  2 +-
 shallow.c                |  2 +-
 upload-pack.c            | 30 +++++++++++++++---------------
 11 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/builtin/archive.c b/builtin/archive.c
index a1e3b94..49f4914 100644
--- a/builtin/archive.c
+++ b/builtin/archive.c
@@ -47,10 +47,10 @@ static int run_remote_archiver(int argc, const char **argv,
 	if (name_hint) {
 		const char *format = archive_format_from_filename(name_hint);
 		if (format)
-			packet_write(fd[1], "argument --format=%s\n", format);
+			packet_write_fmt(fd[1], "argument --format=%s\n", format);
 	}
 	for (i = 1; i < argc; i++)
-		packet_write(fd[1], "argument %s\n", argv[i]);
+		packet_write_fmt(fd[1], "argument %s\n", argv[i]);
 	packet_flush(fd[1]);
 
 	buf = packet_read_line(fd[0], NULL);
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 15c323a..d60a412 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -199,7 +199,7 @@ static int receive_pack_config(const char *var, const char *value, void *cb)
 static void show_ref(const char *path, const unsigned char *sha1)
 {
 	if (sent_capabilities) {
-		packet_write(1, "%s %s\n", sha1_to_hex(sha1), path);
+		packet_write_fmt(1, "%s %s\n", sha1_to_hex(sha1), path);
 	} else {
 		struct strbuf cap = STRBUF_INIT;
 
@@ -212,7 +212,7 @@ static void show_ref(const char *path, const unsigned char *sha1)
 		if (push_cert_nonce)
 			strbuf_addf(&cap, " push-cert=%s", push_cert_nonce);
 		strbuf_addf(&cap, " agent=%s", git_user_agent_sanitized());
-		packet_write(1, "%s %s%c%s\n",
+		packet_write_fmt(1, "%s %s%c%s\n",
 			     sha1_to_hex(sha1), path, 0, cap.buf);
 		strbuf_release(&cap);
 		sent_capabilities = 1;
diff --git a/builtin/remote-ext.c b/builtin/remote-ext.c
index 88eb8f9..11b48bf 100644
--- a/builtin/remote-ext.c
+++ b/builtin/remote-ext.c
@@ -128,9 +128,9 @@ static void send_git_request(int stdin_fd, const char *serv, const char *repo,
 	const char *vhost)
 {
 	if (!vhost)
-		packet_write(stdin_fd, "%s %s%c", serv, repo, 0);
+		packet_write_fmt(stdin_fd, "%s %s%c", serv, repo, 0);
 	else
-		packet_write(stdin_fd, "%s %s%chost=%s%c", serv, repo, 0,
+		packet_write_fmt(stdin_fd, "%s %s%chost=%s%c", serv, repo, 0,
 			     vhost, 0);
 }
 
diff --git a/builtin/upload-archive.c b/builtin/upload-archive.c
index 2caedf1..dc872f6 100644
--- a/builtin/upload-archive.c
+++ b/builtin/upload-archive.c
@@ -88,11 +88,11 @@ int cmd_upload_archive(int argc, const char **argv, const char *prefix)
 	writer.git_cmd = 1;
 	if (start_command(&writer)) {
 		int err = errno;
-		packet_write(1, "NACK unable to spawn subprocess\n");
+		packet_write_fmt(1, "NACK unable to spawn subprocess\n");
 		die("upload-archive: %s", strerror(err));
 	}
 
-	packet_write(1, "ACK\n");
+	packet_write_fmt(1, "ACK\n");
 	packet_flush(1);
 
 	while (1) {
diff --git a/connect.c b/connect.c
index 722dc3f..5330d9c 100644
--- a/connect.c
+++ b/connect.c
@@ -730,7 +730,7 @@ struct child_process *git_connect(int fd[2], const char *url,
 		 * Note: Do not add any other headers here!  Doing so
 		 * will cause older git-daemon servers to crash.
 		 */
-		packet_write(fd[1],
+		packet_write_fmt(fd[1],
 			     "%s %s%chost=%s%c",
 			     prog, path, 0,
 			     target_host, 0);
diff --git a/daemon.c b/daemon.c
index e647254..2646d0f 100644
--- a/daemon.c
+++ b/daemon.c
@@ -281,7 +281,7 @@ static int daemon_error(const char *dir, const char *msg)
 {
 	if (!informative_errors)
 		msg = "access denied or repository not exported";
-	packet_write(1, "ERR %s: %s", msg, dir);
+	packet_write_fmt(1, "ERR %s: %s", msg, dir);
 	return -1;
 }
 
diff --git a/http-backend.c b/http-backend.c
index 0d59499..aa61c18 100644
--- a/http-backend.c
+++ b/http-backend.c
@@ -460,7 +460,7 @@ static void get_info_refs(char *arg)
 		hdr_str(content_type, buf.buf);
 		end_headers();
 
-		packet_write(1, "# service=git-%s\n", svc->name);
+		packet_write_fmt(1, "# service=git-%s\n", svc->name);
 		packet_flush(1);
 
 		argv[0] = svc->name;
diff --git a/pkt-line.c b/pkt-line.c
index 62fdb37..0a9b61c 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -118,7 +118,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
-void packet_write(int fd, const char *fmt, ...)
+void packet_write_fmt(int fd, const char *fmt, ...)
 {
 	static struct strbuf buf = STRBUF_INIT;
 	va_list args;
diff --git a/pkt-line.h b/pkt-line.h
index 3cb9d91..1902fb3 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -20,7 +20,7 @@
  * side can't, we stay with pure read/write interfaces.
  */
 void packet_flush(int fd);
-void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 
diff --git a/shallow.c b/shallow.c
index 4d554ca..1826114 100644
--- a/shallow.c
+++ b/shallow.c
@@ -260,7 +260,7 @@ static int advertise_shallow_grafts_cb(const struct commit_graft *graft, void *c
 {
 	int fd = *(int *)cb;
 	if (graft->nr_parent == -1)
-		packet_write(fd, "shallow %s\n", oid_to_hex(&graft->oid));
+		packet_write_fmt(fd, "shallow %s\n", oid_to_hex(&graft->oid));
 	return 0;
 }
 
diff --git a/upload-pack.c b/upload-pack.c
index d4cc414..8e77b00 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -393,13 +393,13 @@ static int get_common_commits(void)
 			if (multi_ack == 2 && got_common
 			    && !got_other && ok_to_give_up()) {
 				sent_ready = 1;
-				packet_write(1, "ACK %s ready\n", last_hex);
+				packet_write_fmt(1, "ACK %s ready\n", last_hex);
 			}
 			if (have_obj.nr == 0 || multi_ack)
-				packet_write(1, "NAK\n");
+				packet_write_fmt(1, "NAK\n");
 
 			if (no_done && sent_ready) {
-				packet_write(1, "ACK %s\n", last_hex);
+				packet_write_fmt(1, "ACK %s\n", last_hex);
 				return 0;
 			}
 			if (stateless_rpc)
@@ -416,20 +416,20 @@ static int get_common_commits(void)
 					const char *hex = sha1_to_hex(sha1);
 					if (multi_ack == 2) {
 						sent_ready = 1;
-						packet_write(1, "ACK %s ready\n", hex);
+						packet_write_fmt(1, "ACK %s ready\n", hex);
 					} else
-						packet_write(1, "ACK %s continue\n", hex);
+						packet_write_fmt(1, "ACK %s continue\n", hex);
 				}
 				break;
 			default:
 				got_common = 1;
 				memcpy(last_hex, sha1_to_hex(sha1), 41);
 				if (multi_ack == 2)
-					packet_write(1, "ACK %s common\n", last_hex);
+					packet_write_fmt(1, "ACK %s common\n", last_hex);
 				else if (multi_ack)
-					packet_write(1, "ACK %s continue\n", last_hex);
+					packet_write_fmt(1, "ACK %s continue\n", last_hex);
 				else if (have_obj.nr == 1)
-					packet_write(1, "ACK %s\n", last_hex);
+					packet_write_fmt(1, "ACK %s\n", last_hex);
 				break;
 			}
 			continue;
@@ -437,10 +437,10 @@ static int get_common_commits(void)
 		if (!strcmp(line, "done")) {
 			if (have_obj.nr > 0) {
 				if (multi_ack)
-					packet_write(1, "ACK %s\n", last_hex);
+					packet_write_fmt(1, "ACK %s\n", last_hex);
 				return 0;
 			}
-			packet_write(1, "NAK\n");
+			packet_write_fmt(1, "NAK\n");
 			return -1;
 		}
 		die("git upload-pack: expected SHA1 list, got '%s'", line);
@@ -650,7 +650,7 @@ static void receive_needs(void)
 		while (result) {
 			struct object *object = &result->item->object;
 			if (!(object->flags & (CLIENT_SHALLOW|NOT_SHALLOW))) {
-				packet_write(1, "shallow %s",
+				packet_write_fmt(1, "shallow %s",
 						oid_to_hex(&object->oid));
 				register_shallow(object->oid.hash);
 				shallow_nr++;
@@ -662,7 +662,7 @@ static void receive_needs(void)
 			struct object *object = shallows.objects[i].item;
 			if (object->flags & NOT_SHALLOW) {
 				struct commit_list *parents;
-				packet_write(1, "unshallow %s",
+				packet_write_fmt(1, "unshallow %s",
 					oid_to_hex(&object->oid));
 				object->flags &= ~CLIENT_SHALLOW;
 				/* make sure the real parents are parsed */
@@ -741,7 +741,7 @@ static int send_ref(const char *refname, const struct object_id *oid,
 		struct strbuf symref_info = STRBUF_INIT;
 
 		format_symref_info(&symref_info, cb_data);
-		packet_write(1, "%s %s%c%s%s%s%s%s agent=%s\n",
+		packet_write_fmt(1, "%s %s%c%s%s%s%s%s agent=%s\n",
 			     oid_to_hex(oid), refname_nons,
 			     0, capabilities,
 			     (allow_unadvertised_object_request & ALLOW_TIP_SHA1) ?
@@ -753,11 +753,11 @@ static int send_ref(const char *refname, const struct object_id *oid,
 			     git_user_agent_sanitized());
 		strbuf_release(&symref_info);
 	} else {
-		packet_write(1, "%s %s\n", oid_to_hex(oid), refname_nons);
+		packet_write_fmt(1, "%s %s\n", oid_to_hex(oid), refname_nons);
 	}
 	capabilities = NULL;
 	if (!peel_ref(refname, peeled.hash))
-		packet_write(1, "%s %s^{}\n", oid_to_hex(&peeled), refname_nons);
+		packet_write_fmt(1, "%s %s^{}\n", oid_to_hex(&peeled), refname_nons);
 	return 0;
 }
 
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 02/13] pkt-line: extract set_packet_header()
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
  2016-08-25 11:07 ` [PATCH v6 01/13] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 11:07 ` [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently() larsxschneider
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

set_packet_header() converts an integer to a 4 byte hex string. Make
this function locally available so that other pkt-line functions can
use it.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index 0a9b61c..e8adc0f 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -97,10 +97,20 @@ void packet_buf_flush(struct strbuf *buf)
 	strbuf_add(buf, "0000", 4);
 }
 
-#define hex(a) (hexchar[(a) & 15])
-static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+static void set_packet_header(char *buf, const int size)
 {
 	static char hexchar[] = "0123456789abcdef";
+
+	#define hex(a) (hexchar[(a) & 15])
+	buf[0] = hex(size >> 12);
+	buf[1] = hex(size >> 8);
+	buf[2] = hex(size >> 4);
+	buf[3] = hex(size);
+	#undef hex
+}
+
+static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+{
 	size_t orig_len, n;
 
 	orig_len = out->len;
@@ -111,10 +121,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	if (n > LARGE_PACKET_MAX)
 		die("protocol error: impossibly long line");
 
-	out->buf[orig_len + 0] = hex(n >> 12);
-	out->buf[orig_len + 1] = hex(n >> 8);
-	out->buf[orig_len + 2] = hex(n >> 4);
-	out->buf[orig_len + 3] = hex(n);
+	set_packet_header(&out->buf[orig_len], n);
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
  2016-08-25 11:07 ` [PATCH v6 01/13] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
  2016-08-25 11:07 ` [PATCH v6 02/13] pkt-line: extract set_packet_header() larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 18:12   ` Stefan Beller
  2016-08-25 21:41   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 04/13] pkt-line: add packet_flush_gently() larsxschneider
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write_fmt() would die in case of a write error even though for
some callers an error would be acceptable. Add packet_write_fmt_gently()
which writes a formatted pkt-line and returns `0` for success and `-1`
for an error.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 12 ++++++++++++
 pkt-line.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index e8adc0f..3e8b2fb 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -137,6 +137,18 @@ void packet_write_fmt(int fd, const char *fmt, ...)
 	write_or_die(fd, buf.buf, buf.len);
 }
 
+int packet_write_fmt_gently(int fd, const char *fmt, ...)
+{
+	static struct strbuf buf = STRBUF_INIT;
+	va_list args;
+
+	strbuf_reset(&buf);
+	va_start(args, fmt);
+	format_packet(&buf, fmt, args);
+	va_end(args);
+	return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
+}
+
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 {
 	va_list args;
diff --git a/pkt-line.h b/pkt-line.h
index 1902fb3..3caea77 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 04/13] pkt-line: add packet_flush_gently()
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (2 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently() larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 11:07 ` [PATCH v6 05/13] pkt-line: add packet_write_gently() larsxschneider
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_flush() would die in case of a write error even though for some
callers an error would be acceptable. Add packet_flush_gently() which
writes a pkt-line flush packet and returns `0` for success and `-1` for
failure.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 6 ++++++
 pkt-line.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 3e8b2fb..cad26df 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -91,6 +91,12 @@ void packet_flush(int fd)
 	write_or_die(fd, "0000", 4);
 }
 
+int packet_flush_gently(int fd)
+{
+	packet_trace("0000", 4, 1);
+	return (write_in_full(fd, "0000", 4) == 4 ? 0 : -1);
+}
+
 void packet_buf_flush(struct strbuf *buf)
 {
 	packet_trace("0000", 4, 1);
diff --git a/pkt-line.h b/pkt-line.h
index 3caea77..3fa0899 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_flush_gently(int fd);
 int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 
 /*
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 05/13] pkt-line: add packet_write_gently()
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (3 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 04/13] pkt-line: add packet_flush_gently() larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 21:50   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write_fmt() has two shortcomings. First, it uses format_packet()
which lets the caller only send string data via "%s". That means it
cannot be used for arbitrary data that may contain NULs. Second, it will
always die on error.

Add packet_write_gently() which writes arbitrary data and returns `0`
for success and `-1` for an error. This function is used by other
pkt-line functions in a subsequent patch.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index cad26df..7e8a803 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 
 char packet_buffer[LARGE_PACKET_MAX];
+static char packet_write_buffer[LARGE_PACKET_MAX];
 static const char *packet_trace_prefix = "git";
 static struct trace_key trace_packet = TRACE_KEY_INIT(PACKET);
 static struct trace_key trace_pack = TRACE_KEY_INIT(PACKFILE);
@@ -155,6 +156,17 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
 	return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
 }
 
+int packet_write_gently(const int fd_out, const char *buf, size_t size)
+{
+	if (size > sizeof(packet_write_buffer) - 4)
+		return -1;
+	packet_trace(buf, size, 1);
+	memmove(packet_write_buffer + 4, buf, size);
+	size += 4;
+	set_packet_header(packet_write_buffer, size);
+	return (write_in_full(fd_out, packet_write_buffer, size) == size ? 0 : -1);
+}
+
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 {
 	va_list args;
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (4 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 05/13] pkt-line: add packet_write_gently() larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 18:46   ` Stefan Beller
  2016-08-25 22:27   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size larsxschneider
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write_stream_with_flush_from_fd() and
packet_write_stream_with_flush_from_buf() write a stream of packets. All
content packets use the maximal packet size except for the last one.
After the last content packet a `flush` control packet is written.

packet_read_till_flush() reads arbitrary sized packets until it detects
a `flush` packet.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 pkt-line.h |  7 +++++
 2 files changed, 98 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 7e8a803..3033aa3 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -176,6 +176,47 @@ void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 	va_end(args);
 }
 
+int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out)
+{
+	int err = 0;
+	ssize_t bytes_to_write;
+
+	while (!err) {
+		bytes_to_write = xread(fd_in, packet_write_buffer, sizeof(packet_write_buffer) - 4);
+		if (bytes_to_write < 0)
+			return COPY_READ_ERROR;
+		if (bytes_to_write == 0)
+			break;
+		if (bytes_to_write > sizeof(packet_write_buffer) - 4)
+			return COPY_WRITE_ERROR;
+		err = packet_write_gently(fd_out, packet_write_buffer, bytes_to_write);
+	}
+	if (!err)
+		err = packet_flush_gently(fd_out);
+	return err;
+}
+
+int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out)
+{
+	int err = 0;
+	size_t bytes_written = 0;
+	size_t bytes_to_write;
+
+	while (!err) {
+		if ((len - bytes_written) > sizeof(packet_write_buffer) - 4)
+			bytes_to_write = sizeof(packet_write_buffer) - 4;
+		else
+			bytes_to_write = len - bytes_written;
+		if (bytes_to_write == 0)
+			break;
+		err = packet_write_gently(fd_out, src_in + bytes_written, bytes_to_write);
+		bytes_written += bytes_to_write;
+	}
+	if (!err)
+		err = packet_flush_gently(fd_out);
+	return err;
+}
+
 static int get_packet_data(int fd, char **src_buf, size_t *src_size,
 			   void *dst, unsigned size, int options)
 {
@@ -285,3 +326,53 @@ char *packet_read_line_buf(char **src, size_t *src_len, int *dst_len)
 {
 	return packet_read_line_generic(-1, src, src_len, dst_len);
 }
+
+ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out)
+{
+	int len, ret;
+	int options = PACKET_READ_GENTLE_ON_EOF;
+	char linelen[4];
+
+	size_t oldlen = sb_out->len;
+	size_t oldalloc = sb_out->alloc;
+
+	for (;;) {
+		/* Read packet header */
+		ret = get_packet_data(fd_in, NULL, NULL, linelen, 4, options);
+		if (ret < 0)
+			goto done;
+		len = packet_length(linelen);
+		if (len < 0)
+			die("protocol error: bad line length character: %.4s", linelen);
+		if (!len) {
+			/* Found a flush packet - Done! */
+			packet_trace("0000", 4, 0);
+			break;
+		}
+		len -= 4;
+
+		/* Read packet content */
+		strbuf_grow(sb_out, len);
+		ret = get_packet_data(fd_in, NULL, NULL, sb_out->buf + sb_out->len, len, options);
+		if (ret < 0)
+			goto done;
+
+		if (ret != len) {
+			error("protocol error: incomplete read (expected %d, got %d)", len, ret);
+			goto done;
+		}
+
+		packet_trace(sb_out->buf + sb_out->len, len, 0);
+		sb_out->len += len;
+	}
+
+done:
+	if (ret < 0) {
+		if (oldalloc == 0)
+			strbuf_release(sb_out);
+		else
+			strbuf_setlen(sb_out, oldlen);
+		return ret;  /* unexpected EOF */
+	}
+	return sb_out->len - oldlen;
+}
diff --git a/pkt-line.h b/pkt-line.h
index 3fa0899..9616117 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -25,6 +25,8 @@ void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 int packet_flush_gently(int fd);
 int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out);
+int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out);
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
@@ -77,6 +79,11 @@ char *packet_read_line(int fd, int *size);
  */
 char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
 
+/*
+ * Reads a stream of variable sized packets until a flush packet is detected.
+ */
+ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out);
+
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
 extern char packet_buffer[LARGE_PACKET_MAX];
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (5 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 18:59   ` Stefan Beller
  2016-08-25 11:07 ` [PATCH v6 08/13] convert: quote filter names in error messages larsxschneider
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

According to LARGE_PACKET_MAX in pkt-line.h the maximal length of a
pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
therefore the pkt-line data component must not exceed 65516 bytes.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/technical/protocol-common.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt
index bf30167..ecedb34 100644
--- a/Documentation/technical/protocol-common.txt
+++ b/Documentation/technical/protocol-common.txt
@@ -67,9 +67,9 @@ with non-binary data the same whether or not they contain the trailing
 LF (stripping the LF if present, and not complaining when it is
 missing).
 
-The maximum length of a pkt-line's data component is 65520 bytes.
-Implementations MUST NOT send pkt-line whose length exceeds 65524
-(65520 bytes of payload + 4 bytes of length data).
+The maximum length of a pkt-line's data component is 65516 bytes.
+Implementations MUST NOT send pkt-line whose length exceeds 65520
+(65516 bytes of payload + 4 bytes of length data).
 
 Implementations SHOULD NOT send an empty pkt-line ("0004").
 
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 08/13] convert: quote filter names in error messages
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (6 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-26 19:45   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 09/13] convert: modernize tests larsxschneider
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard to
read in error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index b1614bf..522e2c5 100644
--- a/convert.c
+++ b/convert.c
@@ -397,7 +397,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	child_process.out = out;
 
 	if (start_command(&child_process))
-		return error("cannot fork to run external filter %s", params->cmd);
+		return error("cannot fork to run external filter '%s'", params->cmd);
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
@@ -415,13 +415,13 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
-		error("cannot feed the input to external filter %s", params->cmd);
+		error("cannot feed the input to external filter '%s'", params->cmd);
 
 	sigchain_pop(SIGPIPE);
 
 	status = finish_command(&child_process);
 	if (status)
-		error("external filter %s failed %d", params->cmd, status);
+		error("external filter '%s' failed %d", params->cmd, status);
 
 	strbuf_release(&cmd);
 	return (write_err || status);
@@ -462,15 +462,15 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
+		error("external filter '%s' failed", cmd);
 		ret = 0;
 	}
 
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 09/13] convert: modernize tests
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (7 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 08/13] convert: quote filter names in error messages larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-26 20:03   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 10/13] convert: generate large test files only once larsxschneider
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">" and "<".

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 62 +++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..7b45136 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -13,8 +13,8 @@ EOF
 chmod +x rot13.sh
 
 test_expect_success setup '
-	git config filter.rot13.smudge ./rot13.sh &&
-	git config filter.rot13.clean ./rot13.sh &&
+	test_config filter.rot13.smudge ./rot13.sh &&
+	test_config filter.rot13.clean ./rot13.sh &&
 
 	{
 	    echo "*.t filter=rot13"
@@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
 
 test_expect_success check '
 
-	cmp test.o test &&
-	cmp test.o test.t &&
+	test_cmp test.o test &&
+	test_cmp test.o test.t &&
 
 	# ident should be stripped in the repository
 	git diff --raw --exit-code :test :test.i &&
@@ -47,10 +47,10 @@ test_expect_success check '
 	embedded=$(sed -ne "$script" test.i) &&
 	test "z$id" = "z$embedded" &&
 
-	git cat-file blob :test.t > test.r &&
+	git cat-file blob :test.t >test.r &&
 
-	./rot13.sh < test.o > test.t &&
-	cmp test.r test.t
+	./rot13.sh <test.o >test.t &&
+	test_cmp test.r test.t
 '
 
 # If an expanded ident ever gets into the repository, we want to make sure that
@@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
 
 	# delete the files and check them out again, using a smudge filter
 	# that will count the args and echo the command-line back to us
-	git config filter.argc.smudge "sh ./argc.sh %f" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -141,7 +141,7 @@ test_expect_success 'filter shell-escaped filenames' '
 	test_cmp expect "$special" &&
 
 	# do the same thing, but with more args in the filter expression
-	git config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -154,9 +154,9 @@ test_expect_success 'filter shell-escaped filenames' '
 '
 
 test_expect_success 'required filter should filter data' '
-	git config filter.required.smudge ./rot13.sh &&
-	git config filter.required.clean ./rot13.sh &&
-	git config filter.required.required true &&
+	test_config filter.required.smudge ./rot13.sh &&
+	test_config filter.required.clean ./rot13.sh &&
+	test_config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
@@ -165,17 +165,17 @@ test_expect_success 'required filter should filter data' '
 
 	rm -f test.r &&
 	git checkout -- test.r &&
-	cmp test.o test.r &&
+	test_cmp test.o test.r &&
 
 	./rot13.sh <test.o >expected &&
 	git cat-file blob :test.r >actual &&
-	cmp expected actual
+	test_cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
-	git config filter.failsmudge.smudge false &&
-	git config filter.failsmudge.clean cat &&
-	git config filter.failsmudge.required true &&
+	test_config filter.failsmudge.smudge false &&
+	test_config filter.failsmudge.clean cat &&
+	test_config filter.failsmudge.required true &&
 
 	echo "*.fs filter=failsmudge" >.gitattributes &&
 
@@ -186,9 +186,9 @@ test_expect_success 'required filter smudge failure' '
 '
 
 test_expect_success 'required filter clean failure' '
-	git config filter.failclean.smudge cat &&
-	git config filter.failclean.clean false &&
-	git config filter.failclean.required true &&
+	test_config filter.failclean.smudge cat &&
+	test_config filter.failclean.clean false &&
+	test_config filter.failclean.required true &&
 
 	echo "*.fc filter=failclean" >.gitattributes &&
 
@@ -197,8 +197,8 @@ test_expect_success 'required filter clean failure' '
 '
 
 test_expect_success 'filtering large input to small output should use little memory' '
-	git config filter.devnull.clean "cat >/dev/null" &&
-	git config filter.devnull.required true &&
+	test_config filter.devnull.clean "cat >/dev/null" &&
+	test_config filter.devnull.required true &&
 	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
 	echo "30MB filter=devnull" >.gitattributes &&
 	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
@@ -207,7 +207,7 @@ test_expect_success 'filtering large input to small output should use little mem
 test_expect_success 'filter that does not read is fine' '
 	test-genrandom foo $((128 * 1024 + 1)) >big &&
 	echo "big filter=epipe" >.gitattributes &&
-	git config filter.epipe.clean "echo xyzzy" &&
+	test_config filter.epipe.clean "echo xyzzy" &&
 	git add big &&
 	git cat-file blob :big >actual &&
 	echo xyzzy >expect &&
@@ -215,20 +215,20 @@ test_expect_success 'filter that does not read is fine' '
 '
 
 test_expect_success EXPENSIVE 'filter large file' '
-	git config filter.largefile.smudge cat &&
-	git config filter.largefile.clean cat &&
+	test_config filter.largefile.smudge cat &&
+	test_config filter.largefile.clean cat &&
 	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
 	echo "2GB filter=largefile" >.gitattributes &&
 	git add 2GB 2>err &&
-	! test -s err &&
+	test_must_be_empty err &&
 	rm -f 2GB &&
 	git checkout -- 2GB 2>err &&
-	! test -s err
+	test_must_be_empty err
 '
 
 test_expect_success "filter: clean empty file" '
-	git config filter.in-repo-header.clean  "echo cleaned && cat" &&
-	git config filter.in-repo-header.smudge "sed 1d" &&
+	test_config filter.in-repo-header.clean  "echo cleaned && cat" &&
+	test_config filter.in-repo-header.smudge "sed 1d" &&
 
 	echo "empty-in-worktree    filter=in-repo-header" >>.gitattributes &&
 	>empty-in-worktree &&
@@ -240,8 +240,8 @@ test_expect_success "filter: clean empty file" '
 '
 
 test_expect_success "filter: smudge empty file" '
-	git config filter.empty-in-repo.clean "cat >/dev/null" &&
-	git config filter.empty-in-repo.smudge "echo smudged && cat" &&
+	test_config filter.empty-in-repo.clean "cat >/dev/null" &&
+	test_config filter.empty-in-repo.smudge "echo smudged && cat" &&
 
 	echo "empty-in-repo filter=empty-in-repo" >>.gitattributes &&
 	echo dead data walking >empty-in-repo &&
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 10/13] convert: generate large test files only once
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (8 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 09/13] convert: modernize tests larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 19:17   ` Stefan Beller
  2016-08-29 17:46   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 11/13] convert: make apply_filter() adhere to standard Git error handling larsxschneider
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Generate more interesting large test files with pseudo random characters
in between and reuse these test files in multiple tests. Run tests
formerly marked as EXPENSIVE every time but with a smaller data set.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7b45136..34c8eb9 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -4,6 +4,15 @@ test_description='blob conversion via gitattributes'
 
 . ./test-lib.sh
 
+if test_have_prereq EXPENSIVE
+then
+	T0021_LARGE_FILE_SIZE=2048
+	T0021_LARGISH_FILE_SIZE=100
+else
+	T0021_LARGE_FILE_SIZE=30
+	T0021_LARGISH_FILE_SIZE=2
+fi
+
 cat <<EOF >rot13.sh
 #!$SHELL_PATH
 tr \
@@ -31,7 +40,26 @@ test_expect_success setup '
 	cat test >test.i &&
 	git add test test.t test.i &&
 	rm -f test test.t test.i &&
-	git checkout -- test test.t test.i
+	git checkout -- test test.t test.i &&
+
+	mkdir generated-test-data &&
+	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
+	do
+		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
+		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"
+		# Generate 1MB of empty data and 100 bytes of random characters
+		# printf "$(test-genrandom start $i)"
+		printf "%1048576d" 1 >>generated-test-data/large.file &&
+		printf "$RANDOM_STRING" >>generated-test-data/large.file &&
+		printf "%1048576d" 1 >>generated-test-data/large.file.rot13 &&
+		printf "$ROT_RANDOM_STRING" >>generated-test-data/large.file.rot13 &&
+
+		if test $i = $T0021_LARGISH_FILE_SIZE
+		then
+			cat generated-test-data/large.file >generated-test-data/largish.file &&
+			cat generated-test-data/large.file.rot13 >generated-test-data/largish.file.rot13
+		fi
+	done
 '
 
 script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
@@ -199,9 +227,9 @@ test_expect_success 'required filter clean failure' '
 test_expect_success 'filtering large input to small output should use little memory' '
 	test_config filter.devnull.clean "cat >/dev/null" &&
 	test_config filter.devnull.required true &&
-	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
-	echo "30MB filter=devnull" >.gitattributes &&
-	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
+	cp generated-test-data/large.file large.file &&
+	echo "large.file filter=devnull" >.gitattributes &&
+	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add large.file
 '
 
 test_expect_success 'filter that does not read is fine' '
@@ -214,15 +242,15 @@ test_expect_success 'filter that does not read is fine' '
 	test_cmp expect actual
 '
 
-test_expect_success EXPENSIVE 'filter large file' '
+test_expect_success 'filter large file' '
 	test_config filter.largefile.smudge cat &&
 	test_config filter.largefile.clean cat &&
-	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
-	echo "2GB filter=largefile" >.gitattributes &&
-	git add 2GB 2>err &&
+	echo "large.file filter=largefile" >.gitattributes &&
+	cp generated-test-data/large.file large.file &&
+	git add large.file 2>err &&
 	test_must_be_empty err &&
-	rm -f 2GB &&
-	git checkout -- 2GB 2>err &&
+	rm -f large.file &&
+	git checkout -- large.file 2>err &&
 	test_must_be_empty err
 '
 
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 11/13] convert: make apply_filter() adhere to standard Git error handling
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (9 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 10/13] convert: generate large test files only once larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-25 11:07 ` [PATCH v6 12/13] convert: add filter.<driver>.process option larsxschneider
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

apply_filter() returns a boolean that tells the caller if it
"did convert or did not convert". The variable `ret` was used throughout
the function to track errors whereas `1` denoted success and `0`
failure. This is unusual for the Git source where `0` denotes success.

Rename the variable and flip its value to make the function easier
readable for Git developers.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index 522e2c5..bd17340 100644
--- a/convert.c
+++ b/convert.c
@@ -436,7 +436,7 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	 *
 	 * (child --> cmd) --> us
 	 */
-	int ret = 1;
+	int err = 0;
 	struct strbuf nbuf = STRBUF_INIT;
 	struct async async;
 	struct filter_params params;
@@ -463,22 +463,22 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
 		error("read from external filter '%s' failed", cmd);
-		ret = 0;
+		err = -1;
 	}
 	if (close(async.out)) {
 		error("read from external filter '%s' failed", cmd);
-		ret = 0;
+		err = -1;
 	}
 	if (finish_async(&async)) {
 		error("external filter '%s' failed", cmd);
-		ret = 0;
+		err = -1;
 	}
 
-	if (ret) {
+	if (!err) {
 		strbuf_swap(dst, &nbuf);
 	}
 	strbuf_release(&nbuf);
-	return ret;
+	return !err;
 }
 
 static struct convert_driver {
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (10 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 11/13] convert: make apply_filter() adhere to standard Git error handling larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-29 22:21   ` Junio C Hamano
  2016-08-25 11:07 ` [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes larsxschneider
  2016-08-29 15:39 ` [PATCH v6 00/13] Git filter protocol Lars Schneider
  13 siblings, 1 reply; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: https://github.com/github/git-lfs/pull/1382

This patch adds the `filter.<driver>.process` string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line) based protocol over standard input and
standard output. The full protocol is explained in detail in
`Documentation/gitattributes.txt`.

A few key decisions:

* The long running filter process is referred to as filter protocol
  version 2 because the existing single shot filter invocation is
  considered version 1.
* Git sends a welcome message and expects a response right after the
  external filter process has started. This ensures that Git will not
  hang if a version 1 filter is incorrectly used with the
  filter.<driver>.process option for version 2 filters. In addition,
  Git can detect this kind of error and warn the user.
* The status of a filter operation (e.g. "success" or "error) is set
  before the actual response and (if necessary!) re-set after the
  response. The advantage of this two step status response is that if
  the filter detects an error early, then the filter can communicate
  this and Git does not even need to create structures to read the
  response.
* All status responses are pkt-line lists terminated with a flush
  packet. This allows us to send other status fields with the same
  protocol in the future.

Helped-by: Martin-Louis Bright <mlbright@gmail.com>
Reviewed-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt | 146 +++++++++++++++-
 convert.c                       | 349 +++++++++++++++++++++++++++++++++----
 pkt-line.h                      |   1 +
 t/t0021-conversion.sh           | 372 ++++++++++++++++++++++++++++++++++++++++
 t/t0021/rot13-filter.pl         | 176 +++++++++++++++++++
 unpack-trees.c                  |   1 +
 6 files changed, 1015 insertions(+), 30 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..6346700 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,13 @@ checkout, when the `smudge` command is specified, the command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +381,144 @@ substitution.  For example:
 ------------------------
 
 
+Long Running Filter Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the filter command (a string value) is defined via
+`filter.<driver>.process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using the following packet format
+(pkt-line, see technical/protocol-common.txt) based protocol over
+standard input and standard output.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only there
+to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of supported capabilities
+and a flush packet. Git expects to read a list of desired capabilities,
+which must be a subset of the supported capabilities list, and a flush
+packet as response:
+------------------------
+packet:          git> git-filter-client
+packet:          git> version=2
+packet:          git> version=42
+packet:          git> 0000
+packet:          git< git-filter-server
+packet:          git< version=2
+packet:          git> clean=true
+packet:          git> smudge=true
+packet:          git> not-yet-invented=true
+packet:          git> 0000
+packet:          git< clean=true
+packet:          git< smudge=true
+packet:          git< 0000
+------------------------
+Supported filter capabilities in version 2 are "clean" and
+"smudge".
+
+Afterwards Git sends a list of "key=value" pairs terminated with
+a flush packet. The list will contain at least the filter command
+(based on the supported capabilities) and the pathname of the file
+to filter relative to the repository root. Right after these packets
+Git sends the content split in zero or more pkt-line packets and a
+flush packet to terminate content.
+------------------------
+packet:          git> command=smudge\n
+packet:          git> pathname=path/testfile.dat\n
+packet:          git> 0000
+packet:          git> CONTENT
+packet:          git> 0000
+------------------------
+
+The filter is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the filter does not experience
+problems then the list must contain a "success" status. Right after
+these packets the filter is expected to send the content in zero
+or more pkt-line packets and a flush packet at the end. Finally, a
+second list of "key=value" pairs terminated with a flush packet
+is expected. The filter can change the status in the second list.
+------------------------
+packet:          git< status=success\n
+packet:          git< 0000
+packet:          git< SMUDGED_CONTENT
+packet:          git< 0000
+packet:          git< 0000  # empty list!
+------------------------
+
+If the result content is empty then the filter is expected to respond
+with a success status and an empty list.
+------------------------
+packet:          git< status=success\n
+packet:          git< 0000
+packet:          git< 0000  # empty content!
+packet:          git< 0000  # empty list!
+------------------------
+
+In case the filter cannot or does not want to process the content,
+it is expected to respond with an "error" status. Depending on the
+`filter.<driver>.required` flag Git will interpret that as error
+but it will not stop or restart the filter process.
+------------------------
+packet:          git< status=error\n
+packet:          git< 0000
+------------------------
+
+In case the filter cannot or does not want to process the content
+as well as any future content for the lifetime of the Git process,
+it is expected to respond with an "error-all" status. Depending on
+the `filter.<driver>.required` flag Git will interpret that as error
+but it will not stop or restart the filter process.
+------------------------
+packet:          git< status=error-all\n
+packet:          git< 0000
+------------------------
+
+If the filter experiences an error during processing, then it can
+send the status "error". Depending on the `filter.<driver>.required`
+flag Git will interpret that as error but it will not stop or restart
+the filter process.
+------------------------
+packet:          git< status=success\n
+packet:          git< 0000
+packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
+packet:          git< 0000
+packet:          git< status=error\n
+packet:          git< 0000
+------------------------
+
+If the filter dies during the communication or does not adhere to
+the protocol then Git will stop the filter process and restart it
+with the next file that needs to be processed.
+
+After the filter has processed a blob it is expected to wait for
+the next "key=value" list containing a command. When the Git process
+terminates, it will send a kill signal to the filter in that stage.
+
+A long running filter demo implementation can be found in
+`t/t0021/rot13-filter.pl` located in the Git core repository.
+If you develop your own long running filter process then the
+`GIT_TRACE_PACKET` environment variables can be very helpful
+for debugging (see linkgit:git[1]).
+
+If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
+is configured then these commands always take precedence over
+a configured `filter.<driver>.process` command.
+
+Please note that you cannot use an existing `filter.<driver>.clean`
+or `filter.<driver>.smudge` command with `filter.<driver>.process`
+because the former two use a different inter process communication
+protocol than the latter one.
+
+
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/convert.c b/convert.c
index bd17340..362a0af 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "pkt-line.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -427,7 +428,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	return (write_err || status);
 }
 
-static int apply_filter(const char *path, const char *src, size_t len, int fd,
+static int apply_single_file_filter(const char *path, const char *src, size_t len, int fd,
                         struct strbuf *dst, const char *cmd)
 {
 	/*
@@ -441,12 +442,6 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	struct async async;
 	struct filter_params params;
 
-	if (!cmd || !*cmd)
-		return 0;
-
-	if (!dst)
-		return 1;
-
 	memset(&async, 0, sizeof(async));
 	async.proc = filter_buffer_or_fd;
 	async.data = &params;
@@ -481,14 +476,318 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	return !err;
 }
 
+#define CAP_CLEAN    (1u<<0)
+#define CAP_SMUDGE   (1u<<1)
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	int supported_capabilities;
+	const char *cmd;
+	struct child_process process;
+};
+
+static int cmd_process_map_initialized;
+static struct hashmap cmd_process_map;
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+                           const struct cmd2process *e2,
+                           const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_multi_file_filter_entry(struct hashmap *hashmap, const char *cmd)
+{
+	struct cmd2process key;
+	hashmap_entry_init(&key, strhash(cmd));
+	key.cmd = cmd;
+	return hashmap_get(hashmap, &key, NULL);
+}
+
+static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *entry)
+{
+	if (!entry)
+		return;
+	sigchain_push(SIGPIPE, SIG_IGN);
+	/*
+	 * We kill the filter most likely because an error happened already.
+	 * That's why we are not interested in any error code here.
+	 */
+	close(entry->process.in);
+	close(entry->process.out);
+	sigchain_pop(SIGPIPE);
+	finish_command(&entry->process);
+	hashmap_remove(hashmap, entry, NULL);
+	free(entry);
+}
+
+static int packet_write_list(int fd, const char *line, ...)
+{
+	va_list args;
+	int err;
+	va_start(args, line);
+	for (;;)
+	{
+		if (!line)
+			break;
+		if (strlen(line) > PKTLINE_DATA_MAXLEN)
+			return -1;
+		err = packet_write_fmt_gently(fd, "%s", line);
+		if (err)
+			return err;
+		line = va_arg(args, const char*);
+	}
+	va_end(args);
+	return packet_flush_gently(fd);
+}
+
+static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, const char *cmd)
+{
+	int err;
+	struct cmd2process *entry;
+	struct child_process *process;
+	const char *argv[] = { cmd, NULL };
+	struct string_list cap_list = STRING_LIST_INIT_NODUP;
+	char *cap_buf;
+	const char *cap_name;
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(entry, strhash(cmd));
+	entry->cmd = cmd;
+	entry->supported_capabilities = 0;
+	process = &entry->process;
+
+	child_process_init(process);
+	process->argv = argv;
+	process->use_shell = 1;
+	process->in = -1;
+	process->out = -1;
+
+	if (start_command(process)) {
+		error("cannot fork to run external filter '%s'", cmd);
+		kill_multi_file_filter(hashmap, entry);
+		return NULL;
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = packet_write_list(process->in, "git-filter-client", "version=2", NULL);
+	if (err)
+		goto done;
+
+	err = strcmp(packet_read_line(process->out, NULL), "git-filter-server");
+	if (err) {
+		error("external filter '%s' does not support long running filter protocol", cmd);
+		goto done;
+	}
+	err = strcmp(packet_read_line(process->out, NULL), "version=2");
+	if (err)
+		goto done;
+
+	err = packet_write_list(process->in, "clean=true", "smudge=true", NULL);
+
+	for (;;)
+	{
+		cap_buf = packet_read_line(process->out, NULL);
+		if (!cap_buf)
+			break;
+		string_list_split_in_place(&cap_list, cap_buf, '=', 1);
+
+		if (cap_list.nr != 2 || strcmp(cap_list.items[1].string, "true"))
+			continue;
+
+		cap_name = cap_list.items[0].string;
+		if (!strcmp(cap_name, "clean")) {
+			entry->supported_capabilities |= CAP_CLEAN;
+		} else if (!strcmp(cap_name, "smudge")) {
+			entry->supported_capabilities |= CAP_SMUDGE;
+		} else {
+			warning(
+				"external filter '%s' requested unsupported filter capability '%s'",
+				cmd, cap_name
+			);
+		}
+
+		string_list_clear(&cap_list, 0);
+	}
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE) {
+		error("initialization for external filter '%s' failed", cmd);
+		kill_multi_file_filter(hashmap, entry);
+		return NULL;
+	}
+
+	hashmap_add(hashmap, entry);
+	return entry;
+}
+
+static void read_multi_file_filter_values(int fd, struct strbuf *status) {
+	struct strbuf **pair;
+	char *line;
+	for (;;) {
+		line = packet_read_line(fd, NULL);
+		if (!line)
+			break;
+		pair = strbuf_split_str(line, '=', 2);
+		if (pair[0] && pair[0]->len && pair[1]) {
+			if (!strcmp(pair[0]->buf, "status=")) {
+				strbuf_reset(status);
+				strbuf_addbuf(status, pair[1]);
+			}
+		}
+	}
+}
+
+static int apply_multi_file_filter(const char *path, const char *src, size_t len,
+                                   int fd, struct strbuf *dst, const char *cmd,
+                                   const int wanted_capability)
+{
+	int err;
+	struct cmd2process *entry;
+	struct child_process *process;
+	struct stat file_stat;
+	struct strbuf nbuf = STRBUF_INIT;
+	struct strbuf filter_status = STRBUF_INIT;
+	char *filter_type;
+
+	if (!cmd_process_map_initialized) {
+		cmd_process_map_initialized = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = find_multi_file_filter_entry(&cmd_process_map, cmd);
+	}
+
+	fflush(NULL);
+
+	if (!entry) {
+		entry = start_multi_file_filter(&cmd_process_map, cmd);
+		if (!entry)
+			return 0;
+	}
+	process = &entry->process;
+
+	if (!(wanted_capability & entry->supported_capabilities))
+		return 0;
+
+	if (CAP_CLEAN & wanted_capability)
+		filter_type = "clean";
+	else if (CAP_SMUDGE & wanted_capability)
+		filter_type = "smudge";
+	else
+		die("unexpected filter type");
+
+	if (fd >= 0 && !src) {
+		if (fstat(fd, &file_stat) == -1)
+			return 0;
+		len = xsize_t(file_stat.st_size);
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+
+	err = (strlen(filter_type) > PKTLINE_DATA_MAXLEN);
+	if (err)
+		goto done;
+	err = packet_write_fmt_gently(process->in, "command=%s\n", filter_type);
+	if (err)
+		goto done;
+
+	err = (strlen(path) > PKTLINE_DATA_MAXLEN);
+	if (err)
+		goto done;
+	err = packet_write_fmt_gently(process->in, "pathname=%s\n", path);
+	if (err)
+		goto done;
+
+	err = packet_flush_gently(process->in);
+	if (err)
+		goto done;
+
+	if (fd >= 0)
+		err = packet_write_stream_with_flush_from_fd(fd, process->in);
+	else
+		err = packet_write_stream_with_flush_from_buf(src, len, process->in);
+	if (err)
+		goto done;
+
+	read_multi_file_filter_values(process->out, &filter_status);
+	err = strcmp(filter_status.buf, "success");
+	if (err)
+		goto done;
+
+	err = packet_read_till_flush(process->out, &nbuf) < 0;
+	if (err)
+		goto done;
+
+	read_multi_file_filter_values(process->out, &filter_status);
+	err = strcmp(filter_status.buf, "success");
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE) {
+		if (!strcmp(filter_status.buf, "error")) {
+			/* The filter signaled a problem with the file. */
+		} else if (!strcmp(filter_status.buf, "error-all")) {
+			/*
+			 * The filter signaled a permanent problem. Don't try to filter
+			 * files with the same command for the lifetime of the current
+			 * Git process.
+			 */
+			 entry->supported_capabilities &= ~wanted_capability;
+		} else {
+			/*
+			 * Something went wrong with the protocol filter.
+			 * Force shutdown and restart if another blob requires filtering!
+			 */
+			error("external filter '%s' failed", cmd);
+			kill_multi_file_filter(&cmd_process_map, entry);
+		}
+	} else {
+		strbuf_swap(dst, &nbuf);
+	}
+	strbuf_release(&nbuf);
+	return !err;
+}
+
 static struct convert_driver {
 	const char *name;
 	struct convert_driver *next;
 	const char *smudge;
 	const char *clean;
+	const char *process;
 	int required;
 } *user_convert, **user_convert_tail;
 
+static int apply_filter(const char *path, const char *src, size_t len,
+                        int fd, struct strbuf *dst, struct convert_driver *drv,
+                        const int wanted_capability)
+{
+	const char* cmd = NULL;
+
+	if (!drv)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if ((CAP_CLEAN & wanted_capability) && drv->clean)
+		cmd = drv->clean;
+	else if ((CAP_SMUDGE & wanted_capability) && drv->smudge)
+		cmd = drv->smudge;
+
+	if (cmd && *cmd)
+		return apply_single_file_filter(path, src, len, fd, dst, cmd);
+	else if (drv->process && *drv->process)
+		return apply_multi_file_filter(path, src, len, fd, dst, drv->process, wanted_capability);
+
+	return 0;
+}
+
 static int read_convert_config(const char *var, const char *value, void *cb)
 {
 	const char *key, *name;
@@ -526,6 +825,9 @@ static int read_convert_config(const char *var, const char *value, void *cb)
 	if (!strcmp("clean", key))
 		return git_config_string(&drv->clean, var, value);
 
+	if (!strcmp("process", key))
+		return git_config_string(&drv->process, var, value);
+
 	if (!strcmp("required", key)) {
 		drv->required = git_config_bool(var, value);
 		return 0;
@@ -823,7 +1125,7 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;
 
-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+	return apply_filter(path, NULL, 0, -1, NULL, ca.drv, CAP_CLEAN);
 }
 
 const char *get_convert_attr_ascii(const char *path)
@@ -856,18 +1158,12 @@ int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
 	int ret = 0;
-	const char *filter = NULL;
-	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
-	if (ca.drv) {
-		filter = ca.drv->clean;
-		required = ca.drv->required;
-	}
 
-	ret |= apply_filter(path, src, len, -1, dst, filter);
-	if (!ret && required)
+	ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN);
+	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	if (ret && dst) {
@@ -889,9 +1185,9 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
-	assert(ca.drv->clean);
+	assert(ca.drv->clean || ca.drv->process);
 
-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -903,15 +1199,9 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 					    int normalizing)
 {
 	int ret = 0, ret_filter = 0;
-	const char *filter = NULL;
-	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
-	if (ca.drv) {
-		filter = ca.drv->smudge;
-		required = ca.drv->required;
-	}
 
 	ret |= ident_to_worktree(path, src, len, dst, ca.ident);
 	if (ret) {
@@ -920,9 +1210,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 	}
 	/*
 	 * CRLF conversion can be skipped if normalizing, unless there
-	 * is a smudge filter.  The filter might expect CRLFs.
+	 * is a smudge or process filter (even if the process filter doesn't
+	 * support smudge).  The filters might expect CRLFs.
 	 */
-	if (filter || !normalizing) {
+	if ((ca.drv && (ca.drv->smudge || ca.drv->process)) || !normalizing) {
 		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
 		if (ret) {
 			src = dst->buf;
@@ -930,8 +1221,8 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, -1, dst, filter);
-	if (!ret_filter && required)
+	ret_filter = apply_filter(path, src, len, -1, dst, ca.drv, CAP_SMUDGE);
+	if (!ret_filter && ca.drv && ca.drv->required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
 	return ret | ret_filter;
@@ -1383,7 +1674,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	struct stream_filter *filter = NULL;
 
 	convert_attrs(&ca, path);
-	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
+	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/pkt-line.h b/pkt-line.h
index 9616117..6a3b7cf 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -86,6 +86,7 @@ ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out);
 
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
+#define PKTLINE_DATA_MAXLEN (LARGE_PACKET_MAX - 4)
 extern char packet_buffer[LARGE_PACKET_MAX];
 
 #endif
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 34c8eb9..5ba694b 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -42,6 +42,9 @@ test_expect_success setup '
 	rm -f test test.t test.i &&
 	git checkout -- test test.t test.i &&
 
+	echo "content-test2" >test2.o &&
+	echo "content-test3-subdir" >test3-subdir.o &&
+
 	mkdir generated-test-data &&
 	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
 	do
@@ -296,4 +299,373 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '
 
+check_filter () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	cat >expected.log &&
+	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_count_clean () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	cat >expected.log &&
+	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
+		sed "s/^\([0-9]\) IN: clean/x IN: clean/" >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_ignore_clean () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" &&
+	cat >expected.log &&
+	grep -v "IN: clean" rot13-filter.log >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_no_call () {
+	rm -f rot13-filter.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	test_must_be_empty rot13-filter.log
+}
+
+check_rot13 () {
+	test_cmp $1 $2 &&
+	./../rot13.sh <$1 >expected &&
+	git cat-file blob :$2 >actual &&
+	test_cmp expected actual
+}
+
+test_expect_success PERL 'required process filter should filter data' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		mkdir testsubdir &&
+		cp ../test3-subdir.o testsubdir/test3-subdir.r &&
+		>test4-empty.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 IN: clean test2.r 14 [OK] -- OUT: 14 [OK]
+					1 IN: clean test4-empty.r 0 [OK] -- OUT: 0 [OK]
+					1 IN: clean testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF &&
+					x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					x IN: clean test2.r 14 [OK] -- OUT: 14 [OK]
+					x IN: clean test4-empty.r 0 [OK] -- OUT: 0 [OK]
+					x IN: clean testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		rm -f test?.r testsubdir/test3-subdir.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+					IN: smudge testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout empty \
+				<<-\EOF &&
+					start
+					wrote filter header
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout master \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+					IN: smudge test4-empty.r 0 [OK] -- OUT: 0 [OK]
+					IN: smudge testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+		check_rot13 ../test3-subdir.o testsubdir/test3-subdir.r
+	)
+'
+
+test_expect_success PERL 'required process filter should filter smudge data and one-shot filter should clean' '
+	test_config_global filter.protocol.clean ./../rot13.sh &&
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+
+		check_filter_no_call \
+			git add . &&
+
+		check_filter_no_call \
+			git commit . -m "test commit" &&
+
+		rm -f test?.r testsubdir/test3-subdir.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		git checkout empty &&
+
+		check_filter_ignore_clean \
+			git checkout master\
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r
+	)
+'
+
+test_expect_success PERL 'required process filter should clean only' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cp ../test.o test.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF
+					x IN: clean test.r 57 [OK] -- OUT: 57 [OK]
+					1 start
+					1 wrote filter header
+				EOF
+	)
+'
+
+test_expect_success PERL 'required process filter should process binary files larger LARGE_PACKET_MAX' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.file filter=protocol" >.gitattributes &&
+		cat ../generated-test-data/largish.file.rot13 >large.rot13 &&
+		cat ../generated-test-data/largish.file >large.file &&
+		cat large.file >large.original &&
+
+		git add large.file .gitattributes &&
+		git commit . -m "test commit" &&
+
+		rm -f large.file &&
+		git checkout -- large.file &&
+		git cat-file blob :large.file >actual &&
+		test_cmp large.rot13 actual
+	)
+'
+
+test_expect_success PERL 'required process filter should with clean error should fail' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		echo "this is going to fail" >clean-write-fail.r &&
+		echo "content-test3-subdir" >test3.r &&
+
+		# Note: There are three clean paths in convert.c we just test one here.
+		test_must_fail git add .
+	)
+'
+
+test_expect_success PERL 'process filter should restart after unexpected write failure' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		echo "this is going to fail" >smudge-write-fail.o &&
+		cat smudge-write-fail.o >smudge-write-fail.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [WRITE FAIL]
+					start
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+
+		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
+		./../rot13.sh <smudge-write-fail.o >expected &&
+		git cat-file blob :smudge-write-fail.r >actual &&
+		test_cmp expected actual							  # Clean worked!
+	)
+'
+
+test_expect_success PERL 'process filter should not restart in case of an error' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		echo "this will cause an error" >error.o &&
+		cp error.o error.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge error.r 25 [OK] -- OUT: 0 [ERROR]
+					IN: smudge test.r 57 [OK] -- OUT: 57 [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 [OK]
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+		test_cmp error.o error.r
+	)
+'
+
+test_expect_success PERL 'process filter should be able to signal an error for all future files' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		echo "error this blob and all future blobs" >error-all.o &&
+		cp error-all.o error-all.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					start
+					wrote filter header
+					IN: smudge error-all.r 37 [OK] -- OUT: 0 [ERROR-ALL]
+				EOF
+
+		test_cmp ../test.o test.r &&
+		test_cmp ../test2.o test2.r &&
+		test_cmp error-all.o error-all.r
+	)
+'
+
+test_expect_success PERL 'invalid process filter must fail (and not hang!)' '
+	test_config_global filter.protocol.process cat &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		test_must_fail git add . 2> git_stderr.log &&
+		grep "not support long running filter protocol" git_stderr.log
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
new file mode 100755
index 0000000..0e1181d
--- /dev/null
+++ b/t/t0021/rot13-filter.pl
@@ -0,0 +1,176 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git filter protocol version 2
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+# The script takes the list of supported protocol capabilities as
+# arguments ("clean", "smudge", etc).
+#
+# This implementation supports special test cases:
+# (1) If data with the pathname "clean-write-fail.r" is processed with
+#     a "clean" operation then the write operation will die.
+# (2) If data with the pathname "smudge-write-fail.r" is processed with
+#     a "smudge" operation then the write operation will die.
+# (3) If data with the pathname "error.r" is processed with any
+#     operation then the filter signals that it cannot or does not want
+#     to process the file.
+# (4) If data with the pathname "error-all.r" is processed with any
+#     operation then the filter signals that it cannot or does not want
+#     to process the file and any file after that is processed with the
+#     same command.
+#
+
+use strict;
+use warnings;
+
+my $MAX_PACKET_CONTENT_SIZE = 65516;
+my @capabilities            = @ARGV;
+
+sub rot13 {
+    my ($str) = @_;
+    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
+    return $str;
+}
+
+sub packet_read {
+    my $buffer;
+    my $bytes_read = read STDIN, $buffer, 4;
+    if ( $bytes_read == 0 ) {
+
+        # EOF - Git stopped talking to us!
+        exit();
+    }
+    elsif ( $bytes_read != 4 ) {
+        die "invalid packet size '$bytes_read' field";
+    }
+    my $pkt_size = hex($buffer);
+    if ( $pkt_size == 0 ) {
+        return ( 1, "" );
+    }
+    elsif ( $pkt_size > 4 ) {
+        my $content_size = $pkt_size - 4;
+        $bytes_read = read STDIN, $buffer, $content_size;
+        if ( $bytes_read != $content_size ) {
+            die "invalid packet ($content_size expected; $bytes_read read)";
+        }
+        return ( 0, $buffer );
+    }
+    else {
+        die "invalid packet size";
+    }
+}
+
+sub packet_write {
+    my ($packet) = @_;
+    print STDOUT sprintf( "%04x", length($packet) + 4 );
+    print STDOUT $packet;
+    STDOUT->flush();
+}
+
+sub packet_flush {
+    print STDOUT sprintf( "%04x", 0 );
+    STDOUT->flush();
+}
+
+open my $debug, ">>", "rot13-filter.log";
+print $debug "start\n";
+$debug->flush();
+
+( packet_read() eq ( 0, "git-filter-client" ) ) || die "bad initialization";
+( packet_read() eq ( 0, "version=2" ) )         || die "bad version";
+( packet_read() eq ( 1, "" ) )                  || die "bad version end";
+
+packet_write("git-filter-server\n");
+packet_write("version=2\n");
+
+( packet_read() eq ( 0, "clean=true" ) )  || die "bad capability";
+( packet_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
+( packet_read() eq ( 1, "" ) )            || die "bad capability end";
+
+foreach (@capabilities) {
+    packet_write( $_ . "=true\n" );
+}
+packet_flush();
+print $debug "wrote filter header\n";
+$debug->flush();
+
+while (1) {
+    my ($command) = packet_read() =~ /^command=([^=]+)\n$/;
+    print $debug "IN: $command";
+    $debug->flush();
+
+    my ($pathname) = packet_read() =~ /^pathname=([^=]+)\n$/;
+    print $debug " $pathname";
+    $debug->flush();
+
+    # Flush
+    packet_read();
+
+    my $input = "";
+    {
+        binmode(STDIN);
+        my $buffer;
+        my $done = 0;
+        while ( !$done ) {
+            ( $done, $buffer ) = packet_read();
+            $input .= $buffer;
+        }
+        print $debug " " . length($input) . " [OK] -- ";
+        $debug->flush();
+    }
+
+    my $output;
+    if ( $pathname eq "error.r" or $pathname eq "error-all.r" ) {
+        $output = "";
+    }
+    elsif ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
+        $output = rot13($input);
+    }
+    elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
+        $output = rot13($input);
+    }
+    else {
+        die "bad command '$command'";
+    }
+
+    print $debug "OUT: " . length($output) . " ";
+    $debug->flush();
+
+    if ( $pathname eq "error.r" ) {
+        print $debug "[ERROR]\n";
+        $debug->flush();
+        packet_write("status=error\n");
+        packet_flush();
+    }
+    elsif ( $pathname eq "error-all.r" ) {
+        print $debug "[ERROR-ALL]\n";
+        $debug->flush();
+        packet_write("status=error-all\n");
+        packet_flush();
+    }
+    else {
+        packet_write("status=success\n");
+        packet_flush();
+
+        if ( $pathname eq "${command}-write-fail.r" ) {
+            print $debug "[WRITE FAIL]\n";
+            $debug->flush();
+            die "${command} write error";
+        }
+
+        while ( length($output) > 0 ) {
+            my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
+            packet_write($packet);
+            if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
+                $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
+            }
+            else {
+                $output = "";
+            }
+        }
+        packet_flush();
+        print $debug "[OK]\n";
+        $debug->flush();
+        packet_flush();
+    }
+}
diff --git a/unpack-trees.c b/unpack-trees.c
index 11c37fb..f6798f8 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -10,6 +10,7 @@
 #include "attr.h"
 #include "split-index.h"
 #include "dir.h"
+#include "convert.h"
 
 /*
  * Error messages expected by scripts out of plumbing commands such as
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (11 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 12/13] convert: add filter.<driver>.process option larsxschneider
@ 2016-08-25 11:07 ` larsxschneider
  2016-08-29 18:05   ` Junio C Hamano
  2016-08-29 15:39 ` [PATCH v6 00/13] Git filter protocol Lars Schneider
  13 siblings, 1 reply; 66+ messages in thread
From: larsxschneider @ 2016-08-25 11:07 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, Johannes.Schindelin, jnareb, mlbright,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Consider the case of a file that requires filtering and is present in
branch A but not in branch B. If A is the current HEAD and we checkout B
then the following happens:

1. ce_compare_data() opens the file
2.   index_fd() detects that the file requires to run a clean filter and
     calls index_stream_convert_blob()
4.     index_stream_convert_blob() calls convert_to_git_filter_fd()
5.       convert_to_git_filter_fd() calls apply_filter() which creates a
         new long running filter process (in case it is the first file
         of this kind to be filtered)
6.       The new filter process inherits all file handles. This is the
         default on Linux/OSX and is explicitly defined in the
         `CreateProcessW` call in `mingw.c` on Windows.
7. ce_compare_data() closes the file
8. Git unlinks the file as it is not present in B

The unlink operation does not work on Windows because the filter process
has still an open handle to the file. Apparently that is no problem on
Linux/OSX. Probably because "[...] the two file descriptors share open
file status flags" (see fork(2)).

Fix this problem by opening files in read-cache with the `O_CLOEXEC`
flag to ensure that the file descriptor does not remain open in a newly
spawned process. `O_CLOEXEC` is defined as `O_NOINHERIT` on Windows. A
similar fix for temporary file handles was applied on Git for Windows
already:
https://github.com/git-for-windows/git/commit/667b8b51ec850c3e1c7d75dee69dc13c29d1f162

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 read-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/read-cache.c b/read-cache.c
index db27766..f481dee 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -159,7 +159,7 @@ void fill_stat_cache_info(struct cache_entry *ce, struct stat *st)
 static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
 {
 	int match = -1;
-	int fd = open(ce->name, O_RDONLY);
+	int fd = open(ce->name, O_RDONLY | O_CLOEXEC);
 
 	if (fd >= 0) {
 		unsigned char sha1[20];
-- 
2.9.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-25 11:07 ` [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently() larsxschneider
@ 2016-08-25 18:12   ` Stefan Beller
  2016-08-25 18:47     ` Lars Schneider
  2016-08-25 21:41   ` Junio C Hamano
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Beller @ 2016-08-25 18:12 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

> +int packet_write_fmt_gently(int fd, const char *fmt, ...)
> +{
> +       static struct strbuf buf = STRBUF_INIT;
> +       va_list args;
> +
> +       strbuf_reset(&buf);
> +       va_start(args, fmt);
> +       format_packet(&buf, fmt, args);

format_packet also takes care of tracing the contents,
which was a bit unexpected to me.
Do we also want to trace failure?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 11:07 ` [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
@ 2016-08-25 18:46   ` Stefan Beller
  2016-08-25 19:33     ` Lars Schneider
  2016-08-25 22:31     ` Junio C Hamano
  2016-08-25 22:27   ` Junio C Hamano
  1 sibling, 2 replies; 66+ messages in thread
From: Stefan Beller @ 2016-08-25 18:46 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> packet_write_stream_with_flush_from_fd() and
> packet_write_stream_with_flush_from_buf() write a stream of packets. All
> content packets use the maximal packet size except for the last one.
> After the last content packet a `flush` control packet is written.
>
> packet_read_till_flush() reads arbitrary sized packets until it detects
> a `flush` packet.

So the API provided by these read/write functions is intended
to move a huge chunks of data. And as it puts the data on the wire one
packet after the other without the possibility to intervene and e.g. send
a side channel progress bar update, I would question the design of this.
If I understand correctly this will be specifically  used for large
files locally,
so e.g. a file of 5 GB (such as a virtual machine tracked in Git), would
require about 80k packets.

Instead of having many packets of max length and then a remainder,
I would suggest to invent larger packets for this use case. Then we can
just send one packet instead.

Currently a packet consists of 4 bytes indicating the length in hex
and then the payload of length-4 bytes. As the length is in hex
the characters in the first 4 bytes are [0-9a-f], we can easily add another
meaning for the length, e.g.:

  A packet starts with the overall length and then the payload.
  If the first character of the length is 'v' the length is encoded as a
  variable length quantity[1]. The high bit of the char indicates if
  the next char is still part of the length field. The length must not exceed
  LLONG_MAX (which results in a payload of 9223 Petabyte, so
  enough for the foreseeable future).

  [1] A variable-length quantity (VLQ) is a universal code that uses
  an arbitrary number of bytes to represent an arbitrarily large integer.
  https://en.wikipedia.org/wiki/Variable-length_quantity

The neat thing about the packet system is we can dedicate packets
to different channels (such as the side channels), but with the provided
API here this makes it impossible to later add in these side channel
as it is a pure streaming API now. So let's remove the complication
of having to send multiple packets and just go with one large packet
instead.

--
    I understand that my proposal would require writing code again,
    but it has also some long term advantages in the networking stack
    of Git: There are some worries that a capabilities line in fetch/push
    might overflow in the far future, when there are lots of capabilities.

    Also a few days ago there was a proposal to add all symbolic refs
    to a capabilities line, which Peff shot down as "the packet may be
    too small".

    There is an incredible hack that allows transporting refs > 64kB IIRC.

    All these things could go away with the variable length encoded
    packets. But to make them go away in the future we would need
    to start with these variable length packets today. ;)

Just food for thought.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-25 18:12   ` Stefan Beller
@ 2016-08-25 18:47     ` Lars Schneider
  0 siblings, 0 replies; 66+ messages in thread
From: Lars Schneider @ 2016-08-25 18:47 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On 25 Aug 2016, at 20:12, Stefan Beller <sbeller@google.com> wrote:

>> +int packet_write_fmt_gently(int fd, const char *fmt, ...)
>> +{
>> +       static struct strbuf buf = STRBUF_INIT;
>> +       va_list args;
>> +
>> +       strbuf_reset(&buf);
>> +       va_start(args, fmt);
>> +       format_packet(&buf, fmt, args);
> 
> format_packet also takes care of tracing the contents,
> which was a bit unexpected to me.

To me, too :)
http://public-inbox.org/git/20160810143321.q7mjirgr5ynml5ff@sigill.intra.peff.net/

The series is already pretty large and therefore I decided to leave this as-is.

> Do we also want to trace failure?

You mean in all new *_gently() functions? Good idea!

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size
  2016-08-25 11:07 ` [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size larsxschneider
@ 2016-08-25 18:59   ` Stefan Beller
  2016-08-25 19:35     ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Beller @ 2016-08-25 18:59 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> According to LARGE_PACKET_MAX in pkt-line.h the maximal length of a
> pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
> therefore the pkt-line data component must not exceed 65516 bytes.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---

I was recently encouraged off list to really try hard to go with
shorter series's.
(Duh! Everyone knows that shorter series go in faster ;)

And as this is a strict bug fix of Documentation that makes sense
outside this series,
I'd like to encourage you to send this as a separate patch in case you
need further
rerolls.

Thanks,
Stefan

>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-25 11:07 ` [PATCH v6 10/13] convert: generate large test files only once larsxschneider
@ 2016-08-25 19:17   ` Stefan Beller
  2016-08-25 19:54     ` Lars Schneider
  2016-08-29 17:46   ` Junio C Hamano
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Beller @ 2016-08-25 19:17 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Generate more interesting large test files

How are the large test files more interesting?
(interesting in the notion of covering more potential bugs?
easier to debug? better to maintain, or just a pleasant read?)

> with pseudo random characters
> in between and reuse these test files in multiple tests. Run tests
> formerly marked as EXPENSIVE every time but with a smaller data set.

Sounds good to me.

>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 18:46   ` Stefan Beller
@ 2016-08-25 19:33     ` Lars Schneider
  2016-08-25 22:31     ` Junio C Hamano
  1 sibling, 0 replies; 66+ messages in thread
From: Lars Schneider @ 2016-08-25 19:33 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright


> On 25 Aug 2016, at 20:46, Stefan Beller <sbeller@google.com> wrote:
> 
>> On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_write_stream_with_flush_from_fd() and
>> packet_write_stream_with_flush_from_buf() write a stream of packets. All
>> content packets use the maximal packet size except for the last one.
>> After the last content packet a `flush` control packet is written.
>> 
>> packet_read_till_flush() reads arbitrary sized packets until it detects
>> a `flush` packet.
> 
> So the API provided by these read/write functions is intended
> to move a huge chunks of data. And as it puts the data on the wire one
> packet after the other without the possibility to intervene and e.g. send
> a side channel progress bar update, I would question the design of this.
> If I understand correctly this will be specifically  used for large
> files locally,
> so e.g. a file of 5 GB (such as a virtual machine tracked in Git), would
> require about 80k packets.

Peff suggested this approach arguing that the overhead is neglectable:
http://public-inbox.org/git/20160720134916.GB19359@sigill.intra.peff.net/


> Instead of having many packets of max length and then a remainder,
> I would suggest to invent larger packets for this use case. Then we can
> just send one packet instead.
> 
> Currently a packet consists of 4 bytes indicating the length in hex
> and then the payload of length-4 bytes. As the length is in hex
> the characters in the first 4 bytes are [0-9a-f], we can easily add another
> meaning for the length, e.g.:
> 
>  A packet starts with the overall length and then the payload.
>  If the first character of the length is 'v' the length is encoded as a
>  variable length quantity[1]. The high bit of the char indicates if
>  the next char is still part of the length field. The length must not exceed
>  LLONG_MAX (which results in a payload of 9223 Petabyte, so
>  enough for the foreseeable future).

Eventually I would like to resurrect Joey's cleanFromFile/smudgeToFile idea:

http://public-inbox.org/git/1468277112-9909-3-git-send-email-joeyh@joeyh.name/

Then we would not need to transfer that much data over the pipes. However, I wonder if the large amount of packets would actually be a problem. Honestly, I would prefer to not change Git's packet format in this already large series ;-)


>  [1] A variable-length quantity (VLQ) is a universal code that uses
>  an arbitrary number of bytes to represent an arbitrarily large integer.
>  https://en.wikipedia.org/wiki/Variable-length_quantity
> 
> The neat thing about the packet system is we can dedicate packets
> to different channels (such as the side channels), but with the provided
> API here this makes it impossible to later add in these side channel
> as it is a pure streaming API now. So let's remove the complication
> of having to send multiple packets and just go with one large packet
> instead.

I tried to design the protocol as flexible as possible for the future with a version negotiation and a capabilities list. Therefore, I would think it should be possible to implement these ideas in the future if they are required.


> --
>    I understand that my proposal would require writing code again,
>    but it has also some long term advantages in the networking stack
>    of Git: There are some worries that a capabilities line in fetch/push
>    might overflow in the far future, when there are lots of capabilities.
> 
>    Also a few days ago there was a proposal to add all symbolic refs
>    to a capabilities line, which Peff shot down as "the packet may be
>    too small".
> 
>    There is an incredible hack that allows transporting refs > 64kB IIRC.
> 
>    All these things could go away with the variable length encoded
>    packets. But to make them go away in the future we would need
>    to start with these variable length packets today. ;)
> 
> Just food for thought.

Thanks for thinking it through that thoroughly! I understand your point of view and I am curious what others thing.

Cheers,
Lars


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size
  2016-08-25 18:59   ` Stefan Beller
@ 2016-08-25 19:35     ` Lars Schneider
  2016-08-26 19:44       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-25 19:35 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright


> On 25 Aug 2016, at 20:59, Stefan Beller <sbeller@google.com> wrote:
> 
>> On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> According to LARGE_PACKET_MAX in pkt-line.h the maximal length of a
>> pkt-line packet is 65520 bytes. The pkt-line header takes 4 bytes and
>> therefore the pkt-line data component must not exceed 65516 bytes.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
> 
> I was recently encouraged off list to really try hard to go with
> shorter series's.
> (Duh! Everyone knows that shorter series go in faster ;)
> 
> And as this is a strict bug fix of Documentation that makes sense
> outside this series,
> I'd like to encourage you to send this as a separate patch in case you
> need further
> rerolls.

Agreed!

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-25 19:17   ` Stefan Beller
@ 2016-08-25 19:54     ` Lars Schneider
  2016-08-29 17:52       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-25 19:54 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Jeff King, Junio C Hamano,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright



> On 25 Aug 2016, at 21:17, Stefan Beller <sbeller@google.com> wrote:
> 
>> On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Generate more interesting large test files
> 
> How are the large test files more interesting?
> (interesting in the notion of covering more potential bugs?
> easier to debug? better to maintain, or just a pleasant read?)

The old large test file was 1MB of zeros and 1 byte with a one, repeated 2048 times.

Since the filter uses 64k packets we would test a large number of equally looking packets.

That's why I thought the pseudo random content is more interesting.


>> with pseudo random characters
>> in between and reuse these test files in multiple tests. Run tests
>> formerly marked as EXPENSIVE every time but with a smaller data set.
> 
> Sounds good to me.

Thank you.
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-25 11:07 ` [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently() larsxschneider
  2016-08-25 18:12   ` Stefan Beller
@ 2016-08-25 21:41   ` Junio C Hamano
  2016-08-26  9:17     ` Lars Schneider
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-25 21:41 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> packet_write_fmt() would die in case of a write error even though for
> some callers an error would be acceptable. Add packet_write_fmt_gently()
> which writes a formatted pkt-line and returns `0` for success and `-1`
> for an error.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 12 ++++++++++++
>  pkt-line.h |  1 +
>  2 files changed, 13 insertions(+)
>
> diff --git a/pkt-line.c b/pkt-line.c
> index e8adc0f..3e8b2fb 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -137,6 +137,18 @@ void packet_write_fmt(int fd, const char *fmt, ...)
>  	write_or_die(fd, buf.buf, buf.len);
>  }
>  
> +int packet_write_fmt_gently(int fd, const char *fmt, ...)
> +{
> +	static struct strbuf buf = STRBUF_INIT;
> +	va_list args;
> +
> +	strbuf_reset(&buf);
> +	va_start(args, fmt);
> +	format_packet(&buf, fmt, args);
> +	va_end(args);
> +	return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
> +}

Even though its only a handful lines, it is a bit ugly to have a
completely copied implementation only to have _gently().  I suspect
that you should be able to

	static int packet_write_fmt_1(int fd, int gently,
					const char *fmt, va_list args)
        {
		struct strbuf buf = STRBUF_INIT;
		size_t count;

		format_packet(&buf, fmt, args);
		
		count = write_in_full(fd, buf.buf, buf.len);
                if (count == buf.len)
                	return 0;
		if (!gently) {
			check_pipe(errno);
                	die_errno("write error");
		}
                return -1;
	}

and then share that between the existing one:

	void packet_write_fmt(int fd, const char *fmt, ...)
        {
		va_list args;
	        va_start(args, fmt);
                packet_write_fmt_1(fd, 0, fmt, args);
                va_end(args);
	}

and the new one:

	void packet_write_fmt_gently(int fd, const char *fmt, ...)
        {
		int status;
		va_list args;
	        va_start(args, fmt);
                status = packet_write_fmt_1(fd, 1, fmt, args);
                va_end(args);
		return status;
	}

>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
>  {
>  	va_list args;
> diff --git a/pkt-line.h b/pkt-line.h
> index 1902fb3..3caea77 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -23,6 +23,7 @@ void packet_flush(int fd);
>  void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  void packet_buf_flush(struct strbuf *buf);
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> +int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  
>  /*
>   * Read a packetized line into the buffer, which must be at least size bytes

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 05/13] pkt-line: add packet_write_gently()
  2016-08-25 11:07 ` [PATCH v6 05/13] pkt-line: add packet_write_gently() larsxschneider
@ 2016-08-25 21:50   ` Junio C Hamano
  2016-08-26  9:40     ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-25 21:50 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> packet_write_fmt() has two shortcomings. First, it uses format_packet()
> which lets the caller only send string data via "%s". That means it
> cannot be used for arbitrary data that may contain NULs. Second, it will
> always die on error.

As you introduced _gently in 3/13, the latter is no longer a valid
excuse to add this function.  Just remove the sentence "Second, ...".

> Add packet_write_gently() which writes arbitrary data and returns `0`
> for success and `-1` for an error. This function is used by other
> pkt-line functions in a subsequent patch.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/pkt-line.c b/pkt-line.c
> index cad26df..7e8a803 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -3,6 +3,7 @@
>  #include "run-command.h"
>  
>  char packet_buffer[LARGE_PACKET_MAX];
> +static char packet_write_buffer[LARGE_PACKET_MAX];
>  static const char *packet_trace_prefix = "git";
>  static struct trace_key trace_packet = TRACE_KEY_INIT(PACKET);
>  static struct trace_key trace_pack = TRACE_KEY_INIT(PACKFILE);
> @@ -155,6 +156,17 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
>  	return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
>  }
>  
> +int packet_write_gently(const int fd_out, const char *buf, size_t size)
> +{
> +	if (size > sizeof(packet_write_buffer) - 4)
> +		return -1;
> +	packet_trace(buf, size, 1);
> +	memmove(packet_write_buffer + 4, buf, size);
> +	size += 4;
> +	set_packet_header(packet_write_buffer, size);

It may not matter all that much, but from code-reader's point of
view, when you know packet_write_buffer[] will contain things A and
B in this order, and when you have enough information to compute A
before stasrting to fill packet_write_buffer[], I would prefer to
see you actually fill the buffer in that natural order.

Do you anticipate future need of non-gently variant of this
function?  If so, perhaps a helper that takes a boolean "am I
working for the gently variant?" may help share more code.

> +	return (write_in_full(fd_out, packet_write_buffer, size) == size ? 0 : -1);
> +}
> +
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
>  {
>  	va_list args;

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 11:07 ` [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
  2016-08-25 18:46   ` Stefan Beller
@ 2016-08-25 22:27   ` Junio C Hamano
  2016-08-26 10:13     ` Lars Schneider
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-25 22:27 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> packet_write_stream_with_flush_from_fd() and
> packet_write_stream_with_flush_from_buf() write a stream of packets. All
> content packets use the maximal packet size except for the last one.
> After the last content packet a `flush` control packet is written.
> packet_read_till_flush() reads arbitrary sized packets until it detects
> a `flush` packet.

These are awkwardly named and I couldn't guess what the input is (I
can tell one is to read from fd and the other is <mem,len> buffer,
but it is unclear if that is in packetized form or just raw data
stream to be copied to the end from their names) without reading the
implementation.  I _think_ you read a raw stream of data through the
end (either EOF or length limit) and write it out packetized, and
use the flush packet to mark the end of the stream.  In my mind,
that is "writing a packetized stream".  The words "packetizing" and
"stream" imply that the stream could consist of more data than what
would fit in a single packet, which in turn implies that there needs
a way to mark the end of one data item, so with_flush does not
necessarily have to be their names.

The counter-part would be "reading a packetized stream".

> +int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out)
> +{

Especially this one I am tempted to suggest "copy-to-packetized-stream",
as it reads a stream from one fd and then copies out while packetizing.

> +	int err = 0;
> +	ssize_t bytes_to_write;
> +
> +	while (!err) {
> +		bytes_to_write = xread(fd_in, packet_write_buffer, sizeof(packet_write_buffer) - 4);
> +		if (bytes_to_write < 0)
> +			return COPY_READ_ERROR;
> +		if (bytes_to_write == 0)
> +			break;
> +		if (bytes_to_write > sizeof(packet_write_buffer) - 4)
> +			return COPY_WRITE_ERROR;

... and you seem to agree with me by using COPY here.

> +		err = packet_write_gently(fd_out, packet_write_buffer, bytes_to_write);
> +	}
> +	if (!err)
> +		err = packet_flush_gently(fd_out);
> +	return err;
> +}
> +
> +int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out)
> +{
> +	int err = 0;
> +	size_t bytes_written = 0;
> +	size_t bytes_to_write;
> +
> +	while (!err) {
> +		if ((len - bytes_written) > sizeof(packet_write_buffer) - 4)
> +			bytes_to_write = sizeof(packet_write_buffer) - 4;
> +		else
> +			bytes_to_write = len - bytes_written;
> +		if (bytes_to_write == 0)
> +			break;

The lack of COPY_WRITE_ERROR puzzled me briefly here.  If you are
assuming that your math at the beginning of this loop is correct and
bytes_to_write will never exceed the write-buffer size, I think you
should be able to (and it would be better to) assume that the math
you do to tell xread() up to how many bytes it is allowed to read at
once is also correct, losing the COPY_WRITE_ERROR check in the other
function.  You can choose to play safer and do a check in this
function, too.  Either way, we would want to be consistent.

> +		err = packet_write_gently(fd_out, src_in + bytes_written, bytes_to_write);
> +		bytes_written += bytes_to_write;
> +	}
> +	if (!err)
> +		err = packet_flush_gently(fd_out);
> +	return err;
> +}

> +ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out)
> +{
> +	int len, ret;
> +	int options = PACKET_READ_GENTLE_ON_EOF;
> +	char linelen[4];
> +
> +	size_t oldlen = sb_out->len;
> +	size_t oldalloc = sb_out->alloc;
> +
> +	for (;;) {
> +		/* Read packet header */
> +		ret = get_packet_data(fd_in, NULL, NULL, linelen, 4, options);
> +		if (ret < 0)
> +			goto done;
> +		len = packet_length(linelen);
> +		if (len < 0)
> +			die("protocol error: bad line length character: %.4s", linelen);
> +		if (!len) {
> +			/* Found a flush packet - Done! */
> +			packet_trace("0000", 4, 0);
> +			break;
> +		}
> +		len -= 4;
> +
> +		/* Read packet content */
> +		strbuf_grow(sb_out, len);
> +		ret = get_packet_data(fd_in, NULL, NULL, sb_out->buf + sb_out->len, len, options);
> +		if (ret < 0)
> +			goto done;
> +		if (ret != len) {
> +			error("protocol error: incomplete read (expected %d, got %d)", len, ret);
> +			goto done;
> +		}
> +
> +		packet_trace(sb_out->buf + sb_out->len, len, 0);

All of the above seems to pretty much duplicate the logic in
packet_read(), except that this user does not need options handling
it has.  Is optimizing that out the reason why you open-coded it
here?

Or is it because you cannot tell if you got a truly empty packet or
you got a flush from outside packet_read(), and you wanted to make
sure that you won't be fooled by a normal packet with 0-length
payload?

If the latter is the reason, it may be a viable alternative to
update packet_read() to take PACKET_READ_IGNORE_EMPTY_PACKET, i.e. a
new bit in its options parameter, so that a normal packet with
0-length payload is simply ignored there (i.e. even without
returning, packet_read() would repeat from the beginning when it got
such a packet).  That way, the above would become 

	strbuf_grow(); /* enough to hold max-packet-len more bytes */
	len = packet_read();
        if (!len)
        	/* we cannot get 0 unless we see flush */
                break;

which may be a lot cleaner?

> +		sb_out->len += len;
> +	}
> +
> +done:
> +	if (ret < 0) {
> +		if (oldalloc == 0)
> +			strbuf_release(sb_out);
> +		else
> +			strbuf_setlen(sb_out, oldlen);
> +		return ret;  /* unexpected EOF */
> +	}
> +	return sb_out->len - oldlen;
> +}
> diff --git a/pkt-line.h b/pkt-line.h
> index 3fa0899..9616117 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -25,6 +25,8 @@ void packet_buf_flush(struct strbuf *buf);
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  int packet_flush_gently(int fd);
>  int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> +int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out);
> +int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out);
>  
>  /*
>   * Read a packetized line into the buffer, which must be at least size bytes
> @@ -77,6 +79,11 @@ char *packet_read_line(int fd, int *size);
>   */
>  char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
>  
> +/*
> + * Reads a stream of variable sized packets until a flush packet is detected.
> + */
> +ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out);
> +
>  #define DEFAULT_PACKET_MAX 1000
>  #define LARGE_PACKET_MAX 65520
>  extern char packet_buffer[LARGE_PACKET_MAX];

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 18:46   ` Stefan Beller
  2016-08-25 19:33     ` Lars Schneider
@ 2016-08-25 22:31     ` Junio C Hamano
  2016-08-26  0:55       ` Jacob Keller
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-25 22:31 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Lars Schneider, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

Stefan Beller <sbeller@google.com> writes:

> So the API provided by these read/write functions is intended
> to move a huge chunks of data. And as it puts the data on the wire one
> packet after the other without the possibility to intervene and e.g. send
> a side channel progress bar update, I would question the design of this.

Hmph, I didn't think about it.

But shouldn't one be able to set up sideband and channel one such
large transfer on one band, while multiplexing other payload on
other bands?

> If I understand correctly this will be specifically  used for large
> files locally,
> so e.g. a file of 5 GB (such as a virtual machine tracked in Git), would
> require about 80k packets.

What is wrong about that?  4*80k = 320kB overhead for length fields
to transfer 5GB worth of data?  I do not think it is worth worrying
about it.

But I am more surprised by seeing that "why not a single huge
packet" suggestion immediately after you talked about "without the
possibility to intervene".  They do not seem to be remotely related;
in fact, they are going into opposite directions.

Puzzled.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 22:31     ` Junio C Hamano
@ 2016-08-26  0:55       ` Jacob Keller
  2016-08-26 17:02         ` Stefan Beller
  2016-08-26 17:17         ` Junio C Hamano
  0 siblings, 2 replies; 66+ messages in thread
From: Jacob Keller @ 2016-08-26  0:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Stefan Beller, Lars Schneider, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On Thu, Aug 25, 2016 at 3:31 PM, Junio C Hamano <gitster@pobox.com> wrote:
> What is wrong about that?  4*80k = 320kB overhead for length fields
> to transfer 5GB worth of data?  I do not think it is worth worrying
> about it.
>
> But I am more surprised by seeing that "why not a single huge
> packet" suggestion immediately after you talked about "without the
> possibility to intervene".  They do not seem to be remotely related;
> in fact, they are going into opposite directions.
>
> Puzzled.

Stefan's argument to me is thus "If we're already going to ignore
sideband packets here, why not go all the way and make variable length
packets and send a single packet of a maximum length? Doing thus will
solve some set of future problems nicely and makes this code easier."

I'm not sure I agree myself, but that's the logic as I understand it.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-25 21:41   ` Junio C Hamano
@ 2016-08-26  9:17     ` Lars Schneider
  2016-08-26 17:10       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-26  9:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 25 Aug 2016, at 23:41, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_write_fmt() would die in case of a write error even though for
>> some callers an error would be acceptable. Add packet_write_fmt_gently()
>> which writes a formatted pkt-line and returns `0` for success and `-1`
>> for an error.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> pkt-line.c | 12 ++++++++++++
>> pkt-line.h |  1 +
>> 2 files changed, 13 insertions(+)
>> 
>> diff --git a/pkt-line.c b/pkt-line.c
>> index e8adc0f..3e8b2fb 100644
>> --- a/pkt-line.c
>> +++ b/pkt-line.c
>> @@ -137,6 +137,18 @@ void packet_write_fmt(int fd, const char *fmt, ...)
>> 	write_or_die(fd, buf.buf, buf.len);
>> }
>> 
>> +int packet_write_fmt_gently(int fd, const char *fmt, ...)
>> +{
>> +	static struct strbuf buf = STRBUF_INIT;
>> +	va_list args;
>> +
>> +	strbuf_reset(&buf);
>> +	va_start(args, fmt);
>> +	format_packet(&buf, fmt, args);
>> +	va_end(args);
>> +	return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
>> +}
> 
> Even though its only a handful lines, it is a bit ugly to have a
> completely copied implementation only to have _gently().  I suspect
> that you should be able to
> 
> 	static int packet_write_fmt_1(int fd, int gently,
> 					const char *fmt, va_list args)
>        {
> 		struct strbuf buf = STRBUF_INIT;
> 		size_t count;
> 
> 		format_packet(&buf, fmt, args);
> 		
> 		count = write_in_full(fd, buf.buf, buf.len);
>                if (count == buf.len)
>                	return 0;
> 		if (!gently) {
> 			check_pipe(errno);
>                	die_errno("write error");
> 		}
>                return -1;
> 	}
> 
> and then share that between the existing one:
> 
> 	void packet_write_fmt(int fd, const char *fmt, ...)
>        {
> 		va_list args;
> 	        va_start(args, fmt);
>                packet_write_fmt_1(fd, 0, fmt, args);
>                va_end(args);
> 	}
> 
> and the new one:
> 
> 	void packet_write_fmt_gently(int fd, const char *fmt, ...)
>        {
> 		int status;
> 		va_list args;
> 	        va_start(args, fmt);
>                status = packet_write_fmt_1(fd, 1, fmt, args);
>                va_end(args);
> 		return status;
> 	}

I agree with your criticism of the code duplication. 

However, I thought it would be OK, as Peff already 
tried to refactor it...
http://public-inbox.org/git/20160810150139.lpxyrqkr53s5f4sx@sigill.intra.peff.net/

... and I got the impression you agreed with Peff:
http://public-inbox.org/git/xmqqvaz84g9y.fsf@gitster.mtv.corp.google.com/


I will try to refactor it according to your suggestion above. 
Would "packet_write_fmt_1()" be an acceptable name or should 
I come up with something more expressive?

Thanks you,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 05/13] pkt-line: add packet_write_gently()
  2016-08-25 21:50   ` Junio C Hamano
@ 2016-08-26  9:40     ` Lars Schneider
  2016-08-26 17:15       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-26  9:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 25 Aug 2016, at 23:50, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_write_fmt() has two shortcomings. First, it uses format_packet()
>> which lets the caller only send string data via "%s". That means it
>> cannot be used for arbitrary data that may contain NULs. Second, it will
>> always die on error.
> 
> As you introduced _gently in 3/13, the latter is no longer a valid
> excuse to add this function.  Just remove the sentence "Second, ...".

Agreed!


>> Add packet_write_gently() which writes arbitrary data and returns `0`
>> for success and `-1` for an error. This function is used by other
>> pkt-line functions in a subsequent patch.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> pkt-line.c | 12 ++++++++++++
>> 1 file changed, 12 insertions(+)
>> 
>> diff --git a/pkt-line.c b/pkt-line.c
>> index cad26df..7e8a803 100644
>> --- a/pkt-line.c
>> +++ b/pkt-line.c
>> @@ -3,6 +3,7 @@
>> #include "run-command.h"
>> 
>> char packet_buffer[LARGE_PACKET_MAX];
>> +static char packet_write_buffer[LARGE_PACKET_MAX];
>> static const char *packet_trace_prefix = "git";
>> static struct trace_key trace_packet = TRACE_KEY_INIT(PACKET);
>> static struct trace_key trace_pack = TRACE_KEY_INIT(PACKFILE);
>> @@ -155,6 +156,17 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
>> 	return (write_in_full(fd, buf.buf, buf.len) == buf.len ? 0 : -1);
>> }
>> 
>> +int packet_write_gently(const int fd_out, const char *buf, size_t size)
>> +{
>> +	if (size > sizeof(packet_write_buffer) - 4)
>> +		return -1;
>> +	packet_trace(buf, size, 1);
>> +	memmove(packet_write_buffer + 4, buf, size);
>> +	size += 4;
>> +	set_packet_header(packet_write_buffer, size);
> 
> It may not matter all that much, but from code-reader's point of
> view, when you know packet_write_buffer[] will contain things A and
> B in this order, and when you have enough information to compute A
> before stasrting to fill packet_write_buffer[], I would prefer to
> see you actually fill the buffer in that natural order.

I did that because when packet_write_stream_with_flush_from_fd()
calls packet_write_gently() then buf[] is actually packet_write_buffer[]:

https://github.com/larsxschneider/git/blob/d474e6a4c2523b87624a07111eb7a4f2dcd12426/pkt-line.c#L185-L192

Therefore I would override the first 4 bytes. However, the code evolved for
some reason in that way but looking at it now I think that is an obscure,
likely meaningless optimization. I'll use a separate buffer in 
packet_write_stream_with_flush_from_fd() and then fix the order here
following your advice.


> Do you anticipate future need of non-gently variant of this
> function?  If so, perhaps a helper that takes a boolean "am I
> working for the gently variant?" may help share more code.

With helper you mean "an additional boolean parameter"? I don't 
see a need for a non-gently variant right now but I will
add this parameter if you think it is a good idea. How would the
signature look like?

int packet_write_gently(const int fd_out, const char *buf, size_t size, int gentle)

This would follow type_from_string_gently() in object.h.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-25 22:27   ` Junio C Hamano
@ 2016-08-26 10:13     ` Lars Schneider
  2016-08-26 17:21       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-26 10:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 26 Aug 2016, at 00:27, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_write_stream_with_flush_from_fd() and
>> packet_write_stream_with_flush_from_buf() write a stream of packets. All
>> content packets use the maximal packet size except for the last one.
>> After the last content packet a `flush` control packet is written.
>> packet_read_till_flush() reads arbitrary sized packets until it detects
>> a `flush` packet.
> 
> These are awkwardly named and I couldn't guess what the input is (I
> can tell one is to read from fd and the other is <mem,len> buffer,
> but it is unclear if that is in packetized form or just raw data
> stream to be copied to the end from their names) without reading the
> implementation.  I _think_ you read a raw stream of data through the
> end (either EOF or length limit) and write it out packetized, and
> use the flush packet to mark the end of the stream.  In my mind,
> that is "writing a packetized stream".  The words "packetizing" and
> "stream" imply that the stream could consist of more data than what
> would fit in a single packet, which in turn implies that there needs
> a way to mark the end of one data item, so with_flush does not
> necessarily have to be their names.
> 
> The counter-part would be "reading a packetized stream".
> 
>> +int packet_write_stream_with_flush_from_fd(int fd_in, int fd_out)
>> +{
> 
> Especially this one I am tempted to suggest "copy-to-packetized-stream",
> as it reads a stream from one fd and then copies out while packetizing.

OK, what function names would be more clear from your point of view?

copy_to_packetized_stream_from_fd()
copy_to_packetized_stream_from_buf()
copy_to_packetized_stream_to_buf()

or

write_packetized_stream_from_fd()
write_packetized_stream_from_buf()
read_packetized_stream_to_buf()

?


>> +int packet_write_stream_with_flush_from_buf(const char *src_in, size_t len, int fd_out)
>> +{
>> +	int err = 0;
>> +	size_t bytes_written = 0;
>> +	size_t bytes_to_write;
>> +
>> +	while (!err) {
>> +		if ((len - bytes_written) > sizeof(packet_write_buffer) - 4)
>> +			bytes_to_write = sizeof(packet_write_buffer) - 4;
>> +		else
>> +			bytes_to_write = len - bytes_written;
>> +		if (bytes_to_write == 0)
>> +			break;
> 
> The lack of COPY_WRITE_ERROR puzzled me briefly here.  If you are
> assuming that your math at the beginning of this loop is correct and
> bytes_to_write will never exceed the write-buffer size, I think you
> should be able to (and it would be better to) assume that the math
> you do to tell xread() up to how many bytes it is allowed to read at
> once is also correct, losing the COPY_WRITE_ERROR check in the other
> function.  You can choose to play safer and do a check in this
> function, too.  Either way, we would want to be consistent.

OK. I'll remove the (I just realized meaningless) check in the other function:

+		if (bytes_to_write > sizeof(packet_write_buffer) - 4)
+			return COPY_WRITE_ERROR;

> 
>> +		err = packet_write_gently(fd_out, src_in + bytes_written, bytes_to_write);
>> +		bytes_written += bytes_to_write;
>> +	}
>> +	if (!err)
>> +		err = packet_flush_gently(fd_out);
>> +	return err;
>> +}
> 
>> +ssize_t packet_read_till_flush(int fd_in, struct strbuf *sb_out)
>> +{
>> +	int len, ret;
>> +	int options = PACKET_READ_GENTLE_ON_EOF;
>> +	char linelen[4];
>> +
>> +	size_t oldlen = sb_out->len;
>> +	size_t oldalloc = sb_out->alloc;
>> +
>> +	for (;;) {
>> +		/* Read packet header */
>> +		ret = get_packet_data(fd_in, NULL, NULL, linelen, 4, options);
>> +		if (ret < 0)
>> +			goto done;
>> +		len = packet_length(linelen);
>> +		if (len < 0)
>> +			die("protocol error: bad line length character: %.4s", linelen);
>> +		if (!len) {
>> +			/* Found a flush packet - Done! */
>> +			packet_trace("0000", 4, 0);
>> +			break;
>> +		}
>> +		len -= 4;
>> +
>> +		/* Read packet content */
>> +		strbuf_grow(sb_out, len);
>> +		ret = get_packet_data(fd_in, NULL, NULL, sb_out->buf + sb_out->len, len, options);
>> +		if (ret < 0)
>> +			goto done;
>> +		if (ret != len) {
>> +			error("protocol error: incomplete read (expected %d, got %d)", len, ret);
>> +			goto done;
>> +		}
>> +
>> +		packet_trace(sb_out->buf + sb_out->len, len, 0);
> 
> All of the above seems to pretty much duplicate the logic in
> packet_read(), except that this user does not need options handling
> it has.  Is optimizing that out the reason why you open-coded it
> here?

No.

> Or is it because you cannot tell if you got a truly empty packet or
> you got a flush from outside packet_read(), and you wanted to make
> sure that you won't be fooled by a normal packet with 0-length
> payload?

Correct!

> 
> If the latter is the reason, it may be a viable alternative to
> update packet_read() to take PACKET_READ_IGNORE_EMPTY_PACKET, i.e. a
> new bit in its options parameter, so that a normal packet with
> 0-length payload is simply ignored there (i.e. even without
> returning, packet_read() would repeat from the beginning when it got
> such a packet).  That way, the above would become 
> 
> 	strbuf_grow(); /* enough to hold max-packet-len more bytes */
> 	len = packet_read();
>        if (!len)
>        	/* we cannot get 0 unless we see flush */
>                break;
> 
> which may be a lot cleaner?

Good idea! I will refactor it that way!


Thanks a lot for the review,
Lars


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-26  0:55       ` Jacob Keller
@ 2016-08-26 17:02         ` Stefan Beller
  2016-08-26 17:21           ` Jeff King
  2016-08-26 17:17         ` Junio C Hamano
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Beller @ 2016-08-26 17:02 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Junio C Hamano, Lars Schneider, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On Thu, Aug 25, 2016 at 5:55 PM, Jacob Keller <jacob.keller@gmail.com> wrote:
> On Thu, Aug 25, 2016 at 3:31 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> What is wrong about that?  4*80k = 320kB overhead for length fields
>> to transfer 5GB worth of data?  I do not think it is worth worrying
>> about it.
>>
>> But I am more surprised by seeing that "why not a single huge
>> packet" suggestion immediately after you talked about "without the
>> possibility to intervene".  They do not seem to be remotely related;
>> in fact, they are going into opposite directions.
>>
>> Puzzled.
>
> Stefan's argument to me is thus "If we're already going to ignore
> sideband packets here, why not go all the way and make variable length
> packets and send a single packet of a maximum length? Doing thus will
> solve some set of future problems nicely and makes this code easier."
>
> I'm not sure I agree myself, but that's the logic as I understand it.

Yeah. To me it seems this design explicitly makes it hard for side bands.
As we do not need sidebands for local transfers, this is fine for sure.

(If we wanted to make it sideband friendly, I'd expect you could register
callbacks for either all packets or for a given sideband until the next
flush comes.)

So as hinted by this design, we want a protocol that
* doesn't care about sidebands
* cares about large data (hence maybe throughput)
* has easy/clean interface

And one large packet would suffice for these three points as well
and additionally has benefits for the network stuff.

The 320kB additional transmission are negligible overhead, so I was not
concerned about the size, but rather the code being bloated, i.e. we need
one layer of additional code to cope with the repetitive packets.

---
My background is mostly submodule related, and whenever I come up
with a shiny novel idea that would help submodules tremendously, someone
(mostly Peff) comes along and suggests a broader more generic thing, that
works just as well for submodules but is applicable to all of Git.

So I picked up that way of thinking: If we write code here, that helps with a
very special niche of things, can we make so, that the rest also benefits?

I may not understand all the requirements in this case, but to me it looks like
the "one packet" approach covers all requirements, but has the huge potential
to make other parts better in the long run.
-- 

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-26  9:17     ` Lars Schneider
@ 2016-08-26 17:10       ` Junio C Hamano
  2016-08-26 17:23         ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 17:10 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

Lars Schneider <larsxschneider@gmail.com> writes:

> I agree with your criticism of the code duplication. 
>
> However, I thought it would be OK, as Peff already 
> tried to refactor it...
> http://public-inbox.org/git/20160810150139.lpxyrqkr53s5f4sx@sigill.intra.peff.net/
>
> ... and I got the impression you agreed with Peff:
> http://public-inbox.org/git/xmqqvaz84g9y.fsf@gitster.mtv.corp.google.com/

The former does not exactly show how ugly it was, but I do not have
to see it.  It is talking about eliminating the need for memcpy()
and duplicated header generation code, which the suggestion you are
responding to didn't even attempt.  If Peff said he tried an even
more aggressive refactoring and it ended up too ugly to live, I
believe him and agree with his assessment.

> I will try to refactor it according to your suggestion above. 
> Would "packet_write_fmt_1()" be an acceptable name or should 
> I come up with something more expressive?

The latter is preferrable, but we do not mind too strongly about
the name of file-scope static helper that will never be called
directly by anybody other than the two more public entry points the
helper was designed to serve.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 05/13] pkt-line: add packet_write_gently()
  2016-08-26  9:40     ` Lars Schneider
@ 2016-08-26 17:15       ` Junio C Hamano
  2016-08-29  9:40         ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 17:15 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

Lars Schneider <larsxschneider@gmail.com> writes:

>> Do you anticipate future need of non-gently variant of this
>> function?  If so, perhaps a helper that takes a boolean "am I
>> working for the gently variant?" may help share more code.
>
> With helper you mean "an additional boolean parameter"? I don't 
> see a need for a non-gently variant right now but I will
> add this parameter if you think it is a good idea. How would the
> signature look like?
>
> int packet_write_gently(const int fd_out, const char *buf, size_t size, int gentle)
>
> This would follow type_from_string_gently() in object.h.

I actually imagined it would be more like your packet_write_fmt vs
packet_write_fmt_gently pair of functions.  If you do not have an
immediate need for a non-gentle packet_write() right now, but you
still forsee that it is likely some other caller may want one, you
could still prepare for it by doing a static

	packet_write_1((const int fd_out, const char *buf, size_t size, int gentle)

and make packet_write_gently() call it with gentle=1, without
actually introducing packet_write() nobody yet calls.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-26  0:55       ` Jacob Keller
  2016-08-26 17:02         ` Stefan Beller
@ 2016-08-26 17:17         ` Junio C Hamano
  1 sibling, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 17:17 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Stefan Beller, Lars Schneider, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

Jacob Keller <jacob.keller@gmail.com> writes:

> Stefan's argument to me is thus "If we're already going to ignore
> sideband packets here, why not go all the way and make variable length
> packets and send a single packet of a maximum length? Doing thus will
> solve some set of future problems nicely and makes this code easier."
>
> I'm not sure I agree myself, but that's the logic as I understand it.

I do not know if I agree, but now I understand the flow of logic.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-26 10:13     ` Lars Schneider
@ 2016-08-26 17:21       ` Junio C Hamano
  2016-08-29  9:43         ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 17:21 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

Lars Schneider <larsxschneider@gmail.com> writes:

> OK, what function names would be more clear from your point of view?
>
> write_packetized_stream_from_fd()
> write_packetized_stream_from_buf()
> read_packetized_stream_to_buf()

Would

    write_packetized_from_fd()
    write_packetized_from_buf()
    read_packetized_to_buf()

be shorter, still understandable, be consistent with the existing
convention to name write_/read_ a function that shuffles bytes via a
file descriptor, and to the point, perhaps?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-26 17:02         ` Stefan Beller
@ 2016-08-26 17:21           ` Jeff King
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-08-26 17:21 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jacob Keller, Junio C Hamano, Lars Schneider, git@vger.kernel.org,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

On Fri, Aug 26, 2016 at 10:02:52AM -0700, Stefan Beller wrote:

> Yeah. To me it seems this design explicitly makes it hard for side bands.
> As we do not need sidebands for local transfers, this is fine for sure.
> 
> (If we wanted to make it sideband friendly, I'd expect you could register
> callbacks for either all packets or for a given sideband until the next
> flush comes.)
> 
> So as hinted by this design, we want a protocol that
> * doesn't care about sidebands
> * cares about large data (hence maybe throughput)
> * has easy/clean interface
> 
> And one large packet would suffice for these three points as well
> and additionally has benefits for the network stuff.
> 
> The 320kB additional transmission are negligible overhead, so I was not
> concerned about the size, but rather the code being bloated, i.e. we need
> one layer of additional code to cope with the repetitive packets.

Maybe I don't understand what you mean by "one large packet". But if you
mean sending "I am about to send you 100MB" followed by the 100MB, the
point is that the sender does not necessarily have that exact value
ahead of time. So it would want to write it in chunks.

E.g., consider a clean which replaces s/foo/bar/ in its content. It
would write out all the content up to the next "foo", then write "bar",
and repeat.

We could let each chunk be arbitrarily big. I.e., "I am about to send
you 50MB", then "here are 3 bytes", then "here are the other 50MB".

But using a fixed-size length header makes the packets easy to generate
and parse. And 64KB is small enough that senders and receivers can
easily buffer single packets, making interfaces simpler. And it's big
enough that the overhead (4 bytes per 64KB) is negligible.

Anyway. It certainly does not seem worth moving the network protocols to
a new data format. The compatibility changes would not be worth it. But
if we _were_ to do so, there are tons of efficient well-thought-out
data-marshaling formats we could use rather than inventing our own. But
I just don't see the benefit.

> ---
> My background is mostly submodule related, and whenever I come up
> with a shiny novel idea that would help submodules tremendously, someone
> (mostly Peff) comes along and suggests a broader more generic thing, that
> works just as well for submodules but is applicable to all of Git.

Heh. It is not always a good thing, if it derails the original purpose
of the code (and I include some of my suggestions in that). :)

The trick is to make it general enough, without getting lost in the
weeds of what _might_ happen in the future and wasting time (and
adding complexity).

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently()
  2016-08-26 17:10       ` Junio C Hamano
@ 2016-08-26 17:23         ` Jeff King
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-08-26 17:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, sbeller, Johannes.Schindelin, jnareb,
	mlbright

On Fri, Aug 26, 2016 at 10:10:50AM -0700, Junio C Hamano wrote:

> Lars Schneider <larsxschneider@gmail.com> writes:
> 
> > I agree with your criticism of the code duplication. 
> >
> > However, I thought it would be OK, as Peff already 
> > tried to refactor it...
> > http://public-inbox.org/git/20160810150139.lpxyrqkr53s5f4sx@sigill.intra.peff.net/
> >
> > ... and I got the impression you agreed with Peff:
> > http://public-inbox.org/git/xmqqvaz84g9y.fsf@gitster.mtv.corp.google.com/
> 
> The former does not exactly show how ugly it was, but I do not have
> to see it.  It is talking about eliminating the need for memcpy()
> and duplicated header generation code, which the suggestion you are
> responding to didn't even attempt.  If Peff said he tried an even
> more aggressive refactoring and it ended up too ugly to live, I
> believe him and agree with his assessment.

Right, what I found difficulty with was factoring out format_packet(). I
think the "gently" part is easy.

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size
  2016-08-25 19:35     ` Lars Schneider
@ 2016-08-26 19:44       ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 19:44 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Stefan Beller, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

Lars Schneider <larsxschneider@gmail.com> writes:

>> And as this is a strict bug fix of Documentation that makes sense
>> outside this series,
>
> Agreed!

Amen.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 08/13] convert: quote filter names in error messages
  2016-08-25 11:07 ` [PATCH v6 08/13] convert: quote filter names in error messages larsxschneider
@ 2016-08-26 19:45   ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 19:45 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard to
> read in error messages. Quote them to improve the readability.

Sounds good.  Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 09/13] convert: modernize tests
  2016-08-25 11:07 ` [PATCH v6 09/13] convert: modernize tests larsxschneider
@ 2016-08-26 20:03   ` Junio C Hamano
  2016-08-29 10:09     ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-26 20:03 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> Use `test_config` to set the config, check that files are empty with
> `test_must_be_empty`, compare files with `test_cmp`, and remove spaces
> after ">" and "<".

All of the above are good things to do, but the first one needs to
be done a bit carefully.

It is unclear in the above description if you made sure that no
subsequent test depends on the settings left by earlier test before
replacing "git config" with "test_config".

> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index 7bac2bc..7b45136 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -13,8 +13,8 @@ EOF
>  chmod +x rot13.sh
>  
>  test_expect_success setup '
> -	git config filter.rot13.smudge ./rot13.sh &&
> -	git config filter.rot13.clean ./rot13.sh &&
> +	test_config filter.rot13.smudge ./rot13.sh &&
> +	test_config filter.rot13.clean ./rot13.sh &&
>  
>  	{
>  	    echo "*.t filter=rot13"

For example, after this conversion, filter.rot13.* will be reverted
back to unconfigured once setup is done.

>  test_expect_success check '
>  
> -	cmp test.o test &&
> -	cmp test.o test.t &&
> +	test_cmp test.o test &&
> +	test_cmp test.o test.t &&
>  
>  	# ident should be stripped in the repository
>  	git diff --raw --exit-code :test :test.i &&

It happens to be true that this subsequent test does not do any
operation to cause data come from and go into the object database
for any path that match the pattern "*.t", because for whatever
reason the previous "setup" step happens to do more than just
"setup".  It already exercised the filtering by doing "git add" and
"git checkout".

If we were writing the script t0021 from scratch today, we would
have used test_config *AND* squashed there two tests into one
(i.e. making it a single "the filter and ident operation" test).
Then it is crystal clear that additional tests on commands that may
involve filtering will always be added to that combined test
(e.g. we may try "cat-file --filters" to ensure that rot13 filter is
excercised).

But because we are not doing that, it may become tempting to add
test for a new command that pays attention to a filter to either of
the test, and it would have worked OK if this patch weren't there.
Such an addition will break the test, because in the second "check"
test, the filter commands are no longer active with this patch.

So while I do appreciate the change for the longer term, I am not
quite happy with it.  It just looks like an invitation for future
mistakes.

> @@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
>  
>  	# delete the files and check them out again, using a smudge filter
>  	# that will count the args and echo the command-line back to us
> -	git config filter.argc.smudge "sh ./argc.sh %f" &&
> +	test_config filter.argc.smudge "sh ./argc.sh %f" &&
>  	rm "$normal" "$special" &&
>  	git checkout -- "$normal" "$special" &&

This one is clearly OK.  Anything related to argc filter only
appears in this single test so it is not just OK to use test_config,
but it makes our intention very clear that nobody else would use the
argc filter.  I think similar reasoning would apply to the test_config
conversion in the remainder of the script.

As to other types of changes you did in this, I see no problem with
them.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 05/13] pkt-line: add packet_write_gently()
  2016-08-26 17:15       ` Junio C Hamano
@ 2016-08-29  9:40         ` Lars Schneider
  0 siblings, 0 replies; 66+ messages in thread
From: Lars Schneider @ 2016-08-29  9:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Jeff King, Stefan Beller, Johannes.Schindelin,
	Jakub Narębski, mlbright


> On 26 Aug 2016, at 19:15, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> Do you anticipate future need of non-gently variant of this
>>> function?  If so, perhaps a helper that takes a boolean "am I
>>> working for the gently variant?" may help share more code.
>> 
>> With helper you mean "an additional boolean parameter"? I don't 
>> see a need for a non-gently variant right now but I will
>> add this parameter if you think it is a good idea. How would the
>> signature look like?
>> 
>> int packet_write_gently(const int fd_out, const char *buf, size_t size, int gentle)
>> 
>> This would follow type_from_string_gently() in object.h.
> 
> I actually imagined it would be more like your packet_write_fmt vs
> packet_write_fmt_gently pair of functions.  If you do not have an
> immediate need for a non-gentle packet_write() right now, but you
> still forsee that it is likely some other caller may want one, you
> could still prepare for it by doing a static
> 
> 	packet_write_1((const int fd_out, const char *buf, size_t size, int gentle)
> 
> and make packet_write_gently() call it with gentle=1, without
> actually introducing packet_write() nobody yet calls.

I see. In that case I would like to keep packet_write_gently() as is
because I don't see the need for a non-gently variant right now.

If there is a need for packet_write() then we could just add it and
move the packet_write_gently() code to packet_write_1() following your
suggestion. No caller would need to change for this refactoring.

If you strongly disagree then I would use the "two function" approach
you suggested above right away, though.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams
  2016-08-26 17:21       ` Junio C Hamano
@ 2016-08-29  9:43         ` Lars Schneider
  0 siblings, 0 replies; 66+ messages in thread
From: Lars Schneider @ 2016-08-29  9:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 26 Aug 2016, at 19:21, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> OK, what function names would be more clear from your point of view?
>> 
>> write_packetized_stream_from_fd()
>> write_packetized_stream_from_buf()
>> read_packetized_stream_to_buf()
> 
> Would
> 
>    write_packetized_from_fd()
>    write_packetized_from_buf()
>    read_packetized_to_buf()
> 
> be shorter, still understandable, be consistent with the existing
> convention to name write_/read_ a function that shuffles bytes via a
> file descriptor, and to the point, perhaps?

Agreed.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 09/13] convert: modernize tests
  2016-08-26 20:03   ` Junio C Hamano
@ 2016-08-29 10:09     ` Lars Schneider
  0 siblings, 0 replies; 66+ messages in thread
From: Lars Schneider @ 2016-08-29 10:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 26 Aug 2016, at 22:03, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Use `test_config` to set the config, check that files are empty with
>> `test_must_be_empty`, compare files with `test_cmp`, and remove spaces
>> after ">" and "<".
> 
> All of the above are good things to do, but the first one needs to
> be done a bit carefully.
> 
> It is unclear in the above description if you made sure that no
> subsequent test depends on the settings left by earlier test before
> replacing "git config" with "test_config".

I assumed that I would see test failures if a subsequent test depends
on the settings left by earlier tests. I'll add this comment to the
commit message.


>> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
>> index 7bac2bc..7b45136 100755
>> --- a/t/t0021-conversion.sh
>> +++ b/t/t0021-conversion.sh
>> @@ -13,8 +13,8 @@ EOF
>> chmod +x rot13.sh
>> 
>> test_expect_success setup '
>> -	git config filter.rot13.smudge ./rot13.sh &&
>> -	git config filter.rot13.clean ./rot13.sh &&
>> +	test_config filter.rot13.smudge ./rot13.sh &&
>> +	test_config filter.rot13.clean ./rot13.sh &&
>> 
>> 	{
>> 	    echo "*.t filter=rot13"
> 
> For example, after this conversion, filter.rot13.* will be reverted
> back to unconfigured once setup is done.
> 
>> test_expect_success check '
>> 
>> -	cmp test.o test &&
>> -	cmp test.o test.t &&
>> +	test_cmp test.o test &&
>> +	test_cmp test.o test.t &&
>> 
>> 	# ident should be stripped in the repository
>> 	git diff --raw --exit-code :test :test.i &&
> 
> It happens to be true that this subsequent test does not do any
> operation to cause data come from and go into the object database
> for any path that match the pattern "*.t", because for whatever
> reason the previous "setup" step happens to do more than just
> "setup".  It already exercised the filtering by doing "git add" and
> "git checkout".
> 
> If we were writing the script t0021 from scratch today, we would
> have used test_config *AND* squashed there two tests into one
> (i.e. making it a single "the filter and ident operation" test).
> Then it is crystal clear that additional tests on commands that may
> involve filtering will always be added to that combined test
> (e.g. we may try "cat-file --filters" to ensure that rot13 filter is
> excercised).
> 
> But because we are not doing that, it may become tempting to add
> test for a new command that pays attention to a filter to either of
> the test, and it would have worked OK if this patch weren't there.
> Such an addition will break the test, because in the second "check"
> test, the filter commands are no longer active with this patch.
> 
> So while I do appreciate the change for the longer term, I am not
> quite happy with it.  It just looks like an invitation for future
> mistakes.

I'll follow your judgement here. However, from my point of view a future 
addition that causes test failures is no issue. Because these test failures
would be analyzed and then the tests could be refactored accordingly.


>> @@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
>> 
>> 	# delete the files and check them out again, using a smudge filter
>> 	# that will count the args and echo the command-line back to us
>> -	git config filter.argc.smudge "sh ./argc.sh %f" &&
>> +	test_config filter.argc.smudge "sh ./argc.sh %f" &&
>> 	rm "$normal" "$special" &&
>> 	git checkout -- "$normal" "$special" &&
> 
> This one is clearly OK.  Anything related to argc filter only
> appears in this single test so it is not just OK to use test_config,
> but it makes our intention very clear that nobody else would use the
> argc filter.  I think similar reasoning would apply to the test_config
> conversion in the remainder of the script.

OK, then I will keep these test_config's and revert the first one.


> As to other types of changes you did in this, I see no problem with
> them.


Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 00/13] Git filter protocol
  2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
                   ` (12 preceding siblings ...)
  2016-08-25 11:07 ` [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes larsxschneider
@ 2016-08-29 15:39 ` Lars Schneider
  2016-08-29 18:09   ` Junio C Hamano
  13 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-29 15:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Jeff King, Stefan Beller, Johannes.Schindelin,
	Jakub Narębski, Martin-Louis Bright


> On 25 Aug 2016, at 13:07, larsxschneider@gmail.com wrote:
> 
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> The goal of this series is to avoid launching a new clean/smudge filter
> process for each file that is filtered.
> 
> A short summary about v1 to v5 can be found here:
> https://git.github.io/rev_news/2016/08/17/edition-18/
> 
> This series is also published on web:
> https://github.com/larsxschneider/git/pull/10

Hi Junio,

I am working on v7 of this patch series and I noticed that is does
not merge clean anymore (not even with git/git master). Should
I rebase my patch series? If yes, what would be the most
convenient base for you?

Thanks,
Lars


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-25 11:07 ` [PATCH v6 10/13] convert: generate large test files only once larsxschneider
  2016-08-25 19:17   ` Stefan Beller
@ 2016-08-29 17:46   ` Junio C Hamano
  2016-08-30 11:41     ` Lars Schneider
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-29 17:46 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index 7b45136..34c8eb9 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -4,6 +4,15 @@ test_description='blob conversion via gitattributes'
>  
>  . ./test-lib.sh
>  
> +if test_have_prereq EXPENSIVE
> +then
> +	T0021_LARGE_FILE_SIZE=2048
> +	T0021_LARGISH_FILE_SIZE=100
> +else
> +	T0021_LARGE_FILE_SIZE=30
> +	T0021_LARGISH_FILE_SIZE=2
> +fi

Minor: do we need T0021_ prefix?  What are you trying to avoid
collisions with?

> +	git checkout -- test test.t test.i &&
> +
> +	mkdir generated-test-data &&
> +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
> +	do
> +		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
> +		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"

In earlier iteration of loop with lower $i, what guarantees that
some bytes survive "tr -dc"?

> +		# Generate 1MB of empty data and 100 bytes of random characters

100 bytes?  It seems to me that you are giving 1MB and then $i-byte
or less (which sometimes can be zero) of random string.

> +		# printf "$(test-genrandom start $i)"
> +		printf "%1048576d" 1 >>generated-test-data/large.file &&
> +		printf "$RANDOM_STRING" >>generated-test-data/large.file &&
> +		printf "%1048576d" 1 >>generated-test-data/large.file.rot13 &&
> +		printf "$ROT_RANDOM_STRING" >>generated-test-data/large.file.rot13 &&
> +
> +		if test $i = $T0021_LARGISH_FILE_SIZE
> +		then
> +			cat generated-test-data/large.file >generated-test-data/largish.file &&
> +			cat generated-test-data/large.file.rot13 >generated-test-data/largish.file.rot13
> +		fi
> +	done

This "now we are done with the loop, so copy them to the second
pair" needs to be in the loop?  Shouldn't it come after 'done'?

I do not quite get the point of this complexity.  You are using
exactly the same seed "end" every time, so in the first round you
have 1M of SP, letter '1', letter 'S' (from the genrandom), then
in the second round you have 1M of SP, letter '1', letter 'S' and
letter 'p' (the last two from the genrandom), and go on.  Is it
significant for the purpose of your test that the cruft inserted
between the repetition of 1M of SP gets longer by one byte but they
all share the same prefix (e.g. "1S", "1Sp", "1SpZ", "1SpZT",
... are what you insert between a large run of spaces)?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-25 19:54     ` Lars Schneider
@ 2016-08-29 17:52       ` Junio C Hamano
  2016-08-30 11:47         ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-29 17:52 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Stefan Beller, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 25 Aug 2016, at 21:17, Stefan Beller <sbeller@google.com> wrote:
>> 
>>> On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>> 
>>> Generate more interesting large test files
>> 
>> How are the large test files more interesting?
>> (interesting in the notion of covering more potential bugs?
>> easier to debug? better to maintain, or just a pleasant read?)
>
> The old large test file was 1MB of zeros and 1 byte with a one, repeated 2048 times.
>
> Since the filter uses 64k packets we would test a large number of equally looking packets.
>
> That's why I thought the pseudo random content is more interesting.

I guess my real question is why it is not just a single invocation
of test-genrandom that gives you the whole test file; if you are
using 20MB, the simplest would be to grab 20MB out of test-genrandom.
With that hopefully you won't see large number of equally looking
packets, no?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-25 11:07 ` [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes larsxschneider
@ 2016-08-29 18:05   ` Junio C Hamano
  2016-08-29 19:03     ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-29 18:05 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> Consider the case of a file that requires filtering and is present in
> branch A but not in branch B. If A is the current HEAD and we checkout B
> then the following happens:
>
> 1. ce_compare_data() opens the file
> 2.   index_fd() detects that the file requires to run a clean filter and
>      calls index_stream_convert_blob()
> 4.     index_stream_convert_blob() calls convert_to_git_filter_fd()
> 5.       convert_to_git_filter_fd() calls apply_filter() which creates a
>          new long running filter process (in case it is the first file
>          of this kind to be filtered)
> 6.       The new filter process inherits all file handles. This is the
>          default on Linux/OSX and is explicitly defined in the
>          `CreateProcessW` call in `mingw.c` on Windows.
> 7. ce_compare_data() closes the file
> 8. Git unlinks the file as it is not present in B
>
> The unlink operation does not work on Windows because the filter process
> has still an open handle to the file. Apparently that is no problem on
> Linux/OSX. Probably because "[...] the two file descriptors share open
> file status flags" (see fork(2)).

Wait, a, minute.  "that is no problem" may be true as long as "that"
is "unlinking the now-gone file in the filesystem", but the reason
does not have anything to do with the "open-file status flags";
unlike Windows, you _can_ unlink file that has an open file
descriptor on it.

And even on POSIX systems, if you are doing a long-running helper
any open file descriptor in the parent process when the long-running
helper is spawned will become leaked fd.  CLOEXEC is a possible
solution (but not necessarily the only or the best one) to the fd
leak in this case.

How much does the code that spawns these long-running helpers know
about the file descriptors that happen to be open?  The parent is
very likely to have pack windows open into .pack files and they need
to be closed on the child side after fork(2) starts the child
process but before execve(2) runs the helper, if we want to avoid
file descriptor leaks.

> Fix this problem by opening files in read-cache with the `O_CLOEXEC`
> flag to ensure that the file descriptor does not remain open in a newly
> spawned process. `O_CLOEXEC` is defined as `O_NOINHERIT` on Windows. A
> similar fix for temporary file handles was applied on Git for Windows
> already:
> https://github.com/git-for-windows/git/commit/667b8b51ec850c3e1c7d75dee69dc13c29d1f162

After a few iterations, the final version of the same commit is now
in our tree as d5cb9cbd ("Git 2.10-rc2", 2016-08-26).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 00/13] Git filter protocol
  2016-08-29 15:39 ` [PATCH v6 00/13] Git filter protocol Lars Schneider
@ 2016-08-29 18:09   ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-08-29 18:09 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git Mailing List, Jeff King, Stefan Beller, Johannes.Schindelin,
	Jakub Narębski, Martin-Louis Bright

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 25 Aug 2016, at 13:07, larsxschneider@gmail.com wrote:
>> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> The goal of this series is to avoid launching a new clean/smudge filter
>> process for each file that is filtered.
>> 
>> A short summary about v1 to v5 can be found here:
>> https://git.github.io/rev_news/2016/08/17/edition-18/
>> 
>> This series is also published on web:
>> https://github.com/larsxschneider/git/pull/10
>
> Hi Junio,
>
> I am working on v7 of this patch series and I noticed that is does
> not merge clean anymore (not even with git/git master). Should
> I rebase my patch series? If yes, what would be the most
> convenient base for you?

Depends on what you are conflicting with. I think you want to use
something that has at least 05d1ed61 ("mingw: ensure temporary file
handles are not inherited by child processes", 2016-08-22).  Perhaps
2.10 final?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-29 18:05   ` Junio C Hamano
@ 2016-08-29 19:03     ` Lars Schneider
  2016-08-29 19:45       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-29 19:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 29 Aug 2016, at 20:05, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Consider the case of a file that requires filtering and is present in
>> branch A but not in branch B. If A is the current HEAD and we checkout B
>> then the following happens:
>> 
>> 1. ce_compare_data() opens the file
>> 2.   index_fd() detects that the file requires to run a clean filter and
>>     calls index_stream_convert_blob()
>> 4.     index_stream_convert_blob() calls convert_to_git_filter_fd()
>> 5.       convert_to_git_filter_fd() calls apply_filter() which creates a
>>         new long running filter process (in case it is the first file
>>         of this kind to be filtered)
>> 6.       The new filter process inherits all file handles. This is the
>>         default on Linux/OSX and is explicitly defined in the
>>         `CreateProcessW` call in `mingw.c` on Windows.
>> 7. ce_compare_data() closes the file
>> 8. Git unlinks the file as it is not present in B
>> 
>> The unlink operation does not work on Windows because the filter process
>> has still an open handle to the file. Apparently that is no problem on
>> Linux/OSX. Probably because "[...] the two file descriptors share open
>> file status flags" (see fork(2)).
> 
> Wait, a, minute.  "that is no problem" may be true as long as "that"
> is "unlinking the now-gone file in the filesystem", but the reason
> does not have anything to do with the "open-file status flags";
> unlike Windows, you _can_ unlink file that has an open file
> descriptor on it.

I see. Thanks for the explanation.

> 
> And even on POSIX systems, if you are doing a long-running helper
> any open file descriptor in the parent process when the long-running
> helper is spawned will become leaked fd.  CLOEXEC is a possible
> solution (but not necessarily the only or the best one) to the fd
> leak in this case.
> 
> How much does the code that spawns these long-running helpers know
> about the file descriptors that happen to be open?

Nothing really.

>  The parent is
> very likely to have pack windows open into .pack files and they need
> to be closed on the child side after fork(2) starts the child
> process but before execve(2) runs the helper, if we want to avoid
> file descriptor leaks.

I think I understand what you are saying. However, during my tests
.pack file fd's were never a problem. I use start_command() [1]
which wraps the fork() and exec calls [2].

How would I find the open .pack file fd's? Should I go through 
/proc/PID/fd? Why is this no problem for other longer running 
commands such as the git-credential-cache--daemon or git-daemon?

Thanks,
Lars


[1] https://github.com/larsxschneider/git/blob/protocol-filter/v6/convert.c#L566
[2] https://github.com/larsxschneider/git/blob/protocol-filter/v6/run-command.c#L345-L412



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-29 19:03     ` Lars Schneider
@ 2016-08-29 19:45       ` Junio C Hamano
  2016-08-30 12:32         ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-29 19:45 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

Lars Schneider <larsxschneider@gmail.com> writes:

> I see. Thanks for the explanation.

I expect the updated log message to explain the issue correctly
then.

>> And even on POSIX systems, if you are doing a long-running helper
>> any open file descriptor in the parent process when the long-running
>> helper is spawned will become leaked fd.  CLOEXEC is a possible
>> solution (but not necessarily the only or the best one) to the fd
>> leak in this case.
>> 
>> How much does the code that spawns these long-running helpers know
>> about the file descriptors that happen to be open?
>
> Nothing really.

OK.

>> The parent is
>> very likely to have pack windows open into .pack files and they need
>> to be closed on the child side after fork(2) starts the child
>> process but before execve(2) runs the helper, if we want to avoid
>> file descriptor leaks.
>
> I think I understand what you are saying. However, during my tests
> .pack file fd's were never a problem.

I do not expect during the lifetime of your long-running helper
anybody would try to unlink an existing packfile, so it is unlikely
that "cannot unlink an open file on Windows" issue to come up.  And
the cross-platform problem I pointed out is a fd leak; leaks would
not show up until you run out of the resource, just like you
wouldn't notice small memory leak here and there UNLESS you actively
look for them.  I would be surprised if your "tests" found anything.

> How would I find the open .pack file fd's?  Should I go through
> /proc/PID/fd? Why is this no problem for other longer running
> commands such as the git-credential-cache--daemon or git-daemon?

Nobody said this is "no problem for others".  While discussing the
patch that started this thread, we noticed that any open file
descriptor the main process has when the long-running clean/smudge
helper is spawned _is_ a problem.  Other helpers may share the same
problem, and if they do, that is all the more reason that fixing the
file descriptor leak is a good thing.

The primary thing I was wondering was if we should open the window
into packfiles with CLOEXEC, just like we recently started opening
the tempfiles and lockfiles with the flag.  The reason why I asked
if the site that spawns (i.e. run_command API) has an easy access to
the list of file descriptors that we opened ONLY for ourselves is
because that would make it possible to consider an alternative
approach close them before execve(2) in run_commands API.  That
would save us from having to sprinkle existing calls to open(2) with
CLOEXEC.  But if that is not the case, opening them with CLOEXEC is
a much better solution than going outside the program to "find" them
in non-portable ways.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-25 11:07 ` [PATCH v6 12/13] convert: add filter.<driver>.process option larsxschneider
@ 2016-08-29 22:21   ` Junio C Hamano
  2016-08-30 16:27     ` Lars Schneider
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-29 22:21 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

larsxschneider@gmail.com writes:

> +In case the filter cannot or does not want to process the content,
> +it is expected to respond with an "error" status. Depending on the
> +`filter.<driver>.required` flag Git will interpret that as error
> +but it will not stop or restart the filter process.
> +------------------------
> +packet:          git< status=error\n
> +packet:          git< 0000
> +------------------------
> +
> +In case the filter cannot or does not want to process the content
> +as well as any future content for the lifetime of the Git process,
> +it is expected to respond with an "error-all" status. Depending on
> +the `filter.<driver>.required` flag Git will interpret that as error
> +but it will not stop or restart the filter process.
> +------------------------
> +packet:          git< status=error-all\n
> +packet:          git< 0000
> +------------------------

This part of the document is well-written to help filter-writers.

One thing that was unclear from the above to me, when read as a
potential filter-writer, is when I am supposed to exit(2).  After I
tell Git with error-all (I would have called it "abort", but that's
OK) that I desire no further communication, am I free to go?  Or do
I wait until Git somehow disconnects (perhaps by closing the packet
stream I have been reading)?

> +If the filter dies during the communication or does not adhere to
> +the protocol then Git will stop the filter process and restart it
> +with the next file that needs to be processed.

Hmph, is there a reason not to retry a half-converted-and-failed
blob with the fresh process?  Note that this is not "you must do it
that way", and it is not even "I think doing so may be a better
idea".  I merely want to know the reason behind this decision.

> +After the filter has processed a blob it is expected to wait for
> +the next "key=value" list containing a command. When the Git process
> +terminates, it will send a kill signal to the filter in that stage.

The "kill" may not be very nice.  As Git side _knows_ that the
filter is waiting for the next command, having an explicit
"shutdown" command would give the filter a chance to implement a
clean exit--it may have some housekeeping tasks it wants to perform
once it is done.  The "explicit shutdown" could just be "the pipe
gets closed", so from the implementation point of view there may not
be anything you need to further add to this patch (after all, when
we exit, the pipes to them would be closed), but the shutdown
protocol and the expectation on the behaviour of filter processes
would need to be documented.

> +If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
> +is configured then these commands always take precedence over
> +a configured `filter.<driver>.process` command.

It may make more sense to give precedence to the .process (which is
a late-comer) if defined, ignoring .clean and .smudge, than the
other way around.

> +Please note that you cannot use an existing `filter.<driver>.clean`
> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
> +because the former two use a different inter process communication
> +protocol than the latter one.

Would it be a useful sample program we can ship in contrib/ if you
created a "filter adapter" that reads these two configuration
variables and act as a filter.<driver>.process?

During an imaginary session of "git add .", I think I found where
you start THE filter process upon the first path that needs to be
filtered with one for the configured <driver>, and I think the same
place is where you reuse THE filter process, but I am not sure where
you are cleaning up by killing the filter once all paths are added.
Wouldn't you need some hooks at strategic places after such bulk
operation to tell the multi-file-filter machinery to walk all the
entries in cmd_process_map and tell the remaining filter processes
that they have no more tasks, or something?  Are you relying on
these processes to exit upon a read failure after we exit and the
pipe going to the filter is severed?

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-29 17:46   ` Junio C Hamano
@ 2016-08-30 11:41     ` Lars Schneider
  2016-08-30 16:37       ` Jeff King
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-30 11:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git Mailing List, Jeff King, sbeller, Johannes.Schindelin, jnareb,
	mlbright


> On 29 Aug 2016, at 19:46, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
>> index 7b45136..34c8eb9 100755
>> --- a/t/t0021-conversion.sh
>> +++ b/t/t0021-conversion.sh
>> @@ -4,6 +4,15 @@ test_description='blob conversion via gitattributes'
>> 
>> . ./test-lib.sh
>> 
>> +if test_have_prereq EXPENSIVE
>> +then
>> +	T0021_LARGE_FILE_SIZE=2048
>> +	T0021_LARGISH_FILE_SIZE=100
>> +else
>> +	T0021_LARGE_FILE_SIZE=30
>> +	T0021_LARGISH_FILE_SIZE=2
>> +fi
> 
> Minor: do we need T0021_ prefix?  What are you trying to avoid
> collisions with?

Not necessary. I'll remove the prefix.


>> +	git checkout -- test test.t test.i &&
>> +
>> +	mkdir generated-test-data &&
>> +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
>> +	do
>> +		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
>> +		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"
> 
> In earlier iteration of loop with lower $i, what guarantees that
> some bytes survive "tr -dc"?

Nothing really, good catch! The seed "end" produces as first character always a 
"S" which would survive "tr -dc". However, that is clunky. I will always set "1"
as first character in $RANDOM_STRING to mitigate the problem.

> 
>> +		# Generate 1MB of empty data and 100 bytes of random characters
> 
> 100 bytes?  It seems to me that you are giving 1MB and then $i-byte
> or less (which sometimes can be zero) of random string.

Outdated comment. Will fix!

> 
>> +		# printf "$(test-genrandom start $i)"
>> +		printf "%1048576d" 1 >>generated-test-data/large.file &&
>> +		printf "$RANDOM_STRING" >>generated-test-data/large.file &&
>> +		printf "%1048576d" 1 >>generated-test-data/large.file.rot13 &&
>> +		printf "$ROT_RANDOM_STRING" >>generated-test-data/large.file.rot13 &&
>> +
>> +		if test $i = $T0021_LARGISH_FILE_SIZE
>> +		then
>> +			cat generated-test-data/large.file >generated-test-data/largish.file &&
>> +			cat generated-test-data/large.file.rot13 >generated-test-data/largish.file.rot13
>> +		fi
>> +	done
> 
> This "now we are done with the loop, so copy them to the second
> pair" needs to be in the loop?  Shouldn't it come after 'done'?

No, it does not need to be in the loop. I think I could do this
after the loop instead:

head -c $((1048576*$T0021_LARGISH_FILE_SIZE)) generated-test-data/large.file >generated-test-data/largish.file


> I do not quite get the point of this complexity.  You are using
> exactly the same seed "end" every time, so in the first round you
> have 1M of SP, letter '1', letter 'S' (from the genrandom), then
> in the second round you have 1M of SP, letter '1', letter 'S' and
> letter 'p' (the last two from the genrandom), and go on.  Is it
> significant for the purpose of your test that the cruft inserted
> between the repetition of 1M of SP gets longer by one byte but they
> all share the same prefix (e.g. "1S", "1Sp", "1SpZ", "1SpZT",
> ... are what you insert between a large run of spaces)?

The pktline packets have a constant size. If the cruft between 1M of SP 
has a constant size as well then the generated packets for the test data
would repeat themselves. That's why I increased the length after every 1M
of SP.

However, I realized that this test complexity is not necessary. I'll
simplify it in the next round.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-29 17:52       ` Junio C Hamano
@ 2016-08-30 11:47         ` Lars Schneider
  2016-08-30 16:55           ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-30 11:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Stefan Beller, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright


> On 29 Aug 2016, at 19:52, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> On 25 Aug 2016, at 21:17, Stefan Beller <sbeller@google.com> wrote:
>>> 
>>>> On Thu, Aug 25, 2016 at 4:07 AM,  <larsxschneider@gmail.com> wrote:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> 
>>>> Generate more interesting large test files
>>> 
>>> How are the large test files more interesting?
>>> (interesting in the notion of covering more potential bugs?
>>> easier to debug? better to maintain, or just a pleasant read?)
>> 
>> The old large test file was 1MB of zeros and 1 byte with a one, repeated 2048 times.
>> 
>> Since the filter uses 64k packets we would test a large number of equally looking packets.
>> 
>> That's why I thought the pseudo random content is more interesting.
> 
> I guess my real question is why it is not just a single invocation
> of test-genrandom that gives you the whole test file; if you are
> using 20MB, the simplest would be to grab 20MB out of test-genrandom.
> With that hopefully you won't see large number of equally looking
> packets, no?

True, but applying rot13 (via tr ...) on 20+ MB takes quite a bit of
time. That's why I came up with the 1M SP in between.

However, I realized that testing a large amount of data is not really
necessary for the final series. A single packet is 64k. A 500k pseudo random
test file should be sufficient. This will make the test way simpler.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-29 19:45       ` Junio C Hamano
@ 2016-08-30 12:32         ` Lars Schneider
  2016-08-30 14:54           ` Torsten Bögershausen
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-30 12:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 29 Aug 2016, at 21:45, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> I see. Thanks for the explanation.
> 
> I expect the updated log message to explain the issue correctly
> then.

Sure!


>>> The parent is
>>> very likely to have pack windows open into .pack files and they need
>>> to be closed on the child side after fork(2) starts the child
>>> process but before execve(2) runs the helper, if we want to avoid
>>> file descriptor leaks.
>> 
>> I think I understand what you are saying. However, during my tests
>> .pack file fd's were never a problem.
> 
> I do not expect during the lifetime of your long-running helper
> anybody would try to unlink an existing packfile, so it is unlikely
> that "cannot unlink an open file on Windows" issue to come up.  And
> the cross-platform problem I pointed out is a fd leak; leaks would
> not show up until you run out of the resource, just like you
> wouldn't notice small memory leak here and there UNLESS you actively
> look for them.  I would be surprised if your "tests" found anything.
> 
>> How would I find the open .pack file fd's?  Should I go through
>> /proc/PID/fd? Why is this no problem for other longer running
>> commands such as the git-credential-cache--daemon or git-daemon?
> 
> Nobody said this is "no problem for others".  While discussing the
> patch that started this thread, we noticed that any open file
> descriptor the main process has when the long-running clean/smudge
> helper is spawned _is_ a problem.  Other helpers may share the same
> problem, and if they do, that is all the more reason that fixing the
> file descriptor leak is a good thing.
> 
> The primary thing I was wondering was if we should open the window
> into packfiles with CLOEXEC, just like we recently started opening
> the tempfiles and lockfiles with the flag.  The reason why I asked
> if the site that spawns (i.e. run_command API) has an easy access to
> the list of file descriptors that we opened ONLY for ourselves is
> because that would make it possible to consider an alternative
> approach close them before execve(2) in run_commands API.  That
> would save us from having to sprinkle existing calls to open(2) with
> CLOEXEC.  But if that is not the case, opening them with CLOEXEC is
> a much better solution than going outside the program to "find" them
> in non-portable ways.

The pack files are opened before the filter process is forked. Therefore,
I think CLOEXEC makes sense for them. Should this change be part of this 
series? If yes, would it look like below? Should I adjust the function name?

Thanks,
Lars

diff --git a/sha1_file.c b/sha1_file.c
index d5e1121..759991e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1485,7 +1485,7 @@ int check_sha1_signature(const unsigned char *sha1, void *map,
 
 int git_open_noatime(const char *name)
 {
-	static int sha1_file_open_flag = O_NOATIME;
+	static int sha1_file_open_flag = O_NOATIME | O_CLOEXEC;
 
 	for (;;) {
 		int fd;




^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-30 12:32         ` Lars Schneider
@ 2016-08-30 14:54           ` Torsten Bögershausen
  2016-09-01 17:15             ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Torsten Bögershausen @ 2016-08-30 14:54 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, git, peff, sbeller, Johannes.Schindelin, jnareb,
	mlbright


> 
> diff --git a/sha1_file.c b/sha1_file.c
> index d5e1121..759991e 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1485,7 +1485,7 @@ int check_sha1_signature(const unsigned char *sha1, void *map,
>  
>  int git_open_noatime(const char *name)

Hm, should the function then be renamed into

git_open_noatime_cloexec()

>  {
> -	static int sha1_file_open_flag = O_NOATIME;
> +	static int sha1_file_open_flag = O_NOATIME | O_CLOEXEC;
>  
>  	for (;;) {
>  		int fd;
> 
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-29 22:21   ` Junio C Hamano
@ 2016-08-30 16:27     ` Lars Schneider
  2016-08-30 18:59       ` Junio C Hamano
  0 siblings, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-30 16:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


> On 30 Aug 2016, at 00:21, Junio C Hamano <gitster@pobox.com> wrote:
> 
> larsxschneider@gmail.com writes:
> 
>> +In case the filter cannot or does not want to process the content,
>> +it is expected to respond with an "error" status. Depending on the
>> +`filter.<driver>.required` flag Git will interpret that as error
>> +but it will not stop or restart the filter process.
>> +------------------------
>> +packet:          git< status=error\n
>> +packet:          git< 0000
>> +------------------------
>> +
>> +In case the filter cannot or does not want to process the content
>> +as well as any future content for the lifetime of the Git process,
>> +it is expected to respond with an "error-all" status. Depending on
>> +the `filter.<driver>.required` flag Git will interpret that as error
>> +but it will not stop or restart the filter process.
>> +------------------------
>> +packet:          git< status=error-all\n
>> +packet:          git< 0000
>> +------------------------
> 
> This part of the document is well-written to help filter-writers.

Thanks!


> One thing that was unclear from the above to me, when read as a
> potential filter-writer, is when I am supposed to exit(2).  After I
> tell Git with error-all (I would have called it "abort", but that's
> OK) that I desire no further communication, am I free to go?  Or do
> I wait until Git somehow disconnects (perhaps by closing the packet
> stream I have been reading)?

The filter can exit right after the "error-all". If the filter does
not exit then Git will kill the filter. I'll add this to the docs.

"abort" could be ambiguous because it could be read as "abort only
this file". "abort-all" would work, though. Would you prefer to see
"error" replaced by "abort" and "error-all" by "abort-all"?


>> +If the filter dies during the communication or does not adhere to
>> +the protocol then Git will stop the filter process and restart it
>> +with the next file that needs to be processed.
> 
> Hmph, is there a reason not to retry a half-converted-and-failed
> blob with the fresh process?  Note that this is not "you must do it
> that way", and it is not even "I think doing so may be a better
> idea".  I merely want to know the reason behind this decision.

A filter that dies during communication or does not adhere to the protocol
is a faulty filter. Feeding the faulty filter after restart with the same 
blob would likely cause the same error. 

There are legitimate reasons for retries. E.g. if the filter communicates
with the network. In these cases I expect the filter to handle the retry
logic. Git just writes to and reads from pipes. I don't expect frequent
problems in that area. Plus the existing filter mechanism has no retry
either.

Later on we could easily add a "retry" capability if we deem it necessary,
though.


>> +After the filter has processed a blob it is expected to wait for
>> +the next "key=value" list containing a command. When the Git process
>> +terminates, it will send a kill signal to the filter in that stage.
> 
> The "kill" may not be very nice.  As Git side _knows_ that the
> filter is waiting for the next command, having an explicit
> "shutdown" command would give the filter a chance to implement a
> clean exit--it may have some housekeeping tasks it wants to perform
> once it is done.  The "explicit shutdown" could just be "the pipe
> gets closed", so from the implementation point of view there may not
> be anything you need to further add to this patch (after all, when
> we exit, the pipes to them would be closed), but the shutdown
> protocol and the expectation on the behaviour of filter processes
> would need to be documented.

I implemented a shutdown command in v4 [1][2] but dropped it in v5 after
a discussion with Peff [3].

[1] http://public-inbox.org/git/20160803164225.46355-8-larsxschneider@gmail.com/
[2] http://public-inbox.org/git/20160803164225.46355-13-larsxschneider@gmail.com/
[3] http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3i7l@sigill.intra.peff.net/

My main reasons to drop it:

A) There is no central place in Git that could execute code *after*
   all filter operations are complete and *before* Git exits. Therefore,
   I had to add a "clean_on_exit_handler()" to "run-command" [1]. This
   change made this series even larger and therefore harder to review.

B) If we communicate "shutdown" to the filter then we need to give the
   filter some time to perform the exit before the filter is killed on
   Git exit. I wasn't able to come up with a good answer how long Git 
   should wait for the exit.

Do you think I should resurrect the "shutdown" patch?


>> +If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
>> +is configured then these commands always take precedence over
>> +a configured `filter.<driver>.process` command.
> 
> It may make more sense to give precedence to the .process (which is
> a late-comer) if defined, ignoring .clean and .smudge, than the
> other way around.

I agree.


>> +Please note that you cannot use an existing `filter.<driver>.clean`
>> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
>> +because the former two use a different inter process communication
>> +protocol than the latter one.
> 
> Would it be a useful sample program we can ship in contrib/ if you
> created a "filter adapter" that reads these two configuration
> variables and act as a filter.<driver>.process?

You mean a v2 filter that would use v1 filters under the hood?
If we would drop v1, then this would be useful. Otherwise I don't
see any need for such a thing.


> During an imaginary session of "git add .", I think I found where
> you start THE filter process upon the first path that needs to be
> filtered with one for the configured <driver>, and I think the same
> place is where you reuse THE filter process, but I am not sure where
> you are cleaning up by killing the filter once all paths are added.

I don't clean them up. The filter gets killed when Git exits.


> Wouldn't you need some hooks at strategic places after such bulk
> operation to tell the multi-file-filter machinery to walk all the
> entries in cmd_process_map and tell the remaining filter processes
> that they have no more tasks, or something?

That would be nice. I tried to find these strategic places but it
turned out to be harder than I thought because Git can exit in a lot
of places.


>  Are you relying on
> these processes to exit upon a read failure after we exit and the
> pipe going to the filter is severed?

Not really. The filter process is killed by the run-command machinery
with the atexit child_cleanup_handler.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-30 11:41     ` Lars Schneider
@ 2016-08-30 16:37       ` Jeff King
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff King @ 2016-08-30 16:37 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Git Mailing List, sbeller, Johannes.Schindelin,
	jnareb, mlbright

On Tue, Aug 30, 2016 at 01:41:59PM +0200, Lars Schneider wrote:

> >> +	git checkout -- test test.t test.i &&
> >> +
> >> +	mkdir generated-test-data &&
> >> +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
> >> +	do
> >> +		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
> >> +		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"
> > 
> > In earlier iteration of loop with lower $i, what guarantees that
> > some bytes survive "tr -dc"?
> 
> Nothing really, good catch! The seed "end" produces as first character always a 
> "S" which would survive "tr -dc". However, that is clunky. I will always set "1"
> as first character in $RANDOM_STRING to mitigate the problem.

It seems odd that you would generate a larger set of random bytes and
then throw most of them away (about 1 in 5, on average). So you don't
actually know how long your inputs are, and you're wasting time
generating bytes which are discarded.

The goal looks like it is just to clean up the string to only-ASCII
characters. Perhaps converting to to base64 or hex would be conceptually
simpler? Like:

  test-genrandom ... |
  perl -pe 's/./hex(ord($&))/sge'

> > I do not quite get the point of this complexity.  You are using
> > exactly the same seed "end" every time, so in the first round you
> > have 1M of SP, letter '1', letter 'S' (from the genrandom), then
> > in the second round you have 1M of SP, letter '1', letter 'S' and
> > letter 'p' (the last two from the genrandom), and go on.  Is it
> > significant for the purpose of your test that the cruft inserted
> > between the repetition of 1M of SP gets longer by one byte but they
> > all share the same prefix (e.g. "1S", "1Sp", "1SpZ", "1SpZT",
> > ... are what you insert between a large run of spaces)?
> 
> The pktline packets have a constant size. If the cruft between 1M of SP 
> has a constant size as well then the generated packets for the test data
> would repeat themselves. That's why I increased the length after every 1M
> of SP.
> 
> However, I realized that this test complexity is not necessary. I'll
> simplify it in the next round.

I was also confused by this, and wondered if other patterns (perhaps
using a single longer genrandom) might be applicable. Simplification (or
explanation in comments about what properties the content _needs_ to
have) would be welcome. :)

-Peff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 10/13] convert: generate large test files only once
  2016-08-30 11:47         ` Lars Schneider
@ 2016-08-30 16:55           ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-08-30 16:55 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Stefan Beller, git@vger.kernel.org, Jeff King,
	Johannes Schindelin, Jakub Narębski, Martin-Louis Bright

Lars Schneider <larsxschneider@gmail.com> writes:

> True, but applying rot13 (via tr ...) on 20+ MB takes quite a bit of
> time. That's why I came up with the 1M SP in between.

Ah, OK, that makes sense (it may not necessarily make it a good
decision, but it certainly explains it).

> However, I realized that testing a large amount of data is not really
> necessary for the final series. A single packet is 64k. A 500k pseudo random
> test file should be sufficient. This will make the test way simpler.

;-)  Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-30 16:27     ` Lars Schneider
@ 2016-08-30 18:59       ` Junio C Hamano
  2016-08-30 20:38         ` Lars Schneider
  2016-08-30 20:46         ` Jakub Narębski
  0 siblings, 2 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-08-30 18:59 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

Lars Schneider <larsxschneider@gmail.com> writes:

>> This part of the document is well-written to help filter-writers.
>
> Thanks!

Don't thank me; thank yourself and reviewers of the previous rounds.

> The filter can exit right after the "error-all". If the filter does
> not exit then Git will kill the filter. I'll add this to the docs.

OK.

> "abort" could be ambiguous because it could be read as "abort only
> this file". "abort-all" would work, though. Would you prefer to see
> "error" replaced by "abort" and "error-all" by "abort-all"?

No.

I was primarily reacting to "-all" part, so anything that ends with
"-all" is equally ugly from my point of view and not an improvement.

As I said, "error-all" as long as other reviewers are happy with is
OK by me, too.

> A filter that dies during communication or does not adhere to the protocol
> is a faulty filter. Feeding the faulty filter after restart with the same 
> blob would likely cause the same error. 

Why does Git restart it and continue with the rest of the blobs
then?  Is there a reason to believe it may produce correct results
for other blobs if you run it again?

> B) If we communicate "shutdown" to the filter then we need to give the
>    filter some time to perform the exit before the filter is killed on
>    Git exit. I wasn't able to come up with a good answer how long Git 
>    should wait for the exit.

Would that be an issue?  A filter is buggy if told to shutdown,
ignores the request and hangs around forever.  I do not think we
even need to kill it.  It is not Git's problem.

I personally would be reluctant to become a filter process writer if
the system it will be talking to can kill it without giving it a
chance to do a graceful exit, but perhaps that is just me.  I don't
know if closing the pipe going there where you are having Git to
kill the process on the other end is any more involved than what you
have without extra patches.

>>> +Please note that you cannot use an existing `filter.<driver>.clean`
>>> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
>>> +because the former two use a different inter process communication
>>> +protocol than the latter one.
>> 
>> Would it be a useful sample program we can ship in contrib/ if you
>> created a "filter adapter" that reads these two configuration
>> variables and act as a filter.<driver>.process?
>
> You mean a v2 filter that would use v1 filters under the hood?
> If we would drop v1, then this would be useful. Otherwise I don't
> see any need for such a thing.

I meant it as primarily an example people can learn from when they
want to write their own.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-30 18:59       ` Junio C Hamano
@ 2016-08-30 20:38         ` Lars Schneider
  2016-08-30 22:23           ` Junio C Hamano
  2016-08-30 20:46         ` Jakub Narębski
  1 sibling, 1 reply; 66+ messages in thread
From: Lars Schneider @ 2016-08-30 20:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright


On 30 Aug 2016, at 20:59, Junio C Hamano <gitster@pobox.com> wrote:
> 
>> "abort" could be ambiguous because it could be read as "abort only
>> this file". "abort-all" would work, though. Would you prefer to see
>> "error" replaced by "abort" and "error-all" by "abort-all"?
> 
> No.
> 
> I was primarily reacting to "-all" part, so anything that ends with
> "-all" is equally ugly from my point of view and not an improvement.
> 
> As I said, "error-all" as long as other reviewers are happy with is
> OK by me, too.

Now, I see your point. How about "error-and-stop" or "error-stop"
instead of "error-all"?


>> A filter that dies during communication or does not adhere to the protocol
>> is a faulty filter. Feeding the faulty filter after restart with the same 
>> blob would likely cause the same error. 
> 
> Why does Git restart it and continue with the rest of the blobs
> then?  Is there a reason to believe it may produce correct results
> for other blobs if you run it again?

Good argument. However, my intention here was to mimic the v1 filter
mechanism. If a filter fails then Git will run the same v1 filter on the
next file that needs to be filtered. If `filter.<driver>.required=false`
then a failure is even a legitimate result.


>> B) If we communicate "shutdown" to the filter then we need to give the
>>   filter some time to perform the exit before the filter is killed on
>>   Git exit. I wasn't able to come up with a good answer how long Git 
>>   should wait for the exit.
> 
> Would that be an issue?  A filter is buggy if told to shutdown,
> ignores the request and hangs around forever.  I do not think we
> even need to kill it.  It is not Git's problem.
> I personally would be reluctant to become a filter process writer if
> the system it will be talking to can kill it without giving it a
> chance to do a graceful exit, but perhaps that is just me.  I don't
> know if closing the pipe going there where you are having Git to
> kill the process on the other end is any more involved than what you
> have without extra patches.

I am not sure I understand your last sentence. Just to be clear:
Would you prefer it, if Git would just close the pipe to the filter process
on Git exit and leave the filter running? I think I wouldn't even need
to close the pipe explicitly. The pipe would be closed automatically when
git exits or dies, right? The filter could detect EOF and exit on its own.
That would be OK with me.


>>>> +Please note that you cannot use an existing `filter.<driver>.clean`
>>>> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
>>>> +because the former two use a different inter process communication
>>>> +protocol than the latter one.
>>> 
>>> Would it be a useful sample program we can ship in contrib/ if you
>>> created a "filter adapter" that reads these two configuration
>>> variables and act as a filter.<driver>.process?
>> 
>> You mean a v2 filter that would use v1 filters under the hood?
>> If we would drop v1, then this would be useful. Otherwise I don't
>> see any need for such a thing.
> 
> I meant it as primarily an example people can learn from when they
> want to write their own.

I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
already.


Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-30 18:59       ` Junio C Hamano
  2016-08-30 20:38         ` Lars Schneider
@ 2016-08-30 20:46         ` Jakub Narębski
  2016-09-05 19:47           ` Lars Schneider
  1 sibling, 1 reply; 66+ messages in thread
From: Jakub Narębski @ 2016-08-30 20:46 UTC (permalink / raw)
  To: Junio C Hamano, Lars Schneider
  Cc: git, peff, sbeller, Johannes.Schindelin, mlbright

Hello,

W dniu 30.08.2016 o 20:59, Junio C Hamano pisze:
> Lars Schneider <larsxschneider@gmail.com> writes:
[...]

>> The filter can exit right after the "error-all". If the filter does
>> not exit then Git will kill the filter. I'll add this to the docs.
> 
> OK.

Is it explicit kill, or implicit kill: close pipe and end process?

>> "abort" could be ambiguous because it could be read as "abort only
>> this file". "abort-all" would work, though. Would you prefer to see
>> "error" replaced by "abort" and "error-all" by "abort-all"?
> 
> No.
> 
> I was primarily reacting to "-all" part, so anything that ends with
> "-all" is equally ugly from my point of view and not an improvement.
> 
> As I said, "error-all" as long as other reviewers are happy with is
> OK by me, too.

I'm rather partial to "abort" instead of "error-all", or "quit"/"exit"
(but I prefer "abort" or "bail-out"), as it better reflects what this
response is about - ending filter process.

>> A filter that dies during communication or does not adhere to the protocol
>> is a faulty filter. Feeding the faulty filter after restart with the same 
>> blob would likely cause the same error. 
> 
> Why does Git restart it and continue with the rest of the blobs
> then?  Is there a reason to believe it may produce correct results
> for other blobs if you run it again?

I guess the idea is that single filter can be run on different types
of blobs, and it could fail on some types (some files) and not others.
Like LFS-type filter failing on files with size larger than 2 GiB,
or iconv-like filter to convert UTF-16 to UTF-8 failing on invalid
byte sequences.

>> B) If we communicate "shutdown" to the filter then we need to give the
>>    filter some time to perform the exit before the filter is killed on
>>    Git exit. I wasn't able to come up with a good answer how long Git 
>>    should wait for the exit.
> 
> Would that be an issue?  A filter is buggy if told to shutdown,
> ignores the request and hangs around forever.  I do not think we
> even need to kill it.  It is not Git's problem.

I think the idea was for Git to request shutdown, and filter to tell
when it finished (or just exiting, and Git getting SIGCHLD, I guess).
It is hard to tell how much to wait, for example for filter process
to send "sync" command, or finish unmounting, or something...

> I personally would be reluctant to become a filter process writer if
> the system it will be talking to can kill it without giving it a
> chance to do a graceful exit, but perhaps that is just me.  I don't
> know if closing the pipe going there where you are having Git to
> kill the process on the other end is any more involved than what you
> have without extra patches.

Isn't it the same problem with v1 filters being killed on Git process
exit?  Unless v2 filter wants to do something _after_ writing output
to Git, it should be safe... unless Git process is killed, but it
is not different from filter being stray-killed.

>>>> +Please note that you cannot use an existing `filter.<driver>.clean`
>>>> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
>>>> +because the former two use a different inter process communication
>>>> +protocol than the latter one.
>>>
>>> Would it be a useful sample program we can ship in contrib/ if you
>>> created a "filter adapter" that reads these two configuration
>>> variables and act as a filter.<driver>.process?
>>
>> You mean a v2 filter that would use v1 filters under the hood?
>> If we would drop v1, then this would be useful. Otherwise I don't
>> see any need for such a thing.
> 
> I meant it as primarily an example people can learn from when they
> want to write their own.

Errr... if it were any v1 filter, it would be useless (as bad or
worse performance than ordinary v1 clean or smudge filter).  It
might make sense if v1 filter and v2 wrapper were written in the
same [dynamic] language, and wrapper could just load / require
filter as a function / procedure, [compile it], and use it.
For example smudge/clean filter in Perl (e.g. rot13), and v2 wrapper
in Perl too.

Best,
-- 
Jakub Narębski 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-30 20:38         ` Lars Schneider
@ 2016-08-30 22:23           ` Junio C Hamano
  2016-08-31  4:57             ` Torsten Bögershausen
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2016-08-30 22:23 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, peff, sbeller, Johannes.Schindelin, jnareb, mlbright

Lars Schneider <larsxschneider@gmail.com> writes:

> On 30 Aug 2016, at 20:59, Junio C Hamano <gitster@pobox.com> wrote:
>> 
>>> "abort" could be ambiguous because it could be read as "abort only
>>> this file". "abort-all" would work, though. Would you prefer to see
>>> "error" replaced by "abort" and "error-all" by "abort-all"?
>> 
>> No.
>> 
>> I was primarily reacting to "-all" part, so anything that ends with
>> "-all" is equally ugly from my point of view and not an improvement.
>> 
>> As I said, "error-all" as long as other reviewers are happy with is
>> OK by me, too.
>
> Now, I see your point. How about "error-and-stop" or "error-stop"
> instead of "error-all"?

Not really.  I was primarily reacting to having "error" and
"error-all", that share the same prefix.  Changing "-all" suffix to
"-stop" does not make all that difference to me.  But in any case,
as long as other reviewers are happy with it, it is OK by me, too.

> Good argument. However, my intention here was to mimic the v1 filter
> mechanism.

I am not making any argument and in no way against the chosen
behaviour.  That "intention here" that was revealed after two
iterations is something we would want to see as the reasoning
behind the choice of the final behaviour recorded somewhere,
and now I drew it out of you, I achieved what I set out to do
initially ;-)

> I am not sure I understand your last sentence. Just to be clear:
> Would you prefer it, if Git would just close the pipe to the filter process
> on Git exit and leave the filter running?

As a potential filter writer, I would probably feel it the easiest
to handle if there is no signal involved.  My job would be to read
from the command pipe, react to it, and when the command pipe gives
me EOF, I am expected to exit myself.  That would be simple enough.

>> I meant it as primarily an example people can learn from when they
>> want to write their own.
>
> I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
> already.

I would expect them to peek at contrib/, but do you seriously expect
people to comb through t/ directory?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-30 22:23           ` Junio C Hamano
@ 2016-08-31  4:57             ` Torsten Bögershausen
  2016-08-31 13:14               ` Jakub Narębski
  0 siblings, 1 reply; 66+ messages in thread
From: Torsten Bögershausen @ 2016-08-31  4:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, peff, sbeller, Johannes.Schindelin, jnareb,
	mlbright

On Tue, Aug 30, 2016 at 03:23:10PM -0700, Junio C Hamano wrote:
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
> > On 30 Aug 2016, at 20:59, Junio C Hamano <gitster@pobox.com> wrote:
> >> 
> >>> "abort" could be ambiguous because it could be read as "abort only
> >>> this file". "abort-all" would work, though. Would you prefer to see
> >>> "error" replaced by "abort" and "error-all" by "abort-all"?
> >> 
> >> No.
> >> 
> >> I was primarily reacting to "-all" part, so anything that ends with
> >> "-all" is equally ugly from my point of view and not an improvement.
> >> 
> >> As I said, "error-all" as long as other reviewers are happy with is
> >> OK by me, too.
> >
> > Now, I see your point. How about "error-and-stop" or "error-stop"
> > instead of "error-all"?
> 
> Not really.  I was primarily reacting to having "error" and
> "error-all", that share the same prefix.  Changing "-all" suffix to
> "-stop" does not make all that difference to me.  But in any case,
> as long as other reviewers are happy with it, it is OK by me, too.
> 
> > Good argument. However, my intention here was to mimic the v1 filter
> > mechanism.
> 
> I am not making any argument and in no way against the chosen
> behaviour.  That "intention here" that was revealed after two
> iterations is something we would want to see as the reasoning
> behind the choice of the final behaviour recorded somewhere,
> and now I drew it out of you, I achieved what I set out to do
> initially ;-)
> 
> > I am not sure I understand your last sentence. Just to be clear:
> > Would you prefer it, if Git would just close the pipe to the filter process
> > on Git exit and leave the filter running?
> 
> As a potential filter writer, I would probably feel it the easiest
> to handle if there is no signal involved.  My job would be to read
> from the command pipe, react to it, and when the command pipe gives
> me EOF, I am expected to exit myself.  That would be simple enough.
> 
> >> I meant it as primarily an example people can learn from when they
> >> want to write their own.
> >
> > I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
> > already.
> 
> I would expect them to peek at contrib/, but do you seriously expect
> people to comb through t/ directory?

How about a filter written in C, and placed in ./t/helper/ ?
At least I feel that a filter in C-language could be a starting point
for others which prefer, depending on the task the filter is going to do,
to use shell scripts, perl, python or any other high-level language.

A test case, where data can not be filtered, would be a minimum.
As Jakub pointed out, you can use iconv with good and bad data.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-31  4:57             ` Torsten Bögershausen
@ 2016-08-31 13:14               ` Jakub Narębski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Narębski @ 2016-08-31 13:14 UTC (permalink / raw)
  To: Torsten Bögershausen, Junio C Hamano
  Cc: Lars Schneider, git, Jeff King, Stefan Beller,
	Johannes Schindelin, Martin-Louis Bright

Hi,

W dniu 31.08.2016 o 06:57, Torsten Bögershausen pisze:
> On Tue, Aug 30, 2016 at 03:23:10PM -0700, Junio C Hamano wrote:
>> Lars Schneider <larsxschneider@gmail.com> writes:
>>> On 30 Aug 2016, at 20:59, Junio C Hamano <gitster@pobox.com> wrote:

Torsten, could you please in the future remove irrelevant parts of the
cited email you are responding to?  Thanks in advance,

[...]
>>>> I meant it as primarily an example people can learn from when they
>>>> want to write their own.
>>>
>>> I think `t/t0021/rot13-filter.pl` (part of this patch) serves this purpose 
>>> already.
>>
>> I would expect them to peek at contrib/, but do you seriously expect
>> people to comb through t/ directory?
> 
> How about a filter written in C, and placed in ./t/helper/ ?
>
> At least I feel that a filter in C-language could be a starting point
> for others which prefer, depending on the task the filter is going to do,
> to use shell scripts, perl, python or any other high-level language.

People do not look into t/helper/ area, but they do into contrib/.
Git provides examples on how to use its features there (among other
things).  Examples for how use the fast-import feature / protocol
are here, in contrib/fast-import.

I think that C language would be not a good choice, as required
low level wrangling of the protocol would obscure the relevant parts,
... well perhaps except if pkt-line was reused.  High-level language
would be better.  The contrib/fast-import directory includes examples
in shell script, Perl and Python.  The same could be done here
(possibly with more languages).

> A test case, where data can not be filtered, would be a minimum.
> As Jakub pointed out, you can use iconv with good and bad data.

The iconv might be good example, but I think it is not a practical
one; are there development environments that do not handle UTF-8?

More practical would be to have for example LFS-ish solution of
storing files as-is in a specified directory.  ROT13 is also good,
if not a very practical example.  Or rezipping all ZIP-compression
based files (ODF, JAR/WAR, DOCX / OOXML, CBZ, EPUB, APK, etc.).

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes
  2016-08-30 14:54           ` Torsten Bögershausen
@ 2016-09-01 17:15             ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2016-09-01 17:15 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, git, peff, sbeller, Johannes.Schindelin, jnareb,
	mlbright

Torsten Bögershausen <tboegi@web.de> writes:

>> 
>> diff --git a/sha1_file.c b/sha1_file.c
>> index d5e1121..759991e 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1485,7 +1485,7 @@ int check_sha1_signature(const unsigned char *sha1, void *map,
>>  
>>  int git_open_noatime(const char *name)
>
> Hm, should the function then be renamed into
>
> git_open_noatime_cloexec()
>
>>  {
>> -	static int sha1_file_open_flag = O_NOATIME;
>> +	static int sha1_file_open_flag = O_NOATIME | O_CLOEXEC;

Perhaps.

In any case, this is probably something that can and should be done
outside this series.

I am tempted to suggest that the patch 13/13 under discussion may
also want to be done outside the scope of, and before, this series.
Even though with the current system an inherited file descriptor to
v1 filter processes would cause issues, there is no good reason to
expose this file desciptor to them.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v6 12/13] convert: add filter.<driver>.process option
  2016-08-30 20:46         ` Jakub Narębski
@ 2016-09-05 19:47           ` Lars Schneider
  0 siblings, 0 replies; 66+ messages in thread
From: Lars Schneider @ 2016-09-05 19:47 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, git, peff, sbeller, Johannes.Schindelin, mlbright


> On 30 Aug 2016, at 22:46, Jakub Narębski <jnareb@gmail.com> wrote:
> 
>>> The filter can exit right after the "error-all". If the filter does
>>> not exit then Git will kill the filter. I'll add this to the docs.
>> 
>> OK.
> 
> Is it explicit kill, or implicit kill: close pipe and end process?

I close the pipe and call "finish_command".


>>> "abort" could be ambiguous because it could be read as "abort only
>>> this file". "abort-all" would work, though. Would you prefer to see
>>> "error" replaced by "abort" and "error-all" by "abort-all"?
>> 
>> No.
>> 
>> I was primarily reacting to "-all" part, so anything that ends with
>> "-all" is equally ugly from my point of view and not an improvement.
>> 
>> As I said, "error-all" as long as other reviewers are happy with is
>> OK by me, too.
> 
> I'm rather partial to "abort" instead of "error-all", or "quit"/"exit"
> (but I prefer "abort" or "bail-out"), as it better reflects what this
> response is about - ending filter process.

After some thinking I agree with "abort" instead of "error-all".


>>> A filter that dies during communication or does not adhere to the protocol
>>> is a faulty filter. Feeding the faulty filter after restart with the same 
>>> blob would likely cause the same error. 
>> 
>> Why does Git restart it and continue with the rest of the blobs
>> then?  Is there a reason to believe it may produce correct results
>> for other blobs if you run it again?
> 
> I guess the idea is that single filter can be run on different types
> of blobs, and it could fail on some types (some files) and not others.
> Like LFS-type filter failing on files with size larger than 2 GiB,
> or iconv-like filter to convert UTF-16 to UTF-8 failing on invalid
> byte sequences.

This mimics the v1 behavior and I will explain that in the documentation.


>>> B) If we communicate "shutdown" to the filter then we need to give the
>>>   filter some time to perform the exit before the filter is killed on
>>>   Git exit. I wasn't able to come up with a good answer how long Git 
>>>   should wait for the exit.
>> 
>> Would that be an issue?  A filter is buggy if told to shutdown,
>> ignores the request and hangs around forever.  I do not think we
>> even need to kill it.  It is not Git's problem.
> 
> I think the idea was for Git to request shutdown, and filter to tell
> when it finished (or just exiting, and Git getting SIGCHLD, I guess).
> It is hard to tell how much to wait, for example for filter process
> to send "sync" command, or finish unmounting, or something...

I like Junios approach because then we don't need to wait at all...


>> I personally would be reluctant to become a filter process writer if
>> the system it will be talking to can kill it without giving it a
>> chance to do a graceful exit, but perhaps that is just me.  I don't
>> know if closing the pipe going there where you are having Git to
>> kill the process on the other end is any more involved than what you
>> have without extra patches.
> 
> Isn't it the same problem with v1 filters being killed on Git process
> exit?  Unless v2 filter wants to do something _after_ writing output
> to Git, it should be safe... unless Git process is killed, but it
> is not different from filter being stray-killed.

Yes, it should be safe. However, I think it is probably nicer if the filter
process can shutdown gracefully instead of being killed.


>>>>> +Please note that you cannot use an existing `filter.<driver>.clean`
>>>>> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
>>>>> +because the former two use a different inter process communication
>>>>> +protocol than the latter one.
>>>> 
>>>> Would it be a useful sample program we can ship in contrib/ if you
>>>> created a "filter adapter" that reads these two configuration
>>>> variables and act as a filter.<driver>.process?
>>> 
>>> You mean a v2 filter that would use v1 filters under the hood?
>>> If we would drop v1, then this would be useful. Otherwise I don't
>>> see any need for such a thing.
>> 
>> I meant it as primarily an example people can learn from when they
>> want to write their own.
> 
> Errr... if it were any v1 filter, it would be useless (as bad or
> worse performance than ordinary v1 clean or smudge filter).  It
> might make sense if v1 filter and v2 wrapper were written in the
> same [dynamic] language, and wrapper could just load / require
> filter as a function / procedure, [compile it], and use it.
> For example smudge/clean filter in Perl (e.g. rot13), and v2 wrapper
> in Perl too.

I'll provide a simple Perl rot13 v2 filter in contrib.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2016-09-05 19:47 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-25 11:07 [PATCH v6 00/13] Git filter protocol larsxschneider
2016-08-25 11:07 ` [PATCH v6 01/13] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
2016-08-25 11:07 ` [PATCH v6 02/13] pkt-line: extract set_packet_header() larsxschneider
2016-08-25 11:07 ` [PATCH v6 03/13] pkt-line: add packet_write_fmt_gently() larsxschneider
2016-08-25 18:12   ` Stefan Beller
2016-08-25 18:47     ` Lars Schneider
2016-08-25 21:41   ` Junio C Hamano
2016-08-26  9:17     ` Lars Schneider
2016-08-26 17:10       ` Junio C Hamano
2016-08-26 17:23         ` Jeff King
2016-08-25 11:07 ` [PATCH v6 04/13] pkt-line: add packet_flush_gently() larsxschneider
2016-08-25 11:07 ` [PATCH v6 05/13] pkt-line: add packet_write_gently() larsxschneider
2016-08-25 21:50   ` Junio C Hamano
2016-08-26  9:40     ` Lars Schneider
2016-08-26 17:15       ` Junio C Hamano
2016-08-29  9:40         ` Lars Schneider
2016-08-25 11:07 ` [PATCH v6 06/13] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-08-25 18:46   ` Stefan Beller
2016-08-25 19:33     ` Lars Schneider
2016-08-25 22:31     ` Junio C Hamano
2016-08-26  0:55       ` Jacob Keller
2016-08-26 17:02         ` Stefan Beller
2016-08-26 17:21           ` Jeff King
2016-08-26 17:17         ` Junio C Hamano
2016-08-25 22:27   ` Junio C Hamano
2016-08-26 10:13     ` Lars Schneider
2016-08-26 17:21       ` Junio C Hamano
2016-08-29  9:43         ` Lars Schneider
2016-08-25 11:07 ` [PATCH v6 07/13] pack-protocol: fix maximum pkt-line size larsxschneider
2016-08-25 18:59   ` Stefan Beller
2016-08-25 19:35     ` Lars Schneider
2016-08-26 19:44       ` Junio C Hamano
2016-08-25 11:07 ` [PATCH v6 08/13] convert: quote filter names in error messages larsxschneider
2016-08-26 19:45   ` Junio C Hamano
2016-08-25 11:07 ` [PATCH v6 09/13] convert: modernize tests larsxschneider
2016-08-26 20:03   ` Junio C Hamano
2016-08-29 10:09     ` Lars Schneider
2016-08-25 11:07 ` [PATCH v6 10/13] convert: generate large test files only once larsxschneider
2016-08-25 19:17   ` Stefan Beller
2016-08-25 19:54     ` Lars Schneider
2016-08-29 17:52       ` Junio C Hamano
2016-08-30 11:47         ` Lars Schneider
2016-08-30 16:55           ` Junio C Hamano
2016-08-29 17:46   ` Junio C Hamano
2016-08-30 11:41     ` Lars Schneider
2016-08-30 16:37       ` Jeff King
2016-08-25 11:07 ` [PATCH v6 11/13] convert: make apply_filter() adhere to standard Git error handling larsxschneider
2016-08-25 11:07 ` [PATCH v6 12/13] convert: add filter.<driver>.process option larsxschneider
2016-08-29 22:21   ` Junio C Hamano
2016-08-30 16:27     ` Lars Schneider
2016-08-30 18:59       ` Junio C Hamano
2016-08-30 20:38         ` Lars Schneider
2016-08-30 22:23           ` Junio C Hamano
2016-08-31  4:57             ` Torsten Bögershausen
2016-08-31 13:14               ` Jakub Narębski
2016-08-30 20:46         ` Jakub Narębski
2016-09-05 19:47           ` Lars Schneider
2016-08-25 11:07 ` [PATCH v6 13/13] read-cache: make sure file handles are not inherited by child processes larsxschneider
2016-08-29 18:05   ` Junio C Hamano
2016-08-29 19:03     ` Lars Schneider
2016-08-29 19:45       ` Junio C Hamano
2016-08-30 12:32         ` Lars Schneider
2016-08-30 14:54           ` Torsten Bögershausen
2016-09-01 17:15             ` Junio C Hamano
2016-08-29 15:39 ` [PATCH v6 00/13] Git filter protocol Lars Schneider
2016-08-29 18:09   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).