git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v8 00/11] Git filter protocol
@ 2016-09-20 19:02 larsxschneider
  2016-09-20 19:02 ` [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
                   ` (11 more replies)
  0 siblings, 12 replies; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

The goal of this series is to avoid launching a new clean/smudge filter
process for each file that is filtered.

A short summary about v1 to v5 can be found here:
https://git.github.io/rev_news/2016/08/17/edition-18/

This series is also published on web:
https://github.com/larsxschneider/git/pull/12

Thanks a lot to
  Stefan, Torsten, Junio, Jeff, and Ramsay
for very helpful reviews,
Lars



## Major changes since v7

* explicitly define all packets as text packets terminated by an LF (except CONTENT and flush)
* move check_pipe() from write_or_die to run_command and reuse it



## All changes since v7

### Stefan

* http://public-inbox.org/git/CAGZ79kY0GaWuuh_MzKL6FZ7KWF2Kwhfh9qnEYd-qX8VDQWNmCQ@mail.gmail.com/
    * move check_pipe() from write_or_die to run_command and reuse it
    * use error() (== -1) as return value

* http://public-inbox.org/git/CAGZ79kZdroDdD5SHP+-9svSTYbJfn2vsFXAwC4aen3hMVEOOPA@mail.gmail.com/
    * remove verbose return value explanation in commit messages
    * on "packet_flush_gently" introduction, mention that the function is used later


### Torsten

* http://public-inbox.org/git/20160910164056.GA14646@tb-raspi/
    * remove unnecessary parenthesis


* http://public-inbox.org/git/20160910062919.GB11001@tb-raspi/
    * explicitly define all packets as text packets terminated by an LF (except CONTENT and flush)



### Junio

* http://public-inbox.org/git/xmqq8tuvx1sz.fsf@gitster.mtv.corp.google.com/
    * fix SP in Perl script
    * use `unsigned int` for CAP_CLEAN and CAP_SMUDGE
    * fix pointer notation
    * remove invalid "convert.h" include


### Ramsay

* http://public-inbox.org/git/6373d68b-574d-59f3-7b8d-60dd3a673806@ramsayjones.plus.com
    * declare packet_write_gently() static


### Lars
* add SP in paths for test case
* fix "{" code formatting



## Interdiff (v7..v8)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index ac000ea..946dcad 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -385,9 +385,11 @@ Long Running Filter Process
 If the filter command (a string value) is defined via
 `filter.<driver>.process` then Git can process all blobs with a
 single filter invocation for the entire life of a single Git
-command. This is achieved by using the following packet format
-(pkt-line, see technical/protocol-common.txt) based protocol over
-standard input and standard output.
+command. This is achieved by using a packet format (pkt-line,
+see technical/protocol-common.txt) based protocol over standard
+input and standard output as follows. All packets are considered
+text and therefore are terminated by an LF. Exceptions are the
+"*CONTENT" packets and the flush packet.

 Git starts the filter when it encounters the first file
 that needs to be cleaned or smudged. After the filter started
@@ -430,8 +432,8 @@ to filter relative to the repository root. Right after these packets
 Git sends the content split in zero or more pkt-line packets and a
 flush packet to terminate content.
 ------------------------
-packet:          git> command=smudge\n
-packet:          git> pathname=path/testfile.dat\n
+packet:          git> command=smudge
+packet:          git> pathname=path/testfile.dat
 packet:          git> 0000
 packet:          git> CONTENT
 packet:          git> 0000
@@ -445,7 +447,7 @@ or more pkt-line packets and a flush packet at the end. Finally, a
 second list of "key=value" pairs terminated with a flush packet
 is expected. The filter can change the status in the second list.
 ------------------------
-packet:          git< status=success\n
+packet:          git< status=success
 packet:          git< 0000
 packet:          git< SMUDGED_CONTENT
 packet:          git< 0000
@@ -455,7 +457,7 @@ packet:          git< 0000  # empty list!
 If the result content is empty then the filter is expected to respond
 with a success status and an empty list.
 ------------------------
-packet:          git< status=success\n
+packet:          git< status=success
 packet:          git< 0000
 packet:          git< 0000  # empty content!
 packet:          git< 0000  # empty list!
@@ -466,7 +468,7 @@ it is expected to respond with an "error" status. Depending on the
 `filter.<driver>.required` flag Git will interpret that as error
 but it will not stop or restart the filter process.
 ------------------------
-packet:          git< status=error\n
+packet:          git< status=error
 packet:          git< 0000
 ------------------------

@@ -476,11 +478,11 @@ completely) sent. Depending on the `filter.<driver>.required` flag
 Git will interpret that as error but it will not stop or restart the
 filter process.
 ------------------------
-packet:          git< status=success\n
+packet:          git< status=success
 packet:          git< 0000
 packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
 packet:          git< 0000
-packet:          git< status=error\n
+packet:          git< status=error
 packet:          git< 0000
 ------------------------

@@ -500,7 +502,7 @@ the `filter.<driver>.required` flag Git will interpret that as error
 for the content as well as any future content for the lifetime of the
 Git process but it will not stop or restart the filter process.
 ------------------------
-packet:          git< status=abort\n
+packet:          git< status=abort
 packet:          git< 0000
 ------------------------

@@ -510,8 +512,8 @@ the command pipe on exit. The filter is expected to detect EOF
 and exit gracefully on its own.

 A long running filter demo implementation can be found in
-`contrib/long-running-filter/example.pl` located in the Git
-core repository. If you develop your own long running filter
+`contrib/long-running-filter/example.pl` located in the Git
+core repository. If you develop your own long running filter
 process then the `GIT_TRACE_PACKET` environment variables can be
 very helpful for debugging (see linkgit:git[1]).

diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
index 279fbfb..c13a631 100755
--- a/contrib/long-running-filter/example.pl
+++ b/contrib/long-running-filter/example.pl
@@ -9,7 +9,7 @@ use warnings;

 my $MAX_PACKET_CONTENT_SIZE = 65516;

-sub packet_read {
+sub packet_bin_read {
     my $buffer;
     my $bytes_read = read STDIN, $buffer, 4;
     if ( $bytes_read == 0 ) {
@@ -37,38 +37,50 @@ sub packet_read {
     }
 }

-sub packet_write {
+sub packet_txt_read {
+    my ( $res, $buf ) = packet_bin_read();
+    unless ( $buf =~ /\n$/ ) {
+        die "A non-binary line SHOULD BE terminated by an LF.";
+    }
+    return ( $res, substr( $buf, 0, -1 ) );
+}
+
+sub packet_bin_write {
     my ($packet) = @_;
     print STDOUT sprintf( "%04x", length($packet) + 4 );
     print STDOUT $packet;
     STDOUT->flush();
 }

+sub packet_txt_write {
+    packet_bin_write( $_[0] . "\n" );
+}
+
 sub packet_flush {
     print STDOUT sprintf( "%04x", 0 );
     STDOUT->flush();
 }

-( packet_read() eq ( 0, "git-filter-client" ) ) || die "bad initialization";
-( packet_read() eq ( 0, "version=2" ) )         || die "bad version";
-( packet_read() eq ( 1, "" ) )                  || die "bad version end";
+( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";

-packet_write("git-filter-server\n");
-packet_write("version=2\n");
+packet_txt_write("git-filter-server");
+packet_txt_write("version=2");

-( packet_read() eq ( 0, "clean=true" ) )  || die "bad capability";
-( packet_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
-( packet_read() eq ( 1, "" ) )            || die "bad capability end";
+( packet_txt_read() eq ( 0, "clean=true" ) )  || die "bad capability";
+( packet_txt_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )            || die "bad capability end";

-packet_write( "clean=true\n" );
-packet_write( "smudge=true\n" );
+packet_txt_write("clean=true");
+packet_txt_write("smudge=true");
 packet_flush();

 while (1) {
-    my ($command) = packet_read() =~ /^command=([^=]+)\n$/;
-    my ($pathname) = packet_read() =~ /^pathname=([^=]+)\n$/;
+    my ($command)  = packet_txt_read() =~ /^command=([^=]+)$/;
+    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;

-    packet_read();
+    packet_bin_read();

     my $input = "";
     {
@@ -76,7 +88,7 @@ while (1) {
         my $buffer;
         my $done = 0;
         while ( !$done ) {
-            ( $done, $buffer ) = packet_read();
+            ( $done, $buffer ) = packet_bin_read();
             $input .= $buffer;
         }
     }
@@ -94,11 +106,11 @@ while (1) {
         die "bad command '$command'";
     }

-    packet_write("status=success\n");
+    packet_txt_write("status=success");
     packet_flush();
     while ( length($output) > 0 ) {
         my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
-        packet_write($packet);
+        packet_bin_write($packet);
         if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
             $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
         }
@@ -106,6 +118,6 @@ while (1) {
             $output = "";
         }
     }
-    packet_flush(); # flush content!
-    packet_flush(); # empty list!
+    packet_flush();    # flush content!
+    packet_flush();    # empty list!
 }
diff --git a/convert.c b/convert.c
index 0ed48ed..bd66257 100644
--- a/convert.c
+++ b/convert.c
@@ -472,16 +472,13 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
    return 0; /* error was already reported */

  if (strbuf_read(&nbuf, async.out, len) < 0) {
-   error("read from external filter '%s' failed", cmd);
-   err = -1;
+   err = error("read from external filter '%s' failed", cmd);
  }
  if (close(async.out)) {
-   error("read from external filter '%s' failed", cmd);
-   err = -1;
+   err = error("read from external filter '%s' failed", cmd);
  }
  if (finish_async(&async)) {
-   error("external filter '%s' failed", cmd);
-   err = -1;
+   err = error("external filter '%s' failed", cmd);
  }

  if (!err) {
@@ -496,7 +493,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le

 struct cmd2process {
  struct hashmap_entry ent; /* must be the first member! */
- int supported_capabilities;
+ unsigned int supported_capabilities;
  const char *cmd;
  struct child_process process;
 };
@@ -541,13 +538,12 @@ static int packet_write_list(int fd, const char *line, ...)
  va_list args;
  int err;
  va_start(args, line);
- for (;;)
- {
+ for (;;) {
    if (!line)
      break;
    if (strlen(line) > PKTLINE_DATA_MAXLEN)
      return -1;
-   err = packet_write_fmt_gently(fd, "%s", line);
+   err = packet_write_fmt_gently(fd, "%s\n", line);
    if (err)
      return err;
    line = va_arg(args, const char*);
@@ -601,8 +597,7 @@ static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, cons

  err = packet_write_list(process->in, "clean=true", "smudge=true", NULL);

- for (;;)
- {
+ for (;;) {
    cap_buf = packet_read_line(process->out, NULL);
    if (!cap_buf)
      break;
@@ -658,7 +653,7 @@ static void read_multi_file_filter_values(int fd, struct strbuf *status) {

 static int apply_multi_file_filter(const char *path, const char *src, size_t len,
                                    int fd, struct strbuf *dst, const char *cmd,
-                                   const int wanted_capability)
+                                   const unsigned int wanted_capability)
 {
  int err;
  struct cmd2process *entry;
@@ -703,17 +698,18 @@ static int apply_multi_file_filter(const char *path, const char *src, size_t len

  sigchain_push(SIGPIPE, SIG_IGN);

-
- err = (strlen(filter_type) > PKTLINE_DATA_MAXLEN);
+ err = strlen(filter_type) > PKTLINE_DATA_MAXLEN;
  if (err)
    goto done;
+
  err = packet_write_fmt_gently(process->in, "command=%s\n", filter_type);
  if (err)
    goto done;

- err = (strlen(path) > PKTLINE_DATA_MAXLEN);
+ err = strlen(path) > PKTLINE_DATA_MAXLEN;
  if (err)
    goto done;
+
  err = packet_write_fmt_gently(process->in, "pathname=%s\n", path);
  if (err)
    goto done;
@@ -780,9 +776,9 @@ static struct convert_driver {

 static int apply_filter(const char *path, const char *src, size_t len,
                         int fd, struct strbuf *dst, struct convert_driver *drv,
-                        const int wanted_capability)
+                        const unsigned int wanted_capability)
 {
- const char* cmd = NULL;
+ const char *cmd = NULL;

  if (!drv)
    return 0;
diff --git a/pkt-line.c b/pkt-line.c
index 5001a07..a0a8543 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -96,8 +96,7 @@ int packet_flush_gently(int fd)
  packet_trace("0000", 4, 1);
  if (write_in_full(fd, "0000", 4) == 4)
    return 0;
- error("flush packet write failed");
- return -1;
+ return error("flush packet write failed");
 }

 void packet_buf_flush(struct strbuf *buf)
@@ -146,19 +145,10 @@ static int packet_write_fmt_1(int fd, int gently,
    return 0;

  if (!gently) {
-   if (errno == EPIPE) {
-     if (in_async())
-       async_exit(141);
-
-     signal(SIGPIPE, SIG_DFL);
-     raise(SIGPIPE);
-     /* Should never happen, but just in case... */
-     exit(141);
-   }
-   die_errno("packet write error");
+   check_pipe(errno);
+   die_errno("packet write with format failed");
  }
- error("packet write failed");
- return -1;
+ return error("packet write with format failed");
 }

 void packet_write_fmt(int fd, const char *fmt, ...)
@@ -181,13 +171,12 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
  return status;
 }

-int packet_write_gently(const int fd_out, const char *buf, size_t size)
+static int packet_write_gently(const int fd_out, const char *buf, size_t size)
 {
  static char packet_write_buffer[LARGE_PACKET_MAX];

  if (size > sizeof(packet_write_buffer) - 4) {
-   error("packet write failed");
-   return -1;
+   return error("packet write failed - data exceeds max packet size");
  }
  packet_trace(buf, size, 1);
  size += 4;
@@ -195,9 +184,7 @@ int packet_write_gently(const int fd_out, const char *buf, size_t size)
  memcpy(packet_write_buffer + 4, buf, size - 4);
  if (write_in_full(fd_out, packet_write_buffer, size) == size)
    return 0;
-
- error("packet write failed");
- return -1;
+ return error("packet write failed");
 }

 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
diff --git a/run-command.c b/run-command.c
index 5a4dbb6..b72f6d1 100644
--- a/run-command.c
+++ b/run-command.c
@@ -6,6 +6,19 @@
 #include "thread-utils.h"
 #include "strbuf.h"

+void check_pipe(int err)
+{
+ if (err == EPIPE) {
+   if (in_async())
+     async_exit(141);
+
+   signal(SIGPIPE, SIG_DFL);
+   raise(SIGPIPE);
+   /* Should never happen, but just in case... */
+   exit(141);
+ }
+}
+
 void child_process_init(struct child_process *child)
 {
  memset(child, 0, sizeof(*child));
diff --git a/run-command.h b/run-command.h
index 5066649..e7c5f71 100644
--- a/run-command.h
+++ b/run-command.h
@@ -54,6 +54,8 @@ int finish_command(struct child_process *);
 int finish_command_in_signal(struct child_process *);
 int run_command(struct child_process *);

+void check_pipe(int err);
+
 /*
  * Returns the path to the hook file, or NULL if the hook is missing
  * or disabled. Note that this points to static storage that will be
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 1c98ac3..210c4f6 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -34,7 +34,7 @@ test_expect_success setup '
  git checkout -- test test.t test.i &&

  echo "content-test2" >test2.o &&
- echo "content-test3-subdir" >test3-subdir.o
+ echo "content-test3 - subdir" >"test3 - subdir.o"
 '

 script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
@@ -317,9 +317,9 @@ check_filter_no_call () {
 }

 check_rot13 () {
- test_cmp $1 $2 &&
- ./../rot13.sh <$1 >expected &&
- git cat-file blob :$2 >actual &&
+ test_cmp "$1" "$2" &&
+ ./../rot13.sh <"$1" >expected &&
+ git cat-file blob :"$2" >actual &&
  test_cmp expected actual
 }

@@ -340,7 +340,7 @@ test_expect_success PERL 'required process filter should filter data' '
    cp ../test.o test.r &&
    cp ../test2.o test2.r &&
    mkdir testsubdir &&
-   cp ../test3-subdir.o testsubdir/test3-subdir.r &&
+   cp "../test3 - subdir.o" "testsubdir/test3 - subdir.r" &&
    >test4-empty.r &&

    check_filter \
@@ -349,7 +349,7 @@ test_expect_success PERL 'required process filter should filter data' '
          1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
          1 IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
          1 IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
-         1 IN: clean testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 . [OK]
+         1 IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
          1 START
          1 STOP
          1 wrote filter header
@@ -361,13 +361,13 @@ test_expect_success PERL 'required process filter should filter data' '
          x IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
          x IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
          x IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
-         x IN: clean testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 . [OK]
+         x IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
          1 START
          1 STOP
          1 wrote filter header
        EOF

-   rm -f test?.r testsubdir/test3-subdir.r &&
+   rm -f test?.r "testsubdir/test3 - subdir.r" &&

    check_filter_ignore_clean \
      git checkout . \
@@ -375,7 +375,7 @@ test_expect_success PERL 'required process filter should filter data' '
          START
          wrote filter header
          IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
-         IN: smudge testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 . [OK]
+         IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
          STOP
        EOF

@@ -395,13 +395,13 @@ test_expect_success PERL 'required process filter should filter data' '
          IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
          IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
          IN: smudge test4-empty.r 0 [OK] -- OUT: 0  [OK]
-         IN: smudge testsubdir/test3-subdir.r 21 [OK] -- OUT: 21 . [OK]
+         IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
          STOP
        EOF

    check_rot13 ../test.o test.r &&
    check_rot13 ../test2.o test2.r &&
-   check_rot13 ../test3-subdir.o testsubdir/test3-subdir.r
+   check_rot13 "../test3 - subdir.o" "testsubdir/test3 - subdir.r"
  )
 '

diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
index 8e27877..8958f71 100755
--- a/t/t0021/rot13-filter.pl
+++ b/t/t0021/rot13-filter.pl
@@ -34,7 +34,7 @@ sub rot13 {
     return $str;
 }

-sub packet_read {
+sub packet_bin_read {
     my $buffer;
     my $bytes_read = read STDIN, $buffer, 4;
     if ( $bytes_read == 0 ) {
@@ -63,13 +63,25 @@ sub packet_read {
     }
 }

-sub packet_write {
+sub packet_txt_read {
+    my ( $res, $buf ) = packet_bin_read();
+    unless ( $buf =~ /\n$/ ) {
+        die "A non-binary line SHOULD BE terminated by an LF.";
+    }
+    return ( $res, substr( $buf, 0, -1 ) );
+}
+
+sub packet_bin_write {
     my ($packet) = @_;
     print STDOUT sprintf( "%04x", length($packet) + 4 );
     print STDOUT $packet;
     STDOUT->flush();
 }

+sub packet_txt_write {
+    packet_bin_write( $_[0] . "\n" );
+}
+
 sub packet_flush {
     print STDOUT sprintf( "%04x", 0 );
     STDOUT->flush();
@@ -78,35 +90,35 @@ sub packet_flush {
 print $debug "START\n";
 $debug->flush();

-( packet_read() eq ( 0, "git-filter-client" ) ) || die "bad initialization";
-( packet_read() eq ( 0, "version=2" ) )         || die "bad version";
-( packet_read() eq ( 1, "" ) )                  || die "bad version end";
+( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";

-packet_write("git-filter-server\n");
-packet_write("version=2\n");
+packet_txt_write("git-filter-server");
+packet_txt_write("version=2");

-( packet_read() eq ( 0, "clean=true" ) )  || die "bad capability";
-( packet_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
-( packet_read() eq ( 1, "" ) )            || die "bad capability end";
+( packet_txt_read() eq ( 0, "clean=true" ) )  || die "bad capability";
+( packet_txt_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )            || die "bad capability end";

 foreach (@capabilities) {
-    packet_write( $_ . "=true\n" );
+    packet_txt_write( $_ . "=true" );
 }
 packet_flush();
 print $debug "wrote filter header\n";
 $debug->flush();

 while (1) {
-    my ($command) = packet_read() =~ /^command=([^=]+)\n$/;
+    my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
     print $debug "IN: $command";
     $debug->flush();

-    my ($pathname) = packet_read() =~ /^pathname=([^=]+)\n$/;
+    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;
     print $debug " $pathname";
     $debug->flush();

     # Flush
-    packet_read();
+    packet_bin_read();

     my $input = "";
     {
@@ -114,7 +126,7 @@ while (1) {
         my $buffer;
         my $done = 0;
         while ( !$done ) {
-            ( $done, $buffer ) = packet_read();
+            ( $done, $buffer ) = packet_bin_read();
             $input .= $buffer;
         }
         print $debug " " . length($input) . " [OK] -- ";
@@ -141,17 +153,17 @@ while (1) {
     if ( $pathname eq "error.r" ) {
         print $debug "[ERROR]\n";
         $debug->flush();
-        packet_write("status=error\n");
+        packet_txt_write("status=error");
         packet_flush();
     }
     elsif ( $pathname eq "abort.r" ) {
         print $debug "[ABORT]\n";
         $debug->flush();
-        packet_write("status=abort\n");
+        packet_txt_write("status=abort");
         packet_flush();
     }
     else {
-        packet_write("status=success\n");
+        packet_txt_write("status=success");
         packet_flush();

         if ( $pathname eq "${command}-write-fail.r" ) {
@@ -162,7 +174,7 @@ while (1) {

         while ( length($output) > 0 ) {
             my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
-            packet_write($packet);
+            packet_bin_write($packet);
             print $debug ".";
             if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
                 $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
diff --git a/unpack-trees.c b/unpack-trees.c
index f6798f8..11c37fb 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -10,7 +10,6 @@
 #include "attr.h"
 #include "split-index.h"
 #include "dir.h"
-#include "convert.h"

 /*
  * Error messages expected by scripts out of plumbing commands such as
diff --git a/write_or_die.c b/write_or_die.c
index 0734432..eab8c8d 100644
--- a/write_or_die.c
+++ b/write_or_die.c
@@ -1,19 +1,6 @@
 #include "cache.h"
 #include "run-command.h"

-static void check_pipe(int err)
-{
- if (err == EPIPE) {
-   if (in_async())
-     async_exit(141);
-
-   signal(SIGPIPE, SIG_DFL);
-   raise(SIGPIPE);
-   /* Should never happen, but just in case... */
-   exit(141);
- }
-}
-
 /*
  * Some cases use stdio, but want to flush after the write
  * to get error handling (and to get better interactive



Lars Schneider (11):
  pkt-line: rename packet_write() to packet_write_fmt()
  pkt-line: extract set_packet_header()
  run-command: move check_pipe() from write_or_die to run_command
  pkt-line: add packet_write_fmt_gently()
  pkt-line: add packet_flush_gently()
  pkt-line: add packet_write_gently()
  pkt-line: add functions to read/write flush terminated packet streams
  convert: quote filter names in error messages
  convert: modernize tests
  convert: make apply_filter() adhere to standard Git error handling
  convert: add filter.<driver>.process option

 Documentation/gitattributes.txt        | 156 +++++++++++-
 builtin/archive.c                      |   4 +-
 builtin/receive-pack.c                 |   4 +-
 builtin/remote-ext.c                   |   4 +-
 builtin/upload-archive.c               |   4 +-
 connect.c                              |   2 +-
 contrib/long-running-filter/example.pl | 123 ++++++++++
 convert.c                              | 369 ++++++++++++++++++++++++----
 daemon.c                               |   2 +-
 http-backend.c                         |   2 +-
 pkt-line.c                             | 147 +++++++++++-
 pkt-line.h                             |  12 +-
 run-command.c                          |  13 +
 run-command.h                          |   2 +
 shallow.c                              |   2 +-
 t/t0021-conversion.sh                  | 423 ++++++++++++++++++++++++++++++---
 t/t0021/rot13-filter.pl                | 191 +++++++++++++++
 upload-pack.c                          |  30 +--
 write_or_die.c                         |  13 -
 19 files changed, 1379 insertions(+), 124 deletions(-)
 create mode 100755 contrib/long-running-filter/example.pl
 create mode 100755 t/t0021/rot13-filter.pl

--
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt()
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-24 21:14   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 02/11] pkt-line: extract set_packet_header() larsxschneider
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write() should be called packet_write_fmt() as the string
parameter can be formatted.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 builtin/archive.c        |  4 ++--
 builtin/receive-pack.c   |  4 ++--
 builtin/remote-ext.c     |  4 ++--
 builtin/upload-archive.c |  4 ++--
 connect.c                |  2 +-
 daemon.c                 |  2 +-
 http-backend.c           |  2 +-
 pkt-line.c               |  2 +-
 pkt-line.h               |  2 +-
 shallow.c                |  2 +-
 upload-pack.c            | 30 +++++++++++++++---------------
 11 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/builtin/archive.c b/builtin/archive.c
index a1e3b94..49f4914 100644
--- a/builtin/archive.c
+++ b/builtin/archive.c
@@ -47,10 +47,10 @@ static int run_remote_archiver(int argc, const char **argv,
 	if (name_hint) {
 		const char *format = archive_format_from_filename(name_hint);
 		if (format)
-			packet_write(fd[1], "argument --format=%s\n", format);
+			packet_write_fmt(fd[1], "argument --format=%s\n", format);
 	}
 	for (i = 1; i < argc; i++)
-		packet_write(fd[1], "argument %s\n", argv[i]);
+		packet_write_fmt(fd[1], "argument %s\n", argv[i]);
 	packet_flush(fd[1]);
 
 	buf = packet_read_line(fd[0], NULL);
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 011db00..1ce7682 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -218,7 +218,7 @@ static int receive_pack_config(const char *var, const char *value, void *cb)
 static void show_ref(const char *path, const unsigned char *sha1)
 {
 	if (sent_capabilities) {
-		packet_write(1, "%s %s\n", sha1_to_hex(sha1), path);
+		packet_write_fmt(1, "%s %s\n", sha1_to_hex(sha1), path);
 	} else {
 		struct strbuf cap = STRBUF_INIT;
 
@@ -233,7 +233,7 @@ static void show_ref(const char *path, const unsigned char *sha1)
 		if (advertise_push_options)
 			strbuf_addstr(&cap, " push-options");
 		strbuf_addf(&cap, " agent=%s", git_user_agent_sanitized());
-		packet_write(1, "%s %s%c%s\n",
+		packet_write_fmt(1, "%s %s%c%s\n",
 			     sha1_to_hex(sha1), path, 0, cap.buf);
 		strbuf_release(&cap);
 		sent_capabilities = 1;
diff --git a/builtin/remote-ext.c b/builtin/remote-ext.c
index 88eb8f9..11b48bf 100644
--- a/builtin/remote-ext.c
+++ b/builtin/remote-ext.c
@@ -128,9 +128,9 @@ static void send_git_request(int stdin_fd, const char *serv, const char *repo,
 	const char *vhost)
 {
 	if (!vhost)
-		packet_write(stdin_fd, "%s %s%c", serv, repo, 0);
+		packet_write_fmt(stdin_fd, "%s %s%c", serv, repo, 0);
 	else
-		packet_write(stdin_fd, "%s %s%chost=%s%c", serv, repo, 0,
+		packet_write_fmt(stdin_fd, "%s %s%chost=%s%c", serv, repo, 0,
 			     vhost, 0);
 }
 
diff --git a/builtin/upload-archive.c b/builtin/upload-archive.c
index 2caedf1..dc872f6 100644
--- a/builtin/upload-archive.c
+++ b/builtin/upload-archive.c
@@ -88,11 +88,11 @@ int cmd_upload_archive(int argc, const char **argv, const char *prefix)
 	writer.git_cmd = 1;
 	if (start_command(&writer)) {
 		int err = errno;
-		packet_write(1, "NACK unable to spawn subprocess\n");
+		packet_write_fmt(1, "NACK unable to spawn subprocess\n");
 		die("upload-archive: %s", strerror(err));
 	}
 
-	packet_write(1, "ACK\n");
+	packet_write_fmt(1, "ACK\n");
 	packet_flush(1);
 
 	while (1) {
diff --git a/connect.c b/connect.c
index 722dc3f..5330d9c 100644
--- a/connect.c
+++ b/connect.c
@@ -730,7 +730,7 @@ struct child_process *git_connect(int fd[2], const char *url,
 		 * Note: Do not add any other headers here!  Doing so
 		 * will cause older git-daemon servers to crash.
 		 */
-		packet_write(fd[1],
+		packet_write_fmt(fd[1],
 			     "%s %s%chost=%s%c",
 			     prog, path, 0,
 			     target_host, 0);
diff --git a/daemon.c b/daemon.c
index 425aad0..afce1b9 100644
--- a/daemon.c
+++ b/daemon.c
@@ -281,7 +281,7 @@ static int daemon_error(const char *dir, const char *msg)
 {
 	if (!informative_errors)
 		msg = "access denied or repository not exported";
-	packet_write(1, "ERR %s: %s", msg, dir);
+	packet_write_fmt(1, "ERR %s: %s", msg, dir);
 	return -1;
 }
 
diff --git a/http-backend.c b/http-backend.c
index adc8c8c..eef0a36 100644
--- a/http-backend.c
+++ b/http-backend.c
@@ -464,7 +464,7 @@ static void get_info_refs(struct strbuf *hdr, char *arg)
 		hdr_str(hdr, content_type, buf.buf);
 		end_headers(hdr);
 
-		packet_write(1, "# service=git-%s\n", svc->name);
+		packet_write_fmt(1, "# service=git-%s\n", svc->name);
 		packet_flush(1);
 
 		argv[0] = svc->name;
diff --git a/pkt-line.c b/pkt-line.c
index 62fdb37..0a9b61c 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -118,7 +118,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
-void packet_write(int fd, const char *fmt, ...)
+void packet_write_fmt(int fd, const char *fmt, ...)
 {
 	static struct strbuf buf = STRBUF_INIT;
 	va_list args;
diff --git a/pkt-line.h b/pkt-line.h
index 3cb9d91..1902fb3 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -20,7 +20,7 @@
  * side can't, we stay with pure read/write interfaces.
  */
 void packet_flush(int fd);
-void packet_write(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 
diff --git a/shallow.c b/shallow.c
index 54e2db7..d666e24 100644
--- a/shallow.c
+++ b/shallow.c
@@ -260,7 +260,7 @@ static int advertise_shallow_grafts_cb(const struct commit_graft *graft, void *c
 {
 	int fd = *(int *)cb;
 	if (graft->nr_parent == -1)
-		packet_write(fd, "shallow %s\n", oid_to_hex(&graft->oid));
+		packet_write_fmt(fd, "shallow %s\n", oid_to_hex(&graft->oid));
 	return 0;
 }
 
diff --git a/upload-pack.c b/upload-pack.c
index ca7f941..cd47de6 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -393,13 +393,13 @@ static int get_common_commits(void)
 			if (multi_ack == 2 && got_common
 			    && !got_other && ok_to_give_up()) {
 				sent_ready = 1;
-				packet_write(1, "ACK %s ready\n", last_hex);
+				packet_write_fmt(1, "ACK %s ready\n", last_hex);
 			}
 			if (have_obj.nr == 0 || multi_ack)
-				packet_write(1, "NAK\n");
+				packet_write_fmt(1, "NAK\n");
 
 			if (no_done && sent_ready) {
-				packet_write(1, "ACK %s\n", last_hex);
+				packet_write_fmt(1, "ACK %s\n", last_hex);
 				return 0;
 			}
 			if (stateless_rpc)
@@ -416,20 +416,20 @@ static int get_common_commits(void)
 					const char *hex = sha1_to_hex(sha1);
 					if (multi_ack == 2) {
 						sent_ready = 1;
-						packet_write(1, "ACK %s ready\n", hex);
+						packet_write_fmt(1, "ACK %s ready\n", hex);
 					} else
-						packet_write(1, "ACK %s continue\n", hex);
+						packet_write_fmt(1, "ACK %s continue\n", hex);
 				}
 				break;
 			default:
 				got_common = 1;
 				memcpy(last_hex, sha1_to_hex(sha1), 41);
 				if (multi_ack == 2)
-					packet_write(1, "ACK %s common\n", last_hex);
+					packet_write_fmt(1, "ACK %s common\n", last_hex);
 				else if (multi_ack)
-					packet_write(1, "ACK %s continue\n", last_hex);
+					packet_write_fmt(1, "ACK %s continue\n", last_hex);
 				else if (have_obj.nr == 1)
-					packet_write(1, "ACK %s\n", last_hex);
+					packet_write_fmt(1, "ACK %s\n", last_hex);
 				break;
 			}
 			continue;
@@ -437,10 +437,10 @@ static int get_common_commits(void)
 		if (!strcmp(line, "done")) {
 			if (have_obj.nr > 0) {
 				if (multi_ack)
-					packet_write(1, "ACK %s\n", last_hex);
+					packet_write_fmt(1, "ACK %s\n", last_hex);
 				return 0;
 			}
-			packet_write(1, "NAK\n");
+			packet_write_fmt(1, "NAK\n");
 			return -1;
 		}
 		die("git upload-pack: expected SHA1 list, got '%s'", line);
@@ -650,7 +650,7 @@ static void receive_needs(void)
 		while (result) {
 			struct object *object = &result->item->object;
 			if (!(object->flags & (CLIENT_SHALLOW|NOT_SHALLOW))) {
-				packet_write(1, "shallow %s",
+				packet_write_fmt(1, "shallow %s",
 						oid_to_hex(&object->oid));
 				register_shallow(object->oid.hash);
 				shallow_nr++;
@@ -662,7 +662,7 @@ static void receive_needs(void)
 			struct object *object = shallows.objects[i].item;
 			if (object->flags & NOT_SHALLOW) {
 				struct commit_list *parents;
-				packet_write(1, "unshallow %s",
+				packet_write_fmt(1, "unshallow %s",
 					oid_to_hex(&object->oid));
 				object->flags &= ~CLIENT_SHALLOW;
 				/* make sure the real parents are parsed */
@@ -741,7 +741,7 @@ static int send_ref(const char *refname, const struct object_id *oid,
 		struct strbuf symref_info = STRBUF_INIT;
 
 		format_symref_info(&symref_info, cb_data);
-		packet_write(1, "%s %s%c%s%s%s%s%s agent=%s\n",
+		packet_write_fmt(1, "%s %s%c%s%s%s%s%s agent=%s\n",
 			     oid_to_hex(oid), refname_nons,
 			     0, capabilities,
 			     (allow_unadvertised_object_request & ALLOW_TIP_SHA1) ?
@@ -753,11 +753,11 @@ static int send_ref(const char *refname, const struct object_id *oid,
 			     git_user_agent_sanitized());
 		strbuf_release(&symref_info);
 	} else {
-		packet_write(1, "%s %s\n", oid_to_hex(oid), refname_nons);
+		packet_write_fmt(1, "%s %s\n", oid_to_hex(oid), refname_nons);
 	}
 	capabilities = NULL;
 	if (!peel_ref(refname, peeled.hash))
-		packet_write(1, "%s %s^{}\n", oid_to_hex(&peeled), refname_nons);
+		packet_write_fmt(1, "%s %s^{}\n", oid_to_hex(&peeled), refname_nons);
 	return 0;
 }
 
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 02/11] pkt-line: extract set_packet_header()
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
  2016-09-20 19:02 ` [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-24 21:22   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command larsxschneider
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

set_packet_header() converts an integer to a 4 byte hex string. Make
this function locally available so that other pkt-line functions can
use it.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index 0a9b61c..e8adc0f 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -97,10 +97,20 @@ void packet_buf_flush(struct strbuf *buf)
 	strbuf_add(buf, "0000", 4);
 }
 
-#define hex(a) (hexchar[(a) & 15])
-static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+static void set_packet_header(char *buf, const int size)
 {
 	static char hexchar[] = "0123456789abcdef";
+
+	#define hex(a) (hexchar[(a) & 15])
+	buf[0] = hex(size >> 12);
+	buf[1] = hex(size >> 8);
+	buf[2] = hex(size >> 4);
+	buf[3] = hex(size);
+	#undef hex
+}
+
+static void format_packet(struct strbuf *out, const char *fmt, va_list args)
+{
 	size_t orig_len, n;
 
 	orig_len = out->len;
@@ -111,10 +121,7 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	if (n > LARGE_PACKET_MAX)
 		die("protocol error: impossibly long line");
 
-	out->buf[orig_len + 0] = hex(n >> 12);
-	out->buf[orig_len + 1] = hex(n >> 8);
-	out->buf[orig_len + 2] = hex(n >> 4);
-	out->buf[orig_len + 3] = hex(n);
+	set_packet_header(&out->buf[orig_len], n);
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
  2016-09-20 19:02 ` [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
  2016-09-20 19:02 ` [PATCH v8 02/11] pkt-line: extract set_packet_header() larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-24 22:12   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 04/11] pkt-line: add packet_write_fmt_gently() larsxschneider
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Move check_pipe() to run_command and make it public. This is necessary
to call the function from pkt-line in a subsequent patch.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 run-command.c  | 13 +++++++++++++
 run-command.h  |  2 ++
 write_or_die.c | 13 -------------
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/run-command.c b/run-command.c
index 5a4dbb6..b72f6d1 100644
--- a/run-command.c
+++ b/run-command.c
@@ -6,6 +6,19 @@
 #include "thread-utils.h"
 #include "strbuf.h"
 
+void check_pipe(int err)
+{
+	if (err == EPIPE) {
+		if (in_async())
+			async_exit(141);
+
+		signal(SIGPIPE, SIG_DFL);
+		raise(SIGPIPE);
+		/* Should never happen, but just in case... */
+		exit(141);
+	}
+}
+
 void child_process_init(struct child_process *child)
 {
 	memset(child, 0, sizeof(*child));
diff --git a/run-command.h b/run-command.h
index 5066649..e7c5f71 100644
--- a/run-command.h
+++ b/run-command.h
@@ -54,6 +54,8 @@ int finish_command(struct child_process *);
 int finish_command_in_signal(struct child_process *);
 int run_command(struct child_process *);
 
+void check_pipe(int err);
+
 /*
  * Returns the path to the hook file, or NULL if the hook is missing
  * or disabled. Note that this points to static storage that will be
diff --git a/write_or_die.c b/write_or_die.c
index 0734432..eab8c8d 100644
--- a/write_or_die.c
+++ b/write_or_die.c
@@ -1,19 +1,6 @@
 #include "cache.h"
 #include "run-command.h"
 
-static void check_pipe(int err)
-{
-	if (err == EPIPE) {
-		if (in_async())
-			async_exit(141);
-
-		signal(SIGPIPE, SIG_DFL);
-		raise(SIGPIPE);
-		/* Should never happen, but just in case... */
-		exit(141);
-	}
-}
-
 /*
  * Some cases use stdio, but want to flush after the write
  * to get error handling (and to get better interactive
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 04/11] pkt-line: add packet_write_fmt_gently()
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (2 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-24 22:27   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 05/11] pkt-line: add packet_flush_gently() larsxschneider
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write_fmt() would die in case of a write error even though for
some callers an error would be acceptable. Add packet_write_fmt_gently()
which writes a formatted pkt-line like packet_write_fmt() but does not
die in case of an error. The function is used in a subsequent patch.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 34 ++++++++++++++++++++++++++++++----
 pkt-line.h |  1 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index e8adc0f..3b465fd 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -125,16 +125,42 @@ static void format_packet(struct strbuf *out, const char *fmt, va_list args)
 	packet_trace(out->buf + orig_len + 4, n - 4, 1);
 }
 
+static int packet_write_fmt_1(int fd, int gently,
+                              const char *fmt, va_list args)
+{
+	struct strbuf buf = STRBUF_INIT;
+	size_t count;
+
+	format_packet(&buf, fmt, args);
+	count = write_in_full(fd, buf.buf, buf.len);
+	if (count == buf.len)
+		return 0;
+
+	if (!gently) {
+		check_pipe(errno);
+		die_errno("packet write with format failed");
+	}
+	return error("packet write with format failed");
+}
+
 void packet_write_fmt(int fd, const char *fmt, ...)
 {
-	static struct strbuf buf = STRBUF_INIT;
 	va_list args;
 
-	strbuf_reset(&buf);
 	va_start(args, fmt);
-	format_packet(&buf, fmt, args);
+	packet_write_fmt_1(fd, 0, fmt, args);
+	va_end(args);
+}
+
+int packet_write_fmt_gently(int fd, const char *fmt, ...)
+{
+	int status;
+	va_list args;
+
+	va_start(args, fmt);
+	status = packet_write_fmt_1(fd, 1, fmt, args);
 	va_end(args);
-	write_or_die(fd, buf.buf, buf.len);
+	return status;
 }
 
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
diff --git a/pkt-line.h b/pkt-line.h
index 1902fb3..3caea77 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 05/11] pkt-line: add packet_flush_gently()
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (3 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 04/11] pkt-line: add packet_write_fmt_gently() larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-24 22:56   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 06/11] pkt-line: add packet_write_gently() larsxschneider
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_flush() would die in case of a write error even though for some
callers an error would be acceptable. Add packet_flush_gently() which
writes a pkt-line flush packet like packet_flush() but does not die in
case of an error. The function is used in a subsequent patch.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 8 ++++++++
 pkt-line.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 3b465fd..19f0271 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -91,6 +91,14 @@ void packet_flush(int fd)
 	write_or_die(fd, "0000", 4);
 }
 
+int packet_flush_gently(int fd)
+{
+	packet_trace("0000", 4, 1);
+	if (write_in_full(fd, "0000", 4) == 4)
+		return 0;
+	return error("flush packet write failed");
+}
+
 void packet_buf_flush(struct strbuf *buf)
 {
 	packet_trace("0000", 4, 1);
diff --git a/pkt-line.h b/pkt-line.h
index 3caea77..3fa0899 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -23,6 +23,7 @@ void packet_flush(int fd);
 void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int packet_flush_gently(int fd);
 int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 
 /*
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 06/11] pkt-line: add packet_write_gently()
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (4 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 05/11] pkt-line: add packet_flush_gently() larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-25 11:26   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

packet_write_fmt_gently() uses format_packet() which lets the caller
only send string data via "%s". That means it cannot be used for
arbitrary data that may contain NULs.

Add packet_write_gently() which writes arbitrary data and does not die
in case of an error. The function is used by other pkt-line functions in
a subsequent patch.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 19f0271..fc0ac12 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -171,6 +171,22 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
 	return status;
 }
 
+static int packet_write_gently(const int fd_out, const char *buf, size_t size)
+{
+	static char packet_write_buffer[LARGE_PACKET_MAX];
+
+	if (size > sizeof(packet_write_buffer) - 4) {
+		return error("packet write failed - data exceeds max packet size");
+	}
+	packet_trace(buf, size, 1);
+	size += 4;
+	set_packet_header(packet_write_buffer, size);
+	memcpy(packet_write_buffer + 4, buf, size - 4);
+	if (write_in_full(fd_out, packet_write_buffer, size) == size)
+		return 0;
+	return error("packet write failed");
+}
+
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 {
 	va_list args;
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (5 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 06/11] pkt-line: add packet_write_gently() larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-25 13:46   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 08/11] convert: quote filter names in error messages larsxschneider
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

write_packetized_from_fd() and write_packetized_from_buf() write a
stream of packets. All content packets use the maximal packet size
except for the last one. After the last content packet a `flush` control
packet is written.

read_packetized_to_buf() reads arbitrary sized packets until it detects
a `flush` packet.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 pkt-line.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 pkt-line.h |  7 +++++++
 2 files changed, 75 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index fc0ac12..a0a8543 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -196,6 +196,47 @@ void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 	va_end(args);
 }
 
+int write_packetized_from_fd(int fd_in, int fd_out)
+{
+	static char buf[PKTLINE_DATA_MAXLEN];
+	int err = 0;
+	ssize_t bytes_to_write;
+
+	while (!err) {
+		bytes_to_write = xread(fd_in, buf, sizeof(buf));
+		if (bytes_to_write < 0)
+			return COPY_READ_ERROR;
+		if (bytes_to_write == 0)
+			break;
+		err = packet_write_gently(fd_out, buf, bytes_to_write);
+	}
+	if (!err)
+		err = packet_flush_gently(fd_out);
+	return err;
+}
+
+int write_packetized_from_buf(const char *src_in, size_t len, int fd_out)
+{
+	static char buf[PKTLINE_DATA_MAXLEN];
+	int err = 0;
+	size_t bytes_written = 0;
+	size_t bytes_to_write;
+
+	while (!err) {
+		if ((len - bytes_written) > sizeof(buf))
+			bytes_to_write = sizeof(buf);
+		else
+			bytes_to_write = len - bytes_written;
+		if (bytes_to_write == 0)
+			break;
+		err = packet_write_gently(fd_out, src_in + bytes_written, bytes_to_write);
+		bytes_written += bytes_to_write;
+	}
+	if (!err)
+		err = packet_flush_gently(fd_out);
+	return err;
+}
+
 static int get_packet_data(int fd, char **src_buf, size_t *src_size,
 			   void *dst, unsigned size, int options)
 {
@@ -305,3 +346,30 @@ char *packet_read_line_buf(char **src, size_t *src_len, int *dst_len)
 {
 	return packet_read_line_generic(-1, src, src_len, dst_len);
 }
+
+ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out)
+{
+	int paket_len;
+	int options = PACKET_READ_GENTLE_ON_EOF;
+
+	size_t oldlen = sb_out->len;
+	size_t oldalloc = sb_out->alloc;
+
+	for (;;) {
+		strbuf_grow(sb_out, PKTLINE_DATA_MAXLEN+1);
+		paket_len = packet_read(fd_in, NULL, NULL,
+			sb_out->buf + sb_out->len, PKTLINE_DATA_MAXLEN+1, options);
+		if (paket_len <= 0)
+			break;
+		sb_out->len += paket_len;
+	}
+
+	if (paket_len < 0) {
+		if (oldalloc == 0)
+			strbuf_release(sb_out);
+		else
+			strbuf_setlen(sb_out, oldlen);
+		return paket_len;
+	}
+	return sb_out->len - oldlen;
+}
diff --git a/pkt-line.h b/pkt-line.h
index 3fa0899..6df8449 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -25,6 +25,8 @@ void packet_buf_flush(struct strbuf *buf);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
 int packet_flush_gently(int fd);
 int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
+int write_packetized_from_fd(int fd_in, int fd_out);
+int write_packetized_from_buf(const char *src_in, size_t len, int fd_out);
 
 /*
  * Read a packetized line into the buffer, which must be at least size bytes
@@ -77,6 +79,11 @@ char *packet_read_line(int fd, int *size);
  */
 char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
 
+/*
+ * Reads a stream of variable sized packets until a flush packet is detected.
+ */
+ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out);
+
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
 extern char packet_buffer[LARGE_PACKET_MAX];
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 08/11] convert: quote filter names in error messages
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (6 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-25 14:03   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 09/11] convert: modernize tests larsxschneider
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard
to read in error messages. Quote them to improve the readability.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index 077f5e6..986c239 100644
--- a/convert.c
+++ b/convert.c
@@ -412,7 +412,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	child_process.out = out;
 
 	if (start_command(&child_process))
-		return error("cannot fork to run external filter %s", params->cmd);
+		return error("cannot fork to run external filter '%s'", params->cmd);
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
@@ -430,13 +430,13 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
-		error("cannot feed the input to external filter %s", params->cmd);
+		error("cannot feed the input to external filter '%s'", params->cmd);
 
 	sigchain_pop(SIGPIPE);
 
 	status = finish_command(&child_process);
 	if (status)
-		error("external filter %s failed %d", params->cmd, status);
+		error("external filter '%s' failed %d", params->cmd, status);
 
 	strbuf_release(&cmd);
 	return (write_err || status);
@@ -477,15 +477,15 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (close(async.out)) {
-		error("read from external filter %s failed", cmd);
+		error("read from external filter '%s' failed", cmd);
 		ret = 0;
 	}
 	if (finish_async(&async)) {
-		error("external filter %s failed", cmd);
+		error("external filter '%s' failed", cmd);
 		ret = 0;
 	}
 
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 09/11] convert: modernize tests
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (7 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 08/11] convert: quote filter names in error messages larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-25 14:43   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 10/11] convert: make apply_filter() adhere to standard Git error handling larsxschneider
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Use `test_config` to set the config, check that files are empty with
`test_must_be_empty`, compare files with `test_cmp`, and remove spaces
after ">" and "<".

Please note that the "rot13" filter configured in "setup" keeps using
`git config` instead of `test_config` because subsequent tests might
depend on it.

Reviewed-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 t/t0021-conversion.sh | 58 +++++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index e799e59..dc50938 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
 
 test_expect_success check '
 
-	cmp test.o test &&
-	cmp test.o test.t &&
+	test_cmp test.o test &&
+	test_cmp test.o test.t &&
 
 	# ident should be stripped in the repository
 	git diff --raw --exit-code :test :test.i &&
@@ -47,10 +47,10 @@ test_expect_success check '
 	embedded=$(sed -ne "$script" test.i) &&
 	test "z$id" = "z$embedded" &&
 
-	git cat-file blob :test.t > test.r &&
+	git cat-file blob :test.t >test.r &&
 
-	./rot13.sh < test.o > test.t &&
-	cmp test.r test.t
+	./rot13.sh <test.o >test.t &&
+	test_cmp test.r test.t
 '
 
 # If an expanded ident ever gets into the repository, we want to make sure that
@@ -130,7 +130,7 @@ test_expect_success 'filter shell-escaped filenames' '
 
 	# delete the files and check them out again, using a smudge filter
 	# that will count the args and echo the command-line back to us
-	git config filter.argc.smudge "sh ./argc.sh %f" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -141,7 +141,7 @@ test_expect_success 'filter shell-escaped filenames' '
 	test_cmp expect "$special" &&
 
 	# do the same thing, but with more args in the filter expression
-	git config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
+	test_config filter.argc.smudge "sh ./argc.sh %f --my-extra-arg" &&
 	rm "$normal" "$special" &&
 	git checkout -- "$normal" "$special" &&
 
@@ -154,9 +154,9 @@ test_expect_success 'filter shell-escaped filenames' '
 '
 
 test_expect_success 'required filter should filter data' '
-	git config filter.required.smudge ./rot13.sh &&
-	git config filter.required.clean ./rot13.sh &&
-	git config filter.required.required true &&
+	test_config filter.required.smudge ./rot13.sh &&
+	test_config filter.required.clean ./rot13.sh &&
+	test_config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
@@ -165,17 +165,17 @@ test_expect_success 'required filter should filter data' '
 
 	rm -f test.r &&
 	git checkout -- test.r &&
-	cmp test.o test.r &&
+	test_cmp test.o test.r &&
 
 	./rot13.sh <test.o >expected &&
 	git cat-file blob :test.r >actual &&
-	cmp expected actual
+	test_cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
-	git config filter.failsmudge.smudge false &&
-	git config filter.failsmudge.clean cat &&
-	git config filter.failsmudge.required true &&
+	test_config filter.failsmudge.smudge false &&
+	test_config filter.failsmudge.clean cat &&
+	test_config filter.failsmudge.required true &&
 
 	echo "*.fs filter=failsmudge" >.gitattributes &&
 
@@ -186,9 +186,9 @@ test_expect_success 'required filter smudge failure' '
 '
 
 test_expect_success 'required filter clean failure' '
-	git config filter.failclean.smudge cat &&
-	git config filter.failclean.clean false &&
-	git config filter.failclean.required true &&
+	test_config filter.failclean.smudge cat &&
+	test_config filter.failclean.clean false &&
+	test_config filter.failclean.required true &&
 
 	echo "*.fc filter=failclean" >.gitattributes &&
 
@@ -197,8 +197,8 @@ test_expect_success 'required filter clean failure' '
 '
 
 test_expect_success 'filtering large input to small output should use little memory' '
-	git config filter.devnull.clean "cat >/dev/null" &&
-	git config filter.devnull.required true &&
+	test_config filter.devnull.clean "cat >/dev/null" &&
+	test_config filter.devnull.required true &&
 	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
 	echo "30MB filter=devnull" >.gitattributes &&
 	GIT_MMAP_LIMIT=1m GIT_ALLOC_LIMIT=1m git add 30MB
@@ -207,7 +207,7 @@ test_expect_success 'filtering large input to small output should use little mem
 test_expect_success 'filter that does not read is fine' '
 	test-genrandom foo $((128 * 1024 + 1)) >big &&
 	echo "big filter=epipe" >.gitattributes &&
-	git config filter.epipe.clean "echo xyzzy" &&
+	test_config filter.epipe.clean "echo xyzzy" &&
 	git add big &&
 	git cat-file blob :big >actual &&
 	echo xyzzy >expect &&
@@ -215,20 +215,20 @@ test_expect_success 'filter that does not read is fine' '
 '
 
 test_expect_success EXPENSIVE 'filter large file' '
-	git config filter.largefile.smudge cat &&
-	git config filter.largefile.clean cat &&
+	test_config filter.largefile.smudge cat &&
+	test_config filter.largefile.clean cat &&
 	for i in $(test_seq 1 2048); do printf "%1048576d" 1; done >2GB &&
 	echo "2GB filter=largefile" >.gitattributes &&
 	git add 2GB 2>err &&
-	! test -s err &&
+	test_must_be_empty err &&
 	rm -f 2GB &&
 	git checkout -- 2GB 2>err &&
-	! test -s err
+	test_must_be_empty err
 '
 
 test_expect_success "filter: clean empty file" '
-	git config filter.in-repo-header.clean  "echo cleaned && cat" &&
-	git config filter.in-repo-header.smudge "sed 1d" &&
+	test_config filter.in-repo-header.clean  "echo cleaned && cat" &&
+	test_config filter.in-repo-header.smudge "sed 1d" &&
 
 	echo "empty-in-worktree    filter=in-repo-header" >>.gitattributes &&
 	>empty-in-worktree &&
@@ -240,8 +240,8 @@ test_expect_success "filter: clean empty file" '
 '
 
 test_expect_success "filter: smudge empty file" '
-	git config filter.empty-in-repo.clean "cat >/dev/null" &&
-	git config filter.empty-in-repo.smudge "echo smudged && cat" &&
+	test_config filter.empty-in-repo.clean "cat >/dev/null" &&
+	test_config filter.empty-in-repo.smudge "echo smudged && cat" &&
 
 	echo "empty-in-repo filter=empty-in-repo" >>.gitattributes &&
 	echo dead data walking >empty-in-repo &&
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 10/11] convert: make apply_filter() adhere to standard Git error handling
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (8 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 09/11] convert: modernize tests larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-25 14:47   ` Jakub Narębski
  2016-09-20 19:02 ` [PATCH v8 11/11] convert: add filter.<driver>.process option larsxschneider
  2016-09-28 21:49 ` [PATCH v8 00/11] Git filter protocol Junio C Hamano
  11 siblings, 1 reply; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

apply_filter() returns a boolean that tells the caller if it
"did convert or did not convert". The variable `ret` was used throughout
the function to track errors whereas `1` denoted success and `0`
failure. This is unusual for the Git source where `0` denotes success.

Rename the variable and flip its value to make the function easier
readable for Git developers.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/convert.c b/convert.c
index 986c239..597f561 100644
--- a/convert.c
+++ b/convert.c
@@ -451,7 +451,7 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	 *
 	 * (child --> cmd) --> us
 	 */
-	int ret = 1;
+	int err = 0;
 	struct strbuf nbuf = STRBUF_INIT;
 	struct async async;
 	struct filter_params params;
@@ -477,23 +477,20 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 		return 0;	/* error was already reported */
 
 	if (strbuf_read(&nbuf, async.out, len) < 0) {
-		error("read from external filter '%s' failed", cmd);
-		ret = 0;
+		err = error("read from external filter '%s' failed", cmd);
 	}
 	if (close(async.out)) {
-		error("read from external filter '%s' failed", cmd);
-		ret = 0;
+		err = error("read from external filter '%s' failed", cmd);
 	}
 	if (finish_async(&async)) {
-		error("external filter '%s' failed", cmd);
-		ret = 0;
+		err = error("external filter '%s' failed", cmd);
 	}
 
-	if (ret) {
+	if (!err) {
 		strbuf_swap(dst, &nbuf);
 	}
 	strbuf_release(&nbuf);
-	return ret;
+	return !err;
 }
 
 static struct convert_driver {
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (9 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 10/11] convert: make apply_filter() adhere to standard Git error handling larsxschneider
@ 2016-09-20 19:02 ` larsxschneider
  2016-09-26 22:41   ` Jakub Narębski
                     ` (2 more replies)
  2016-09-28 21:49 ` [PATCH v8 00/11] Git filter protocol Junio C Hamano
  11 siblings, 3 replies; 71+ messages in thread
From: larsxschneider @ 2016-09-20 19:02 UTC (permalink / raw)
  To: git
  Cc: peff, gitster, sbeller, jnareb, mlbright, tboegi, ramsay,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: https://github.com/github/git-lfs/pull/1382

This patch adds the `filter.<driver>.process` string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line) based protocol over standard input and
standard output. The full protocol is explained in detail in
`Documentation/gitattributes.txt`.

A few key decisions:

* The long running filter process is referred to as filter protocol
  version 2 because the existing single shot filter invocation is
  considered version 1.
* Git sends a welcome message and expects a response right after the
  external filter process has started. This ensures that Git will not
  hang if a version 1 filter is incorrectly used with the
  filter.<driver>.process option for version 2 filters. In addition,
  Git can detect this kind of error and warn the user.
* The status of a filter operation (e.g. "success" or "error) is set
  before the actual response and (if necessary!) re-set after the
  response. The advantage of this two step status response is that if
  the filter detects an error early, then the filter can communicate
  this and Git does not even need to create structures to read the
  response.
* All status responses are pkt-line lists terminated with a flush
  packet. This allows us to send other status fields with the same
  protocol in the future.

Helped-by: Martin-Louis Bright <mlbright@gmail.com>
Reviewed-by: Jakub Narebski <jnareb@gmail.com>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt        | 156 +++++++++++++-
 contrib/long-running-filter/example.pl | 123 +++++++++++
 convert.c                              | 348 ++++++++++++++++++++++++++++---
 pkt-line.h                             |   1 +
 t/t0021-conversion.sh                  | 365 ++++++++++++++++++++++++++++++++-
 t/t0021/rot13-filter.pl                | 191 +++++++++++++++++
 6 files changed, 1153 insertions(+), 31 deletions(-)
 create mode 100755 contrib/long-running-filter/example.pl
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 7aff940..946dcad 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -373,6 +379,154 @@ not exist, or may have different contents. So, smudge and clean commands
 should not try to access the file on disk, but only act as filters on the
 content provided to them on standard input.
 
+Long Running Filter Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the filter command (a string value) is defined via
+`filter.<driver>.process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using a packet format (pkt-line,
+see technical/protocol-common.txt) based protocol over standard
+input and standard output as follows. All packets are considered
+text and therefore are terminated by an LF. Exceptions are the
+"*CONTENT" packets and the flush packet.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only there
+to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of supported capabilities
+and a flush packet. Git expects to read a list of desired capabilities,
+which must be a subset of the supported capabilities list, and a flush
+packet as response:
+------------------------
+packet:          git> git-filter-client
+packet:          git> version=2
+packet:          git> version=42
+packet:          git> 0000
+packet:          git< git-filter-server
+packet:          git< version=2
+packet:          git> clean=true
+packet:          git> smudge=true
+packet:          git> not-yet-invented=true
+packet:          git> 0000
+packet:          git< clean=true
+packet:          git< smudge=true
+packet:          git< 0000
+------------------------
+Supported filter capabilities in version 2 are "clean" and
+"smudge".
+
+Afterwards Git sends a list of "key=value" pairs terminated with
+a flush packet. The list will contain at least the filter command
+(based on the supported capabilities) and the pathname of the file
+to filter relative to the repository root. Right after these packets
+Git sends the content split in zero or more pkt-line packets and a
+flush packet to terminate content.
+------------------------
+packet:          git> command=smudge
+packet:          git> pathname=path/testfile.dat
+packet:          git> 0000
+packet:          git> CONTENT
+packet:          git> 0000
+------------------------
+
+The filter is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the filter does not experience
+problems then the list must contain a "success" status. Right after
+these packets the filter is expected to send the content in zero
+or more pkt-line packets and a flush packet at the end. Finally, a
+second list of "key=value" pairs terminated with a flush packet
+is expected. The filter can change the status in the second list.
+------------------------
+packet:          git< status=success
+packet:          git< 0000
+packet:          git< SMUDGED_CONTENT
+packet:          git< 0000
+packet:          git< 0000  # empty list!
+------------------------
+
+If the result content is empty then the filter is expected to respond
+with a success status and an empty list.
+------------------------
+packet:          git< status=success
+packet:          git< 0000
+packet:          git< 0000  # empty content!
+packet:          git< 0000  # empty list!
+------------------------
+
+In case the filter cannot or does not want to process the content,
+it is expected to respond with an "error" status. Depending on the
+`filter.<driver>.required` flag Git will interpret that as error
+but it will not stop or restart the filter process.
+------------------------
+packet:          git< status=error
+packet:          git< 0000
+------------------------
+
+If the filter experiences an error during processing, then it can
+send the status "error" after the content was (partially or
+completely) sent. Depending on the `filter.<driver>.required` flag
+Git will interpret that as error but it will not stop or restart the
+filter process.
+------------------------
+packet:          git< status=success
+packet:          git< 0000
+packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
+packet:          git< 0000
+packet:          git< status=error
+packet:          git< 0000
+------------------------
+
+If the filter dies during the communication or does not adhere to
+the protocol then Git will stop the filter process and restart it
+with the next file that needs to be processed. Depending on the
+`filter.<driver>.required` flag Git will interpret that as error.
+
+The error handling for all cases above mimic the behavior of
+the `filter.<driver>.clean` / `filter.<driver>.smudge` error
+handling.
+
+In case the filter cannot or does not want to process the content
+as well as any future content for the lifetime of the Git process,
+it is expected to respond with an "abort" status. Depending on
+the `filter.<driver>.required` flag Git will interpret that as error
+for the content as well as any future content for the lifetime of the
+Git process but it will not stop or restart the filter process.
+------------------------
+packet:          git< status=abort
+packet:          git< 0000
+------------------------
+
+After the filter has processed a blob it is expected to wait for
+the next "key=value" list containing a command. Git will close
+the command pipe on exit. The filter is expected to detect EOF
+and exit gracefully on its own.
+
+A long running filter demo implementation can be found in
+`contrib/long-running-filter/example.pl` located in the Git
+core repository. If you develop your own long running filter
+process then the `GIT_TRACE_PACKET` environment variables can be
+very helpful for debugging (see linkgit:git[1]).
+
+If a `filter.<driver>.process` command is configured then it
+always takes precedence over a configured `filter.<driver>.clean`
+or `filter.<driver>.smudge` command.
+
+Please note that you cannot use an existing `filter.<driver>.clean`
+or `filter.<driver>.smudge` command with `filter.<driver>.process`
+because the former two use a different inter process communication
+protocol than the latter one.
+
+
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
new file mode 100755
index 0000000..c13a631
--- /dev/null
+++ b/contrib/long-running-filter/example.pl
@@ -0,0 +1,123 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git filter protocol version 2
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+
+use strict;
+use warnings;
+
+my $MAX_PACKET_CONTENT_SIZE = 65516;
+
+sub packet_bin_read {
+    my $buffer;
+    my $bytes_read = read STDIN, $buffer, 4;
+    if ( $bytes_read == 0 ) {
+
+        # EOF - Git stopped talking to us!
+        exit();
+    }
+    elsif ( $bytes_read != 4 ) {
+        die "invalid packet size '$bytes_read' field";
+    }
+    my $pkt_size = hex($buffer);
+    if ( $pkt_size == 0 ) {
+        return ( 1, "" );
+    }
+    elsif ( $pkt_size > 4 ) {
+        my $content_size = $pkt_size - 4;
+        $bytes_read = read STDIN, $buffer, $content_size;
+        if ( $bytes_read != $content_size ) {
+            die "invalid packet ($content_size expected; $bytes_read read)";
+        }
+        return ( 0, $buffer );
+    }
+    else {
+        die "invalid packet size";
+    }
+}
+
+sub packet_txt_read {
+    my ( $res, $buf ) = packet_bin_read();
+    unless ( $buf =~ /\n$/ ) {
+        die "A non-binary line SHOULD BE terminated by an LF.";
+    }
+    return ( $res, substr( $buf, 0, -1 ) );
+}
+
+sub packet_bin_write {
+    my ($packet) = @_;
+    print STDOUT sprintf( "%04x", length($packet) + 4 );
+    print STDOUT $packet;
+    STDOUT->flush();
+}
+
+sub packet_txt_write {
+    packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+    print STDOUT sprintf( "%04x", 0 );
+    STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";
+
+packet_txt_write("git-filter-server");
+packet_txt_write("version=2");
+
+( packet_txt_read() eq ( 0, "clean=true" ) )  || die "bad capability";
+( packet_txt_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )            || die "bad capability end";
+
+packet_txt_write("clean=true");
+packet_txt_write("smudge=true");
+packet_flush();
+
+while (1) {
+    my ($command)  = packet_txt_read() =~ /^command=([^=]+)$/;
+    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;
+
+    packet_bin_read();
+
+    my $input = "";
+    {
+        binmode(STDIN);
+        my $buffer;
+        my $done = 0;
+        while ( !$done ) {
+            ( $done, $buffer ) = packet_bin_read();
+            $input .= $buffer;
+        }
+    }
+
+    my $output;
+    if ( $command eq "clean" ) {
+        ### Perform clean here ###
+        $output = $input;
+    }
+    elsif ( $command eq "smudge" ) {
+        ### Perform smudge here ###
+        $output = $input;
+    }
+    else {
+        die "bad command '$command'";
+    }
+
+    packet_txt_write("status=success");
+    packet_flush();
+    while ( length($output) > 0 ) {
+        my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
+        packet_bin_write($packet);
+        if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
+            $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
+        }
+        else {
+            $output = "";
+        }
+    }
+    packet_flush();    # flush content!
+    packet_flush();    # empty list!
+}
diff --git a/convert.c b/convert.c
index 597f561..bd66257 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "pkt-line.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -442,7 +443,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
 	return (write_err || status);
 }
 
-static int apply_filter(const char *path, const char *src, size_t len, int fd,
+static int apply_single_file_filter(const char *path, const char *src, size_t len, int fd,
                         struct strbuf *dst, const char *cmd)
 {
 	/*
@@ -456,12 +457,6 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	struct async async;
 	struct filter_params params;
 
-	if (!cmd || !*cmd)
-		return 0;
-
-	if (!dst)
-		return 1;
-
 	memset(&async, 0, sizeof(async));
 	async.proc = filter_buffer_or_fd;
 	async.data = &params;
@@ -493,14 +488,317 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
 	return !err;
 }
 
+#define CAP_CLEAN    (1u<<0)
+#define CAP_SMUDGE   (1u<<1)
+
+struct cmd2process {
+	struct hashmap_entry ent; /* must be the first member! */
+	unsigned int supported_capabilities;
+	const char *cmd;
+	struct child_process process;
+};
+
+static int cmd_process_map_initialized;
+static struct hashmap cmd_process_map;
+
+static int cmd2process_cmp(const struct cmd2process *e1,
+                           const struct cmd2process *e2,
+                           const void *unused)
+{
+	return strcmp(e1->cmd, e2->cmd);
+}
+
+static struct cmd2process *find_multi_file_filter_entry(struct hashmap *hashmap, const char *cmd)
+{
+	struct cmd2process key;
+	hashmap_entry_init(&key, strhash(cmd));
+	key.cmd = cmd;
+	return hashmap_get(hashmap, &key, NULL);
+}
+
+static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *entry)
+{
+	if (!entry)
+		return;
+	sigchain_push(SIGPIPE, SIG_IGN);
+	/*
+	 * We kill the filter most likely because an error happened already.
+	 * That's why we are not interested in any error code here.
+	 */
+	close(entry->process.in);
+	close(entry->process.out);
+	sigchain_pop(SIGPIPE);
+	finish_command(&entry->process);
+	hashmap_remove(hashmap, entry, NULL);
+	free(entry);
+}
+
+static int packet_write_list(int fd, const char *line, ...)
+{
+	va_list args;
+	int err;
+	va_start(args, line);
+	for (;;) {
+		if (!line)
+			break;
+		if (strlen(line) > PKTLINE_DATA_MAXLEN)
+			return -1;
+		err = packet_write_fmt_gently(fd, "%s\n", line);
+		if (err)
+			return err;
+		line = va_arg(args, const char*);
+	}
+	va_end(args);
+	return packet_flush_gently(fd);
+}
+
+static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, const char *cmd)
+{
+	int err;
+	struct cmd2process *entry;
+	struct child_process *process;
+	const char *argv[] = { cmd, NULL };
+	struct string_list cap_list = STRING_LIST_INIT_NODUP;
+	char *cap_buf;
+	const char *cap_name;
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(entry, strhash(cmd));
+	entry->cmd = cmd;
+	entry->supported_capabilities = 0;
+	process = &entry->process;
+
+	child_process_init(process);
+	process->argv = argv;
+	process->use_shell = 1;
+	process->in = -1;
+	process->out = -1;
+
+	if (start_command(process)) {
+		error("cannot fork to run external filter '%s'", cmd);
+		kill_multi_file_filter(hashmap, entry);
+		return NULL;
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = packet_write_list(process->in, "git-filter-client", "version=2", NULL);
+	if (err)
+		goto done;
+
+	err = strcmp(packet_read_line(process->out, NULL), "git-filter-server");
+	if (err) {
+		error("external filter '%s' does not support long running filter protocol", cmd);
+		goto done;
+	}
+	err = strcmp(packet_read_line(process->out, NULL), "version=2");
+	if (err)
+		goto done;
+
+	err = packet_write_list(process->in, "clean=true", "smudge=true", NULL);
+
+	for (;;) {
+		cap_buf = packet_read_line(process->out, NULL);
+		if (!cap_buf)
+			break;
+		string_list_split_in_place(&cap_list, cap_buf, '=', 1);
+
+		if (cap_list.nr != 2 || strcmp(cap_list.items[1].string, "true"))
+			continue;
+
+		cap_name = cap_list.items[0].string;
+		if (!strcmp(cap_name, "clean")) {
+			entry->supported_capabilities |= CAP_CLEAN;
+		} else if (!strcmp(cap_name, "smudge")) {
+			entry->supported_capabilities |= CAP_SMUDGE;
+		} else {
+			warning(
+				"external filter '%s' requested unsupported filter capability '%s'",
+				cmd, cap_name
+			);
+		}
+
+		string_list_clear(&cap_list, 0);
+	}
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE) {
+		error("initialization for external filter '%s' failed", cmd);
+		kill_multi_file_filter(hashmap, entry);
+		return NULL;
+	}
+
+	hashmap_add(hashmap, entry);
+	return entry;
+}
+
+static void read_multi_file_filter_values(int fd, struct strbuf *status) {
+	struct strbuf **pair;
+	char *line;
+	for (;;) {
+		line = packet_read_line(fd, NULL);
+		if (!line)
+			break;
+		pair = strbuf_split_str(line, '=', 2);
+		if (pair[0] && pair[0]->len && pair[1]) {
+			if (!strcmp(pair[0]->buf, "status=")) {
+				strbuf_reset(status);
+				strbuf_addbuf(status, pair[1]);
+			}
+		}
+	}
+}
+
+static int apply_multi_file_filter(const char *path, const char *src, size_t len,
+                                   int fd, struct strbuf *dst, const char *cmd,
+                                   const unsigned int wanted_capability)
+{
+	int err;
+	struct cmd2process *entry;
+	struct child_process *process;
+	struct stat file_stat;
+	struct strbuf nbuf = STRBUF_INIT;
+	struct strbuf filter_status = STRBUF_INIT;
+	char *filter_type;
+
+	if (!cmd_process_map_initialized) {
+		cmd_process_map_initialized = 1;
+		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = find_multi_file_filter_entry(&cmd_process_map, cmd);
+	}
+
+	fflush(NULL);
+
+	if (!entry) {
+		entry = start_multi_file_filter(&cmd_process_map, cmd);
+		if (!entry)
+			return 0;
+	}
+	process = &entry->process;
+
+	if (!(wanted_capability & entry->supported_capabilities))
+		return 0;
+
+	if (CAP_CLEAN & wanted_capability)
+		filter_type = "clean";
+	else if (CAP_SMUDGE & wanted_capability)
+		filter_type = "smudge";
+	else
+		die("unexpected filter type");
+
+	if (fd >= 0 && !src) {
+		if (fstat(fd, &file_stat) == -1)
+			return 0;
+		len = xsize_t(file_stat.st_size);
+	}
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = strlen(filter_type) > PKTLINE_DATA_MAXLEN;
+	if (err)
+		goto done;
+
+	err = packet_write_fmt_gently(process->in, "command=%s\n", filter_type);
+	if (err)
+		goto done;
+
+	err = strlen(path) > PKTLINE_DATA_MAXLEN;
+	if (err)
+		goto done;
+
+	err = packet_write_fmt_gently(process->in, "pathname=%s\n", path);
+	if (err)
+		goto done;
+
+	err = packet_flush_gently(process->in);
+	if (err)
+		goto done;
+
+	if (fd >= 0)
+		err = write_packetized_from_fd(fd, process->in);
+	else
+		err = write_packetized_from_buf(src, len, process->in);
+	if (err)
+		goto done;
+
+	read_multi_file_filter_values(process->out, &filter_status);
+	err = strcmp(filter_status.buf, "success");
+	if (err)
+		goto done;
+
+	err = read_packetized_to_buf(process->out, &nbuf) < 0;
+	if (err)
+		goto done;
+
+	read_multi_file_filter_values(process->out, &filter_status);
+	err = strcmp(filter_status.buf, "success");
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE) {
+		if (!strcmp(filter_status.buf, "error")) {
+			/* The filter signaled a problem with the file. */
+		} else if (!strcmp(filter_status.buf, "abort")) {
+			/*
+			 * The filter signaled a permanent problem. Don't try to filter
+			 * files with the same command for the lifetime of the current
+			 * Git process.
+			 */
+			 entry->supported_capabilities &= ~wanted_capability;
+		} else {
+			/*
+			 * Something went wrong with the protocol filter.
+			 * Force shutdown and restart if another blob requires filtering!
+			 */
+			error("external filter '%s' failed", cmd);
+			kill_multi_file_filter(&cmd_process_map, entry);
+		}
+	} else {
+		strbuf_swap(dst, &nbuf);
+	}
+	strbuf_release(&nbuf);
+	return !err;
+}
+
 static struct convert_driver {
 	const char *name;
 	struct convert_driver *next;
 	const char *smudge;
 	const char *clean;
+	const char *process;
 	int required;
 } *user_convert, **user_convert_tail;
 
+static int apply_filter(const char *path, const char *src, size_t len,
+                        int fd, struct strbuf *dst, struct convert_driver *drv,
+                        const unsigned int wanted_capability)
+{
+	const char *cmd = NULL;
+
+	if (!drv)
+		return 0;
+
+	if (!dst)
+		return 1;
+
+	if (!drv->process && (CAP_CLEAN & wanted_capability) && drv->clean)
+		cmd = drv->clean;
+	else if (!drv->process && (CAP_SMUDGE & wanted_capability) && drv->smudge)
+		cmd = drv->smudge;
+
+	if (cmd && *cmd)
+		return apply_single_file_filter(path, src, len, fd, dst, cmd);
+	else if (drv->process && *drv->process)
+		return apply_multi_file_filter(path, src, len, fd, dst, drv->process, wanted_capability);
+
+	return 0;
+}
+
 static int read_convert_config(const char *var, const char *value, void *cb)
 {
 	const char *key, *name;
@@ -538,6 +836,9 @@ static int read_convert_config(const char *var, const char *value, void *cb)
 	if (!strcmp("clean", key))
 		return git_config_string(&drv->clean, var, value);
 
+	if (!strcmp("process", key))
+		return git_config_string(&drv->process, var, value);
+
 	if (!strcmp("required", key)) {
 		drv->required = git_config_bool(var, value);
 		return 0;
@@ -839,7 +1140,7 @@ int would_convert_to_git_filter_fd(const char *path)
 	if (!ca.drv->required)
 		return 0;
 
-	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+	return apply_filter(path, NULL, 0, -1, NULL, ca.drv, CAP_CLEAN);
 }
 
 const char *get_convert_attr_ascii(const char *path)
@@ -872,18 +1173,12 @@ int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
 	int ret = 0;
-	const char *filter = NULL;
-	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
-	if (ca.drv) {
-		filter = ca.drv->clean;
-		required = ca.drv->required;
-	}
 
-	ret |= apply_filter(path, src, len, -1, dst, filter);
-	if (!ret && required)
+	ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN);
+	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	if (ret && dst) {
@@ -905,9 +1200,9 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
-	assert(ca.drv->clean);
+	assert(ca.drv->clean || ca.drv->process);
 
-	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
 	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
@@ -919,15 +1214,9 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 					    int normalizing)
 {
 	int ret = 0, ret_filter = 0;
-	const char *filter = NULL;
-	int required = 0;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
-	if (ca.drv) {
-		filter = ca.drv->smudge;
-		required = ca.drv->required;
-	}
 
 	ret |= ident_to_worktree(path, src, len, dst, ca.ident);
 	if (ret) {
@@ -936,9 +1225,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 	}
 	/*
 	 * CRLF conversion can be skipped if normalizing, unless there
-	 * is a smudge filter.  The filter might expect CRLFs.
+	 * is a smudge or process filter (even if the process filter doesn't
+	 * support smudge).  The filters might expect CRLFs.
 	 */
-	if (filter || !normalizing) {
+	if ((ca.drv && (ca.drv->smudge || ca.drv->process)) || !normalizing) {
 		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
 		if (ret) {
 			src = dst->buf;
@@ -946,8 +1236,8 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, -1, dst, filter);
-	if (!ret_filter && required)
+	ret_filter = apply_filter(path, src, len, -1, dst, ca.drv, CAP_SMUDGE);
+	if (!ret_filter && ca.drv && ca.drv->required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
 	return ret | ret_filter;
@@ -1399,7 +1689,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	struct stream_filter *filter = NULL;
 
 	convert_attrs(&ca, path);
-	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
+	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/pkt-line.h b/pkt-line.h
index 6df8449..3d873f3 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -86,6 +86,7 @@ ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out);
 
 #define DEFAULT_PACKET_MAX 1000
 #define LARGE_PACKET_MAX 65520
+#define PKTLINE_DATA_MAXLEN (LARGE_PACKET_MAX - 4)
 extern char packet_buffer[LARGE_PACKET_MAX];
 
 #endif
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index dc50938..210c4f6 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -31,7 +31,10 @@ test_expect_success setup '
 	cat test >test.i &&
 	git add test test.t test.i &&
 	rm -f test test.t test.i &&
-	git checkout -- test test.t test.i
+	git checkout -- test test.t test.i &&
+
+	echo "content-test2" >test2.o &&
+	echo "content-test3 - subdir" >"test3 - subdir.o"
 '
 
 script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
@@ -279,4 +282,364 @@ test_expect_success 'diff does not reuse worktree files that need cleaning' '
 	test_line_count = 0 count
 '
 
+check_filter () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	cat >expected.log &&
+	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_count_clean () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	cat >expected.log &&
+	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
+		sed "s/^\([0-9]\) IN: clean/x IN: clean/" >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_ignore_clean () {
+	rm -f rot13-filter.log actual.log &&
+	"$@" &&
+	cat >expected.log &&
+	grep -v "IN: clean" rot13-filter.log >actual.log &&
+	test_cmp expected.log actual.log
+}
+
+check_filter_no_call () {
+	rm -f rot13-filter.log &&
+	"$@" 2> git_stderr.log &&
+	test_must_be_empty git_stderr.log &&
+	test_must_be_empty rot13-filter.log
+}
+
+check_rot13 () {
+	test_cmp "$1" "$2" &&
+	./../rot13.sh <"$1" >expected &&
+	git cat-file blob :"$2" >actual &&
+	test_cmp expected actual
+}
+
+test_expect_success PERL 'required process filter should filter data' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		mkdir testsubdir &&
+		cp "../test3 - subdir.o" "testsubdir/test3 - subdir.r" &&
+		>test4-empty.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
+					1 IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
+					1 IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
+					1 IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
+					1 START
+					1 STOP
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF &&
+					x IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
+					x IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
+					x IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
+					x IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
+					1 START
+					1 STOP
+					1 wrote filter header
+				EOF
+
+		rm -f test?.r "testsubdir/test3 - subdir.r" &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					START
+					wrote filter header
+					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
+					IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
+					STOP
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout empty \
+				<<-\EOF &&
+					START
+					wrote filter header
+					STOP
+				EOF
+
+		check_filter_ignore_clean \
+			git checkout master \
+				<<-\EOF &&
+					START
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
+					IN: smudge test4-empty.r 0 [OK] -- OUT: 0  [OK]
+					IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
+					STOP
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+		check_rot13 "../test3 - subdir.o" "testsubdir/test3 - subdir.r"
+	)
+'
+
+test_expect_success PERL 'required process filter should clean only and take precedence' '
+	test_config_global filter.protocol.clean ./../rot13.sh &&
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+		git add . &&
+		git commit . -m "test commit" &&
+		git branch empty &&
+
+		cp ../test.o test.r &&
+
+		check_filter \
+			git add . \
+				<<-\EOF &&
+					1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
+					1 START
+					1 STOP
+					1 wrote filter header
+				EOF
+
+		check_filter_count_clean \
+			git commit . -m "test commit" \
+				<<-\EOF
+					x IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
+					1 START
+					1 STOP
+					1 wrote filter header
+				EOF
+	)
+'
+
+generate_test_data () {
+	LEN=$1
+	NAME=$2
+	test-genrandom end $LEN |
+		perl -pe "s/./chr((ord($&) % 26) + 97)/sge" >../$NAME.file &&
+	cp ../$NAME.file . &&
+	./../rot13.sh <../$NAME.file >../$NAME.file.rot13
+}
+
+test_expect_success PERL 'required process filter should process multiple packets' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		# Generate data that requires 3 packets
+		PKTLINE_DATA_MAXLEN=65516 &&
+
+		generate_test_data $(($PKTLINE_DATA_MAXLEN        )) 1pkt_1__ &&
+		generate_test_data $(($PKTLINE_DATA_MAXLEN     + 1)) 2pkt_1+1 &&
+		generate_test_data $(($PKTLINE_DATA_MAXLEN * 2 - 1)) 2pkt_2-1 &&
+		generate_test_data $(($PKTLINE_DATA_MAXLEN * 2    )) 2pkt_2__ &&
+		generate_test_data $(($PKTLINE_DATA_MAXLEN * 2 + 1)) 3pkt_2+1 &&
+
+		echo "*.file filter=protocol" >.gitattributes &&
+		check_filter \
+			git add *.file .gitattributes \
+				<<-\EOF &&
+					1 IN: clean 1pkt_1__.file 65516 [OK] -- OUT: 65516 . [OK]
+					1 IN: clean 2pkt_1+1.file 65517 [OK] -- OUT: 65517 .. [OK]
+					1 IN: clean 2pkt_2-1.file 131031 [OK] -- OUT: 131031 .. [OK]
+					1 IN: clean 2pkt_2__.file 131032 [OK] -- OUT: 131032 .. [OK]
+					1 IN: clean 3pkt_2+1.file 131033 [OK] -- OUT: 131033 ... [OK]
+					1 START
+					1 STOP
+					1 wrote filter header
+				EOF
+		git commit . -m "test commit" &&
+
+		rm -f *.file &&
+		git checkout -- *.file &&
+
+		for f in *.file
+		do
+			git cat-file blob :$f >actual &&
+			test_cmp ../$f.rot13 actual
+		done
+	)
+'
+
+test_expect_success PERL 'required process filter should with clean error should fail' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		echo "this is going to fail" >clean-write-fail.r &&
+		echo "content-test3-subdir" >test3.r &&
+
+		# Note: There are three clean paths in convert.c we just test one here.
+		test_must_fail git add .
+	)
+'
+
+test_expect_success PERL 'process filter should restart after unexpected write failure' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		echo "this is going to fail" >smudge-write-fail.o &&
+		cat smudge-write-fail.o >smudge-write-fail.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					START
+					wrote filter header
+					IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [WRITE FAIL]
+					START
+					wrote filter header
+					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
+					STOP
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+
+		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
+		./../rot13.sh <smudge-write-fail.o >expected &&
+		git cat-file blob :smudge-write-fail.r >actual &&
+		test_cmp expected actual							  # Clean worked!
+	)
+'
+
+test_expect_success PERL 'process filter should not restart in case of an error' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		echo "this will cause an error" >error.o &&
+		cp error.o error.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					START
+					wrote filter header
+					IN: smudge error.r 25 [OK] -- OUT: 0 [ERROR]
+					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
+					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
+					STOP
+				EOF
+
+		check_rot13 ../test.o test.r &&
+		check_rot13 ../test2.o test2.r &&
+		test_cmp error.o error.r
+	)
+'
+
+test_expect_success PERL 'process filter should be able to signal an error for all future files' '
+	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		cp ../test2.o test2.r &&
+		echo "error this blob and all future blobs" >abort.o &&
+		cp abort.o abort.r &&
+		git add . &&
+		git commit . -m "test commit" &&
+		rm -f *.r &&
+
+		check_filter_ignore_clean \
+			git checkout . \
+				<<-\EOF &&
+					START
+					wrote filter header
+					IN: smudge abort.r 37 [OK] -- OUT: 0 [ABORT]
+					STOP
+				EOF
+
+		test_cmp ../test.o test.r &&
+		test_cmp ../test2.o test2.r &&
+		test_cmp abort.o abort.r
+	)
+'
+
+test_expect_success PERL 'invalid process filter must fail (and not hang!)' '
+	test_config_global filter.protocol.process cat &&
+	test_config_global filter.protocol.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=protocol" >.gitattributes &&
+
+		cp ../test.o test.r &&
+		test_must_fail git add . 2> git_stderr.log &&
+		grep "not support long running filter protocol" git_stderr.log
+	)
+'
+
 test_done
diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
new file mode 100755
index 0000000..8958f71
--- /dev/null
+++ b/t/t0021/rot13-filter.pl
@@ -0,0 +1,191 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git filter protocol version 2
+# See Documentation/gitattributes.txt, section "Filter Protocol"
+#
+# The script takes the list of supported protocol capabilities as
+# arguments ("clean", "smudge", etc).
+#
+# This implementation supports special test cases:
+# (1) If data with the pathname "clean-write-fail.r" is processed with
+#     a "clean" operation then the write operation will die.
+# (2) If data with the pathname "smudge-write-fail.r" is processed with
+#     a "smudge" operation then the write operation will die.
+# (3) If data with the pathname "error.r" is processed with any
+#     operation then the filter signals that it cannot or does not want
+#     to process the file.
+# (4) If data with the pathname "abort.r" is processed with any
+#     operation then the filter signals that it cannot or does not want
+#     to process the file and any file after that is processed with the
+#     same command.
+#
+
+use strict;
+use warnings;
+
+my $MAX_PACKET_CONTENT_SIZE = 65516;
+my @capabilities            = @ARGV;
+
+open my $debug, ">>", "rot13-filter.log";
+
+sub rot13 {
+    my ($str) = @_;
+    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
+    return $str;
+}
+
+sub packet_bin_read {
+    my $buffer;
+    my $bytes_read = read STDIN, $buffer, 4;
+    if ( $bytes_read == 0 ) {
+
+        # EOF - Git stopped talking to us!
+        print $debug "STOP\n";
+        exit();
+    }
+    elsif ( $bytes_read != 4 ) {
+        die "invalid packet size '$bytes_read' field";
+    }
+    my $pkt_size = hex($buffer);
+    if ( $pkt_size == 0 ) {
+        return ( 1, "" );
+    }
+    elsif ( $pkt_size > 4 ) {
+        my $content_size = $pkt_size - 4;
+        $bytes_read = read STDIN, $buffer, $content_size;
+        if ( $bytes_read != $content_size ) {
+            die "invalid packet ($content_size expected; $bytes_read read)";
+        }
+        return ( 0, $buffer );
+    }
+    else {
+        die "invalid packet size";
+    }
+}
+
+sub packet_txt_read {
+    my ( $res, $buf ) = packet_bin_read();
+    unless ( $buf =~ /\n$/ ) {
+        die "A non-binary line SHOULD BE terminated by an LF.";
+    }
+    return ( $res, substr( $buf, 0, -1 ) );
+}
+
+sub packet_bin_write {
+    my ($packet) = @_;
+    print STDOUT sprintf( "%04x", length($packet) + 4 );
+    print STDOUT $packet;
+    STDOUT->flush();
+}
+
+sub packet_txt_write {
+    packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+    print STDOUT sprintf( "%04x", 0 );
+    STDOUT->flush();
+}
+
+print $debug "START\n";
+$debug->flush();
+
+( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";
+
+packet_txt_write("git-filter-server");
+packet_txt_write("version=2");
+
+( packet_txt_read() eq ( 0, "clean=true" ) )  || die "bad capability";
+( packet_txt_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )            || die "bad capability end";
+
+foreach (@capabilities) {
+    packet_txt_write( $_ . "=true" );
+}
+packet_flush();
+print $debug "wrote filter header\n";
+$debug->flush();
+
+while (1) {
+    my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+    print $debug "IN: $command";
+    $debug->flush();
+
+    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;
+    print $debug " $pathname";
+    $debug->flush();
+
+    # Flush
+    packet_bin_read();
+
+    my $input = "";
+    {
+        binmode(STDIN);
+        my $buffer;
+        my $done = 0;
+        while ( !$done ) {
+            ( $done, $buffer ) = packet_bin_read();
+            $input .= $buffer;
+        }
+        print $debug " " . length($input) . " [OK] -- ";
+        $debug->flush();
+    }
+
+    my $output;
+    if ( $pathname eq "error.r" or $pathname eq "abort.r" ) {
+        $output = "";
+    }
+    elsif ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
+        $output = rot13($input);
+    }
+    elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
+        $output = rot13($input);
+    }
+    else {
+        die "bad command '$command'";
+    }
+
+    print $debug "OUT: " . length($output) . " ";
+    $debug->flush();
+
+    if ( $pathname eq "error.r" ) {
+        print $debug "[ERROR]\n";
+        $debug->flush();
+        packet_txt_write("status=error");
+        packet_flush();
+    }
+    elsif ( $pathname eq "abort.r" ) {
+        print $debug "[ABORT]\n";
+        $debug->flush();
+        packet_txt_write("status=abort");
+        packet_flush();
+    }
+    else {
+        packet_txt_write("status=success");
+        packet_flush();
+
+        if ( $pathname eq "${command}-write-fail.r" ) {
+            print $debug "[WRITE FAIL]\n";
+            $debug->flush();
+            die "${command} write error";
+        }
+
+        while ( length($output) > 0 ) {
+            my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
+            packet_bin_write($packet);
+            print $debug ".";
+            if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
+                $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
+            }
+            else {
+                $output = "";
+            }
+        }
+        packet_flush();
+        print $debug " [OK]\n";
+        $debug->flush();
+        packet_flush();
+    }
+}
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt()
  2016-09-20 19:02 ` [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
@ 2016-09-24 21:14   ` Jakub Narębski
  2016-09-26 18:49     ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-24 21:14 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

Hello Lars,

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:

> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_write() should be called packet_write_fmt() as the string
> parameter can be formatted.

I would say:

  packet_write() should be called packet_write_fmt() because it
  is printf-like function where first parameter is format string.
  
Or something like that.  But such minor change might be not worth
yet another reroll of this patch series.

Perhaps it would be a good idea to explain the reasoning behind
this change:

  This is important distinction to know from the name if the
  function accepts arbitrary binary data and/or arbitrary
  strings to be written - packet_write[_fmt()] do not.

> 
> Suggested-by: Junio C Hamano <gitster@pobox.com>

Just so nobody wonders later why this patch was needed/suggested.

> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  builtin/archive.c        |  4 ++--
>  builtin/receive-pack.c   |  4 ++--
>  builtin/remote-ext.c     |  4 ++--
>  builtin/upload-archive.c |  4 ++--
>  connect.c                |  2 +-
>  daemon.c                 |  2 +-
>  http-backend.c           |  2 +-
>  pkt-line.c               |  2 +-

The header of the renamed function looks now very nice:

 void packet_write_fmt(int fd, const char *fmt, ...)
                   ^^^                     ^^^

>  pkt-line.h               |  2 +-
>  shallow.c                |  2 +-
>  upload-pack.c            | 30 +++++++++++++++---------------
>  11 files changed, 29 insertions(+), 29 deletions(-)

Diffstat looks correct.  Was the patch generated by doing search
and replace?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 02/11] pkt-line: extract set_packet_header()
  2016-09-20 19:02 ` [PATCH v8 02/11] pkt-line: extract set_packet_header() larsxschneider
@ 2016-09-24 21:22   ` Jakub Narębski
  2016-09-26 18:53     ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-24 21:22 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> Subject: [PATCH v8 02/11] pkt-line: extract set_packet_header()
> 
> set_packet_header() converts an integer to a 4 byte hex string. Make
> this function locally available so that other pkt-line functions can
> use it.

Ah. I have trouble understanding this commit message, as the
set_packet_header() was not available before this patch, but it
is good if one reads it together with commit summary / title.

Writing

  Extracted set_packet_header() function converts...

or

  New set_packet_header() function converts... 

would make it more clear, but it is all right as it is now.
Perhaps also

  ... could use it.

as currently no other pkt-line function but the one set_packet_header()
was extracted from, namely format_packet(), uses it.

But that is just nitpicking; no need to change on that account.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command
  2016-09-20 19:02 ` [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command larsxschneider
@ 2016-09-24 22:12   ` Jakub Narębski
  2016-09-26 16:13     ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-24 22:12 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Move check_pipe() to run_command and make it public. This is necessary
> to call the function from pkt-line in a subsequent patch.

All right.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  run-command.c  | 13 +++++++++++++
>  run-command.h  |  2 ++
>  write_or_die.c | 13 -------------
>  3 files changed, 15 insertions(+), 13 deletions(-)

Diffstat looks correct.

Not to add to your burden, but perhaps somebody could add to his/her
TODO documenting check_pipe() in Documentation/technical/api-run-command.txt
Or is it not worth it?

Best regards,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 04/11] pkt-line: add packet_write_fmt_gently()
  2016-09-20 19:02 ` [PATCH v8 04/11] pkt-line: add packet_write_fmt_gently() larsxschneider
@ 2016-09-24 22:27   ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-24 22:27 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: peff, gitster, sbeller, mlbright, tboegi, ramsay

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_write_fmt() would die in case of a write error even though for
> some callers an error would be acceptable. Add packet_write_fmt_gently()
> which writes a formatted pkt-line like packet_write_fmt() but does not
> die in case of an error. The function is used in a subsequent patch.

Looks good.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 34 ++++++++++++++++++++++++++++++----
>  pkt-line.h |  1 +
>  2 files changed, 31 insertions(+), 4 deletions(-)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 05/11] pkt-line: add packet_flush_gently()
  2016-09-20 19:02 ` [PATCH v8 05/11] pkt-line: add packet_flush_gently() larsxschneider
@ 2016-09-24 22:56   ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-24 22:56 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_flush() would die in case of a write error even though for some
> callers an error would be acceptable. Add packet_flush_gently() which
> writes a pkt-line flush packet like packet_flush() but does not die in
> case of an error. The function is used in a subsequent patch.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>

Looks good.

I guess the difference in treatment from packet_write_fmt_gently() in
the previous patch is that there isn't anything to extract to form
a common code function... I have skipped a few iterations of this series.

> ---
>  pkt-line.c | 8 ++++++++
>  pkt-line.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index 3b465fd..19f0271 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -91,6 +91,14 @@ void packet_flush(int fd)
>  	write_or_die(fd, "0000", 4);
>  }
>  
> +int packet_flush_gently(int fd)
> +{
> +	packet_trace("0000", 4, 1);
> +	if (write_in_full(fd, "0000", 4) == 4)
> +		return 0;
> +	return error("flush packet write failed");
> +}
> +
>  void packet_buf_flush(struct strbuf *buf)
>  {
>  	packet_trace("0000", 4, 1);
> diff --git a/pkt-line.h b/pkt-line.h
> index 3caea77..3fa0899 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -23,6 +23,7 @@ void packet_flush(int fd);
>  void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  void packet_buf_flush(struct strbuf *buf);
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> +int packet_flush_gently(int fd);
>  int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 06/11] pkt-line: add packet_write_gently()
  2016-09-20 19:02 ` [PATCH v8 06/11] pkt-line: add packet_write_gently() larsxschneider
@ 2016-09-25 11:26   ` Jakub Narębski
  2016-09-26 19:21     ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-25 11:26 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> packet_write_fmt_gently() uses format_packet() which lets the caller
> only send string data via "%s". That means it cannot be used for
> arbitrary data that may contain NULs.
> 
> Add packet_write_gently() which writes arbitrary data and does not die
> in case of an error. The function is used by other pkt-line functions in
> a subsequent patch.

Nice; obviously needed for sending binary data.

I wonder how send-pack / receive-pack handles sending binary files.
Though this is outside of scope of this patch series, it is something
to think about for later.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index 19f0271..fc0ac12 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -171,6 +171,22 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
>  	return status;
>  }
>  
> +static int packet_write_gently(const int fd_out, const char *buf, size_t size)

I'm not sure what naming convention the rest of Git uses, but isn't
it more like '*data' rather than '*buf' here?

> +{
> +	static char packet_write_buffer[LARGE_PACKET_MAX];

I think there should be warning (as a comment before function
declaration, or before function definition), that packet_write_gently()
is not thread-safe (nor reentrant, but the latter does not matter here,
I think).

Thread-safe vs reentrant: http://stackoverflow.com/a/33445858/46058

This is not something terribly important; I guess git code has tons
of functions not marked as thread-unsafe...

> +
> +	if (size > sizeof(packet_write_buffer) - 4) {

First, wouldn't the following be more readable:

  +	if (size + 4 > LARGE_PACKET_MAX) {

> +		return error("packet write failed - data exceeds max packet size");
> +	}

Second, CodingGuidelines is against using braces (blocks) for one
line conditionals: "We avoid using braces unnecessarily."

But this is just me nitpicking.

> +	packet_trace(buf, size, 1);
> +	size += 4;
> +	set_packet_header(packet_write_buffer, size);
> +	memcpy(packet_write_buffer + 4, buf, size - 4);
> +	if (write_in_full(fd_out, packet_write_buffer, size) == size)

Hmmm... in some places we use original size, in others (original) size + 4;
perhaps it would be more readable to add a new local temporary variable

	size_t full_size = size + 4;

Or perhaps use 'data_size' and 'packet_size', where 'packet_size = data_size + 4'.
But that might be too chatty for variable names ;-)

> +		return 0;
> +	return error("packet write failed");
> +}

Compared to previous iterations, where there were two versions
of this function, IIRC sharing no common code: one taking buffer
which had to be with place for packet size info, one with a separate
local buffer for packet size only and using two writes.  This
version uses static buffer (thus not thread-safe, I think; and
not reentrant), and memcpy.

Anyway, if reentrant / thread-safe version would be required,
or not doing memcpy turns out to be important with respect to
performance, we can provide with the *_r version:

  static int packet_write_gently_r(const int fd_out, const char *data, size_t size,
                                   char *restrict buf)

We can check if 'buf + 4 == data', and if it is, we can skip
memcpy() as an optimization.

This is something for the future, but not very important for
having this patch series accepted.

> +
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
>  {
>  	va_list args;
> 

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-20 19:02 ` [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
@ 2016-09-25 13:46   ` Jakub Narębski
  2016-09-26 20:23     ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-25 13:46 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> write_packetized_from_fd() and write_packetized_from_buf() write a
> stream of packets. All content packets use the maximal packet size
> except for the last one. After the last content packet a `flush` control
> packet is written.
> 
> read_packetized_to_buf() reads arbitrary sized packets until it detects
> a `flush` packet.

I guess that read_packetized_to_fd(), for completeness, is not needed
for the filter protocol (though it might be useful for the receive
side of send-pack / receive-pack).

Also, should it be read_packetized_to_strbuf()?  I guess using strbuf
to read is here because we might need more size to read in full, isn't
it.

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  pkt-line.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  pkt-line.h |  7 +++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/pkt-line.c b/pkt-line.c
> index fc0ac12..a0a8543 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -196,6 +196,47 @@ void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
>  	va_end(args);
>  }
>  
> +int write_packetized_from_fd(int fd_in, int fd_out)

I wonder if it would be worth it to name parameters in such way that
it is known from the name which one is to be packetized, for example
fd_out_pkt here...

But it might be not worth it; you can get it from the function name.

> +{
> +	static char buf[PKTLINE_DATA_MAXLEN];

Static buffer means not thread-safe and not reentrant. It would be
nice to have this information in a comment for this function, but
it is not necessary.

Also, is using static variable better than using global variable
`packet_buffer`?  Well, scope for weird interactions is smaller...

Sidenote: we have LARGE_PACKET_MAX (used in previous patch), but
PKTLINE_DATA_MAXLEN not LARGE_PACKET_DATA_MAX.

> +	int err = 0;
> +	ssize_t bytes_to_write;
> +
> +	while (!err) {
> +		bytes_to_write = xread(fd_in, buf, sizeof(buf));
> +		if (bytes_to_write < 0)
> +			return COPY_READ_ERROR;
> +		if (bytes_to_write == 0)
> +			break;
> +		err = packet_write_gently(fd_out, buf, bytes_to_write);
> +	}
> +	if (!err)
> +		err = packet_flush_gently(fd_out);
> +	return err;
> +}

Looks good: clean and readable.

Sidenote (probably outside of scope of this patch): what are the
errors that we can get from this function, beside COPY_READ_ERROR
of course?

> +
> +int write_packetized_from_buf(const char *src_in, size_t len, int fd_out)
> +{
> +	static char buf[PKTLINE_DATA_MAXLEN];
> +	int err = 0;
> +	size_t bytes_written = 0;
> +	size_t bytes_to_write;

Those two variables, instead of modifying the values of len and/or src_in,
make code very easy to read.

> +
> +	while (!err) {
> +		if ((len - bytes_written) > sizeof(buf))
> +			bytes_to_write = sizeof(buf);
> +		else
> +			bytes_to_write = len - bytes_written;
> +		if (bytes_to_write == 0)
> +			break;
> +		err = packet_write_gently(fd_out, src_in + bytes_written, bytes_to_write);
> +		bytes_written += bytes_to_write;
> +	}
> +	if (!err)
> +		err = packet_flush_gently(fd_out);
> +	return err;
> +}

Looks good: clean and readable.

> +
>  static int get_packet_data(int fd, char **src_buf, size_t *src_size,
>  			   void *dst, unsigned size, int options)
>  {
> @@ -305,3 +346,30 @@ char *packet_read_line_buf(char **src, size_t *src_len, int *dst_len)
>  {
>  	return packet_read_line_generic(-1, src, src_len, dst_len);
>  }
> +
> +ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out)

It's a bit strange that the signature of write_packetized_from_buf() is
that different from read_packetized_to_buf().  This includes the return
value: int vs ssize_t.  As I have checked, write() and read() both
use ssize_t, while fread() and fwrite() both use size_t.

Perhaps this function should be named read_packetized_to_strbuf()
(err, I asked this already)?

> +{
> +	int paket_len;

Possible typo: shouldn't it be called packet_len?
Shouldn't it be initialized to 0?

  +	int packet_len = 0;

> +	int options = PACKET_READ_GENTLE_ON_EOF;

Why is this even a local variable?  It is never changed, and it is
used only in one place; we can inline it.

If it would be needed in subsequent patches, then such information
should be included in the commit message.

> +
> +	size_t oldlen = sb_out->len;
> +	size_t oldalloc = sb_out->alloc;

Just a nitpick (feel free to ignore): doesn't this looks better:

  +	size_t old_len   = sb_out->len;
  +	size_t old_alloc = sb_out->alloc;

Also perhaps s/old_/orig_/g.

> +
> +	for (;;) {

I see that you used the more popular way of coding forever loop:

  $ git grep 'for (;;)' -- '*.c'  | wc -l
  120
  $ git grep 'while (1)' -- '*.c' | wc -l
  86


> +		strbuf_grow(sb_out, PKTLINE_DATA_MAXLEN+1);
> +		paket_len = packet_read(fd_in, NULL, NULL,
> +			sb_out->buf + sb_out->len, PKTLINE_DATA_MAXLEN+1, options);

A question (which perhaps was answered during the development of this
patch series): why is this +1 in PKTLINE_DATA_MAXLEN+1 here?

> +		if (paket_len <= 0)
> +			break;
> +		sb_out->len += paket_len;
> +	}
> +
> +	if (paket_len < 0) {
> +		if (oldalloc == 0)
> +			strbuf_release(sb_out);
> +		else
> +			strbuf_setlen(sb_out, oldlen);

A question (maybe I don't understand strbufs): why there is a special
case for oldalloc == 0?

> +		return paket_len;
> +	}
> +	return sb_out->len - oldlen;
> +}
> diff --git a/pkt-line.h b/pkt-line.h
> index 3fa0899..6df8449 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -25,6 +25,8 @@ void packet_buf_flush(struct strbuf *buf);
>  void packet_buf_write(struct strbuf *buf, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
>  int packet_flush_gently(int fd);
>  int packet_write_fmt_gently(int fd, const char *fmt, ...) __attribute__((format (printf, 2, 3)));
> +int write_packetized_from_fd(int fd_in, int fd_out);
> +int write_packetized_from_buf(const char *src_in, size_t len, int fd_out);
>  
>  /*
>   * Read a packetized line into the buffer, which must be at least size bytes
> @@ -77,6 +79,11 @@ char *packet_read_line(int fd, int *size);
>   */
>  char *packet_read_line_buf(char **src_buf, size_t *src_len, int *size);
>  
> +/*
> + * Reads a stream of variable sized packets until a flush packet is detected.
> + */
> +ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out);
> +
>  #define DEFAULT_PACKET_MAX 1000
>  #define LARGE_PACKET_MAX 65520
>  extern char packet_buffer[LARGE_PACKET_MAX];
> 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 08/11] convert: quote filter names in error messages
  2016-09-20 19:02 ` [PATCH v8 08/11] convert: quote filter names in error messages larsxschneider
@ 2016-09-25 14:03   ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-25 14:03 UTC (permalink / raw)
  To: larsxschneider, git; +Cc: peff, gitster, sbeller, mlbright, tboegi, ramsay

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git filter driver commands with spaces (e.g. `filter.sh foo`) are hard
> to read in error messages. Quote them to improve the readability.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  convert.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)

Looks good (those are all sites matching 'error.*%s' in convert.c).

> -		return error("cannot fork to run external filter %s", params->cmd);
> +		return error("cannot fork to run external filter '%s'", params->cmd);

> -		error("cannot feed the input to external filter %s", params->cmd);
> +		error("cannot feed the input to external filter '%s'", params->cmd);

> -		error("external filter %s failed %d", params->cmd, status);
> +		error("external filter '%s' failed %d", params->cmd, status);

> -		error("read from external filter %s failed", cmd);
> +		error("read from external filter '%s' failed", cmd);

> -		error("read from external filter %s failed", cmd);
> +		error("read from external filter '%s' failed", cmd);

> -		error("external filter %s failed", cmd);
> +		error("external filter '%s' failed", cmd);


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 09/11] convert: modernize tests
  2016-09-20 19:02 ` [PATCH v8 09/11] convert: modernize tests larsxschneider
@ 2016-09-25 14:43   ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-25 14:43 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Use `test_config` to set the config, check that files are empty with
> `test_must_be_empty`, compare files with `test_cmp`, and remove spaces
> after ">" and "<".

That's good.

> 
> Please note that the "rot13" filter configured in "setup" keeps using
> `git config` instead of `test_config` because subsequent tests might
> depend on it.

This is good information to have for doing review (which could include
"post-mortem" review during bisect, so it should be in commit message
proper).

> 
> Reviewed-by: Stefan Beller <sbeller@google.com>

I have not reviewed this patch in detail, but it looks good.
A bit of nitpicking below.

> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  t/t0021-conversion.sh | 58 +++++++++++++++++++++++++--------------------------
>  1 file changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index e799e59..dc50938 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -38,8 +38,8 @@ script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
>  
>  test_expect_success check '

This patch is "while at it" already for this patch series, done
I guess for new tests to both use modern style, and be consistent
with the rest of test...

...that said, if you could modernize _naming_ of tests.  The t0021
test is quite inconsistent, and uses:

 * standard short names, like 'setup', without quotes (once),
   which is I think all right
 * cryptic short names, like 'check', without quotes (once)
 * snake_case name, like 'expanded_in_repo', without quotes (once)
 
>  test_expect_success "filter: clean empty file" '
>  test_expect_success "filter: smudge empty file" '

 * double quoted names (twice, see above)
 * proper modern names, with single quotes (the rest),
   which is as almost all the rest should be using

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 10/11] convert: make apply_filter() adhere to standard Git error handling
  2016-09-20 19:02 ` [PATCH v8 10/11] convert: make apply_filter() adhere to standard Git error handling larsxschneider
@ 2016-09-25 14:47   ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-25 14:47 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> apply_filter() returns a boolean that tells the caller if it
> "did convert or did not convert". The variable `ret` was used throughout
> the function to track errors whereas `1` denoted success and `0`
> failure. This is unusual for the Git source where `0` denotes success.
> 
> Rename the variable and flip its value to make the function easier
> readable for Git developers.

This also allow to use the 'err = error("<error message>");' idiom,
isn't it...

> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  convert.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)

...which allows to delete some lines of code.  Very nice.

> -	int ret = 1;
> +	int err = 0;

> -		error("read from external filter '%s' failed", cmd);
> -		ret = 0;
> +		err = error("read from external filter '%s' failed", cmd);

> -	if (ret) {
> +	if (!err) {

> -	return ret;
> +	return !err;

Looks good.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command
  2016-09-24 22:12   ` Jakub Narębski
@ 2016-09-26 16:13     ` Lars Schneider
  2016-09-26 16:21       ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-26 16:13 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 25 Sep 2016, at 00:12, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Move check_pipe() to run_command and make it public. This is necessary
>> to call the function from pkt-line in a subsequent patch.
> 
> All right.

Does this mean I can add your "Acked-by: Jakub Narebski <jnareb@gmail.com>" ?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command
  2016-09-26 16:13     ` Lars Schneider
@ 2016-09-26 16:21       ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-26 16:21 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

On 26 September 2016 at 18:13, Lars Schneider <larsxschneider@gmail.com> wrote:
>> On 25 Sep 2016, at 00:12, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>
>>> Move check_pipe() to run_command and make it public. This is necessary
>>> to call the function from pkt-line in a subsequent patch.
>>
>> All right.
>
> Does this mean I can add your "Acked-by: Jakub Narebski <jnareb@gmail.com>" ?

Well, Acked-by makes sense if it is from subsystem maintainer. I can only
claim gitweb subsystem where my ACKs might make sense.

This "All right" is here to note that I have read this patch (and not
skipped it),
and I have't found anything to complain about or nitpick ;-P

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt()
  2016-09-24 21:14   ` Jakub Narębski
@ 2016-09-26 18:49     ` Lars Schneider
  2016-09-28 23:15       ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-26 18:49 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


On 24 Sep 2016, at 23:14, Jakub Narębski <jnareb@gmail.com> wrote:

> Hello Lars,
> 
> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> packet_write() should be called packet_write_fmt() as the string
>> parameter can be formatted.
> 
> I would say:
> 
>  packet_write() should be called packet_write_fmt() because it
>  is printf-like function where first parameter is format string.
> 
> Or something like that.  But such minor change might be not worth
> yet another reroll of this patch series.
> 
> Perhaps it would be a good idea to explain the reasoning behind
> this change:
> 
>  This is important distinction to know from the name if the
>  function accepts arbitrary binary data and/or arbitrary
>  strings to be written - packet_write[_fmt()] do not.

packet_write() should be called packet_write_fmt() because it is a
printf-like function that takes a format string as first parameter.

packet_write_fmt() should be used for text strings only. Arbitrary
binary data should use a new packet_write() function that is introduced
in a subsequent patch.

Better?


>> pkt-line.h               |  2 +-
>> shallow.c                |  2 +-
>> upload-pack.c            | 30 +++++++++++++++---------------
>> 11 files changed, 29 insertions(+), 29 deletions(-)
> 
> Diffstat looks correct.  Was the patch generated by doing search
> and replace?

Yes.

- Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 02/11] pkt-line: extract set_packet_header()
  2016-09-24 21:22   ` Jakub Narębski
@ 2016-09-26 18:53     ` Lars Schneider
  0 siblings, 0 replies; 71+ messages in thread
From: Lars Schneider @ 2016-09-26 18:53 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


On 24 Sep 2016, at 23:22, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Subject: [PATCH v8 02/11] pkt-line: extract set_packet_header()
>> 
>> set_packet_header() converts an integer to a 4 byte hex string. Make
>> this function locally available so that other pkt-line functions can
>> use it.
> 
> Ah. I have trouble understanding this commit message, as the
> set_packet_header() was not available before this patch, but it
> is good if one reads it together with commit summary / title.
> 
> Writing
> 
>  Extracted set_packet_header() function converts...
> 
> or
> 
>  New set_packet_header() function converts... 
> 
> would make it more clear, but it is all right as it is now.
> Perhaps also
> 
>  ... could use it.
> 
> as currently no other pkt-line function but the one set_packet_header()
> was extracted from, namely format_packet(), uses it.
> 
> But that is just nitpicking; no need to change on that account.

Changed it:

Extracted set_packet_header() function converts an integer to a 4 byte 
hex string. Make this function locally available so that other pkt-line 
functions could use it.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 06/11] pkt-line: add packet_write_gently()
  2016-09-25 11:26   ` Jakub Narębski
@ 2016-09-26 19:21     ` Lars Schneider
  2016-09-27  8:39       ` Jeff King
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-26 19:21 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


On 25 Sep 2016, at 13:26, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> ...
>> 
>> +static int packet_write_gently(const int fd_out, const char *buf, size_t size)
> 
> I'm not sure what naming convention the rest of Git uses, but isn't
> it more like '*data' rather than '*buf' here?

pkt-line seems to use 'buf' or 'buffer' for everything else.


>> +{
>> +	static char packet_write_buffer[LARGE_PACKET_MAX];
> 
> I think there should be warning (as a comment before function
> declaration, or before function definition), that packet_write_gently()
> is not thread-safe (nor reentrant, but the latter does not matter here,
> I think).
> 
> Thread-safe vs reentrant: http://stackoverflow.com/a/33445858/46058
> 
> This is not something terribly important; I guess git code has tons
> of functions not marked as thread-unsafe...

I agree that the function is not thread-safe. However, I can't find 
an example in the Git source that marks a function as not thread-safe.
Unless is it explicitly stated in the coding guidelines I would prefer
not to start way to mark functions.


>> +	if (size > sizeof(packet_write_buffer) - 4) {
> 
> First, wouldn't the following be more readable:
> 
>  +	if (size + 4 > LARGE_PACKET_MAX) {

Peff suggested that here:
http://public-inbox.org/git/20160810132814.gqnipsdwyzjmuqjy@sigill.intra.peff.net/


>> +		return error("packet write failed - data exceeds max packet size");
>> +	}
> 
> Second, CodingGuidelines is against using braces (blocks) for one
> line conditionals: "We avoid using braces unnecessarily."
> 
> But this is just me nitpicking.

Fixed.


>> +	packet_trace(buf, size, 1);
>> +	size += 4;
>> +	set_packet_header(packet_write_buffer, size);
>> +	memcpy(packet_write_buffer + 4, buf, size - 4);
>> +	if (write_in_full(fd_out, packet_write_buffer, size) == size)
> 
> Hmmm... in some places we use original size, in others (original) size + 4;
> perhaps it would be more readable to add a new local temporary variable
> 
> 	size_t full_size = size + 4;

Agreed. I introduced "packet_size".

Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-25 13:46   ` Jakub Narębski
@ 2016-09-26 20:23     ` Lars Schneider
  2016-09-27  8:14       ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-26 20:23 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


On 25 Sep 2016, at 15:46, Jakub Narębski <jnareb@gmail.com> wrote:

> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>


>> +{
>> +	static char buf[PKTLINE_DATA_MAXLEN];
> 
> Sidenote: we have LARGE_PACKET_MAX (used in previous patch), but
> PKTLINE_DATA_MAXLEN not LARGE_PACKET_DATA_MAX.

Agreed, I will rename it.


> 
>> +	int err = 0;
>> +	ssize_t bytes_to_write;
>> +
>> +	while (!err) {
>> +		bytes_to_write = xread(fd_in, buf, sizeof(buf));
>> +		if (bytes_to_write < 0)
>> +			return COPY_READ_ERROR;
>> +		if (bytes_to_write == 0)
>> +			break;
>> +		err = packet_write_gently(fd_out, buf, bytes_to_write);
>> +	}
>> +	if (!err)
>> +		err = packet_flush_gently(fd_out);
>> +	return err;
>> +}
> 
> Looks good: clean and readable.
> 
> Sidenote (probably outside of scope of this patch): what are the
> errors that we can get from this function, beside COPY_READ_ERROR
> of course?

Everything that is returned by "read()"


>> +
>> static int get_packet_data(int fd, char **src_buf, size_t *src_size,
>> 			   void *dst, unsigned size, int options)
>> {
>> @@ -305,3 +346,30 @@ char *packet_read_line_buf(char **src, size_t *src_len, int *dst_len)
>> {
>> 	return packet_read_line_generic(-1, src, src_len, dst_len);
>> }
>> +
>> +ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out)
> 
> It's a bit strange that the signature of write_packetized_from_buf() is
> that different from read_packetized_to_buf().  This includes the return
> value: int vs ssize_t.  As I have checked, write() and read() both
> use ssize_t, while fread() and fwrite() both use size_t.

read_packetized_to_buf() returns the number of bytes read or a negative 
error code.

write_packetized_from_buf() returns 0 if the call was successful and an 
error code if not.

That's the reason these two functions have a different signature


> Perhaps this function should be named read_packetized_to_strbuf()
> (err, I asked this already)?

I agree with the rename as makes it distinct from 
write_packetized_from_buf().


>> +{
>> +	int paket_len;
> 
> Possible typo: shouldn't it be called packet_len?

True!


> Shouldn't it be initialized to 0?

Well, it is set for sure later. That's why I think it is not necessary.

Plus, Eric Wong thought me not to:
"Compilers complain about uninitialized variables."
http://public-inbox.org/git/20160725072745.GB11634@starla/
(Note: he was talking about pointers there :-)


>> +	int options = PACKET_READ_GENTLE_ON_EOF;
> 
> Why is this even a local variable?  It is never changed, and it is
> used only in one place; we can inline it.

Removed.


>> +
>> +	size_t oldlen = sb_out->len;
>> +	size_t oldalloc = sb_out->alloc;
> 
> Just a nitpick (feel free to ignore): doesn't this looks better:
> 
>  +	size_t old_len   = sb_out->len;
>  +	size_t old_alloc = sb_out->alloc;
> 
> Also perhaps s/old_/orig_/g.

Agreed. That matches the other variables better.


>> +		strbuf_grow(sb_out, PKTLINE_DATA_MAXLEN+1);
>> +		paket_len = packet_read(fd_in, NULL, NULL,
>> +			sb_out->buf + sb_out->len, PKTLINE_DATA_MAXLEN+1, options);
> 
> A question (which perhaps was answered during the development of this
> patch series): why is this +1 in PKTLINE_DATA_MAXLEN+1 here?

Nice catch. I think this is wrong:
https://github.com/git/git/blob/6fe1b1407ed91823daa5d487abe457ff37463349/pkt-line.c#L196

It should be "if (len > size)" ... then we don't need the "+1" here.
(but I need to think a bit more about this)

> 
>> +		if (paket_len <= 0)
>> +			break;
>> +		sb_out->len += paket_len;
>> +	}
>> +
>> +	if (paket_len < 0) {
>> +		if (oldalloc == 0)
>> +			strbuf_release(sb_out);
>> +		else
>> +			strbuf_setlen(sb_out, oldlen);
> 
> A question (maybe I don't understand strbufs): why there is a special
> case for oldalloc == 0?

I tried to mimic the behavior of strbuf_read() [1]. The error handling
was introduced in 2fc647 [2] to ease error handling:

"This allows for easier error handling, as callers only need to call
strbuf_release() if A) the command succeeded or B) if they would have had
to do so anyway because they added something to the strbuf themselves."

[1] https://github.com/git/git/blob/cda1bbd474805e653dda8a71d4ea3790e2a66cbb/strbuf.c#L377-L383
[2] https://github.com/git/git/commit/2fc647004ac7016128372a85db8245581e493812


Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-20 19:02 ` [PATCH v8 11/11] convert: add filter.<driver>.process option larsxschneider
@ 2016-09-26 22:41   ` Jakub Narębski
  2016-09-30 18:56     ` Lars Schneider
  2016-09-27 15:37   ` Jakub Narębski
  2016-09-28 23:14   ` Jakub Narębski
  2 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-26 22:41 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

Part first of the review of 11/11.

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git's clean/smudge mechanism invokes an external filter process for
> every single blob that is affected by a filter. If Git filters a lot of
> blobs then the startup time of the external filter processes can become
> a significant part of the overall Git execution time.
> 
> In a preliminary performance test this developer used a clean/smudge
> filter written in golang to filter 12,000 files. This process took 364s
> with the existing filter mechanism and 5s with the new mechanism. See
> details here: https://github.com/github/git-lfs/pull/1382
> 
> This patch adds the `filter.<driver>.process` string option which, if
> used, keeps the external filter process running and processes all blobs
> with the packet format (pkt-line) based protocol over standard input and
> standard output. The full protocol is explained in detail in
> `Documentation/gitattributes.txt`.

That is a good description.  Enough detail to explain the new feature,
all without duplicating information with (added) docs.

> 
> A few key decisions:
> 
> * The long running filter process is referred to as filter protocol
>   version 2 because the existing single shot filter invocation is
>   considered version 1.

All right.

> * Git sends a welcome message and expects a response right after the
>   external filter process has started. This ensures that Git will not
>   hang if a version 1 filter is incorrectly used with the
>   filter.<driver>.process option for version 2 filters. In addition,
>   Git can detect this kind of error and warn the user.

On one hand side, this involved handshake means that implementing
a filter process script is harder; you need to write quite a lot of
boilerplate (though the example or examples would help).

On the other hand, this handshake is what allows good error detection,
easy extendability of the protocol, and forward-compatibility.  Which,
as we agreed (AFAIU), is more important.

> * The status of a filter operation (e.g. "success" or "error) is set
>   before the actual response and (if necessary!) re-set after the
>   response. The advantage of this two step status response is that if
>   the filter detects an error early, then the filter can communicate
>   this and Git does not even need to create structures to read the
>   response.

That's nice (well, among others I have argued for this :-))

> * All status responses are pkt-line lists terminated with a flush
>   packet. This allows us to send other status fields with the same
>   protocol in the future.

Good.

This also makes protocol simple, easier to implement (on Git side),
and easier to parse (on filter side).

> 
> Helped-by: Martin-Louis Bright <mlbright@gmail.com>
> Reviewed-by: Jakub Narebski <jnareb@gmail.com>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/gitattributes.txt        | 156 +++++++++++++-
>  contrib/long-running-filter/example.pl | 123 +++++++++++
>  convert.c                              | 348 ++++++++++++++++++++++++++++---
>  pkt-line.h                             |   1 +
>  t/t0021-conversion.sh                  | 365 ++++++++++++++++++++++++++++++++-
>  t/t0021/rot13-filter.pl                | 191 +++++++++++++++++
>  6 files changed, 1153 insertions(+), 31 deletions(-)

That's quite a large change.  Large changes are harder to review.
I was thinking about how one could split this change.  I guess
that it is better to keep the new feature, its documentation, and
its tests together.  But perhaps the example in `contrib/`
(which is newer, and thus less reviewed) would be better in
separate commit.

There is also another change that could be split off this patch
into purely preparatory commit, that is one that stands alone
but doesn't make much sense alone.  I would write about this
proposal (important only if there would be yet another iteration
of this patch series) a bit later.

>  create mode 100755 contrib/long-running-filter/example.pl
>  create mode 100755 t/t0021/rot13-filter.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 7aff940..946dcad 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate.  If a long running `process` filter is used
   ^^^^

Should we use this terminology here?  I have not read the preceding
part of documentation, so I don't know if it talks about "blobs" or
if it uses "files" and/or "file contents".

Though this is very minor nitpick.

> +in place of `clean` and/or `smudge` filters, then Git can process
> +all blobs with a single filter command invocation for the entire
> +life of a single Git command, for example `git add --all`.  See
> +section below for the description of the protocol used to
> +communicate with a `process` filter.

Good introduction of long lived filter feature (`process` filter).

>  
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user to use.
> @@ -373,6 +379,154 @@ not exist, or may have different contents. So, smudge and clean commands
>  should not try to access the file on disk, but only act as filters on the
>  content provided to them on standard input.
>  
> +Long Running Filter Process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +If the filter command (a string value) is defined via
> +`filter.<driver>.process` then Git can process all blobs with a
> +single filter invocation for the entire life of a single Git
> +command. This is achieved by using a packet format (pkt-line,
> +see technical/protocol-common.txt) based protocol over standard
> +input and standard output as follows. All packets are considered
> +text and therefore are terminated by an LF. Exceptions are the
> +"*CONTENT" packets and the flush packet.

I guess that reasoning here is that all but CONTENT packets are
metadata, and thus to aid debuggability of the protocol are "text",
as considered by pkt-line.

Perhaps a bit more readable would be the following (but current is
just fine; I am nitpicking):

  All packets, except for the "{star}CONTENT" packets and the "0000"
  flush packer, are considered text and therefore are terminated by
  a LF.

Or maybe:

  All metadata is considered text, and thus send as a text packet,
  that is terminated with the newline (LF) character.  The file
  contents is send as binary data, as is, without appending LF.
  The flush packet is a packet composed of 4 bytes, represented
  in ASCII as "0000".

Anyway, this is all right as it is now; we can always polish it
later.

> +
> +Git starts the filter when it encounters the first file
> +that needs to be cleaned or smudged. After the filter started
> +Git sends a welcome message ("git-filter-client"), a list of
> +supported protocol version numbers, and a flush packet. Git expects

I see that example below explains what is the format of sending
this list of process filter protocol version numbers supported by
Git (I suppose it is to future proof adding new versions of protocol,
and removing old ones if they are somehow buggy). It would be nice,
I think, to explain it in more detail:

  a list of supported protocol version numbers, each version in
  a separate text packet using the "version=<n>" format,

But with an example few paragraphs below it might be not necessary.

I think it might be a good idea to describe what flush packet is
somewhere in this document; on the other hand referring (especially
if hyperlinked) to pkt-line technical documentation might be good
enough / better.  I'm unsure, but I tend on the side that referring
to technical documentation is better.

> +to read a welcome response message ("git-filter-server") and exactly
> +one protocol version number from the previously sent list. All further

I guess that is to provide forward-compatibility, isn't it?  Also,
"Git expects..." probably means filter process MUST send, in the
RFC2119 (https://tools.ietf.org/html/rfc2119) meaning.

> +communication will be based on the selected version. The remaining
> +protocol description below documents "version=2". Please note that
> +"version=42" in the example below does not exist and is only there
> +to illustrate how the protocol would look like with more than one
> +version.

All right.

> +
> +After the version negotiation Git sends a list of supported capabilities
> +and a flush packet.

Is it that Git SHOULD send list of ALL supported capabilities, or is
it that Git SHOULD NOT send capabilities it does not support, and that
it MAY send only those capabilities it needs (so for example if command
uses only `smudge`, it may not send `clean`, so that filter driver doesn't
need to initialize data it would not need).

I guess with the example few lines below there is no need to explain
the format of capabilities (or use BNF / EBNF notation to define it).

I wonder why it is "<capability>=true", and not "capability=<capability>".
Is there a case where we would want to send "<capability>=false".  Or
is it to allow configurable / value based capabilities?  Isn't it going
a bit too far: is there even a hind of an idea for parametrize-able
capability? YAGNI is a thing...

A few new capabilities that we might want to support in the near future
is "size", "stream", which are options describing how to communicate,
and "cleanFromFile", "smudgeToFile", which are new types of operations...
but neither needs any parameter.

I guess that adding new capabilities doesn't require having to come up
with the new version of the protocol, isn't it.

>                       Git expects to read a list of desired capabilities,
> +which must be a subset of the supported capabilities list, and a flush
> +packet as response:

All right, with Git speaking first, having Git provide list of supported
capabilities first is quite natural.

> +------------------------
> +packet:          git> git-filter-client

I guess we assume that from the above description it is obvious that
this is

  +packet:          git> git-filter-client\n

All right.

> +packet:          git> version=2
> +packet:          git> version=42

"List" means "each in separate packet", right.  Here also

  +packet:          git> version=2\n

> +packet:          git> 0000

As I wrote, I hope everybody would understand that is a flush packet,
that is packet composed literally of 4 characters / bytes "0000",
and not binary or text packet with "0000" as contents, that is
"00040000" packet or "00050000\n" packet.

But as it is consistent with other examples, and with GIT_TRACE_PACKET
output, I think both skipping trailing \n for text packets, and
writing "0000" for flush packet is all right.  Sorry for the noise.

Sidenote (you don't have to answer to): do we use "0004" packet as
a keep-alive packet anywhere?

> +packet:          git< git-filter-server
> +packet:          git< version=2
> +packet:          git> clean=true
> +packet:          git> smudge=true
> +packet:          git> not-yet-invented=true

Hmmm... should we hint at the use of kebab-case versus snake_case
or camelCase for new capabilities?

> +packet:          git> 0000
> +packet:          git< clean=true
> +packet:          git< smudge=true
> +packet:          git< 0000
> +------------------------
> +Supported filter capabilities in version 2 are "clean" and
> +"smudge".

I think we can add new capabilities without increasing version number
of the protocol.  But then we can and should update this part of the
documentation.

All right.

> +
> +Afterwards Git sends a list of "key=value" pairs terminated with
> +a flush packet. The list will contain at least the filter command
> +(based on the supported capabilities) and the pathname of the file
> +to filter relative to the repository root. Right after these packets
> +Git sends the content split in zero or more pkt-line packets and a
> +flush packet to terminate content.

All right, the example below shows what are the names of 'variables;
in those packets (that is "command=" ( "clean" | "smudge" ), and
"pathname=" <pathname>).

> +------------------------
> +packet:          git> command=smudge
> +packet:          git> pathname=path/testfile.dat
> +packet:          git> 0000
> +packet:          git> CONTENT
> +packet:          git> 0000
> +------------------------

I think it is important to mention that (at least with current
`filter.<driver>.process` implementation, that is absent future
"stream" capability / option) the filter process needs to read
*whole contents* at once, *before* writing anything.  Otherwise
it can lead to deadlock.

This is especially important in that it is different (!) from the
current behavior of `clean` and `smudge` filters, which can
stream their response because Git invokes them async.

> +
> +The filter is expected to respond with a list of "key=value" pairs
> +terminated with a flush packet. If the filter does not experience
> +problems then the list must contain a "success" status.

Perhaps "status" packet with value "success", or

                                   If the filter does not experience
  +problems then the list must contain a "status=success" line.

Possibly s/line./packet./

But as I see with how it is explained further, 'a "success" status'
works too.  No need to change, then.

>                                                          Right after
> +these packets the filter is expected to send the content in zero
> +or more pkt-line packets and a flush packet at the end. Finally, a
> +second list of "key=value" pairs terminated with a flush packet
> +is expected. The filter can change the status in the second list.
> +------------------------
> +packet:          git< status=success
> +packet:          git< 0000
> +packet:          git< SMUDGED_CONTENT
> +packet:          git< 0000
> +packet:          git< 0000  # empty list!
> +------------------------

All right.  Empty list with no change in status looks good.

An alternative would be to assume different meaning to the "status"
before and after sending contents, e.g.

   packet:          git< received=ok
   packet:          git< 0000
   packet:          git< SMUDGED_CONTENT
   packet:          git< 0000
   packet:          git< sent=ok
   packet:          git< 0000

But I think current solution is good enough.

> +
> +If the result content is empty then the filter is expected to respond
> +with a success status and an empty list.
> +------------------------
> +packet:          git< status=success
> +packet:          git< 0000
> +packet:          git< 0000  # empty content!
> +packet:          git< 0000  # empty list!
> +------------------------

All right.  This follows from the definition, but it is nice to have
it spelled in full.

> +
> +In case the filter cannot or does not want to process the content,
> +it is expected to respond with an "error" status. Depending on the
> +`filter.<driver>.required` flag Git will interpret that as error
> +but it will not stop or restart the filter process.

Right, and Git would not try to read contents from the filter then.

> +------------------------
> +packet:          git< status=error
> +packet:          git< 0000
> +------------------------
> +
> +If the filter experiences an error during processing, then it can
> +send the status "error" after the content was (partially or
> +completely) sent. Depending on the `filter.<driver>.required` flag
> +Git will interpret that as error but it will not stop or restart the
> +filter process.
> +------------------------
> +packet:          git< status=success
> +packet:          git< 0000
> +packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
> +packet:          git< 0000
> +packet:          git< status=error
> +packet:          git< 0000
> +------------------------

Good.  A question is if the filter process can send "status=abort"
after partial contents, or does it need to wait for the next command?

> +
> +If the filter dies during the communication or does not adhere to
> +the protocol then Git will stop the filter process and restart it
> +with the next file that needs to be processed. Depending on the
> +`filter.<driver>.required` flag Git will interpret that as error.
> +
> +The error handling for all cases above mimic the behavior of
> +the `filter.<driver>.clean` / `filter.<driver>.smudge` error
> +handling.

Good.

> +
> +In case the filter cannot or does not want to process the content
> +as well as any future content for the lifetime of the Git process,
> +it is expected to respond with an "abort" status. Depending on
> +the `filter.<driver>.required` flag Git will interpret that as error
> +for the content as well as any future content for the lifetime of the
> +Git process but it will not stop or restart the filter process.
> +------------------------
> +packet:          git< status=abort
> +packet:          git< 0000
> +------------------------

I assume this is obvious that if filter process tells "abort", Git
would not try to send further files (regardless of the value of
`filter.<driver>.required`).

> +
> +After the filter has processed a blob it is expected to wait for
> +the next "key=value" list containing a command. Git will close
> +the command pipe on exit. The filter is expected to detect EOF
> +and exit gracefully on its own.

Good to have it documented.  

Anyway, as it is Git command that spawns the filter driver process,
assuming that the filter process doesn't daemonize itself, wouldn't
the operating system reap it after its parent process, that is the
git command it invoked, dies? So detecting EOF is good, but not
strictly necessary for simple filter that do not need to free
its resources, or can leave freeing resources to the operating
system? But I may be wrong here.

> +
> +A long running filter demo implementation can be found in
> +`contrib/long-running-filter/example.pl` located in the Git
> +core repository. If you develop your own long running filter
> +process then the `GIT_TRACE_PACKET` environment variables can be
> +very helpful for debugging (see linkgit:git[1]).

Very good... though I wonder if adding demo implementation should
not be left for a separate commit.

> +
> +If a `filter.<driver>.process` command is configured then it
> +always takes precedence over a configured `filter.<driver>.clean`
> +or `filter.<driver>.smudge` command.

This is a change from what I remember of previous iterations of this
patch series, but I see that it might be the best solution of the
three possible (that I can think of):

* `filter.<driver>.clean` or `filter.<driver>.smudge` command
  always takes precedence over `filter.<driver>.process`

  ADVANTAGES: 
   - can convert only half of filter to process filter

  DISADVANTAGES: 
   - uncommon "older type wins"
   - cannot provide fallback of one-shot filters for older Git

* `filter.<driver>.process` command always takes precedence
  over `filter.<driver>.clean` or `filter.<driver>.smudge` command
  (this is the one chosen)

  ADVANTAGES:
   - can provide `clean` and `smudge` filters as fallback for
     older Git (e.g. installed locally, and not always in PATH)

  DISADVANTAGES:
   - need to convert both `clean` and `smudge` part into `process`
     at once

* `filter.<driver>.process` command takes precedence over
  `filter.<driver>.clean` if it supports "clean" capability, and
  similarly for `filter.<driver>.smudge`

  ADVANTAGES:
   - can convert only half of filter to process filter
   - can provide `clean` and `smudge` filters as fallback for
     older Git (e.g. installed locally, and not always in PATH)
  
  DISADVANTAGES:
   - you need to see the filter implementation to know which
     one would be invoked
   - complicated to understand, reason about, and implement

> +
> +Please note that you cannot use an existing `filter.<driver>.clean`
> +or `filter.<driver>.smudge` command with `filter.<driver>.process`
> +because the former two use a different inter process communication
> +protocol than the latter one.

Good.

> +
> +
>  Interaction between checkin/checkout attributes
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  
> diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
> new file mode 100755
> index 0000000..c13a631
> --- /dev/null
> +++ b/contrib/long-running-filter/example.pl

To repeat myself, I think it would serve better as a separate patch.

Perhaps even you could add the example filter in the Go language
from https://github.com/github/git-lfs/pull/1382 (there is already
at least one Go program in contrib/).  In a separate patch, of course.
The commands/command_filter.go.

> @@ -0,0 +1,123 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git filter protocol version 2
> +# See Documentation/gitattributes.txt, section "Filter Protocol"

We might add that it is a pass-thru filter, a skeleton to be
modified.

> +#
> +
> +use strict;
> +use warnings;
> +
> +my $MAX_PACKET_CONTENT_SIZE = 65516;

All right, constants (the built in / in core ones) are strange...
Variables are easier to use.

> +
> +sub packet_bin_read {
> +    my $buffer;
> +    my $bytes_read = read STDIN, $buffer, 4;
> +    if ( $bytes_read == 0 ) {
> +
> +        # EOF - Git stopped talking to us!
> +        exit();
> +    }
> +    elsif ( $bytes_read != 4 ) {

That is a different Perl coding convention that the one I am used to,
but it is no less valid.  Consistent style is more important.  This is
the contrib/ area anyway.

> +        die "invalid packet size '$bytes_read' field";

This would read "invalid packet size '000' field", for example.
Perhaps the following would be (slightly) better:

  +        die "invalid packet size field: '$bytes_read'";

> +    }
> +    my $pkt_size = hex($buffer);
> +    if ( $pkt_size == 0 ) {
> +        return ( 1, "" );

It feels a bit strange to me to return a list / a pair combining
status / error condition with the actual result.  I'm more used
to packing such in hash reference.  But I think it is all right.

> +    }
> +    elsif ( $pkt_size > 4 ) {

Isn't a packet of $pkt_size == 4 a valid packet, a keep-alive
one?  Or is it forbidden?

We can declare that Git should not use it for filter process anyway.

> +        my $content_size = $pkt_size - 4;
> +        $bytes_read = read STDIN, $buffer, $content_size;
> +        if ( $bytes_read != $content_size ) {
> +            die "invalid packet ($content_size expected; $bytes_read read)";

This error message would read "invalid packet (12 expected; 10 read)";
I think it would be better to rephrase it as

  +            die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";

> +        }
> +        return ( 0, $buffer );
> +    }
> +    else {
> +        die "invalid packet size";

I'm not sure if it is worth it (especially for the demo script),
but perhaps we could show what this invalid size was?

  +        die "invalid packet size value '$pkt_size'";

> +    }
> +}
> +
> +sub packet_txt_read {
> +    my ( $res, $buf ) = packet_bin_read();
> +    unless ( $buf =~ /\n$/ ) {

Wouldn't

  +    unless ( $buf =~ s/\n$// ) {

or (less so)

  +    unless ( $buf =~ s/\n$\z// ) {

be more idiomatic (and not require use of 'substr')?  Remember,
the s/// substitution quote-like operator returns number of
substitutions in the scalar context.

> +        die "A non-binary line SHOULD BE terminated by an LF.";

This is SHOULD be, not MUST be, so perhaps 'warn' would be enough.
Not that Git should send us such line.

> +    }
> +    return ( $res, substr( $buf, 0, -1 ) );

This would be not necessary if s/// instead of m// was used.

> +}
> +
> +sub packet_bin_write {
> +    my ($packet) = @_;

This is equivalent to

  +    my $packet = shift;

which, I think, is more common for single-parameter subroutines.

Also, this is $data (or $buf), not $packet.

> +    print STDOUT sprintf( "%04x", length($packet) + 4 );
> +    print STDOUT $packet;
> +    STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +    packet_bin_write( $_[0] . "\n" );
> +}

Nice.

> +
> +sub packet_flush {
> +    print STDOUT sprintf( "%04x", 0 );

We could use simply

  +    print STDOUT "0000";

but this is more explicit.  Good.

> +    STDOUT->flush();
> +}
> +

Perhaps some comment that main begins here?

> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";

Actually, it is overly strict.  It should not fail if there
are other "version=3", "version=4" etc. lines.

> +
> +packet_txt_write("git-filter-server");
> +packet_txt_write("version=2");

It needs to do

  +packet_flush();

here.

> +
> +( packet_txt_read() eq ( 0, "clean=true" ) )  || die "bad capability";
> +( packet_txt_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
> +( packet_bin_read() eq ( 1, "" ) )            || die "bad capability end";

It is also overly strict.  The capabilities can be ordered in any
way, and there can be additional capabilities which this script
do not understand.  It is all right to have such capabilities.

All this makes it better to extract the handshake / metadata part
into a subroutine.

> +
> +packet_txt_write("clean=true");
> +packet_txt_write("smudge=true");
> +packet_flush();

All right.

> +
> +while (1) {
> +    my ($command)  = packet_txt_read() =~ /^command=([^=]+)$/;
> +    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;

Do we require this order?  If it is, is that explained in the
documentation?

> +
> +    packet_bin_read();

I think there can be other auxiliary data (like "size=<n>") that
filter do not need to understand.

Anyway, using packet_bin_read() is quite unreadable.  What you
mean is to wait until flush packet, or expect flush packet.

   	my $done = 0;
   	while ( !$done ) {
   		( $done, undef ) = packet_bin_read();
   	}

Or

   	wait_for_flush();

Or

   	skip_until_flush();

> +
> +    my $input = "";
> +    {
> +        binmode(STDIN);
> +        my $buffer;
> +        my $done = 0;
> +        while ( !$done ) {
> +            ( $done, $buffer ) = packet_bin_read();
> +            $input .= $buffer;
> +        }
> +    }

All right.

> +
> +    my $output;
> +    if ( $command eq "clean" ) {
> +        ### Perform clean here ###
> +        $output = $input;

Perhaps we should also mention here how to handle errors
(the "status=error" and "status=abort" both for upfront
and partial contents case).

> +    }
> +    elsif ( $command eq "smudge" ) {
> +        ### Perform smudge here ###
> +        $output = $input;

Same as above.

> +    }
> +    else {
> +        die "bad command '$command'";
> +    }

All right.

> +
> +    packet_txt_write("status=success");
> +    packet_flush();
> +    while ( length($output) > 0 ) {
> +        my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
> +        packet_bin_write($packet);
> +        if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
> +            $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
> +        }
> +        else {
> +            $output = "";
> +        }
> +    }
> +    packet_flush();    # flush content!

All right.

> +    packet_flush();    # empty list!

This is less "empty list!", and more keeping "status=success" unchanged.

> +}
> diff --git a/convert.c b/convert.c
> index 597f561..bd66257 100644
> --- a/convert.c
> +++ b/convert.c

I'll stop here, and I'll finish the review later.

To be continued,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-26 20:23     ` Lars Schneider
@ 2016-09-27  8:14       ` Lars Schneider
  2016-09-27  9:00         ` Jeff King
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-27  8:14 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


On 26 Sep 2016, at 22:23, Lars Schneider <larsxschneider@gmail.com> wrote:

> 
> On 25 Sep 2016, at 15:46, Jakub Narębski <jnareb@gmail.com> wrote:
> 
>> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>>> From: Lars Schneider <larsxschneider@gmail.com>
> 
> 
>>> +		strbuf_grow(sb_out, PKTLINE_DATA_MAXLEN+1);
>>> +		paket_len = packet_read(fd_in, NULL, NULL,
>>> +			sb_out->buf + sb_out->len, PKTLINE_DATA_MAXLEN+1, options);
>> 
>> A question (which perhaps was answered during the development of this
>> patch series): why is this +1 in PKTLINE_DATA_MAXLEN+1 here?
> 
> Nice catch. I think this is wrong:
> https://github.com/git/git/blob/6fe1b1407ed91823daa5d487abe457ff37463349/pkt-line.c#L196
> 
> It should be "if (len > size)" ... then we don't need the "+1" here.
> (but I need to think a bit more about this)

After looking at it with fresh eyes I think the existing code is probably correct,
but maybe a bit confusing.

packet_read() adds a '\0' at the end of the destination buffer:
https://github.com/git/git/blob/21f862b498925194f8f1ebe8203b7a7df756555b/pkt-line.c#L206

That is why the destination buffer needs to be one byte larger than the expected content.

However, in this particular case that wouldn't be necessary because the destination
buffer is a 'strbuf' that allocates an extra byte for '\0' at the end. But we are not
supposed to write to this extra byte:
https://github.com/git/git/blob/21f862b498925194f8f1ebe8203b7a7df756555b/strbuf.h#L25-L31


I see two options:


(1) I leave the +1 as is and add a comment why the extra byte is necessary.

    Pro: No change in existing code necessary
    Con: The destination buffer has two '\0' at the end.


(2) I add an option PACKET_READ_DISABLE_NUL_TERMINATION. If the option is
    set then no '\0' byte is added to the end.

    Pro: Correct solution, no byte wasted.
    Con: Change in existing code required.


Any preference?


Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 06/11] pkt-line: add packet_write_gently()
  2016-09-26 19:21     ` Lars Schneider
@ 2016-09-27  8:39       ` Jeff King
  2016-09-27 19:33         ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Jeff King @ 2016-09-27  8:39 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, git, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

On Mon, Sep 26, 2016 at 09:21:10PM +0200, Lars Schneider wrote:

> On 25 Sep 2016, at 13:26, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> > W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> >> From: Lars Schneider <larsxschneider@gmail.com>
> >> ...
> >> 
> >> +static int packet_write_gently(const int fd_out, const char *buf, size_t size)
> > 
> > I'm not sure what naming convention the rest of Git uses, but isn't
> > it more like '*data' rather than '*buf' here?
> 
> pkt-line seems to use 'buf' or 'buffer' for everything else.

I do not think we have definite rules, but I would generally expect to
see "data" as an opaque thing (e.g., passing "void *data" to callbacks).
"buf" or "buffer" makes sense here, but I don't think it really matters
that much either way.

> >> +	static char packet_write_buffer[LARGE_PACKET_MAX];
> > 
> > I think there should be warning (as a comment before function
> > declaration, or before function definition), that packet_write_gently()
> > is not thread-safe (nor reentrant, but the latter does not matter here,
> > I think).
> > 
> > Thread-safe vs reentrant: http://stackoverflow.com/a/33445858/46058
> > 
> > This is not something terribly important; I guess git code has tons
> > of functions not marked as thread-unsafe...
> 
> I agree that the function is not thread-safe. However, I can't find 
> an example in the Git source that marks a function as not thread-safe.
> Unless is it explicitly stated in the coding guidelines I would prefer
> not to start way to mark functions.

I'd agree. A large number of functions in git are not reentrant, and I
would not want to give the impression that those missing a warning are
safe to use.

> >> +	if (size > sizeof(packet_write_buffer) - 4) {
> > 
> > First, wouldn't the following be more readable:
> > 
> >  +	if (size + 4 > LARGE_PACKET_MAX) {
> 
> Peff suggested that here:
> http://public-inbox.org/git/20160810132814.gqnipsdwyzjmuqjy@sigill.intra.peff.net/

There is a good reason to do size checks as a subtraction from a known
quantity: you can be sure that you are not introducing an overflow
(e.g., Jakub's suggestion does the wrong thing when "size" is within 4
bytes of its maximum value). That's unlikely in this case, but then so
is the size exceeding LARGE_PACKET_MAX in the first place (arguably this
should be a die("BUG"), because it is the caller's responsibility to
split their packets.

-Peff

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-27  8:14       ` Lars Schneider
@ 2016-09-27  9:00         ` Jeff King
  2016-09-27 12:10           ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jeff King @ 2016-09-27  9:00 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, git, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

On Tue, Sep 27, 2016 at 10:14:16AM +0200, Lars Schneider wrote:

> >>> +		strbuf_grow(sb_out, PKTLINE_DATA_MAXLEN+1);
> >>> +		paket_len = packet_read(fd_in, NULL, NULL,
> >>> +			sb_out->buf + sb_out->len, PKTLINE_DATA_MAXLEN+1, options);
> [...]
> After looking at it with fresh eyes I think the existing code is probably correct,
> but maybe a bit confusing.
> 
> packet_read() adds a '\0' at the end of the destination buffer:
> https://github.com/git/git/blob/21f862b498925194f8f1ebe8203b7a7df756555b/pkt-line.c#L206
> 
> That is why the destination buffer needs to be one byte larger than the expected content.
> 
> However, in this particular case that wouldn't be necessary because the destination
> buffer is a 'strbuf' that allocates an extra byte for '\0' at the end. But we are not
> supposed to write to this extra byte:
> https://github.com/git/git/blob/21f862b498925194f8f1ebe8203b7a7df756555b/strbuf.h#L25-L31

Right. The allocation happens as part of strbuf_grow(), but whatever
fills the buffer is expected to write the actual NUL (either manually,
or by calling strbuf_setlen().

I see the bit you quoted warns not to touch the extra byte yourself,
though I wonder if that is a bit heavy-handed (I guess it would matter
if we moved the extra 1-byte growth into strbuf_setlen(), but I find
that a rather unlikely change).

That being said, why don't you just use LARGE_PACKET_MAX here? It is
already the accepted size for feeding to packet_read(), and we know it
has enough space to hold a NUL terminator. Yes, we may over-allocate by
4 bytes, but that isn't really relevant. Strbufs over-allocate anyway.
So just:

  for (;;) {
        strbuf_grow(sb_out, LARGE_PACKET_MAX);
        packet_len = packet_read(fd_in, NULL, NULL,
	                         sb_out->buf + sb_out->len, LARGE_PACKET_MAX,
				 options);
        if (packet_len <= 0)
                break;
        /*
         * no need for strbuf_setlen() here; packet_read always adds a
         * NUL terminator.
         */
        sb_out->len += packet_len;
  }

You _could_ make the final line of the loop use strbuf_setlen(); it
would just overwrite something we already know is a NUL (and we know
that no extra allocation is necessary).

Also, using LARGE_PACKET_MAX fixes the fact that this patch is using
PKTLINE_DATA_MAXLEN before it is actually defined. :)

You might want to occasionally run:

  git rebase -x make

to make sure all of your incremental steps are valid (or even "make
test" if you are extremely patient; I often do that once after a big
round of complex interactive-rebase reordering).

> I see two options:
> 
> 
> (1) I leave the +1 as is and add a comment why the extra byte is necessary.
> 
>     Pro: No change in existing code necessary
>     Con: The destination buffer has two '\0' at the end.
> 
> 
> (2) I add an option PACKET_READ_DISABLE_NUL_TERMINATION. If the option is
>     set then no '\0' byte is added to the end.
> 
>     Pro: Correct solution, no byte wasted.
>     Con: Change in existing code required.
> 
> 
> Any preference?

Of the two, I prefer (1), though I like what I suggested above even more
(big surprise, I know).

-Peff

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-27  9:00         ` Jeff King
@ 2016-09-27 12:10           ` Lars Schneider
  2016-09-27 12:13             ` Jeff King
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-27 12:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narębski, git, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 27 Sep 2016, at 11:00, Jeff King <peff@peff.net> wrote:
> 
> On Tue, Sep 27, 2016 at 10:14:16AM +0200, Lars Schneider wrote:
> 
>>>>> +		strbuf_grow(sb_out, PKTLINE_DATA_MAXLEN+1);
>>>>> +		paket_len = packet_read(fd_in, NULL, NULL,
>>>>> +			sb_out->buf + sb_out->len, PKTLINE_DATA_MAXLEN+1, options);
>> [...]
>> After looking at it with fresh eyes I think the existing code is probably correct,
>> but maybe a bit confusing.
>> 
>> packet_read() adds a '\0' at the end of the destination buffer:
>> https://github.com/git/git/blob/21f862b498925194f8f1ebe8203b7a7df756555b/pkt-line.c#L206
>> 
>> That is why the destination buffer needs to be one byte larger than the expected content.
>> 
>> However, in this particular case that wouldn't be necessary because the destination
>> buffer is a 'strbuf' that allocates an extra byte for '\0' at the end. But we are not
>> supposed to write to this extra byte:
>> https://github.com/git/git/blob/21f862b498925194f8f1ebe8203b7a7df756555b/strbuf.h#L25-L31
> 
> Right. The allocation happens as part of strbuf_grow(), but whatever
> fills the buffer is expected to write the actual NUL (either manually,
> or by calling strbuf_setlen().
> 
> I see the bit you quoted warns not to touch the extra byte yourself,
> though I wonder if that is a bit heavy-handed (I guess it would matter
> if we moved the extra 1-byte growth into strbuf_setlen(), but I find
> that a rather unlikely change).
> 
> That being said, why don't you just use LARGE_PACKET_MAX here? It is
> already the accepted size for feeding to packet_read(), and we know it
> has enough space to hold a NUL terminator. Yes, we may over-allocate by
> 4 bytes, but that isn't really relevant. Strbufs over-allocate anyway.

TBH in that case I would prefer the "PKTLINE_DATA_MAXLEN+1" solution with
an additional comment explaining "+1".

Would that be OK for you?

I am not worried about the extra 4 bytes. I am worried that we make it harder
to see what is going on if we use LARGE_PACKET_MAX.


> So just:
> 
>  for (;;) {
>        strbuf_grow(sb_out, LARGE_PACKET_MAX);
>        packet_len = packet_read(fd_in, NULL, NULL,
> 	                         sb_out->buf + sb_out->len, LARGE_PACKET_MAX,
> 				 options);
>        if (packet_len <= 0)
>                break;
>        /*
>         * no need for strbuf_setlen() here; packet_read always adds a
>         * NUL terminator.
>         */
>        sb_out->len += packet_len;
>  }
> 
> You _could_ make the final line of the loop use strbuf_setlen(); it
> would just overwrite something we already know is a NUL (and we know
> that no extra allocation is necessary).
> 
> Also, using LARGE_PACKET_MAX fixes the fact that this patch is using
> PKTLINE_DATA_MAXLEN before it is actually defined. :)

Yeah, I noticed that too and fixed it in v9 :-) Thanks for the reminder!


> You might want to occasionally run:
> 
>  git rebase -x make
> 
> to make sure all of your incremental steps are valid (or even "make
> test" if you are extremely patient; I often do that once after a big
> round of complex interactive-rebase reordering).

That is a good suggestion. I'll add that to my "tool-belt" :-)


Thanks,
Lars


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams
  2016-09-27 12:10           ` Lars Schneider
@ 2016-09-27 12:13             ` Jeff King
  0 siblings, 0 replies; 71+ messages in thread
From: Jeff King @ 2016-09-27 12:13 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, git, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

On Tue, Sep 27, 2016 at 02:10:50PM +0200, Lars Schneider wrote:

> > That being said, why don't you just use LARGE_PACKET_MAX here? It is
> > already the accepted size for feeding to packet_read(), and we know it
> > has enough space to hold a NUL terminator. Yes, we may over-allocate by
> > 4 bytes, but that isn't really relevant. Strbufs over-allocate anyway.
> 
> TBH in that case I would prefer the "PKTLINE_DATA_MAXLEN+1" solution with
> an additional comment explaining "+1".
> 
> Would that be OK for you?
> 
> I am not worried about the extra 4 bytes. I am worried that we make it harder
> to see what is going on if we use LARGE_PACKET_MAX.

I guess I don't feel to strongly either way. My interest in
LARGE_PACKET_MAX is mostly that this is how all the rest of the
packet_read() callers behave.

-Peff

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-20 19:02 ` [PATCH v8 11/11] convert: add filter.<driver>.process option larsxschneider
  2016-09-26 22:41   ` Jakub Narębski
@ 2016-09-27 15:37   ` Jakub Narębski
  2016-09-30 19:38     ` Lars Schneider
  2016-09-28 23:14   ` Jakub Narębski
  2 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-27 15:37 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

Part second of the review of 11/11.

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:

> diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
> new file mode 100755
> index 0000000..c13a631
> --- /dev/null
> +++ b/contrib/long-running-filter/example.pl
[...]
> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";

What I would like to see here is some kind of packet_read_list()
or packet_txt_read_list() that reads until flush packet or EOF,
and returns list of chomp-ed lines (without LF terminator).

Then you can examine those lines:

   my @lines = packet_read_list();
   $lines[0] eq "git-filter-client"      or die "bad initialization: '$lines[0]'";
   grep { $_ eq "version=2" } @lines     or die "bad version: version=2 not found";

Note: I have not checked that I got operator precedence right.

> +
> +packet_txt_write("git-filter-server");
> +packet_txt_write("version=2");

Here using packet_write_list() would help us to not forget to
send the flush packet:

  packet_write_list(
  	"git-filter-server",
  	"version=2"
  );

[...]
> diff --git a/convert.c b/convert.c
> index 597f561..bd66257 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -3,6 +3,7 @@
>  #include "run-command.h"
>  #include "quote.h"
>  #include "sigchain.h"
> +#include "pkt-line.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -442,7 +443,7 @@ static int filter_buffer_or_fd(int in, int out, void *data)
>  	return (write_err || status);
>  }
>  
> -static int apply_filter(const char *path, const char *src, size_t len, int fd,
> +static int apply_single_file_filter(const char *path, const char *src, size_t len, int fd,
>                          struct strbuf *dst, const char *cmd)
>  {
>  	/*
> @@ -456,12 +457,6 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	struct async async;
>  	struct filter_params params;
>  
> -	if (!cmd || !*cmd)
> -		return 0;
> -
> -	if (!dst)
> -		return 1;
> -
>  	memset(&async, 0, sizeof(async));
>  	async.proc = filter_buffer_or_fd;
>  	async.data = &params;

I have reordered a few chunks of this patch to make it easier to
see what happens here, and to review this part of patch.

[moved here from further in the patch]
> +static int apply_filter(const char *path, const char *src, size_t len,
> +                        int fd, struct strbuf *dst, struct convert_driver *drv,
> +                        const unsigned int wanted_capability)
> +{
> +	const char *cmd = NULL;
> +
> +	if (!drv)
> +		return 0;
> +
> +	if (!dst)
> +		return 1;

To reduce the size of this patch (which is not yet of the size that
would make vger reject the email :-/), perhaps this split into 
apply_single_file_filter() and apply_filter(), without yet adding
apply_multi_file_filter(). The apply_filter() would be semi-simple
wrapper, with the same signature as above.

> +
> +	if (!drv->process && (CAP_CLEAN & wanted_capability) && drv->clean)

This is just a very minor nitpicking, but wouldn't it be easier
to read with those checks reordered?

  +	if ((wanted_capability & CAP_CLEAN) && !drv->process && drv->clean)

That is: if we want `clean` capability, and `process` is not set,
and there is `clean` filter.

Though the following would also work...

  +	if ((wanted_capability & CAP_CLEAN) && !drv->process)

> +		cmd = drv->clean;
> +	else if (!drv->process && (CAP_SMUDGE & wanted_capability) && drv->smudge)
> +		cmd = drv->smudge;
> +
> +	if (cmd && *cmd)

... thanks to this check (the 'cmd' part, which we need to check anyway).

'cmd = drv->clean', then 'if (cmd)' is the same as 'if (drv->clean)',
then 'cmd = drv->clean', then 'if (cmd)', isn't it.

Not sure if it would be more readable, or less readable.

CAP_CLEAN and CAP_SMUDGE are, in theory, mutually exclusive.  Note that
the above order prefers `smudge` to `clean` if both given, while in other
places we prefer `clean` to `smudge` if both given.


> +		return apply_single_file_filter(path, src, len, fd, dst, cmd);
> +	else if (drv->process && *drv->process)
> +		return apply_multi_file_filter(path, src, len, fd, dst, drv->process, wanted_capability);
> +
> +	return 0;
> +}

Nice and clean wrapper.

> +

[moved here from further in the patch]
> @@ -839,7 +1140,7 @@ int would_convert_to_git_filter_fd(const char *path)
>  	if (!ca.drv->required)
>  		return 0;
>  
> -	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
> +	return apply_filter(path, NULL, 0, -1, NULL, ca.drv, CAP_CLEAN);
>  }
>  
>  const char *get_convert_attr_ascii(const char *path)

This would also be a part of move adding apply_single_file_filter()
and converting apply_filter() to wrapper around single / multi file
filters.

> @@ -493,14 +488,317 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd,
>  	return !err;
>  }
>  
> +#define CAP_CLEAN    (1u<<0)
> +#define CAP_SMUDGE   (1u<<1)

It's a pity that ANSI C does not include binary constants, like e.g.
modern Perl, that is '0b0001u' and '0b0010u'; we could use hexadecimal
constants '0x01u' and '0x02u', but perhaps the above is more readable,
and as performant.

> +
> +struct cmd2process {
> +	struct hashmap_entry ent; /* must be the first member! */
> +	unsigned int supported_capabilities;
> +	const char *cmd;
> +	struct child_process process;
> +};
> +
> +static int cmd_process_map_initialized;
> +static struct hashmap cmd_process_map;
> +
> +static int cmd2process_cmp(const struct cmd2process *e1,
> +                           const struct cmd2process *e2,
> +                           const void *unused)
> +{
> +	return strcmp(e1->cmd, e2->cmd);
> +}
> +
> +static struct cmd2process *find_multi_file_filter_entry(struct hashmap *hashmap, const char *cmd)
> +{
> +	struct cmd2process key;
> +	hashmap_entry_init(&key, strhash(cmd));
> +	key.cmd = cmd;
> +	return hashmap_get(hashmap, &key, NULL);
> +}


All right, basic hashmap for the list of command processes, here
so that we can find correct driver for current file, and reuse it
if it was started already.

I see that git code does not use /* ............ */ etc to separate
subsections / subparts of a file, so I won't ask for it ;-(

> +
> +static void kill_multi_file_filter(struct hashmap *hashmap, struct cmd2process *entry)
> +{
> +	if (!entry)
> +		return;
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	/*
> +	 * We kill the filter most likely because an error happened already.
> +	 * That's why we are not interested in any error code here.
> +	 */

Good explanation.

> +	close(entry->process.in);
> +	close(entry->process.out);
> +	sigchain_pop(SIGPIPE);
> +	finish_command(&entry->process);
> +	hashmap_remove(hashmap, entry, NULL);
> +	free(entry);
> +}

That's more 'kill_and_remove_...', but that would make too long
function name.

Small and readable.  Nice.

> +
> +static int packet_write_list(int fd, const char *line, ...)
> +{
> +	va_list args;
> +	int err;
> +	va_start(args, line);
> +	for (;;) {
> +		if (!line)
> +			break;
> +		if (strlen(line) > PKTLINE_DATA_MAXLEN)

Here we see that having PKTLINE_DATA_MAXLEN (or LARGE_PACKET_DATA_MAX)
constant is quite useful.

> +			return -1;
> +		err = packet_write_fmt_gently(fd, "%s\n", line);

I wonder if adding the fact that we are writing text packets
to function name would be worth it.  Nah.  Also, it is file-local
(static) function.

> +		if (err)
> +			return err;
> +		line = va_arg(args, const char*);
> +	}
> +	va_end(args);
> +	return packet_flush_gently(fd);
> +}

Nice abstraction.

> +
> +static struct cmd2process *start_multi_file_filter(struct hashmap *hashmap, const char *cmd)
> +{
> +	int err;
> +	struct cmd2process *entry;
> +	struct child_process *process;
> +	const char *argv[] = { cmd, NULL };
> +	struct string_list cap_list = STRING_LIST_INIT_NODUP;
> +	char *cap_buf;
> +	const char *cap_name;
> +
> +	entry = xmalloc(sizeof(*entry));
> +	hashmap_entry_init(entry, strhash(cmd));
> +	entry->cmd = cmd;
> +	entry->supported_capabilities = 0;
> +	process = &entry->process;
> +
> +	child_process_init(process);
> +	process->argv = argv;
> +	process->use_shell = 1;
> +	process->in = -1;
> +	process->out = -1;
> +
> +	if (start_command(process)) {
> +		error("cannot fork to run external filter '%s'", cmd);
> +		kill_multi_file_filter(hashmap, entry);
> +		return NULL;
> +	}

I guess there is a reason why we init hashmap entry, try to start
external process, then kill entry of unable to start, instead of
trying to start external process, and adding hashmap entry when
we succeed?

> +
> +	sigchain_push(SIGPIPE, SIG_IGN);

I guess that this is here to handle errors writing to filter
by ourself, isn't it?

> +
> +	err = packet_write_list(process->in, "git-filter-client", "version=2", NULL);
> +	if (err)
> +		goto done;

Ugh, error / exception handling in C.

> +
> +	err = strcmp(packet_read_line(process->out, NULL), "git-filter-server");
> +	if (err) {
> +		error("external filter '%s' does not support long running filter protocol", cmd);
> +		goto done;
> +	}
> +	err = strcmp(packet_read_line(process->out, NULL), "version=2");
> +	if (err)

We could have described the error here better.

  +		error("external filter '%s' does not support filter protocol version 2", cmd);

But this is probably not necessary; it should be rare to find
filter process that support the protocol halfway right.

> +		goto done;

I guess this would get more complicated if/when there is need
for new version of the protocol.

Shouldn't we read flush packet here?  Ah, sorry, we know that we
should get only two lines from the `process` filter driver, and
not variable number of lines, so there is no need to flush here.

Disregard my comments about lack of flush packet in the example
of long running filter script.  Well, unless the protocol itself
would get adjusted to always use flush packet to terminate set
of metadata lines, even if number of lines is fixed.

> +
> +	err = packet_write_list(process->in, "clean=true", "smudge=true", NULL);

So I see that Git sends all capabilities it supports, not only
those that given git command needs (which might be hard to find
out).

If it were possible at this point of code for Git to know, for
example, that it would only do `clean` operation, shouldn't it
write "clean=true", "smudge=false"? ;-PPP

Note that this "=true" is totally spurious.  Maybe "capability=clean",
or just "clean" would make a better protocol?

> +
> +	for (;;) {
> +		cap_buf = packet_read_line(process->out, NULL);
> +		if (!cap_buf)
> +			break;
> +		string_list_split_in_place(&cap_list, cap_buf, '=', 1);
> +
> +		if (cap_list.nr != 2 || strcmp(cap_list.items[1].string, "true"))
> +			continue;
> +
> +		cap_name = cap_list.items[0].string;
> +		if (!strcmp(cap_name, "clean")) {
> +			entry->supported_capabilities |= CAP_CLEAN;
> +		} else if (!strcmp(cap_name, "smudge")) {
> +			entry->supported_capabilities |= CAP_SMUDGE;
> +		} else {
> +			warning(
> +				"external filter '%s' requested unsupported filter capability '%s'",
> +				cmd, cap_name
> +			);
> +		}
> +
> +		string_list_clear(&cap_list, 0);
> +	}

I guess there is a reason why it was not extracted into helper
function?

Well, both because handling of variable-length response, where
multiple lines must be analyzed, happens only once, and also
because returning list of variable-length strings in C is hard
(alloca? string_list?).

> +
> +done:
> +	sigchain_pop(SIGPIPE);
> +
> +	if (err || errno == EPIPE) {
> +		error("initialization for external filter '%s' failed", cmd);
> +		kill_multi_file_filter(hashmap, entry);
> +		return NULL;
> +	}

Good.

> +
> +	hashmap_add(hashmap, entry);
> +	return entry;
> +}
> +
> +static void read_multi_file_filter_values(int fd, struct strbuf *status) {

This is more

  +static void read_multi_file_filter_status(int fd, struct strbuf *status) {

It doesn't read arbitrary values, it examines 'metadata' from
filter for "status=<foo>" lines.

> +	struct strbuf **pair;

Shouldn't it be initialized to NULL, like in strbuf_split_buf()
code?

> +	char *line;
> +	for (;;) {
> +		line = packet_read_line(fd, NULL);
> +		if (!line)
> +			break;
> +		pair = strbuf_split_str(line, '=', 2);

Why, oh why, there is no Documentation/technical/api-strbuf.txt?
Well, strbuf.h is really well commented... but perhaps not enough.

> +		if (pair[0] && pair[0]->len && pair[1]) {
> +			if (!strcmp(pair[0]->buf, "status=")) {
> +				strbuf_reset(status);
> +				strbuf_addbuf(status, pair[1]);
> +			}

So it is last status=<foo> line wins behavior?

> +		}

Shouldn't we free 'struct strbuf **pair', maybe allocated by the
strbuf_split_str() function, and reset to NULL?

> +	}
> +}
> +
> +static int apply_multi_file_filter(const char *path, const char *src, size_t len,
> +                                   int fd, struct strbuf *dst, const char *cmd,
> +                                   const unsigned int wanted_capability)
> +{
> +	int err;
> +	struct cmd2process *entry;
> +	struct child_process *process;
> +	struct stat file_stat;
> +	struct strbuf nbuf = STRBUF_INIT;

This name doesn't tell us much, but I guess there is precedence?

> +	struct strbuf filter_status = STRBUF_INIT;
> +	char *filter_type;
> +
> +	if (!cmd_process_map_initialized) {
> +		cmd_process_map_initialized = 1;
> +		hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0);
> +		entry = NULL;
> +	} else {
> +		entry = find_multi_file_filter_entry(&cmd_process_map, cmd);
> +	}
> +
> +	fflush(NULL);

Why this fflush(NULL) is needed here?

> +
> +	if (!entry) {
> +		entry = start_multi_file_filter(&cmd_process_map, cmd);
> +		if (!entry)
> +			return 0;
> +	}
> +	process = &entry->process;

All right, we start process filter, or get existing instance.

> +
> +	if (!(wanted_capability & entry->supported_capabilities))
> +		return 0;

If filter doesn't support wanted capability, then Git just
wouldn't filter.  Looks good to me.

> +
> +	if (CAP_CLEAN & wanted_capability)
> +		filter_type = "clean";
> +	else if (CAP_SMUDGE & wanted_capability)
> +		filter_type = "smudge";
> +	else
> +		die("unexpected filter type");

This should never happen; we should always request one of those
capabilities, and only one.

> +
> +	if (fd >= 0 && !src) {
> +		if (fstat(fd, &file_stat) == -1)
> +			return 0;
> +		len = xsize_t(file_stat.st_size);
> +	}

Errr... is it necessary?  The protocol no longer provides size=<n>
hint, and neither uses such hint if provided.

> +
> +	sigchain_push(SIGPIPE, SIG_IGN);

Right, we want to handle errors ourself.

> +
> +	err = strlen(filter_type) > PKTLINE_DATA_MAXLEN;
> +	if (err)
> +		goto done;

Errr... this should never happen.  We control which capabilities
we pass, it can be only "clean" or "smudge", nothing else. Those
would always be shorter than PKTLINE_DATA_MAXLEN.

Never mind that that is "command=smudge\n" etc. that needs to
be shorter that PKTLINE_DATA_MAXLEN!

So, IMHO it should be at most assert, and needs to be corrected
anyway.

> +
> +	err = packet_write_fmt_gently(process->in, "command=%s\n", filter_type);
> +	if (err)
> +		goto done;
> +
> +	err = strlen(path) > PKTLINE_DATA_MAXLEN;

Actually

  +	err = strlen(path) > PKTLINE_DATA_MAXLEN - strlen("pathname=\n");

This version was chosen in the very unlikely case if
strlen(path) + strlen("pathname=\n") would overflow.

  +	err = strlen("pathname=") + strlen(path) + strlen("\n") > PKTLINE_DATA_MAXLEN;

;-)

> +	if (err)
> +		goto done;

This should never happen, PATH_MAX everywhere is much shorter
than PKTLINE_DATA_MAXLEN / LARGE_PACKET_MAX.  Or is it?

Anyway, we should probably explain or warn

   		error("path name too long: '%s'", path);

Though if length of pathname is of the order of 2^16, I don't
think printing it would help :-)

> +
> +	err = packet_write_fmt_gently(process->in, "pathname=%s\n", path);
> +	if (err)
> +		goto done;
> +
> +	err = packet_flush_gently(process->in);
> +	if (err)
> +		goto done;

All right, this list of values, currently composed of "command=<sth>"
and "pathname=<sth>" - both of which are required, may be variable
length, so we need flush packet.

> +
> +	if (fd >= 0)
> +		err = write_packetized_from_fd(fd, process->in);
> +	else
> +		err = write_packetized_from_buf(src, len, process->in);
> +	if (err)
> +		goto done;

Looks good, and I think it is better if the caller decided rather
that write_packetized(fd, src, len, process->in) deciding.

Note for implementers: write in full, read in full, no streaming
support (Git doesn't start to read filter output until it writes
to filter in full).  This is opposed to what `clean` and `smudge`
filters support.

> +
> +	read_multi_file_filter_values(process->out, &filter_status);
> +	err = strcmp(filter_status.buf, "success");
> +	if (err)
> +		goto done;
> +
> +	err = read_packetized_to_buf(process->out, &nbuf) < 0;
> +	if (err)
> +		goto done;
> +
> +	read_multi_file_filter_values(process->out, &filter_status);
> +	err = strcmp(filter_status.buf, "success");

Looks good to me (LGTM).

> +
> +done:
> +	sigchain_pop(SIGPIPE);
> +
> +	if (err || errno == EPIPE) {
> +		if (!strcmp(filter_status.buf, "error")) {
> +			/* The filter signaled a problem with the file. */
> +		} else if (!strcmp(filter_status.buf, "abort")) {
> +			/*
> +			 * The filter signaled a permanent problem. Don't try to filter
> +			 * files with the same command for the lifetime of the current
> +			 * Git process.
> +			 */
> +			 entry->supported_capabilities &= ~wanted_capability;
> +		} else {
> +			/*
> +			 * Something went wrong with the protocol filter.
> +			 * Force shutdown and restart if another blob requires filtering!

Is this exclamation mark '!' here necessary?

> +			 */
> +			error("external filter '%s' failed", cmd);
> +			kill_multi_file_filter(&cmd_process_map, entry);
> +		}

Looks good.  Three error conditions: resumable error from filter,
failure of filter (kill, would restart if necessary), and abort.

> +	} else {
> +		strbuf_swap(dst, &nbuf);
> +	}
> +	strbuf_release(&nbuf);
> +	return !err;

I guess this is for `filter.<driver>.required` to handle correctly
filter error-ing out, or filter failing, while not aborting
if filter simply doesn't support `clean` or `smudge` capability.

> +}
> +
>  static struct convert_driver {
>  	const char *name;
>  	struct convert_driver *next;
>  	const char *smudge;
>  	const char *clean;
> +	const char *process;

LGTM.

>  	int required;
>  } *user_convert, **user_convert_tail;
>  

[a section of chunk moved up]

>  static int read_convert_config(const char *var, const char *value, void *cb)
>  {
>  	const char *key, *name;
> @@ -538,6 +836,9 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>  	if (!strcmp("clean", key))
>  		return git_config_string(&drv->clean, var, value);
>  
> +	if (!strcmp("process", key))
> +		return git_config_string(&drv->process, var, value);
> +

LGTM.

>  	if (!strcmp("required", key)) {
>  		drv->required = git_config_bool(var, value);
>  		return 0;

[a chunk of diff moved up]

> @@ -872,18 +1173,12 @@ int convert_to_git(const char *path, const char *src, size_t len,
>                     struct strbuf *dst, enum safe_crlf checksafe)
>  {
>  	int ret = 0;
> -	const char *filter = NULL;

All right, this was (I think) moved into apply_filter()...

> -	int required = 0;

...but this looks like just a removal of a temporary variable,
which could have been done in a separate preparatory patch.

>  	struct conv_attrs ca;
>  
>  	convert_attrs(&ca, path);
> -	if (ca.drv) {
> -		filter = ca.drv->clean;
> -		required = ca.drv->required;
> -	}
>  
> -	ret |= apply_filter(path, src, len, -1, dst, filter);
> -	if (!ret && required)
> +	ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN);
> +	if (!ret && ca.drv && ca.drv->required)
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);

Looks good.  (And could be a part of patch adding apply_filter()
as wrapper.)

>  
>  	if (ret && dst) {
> @@ -905,9 +1200,9 @@ void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
>  	convert_attrs(&ca, path);
>  
>  	assert(ca.drv);
> -	assert(ca.drv->clean);
> +	assert(ca.drv->clean || ca.drv->process);

Hmmm... asserts.  Well, they were here.

>  
> -	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
> +	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN))
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>  

Looks good.  (And could be a part of patch adding apply_filter()
as wrapper.)

>  	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
> @@ -919,15 +1214,9 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  					    int normalizing)
>  {
>  	int ret = 0, ret_filter = 0;
> -	const char *filter = NULL;
> -	int required = 0;
>  	struct conv_attrs ca;
>  
>  	convert_attrs(&ca, path);
> -	if (ca.drv) {
> -		filter = ca.drv->smudge;
> -		required = ca.drv->required;
> -	}

Well, this is the same change as a bit eaelier.

>  
>  	ret |= ident_to_worktree(path, src, len, dst, ca.ident);
>  	if (ret) {
> @@ -936,9 +1225,10 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  	}
>  	/*
>  	 * CRLF conversion can be skipped if normalizing, unless there
> -	 * is a smudge filter.  The filter might expect CRLFs.
> +	 * is a smudge or process filter (even if the process filter doesn't
> +	 * support smudge).  The filters might expect CRLFs.
>  	 */
> -	if (filter || !normalizing) {
> +	if ((ca.drv && (ca.drv->smudge || ca.drv->process)) || !normalizing) {
>  		ret |= crlf_to_worktree(path, src, len, dst, ca.crlf_action);
>  		if (ret) {
>  			src = dst->buf;
> @@ -946,8 +1236,8 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  		}
>  	}
>  
> -	ret_filter = apply_filter(path, src, len, -1, dst, filter);
> -	if (!ret_filter && required)
> +	ret_filter = apply_filter(path, src, len, -1, dst, ca.drv, CAP_SMUDGE);
> +	if (!ret_filter && ca.drv && ca.drv->required)
>  		die("%s: smudge filter %s failed", path, ca.drv->name);
>  
>  	return ret | ret_filter;

Looks good to me.  I understand ca.drv is checked so that ca.drv->required
and/or ca.drv->smudge / ca.drv->clean / ca.drv->process can be safely
checked, isn't it.

> @@ -1399,7 +1689,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
>  	struct stream_filter *filter = NULL;
>  
>  	convert_attrs(&ca, path);
> -	if (ca.drv && (ca.drv->smudge || ca.drv->clean))
> +	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))

LGTM, understandable change.

>  		return NULL;
>  
>  	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
> diff --git a/pkt-line.h b/pkt-line.h
> index 6df8449..3d873f3 100644
> --- a/pkt-line.h
> +++ b/pkt-line.h
> @@ -86,6 +86,7 @@ ssize_t read_packetized_to_buf(int fd_in, struct strbuf *sb_out);
>  
>  #define DEFAULT_PACKET_MAX 1000
>  #define LARGE_PACKET_MAX 65520
> +#define PKTLINE_DATA_MAXLEN (LARGE_PACKET_MAX - 4)

What the... didn't you use PKTLINE_DATA_MAXLEN in one of
earlier patches in this series?  How this even...?

>  extern char packet_buffer[LARGE_PACKET_MAX];
>  
>  #endif
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index dc50938..210c4f6 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh

I'll stop here, and I'll finish the review later.

To be continued,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 06/11] pkt-line: add packet_write_gently()
  2016-09-27  8:39       ` Jeff King
@ 2016-09-27 19:33         ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-27 19:33 UTC (permalink / raw)
  To: Jeff King, Lars Schneider
  Cc: git, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

W dniu 27.09.2016 o 10:39, Jeff King pisze:
> On Mon, Sep 26, 2016 at 09:21:10PM +0200, Lars Schneider wrote:
> 
>> On 25 Sep 2016, at 13:26, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>>> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> ...
>>>>
>>>> +static int packet_write_gently(const int fd_out, const char *buf, size_t size)
>>>
>>> I'm not sure what naming convention the rest of Git uses, but isn't
>>> it more like '*data' rather than '*buf' here?
>>
>> pkt-line seems to use 'buf' or 'buffer' for everything else.
> 
> I do not think we have definite rules, but I would generally expect to
> see "data" as an opaque thing (e.g., passing "void *data" to callbacks).
> "buf" or "buffer" makes sense here, but I don't think it really matters
> that much either way.

True.

>>>> +	static char packet_write_buffer[LARGE_PACKET_MAX];
>>>
>>> I think there should be warning (as a comment before function
>>> declaration, or before function definition), that packet_write_gently()
>>> is not thread-safe (nor reentrant, but the latter does not matter here,
>>> I think).
>>>
>>> Thread-safe vs reentrant: http://stackoverflow.com/a/33445858/46058
>>>
>>> This is not something terribly important; I guess git code has tons
>>> of functions not marked as thread-unsafe...
>>
>> I agree that the function is not thread-safe. However, I can't find 
>> an example in the Git source that marks a function as not thread-safe.
>> Unless is it explicitly stated in the coding guidelines I would prefer
>> not to start way to mark functions.

There is *one* example: "fill_textconv is not remotely thread-safe;"
comment in grep.c, but not in diff.{c,h} where it is declared/defined.

Also, it is static function; we should know if it is thread-safe
or not.

I am thinking about supporting streaming in the future, and perhaps
also running different filter drivers (for different files) in parallel.
I guess that using "static __thread char packet_write_buffer[...]"
is out of question (still not reentrant)?

> 
> I'd agree. A large number of functions in git are not reentrant, and I
> would not want to give the impression that those missing a warning are
> safe to use.

The fact tha git code is undercommented and underdocumented does not
mean that we should not add comments and documentation.

> 
>>>> +	if (size > sizeof(packet_write_buffer) - 4) {
>>>
>>> First, wouldn't the following be more readable:
>>>
>>>  +	if (size + 4 > LARGE_PACKET_MAX) {
>>
>> Peff suggested that here:
>> http://public-inbox.org/git/20160810132814.gqnipsdwyzjmuqjy@sigill.intra.peff.net/
> 
> There is a good reason to do size checks as a subtraction from a known
> quantity: you can be sure that you are not introducing an overflow
> (e.g., Jakub's suggestion does the wrong thing when "size" is within 4
> bytes of its maximum value). That's unlikely in this case, but then so
> is the size exceeding LARGE_PACKET_MAX in the first place (arguably this
> should be a die("BUG"), because it is the caller's responsibility to
> split their packets.

Right.  I should train myself to watch for overflows.

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
                   ` (10 preceding siblings ...)
  2016-09-20 19:02 ` [PATCH v8 11/11] convert: add filter.<driver>.process option larsxschneider
@ 2016-09-28 21:49 ` Junio C Hamano
  2016-09-29 10:28   ` Lars Schneider
  11 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2016-09-28 21:49 UTC (permalink / raw)
  To: larsxschneider; +Cc: git, peff, sbeller, jnareb, mlbright, tboegi, ramsay

I suspect that you are preparing a reroll already, but the one that
is sitting in 'pu' seems to be flaky in t/t0021 and I seem to see
occasional failures from it.

I didn't trace where the test goes wrong, but one easy mistake you
could make (I am not saying that is the reason of the failure) is to
assume your filter will not be called under certain condition (like
immediately after you checked out from the index to the working
tree), when the automated test goes fast enough and get you into a
"racy git" situation---the filter may be asked to filter the
contents from the working tree again to re-validate what's there is
still what is in the index.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-20 19:02 ` [PATCH v8 11/11] convert: add filter.<driver>.process option larsxschneider
  2016-09-26 22:41   ` Jakub Narębski
  2016-09-27 15:37   ` Jakub Narębski
@ 2016-09-28 23:14   ` Jakub Narębski
  2016-10-01 15:34     ` Lars Schneider
  2 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-28 23:14 UTC (permalink / raw)
  To: Lars Schneider, git
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Martin-Louis Bright,
	Torsten Bögershausen, Ramsay Jones

Part third (and last) of the review of v8 11/11.

W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com napisał:
[...]
> diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
> new file mode 100755
> index 0000000..c13a631
> --- /dev/null
> +++ b/contrib/long-running-filter/example.pl
[...]
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index dc50938..210c4f6 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh

One thing that could have been done as yet another preparatory
patch would be to modernize existing t/t0021-conversion.sh tests.
For example use here-doc instead of series of echo-s, use cp
to copy files and not echo, etc.

> @@ -31,7 +31,10 @@ test_expect_success setup '
>  	cat test >test.i &&
>  	git add test test.t test.i &&
>  	rm -f test test.t test.i &&
> -	git checkout -- test test.t test.i
> +	git checkout -- test test.t test.i &&
> +
> +	echo "content-test2" >test2.o &&
> +	echo "content-test3 - subdir" >"test3 - subdir.o"

I see that you prepare here a few uncommitted files, but both
their names and their contents leave much to be desired - you
don't know from the name and contents what they are for.

And the '"subdir"' file which is not in subdirectory is
especially egregious.

>  '
>  
>  script='s/^\$Id: \([0-9a-f]*\) \$/\1/p'
> @@ -279,4 +282,364 @@ test_expect_success 'diff does not reuse worktree files that need cleaning' '
>  	test_line_count = 0 count
>  '
>  

A small comment on parameters of this function would be nice.

> +check_filter () {
> +	rm -f rot13-filter.log actual.log &&
> +	"$@" 2> git_stderr.log &&
> +	test_must_be_empty git_stderr.log &&
> +	cat >expected.log &&

This is too clever by half.  Having a function that both tests
the behavior and prepares 'expected' file is too much.

In my opinion preparation of 'expected.log' file should be moved
to another function or functions.

Also, if we are running sort on output, I think we should also
run sort on 'expected.log', so that what we write doesn't need to
be created sorted (so we don't have to sort expected lines by hand).
Or maybe we should run the same transformation on rot13-filter.log
and on the contents of expected.log.

> +	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" >actual.log &&
> +	test_cmp expected.log actual.log
> +}
> +
> +check_filter_count_clean () {
> +	rm -f rot13-filter.log actual.log &&
> +	"$@" 2> git_stderr.log &&
> +	test_must_be_empty git_stderr.log &&

All those functions (well, wait?) have common setup, which we can
extract into separate shell function, I think.  IMVHO.

> +	cat >expected.log &&
> +	sort rot13-filter.log | uniq -c | sed "s/^[ ]*//" |
> +		sed "s/^\([0-9]\) IN: clean/x IN: clean/" >actual.log &&
> +	test_cmp expected.log actual.log
> +}
> +
> +check_filter_ignore_clean () {
> +	rm -f rot13-filter.log actual.log &&
> +	"$@" &&

Why we don't check for stderr here?

> +	cat >expected.log &&
> +	grep -v "IN: clean" rot13-filter.log >actual.log &&
> +	test_cmp expected.log actual.log
> +}
> +
> +check_filter_no_call () {
> +	rm -f rot13-filter.log &&
> +	"$@" 2> git_stderr.log &&
> +	test_must_be_empty git_stderr.log &&
> +	test_must_be_empty rot13-filter.log
> +}
> +

A small comment on parameters of this function would be nice.
And a comment what it does.

> +check_rot13 () {
> +	test_cmp "$1" "$2" &&
> +	./../rot13.sh <"$1" >expected &&

Why there is .. in this invocation?

> +	git cat-file blob :"$2" >actual &&
> +	test_cmp expected actual
> +}
> +
> +test_expect_success PERL 'required process filter should filter data' '
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&

Don't you think that creating a fresh test repository for each
separate test is a bit too much?  I guess that you want for
each and every test to be completely independent, but this setup
and teardown is a bit excessive.

Other tests in the same file (should we reuse the test, or use
new test file) do not use this method.

> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +		git add . &&
> +		git commit . -m "test commit" &&
> +		git branch empty &&

Err... I think it would be better to name it 'empty-branch'
(or 'almost-empty-branch', as it does include .gitattributes file).
See my mistake below (marked <del>...</del>).

> +
> +		cp ../test.o test.r &&
> +		cp ../test2.o test2.r &&

What does this test2.o / test2.r file tests, that test.o / test.r
doesn't?  The name doesn't tell us.

Why it is test.r, but test2.r?  Why it isn't test1.r?

> +		mkdir testsubdir &&
> +		cp "../test3 - subdir.o" "testsubdir/test3 - subdir.r" &&

Why it needs to have different contents?

I guess that you test two things here: file in a subdirectory,
and file with spaces in names.  Shouldn't it be better split
into two separate test files?

> +		>test4-empty.r &&

You test ordinary file, file in subdirectory, file with filename
containing spaces, and an empty file.

Other tests of single file `clean`/`smudge` filters use filename
that requires mangling; maybe we should use similar file?

        special="name  with '\''sq'\'' and \$x" &&
        echo some test text >"$special" &&

In case of `process` filter, a special filename could look like
this:

        process_special="name=with equals and\nembedded newlines\n" &&
        echo some test text >"$process_special" &&

> +
> +		check_filter \
> +			git add . \

I assume that this kind of test is here also to check that
we are not regressing / backsliding, and we do not start to
run "clean" operation more than once per file for "git add",
isn't it?

> +				<<-\EOF &&
> +					1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
> +					1 IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
> +					1 IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
> +					1 IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
> +					1 START
> +					1 STOP
> +					1 wrote filter header
> +				EOF

First, this indentation level confirms that the check_filter
function is too clever by half, and that preparing expected.log
file should be a separate step.

Second, if we run "sort" on contents to be in expected.log, we
can write it in more natural, and less fragile way:

  +		sort >expected.log <<-\EOF &&
  +			1 START
  +			1 wrote filter header
  +			1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
  +			1 IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
  +			1 IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
  +			1 IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
  +			1 STOP
  +		EOF

Third, why the filter even writes output size? It is no longer
part of `process` filter driver protocol, and it makes test more
fragile.

If we are to keep sizes, then to make test less fragile with
respect to changes in contents of tested files, we should use
variables containing file size:

   		test_r_size=$(wc -c test.r)
   		...
   		sort >expected.log <<-EOF &&
   		...
   			1 IN: clean test.r $test_r_size [OK] -- OUT: $test_r_size . [OK]

> +
> +		check_filter_count_clean \
> +			git commit . -m "test commit" \

I guess that you use "git commit ." (not very visible this '.')
in order to force cleaning of all files, isn't it?

Use of *_count_clean function is here, from what I remember,
because 'git commit .' sometimes call `clean` multiple times
for the same file (?), and sometimes it calls `smudge` (probably
as part of some optimization?).

I guess that fixing "git commit" so that calls clean operation
at most once per file is left for a separate patch series; this
one is long enough and involved enough as it is.

> +				<<-\EOF &&
> +					x IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
> +					x IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
> +					x IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
> +					x IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
> +					1 START
> +					1 STOP
> +					1 wrote filter header
> +				EOF
> +
> +		rm -f test?.r "testsubdir/test3 - subdir.r" &&

Why 'test?.r' when we are removing only 'test2.r'; why not be explicit?

> +
> +		check_filter_ignore_clean \
> +			git checkout . \
> +				<<-\EOF &&
> +					START
> +					wrote filter header
> +					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
> +					IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]

Ah, I see that there are no shenningans for `clean`
operation, calling op multiple time for single file.

> +					STOP
> +				EOF
> +
> +		check_filter_ignore_clean \
> +			git checkout empty \

<del>
First, isn't it test4-empty.r?  Trying to check out non-existent
file should not run filter, isn't it?  How the heck this passed???
There is no branch 'empty'.

Second, the one-shot filter tests have empty-in-worktree and
empty-in-repo files; why not reuse them?
</del>

My mistake, but the branch is named a bit strange.

> +				<<-\EOF &&
> +					START
> +					wrote filter header
> +					STOP
> +				EOF

Why is even filter process invoked?  If this is not expected, perhaps
simply ignore what checking out almost empty branch (one without any
files marked for filtering) does.

Shouldn't we test_expect_failure no-call?

> +
> +		check_filter_ignore_clean \
> +			git checkout master \

Does this checks different code path than 'git checkout .'? For
example, does this test increase code coverage (e.g. as measured
by gcov)?  If not, then this test could be safely dropped.

> +				<<-\EOF &&
> +					START
> +					wrote filter header
> +					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
> +					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
> +					IN: smudge test4-empty.r 0 [OK] -- OUT: 0  [OK]
> +					IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]

Can we assume that Git would pass files to filter in alphabetical
order?  This assumption might make the test unnecessary fragile.

> +					STOP
> +				EOF
> +
> +		check_rot13 ../test.o test.r &&
> +		check_rot13 ../test2.o test2.r &&
> +		check_rot13 "../test3 - subdir.o" "testsubdir/test3 - subdir.r"

All right.

> +	)
> +'
> +
> +test_expect_success PERL 'required process filter should clean only and take precedence' '

Trying to describe it better results in overly long description,
which probably means that this test should be split into few
smaller ones:

 - `process` filter takes precedence over `clean` and/or `smudge`
   filters, regardless if it supports relevant ("clean" or "smudge")
   capability or not

 - `process` filter that includes only "clean" capability should
   clean only (be used only for 'clean' operation)

 - required process filter should do something (???)
   

> +	test_config_global filter.protocol.clean ./../rot13.sh &&
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean" &&
> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +		git add . &&
> +		git commit . -m "test commit" &&
> +		git branch empty &&
> +
> +		cp ../test.o test.r &&
> +
> +		check_filter \
> +			git add . \
> +				<<-\EOF &&
> +					1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
> +					1 START
> +					1 STOP
> +					1 wrote filter header
> +				EOF
> +
> +		check_filter_count_clean \
> +			git commit . -m "test commit" \

Is this part really necessary?  I think it duplicates what we
have tested earlier, and would not catch any new errors.  Removing
spurious/redundant tests results in faster testsuite, which is
quite important.

> +				<<-\EOF
> +					x IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
> +					1 START
> +					1 STOP
> +					1 wrote filter header
> +				EOF

And this test checks only the first one from the list.
Well, actually the first part, without "regardless if it
supports relevant ('clean' [...]) capability or not".

> +	)
> +'
> +

In my opinion all functions should be placed at beginning,
or even in separate file (if they are used in more than
one test).

> +generate_test_data () {

The name is not good, it doesn't describe what kind of data
we want to generate.

> +	LEN=$1
> +	NAME=$2
> +	test-genrandom end $LEN |

Why do you use 'end' as <seed_string> parameter to test-genrandom?

> +		perl -pe "s/./chr((ord($&) % 26) + 97)/sge" >../$NAME.file &&

Those constants (26 and 97) are a bit cryptic; magical constants.
I guess this is

  +		perl -pe "s/./chr((ord($&) % (ord('z') - ord('a') + 1) + ord('a'))/sge" >../$NAME.file &&

or

  +		perl -pe "s/./chr((ord($&) % 26 + ord('a'))/sge" >../$NAME.file &&

That is, convert to a-z range (why not ASCII printable characters,
that is characters from ' ' / chr(32) to '~' / chr(126), which is
95 characters instead of 26?)

I guess this is so we can be sure that rot13 filter would work
(note: the filter is defined for A-Za-z, not only a-z, never
the mind pass-through for other characters).

> +	cp ../$NAME.file . &&

Do we re-generate this file each time?

> +	./../rot13.sh <../$NAME.file >../$NAME.file.rot13

Anyway, I wonder if taking the last two lines out of the function
(as they are not about _generating_ a file) would make it more
readable or not.

> +}
> +
> +test_expect_success PERL 'required process filter should process multiple packets' '
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> +	test_config_global filter.protocol.required true &&
> +
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		# Generate data that requires 3 packets
> +		PKTLINE_DATA_MAXLEN=65516 &&

Shouldn't this be set once per whole test?

> +
> +		generate_test_data $(($PKTLINE_DATA_MAXLEN        )) 1pkt_1__ &&
> +		generate_test_data $(($PKTLINE_DATA_MAXLEN     + 1)) 2pkt_1+1 &&
> +		generate_test_data $(($PKTLINE_DATA_MAXLEN * 2 - 1)) 2pkt_2-1 &&
> +		generate_test_data $(($PKTLINE_DATA_MAXLEN * 2    )) 2pkt_2__ &&
> +		generate_test_data $(($PKTLINE_DATA_MAXLEN * 2 + 1)) 3pkt_2+1 &&

Looks good to me.

> +
> +		echo "*.file filter=protocol" >.gitattributes &&
> +		check_filter \
> +			git add *.file .gitattributes \

Should it be shell expansion, or git expansion, that is

   			git add '*.file' .gitattributes


> +				<<-\EOF &&
> +					1 IN: clean 1pkt_1__.file 65516 [OK] -- OUT: 65516 . [OK]
> +					1 IN: clean 2pkt_1+1.file 65517 [OK] -- OUT: 65517 .. [OK]
> +					1 IN: clean 2pkt_2-1.file 131031 [OK] -- OUT: 131031 .. [OK]
> +					1 IN: clean 2pkt_2__.file 131032 [OK] -- OUT: 131032 .. [OK]
> +					1 IN: clean 3pkt_2+1.file 131033 [OK] -- OUT: 131033 ... [OK]

I think it would be better for those sizes to be calculated,
not entered by hand.  Though in this case this doesn't matter
much - it would always be this size.

> +					1 START
> +					1 STOP
> +					1 wrote filter header
> +				EOF
> +		git commit . -m "test commit" &&

Is this needed / necessary?

> +
> +		rm -f *.file &&
> +		git checkout -- *.file &&

Is this necessary?  I guess this checks that it doesn't crash, but
we do not check that smudge operation works correctly, as we did
for clean.

> +
> +		for f in *.file
> +		do
> +			git cat-file blob :$f >actual &&
> +			test_cmp ../$f.rot13 actual
> +		done

Wasn't there helper function for this?

> +	)
> +'
> +
> +test_expect_success PERL 'required process filter should with clean error should fail' '
                                                     ^^^^^^                  ^^^^^^

Errr... what?  You have 'should' twice here.

Also, does it matter that the error is during clean operation?
We don't test that error during smudge operation is handled in
the same way, do we?

> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&

Do we need to pass 'clean smudge', or does it provide both by
default?

> +	test_config_global filter.protocol.required true &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cp ../test.o test.r &&
> +		echo "this is going to fail" >clean-write-fail.r &&
> +		echo "content-test3-subdir" >test3.r &&
> +
> +		# Note: There are three clean paths in convert.c we just test one here.

What does this comment is about?  What 'three clean paths'?

> +		test_must_fail git add .
> +	)
> +'
> +
> +test_expect_success PERL 'process filter should restart after unexpected write failure' '
> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cp ../test.o test.r &&
> +		cp ../test2.o test2.r &&

Note that the preparation step is almost the same, and we
repeat it over, and over, and over (no shell function for
this; and we always do full setup / teardown).

> +		echo "this is going to fail" >smudge-write-fail.o &&
> +		cat smudge-write-fail.o >smudge-write-fail.r &&

This cat is cp.

> +		git add . &&
> +		git commit . -m "test commit" &&

You don't need to commit for 'git checkout <path>' (e.g. for .)
or 'git cat-file -p :<file>' to work.

> +		rm -f *.r &&
> +
> +		check_filter_ignore_clean \
> +			git checkout . \
> +				<<-\EOF &&
> +					START
> +					wrote filter header
> +					IN: smudge smudge-write-fail.r 22 [OK] -- OUT: 22 [WRITE FAIL]
> +					START
> +					wrote filter header
> +					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
> +					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
> +					STOP
> +				EOF
> +
> +		check_rot13 ../test.o test.r &&
> +		check_rot13 ../test2.o test2.r &&

Looks good.

> +
> +		! test_cmp smudge-write-fail.o smudge-write-fail.r && # Smudge failed!
> +		./../rot13.sh <smudge-write-fail.o >expected &&
> +		git cat-file blob :smudge-write-fail.r >actual &&
> +		test_cmp expected actual							  # Clean worked!

This is almost negation of check_rot13 - perhaps a helper function
would help here (check_not_rot13?).

Also, what this comment is about, and why so far to the right?

> +	)
> +'
> +
> +test_expect_success PERL 'process filter should not restart in case of an error' '

Errr... what? This description is not clear.  Did you mean
that filter should not be restarted if it *signals* an error
with file (either before sending anything, or after sending
partial contents)?

> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cp ../test.o test.r &&
> +		cp ../test2.o test2.r &&
> +		echo "this will cause an error" >error.o &&
> +		cp error.o error.r &&

And here you (correctly) use cp, and not cat.

> +		git add . &&
> +		git commit . -m "test commit" &&
> +		rm -f *.r &&
> +
> +		check_filter_ignore_clean \
> +			git checkout . \
> +				<<-\EOF &&
> +					START
> +					wrote filter header
> +					IN: smudge error.r 25 [OK] -- OUT: 0 [ERROR]
> +					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
> +					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
> +					STOP
> +				EOF
> +
> +		check_rot13 ../test.o test.r &&
> +		check_rot13 ../test2.o test2.r &&
> +		test_cmp error.o error.r

Looks good to me.

> +	)
> +'
> +
> +test_expect_success PERL 'process filter should be able to signal an error for all future files' '

Did you mean here that filter can abort processing of
all future files?

> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cp ../test.o test.r &&
> +		cp ../test2.o test2.r &&
> +		echo "error this blob and all future blobs" >abort.o &&
> +		cp abort.o abort.r &&
> +		git add . &&
> +		git commit . -m "test commit" &&
> +		rm -f *.r &&
> +
> +		check_filter_ignore_clean \
> +			git checkout . \
> +				<<-\EOF &&
> +					START
> +					wrote filter header
> +					IN: smudge abort.r 37 [OK] -- OUT: 0 [ABORT]
> +					STOP

How can we know that 'abort' file is processed first?
Though more resilent solution would be harder to create...

> +				EOF
> +
> +		test_cmp ../test.o test.r &&
> +		test_cmp ../test2.o test2.r &&
> +		test_cmp abort.o abort.r
> +	)
> +'
> +
> +test_expect_success PERL 'invalid process filter must fail (and not hang!)' '
> +	test_config_global filter.protocol.process cat &&

We could use rot13.sh, that is one-shot filter here.

> +	test_config_global filter.protocol.required true &&

All right, filter is required to easier distinguish it not
working from not filtered.

> +	rm -rf repo &&
> +	mkdir repo &&
> +	(
> +		cd repo &&
> +		git init &&
> +
> +		echo "*.r filter=protocol" >.gitattributes &&
> +
> +		cp ../test.o test.r &&
> +		test_must_fail git add . 2> git_stderr.log &&
> +		grep "not support long running filter protocol" git_stderr.log

Shouldn't this use gettext poison (or rather C locale)?
This error message could be translated in the future.

> +	)
> +'
> +
>  test_done

I wonder how does the code coverage for the new v2 filter
code looks like...

Anyway, I think it would be good idea to write at the beginning
of new tests (be they in old test, or in new test) what we want
to test:

 - that 'clean' and 'smudge' operations are invoked, for all
   possible combinations (covering all code paths), and that
   filter is invoked only once
 - that special types of files work:
   * empty file (in worktree, in index, in repo)
   * file in subdirectory
   * filename with special characters
   * large file (test marked as EXPENSIVE), multiple maximum
     packet size
   * perhaps binary file?
 - that 'process' overrides old-style 'clean' and 'smudge'
   filters, regardless of the former capabilities
 - that limiting capabilities works
 - that requiring filter works correctly (doubles number of
   tests, at least for a subset of them)
 - that filter is restarted if it fails on non-required,
   fails git command if required
 - that filter can error out out of filtering a file,
   upfront and after partial contents, without restart;
   fails git command if required
 - that filter can abort,
   fails git command if required (?)

> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl
> new file mode 100755
> index 0000000..8958f71
> --- /dev/null
> +++ b/t/t0021/rot13-filter.pl
> @@ -0,0 +1,191 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git filter protocol version 2
> +# See Documentation/gitattributes.txt, section "Filter Protocol"
> +#
> +# The script takes the list of supported protocol capabilities as
> +# arguments ("clean", "smudge", etc).
> +#
> +# This implementation supports special test cases:
> +# (1) If data with the pathname "clean-write-fail.r" is processed with
> +#     a "clean" operation then the write operation will die.
> +# (2) If data with the pathname "smudge-write-fail.r" is processed with
> +#     a "smudge" operation then the write operation will die.
> +# (3) If data with the pathname "error.r" is processed with any
> +#     operation then the filter signals that it cannot or does not want
> +#     to process the file.
> +# (4) If data with the pathname "abort.r" is processed with any
> +#     operation then the filter signals that it cannot or does not want
> +#     to process the file and any file after that is processed with the
> +#     same command.

Nice to have this description.

BTW. why write-fail is per operation (clean or smudge), but error and abort
is not?


> +#
> +
> +use strict;
> +use warnings;

I guess there is some duplication with the code in contrib, isn't it?

> +
> +my $MAX_PACKET_CONTENT_SIZE = 65516;
> +my @capabilities            = @ARGV;
> +
> +open my $debug, ">>", "rot13-filter.log";

   	or die "cannot open file for appending: $!";

Good, three argument open.  Bad (?), not error handling.

> +
> +sub rot13 {
> +    my ($str) = @_;
   ^^^^

Why 4 spaces, and not TAB character?


I think

   	my $str = shift;

is more idiomatic Perl.

> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;

Why not use tr/// version of this quote-like operation?
Or do you follow prior art here?

> +    return $str;
> +}
> +
> +sub packet_bin_read {
> +    my $buffer;
> +    my $bytes_read = read STDIN, $buffer, 4;
> +    if ( $bytes_read == 0 ) {
> +
> +        # EOF - Git stopped talking to us!
> +        print $debug "STOP\n";
> +        exit();
> +    }
> +    elsif ( $bytes_read != 4 ) {
> +        die "invalid packet size '$bytes_read' field";

Errr, $bytes_read is not packet size field.  It is $buffer.
Also, error message looks strange

   		invalid packet size '004' field

Shouldn't it be at end?

> +    }
> +    my $pkt_size = hex($buffer);

$pkt_size greater than $MAX_PACKET_CONTENT_SIZE is also an error,
as is sizes 1-3 (not that it matters much, at least here).

> +    if ( $pkt_size == 0 ) {
> +        return ( 1, "" );

It feels a bit strange to me to return list instead of hashref,
but this is a matter of opinion.

> +    }
> +    elsif ( $pkt_size > 4 ) {
> +        my $content_size = $pkt_size - 4;
> +        $bytes_read = read STDIN, $buffer, $content_size;
> +        if ( $bytes_read != $content_size ) {
> +            die "invalid packet ($content_size expected; $bytes_read read)";

It would read, strangely

   		   "invalid packet (8 expected, 7 read)"

The "size" or "bytes" is missing from this output.

> +        }
> +        return ( 0, $buffer );
> +    }
> +    else {
> +        die "invalid packet size";

Is keep-alive packet valid ("0004")?

> +    }
> +}
> +
> +sub packet_txt_read {
> +    my ( $res, $buf ) = packet_bin_read();
> +    unless ( $buf =~ /\n$/ ) {
> +        die "A non-binary line SHOULD BE terminated by an LF.";

First, if SHOULD BE, then perhaps 'warn' not 'die'... though for
tests it is probably better to 'die'.

Second, we should probably print (a fragment of) this line.

> +    }
> +    return ( $res, substr( $buf, 0, -1 ) );

Same comment as for example file in contrib/ - use s/// and no
need for substr stuff.

> +}
> +
> +sub packet_bin_write {
> +    my ($packet) = @_;
> +    print STDOUT sprintf( "%04x", length($packet) + 4 );
> +    print STDOUT $packet;
> +    STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +    packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +    print STDOUT sprintf( "%04x", 0 );
> +    STDOUT->flush();
> +}

Looks good to me (though same comments as to contrib/ file applies).

> +
> +print $debug "START\n";
> +$debug->flush();
> +
> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";
> +
> +packet_txt_write("git-filter-server");
> +packet_txt_write("version=2");
> +
> +( packet_txt_read() eq ( 0, "clean=true" ) )  || die "bad capability";
> +( packet_txt_read() eq ( 0, "smudge=true" ) ) || die "bad capability";
> +( packet_bin_read() eq ( 1, "" ) )            || die "bad capability end";
> +
> +foreach (@capabilities) {
> +    packet_txt_write( $_ . "=true" );
> +}
> +packet_flush();
> +print $debug "wrote filter header\n";

Or perhaps "handshake end"?

> +$debug->flush();
> +
> +while (1) {
> +    my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
> +    print $debug "IN: $command";
> +    $debug->flush();
> +
> +    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;

All right, here list context is necessary.

> +    print $debug " $pathname";

No " pathname=$pathname" ?

> +    $debug->flush();
> +
> +    # Flush
> +    packet_bin_read();

Same comment as earlier: read_flush, or read_varlist (till flush)
to have would be better.

> +
> +    my $input = "";
> +    {
> +        binmode(STDIN);
> +        my $buffer;
> +        my $done = 0;
> +        while ( !$done ) {
> +            ( $done, $buffer ) = packet_bin_read();
> +            $input .= $buffer;
> +        }
> +        print $debug " " . length($input) . " [OK] -- ";
> +        $debug->flush();
> +    }
> +
> +    my $output;
> +    if ( $pathname eq "error.r" or $pathname eq "abort.r" ) {
> +        $output = "";
> +    }
> +    elsif ( $command eq "clean" and grep( /^clean$/, @capabilities ) ) {
> +        $output = rot13($input);
> +    }
> +    elsif ( $command eq "smudge" and grep( /^smudge$/, @capabilities ) ) {
> +        $output = rot13($input);
> +    }
> +    else {
> +        die "bad command '$command'";
> +    }
> +
> +    print $debug "OUT: " . length($output) . " ";

Shouldn't we write the length of output only if we don't error out,
abort, or fail?

> +    $debug->flush();
> +
> +    if ( $pathname eq "error.r" ) {
> +        print $debug "[ERROR]\n";
> +        $debug->flush();
> +        packet_txt_write("status=error");
> +        packet_flush();
> +    }
> +    elsif ( $pathname eq "abort.r" ) {
> +        print $debug "[ABORT]\n";
> +        $debug->flush();
> +        packet_txt_write("status=abort");
> +        packet_flush();
> +    }

Looks good, so this is upfront status=error or status-abort.

> +    else {
> +        packet_txt_write("status=success");
> +        packet_flush();
> +
> +        if ( $pathname eq "${command}-write-fail.r" ) {
> +            print $debug "[WRITE FAIL]\n";
> +            $debug->flush();
> +            die "${command} write error";
> +        }
> +
> +        while ( length($output) > 0 ) {
> +            my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
> +            packet_bin_write($packet);
> +            print $debug ".";

All right, so number of dots is the number of packets.  This is
surprisingly opaque.

> +            if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
> +                $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
> +            }
> +            else {
> +                $output = "";
> +            }
> +        }
> +        packet_flush();
> +        print $debug " [OK]\n";
> +        $debug->flush();
> +        packet_flush();

Should we test partial contents case?  Or failure during printing?
What happens then - is file cleared by Git, or left partially converted?

> +    }
> +}
> 

Keep up good work.  Looks quite good.
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt()
  2016-09-26 18:49     ` Lars Schneider
@ 2016-09-28 23:15       ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-09-28 23:15 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

W dniu 26.09.2016 o 20:49, Lars Schneider pisze: 
> On 24 Sep 2016, at 23:14, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>>
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>
>>> packet_write() should be called packet_write_fmt() as the string
>>> parameter can be formatted.
>>
>> I would say:
>>
>>  packet_write() should be called packet_write_fmt() because it
>>  is printf-like function where first parameter is format string.
>>
>> Or something like that.  But such minor change might be not worth
>> yet another reroll of this patch series.
>>
>> Perhaps it would be a good idea to explain the reasoning behind
>> this change:
>>
>>  This is important distinction to know from the name if the
>>  function accepts arbitrary binary data and/or arbitrary
>>  strings to be written - packet_write[_fmt()] do not.
> 
> packet_write() should be called packet_write_fmt() because it is a
> printf-like function that takes a format string as first parameter.
> 
> packet_write_fmt() should be used for text strings only. Arbitrary
> binary data should use a new packet_write() function that is introduced
> in a subsequent patch.
> 
> Better?

Better.

> 
>>> pkt-line.h               |  2 +-
>>> shallow.c                |  2 +-
>>> upload-pack.c            | 30 +++++++++++++++---------------
>>> 11 files changed, 29 insertions(+), 29 deletions(-)
>>
>> Diffstat looks correct.  Was the patch generated by doing search
>> and replace?
> 
> Yes.

Good.

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-28 21:49 ` [PATCH v8 00/11] Git filter protocol Junio C Hamano
@ 2016-09-29 10:28   ` Lars Schneider
  2016-09-29 11:57     ` Torsten Bögershausen
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-29 10:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jeff King, Stefan Beller, Jakub Narębski,
	Martin-Louis Bright, Torsten Bögershausen, ramsay


> On 28 Sep 2016, at 23:49, Junio C Hamano <gitster@pobox.com> wrote:
> 
> I suspect that you are preparing a reroll already, but the one that
> is sitting in 'pu' seems to be flaky in t/t0021 and I seem to see
> occasional failures from it.
> 
> I didn't trace where the test goes wrong, but one easy mistake you
> could make (I am not saying that is the reason of the failure) is to
> assume your filter will not be called under certain condition (like
> immediately after you checked out from the index to the working
> tree), when the automated test goes fast enough and get you into a
> "racy git" situation---the filter may be asked to filter the
> contents from the working tree again to re-validate what's there is
> still what is in the index.

Thanks for the heads-up! 

This is what happens:

1) Git exits
2) The filter process receives EOF and prints "STOP" to the log
3) t0021 checks the content of the log

Sometimes 3 happened before 2 which makes the test fail.
(Example: https://travis-ci.org/git/git/jobs/162660563 )

I added a this to wait until the filter process terminates:

+wait_for_filter_termination () {
+	while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
+	do
+		echo "Waiting for /t0021/rot13-filter.pl to finish..."
+		sleep 1
+	done
+}

Does this look OK to you?

- Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 10:28   ` Lars Schneider
@ 2016-09-29 11:57     ` Torsten Bögershausen
  2016-09-29 16:57       ` Junio C Hamano
  2016-09-29 20:59       ` Jakub Narębski
  0 siblings, 2 replies; 71+ messages in thread
From: Torsten Bögershausen @ 2016-09-29 11:57 UTC (permalink / raw)
  To: Lars Schneider, Junio C Hamano
  Cc: git, Jeff King, Stefan Beller, Jakub Narębski,
	Martin-Louis Bright, ramsay



On 29/09/16 12:28, Lars Schneider wrote:
>> On 28 Sep 2016, at 23:49, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> I suspect that you are preparing a reroll already, but the one that
>> is sitting in 'pu' seems to be flaky in t/t0021 and I seem to see
>> occasional failures from it.
>>
>> I didn't trace where the test goes wrong, but one easy mistake you
>> could make (I am not saying that is the reason of the failure) is to
>> assume your filter will not be called under certain condition (like
>> immediately after you checked out from the index to the working
>> tree), when the automated test goes fast enough and get you into a
>> "racy git" situation---the filter may be asked to filter the
>> contents from the working tree again to re-validate what's there is
>> still what is in the index.
> Thanks for the heads-up!
>
> This is what happens:
>
> 1) Git exits
> 2) The filter process receives EOF and prints "STOP" to the log
> 3) t0021 checks the content of the log
>
> Sometimes 3 happened before 2 which makes the test fail.
> (Example: https://travis-ci.org/git/git/jobs/162660563 )
>
> I added a this to wait until the filter process terminates:
>
> +wait_for_filter_termination () {
> +	while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
> +	do
> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
> +		sleep 1
> +	done
> +}
>
> Does this look OK to you?
Do we need the ps at all ?
How about this:

+wait_for_filter_termination () {
+	while ! grep "STOP"  LOGFILENAME >/dev/null
+	do
+		echo "Waiting for /t0021/rot13-filter.pl to finish..."
+		sleep 1
+	done
+}



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 11:57     ` Torsten Bögershausen
@ 2016-09-29 16:57       ` Junio C Hamano
  2016-09-29 17:57         ` Lars Schneider
                           ` (2 more replies)
  2016-09-29 20:59       ` Jakub Narębski
  1 sibling, 3 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-09-29 16:57 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Torsten Bögershausen <tboegi@web.de> writes:

>> 1) Git exits
>> 2) The filter process receives EOF and prints "STOP" to the log
>> 3) t0021 checks the content of the log
>>
>> Sometimes 3 happened before 2 which makes the test fail.
>> (Example: https://travis-ci.org/git/git/jobs/162660563 )
>>
>> I added a this to wait until the filter process terminates:
>>
>> +wait_for_filter_termination () {
>> +	while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
>> +	do
>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>> +		sleep 1
>> +	done
>> +}
>>
>> Does this look OK to you?
> Do we need the ps at all ?
> How about this:
>
> +wait_for_filter_termination () {
> +	while ! grep "STOP"  LOGFILENAME >/dev/null
> +	do
> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
> +		sleep 1
> +	done
> +}

Running "ps" and grepping for a command is not suitable for script
to reliably tell things, so it is out of question.  Compared to
that, your version looks slightly better, but what if the machinery
that being tested, i.e. the part that drives the filter process, is
buggy or becomes buggy and causes the filter process that writes
"STOP" to die before it actually writes that string?

I have a feeling that the machinery being tested needs to be fixed
so that the sequence is always be:

    0) Git spawns the filter process, as it needs some contents to
       be filtered.

    1) Git did everything it needed to do and decides that is time
       to go.

    2) Filter process receives EOF and prints "STOP" to the log.

    3) Git waits until the filter process finishes.

    4) t0021, after Git finishes, checks the log.

Repeated sleep combined with grep is probably just sweeping the real
problem under the rug.  Do we have enough information to do the
above?

An inspiration may be in the way we centrally clean all tempfiles
and lockfiles before exiting.  We have a central registry of these
files that need cleaning up and have a single atexit(3) handler to
clean them up.  Perhaps we need a registry that filter processes
spawned by the mechanism Lars introduces in this series, and have an
atexit(3) handler that closes the pipe to them (which signals the
filters that it is time for them to go) and wait(2) on them, or
something?  I do not think we want any kill(2) to be involved in
this clean-up procedure, but I do think we should wait(2) on what we
spawn, as long as these processes are meant to be shut down when the
main process of Git exits (this is different from things like
credential-cache daemon where they are expected to persist and meant
to serve multiple Git processes).



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 16:57       ` Junio C Hamano
@ 2016-09-29 17:57         ` Lars Schneider
  2016-09-29 18:18           ` Torsten Bögershausen
  2016-09-29 21:27           ` Junio C Hamano
  2016-09-29 18:02         ` Jeff King
  2016-09-29 20:50         ` Lars Schneider
  2 siblings, 2 replies; 71+ messages in thread
From: Lars Schneider @ 2016-09-29 17:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay


> On 29 Sep 2016, at 18:57, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Torsten Bögershausen <tboegi@web.de> writes:
> 
>>> 1) Git exits
>>> 2) The filter process receives EOF and prints "STOP" to the log
>>> 3) t0021 checks the content of the log
>>> 
>>> Sometimes 3 happened before 2 which makes the test fail.
>>> (Example: https://travis-ci.org/git/git/jobs/162660563 )
>>> 
>>> I added a this to wait until the filter process terminates:
>>> 
>>> +wait_for_filter_termination () {
>>> +	while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
>>> +	do
>>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>>> +		sleep 1
>>> +	done
>>> +}
>>> 
>>> Does this look OK to you?
>> Do we need the ps at all ?
>> How about this:
>> 
>> +wait_for_filter_termination () {
>> +	while ! grep "STOP"  LOGFILENAME >/dev/null
>> +	do
>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>> +		sleep 1
>> +	done
>> +}
> 
> Running "ps" and grepping for a command is not suitable for script
> to reliably tell things, so it is out of question.  Compared to
> that, your version looks slightly better, but what if the machinery
> that being tested, i.e. the part that drives the filter process, is
> buggy or becomes buggy and causes the filter process that writes
> "STOP" to die before it actually writes that string?
> 
> I have a feeling that the machinery being tested needs to be fixed
> so that the sequence is always be:
> 
>    0) Git spawns the filter process, as it needs some contents to
>       be filtered.
> 
>    1) Git did everything it needed to do and decides that is time
>       to go.
> 
>    2) Filter process receives EOF and prints "STOP" to the log.
> 
>    3) Git waits until the filter process finishes.
> 
>    4) t0021, after Git finishes, checks the log.
> 
> Repeated sleep combined with grep is probably just sweeping the real
> problem under the rug.  Do we have enough information to do the
> above?
> 
> An inspiration may be in the way we centrally clean all tempfiles
> and lockfiles before exiting.  We have a central registry of these
> files that need cleaning up and have a single atexit(3) handler to
> clean them up.  Perhaps we need a registry that filter processes
> spawned by the mechanism Lars introduces in this series, and have an
> atexit(3) handler that closes the pipe to them (which signals the
> filters that it is time for them to go) and wait(2) on them, or
> something?  I do not think we want any kill(2) to be involved in
> this clean-up procedure, but I do think we should wait(2) on what we
> spawn, as long as these processes are meant to be shut down when the
> main process of Git exits (this is different from things like
> credential-cache daemon where they are expected to persist and meant
> to serve multiple Git processes).

We discussed that issue in v4 and v6:
http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3i7l@sigill.intra.peff.net/
http://public-inbox.org/git/xmqqbn0a3wy3.fsf@gitster.mtv.corp.google.com/

My impression was that you don't want Git to wait for the filter process.
If Git waits for the filter process - how long should Git wait?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 16:57       ` Junio C Hamano
  2016-09-29 17:57         ` Lars Schneider
@ 2016-09-29 18:02         ` Jeff King
  2016-09-29 21:19           ` Junio C Hamano
  2016-09-29 20:50         ` Lars Schneider
  2 siblings, 1 reply; 71+ messages in thread
From: Jeff King @ 2016-09-29 18:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, git, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

On Thu, Sep 29, 2016 at 09:57:57AM -0700, Junio C Hamano wrote:

> > +wait_for_filter_termination () {
> > +	while ! grep "STOP"  LOGFILENAME >/dev/null
> > +	do
> > +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
> > +		sleep 1
> > +	done
> > +}
> 
> Running "ps" and grepping for a command is not suitable for script
> to reliably tell things, so it is out of question.  Compared to
> that, your version looks slightly better, but what if the machinery
> that being tested, i.e. the part that drives the filter process, is
> buggy or becomes buggy and causes the filter process that writes
> "STOP" to die before it actually writes that string?

I'm of the opinion that any busy-waiting is a good sign that something
is suboptimal. The right solution here seems like it should be signaling
the test script via a descriptor.

I don't necessarily agree, though, that the timing of filter-process
cleanup needs to be part of the public interface. So in your list:

>     3) Git waits until the filter process finishes.

That seems simple and elegant, but I can think of reasons we might not
want to wait (e.g., if the filter has to do some maintenance task and
does not the user to have to wait).

OTOH, we already face this in git, and we solve it by explicitly
backgrounding the maintenance task (i.e., auto-gc). So one could argue
that it is the responsibility of the filter process to manage its own
processes. It certainly makes the interaction with git simpler.

-Peff

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 17:57         ` Lars Schneider
@ 2016-09-29 18:18           ` Torsten Bögershausen
  2016-09-29 18:38             ` Johannes Sixt
  2016-09-29 21:27           ` Junio C Hamano
  1 sibling, 1 reply; 71+ messages in thread
From: Torsten Bögershausen @ 2016-09-29 18:18 UTC (permalink / raw)
  To: Lars Schneider, Junio C Hamano
  Cc: git, Jeff King, Stefan Beller, Jakub Narębski,
	Martin-Louis Bright, ramsay



On 29/09/16 19:57, Lars Schneider wrote:
>> On 29 Sep 2016, at 18:57, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Torsten Bögershausen <tboegi@web.de> writes:
>>
>>>> 1) Git exits
>>>> 2) The filter process receives EOF and prints "STOP" to the log
>>>> 3) t0021 checks the content of the log
>>>>
>>>> Sometimes 3 happened before 2 which makes the test fail.
>>>> (Example: https://travis-ci.org/git/git/jobs/162660563 )
>>>>
>>>> I added a this to wait until the filter process terminates:
>>>>
>>>> +wait_for_filter_termination () {
>>>> +	while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
>>>> +	do
>>>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>>>> +		sleep 1
>>>> +	done
>>>> +}
>>>>
>>>> Does this look OK to you?
>>> Do we need the ps at all ?
>>> How about this:
>>>
>>> +wait_for_filter_termination () {
>>> +	while ! grep "STOP"  LOGFILENAME >/dev/null
>>> +	do
>>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>>> +		sleep 1
>>> +	done
>>> +}
>> Running "ps" and grepping for a command is not suitable for script
>> to reliably tell things, so it is out of question.  Compared to
>> that, your version looks slightly better, but what if the machinery
>> that being tested, i.e. the part that drives the filter process, is
>> buggy or becomes buggy and causes the filter process that writes
>> "STOP" to die before it actually writes that string?
>>
>> I have a feeling that the machinery being tested needs to be fixed
>> so that the sequence is always be:
>>
>>     0) Git spawns the filter process, as it needs some contents to
>>        be filtered.
>>
>>     1) Git did everything it needed to do and decides that is time
>>        to go.
>>
>>     2) Filter process receives EOF and prints "STOP" to the log.
>>
>>     3) Git waits until the filter process finishes.
>>
>>     4) t0021, after Git finishes, checks the log.
>>
>> Repeated sleep combined with grep is probably just sweeping the real
>> problem under the rug.  Do we have enough information to do the
>> above?
>>
>> An inspiration may be in the way we centrally clean all tempfiles
>> and lockfiles before exiting.  We have a central registry of these
>> files that need cleaning up and have a single atexit(3) handler to
>> clean them up.  Perhaps we need a registry that filter processes
>> spawned by the mechanism Lars introduces in this series, and have an
>> atexit(3) handler that closes the pipe to them (which signals the
>> filters that it is time for them to go) and wait(2) on them, or
>> something?  I do not think we want any kill(2) to be involved in
>> this clean-up procedure, but I do think we should wait(2) on what we
>> spawn, as long as these processes are meant to be shut down when the
>> main process of Git exits (this is different from things like
>> credential-cache daemon where they are expected to persist and meant
>> to serve multiple Git processes).
> We discussed that issue in v4 and v6:
> http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3i7l@sigill.intra.peff.net/
> http://public-inbox.org/git/xmqqbn0a3wy3.fsf@gitster.mtv.corp.google.com/
>
> My impression was that you don't want Git to wait for the filter process.
> If Git waits for the filter process - how long should Git wait?
>
> Thanks,
> Lars

Hm,
I would agree that  Git should not wait for the filter.
But does the test suite need to wait for the filter ?
May be, in this case we test the filter and Git, which is good.
Adding a 1 second delay, if, and only if, there is a racy condition,
is not that bad (or do we have better ways to check for a process to
be terminated ?)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 18:18           ` Torsten Bögershausen
@ 2016-09-29 18:38             ` Johannes Sixt
  0 siblings, 0 replies; 71+ messages in thread
From: Johannes Sixt @ 2016-09-29 18:38 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Junio C Hamano, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Am 29.09.2016 um 20:18 schrieb Torsten Bögershausen:
> I would agree that  Git should not wait for the filter.
> But does the test suite need to wait for the filter ?

We have fixed a test case on Windows recently where a process hung 
around too long (5babb5bd). So, yes, the test suite has to wait for the 
filter.

-- Hannes


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 16:57       ` Junio C Hamano
  2016-09-29 17:57         ` Lars Schneider
  2016-09-29 18:02         ` Jeff King
@ 2016-09-29 20:50         ` Lars Schneider
  2016-09-29 21:12           ` Junio C Hamano
  2 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-29 20:50 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay


> On 29 Sep 2016, at 18:57, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Torsten Bögershausen <tboegi@web.de> writes:
> 
>>> 1) Git exits
>>> 2) The filter process receives EOF and prints "STOP" to the log
>>> 3) t0021 checks the content of the log
>>> 
>>> Sometimes 3 happened before 2 which makes the test fail.
>>> (Example: https://travis-ci.org/git/git/jobs/162660563 )
>>> 
>>> I added a this to wait until the filter process terminates:
>>> 
>>> +wait_for_filter_termination () {
>>> +	while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
>>> +	do
>>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>>> +		sleep 1
>>> +	done
>>> +}
>>> 
>>> Does this look OK to you?
>> Do we need the ps at all ?
>> How about this:
>> 
>> +wait_for_filter_termination () {
>> +	while ! grep "STOP"  LOGFILENAME >/dev/null
>> +	do
>> +		echo "Waiting for /t0021/rot13-filter.pl to finish..."
>> +		sleep 1
>> +	done
>> +}
> 
> Running "ps" and grepping for a command is not suitable for script
> to reliably tell things, so it is out of question.  Compared to
> that, your version looks slightly better, but what if the machinery
> that being tested, i.e. the part that drives the filter process, is
> buggy or becomes buggy and causes the filter process that writes
> "STOP" to die before it actually writes that string?
> 
> I have a feeling that the machinery being tested needs to be fixed
> so that the sequence is always be:
> 
>    0) Git spawns the filter process, as it needs some contents to
>       be filtered.
> 
>    1) Git did everything it needed to do and decides that is time
>       to go.
> 
>    2) Filter process receives EOF and prints "STOP" to the log.
> 
>    3) Git waits until the filter process finishes.
> 
>    4) t0021, after Git finishes, checks the log.


A pragmatic approach:

I could drop the "STOP" message that the filter writes to the log
on exit and everything would work as is. We could argue that this 
is OK because Git doesn't care anyways if the filter process has 
stopped or not.

Would that be OK for everyone?

- Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 11:57     ` Torsten Bögershausen
  2016-09-29 16:57       ` Junio C Hamano
@ 2016-09-29 20:59       ` Jakub Narębski
  2016-09-29 21:17         ` Junio C Hamano
  1 sibling, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-09-29 20:59 UTC (permalink / raw)
  To: Torsten Bögershausen, Lars Schneider, Junio C Hamano
  Cc: git, Jeff King, Stefan Beller, Martin-Louis Bright, Ramsay Jones

W dniu 29.09.2016 o 13:57, Torsten Bögershausen pisze: 
> On 29/09/16 12:28, Lars Schneider wrote:

>> This is what happens:
>>
>> 1) Git exits
>> 2) The filter process receives EOF and prints "STOP" to the log
>> 3) t0021 checks the content of the log
>>
>> Sometimes 3 happened before 2 which makes the test fail.
>> (Example: https://travis-ci.org/git/git/jobs/162660563 )
>>
>> I added a this to wait until the filter process terminates:
>>
>> +wait_for_filter_termination () {
>> +    while ps | grep -v grep | grep -F "/t0021/rot13-filter.pl" >/dev/null 2>&1
>> +    do
>> +        echo "Waiting for /t0021/rot13-filter.pl to finish..."
>> +        sleep 1
>> +    done
>> +}
>>
>> Does this look OK to you?
> Do we need the ps at all ?
> How about this:
> 
> +wait_for_filter_termination () {
> +    while ! grep "STOP"  LOGFILENAME >/dev/null
> +    do
> +        echo "Waiting for /t0021/rot13-filter.pl to finish..."
> +        sleep 1
> +    done
> +}

Or even better: make filter driver write its pid to pidfile, and then
"wait $(cat rot13-filter.pid)".  That's what we do in lib-git-daemon.sh
(I think).

If the problem is exit status of "wait" builtin, then filter driver
can remove its pidfile after writing "STOP", just before ending.

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 20:50         ` Lars Schneider
@ 2016-09-29 21:12           ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-09-29 21:12 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Lars Schneider <larsxschneider@gmail.com> writes:

> A pragmatic approach:
>
> I could drop the "STOP" message that the filter writes to the log
> on exit and everything would work as is. We could argue that this 
> is OK because Git doesn't care anyways if the filter process has 
> stopped or not.

That would mean you can leave the process running while the test
framework tries to remove the trash directory when we are done,
creating the same bug J6t mentioned in the thread, no?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 20:59       ` Jakub Narębski
@ 2016-09-29 21:17         ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-09-29 21:17 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Torsten Bögershausen, Lars Schneider, git, Jeff King,
	Stefan Beller, Martin-Louis Bright, Ramsay Jones

Jakub Narębski <jnareb@gmail.com> writes:

> Or even better: make filter driver write its pid to pidfile, and then
> "wait $(cat rot13-filter.pid)".  That's what we do in lib-git-daemon.sh
> (I think).

I am not sure if "wait"ing on a random process that is not a direct
child is a reasonable thing to do, but I like the direction.

Communicate with a pidfile and wait until "kill -0 $that_pid" fails,
or something like that, would be clean enough.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 18:02         ` Jeff King
@ 2016-09-29 21:19           ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-09-29 21:19 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Lars Schneider, git, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Jeff King <peff@peff.net> writes:

> I don't necessarily agree, though, that the timing of filter-process
> cleanup needs to be part of the public interface. So in your list:
>
>>     3) Git waits until the filter process finishes.
>
> That seems simple and elegant, but I can think of reasons we might not
> want to wait (e.g., if the filter has to do some maintenance task and
> does not the user to have to wait).
>
> OTOH, we already face this in git, and we solve it by explicitly
> backgrounding the maintenance task (i.e., auto-gc). So one could argue
> that it is the responsibility of the filter process to manage its own
> processes. It certainly makes the interaction with git simpler.

Yup, that summarizes my thinking a lot better than I managed to do
in the previous message.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 17:57         ` Lars Schneider
  2016-09-29 18:18           ` Torsten Bögershausen
@ 2016-09-29 21:27           ` Junio C Hamano
  2016-10-01 18:59             ` Lars Schneider
  1 sibling, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2016-09-29 21:27 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Lars Schneider <larsxschneider@gmail.com> writes:

> We discussed that issue in v4 and v6:
> http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3i7l@sigill.intra.peff.net/
> http://public-inbox.org/git/xmqqbn0a3wy3.fsf@gitster.mtv.corp.google.com/
>
> My impression was that you don't want Git to wait for the filter process.
> If Git waits for the filter process - how long should Git wait?

I am not sure where you got that impression.  I did say that I do
not want Git to _KILL_ my filter process.  That does not mean I want
Git to go away without waiting for me.

If the filter process refuses to die forever when Git told it to
shutdown (by closing the pipe to it, for example), that filter
process is simply buggy.  I think we want users to become aware of
that, instead of Git leaving it behind, which essentially is to
sweep the problem under the rug.

I agree with what Peff said elsewhere in the thread; if a filter
process wants to take time to clean things up while letting Git
proceed, it can do its own process management, but I think it is
sensible for Git to wait the filter process it directly spawned.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-26 22:41   ` Jakub Narębski
@ 2016-09-30 18:56     ` Lars Schneider
  2016-10-04 20:50       ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-30 18:56 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 27 Sep 2016, at 00:41, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> Part first of the review of 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> index 7aff940..946dcad 100644
>> --- a/Documentation/gitattributes.txt
>> +++ b/Documentation/gitattributes.txt
>> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the command is
>> fed the blob object from its standard input, and its standard
>> output is used to update the worktree file.  Similarly, the
>> `clean` command is used to convert the contents of worktree file
>> -upon checkin.
>> +upon checkin. By default these commands process only a single
>> +blob and terminate.  If a long running `process` filter is used
>   ^^^^
> 
> Should we use this terminology here?  I have not read the preceding
> part of documentation, so I don't know if it talks about "blobs" or
> if it uses "files" and/or "file contents".

I used that because it was used in the paragraph above already.


>> +Long Running Filter Process
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +If the filter command (a string value) is defined via
>> +`filter.<driver>.process` then Git can process all blobs with a
>> +single filter invocation for the entire life of a single Git
>> +command. This is achieved by using a packet format (pkt-line,
>> +see technical/protocol-common.txt) based protocol over standard
>> +input and standard output as follows. All packets are considered
>> +text and therefore are terminated by an LF. Exceptions are the
>> +"*CONTENT" packets and the flush packet.
> 
> I guess that reasoning here is that all but CONTENT packets are
> metadata, and thus to aid debuggability of the protocol are "text",
> as considered by pkt-line.
> 
> Perhaps a bit more readable would be the following (but current is
> just fine; I am nitpicking):
> 
>  All packets, except for the "{star}CONTENT" packets and the "0000"
>  flush packer, are considered text and therefore are terminated by
>  a LF.

OK, I use that!


> I think it might be a good idea to describe what flush packet is
> somewhere in this document; on the other hand referring (especially
> if hyperlinked) to pkt-line technical documentation might be good
> enough / better.  I'm unsure, but I tend on the side that referring
> to technical documentation is better.

I have this line in the first paragraph of the Long Running Filter process:
"packet format (pkt-line, see technical/protocol-common.txt) based protocol"

> 
>> +to read a welcome response message ("git-filter-server") and exactly
>> +one protocol version number from the previously sent list. All further
> 
> I guess that is to provide forward-compatibility, isn't it?  Also,
> "Git expects..." probably means filter process MUST send, in the
> RFC2119 (https://tools.ietf.org/html/rfc2119) meaning.

True. I feel "expects" reads better but I am happy to change it if
you feel strong about it.


>> +
>> +After the version negotiation Git sends a list of supported capabilities
>> +and a flush packet.
> 
> Is it that Git SHOULD send list of ALL supported capabilities, or is
> it that Git SHOULD NOT send capabilities it does not support, and that
> it MAY send only those capabilities it needs (so for example if command
> uses only `smudge`, it may not send `clean`, so that filter driver doesn't
> need to initialize data it would not need).

"After the version negotiation Git sends a list of all capabilities that
it supports and a flush packet."

Better?


> I wonder why it is "<capability>=true", and not "capability=<capability>".
> Is there a case where we would want to send "<capability>=false".  Or
> is it to allow configurable / value based capabilities?  Isn't it going
> a bit too far: is there even a hind of an idea for parametrize-able
> capability? YAGNI is a thing...

Peff suggested that format and I think it is OK:
http://public-inbox.org/git/20160803224619.bwtbvmslhuicx2qi@sigill.intra.peff.net/


> A few new capabilities that we might want to support in the near future
> is "size", "stream", which are options describing how to communicate,
> and "cleanFromFile", "smudgeToFile", which are new types of operations...
> but neither needs any parameter.
> 
> I guess that adding new capabilities doesn't require having to come up
> with the new version of the protocol, isn't it.

Correct.


>> +packet:          git< git-filter-server
>> +packet:          git< version=2
>> +packet:          git> clean=true
>> +packet:          git> smudge=true
>> +packet:          git> not-yet-invented=true
> 
> Hmmm... should we hint at the use of kebab-case versus snake_case
> or camelCase for new capabilities?

I personally prefer kebab-case but I think that is a discussion for
future contributions ;-)


>> +------------------------
>> +packet:          git> command=smudge
>> +packet:          git> pathname=path/testfile.dat
>> +packet:          git> 0000
>> +packet:          git> CONTENT
>> +packet:          git> 0000
>> +------------------------
> 
> I think it is important to mention that (at least with current
> `filter.<driver>.process` implementation, that is absent future
> "stream" capability / option) the filter process needs to read
> *whole contents* at once, *before* writing anything.  Otherwise
> it can lead to deadlock.
> 
> This is especially important in that it is different (!) from the
> current behavior of `clean` and `smudge` filters, which can
> stream their response because Git invokes them async.

I added this:
" Please note, that the filter
must not send any response before it received the content and the
final flush packet. "


>> +
>> +If the filter experiences an error during processing, then it can
>> +send the status "error" after the content was (partially or
>> +completely) sent. Depending on the `filter.<driver>.required` flag
>> +Git will interpret that as error but it will not stop or restart the
>> +filter process.
>> +------------------------
>> +packet:          git< status=success
>> +packet:          git< 0000
>> +packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> +packet:          git< 0000
>> +packet:          git< status=error
>> +packet:          git< 0000
>> +------------------------
> 
> Good.  A question is if the filter process can send "status=abort"
> after partial contents, or does it need to wait for the next command?

I added:
"expected to respond with an "abort" status at any point in
the protocol."


>> +
>> +After the filter has processed a blob it is expected to wait for
>> +the next "key=value" list containing a command. Git will close
>> +the command pipe on exit. The filter is expected to detect EOF
>> +and exit gracefully on its own.
> 
> Good to have it documented.  
> 
> Anyway, as it is Git command that spawns the filter driver process,
> assuming that the filter process doesn't daemonize itself, wouldn't
> the operating system reap it after its parent process, that is the
> git command it invoked, dies? So detecting EOF is good, but not
> strictly necessary for simple filter that do not need to free
> its resources, or can leave freeing resources to the operating
> system? But I may be wrong here.

The filter process runs independent of Git.


>> +
>> +
>> Interaction between checkin/checkout attributes
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 
>> diff --git a/contrib/long-running-filter/example.pl b/contrib/long-running-filter/example.pl
>> new file mode 100755
>> index 0000000..c13a631
>> --- /dev/null
>> +++ b/contrib/long-running-filter/example.pl
> 
> To repeat myself, I think it would serve better as a separate patch.

OK


>> +        die "invalid packet size '$bytes_read' field";
> 
> This would read "invalid packet size '000' field", for example.
> Perhaps the following would be (slightly) better:
> 
>  +        die "invalid packet size field: '$bytes_read'";

OK


>> +    }
>> +    elsif ( $pkt_size > 4 ) {
> 
> Isn't a packet of $pkt_size == 4 a valid packet, a keep-alive
> one?  Or is it forbidden?

"Implementations SHOULD NOT send an empty pkt-line ("0004")."
Source: Documentation/technical/protocol-common.txt


>> +            die "invalid packet ($content_size expected; $bytes_read read)";
> 
> This error message would read "invalid packet (12 expected; 10 read)";
> I think it would be better to rephrase it as
> 
>  +            die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";

OK


>> +        die "invalid packet size";
> 
> I'm not sure if it is worth it (especially for the demo script),
> but perhaps we could show what this invalid size was?
> 
>  +        die "invalid packet size value '$pkt_size'";

OK


>> +sub packet_txt_read {
>> +    my ( $res, $buf ) = packet_bin_read();
>> +    unless ( $buf =~ /\n$/ ) {
> 
> Wouldn't
> 
>  +    unless ( $buf =~ s/\n$// ) {
> 
> or (less so)
> 
>  +    unless ( $buf =~ s/\n$\z// ) {
> 
> be more idiomatic (and not require use of 'substr')?  Remember,
> the s/// substitution quote-like operator returns number of
> substitutions in the scalar context.

OK.


>> +        die "A non-binary line SHOULD BE terminated by an LF.";
> 
> This is SHOULD be, not MUST be, so perhaps 'warn' would be enough.
> Not that Git should send us such line.

Actually it MUST per protocol definition. I'll change it to MUST.


>> +    my ($packet) = @_;
> 
> This is equivalent to
> 
>  +    my $packet = shift;
> 
> which, I think, is more common for single-parameter subroutines.
> 
> Also, this is $data (or $buf), not $packet.

OK


> Perhaps some comment that main begins here?
> 
>> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
>> +( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
>> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";
> 
> Actually, it is overly strict.  It should not fail if there
> are other "version=3", "version=4" etc. lines.

True, but I think for an example this is OK. I'll add a note
to the file header.


>> +
>> +while (1) {
>> +    my ($command)  = packet_txt_read() =~ /^command=([^=]+)$/;
>> +    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;
> 
> Do we require this order?  If it is, is that explained in the
> documentation?

Git sends that order right now but the filter should not rely
on that order.


>> +    packet_flush();    # empty list!
> 
> This is less "empty list!", and more keeping "status=success" unchanged.

OK


OK means, I agree and I added your suggestion to v9.
Thanks a lot for your review and the comments!

Cheers,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-27 15:37   ` Jakub Narębski
@ 2016-09-30 19:38     ` Lars Schneider
  2016-10-04 21:00       ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-09-30 19:38 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 27 Sep 2016, at 17:37, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> Part second of the review of 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com pisze:
> 
>> +
>> +	if (!drv->process && (CAP_CLEAN & wanted_capability) && drv->clean)
> 
> This is just a very minor nitpicking, but wouldn't it be easier
> to read with those checks reordered?
> 
>  +	if ((wanted_capability & CAP_CLEAN) && !drv->process && drv->clean)

OK


>> +
>> +	if (start_command(process)) {
>> +		error("cannot fork to run external filter '%s'", cmd);
>> +		kill_multi_file_filter(hashmap, entry);
>> +		return NULL;
>> +	}
> 
> I guess there is a reason why we init hashmap entry, try to start
> external process, then kill entry of unable to start, instead of
> trying to start external process, and adding hashmap entry when
> we succeed?

Yes. This way I can reuse the kill_multi_file_filter() function.


>> +
>> +	sigchain_push(SIGPIPE, SIG_IGN);
> 
> I guess that this is here to handle errors writing to filter
> by ourself, isn't it?

Yes.


>> +		error("external filter '%s' does not support long running filter protocol", cmd);
> 
> We could have described the error here better.
> 
>  +		error("external filter '%s' does not support filter protocol version 2", cmd);

OK


>> +static void read_multi_file_filter_values(int fd, struct strbuf *status) {
> 
> This is more
> 
>  +static void read_multi_file_filter_status(int fd, struct strbuf *status) {
> 
> It doesn't read arbitrary values, it examines 'metadata' from
> filter for "status=<foo>" lines.

True!


>> +		if (pair[0] && pair[0]->len && pair[1]) {
>> +			if (!strcmp(pair[0]->buf, "status=")) {
>> +				strbuf_reset(status);
>> +				strbuf_addbuf(status, pair[1]);
>> +			}
> 
> So it is last status=<foo> line wins behavior?

Correct.


> 
>> +		}
> 
> Shouldn't we free 'struct strbuf **pair', maybe allocated by the
> strbuf_split_str() function, and reset to NULL?

True. strbuf_list_free() should be enough.


>> 
>> +	fflush(NULL);
> 
> Why this fflush(NULL) is needed here?

This flushes all open output streams. The single filter does the same.


>> 
>> +	if (fd >= 0 && !src) {
>> +		if (fstat(fd, &file_stat) == -1)
>> +			return 0;
>> +		len = xsize_t(file_stat.st_size);
>> +	}
> 
> Errr... is it necessary?  The protocol no longer provides size=<n>
> hint, and neither uses such hint if provided.

We require the size in write_packetized_from_buf() later.


>> +
>> +	err = strlen(filter_type) > PKTLINE_DATA_MAXLEN;
>> +	if (err)
>> +		goto done;
> 
> Errr... this should never happen.  We control which capabilities
> we pass, it can be only "clean" or "smudge", nothing else. Those
> would always be shorter than PKTLINE_DATA_MAXLEN.
> 
> Never mind that that is "command=smudge\n" etc. that needs to
> be shorter that PKTLINE_DATA_MAXLEN!
> 
> So, IMHO it should be at most assert, and needs to be corrected
> anyway.

OK!


> This should never happen, PATH_MAX everywhere is much shorter
> than PKTLINE_DATA_MAXLEN / LARGE_PACKET_MAX.  Or is it?
> 
> Anyway, we should probably explain or warn
> 
>   		error("path name too long: '%s'", path);

OK


>> +			/*
>> +			 * Something went wrong with the protocol filter.
>> +			 * Force shutdown and restart if another blob requires filtering!
> 
> Is this exclamation mark '!' here necessary?
> 

No.


Thanks,
Lars


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-28 23:14   ` Jakub Narębski
@ 2016-10-01 15:34     ` Lars Schneider
  2016-10-04 21:34       ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-10-01 15:34 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 29 Sep 2016, at 01:14, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> Part third (and last) of the review of v8 11/11.
> 
> W dniu 20.09.2016 o 21:02, larsxschneider@gmail.com napisał:
> 
> 
>> @@ -31,7 +31,10 @@ test_expect_success setup '
>> 	cat test >test.i &&
>> 	git add test test.t test.i &&
>> 	rm -f test test.t test.i &&
>> -	git checkout -- test test.t test.i
>> +	git checkout -- test test.t test.i &&
>> +
>> +	echo "content-test2" >test2.o &&
>> +	echo "content-test3 - subdir" >"test3 - subdir.o"
> 
> I see that you prepare here a few uncommitted files, but both
> their names and their contents leave much to be desired - you
> don't know from the name and contents what they are for.
> 
> And the '"subdir"' file which is not in subdirectory is
> especially egregious.

These are 3 files with somewhat random test content. I renamed
"subdir" to "spaces".


>> +check_filter () {
>> +	rm -f rot13-filter.log actual.log &&
>> +	"$@" 2> git_stderr.log &&
>> +	test_must_be_empty git_stderr.log &&
>> +	cat >expected.log &&
> 
> This is too clever by half.  Having a function that both tests
> the behavior and prepares 'expected' file is too much.
> 
> In my opinion preparation of 'expected.log' file should be moved
> to another function or functions.
> 
> Also, if we are running sort on output, I think we should also
> run sort on 'expected.log', so that what we write doesn't need to
> be created sorted (so we don't have to sort expected lines by hand).
> Or maybe we should run the same transformation on rot13-filter.log
> and on the contents of expected.log.

Agreed. Very good suggestion!


>> +check_filter_ignore_clean () {
>> +	rm -f rot13-filter.log actual.log &&
>> +	"$@" &&
> 
> Why we don't check for stderr here?

Because this function is used by "git checkout" which writes all
kinds of stuff to stderr. I added "--quiet --no-progress" to
disable this behavior. 


>> +check_rot13 () {
>> +	test_cmp "$1" "$2" &&
>> +	./../rot13.sh <"$1" >expected &&
> 
> Why there is .. in this invocation?

Because this script is located in the root of the current test directory.


>> +	git cat-file blob :"$2" >actual &&
>> +	test_cmp expected actual
>> +}
>> +
>> +test_expect_success PERL 'required process filter should filter data' '
>> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
>> +	test_config_global filter.protocol.required true &&
>> +	rm -rf repo &&
>> +	mkdir repo &&
>> +	(
>> +		cd repo &&
>> +		git init &&
> 
> Don't you think that creating a fresh test repository for each
> separate test is a bit too much?  I guess that you want for
> each and every test to be completely independent, but this setup
> and teardown is a bit excessive.
> 
> Other tests in the same file (should we reuse the test, or use
> new test file) do not use this method.

I see your point. However, I am always annoyed if Git tests are
entangled because it makes working with them way way harder.
This test test runs in 4.5s on a slow Travis CI machine. I think
that is OK considering that we have tests running 3.5min (t3404).


>> +		echo "*.r filter=protocol" >.gitattributes &&
>> +		git add . &&
>> +		git commit . -m "test commit" &&
>> +		git branch empty &&
> 
> Err... I think it would be better to name it 'empty-branch'
> (or 'almost-empty-branch', as it does include .gitattributes file).
> See my mistake below (marked <del>...</del>).

"empty-branch". OK


>> +
>> +		cp ../test.o test.r &&
>> +		cp ../test2.o test2.r &&
> 
> What does this test2.o / test2.r file tests, that test.o / test.r
> doesn't?  The name doesn't tell us.

This just tests multiple files with different content.


> Why it is test.r, but test2.r?  Why it isn't test1.r?

test.r already existed (created in setup test).


>> +		mkdir testsubdir &&
>> +		cp "../test3 - subdir.o" "testsubdir/test3 - subdir.r" &&
> 
> Why it needs to have different contents?

To check that the filer does the right thing with multiple files
and contents.



>> +		>test4-empty.r &&
> 
> You test ordinary file, file in subdirectory, file with filename
> containing spaces, and an empty file.
> 
> Other tests of single file `clean`/`smudge` filters use filename
> that requires mangling; maybe we should use similar file?
> 
>        special="name  with '\''sq'\'' and \$x" &&
>        echo some test text >"$special" &&

OK.


> In case of `process` filter, a special filename could look like
> this:
> 
>        process_special="name=with equals and\nembedded newlines\n" &&
>        echo some test text >"$process_special" &&

I think this test would create trouble on Windows. I'll stick to
the special characters used in the single shot filter.


>> +				<<-\EOF &&
>> +					1 IN: clean test.r 57 [OK] -- OUT: 57 . [OK]
>> +					1 IN: clean test2.r 14 [OK] -- OUT: 14 . [OK]
>> +					1 IN: clean test4-empty.r 0 [OK] -- OUT: 0  [OK]
>> +					1 IN: clean testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
>> +					1 START
>> +					1 STOP
>> +					1 wrote filter header
>> +				EOF
> 
> First, this indentation level confirms that the check_filter
> function is too clever by half, and that preparing expected.log
> file should be a separate step.

Agreed.


> Second, if we run "sort" on contents to be in expected.log, we
> can write it in more natural, and less fragile way:

Agreed.


> Third, why the filter even writes output size? It is no longer
> part of `process` filter driver protocol, and it makes test more
> fragile.

I would prefer to leave that in. I think it is good for the test to
check that we are transmitting the amount of content that what we 
think we transmit.


> If we are to keep sizes, then to make test less fragile with
> respect to changes in contents of tested files, we should use
> variables containing file size:
> 
>   		test_r_size=$(wc -c test.r)
>   		...
>   		sort >expected.log <<-EOF &&
>   		...
>   			1 IN: clean test.r $test_r_size [OK] -- OUT: $test_r_size . [OK]

Agreed.


>> 
>> +		rm -f test?.r "testsubdir/test3 - subdir.r" &&
> 
> Why 'test?.r' when we are removing only 'test2.r'; why not be explicit?

True!


>> +				<<-\EOF &&
>> +					START
>> +					wrote filter header
>> +					STOP
>> +				EOF
> 
> Why is even filter process invoked?  If this is not expected, perhaps
> simply ignore what checking out almost empty branch (one without any
> files marked for filtering) does.
> 
> Shouldn't we test_expect_failure no-call?

Because a clean operation could happen. I added a clean operation to
the expected log in order to make this visible (expected log is stripped
of clean operations in the same way as the actual log per your suggestion
above).


>> +
>> +		check_filter_ignore_clean \
>> +			git checkout master \
> 
> Does this checks different code path than 'git checkout .'? For
> example, does this test increase code coverage (e.g. as measured
> by gcov)?  If not, then this test could be safely dropped.

We checked out the "empty-branch" before. That's why we check here
that the smudge filter runs for all files (smudge filter did not run
for all files with `git checkout .`).


>> +				<<-\EOF &&
>> +					START
>> +					wrote filter header
>> +					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
>> +					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
>> +					IN: smudge test4-empty.r 0 [OK] -- OUT: 0  [OK]
>> +					IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
> 
> Can we assume that Git would pass files to filter in alphabetical
> order?  This assumption might make the test unnecessary fragile.

I have never experienced another behavior. If we see fragility we could
sort the result...


>> 
>> +test_expect_success PERL 'required process filter should clean only and take precedence' '
> 
> Trying to describe it better results in overly long description,
> which probably means that this test should be split into few
> smaller ones:
> 
> - `process` filter takes precedence over `clean` and/or `smudge`
>   filters, regardless if it supports relevant ("clean" or "smudge")
>   capability or not
> 
> - `process` filter that includes only "clean" capability should
>   clean only (be used only for 'clean' operation)

Agreed!


> In my opinion all functions should be placed at beginning,
> or even in separate file (if they are used in more than
> one test).

OK


>> +generate_test_data () {
> 
> The name is not good, it doesn't describe what kind of data
> we want to generate.

"generate_random_characters" ok?!

>> +		perl -pe "s/./chr((ord($&) % 26) + 97)/sge" >../$NAME.file &&
> 
> Those constants (26 and 97) are a bit cryptic; magical constants.
> I guess this is
> 
>  +		perl -pe "s/./chr((ord($&) % (ord('z') - ord('a') + 1) + ord('a'))/sge" >../$NAME.file &&
> 
> or
> 
>  +		perl -pe "s/./chr((ord($&) % 26 + ord('a'))/sge" >../$NAME.file &&

OK!


> Do we re-generate this file each time?
> 
>> +	./../rot13.sh <../$NAME.file >../$NAME.file.rot13
> 
> Anyway, I wonder if taking the last two lines out of the function
> (as they are not about _generating_ a file) would make it more
> readable or not.

Agreed.


>> +
>> +		echo "*.file filter=protocol" >.gitattributes &&
>> +		check_filter \
>> +			git add *.file .gitattributes \
> 
> Should it be shell expansion, or git expansion, that is
> 
>   			git add '*.file' .gitattributes

Both have the same output. Would the difference matter?


>> +					1 START
>> +					1 STOP
>> +					1 wrote filter header
>> +				EOF
>> +		git commit . -m "test commit" &&
> 
> Is this needed / necessary?

Yes, to test the smudge afterwards!

> 
>> +
>> +		rm -f *.file &&
>> +		git checkout -- *.file &&
> 
> Is this necessary?  I guess this checks that it doesn't crash, but
> we do not check that smudge operation works correctly, as we did
> for clean.

Good point. Smudge check added!


>> +		for f in *.file
>> +		do
>> +			git cat-file blob :$f >actual &&
>> +			test_cmp ../$f.rot13 actual
>> +		done
> 
> Wasn't there helper function for this?

True :-)


>> +test_expect_success PERL 'required process filter should with clean error should fail' '
>                                                     ^^^^^^                  ^^^^^^
> 
> Errr... what?  You have 'should' twice here.

Fixed


> Also, does it matter that the error is during clean operation?
> We don't test that error during smudge operation is handled in
> the same way, do we?

Clean and smudge should hit the same code paths here. Therefore I think
it is sufficient to test clean only.


>> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
> 
> Do we need to pass 'clean smudge', or does it provide both by
> default?

We need to pass them. Default is empty.


>> +		git add . &&
>> +		git commit . -m "test commit" &&
> 
> You don't need to commit for 'git checkout <path>' (e.g. for .)
> or 'git cat-file -p :<file>' to work.

True!


>> +	)
>> +'
>> +
>> +test_expect_success PERL 'process filter should not restart in case of an error' '
> 
> Errr... what? This description is not clear.  Did you mean
> that filter should not be restarted if it *signals* an error
> with file (either before sending anything, or after sending
> partial contents)?

OK renamed to "process filter should not be restarted if it signals an error"


>> +test_expect_success PERL 'process filter should be able to signal an error for all future files' '
> 
> Did you mean here that filter can abort processing of
> all future files?

"process filter signals abort once to abort processing of all future files", better?


>> +
>> +		cp ../test.o test.r &&
>> +		test_must_fail git add . 2> git_stderr.log &&
>> +		grep "not support long running filter protocol" git_stderr.log
> 
> Shouldn't this use gettext poison (or rather C locale)?
> This error message could be translated in the future.

I would prefer to adjust that when we translate it.


>> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
> 
> Why not use tr/// version of this quote-like operation?
> Or do you follow prior art here?

I am not Perl expert. That worked for me :-)


>> +sub packet_bin_read {
>> +    my $buffer;
>> +    my $bytes_read = read STDIN, $buffer, 4;
>> +    if ( $bytes_read == 0 ) {
>> +
>> +        # EOF - Git stopped talking to us!
>> +        print $debug "STOP\n";
>> +        exit();
>> +    }
>> +    elsif ( $bytes_read != 4 ) {
>> +        die "invalid packet size '$bytes_read' field";
> 
> Errr, $bytes_read is not packet size field.  It is $buffer.
> Also, error message looks strange
> 
>   		invalid packet size '004' field
> 
> Shouldn't it be at end?

True. Fixed!


>> +        }
>> +        return ( 0, $buffer );
>> +    }
>> +    else {
>> +        die "invalid packet size";
> 
> Is keep-alive packet valid ("0004")?

No.


>> 
>> +packet_flush();
>> +print $debug "wrote filter header\n";
> 
> Or perhaps "handshake end"?

"init handshake complete", ok?


>> +    print $debug " $pathname";
> 
> No " pathname=$pathname" ?

Yes, otherwise it gets too verbose in the tests.


>> +        while ( length($output) > 0 ) {
>> +            my $packet = substr( $output, 0, $MAX_PACKET_CONTENT_SIZE );
>> +            packet_bin_write($packet);
>> +            print $debug ".";
> 
> All right, so number of dots is the number of packets.  This is
> surprisingly opaque.

I added a comment.


> 
>> +            if ( length($output) > $MAX_PACKET_CONTENT_SIZE ) {
>> +                $output = substr( $output, $MAX_PACKET_CONTENT_SIZE );
>> +            }
>> +            else {
>> +                $output = "";
>> +            }
>> +        }
>> +        packet_flush();
>> +        print $debug " [OK]\n";
>> +        $debug->flush();
>> +        packet_flush();
> 
> Should we test partial contents case?  Or failure during printing?
> What happens then - is file cleared by Git, or left partially converted?

Git will clear the file on any error (it doesn't matter when the error happens).

---

I am astonished how many valuable suggestion you were able to make
even though I am working with this code for months now.

Thanks a lot for taking the time to review my code that thoroughly.

- Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-09-29 21:27           ` Junio C Hamano
@ 2016-10-01 18:59             ` Lars Schneider
  2016-10-01 20:48               ` Jakub Narębski
  2016-10-03 17:02               ` Junio C Hamano
  0 siblings, 2 replies; 71+ messages in thread
From: Lars Schneider @ 2016-10-01 18:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay


> On 29 Sep 2016, at 23:27, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> We discussed that issue in v4 and v6:
>> http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3i7l@sigill.intra.peff.net/
>> http://public-inbox.org/git/xmqqbn0a3wy3.fsf@gitster.mtv.corp.google.com/
>> 
>> My impression was that you don't want Git to wait for the filter process.
>> If Git waits for the filter process - how long should Git wait?
> 
> I am not sure where you got that impression.  I did say that I do
> not want Git to _KILL_ my filter process.  That does not mean I want
> Git to go away without waiting for me.
> 
> If the filter process refuses to die forever when Git told it to
> shutdown (by closing the pipe to it, for example), that filter
> process is simply buggy.  I think we want users to become aware of
> that, instead of Git leaving it behind, which essentially is to
> sweep the problem under the rug.
> 
> I agree with what Peff said elsewhere in the thread; if a filter
> process wants to take time to clean things up while letting Git
> proceed, it can do its own process management, but I think it is
> sensible for Git to wait the filter process it directly spawned.

To realize the approach above I prototyped the run-command patch below:

I added an "exit_timeout" variable to the "child_process" struct.
On exit, Git will close the pipe to the process and wait "exit_timeout" 
seconds until it kills the child process. If "exit_timeout" is negative
then Git will wait until the process is done.

If we use that in the long running filter process, then we could make
the timeout even configurable. E.g. with "filter.<driver>.process-timeout".

What do you think about this solution? 

Thanks,
Lars



diff --git a/run-command.c b/run-command.c
index 3269362..a933066 100644
--- a/run-command.c
+++ b/run-command.c
@@ -21,6 +21,8 @@ void child_process_clear(struct child_process *child)
 
 struct child_to_clean {
 	pid_t pid;
+	int stdin;
+	int timeout;
 	struct child_to_clean *next;
 };
 static struct child_to_clean *children_to_clean;
@@ -28,9 +30,30 @@ static int installed_child_cleanup_handler;
 
 static void cleanup_children(int sig, int in_signal)
 {
+	int status;
+	struct timeval tv;
+	time_t secs;
+
 	while (children_to_clean) {
 		struct child_to_clean *p = children_to_clean;
 		children_to_clean = p->next;
+
+		if (p->timeout != 0 && p->stdin > 0)
+			close(p->stdin);
+
+		if (p->timeout < 0) {
+			// Wait until the process finishes
+			while ((waitpid(p->pid, &status, 0)) < 0 && errno == EINTR)
+				;	/* nothing */
+		} else if (p->timeout != 0) {
+			// Wait until the process finishes or timeout
+			gettimeofday(&tv, NULL);
+			secs = tv.tv_sec;
+			while (getpgid(p->pid) >= 0 && tv.tv_sec - secs < p->timeout) {
+				gettimeofday(&tv, NULL);
+			}
+		}
+
 		kill(p->pid, sig);
 		if (!in_signal)
 			free(p);
@@ -49,10 +72,12 @@ static void cleanup_children_on_exit(void)
 	cleanup_children(SIGTERM, 0);
 }
 
-static void mark_child_for_cleanup(pid_t pid)
+static void mark_child_for_cleanup(pid_t pid, int timeout, int stdin)
 {
 	struct child_to_clean *p = xmalloc(sizeof(*p));
 	p->pid = pid;
+	p->stdin = stdin;
+	p->timeout = timeout;
 	p->next = children_to_clean;
 	children_to_clean = p;
 
@@ -422,7 +447,7 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0)
 		error_errno("cannot fork() for %s", cmd->argv[0]);
 	else if (cmd->clean_on_exit)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup(cmd->pid, cmd->exit_timeout, cmd->in);
 
 	/*
 	 * Wait for child's execvp. If the execvp succeeds (or if fork()
@@ -483,7 +508,7 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
 		error_errno("cannot spawn %s", cmd->argv[0]);
 	if (cmd->clean_on_exit && cmd->pid >= 0)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup(cmd->pid, cmd->exit_timeout, cmd->in);
 
 	argv_array_clear(&nargv);
 	cmd->argv = sargv;
@@ -765,7 +790,7 @@ int start_async(struct async *async)
 		exit(!!async->proc(proc_in, proc_out, async->data));
 	}
 
-	mark_child_for_cleanup(async->pid);
+	mark_child_for_cleanup(async->pid, 0, -1);
 
 	if (need_in)
 		close(fdin[0]);
diff --git a/run-command.h b/run-command.h
index cf29a31..f2eca33 100644
--- a/run-command.h
+++ b/run-command.h
@@ -33,6 +33,7 @@ struct child_process {
 	int in;
 	int out;
 	int err;
+	int exit_timeout;
 	const char *dir;
 	const char *const *env;
 	unsigned no_stdin:1;



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-01 18:59             ` Lars Schneider
@ 2016-10-01 20:48               ` Jakub Narębski
  2016-10-03 17:13                 ` Lars Schneider
  2016-10-03 17:02               ` Junio C Hamano
  1 sibling, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-10-01 20:48 UTC (permalink / raw)
  To: Lars Schneider, Junio C Hamano
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Martin-Louis Bright, Ramsay Jones

W dniu 01.10.2016 o 20:59, Lars Schneider pisze: 
> On 29 Sep 2016, at 23:27, Junio C Hamano <gitster@pobox.com> wrote:
>> Lars Schneider <larsxschneider@gmail.com> writes:
>>
>>> We discussed that issue in v4 and v6:
>>> http://public-inbox.org/git/20160803225313.pk3tfe5ovz4y3i7l@sigill.intra.peff.net/
>>> http://public-inbox.org/git/xmqqbn0a3wy3.fsf@gitster.mtv.corp.google.com/
>>>
>>> My impression was that you don't want Git to wait for the filter process.
>>> If Git waits for the filter process - how long should Git wait?
>>
>> I am not sure where you got that impression.  I did say that I do
>> not want Git to _KILL_ my filter process.  That does not mean I want
>> Git to go away without waiting for me.
>>
>> If the filter process refuses to die forever when Git told it to
>> shutdown (by closing the pipe to it, for example), that filter
>> process is simply buggy.  I think we want users to become aware of
>> that, instead of Git leaving it behind, which essentially is to
>> sweep the problem under the rug.

Well, it would be good to tell users _why_ Git is hanging, see below.

>>
>> I agree with what Peff said elsewhere in the thread; if a filter
>> process wants to take time to clean things up while letting Git
>> proceed, it can do its own process management, but I think it is
>> sensible for Git to wait the filter process it directly spawned.
> 
> To realize the approach above I prototyped the run-command patch below:
> 
> I added an "exit_timeout" variable to the "child_process" struct.
> On exit, Git will close the pipe to the process and wait "exit_timeout" 
> seconds until it kills the child process. If "exit_timeout" is negative
> then Git will wait until the process is done.

That might be good approach.  Probably the default would be to wait.

> 
> If we use that in the long running filter process, then we could make
> the timeout even configurable. E.g. with "filter.<driver>.process-timeout".

Sidenote: we prefer camelCase rather than kebab-case for config
variables, that is, "filter.<driver>.processTimeout".

Also, how would one set default value of timeout for all process
based filters?

> 
> What do you think about this solution?

I think this addition be done after, assuming that we come up
with good default behavior (e.g. wait for filter processes
to finish).

Also, we would probably want to add some progress information
in a similar way to progress info for checkout, that is display
it after a few seconds waiting.

This could be, for example:

  Waiting for filter '<driver>' to finish... done

With timeout it could look like this (where underlined part
is interactive, that is changes every second):

  Waiting 10s for '<driver>' filter process to finish.
          ^^^

And then either

  Filter '<driver>' killed

or

  Filter '<driver>' finished


> diff --git a/run-command.c b/run-command.c
> index 3269362..a933066 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -21,6 +21,8 @@ void child_process_clear(struct child_process *child)
>  
>  struct child_to_clean {
>  	pid_t pid;
> +	int stdin;
> +	int timeout;
>  	struct child_to_clean *next;
>  };
>  static struct child_to_clean *children_to_clean;
> @@ -28,9 +30,30 @@ static int installed_child_cleanup_handler;
>  
>  static void cleanup_children(int sig, int in_signal)
>  {
> +	int status;
> +	struct timeval tv;
> +	time_t secs;
> +
>  	while (children_to_clean) {
>  		struct child_to_clean *p = children_to_clean;
>  		children_to_clean = p->next;
> +
> +		if (p->timeout != 0 && p->stdin > 0)
> +			close(p->stdin);

Why are you not closing stdin of filter driver process if timeout
is zero?  Is it maybe some kind of special value?  If it is, this
is not documented.

> +
> +		if (p->timeout < 0) {
> +			// Wait until the process finishes
> +			while ((waitpid(p->pid, &status, 0)) < 0 && errno == EINTR)
> +				;	/* nothing */

Ah, this loop is here because waiting on waitpid() can be interrupted
by the delivery of a signal to the calling process; though the result
is -1, not just any < 0.

Looks good (but we would want some progress information, probably).

> +		} else if (p->timeout != 0) {
> +			// Wait until the process finishes or timeout
> +			gettimeofday(&tv, NULL);
> +			secs = tv.tv_sec;
> +			while (getpgid(p->pid) >= 0 && tv.tv_sec - secs < p->timeout) {
> +				gettimeofday(&tv, NULL);
> +			}

WTF?  Busy wait loop???

Also, if we want to wait for child without blocking, then instead
of cryptic getpgid(p->pid) maybe use waitpid(p->pid, &status, WNOHANG);
it is more explicit.

There is also another complication: there can be more than one
long-running filter driver used.  With this implementation we
wait for each of one in sequence, e.g. 10s + 10s + 10s.

> +		}
> +
>  		kill(p->pid, sig);
>  		if (!in_signal)
>  			free(p);
> @@ -49,10 +72,12 @@ static void cleanup_children_on_exit(void)
>  	cleanup_children(SIGTERM, 0);
>  }
>  
> -static void mark_child_for_cleanup(pid_t pid)
> +static void mark_child_for_cleanup(pid_t pid, int timeout, int stdin)

Hmmmm... three parameters is not too much, but we might want to
pass "struct child_process *" directly if this number grows.

>  {
>  	struct child_to_clean *p = xmalloc(sizeof(*p));
>  	p->pid = pid;
> +	p->stdin = stdin;
> +	p->timeout = timeout;
>  	p->next = children_to_clean;
>  	children_to_clean = p;
>  
> @@ -422,7 +447,7 @@ int start_command(struct child_process *cmd)
>  	if (cmd->pid < 0)
>  		error_errno("cannot fork() for %s", cmd->argv[0]);
>  	else if (cmd->clean_on_exit)
> -		mark_child_for_cleanup(cmd->pid);
> +		mark_child_for_cleanup(cmd->pid, cmd->exit_timeout, cmd->in);
>  
>  	/*
>  	 * Wait for child's execvp. If the execvp succeeds (or if fork()
> @@ -483,7 +508,7 @@ int start_command(struct child_process *cmd)
>  	if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
>  		error_errno("cannot spawn %s", cmd->argv[0]);
>  	if (cmd->clean_on_exit && cmd->pid >= 0)
> -		mark_child_for_cleanup(cmd->pid);
> +		mark_child_for_cleanup(cmd->pid, cmd->exit_timeout, cmd->in);
>  
>  	argv_array_clear(&nargv);
>  	cmd->argv = sargv;
> @@ -765,7 +790,7 @@ int start_async(struct async *async)
>  		exit(!!async->proc(proc_in, proc_out, async->data));
>  	}
>  
> -	mark_child_for_cleanup(async->pid);
> +	mark_child_for_cleanup(async->pid, 0, -1);

Eh?  A bit magic.

>  
>  	if (need_in)
>  		close(fdin[0]);
> diff --git a/run-command.h b/run-command.h
> index cf29a31..f2eca33 100644
> --- a/run-command.h
> +++ b/run-command.h
> @@ -33,6 +33,7 @@ struct child_process {
>  	int in;
>  	int out;
>  	int err;
> +	int exit_timeout;

I guess it is no problem adding new field to child_process struct;
it is not constrained for memory...

>  	const char *dir;
>  	const char *const *env;
>  	unsigned no_stdin:1;
> 

Where we read the value of the configuration variable?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-01 18:59             ` Lars Schneider
  2016-10-01 20:48               ` Jakub Narębski
@ 2016-10-03 17:02               ` Junio C Hamano
  2016-10-03 17:35                 ` Lars Schneider
  2016-10-04 12:11                 ` Jeff King
  1 sibling, 2 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-10-03 17:02 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Lars Schneider <larsxschneider@gmail.com> writes:

>> If the filter process refuses to die forever when Git told it to
>> shutdown (by closing the pipe to it, for example), that filter
>> process is simply buggy.  I think we want users to become aware of
>> that, instead of Git leaving it behind, which essentially is to
>> sweep the problem under the rug.
>> 
>> I agree with what Peff said elsewhere in the thread; if a filter
>> process wants to take time to clean things up while letting Git
>> proceed, it can do its own process management, but I think it is
>> sensible for Git to wait the filter process it directly spawned.
>
> To realize the approach above I prototyped the run-command patch below:
>
> I added an "exit_timeout" variable to the "child_process" struct.
> On exit, Git will close the pipe to the process and wait "exit_timeout" 
> seconds until it kills the child process. If "exit_timeout" is negative
> then Git will wait until the process is done.

> If we use that in the long running filter process, then we could make
> the timeout even configurable. E.g. with "filter.<driver>.process-timeout".
>
> What do you think about this solution? 

Is such a configuration (or timeout in general) necessary?  I
suspect that a need for timeout, especially needing timeout and
needing to get killed that happens so often to require a
configuration variable, is a sign of something else seriously wrong.

What's the justification for a filter to _require_ getting killed
all the time when it is spawned?  Otherwise you wouldn't configure
"this driver does not die when told, so we need a timeout" variable.
Is it a sign of the flaw in the protocol to talk to it?  e.g. Git
has a way to tell it to die, but it somehow is very hard to hear
from filter's end and honor that request?

I think that we would need some timeout in the mechanism, but not to
be used for "killing".

You would decide to "kill" an filter process in two cases: the
filter is buggy and refuses to die when Git tells it to exit, or the
code in Git waiting for its death is somehow miscounting its
children, and thought it told to die one process but in fact it
didn't (perhaps it told somebody else to die), or it thought it
hasn't seen the child die when in fact it already did.

Calling kill(2) and exiting would hide these two kind of bugs from
end users.  Not doing so would give the end users a hung Git, which
is a VERY GOOD thing.  Otherwise you would not notice bugs and lose
the opportunity to diagnose and fix it.

The timeout would be good for you to give a message "filter process
running the script '%s' is not exiting; I am waiting for it".  The
user is still left with a hung Git, and can then see if that process
is hanging around.  If it is, then we found a buggy filter.  Or we
found a buggy Git.  Either needs to be fixed.  I do not think it
would help anybody by doing a kill(2) to sweep possible bugs under
the rug.

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-01 20:48               ` Jakub Narębski
@ 2016-10-03 17:13                 ` Lars Schneider
  2016-10-04 19:04                   ` Jakub Narębski
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-10-03 17:13 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Torsten Bögershausen, git, Jeff King,
	Stefan Beller, Martin-Louis Bright, Ramsay Jones


> On 01 Oct 2016, at 22:48, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 01.10.2016 o 20:59, Lars Schneider pisze: 
>> On 29 Sep 2016, at 23:27, Junio C Hamano <gitster@pobox.com> wrote:
>>> Lars Schneider <larsxschneider@gmail.com> writes:
>>> 
>>> If the filter process refuses to die forever when Git told it to
>>> shutdown (by closing the pipe to it, for example), that filter
>>> process is simply buggy.  I think we want users to become aware of
>>> that, instead of Git leaving it behind, which essentially is to
>>> sweep the problem under the rug.
> 
> Well, it would be good to tell users _why_ Git is hanging, see below.

Agreed. Do you think it is OK to write the message to stderr?


>>> I agree with what Peff said elsewhere in the thread; if a filter
>>> process wants to take time to clean things up while letting Git
>>> proceed, it can do its own process management, but I think it is
>>> sensible for Git to wait the filter process it directly spawned.
>> 
>> To realize the approach above I prototyped the run-command patch below:
>> 
>> I added an "exit_timeout" variable to the "child_process" struct.
>> On exit, Git will close the pipe to the process and wait "exit_timeout" 
>> seconds until it kills the child process. If "exit_timeout" is negative
>> then Git will wait until the process is done.
> 
> That might be good approach.  Probably the default would be to wait.

I think I would prefer a 2sec timeout or something as default. This way
we can ensure Git would not wait indefinitely for a buggy filter by default.


>> If we use that in the long running filter process, then we could make
>> the timeout even configurable. E.g. with "filter.<driver>.process-timeout".
> 
> Sidenote: we prefer camelCase rather than kebab-case for config
> variables, that is, "filter.<driver>.processTimeout".

Sure!


> Also, how would one set default value of timeout for all process
> based filters?

I think we don't need that because a timeout is always specific
to a filter (if the 2sec default is not sufficient).


>> 
>> +			while ((waitpid(p->pid, &status, 0)) < 0 && errno == EINTR)
>> +				;	/* nothing */
> 
> Ah, this loop is here because waiting on waitpid() can be interrupted
> by the delivery of a signal to the calling process; though the result
> is -1, not just any < 0.

"< 0" is also used in wait_or_whine()


>> +			while (getpgid(p->pid) >= 0 && tv.tv_sec - secs < p->timeout) {
>> +				gettimeofday(&tv, NULL);
>> +			}
> 
> WTF?  Busy wait loop???

This was just a quick prototype to show "my thinking direction". 
I wasn't expecting a review. Sorry :-)


> Also, if we want to wait for child without blocking, then instead
> of cryptic getpgid(p->pid) maybe use waitpid(p->pid, &status, WNOHANG);
> it is more explicit.

Sure!


> There is also another complication: there can be more than one
> long-running filter driver used.  With this implementation we
> wait for each of one in sequence, e.g. 10s + 10s + 10s.

Good idea, I fixed that in the version below!


>> 
>> -static void mark_child_for_cleanup(pid_t pid)
>> +static void mark_child_for_cleanup(pid_t pid, int timeout, int stdin)
> 
> Hmmmm... three parameters is not too much, but we might want to
> pass "struct child_process *" directly if this number grows.

I used parameters because this function is also used with the async struct... 

I've attached a more serious patch for review below.
What do you think?

Thanks,
Lars



diff --git a/run-command.c b/run-command.c
index 3269362..ca0feef 100644
--- a/run-command.c
+++ b/run-command.c
@@ -21,6 +21,9 @@ void child_process_clear(struct child_process *child)
 
 struct child_to_clean {
 	pid_t pid;
+	char *name;
+	int stdin;
+	int timeout;
 	struct child_to_clean *next;
 };
 static struct child_to_clean *children_to_clean;
@@ -28,12 +31,53 @@ static int installed_child_cleanup_handler;
 
 static void cleanup_children(int sig, int in_signal)
 {
+	int status;
+	struct timeval tv;
+	time_t secs;
+	struct child_to_clean *p = children_to_clean;
+
+	// Send EOF to children as indicator that Git will exit soon
+	while (p) {
+		if (p->timeout != 0) {
+			if (p->stdin > 0)
+				close(p->stdin);
+		}
+		p = p->next;
+	}
+
 	while (children_to_clean) {
-		struct child_to_clean *p = children_to_clean;
+		p = children_to_clean;
 		children_to_clean = p->next;
+
+		if (p->timeout != 0) {
+			fprintf(stderr, _("Waiting for '%s' to finish..."), p->name);
+			if (p->timeout < 0) {
+				// No timeout given - wait indefinitely
+				while ((waitpid(p->pid, &status, 0)) < 0 && errno == EINTR)
+					;	/* nothing */
+			} else {
+				// Wait until timeout
+				gettimeofday(&tv, NULL);
+				secs = tv.tv_sec;
+				while (!waitpid(p->pid, &status, WNOHANG) &&
+					   tv.tv_sec - secs < p->timeout) {
+					fprintf(stderr, _(" \rWaiting %lds for '%s' to finish..."),
+						p->timeout - tv.tv_sec + secs - 1, p->name);
+					gettimeofday(&tv, NULL);
+					sleep_millisec(10);
+				}
+			}
+			if (waitpid(p->pid, &status, WNOHANG))
+				fprintf(stderr, _("done!\n"));
+			else
+				fprintf(stderr, _("timeout. Killing...\n"));
+		}
+
 		kill(p->pid, sig);
-		if (!in_signal)
+		if (!in_signal) {
+			free(p->name);
 			free(p);
+		}
 	}
 }
 
@@ -49,10 +93,18 @@ static void cleanup_children_on_exit(void)
 	cleanup_children(SIGTERM, 0);
 }
 
-static void mark_child_for_cleanup(pid_t pid)
+static void mark_child_for_cleanup_with_timeout(pid_t pid, const char *name, int stdin, int timeout)
 {
 	struct child_to_clean *p = xmalloc(sizeof(*p));
 	p->pid = pid;
+	p->timeout = timeout;
+	p->stdin = stdin;
+	if (name) {
+		p->name = xmalloc(strlen(name) + 1);
+		strcpy(p->name, name);
+	} else {
+		p->name = "process";
+	}
 	p->next = children_to_clean;
 	children_to_clean = p;
 
@@ -63,6 +115,13 @@ static void mark_child_for_cleanup(pid_t pid)
 	}
 }
 
+#ifdef NO_PTHREADS
+static void mark_child_for_cleanup(pid_t pid, const char *name, int timeout, int stdin)
+{
+	mark_child_for_cleanup_with_timeout(pid, NULL, 0, 0);
+}
+#endif
+
 static void clear_child_for_cleanup(pid_t pid)
 {
 	struct child_to_clean **pp;
@@ -422,7 +481,8 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0)
 		error_errno("cannot fork() for %s", cmd->argv[0]);
 	else if (cmd->clean_on_exit)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup_with_timeout(
+			cmd->pid, cmd->argv[0], cmd->in, cmd->clean_on_exit_timeout);
 
 	/*
 	 * Wait for child's execvp. If the execvp succeeds (or if fork()
@@ -483,7 +543,8 @@ int start_command(struct child_process *cmd)
 	if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
 		error_errno("cannot spawn %s", cmd->argv[0]);
 	if (cmd->clean_on_exit && cmd->pid >= 0)
-		mark_child_for_cleanup(cmd->pid);
+		mark_child_for_cleanup_with_timeout(
+			cmd->pid, cmd->argv[0], cmd->in, cmd->clean_on_exit_timeout);
 
 	argv_array_clear(&nargv);
 	cmd->argv = sargv;
diff --git a/run-command.h b/run-command.h
index cf29a31..4c1c1f4 100644
--- a/run-command.h
+++ b/run-command.h
@@ -43,6 +43,16 @@ struct child_process {
 	unsigned stdout_to_stderr:1;
 	unsigned use_shell:1;
 	unsigned clean_on_exit:1;
+	/*
+	 * clean_on_exit_timeout is only considered if clean_on_exit is set.
+	 * - Specify 0 to kill the child on Git exit (default)
+	 * - Specify a negative value to close the child's stdin on Git exit
+	 *   and wait indefinitely for the child's termination.
+	 * - Specify a positive value to close the child's stdin on Git exit
+	 *   and wait clean_on_exit_timeout seconds for the child's
+	 *   termination.
+	 */
+	int clean_on_exit_timeout;
 };
 
 #define CHILD_PROCESS_INIT { NULL, ARGV_ARRAY_INIT, ARGV_ARRAY_INIT }


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-03 17:02               ` Junio C Hamano
@ 2016-10-03 17:35                 ` Lars Schneider
  2016-10-04 12:11                 ` Jeff King
  1 sibling, 0 replies; 71+ messages in thread
From: Lars Schneider @ 2016-10-03 17:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, git, Jeff King, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay


> On 03 Oct 2016, at 19:02, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> If the filter process refuses to die forever when Git told it to
>>> shutdown (by closing the pipe to it, for example), that filter
>>> process is simply buggy.  I think we want users to become aware of
>>> that, instead of Git leaving it behind, which essentially is to
>>> sweep the problem under the rug.
>>> 
>>> I agree with what Peff said elsewhere in the thread; if a filter
>>> process wants to take time to clean things up while letting Git
>>> proceed, it can do its own process management, but I think it is
>>> sensible for Git to wait the filter process it directly spawned.
>> 
>> To realize the approach above I prototyped the run-command patch below:
>> 
>> I added an "exit_timeout" variable to the "child_process" struct.
>> On exit, Git will close the pipe to the process and wait "exit_timeout" 
>> seconds until it kills the child process. If "exit_timeout" is negative
>> then Git will wait until the process is done.
> 
>> If we use that in the long running filter process, then we could make
>> the timeout even configurable. E.g. with "filter.<driver>.process-timeout".
>> 
>> What do you think about this solution? 
> 
> Is such a configuration (or timeout in general) necessary?  I
> suspect that a need for timeout, especially needing timeout and
> needing to get killed that happens so often to require a
> configuration variable, is a sign of something else seriously wrong.
> 
> What's the justification for a filter to _require_ getting killed
> all the time when it is spawned?  Otherwise you wouldn't configure
> "this driver does not die when told, so we need a timeout" variable.
> Is it a sign of the flaw in the protocol to talk to it?  e.g. Git
> has a way to tell it to die, but it somehow is very hard to hear
> from filter's end and honor that request?
> 
> I think that we would need some timeout in the mechanism, but not to
> be used for "killing".
> 
> You would decide to "kill" an filter process in two cases: the
> filter is buggy and refuses to die when Git tells it to exit, or the
> code in Git waiting for its death is somehow miscounting its
> children, and thought it told to die one process but in fact it
> didn't (perhaps it told somebody else to die), or it thought it
> hasn't seen the child die when in fact it already did.

Agreed.


> Calling kill(2) and exiting would hide these two kind of bugs from
> end users.  Not doing so would give the end users a hung Git, which
> is a VERY GOOD thing.  Otherwise you would not notice bugs and lose
> the opportunity to diagnose and fix it.

Aha. I assumed that a hung Git because of a buggy filter would be a no-no.
Thanks for this clarification.


> The timeout would be good for you to give a message "filter process
> running the script '%s' is not exiting; I am waiting for it".  The
> user is still left with a hung Git, and can then see if that process
> is hanging around.  If it is, then we found a buggy filter.  Or we
> found a buggy Git.  Either needs to be fixed.  I do not think it
> would help anybody by doing a kill(2) to sweep possible bugs under
> the rug.

I could achieve that with this run-command patch: 
http://public-inbox.org/git/E9946E9F-6EE5-492B-B122-9078CEB88044@gmail.com/
(I'll remove the "timeout after x seconds" parts and keep the "wait until 
done" part with stderr output)


Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-03 17:02               ` Junio C Hamano
  2016-10-03 17:35                 ` Lars Schneider
@ 2016-10-04 12:11                 ` Jeff King
  2016-10-04 16:47                   ` Junio C Hamano
  1 sibling, 1 reply; 71+ messages in thread
From: Jeff King @ 2016-10-04 12:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, Torsten Bögershausen, git, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

On Mon, Oct 03, 2016 at 10:02:14AM -0700, Junio C Hamano wrote:

> The timeout would be good for you to give a message "filter process
> running the script '%s' is not exiting; I am waiting for it".  The
> user is still left with a hung Git, and can then see if that process
> is hanging around.  If it is, then we found a buggy filter.  Or we
> found a buggy Git.  Either needs to be fixed.  I do not think it
> would help anybody by doing a kill(2) to sweep possible bugs under
> the rug.

I would argue that we should not even bother with such a timeout. This
is an exceptional, buggy condition, and hanging is not at all restricted
to this particular case. If git is hanging, then the right tools are
"ps" or "strace" to figure out what is going on. I know that not all
users are comfortable with those tools, but enough are in practice that
the bugs get ironed out, without git having to carry a bunch of extra
timing code that is essentially never exercised.

-Peff

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-04 12:11                 ` Jeff King
@ 2016-10-04 16:47                   ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-10-04 16:47 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Torsten Bögershausen, git, Stefan Beller,
	Jakub Narębski, Martin-Louis Bright, ramsay

Jeff King <peff@peff.net> writes:

> On Mon, Oct 03, 2016 at 10:02:14AM -0700, Junio C Hamano wrote:
>
>> The timeout would be good for you to give a message "filter process
>> running the script '%s' is not exiting; I am waiting for it".  The
>> user is still left with a hung Git, and can then see if that process
>> is hanging around.  If it is, then we found a buggy filter.  Or we
>> found a buggy Git.  Either needs to be fixed.  I do not think it
>> would help anybody by doing a kill(2) to sweep possible bugs under
>> the rug.
>
> I would argue that we should not even bother with such a timeout. This
> is an exceptional, buggy condition, and hanging is not at all restricted
> to this particular case. If git is hanging, then the right tools are
> "ps" or "strace" to figure out what is going on. I know that not all
> users are comfortable with those tools, but enough are in practice that
> the bugs get ironed out, without git having to carry a bunch of extra
> timing code that is essentially never exercised.

OK.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-03 17:13                 ` Lars Schneider
@ 2016-10-04 19:04                   ` Jakub Narębski
  2016-10-06 13:13                     ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-10-04 19:04 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Torsten Bögershausen, git, Jeff King,
	Stefan Beller, Martin-Louis Bright, Ramsay Jones

W dniu 03.10.2016 o 19:13, Lars Schneider pisze: 
>> On 01 Oct 2016, at 22:48, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 01.10.2016 o 20:59, Lars Schneider pisze: 
>>> On 29 Sep 2016, at 23:27, Junio C Hamano <gitster@pobox.com> wrote:
>>>> Lars Schneider <larsxschneider@gmail.com> writes:
>>>>
>>>> If the filter process refuses to die forever when Git told it to
>>>> shutdown (by closing the pipe to it, for example), that filter
>>>> process is simply buggy.  I think we want users to become aware of
>>>> that, instead of Git leaving it behind, which essentially is to
>>>> sweep the problem under the rug.
>>
>> Well, it would be good to tell users _why_ Git is hanging, see below.
> 
> Agreed. Do you think it is OK to write the message to stderr?

On the other hand, this is why GIT_TRACE (and GIT_TRACE_PERFORMANCE)
was invented for.  We do not signal troubles with single-shot filters,
so I guess doing it for multi-file filters is not needed.
 
>>>> I agree with what Peff said elsewhere in the thread; if a filter
>>>> process wants to take time to clean things up while letting Git
>>>> proceed, it can do its own process management, but I think it is
>>>> sensible for Git to wait the filter process it directly spawned.
>>>
>>> To realize the approach above I prototyped the run-command patch below:
>>>
>>> I added an "exit_timeout" variable to the "child_process" struct.
>>> On exit, Git will close the pipe to the process and wait "exit_timeout" 
>>> seconds until it kills the child process. If "exit_timeout" is negative
>>> then Git will wait until the process is done.
>>
>> That might be good approach.  Probably the default would be to wait.
> 
> I think I would prefer a 2sec timeout or something as default. This way
> we can ensure Git would not wait indefinitely for a buggy filter by default.

Actually this waiting for multi-file filter is only about waiting for
the shutdown process of the filter.  The filter could still hang during
processing a file, and git would hang too, if I understand it correctly.

[...]
>> Also, how would one set default value of timeout for all process
>> based filters?
> 
> I think we don't need that because a timeout is always specific
> to a filter (if the 2sec default is not sufficient).

All right (assuming that timeouts are good idea). 

>>>
>>> +			while ((waitpid(p->pid, &status, 0)) < 0 && errno == EINTR)
>>> +				;	/* nothing */
>>
>> Ah, this loop is here because waiting on waitpid() can be interrupted
>> by the delivery of a signal to the calling process; though the result
>> is -1, not just any < 0.
> 
> "< 0" is also used in wait_or_whine()

O.K. (though it doesn't necessary mean that it is correct, there
is another point for using "< 0").
 
[...]
>> There is also another complication: there can be more than one
>> long-running filter driver used.  With this implementation we
>> wait for each of one in sequence, e.g. 10s + 10s + 10s.
> 
> Good idea, I fixed that in the version below!
> 
[...]
> [...] this function is also used with the async struct... 

Hmmm... now I wonder if it is a good idea (similar treatment for
single-file async-invoked filter, and multi-file pkt-line filters).

For single-file one-shot filter (correct me if I am wrong):

 - git sends contents to filter, signals end with EOF
   (after process is started)
 - in an async process:
   - process is started
   - git reads contents from filter, until EOF
   - if process did not end, it is killed


For multi-process pkt-line based filter (simplified):

 - process is started
 - handshake
 - for each file
   - file is send to filter process over pkt-line,
     end signalled with flush packet
   - git reads from filter from pkt-line, until flush
 - ...


See how single-shot filter is sent EOF, though in different part
of code.  We need to signal multi-file filter that no more files
will be coming.  Simplest solution is to send EOF (we could send
"command=shutdown" for example...) to filter, and wait for EOF
from filter (or for "status=finished" and EOF).

We could kill multi-file filter after sending last file and
receiving full response... but I think single-shot filter gets
killed only because it allows for very simple filters, and reusing
existing commands as filters.

[...]
> diff --git a/run-command.c b/run-command.c
> index 3269362..ca0feef 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -21,6 +21,9 @@ void child_process_clear(struct child_process *child)
>  
>  struct child_to_clean {
>  	pid_t pid;
> +	char *name;

I guess it is here for output purposes?

Should we store full command here, or just name of <driver>?

> +	int stdin;

I guess the name `stdin` for file _descriptor_ is something
used in other parts of convert.c code, isn't it?

> +	int timeout;

Hmmm... we assume that timeout is in seconds, not millis or other
value, isn't it.  timeout_sec would perhaps be unnecessarily long.

>  	struct child_to_clean *next;
>  };
>  static struct child_to_clean *children_to_clean;
> @@ -28,12 +31,53 @@ static int installed_child_cleanup_handler;
>  
>  static void cleanup_children(int sig, int in_signal)
>  {
> +	int status;
> +	struct timeval tv;
> +	time_t secs;
> +	struct child_to_clean *p = children_to_clean;
> +
> +	// Send EOF to children as indicator that Git will exit soon
> +	while (p) {
> +		if (p->timeout != 0) {

Here we use timeout == 0 as a special case, a special indicator
(IIUC for the single-shot filter case, where it is closed already).
This is not documented.  Somebody setting timeout to "0" would
be surprised, isn't it?

> +			if (p->stdin > 0)
> +				close(p->stdin);
> +		}
> +		p = p->next;
> +	}
> +
>  	while (children_to_clean) {
> -		struct child_to_clean *p = children_to_clean;
> +		p = children_to_clean;
>  		children_to_clean = p->next;
> +
> +		if (p->timeout != 0) {
> +			fprintf(stderr, _("Waiting for '%s' to finish..."), p->name);
> +			if (p->timeout < 0) {
> +				// No timeout given - wait indefinitely
> +				while ((waitpid(p->pid, &status, 0)) < 0 && errno == EINTR)
> +					;	/* nothing */
> +			} else {
> +				// Wait until timeout
> +				gettimeofday(&tv, NULL);
> +				secs = tv.tv_sec;
> +				while (!waitpid(p->pid, &status, WNOHANG) &&
> +					   tv.tv_sec - secs < p->timeout) {
> +					fprintf(stderr, _(" \rWaiting %lds for '%s' to finish..."),
> +						p->timeout - tv.tv_sec + secs - 1, p->name);
> +					gettimeofday(&tv, NULL);
> +					sleep_millisec(10);
> +				}
> +			}

I wonder if we have some progress-printing code we can borrow
from, or just plain use (like progress report for long checkout).

> +			if (waitpid(p->pid, &status, WNOHANG))
> +				fprintf(stderr, _("done!\n"));
> +			else
> +				fprintf(stderr, _("timeout. Killing...\n"));
> +		}
> +
>  		kill(p->pid, sig);
> -		if (!in_signal)
> +		if (!in_signal) {
> +			free(p->name);
>  			free(p);
> +		}
>  	}
>  }
>  
> @@ -49,10 +93,18 @@ static void cleanup_children_on_exit(void)
>  	cleanup_children(SIGTERM, 0);
>  }
>  
> -static void mark_child_for_cleanup(pid_t pid)
> +static void mark_child_for_cleanup_with_timeout(pid_t pid, const char *name, int stdin, int timeout)
>  {
>  	struct child_to_clean *p = xmalloc(sizeof(*p));
>  	p->pid = pid;
> +	p->timeout = timeout;
> +	p->stdin = stdin;
> +	if (name) {
> +		p->name = xmalloc(strlen(name) + 1);
> +		strcpy(p->name, name);

Don't we have xstrdup() for that, or am I mistaken?

> +	} else {
> +		p->name = "process";

Hmmmm...

> +	}
>  	p->next = children_to_clean;
>  	children_to_clean = p;
>  
> @@ -63,6 +115,13 @@ static void mark_child_for_cleanup(pid_t pid)
>  	}
>  }
>  
> +#ifdef NO_PTHREADS
> +static void mark_child_for_cleanup(pid_t pid, const char *name, int timeout, int stdin)
> +{
> +	mark_child_for_cleanup_with_timeout(pid, NULL, 0, 0);
> +}
> +#endif

Uh?

> +
>  static void clear_child_for_cleanup(pid_t pid)
>  {
>  	struct child_to_clean **pp;
> @@ -422,7 +481,8 @@ int start_command(struct child_process *cmd)
>  	if (cmd->pid < 0)
>  		error_errno("cannot fork() for %s", cmd->argv[0]);
>  	else if (cmd->clean_on_exit)
> -		mark_child_for_cleanup(cmd->pid);
> +		mark_child_for_cleanup_with_timeout(
> +			cmd->pid, cmd->argv[0], cmd->in, cmd->clean_on_exit_timeout);

All right, nice abstraction.

>  
>  	/*
>  	 * Wait for child's execvp. If the execvp succeeds (or if fork()
> @@ -483,7 +543,8 @@ int start_command(struct child_process *cmd)
>  	if (cmd->pid < 0 && (!cmd->silent_exec_failure || errno != ENOENT))
>  		error_errno("cannot spawn %s", cmd->argv[0]);
>  	if (cmd->clean_on_exit && cmd->pid >= 0)
> -		mark_child_for_cleanup(cmd->pid);
> +		mark_child_for_cleanup_with_timeout(
> +			cmd->pid, cmd->argv[0], cmd->in, cmd->clean_on_exit_timeout);
>  
>  	argv_array_clear(&nargv);
>  	cmd->argv = sargv;
> diff --git a/run-command.h b/run-command.h
> index cf29a31..4c1c1f4 100644
> --- a/run-command.h
> +++ b/run-command.h
> @@ -43,6 +43,16 @@ struct child_process {
>  	unsigned stdout_to_stderr:1;
>  	unsigned use_shell:1;
>  	unsigned clean_on_exit:1;
> +	/*
> +	 * clean_on_exit_timeout is only considered if clean_on_exit is set.
> +	 * - Specify 0 to kill the child on Git exit (default)
> +	 * - Specify a negative value to close the child's stdin on Git exit
> +	 *   and wait indefinitely for the child's termination.
> +	 * - Specify a positive value to close the child's stdin on Git exit
> +	 *   and wait clean_on_exit_timeout seconds for the child's
> +	 *   termination.

All right, so here is this documentation...

> +	 */
> +	int clean_on_exit_timeout;
>  };
>  
>  #define CHILD_PROCESS_INIT { NULL, ARGV_ARRAY_INIT, ARGV_ARRAY_INIT }
> 
> 

For full patch, you would need also to add to Documentation/config.txt

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-30 18:56     ` Lars Schneider
@ 2016-10-04 20:50       ` Jakub Narębski
  2016-10-06 13:16         ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-10-04 20:50 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

[Some of answers may get invalidated by v9]

W dniu 30.09.2016 o 20:56, Lars Schneider pisze:
>> On 27 Sep 2016, at 00:41, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>> Part first of the review of 11/11.

[...]
>>> +to read a welcome response message ("git-filter-server") and exactly
>>> +one protocol version number from the previously sent list. All further
>>
>> I guess that is to provide forward-compatibility, isn't it?  Also,
>> "Git expects..." probably means filter process MUST send, in the
>> RFC2119 (https://tools.ietf.org/html/rfc2119) meaning.
> 
> True. I feel "expects" reads better but I am happy to change it if
> you feel strong about it.

I don't have strong opinion about this.  I guess following the example
of pkt-line format description may be a good idea.  We are not writing
an RFC here... but having be explicit is better than be good read :-P

>>> +
>>> +After the version negotiation Git sends a list of supported capabilities
>>> +and a flush packet.
>>
>> Is it that Git SHOULD send list of ALL supported capabilities, or is
>> it that Git SHOULD NOT send capabilities it does not support, and that
>> it MAY send only those capabilities it needs (so for example if command
>> uses only `smudge`, it may not send `clean`, so that filter driver doesn't
>> need to initialize data it would not need).
> 
> "After the version negotiation Git sends a list of all capabilities that
> it supports and a flush packet."
> 
> Better?

Better.

I wonder if it is a matter of current implementation, namely that at
the point of code where Git is sending capabilities it doesn't know
which of them it will be using; at least in some of cases.

Because if it would be possible for Git to not send capabilities which
it supports, but is sure that it would not be using for a given operation,
then it would be good to do that.  It might be that filter driver must
do some prep work for each of capabilities, so skipping some of prep
would make git faster.  Though all this is for now theoretical musings...
it might be an argument for such description of protocol so it does
not prevent Git from sending only supported capabilities it needs.

>> I wonder why it is "<capability>=true", and not "capability=<capability>".
>> Is there a case where we would want to send "<capability>=false".  Or
>> is it to allow configurable / value based capabilities?  Isn't it going
>> a bit too far: is there even a hind of an idea for parametrize-able
>> capability? YAGNI is a thing...
> 
> Peff suggested that format and I think it is OK:
> http://public-inbox.org/git/20160803224619.bwtbvmslhuicx2qi@sigill.intra.peff.net/

It wouldn't kill you to summarize the argument, would it?

From what I understand the argument is that "<capability>=true" allows
for simplist parsing into a [intermediate] hash, while the other
proposal of using"capability=<capability>" would require something more
sophisticated.  And that it is better to be explicit rather than
use synonyms / shortcuts for "<capability>".

Though one can argue that "<foo>" is synonym / shortcut for "<foo>=true";
it would not complicate parsing much.

Nb. we don't use trick of 'parse metadata to hash' in neither example,
nor filter driver used in test...

[...]
>>> +
>>> +After the filter has processed a blob it is expected to wait for
>>> +the next "key=value" list containing a command. Git will close
>>> +the command pipe on exit. The filter is expected to detect EOF
>>> +and exit gracefully on its own.

Is this still true?

>>
>> Good to have it documented.  
>>
>> Anyway, as it is Git command that spawns the filter driver process,
>> assuming that the filter process doesn't daemonize itself, wouldn't
>> the operating system reap it after its parent process, that is the
>> git command it invoked, dies? So detecting EOF is good, but not
>> strictly necessary for simple filter that do not need to free
>> its resources, or can leave freeing resources to the operating
>> system? But I may be wrong here.
> 
> The filter process runs independent of Git.

Ah.  So without some way to tell long-lived filter process that
it can shut down, because no further data will be incoming, or
killing it by Git, it would hang indefinitely?

[...]
>>> +    }
>>> +    elsif ( $pkt_size > 4 ) {
>>
>> Isn't a packet of $pkt_size == 4 a valid packet, a keep-alive
>> one?  Or is it forbidden?
> 
> "Implementations SHOULD NOT send an empty pkt-line ("0004")."
> Source: Documentation/technical/protocol-common.txt

All right.  Not that we need keep-alive packet for communication
between two processes on the same host.

Though I wonder why this rule is here...

[...]
>>> +( packet_txt_read() eq ( 0, "git-filter-client" ) ) || die "bad initialize";
>>> +( packet_txt_read() eq ( 0, "version=2" ) )         || die "bad version";
>>> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad version end";
>>
>> Actually, it is overly strict.  It should not fail if there
>> are other "version=3", "version=4" etc. lines.
> 
> True, but I think for an example this is OK. I'll add a note
> to the file header.

Anyway it would be better to have helper subroutine to get metadata
lines (packets till flush) and parse them into array or a hash...

>>> +
>>> +while (1) {
>>> +    my ($command)  = packet_txt_read() =~ /^command=([^=]+)$/;
>>> +    my ($pathname) = packet_txt_read() =~ /^pathname=([^=]+)$/;
>>
>> Do we require this order?  If it is, is that explained in the
>> documentation?
> 
> Git sends that order right now but the filter should not rely
> on that order.

...and the latter would make ignoring order of lines simpler.

Though I wonder if we can ensure in the protocol documentation
that those lines would always be send in this order.

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-09-30 19:38     ` Lars Schneider
@ 2016-10-04 21:00       ` Jakub Narębski
  2016-10-06 21:27         ` Lars Schneider
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Narębski @ 2016-10-04 21:00 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

[Some of answers and comments may got invalidated by v9]

W dniu 30.09.2016 o 21:38, Lars Schneider pisze:
>> On 27 Sep 2016, at 17:37, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>> Part second of the review of 11/11.
[...]
>>> +
>>> +	if (start_command(process)) {
>>> +		error("cannot fork to run external filter '%s'", cmd);
>>> +		kill_multi_file_filter(hashmap, entry);
>>> +		return NULL;
>>> +	}
>>
>> I guess there is a reason why we init hashmap entry, try to start
>> external process, then kill entry of unable to start, instead of
>> trying to start external process, and adding hashmap entry when
>> we succeed?
> 
> Yes. This way I can reuse the kill_multi_file_filter() function.

I don't quite understand.  If you didn't fill the entry before
using start_command(process), you would not need kill_multi_file_filter(),
which in that case IIUC just removes the just created entry from hashmap.
Couldn't you add entry to hashmap in the 'else' part?  Or would it
be racy?

[...]
>>> +static void read_multi_file_filter_values(int fd, struct strbuf *status) {
>>
>> This is more
>>
>>  +static void read_multi_file_filter_status(int fd, struct strbuf *status) {
>>
>> It doesn't read arbitrary values, it examines 'metadata' from
>> filter for "status=<foo>" lines.
> 
> True!
>
>>> +		if (pair[0] && pair[0]->len && pair[1]) {
>>> +			if (!strcmp(pair[0]->buf, "status=")) {
>>> +				strbuf_reset(status);
>>> +				strbuf_addbuf(status, pair[1]);
>>> +			}
>>
>> So it is last status=<foo> line wins behavior?
> 
> Correct.

Perhaps this should be described in code comment.
 
>>>
>>> +	fflush(NULL);
>>
>> Why this fflush(NULL) is needed here?
> 
> This flushes all open output streams. The single filter does the same.

I know what it does, but I don't know why.  But "single filter does it"
is good enough for me.  Still would want to know why, though ;-)
 
>>>
>>> +	if (fd >= 0 && !src) {
>>> +		if (fstat(fd, &file_stat) == -1)
>>> +			return 0;
>>> +		len = xsize_t(file_stat.st_size);
>>> +	}
>>
>> Errr... is it necessary?  The protocol no longer provides size=<n>
>> hint, and neither uses such hint if provided.
> 
> We require the size in write_packetized_from_buf() later.

Don't we use write_packetized_from_fd() in the case of fd >= 0?

[...]

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-10-01 15:34     ` Lars Schneider
@ 2016-10-04 21:34       ` Jakub Narębski
  0 siblings, 0 replies; 71+ messages in thread
From: Jakub Narębski @ 2016-10-04 21:34 UTC (permalink / raw)
  To: Lars Schneider
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones

[Some of answers and comments may got invalidated by v9]

W dniu 01.10.2016 o 17:34, Lars Schneider pisze:
>> On 29 Sep 2016, at 01:14, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>> Part third (and last) of the review of v8 11/11.

[...]
>>> +
>>> +test_expect_success PERL 'required process filter should filter data' '
>>> +	test_config_global filter.protocol.process "$TEST_DIRECTORY/t0021/rot13-filter.pl clean smudge" &&
>>> +	test_config_global filter.protocol.required true &&
>>> +	rm -rf repo &&
>>> +	mkdir repo &&
>>> +	(
>>> +		cd repo &&
>>> +		git init &&
>>
>> Don't you think that creating a fresh test repository for each
>> separate test is a bit too much?  I guess that you want for
>> each and every test to be completely independent, but this setup
>> and teardown is a bit excessive.
>>
>> Other tests in the same file (should we reuse the test, or use
>> new test file) do not use this method.
> 
> I see your point. However, I am always annoyed if Git tests are
> entangled because it makes working with them way way harder.
> This test test runs in 4.5s on a slow Travis CI machine. I think
> that is OK considering that we have tests running 3.5min (t3404).

All right, this is good argument... though inconsistency (assuming
that we don't switch to separate test for multi-file filter) could
be argument against.

Though I wonder if test preparation could be extracted into a
common function...

[...]
>>> +		cp ../test.o test.r &&
>>> +		cp ../test2.o test2.r &&
>>
>> What does this test2.o / test2.r file tests, that test.o / test.r
>> doesn't?  The name doesn't tell us.
> 
> This just tests multiple files with different content.

All right, so it is here to test multiple files (and that filter
actually process multiple files).

>> Why it is test.r, but test2.r?  Why it isn't test1.r?
> 
> test.r already existed (created in setup test).

With each test in separate repository we could copy test.r prepared
in 'setup' into test1.r in each of multi-file tests.

[...]
>>> +		>test4-empty.r &&
>>
>> You test ordinary file, file in subdirectory, file with filename
>> containing spaces, and an empty file.
>>
>> Other tests of single file `clean`/`smudge` filters use filename
>> that requires mangling; maybe we should use similar file?
>>
>>        special="name  with '\''sq'\'' and \$x" &&
>>        echo some test text >"$special" &&
> 
> OK.

This test was required because of %f to pass filename as parameter
coupled with the fact that we use `clean` and `smudge` as shell
script fragment (so e.g. pipes and redirection would work in
one-shot filter definition).

This is not the case with multi-file filter, where filenames are
passed internally, and we don't need to worry about shell quoting
at all.
 
>> In case of `process` filter, a special filename could look like
>> this:
>>
>>        process_special="name=with equals and\nembedded newlines\n" &&
>>        echo some test text >"$process_special" &&
> 
> I think this test would create trouble on Windows. I'll stick to
> the special characters used in the single shot filter.

This would test... example parser.  Well, all right, better not
give problems for Windows.

But... you can create such files in Git Bash:

  $ touch "$(echo -n -e "fo=o\nbar\n")"

and though they look strange

  $ ls -1 fo*
  'fo=o'$'\n''bar'

but work all right

  $ echo "strangest" >>"$(echo -n -e "fo=o\nbar\n")"
  $ name="$(echo -n -e "fo=o\nbar\n")"
  $ cat "$name"
  strangest

>> Third, why the filter even writes output size? It is no longer
>> part of `process` filter driver protocol, and it makes test more
>> fragile.
> 
> I would prefer to leave that in. I think it is good for the test to
> check that we are transmitting the amount of content that what we 
> think we transmit.

Right, we test that we processed full file this way, in the multi
packet test. 

>>> +				<<-\EOF &&
>>> +					START
>>> +					wrote filter header
>>> +					STOP
>>> +				EOF
>>
>> Why is even filter process invoked?  If this is not expected, perhaps
>> simply ignore what checking out almost empty branch (one without any
>> files marked for filtering) does.
>>
>> Shouldn't we test_expect_failure no-call?
> 
> Because a clean operation could happen. I added a clean operation to
> the expected log in order to make this visible (expected log is stripped
> of clean operations in the same way as the actual log per your suggestion
> above).

If we are testing that if there is no "smudge" capability, then
there were no "smudge" operations, why we don't test just that:
that grepping for "smudge" in long doesn't find anything.

Current version feels convoluted (and would stop working if Git
is improved to not run "clean" in this case for optimization).
 
>>> +
>>> +		check_filter_ignore_clean \
>>> +			git checkout master \
>>
>> Does this checks different code path than 'git checkout .'? For
>> example, does this test increase code coverage (e.g. as measured
>> by gcov)?  If not, then this test could be safely dropped.
> 
> We checked out the "empty-branch" before. That's why we check here
> that the smudge filter runs for all files (smudge filter did not run
> for all files with `git checkout .`).

All right, it runs for more files, but does it cover different
code paths?  If not, it only makes test run longer...

>>> +				<<-\EOF &&
>>> +					START
>>> +					wrote filter header
>>> +					IN: smudge test.r 57 [OK] -- OUT: 57 . [OK]
>>> +					IN: smudge test2.r 14 [OK] -- OUT: 14 . [OK]
>>> +					IN: smudge test4-empty.r 0 [OK] -- OUT: 0  [OK]
>>> +					IN: smudge testsubdir/test3 - subdir.r 23 [OK] -- OUT: 23 . [OK]
>>
>> Can we assume that Git would pass files to filter in alphabetical
>> order?  This assumption might make the test unnecessary fragile.
> 
> I have never experienced another behavior. If we see fragility we could
> sort the result...
 
All right (perhaps comment for future would be good idea, though).
 
>>>
>>> +test_expect_success PERL 'required process filter should clean only and take precedence' '
>>
>> Trying to describe it better results in overly long description,
>> which probably means that this test should be split into few
>> smaller ones:
>>
>> - `process` filter takes precedence over `clean` and/or `smudge`
>>   filters, regardless if it supports relevant ("clean" or "smudge")
>>   capability or not
>>
>> - `process` filter that includes only "clean" capability should
>>   clean only (be used only for 'clean' operation)
> 
> Agreed!
> 
> 
>> In my opinion all functions should be placed at beginning,
>> or even in separate file (if they are used in more than
>> one test).
> 
> OK
> 
> 
>>> +generate_test_data () {
>>
>> The name is not good, it doesn't describe what kind of data
>> we want to generate.
> 
> "generate_random_characters" ok?!

All right.

[...]
>>> +		echo "*.file filter=protocol" >.gitattributes &&
>>> +		check_filter \
>>> +			git add *.file .gitattributes \
>>
>> Should it be shell expansion, or git expansion, that is
>>
>>   			git add '*.file' .gitattributes
> 
> Both have the same output. Would the difference matter?

In one case *.file is expanded by shell, then expansion passed
as parameters to `git add` (perhaps not on MS Windows); in the
other '*.file' is passed as pattern to `git add` and expanded
by Git itself (this might be case for both patterns on Win).

But this doesn't matter here, anyway. I think.

[...]
>>> +
>>> +test_expect_success PERL 'process filter should not restart in case of an error' '
>>
>> Errr... what? This description is not clear.  Did you mean
>> that filter should not be restarted if it *signals* an error
>> with file (either before sending anything, or after sending
>> partial contents)?
> 
> OK renamed to "process filter should not be restarted if it signals an error"

This is better.
 
>>> +test_expect_success PERL 'process filter should be able to signal an error for all future files' '
>>
>> Did you mean here that filter can abort processing of
>> all future files?
> 
> "process filter signals abort once to abort processing of all future files", better?

"process filter aborting stops processing of all further files", maybe?

>>> +
>>> +		cp ../test.o test.r &&
>>> +		test_must_fail git add . 2> git_stderr.log &&
>>> +		grep "not support long running filter protocol" git_stderr.log
>>
>> Shouldn't this use gettext poison (or rather C locale)?
>> This error message could be translated in the future.
> 
> I would prefer to adjust that when we translate it.

All right, good enough.

>>> +    $str =~ y/A-Za-z/N-ZA-Mn-za-m/;
>>
>> Why not use tr/// version of this quote-like operation?
>> Or do you follow prior art here?
> 
> I am not Perl expert. That worked for me :-)

y/// and tr/// are the same operator.  Though y/// is supposedly
more Perl-ish, and I think it is used more in Git tests (or rather
it functions).

[...]
>>> +packet_flush();
>>> +print $debug "wrote filter header\n";
>>
>> Or perhaps "handshake end"?
> 
> "init handshake complete", ok?

Better.
 
>>> +    print $debug " $pathname";
>>
>> No " pathname=$pathname" ?
> 
> Yes, otherwise it gets too verbose in the tests.

All right.  And the lines gets too long.

[...]

Regards,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-04 19:04                   ` Jakub Narębski
@ 2016-10-06 13:13                     ` Lars Schneider
  2016-10-06 16:01                       ` Jeff King
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Schneider @ 2016-10-06 13:13 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Torsten Bögershausen, git, Jeff King,
	Stefan Beller, Martin-Louis Bright, Ramsay Jones


> On 04 Oct 2016, at 21:04, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> W dniu 03.10.2016 o 19:13, Lars Schneider pisze: 
>>> On 01 Oct 2016, at 22:48, Jakub Narębski <jnareb@gmail.com> wrote:
>>> W dniu 01.10.2016 o 20:59, Lars Schneider pisze: 
>>>> On 29 Sep 2016, at 23:27, Junio C Hamano <gitster@pobox.com> wrote:
>>>>> Lars Schneider <larsxschneider@gmail.com> writes:
>>>>> 
>>>>> If the filter process refuses to die forever when Git told it to
>>>>> shutdown (by closing the pipe to it, for example), that filter
>>>>> process is simply buggy.  I think we want users to become aware of
>>>>> that, instead of Git leaving it behind, which essentially is to
>>>>> sweep the problem under the rug.
>>> 
>>> Well, it would be good to tell users _why_ Git is hanging, see below.
>> 
>> Agreed. Do you think it is OK to write the message to stderr?
> 
> On the other hand, this is why GIT_TRACE (and GIT_TRACE_PERFORMANCE)
> was invented for.  We do not signal troubles with single-shot filters,
> so I guess doing it for multi-file filters is not needed.

I am on the fence with this one.

@Junio/Peff:
Where would you prefer to see a "Waiting for filter 'XYZ'... " message?
On stderr or via GIT_TRACE?


> 
>>>>> I agree with what Peff said elsewhere in the thread; if a filter
>>>>> process wants to take time to clean things up while letting Git
>>>>> proceed, it can do its own process management, but I think it is
>>>>> sensible for Git to wait the filter process it directly spawned.
>>>> 
>>>> To realize the approach above I prototyped the run-command patch below:
>>>> 
>>>> I added an "exit_timeout" variable to the "child_process" struct.
>>>> On exit, Git will close the pipe to the process and wait "exit_timeout" 
>>>> seconds until it kills the child process. If "exit_timeout" is negative
>>>> then Git will wait until the process is done.
>>> 
>>> That might be good approach.  Probably the default would be to wait.
>> 
>> I think I would prefer a 2sec timeout or something as default. This way
>> we can ensure Git would not wait indefinitely for a buggy filter by default.
> 
> Actually this waiting for multi-file filter is only about waiting for
> the shutdown process of the filter.  The filter could still hang during
> processing a file, and git would hang too, if I understand it correctly.

Correct.


>> [...] this function is also used with the async struct... 
> 
> Hmmm... now I wonder if it is a good idea (similar treatment for
> single-file async-invoked filter, and multi-file pkt-line filters).
> 
> For single-file one-shot filter (correct me if I am wrong):
> 
> - git sends contents to filter, signals end with EOF
>   (after process is started)
> - in an async process:
>   - process is started
>   - git reads contents from filter, until EOF
>   - if process did not end, it is killed
> 
> 
> For multi-process pkt-line based filter (simplified):
> 
> - process is started
> - handshake
> - for each file
>   - file is send to filter process over pkt-line,
>     end signalled with flush packet
>   - git reads from filter from pkt-line, until flush
> - ...
> 
> 
> See how single-shot filter is sent EOF, though in different part
> of code.  We need to signal multi-file filter that no more files
> will be coming.  Simplest solution is to send EOF (we could send
> "command=shutdown" for example...) to filter, and wait for EOF
> from filter (or for "status=finished" and EOF).

That's what we do. EOF does signal the multi-filter to shutdown.


> For full patch, you would need also to add to Documentation/config.txt

Why config.txt?


Thanks,
Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-10-04 20:50       ` Jakub Narębski
@ 2016-10-06 13:16         ` Lars Schneider
  0 siblings, 0 replies; 71+ messages in thread
From: Lars Schneider @ 2016-10-06 13:16 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 04 Oct 2016, at 22:50, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> [Some of answers may get invalidated by v9]
> 
> W dniu 30.09.2016 o 20:56, Lars Schneider pisze:
>>> On 27 Sep 2016, at 00:41, Jakub Narębski <jnareb@gmail.com> wrote:
>>> 
>>>> +
>>>> +After the filter has processed a blob it is expected to wait for
>>>> +the next "key=value" list containing a command. Git will close
>>>> +the command pipe on exit. The filter is expected to detect EOF
>>>> +and exit gracefully on its own.
> 
> Is this still true?

Yes


>>> 
>>> Good to have it documented.  
>>> 
>>> Anyway, as it is Git command that spawns the filter driver process,
>>> assuming that the filter process doesn't daemonize itself, wouldn't
>>> the operating system reap it after its parent process, that is the
>>> git command it invoked, dies? So detecting EOF is good, but not
>>> strictly necessary for simple filter that do not need to free
>>> its resources, or can leave freeing resources to the operating
>>> system? But I may be wrong here.
>> 
>> The filter process runs independent of Git.
> 
> Ah.  So without some way to tell long-lived filter process that
> it can shut down, because no further data will be incoming, or
> killing it by Git, it would hang indefinitely?

Yes

- Lars

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-06 13:13                     ` Lars Schneider
@ 2016-10-06 16:01                       ` Jeff King
  2016-10-06 17:17                         ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Jeff King @ 2016-10-06 16:01 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jakub Narębski, Junio C Hamano, Torsten Bögershausen,
	git, Stefan Beller, Martin-Louis Bright, Ramsay Jones

On Thu, Oct 06, 2016 at 03:13:19PM +0200, Lars Schneider wrote:

> >>> Well, it would be good to tell users _why_ Git is hanging, see below.
> >> 
> >> Agreed. Do you think it is OK to write the message to stderr?
> > 
> > On the other hand, this is why GIT_TRACE (and GIT_TRACE_PERFORMANCE)
> > was invented for.  We do not signal troubles with single-shot filters,
> > so I guess doing it for multi-file filters is not needed.
> 
> I am on the fence with this one.
> 
> @Junio/Peff:
> Where would you prefer to see a "Waiting for filter 'XYZ'... " message?
> On stderr or via GIT_TRACE?

I am not sure if I have followed all of this discussion, but I am of the
opinion that Git should not be doing any timeouts at all. There are
simply too many places where the filter (or any other process that git
is depending on) could inexplicably hang, and I
do not want to pepper random timeouts for all parts of the procedure
where we say "woah, this is taking longer than expected" (nor do I want
to have a timeout for _one_ spot, and ignore all the others).

If this is debugging output of the form "now I am calling wait() on all
of the filters", without respect to any timeout, that sounds reasonable.
Though I would argue that run-command should simply trace_printf()
any processes it is waiting for, which covers _any_ process, not just
the filters. That seems like a good match for the rest of the GIT_TRACE
output, which describes how and when we spawn the sub-processes.

Something like:

diff --git a/run-command.c b/run-command.c
index 5a4dbb6..b884605 100644
--- a/run-command.c
+++ b/run-command.c
@@ -226,6 +226,9 @@ static int wait_or_whine(pid_t pid, const char *argv0, int in_signal)
 	pid_t waiting;
 	int failed_errno = 0;
 
+	if (!in_signal)
+		trace_printf("waiting for pid %d", (int)pid);
+
 	while ((waiting = waitpid(pid, &status, 0)) < 0 && errno == EINTR)
 		;	/* nothing */
 	if (in_signal)

but it does not have to be part of this series, I think.

-Peff

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 00/11] Git filter protocol
  2016-10-06 16:01                       ` Jeff King
@ 2016-10-06 17:17                         ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2016-10-06 17:17 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Jakub Narębski, Torsten Bögershausen,
	git, Stefan Beller, Martin-Louis Bright, Ramsay Jones

Jeff King <peff@peff.net> writes:

> I am not sure if I have followed all of this discussion, but I am of the
> opinion that Git should not be doing any timeouts at all.
> ...
> If this is debugging output of the form "now I am calling wait() on all
> of the filters", without respect to any timeout, that sounds reasonable.
> Though I would argue that run-command should simply trace_printf()
> any processes it is waiting for, which covers _any_ process, not just
> the filters. That seems like a good match for the rest of the GIT_TRACE
> output, which describes how and when we spawn the sub-processes.

Yup, I agree that having trace_printf() report the wait for any
process is the cleanest way to go.  As you guessed the reason why
Lars is bringing up "now we are waiting for this filter" is because
I suggested it as a way to encourage users to file bugs when they
see a hung Git.  Originally Lars wanted to have a timeout on wait
and after the timeout wanted to kill the process, and because I
really did not want such a random "you are too slow to die, so I'll
send a signal to you and exit myself without making sure you died"
there, I suggested that if we were to have a timeout, that would be
to timeout the wait only to have a chance to tell the user "we are
stuck waiting on this thing" (and then go back to wait), as it would
either be a buggy filter (i.e. the users need to debug their own
filter code) or a buggy use of wait on Git side (i.e. we would want
to hear about such bugs from them).

Without such a "wait with timeout so that we can tell the user", we
can still respond to "my 'git checkout' hangs forever" with "try
running with GIT_TRACE" as you outlined above, so I do not think we
need the timeout.

Thanks for straightening us out.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v8 11/11] convert: add filter.<driver>.process option
  2016-10-04 21:00       ` Jakub Narębski
@ 2016-10-06 21:27         ` Lars Schneider
  0 siblings, 0 replies; 71+ messages in thread
From: Lars Schneider @ 2016-10-06 21:27 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: git, Jeff King, Junio C Hamano, Stefan Beller,
	Martin-Louis Bright, Torsten Bögershausen, Ramsay Jones


> On 04 Oct 2016, at 23:00, Jakub Narębski <jnareb@gmail.com> wrote:
> 
> [Some of answers and comments may got invalidated by v9]
> 
> W dniu 30.09.2016 o 21:38, Lars Schneider pisze:
>>> On 27 Sep 2016, at 17:37, Jakub Narębski <jnareb@gmail.com> wrote:
>>> 
>>> Part second of the review of 11/11.
> [...]
>>>> +
>>>> +	if (start_command(process)) {
>>>> +		error("cannot fork to run external filter '%s'", cmd);
>>>> +		kill_multi_file_filter(hashmap, entry);
>>>> +		return NULL;
>>>> +	}
>>> 
>>> I guess there is a reason why we init hashmap entry, try to start
>>> external process, then kill entry of unable to start, instead of
>>> trying to start external process, and adding hashmap entry when
>>> we succeed?
>> 
>> Yes. This way I can reuse the kill_multi_file_filter() function.
> 
> I don't quite understand.  If you didn't fill the entry before
> using start_command(process), you would not need kill_multi_file_filter(),
> which in that case IIUC just removes the just created entry from hashmap.
> Couldn't you add entry to hashmap in the 'else' part?  Or would it
> be racy?

You are right. I'll fix that.


>> 
>>>> +		if (pair[0] && pair[0]->len && pair[1]) {
>>>> +			if (!strcmp(pair[0]->buf, "status=")) {
>>>> +				strbuf_reset(status);
>>>> +				strbuf_addbuf(status, pair[1]);
>>>> +			}
>>> 
>>> So it is last status=<foo> line wins behavior?
>> 
>> Correct.
> 
> Perhaps this should be described in code comment.

OK


>>>> 
>>>> +	fflush(NULL);
>>> 
>>> Why this fflush(NULL) is needed here?
>> 
>> This flushes all open output streams. The single filter does the same.
> 
> I know what it does, but I don't know why.  But "single filter does it"
> is good enough for me.  Still would want to know why, though ;-)

TBH I am not 100% sure why, too. I think this ensures that we don't have 
any outdated/unrelated/previous data in the stream buffers.


>>>> 
>>>> +	if (fd >= 0 && !src) {
>>>> +		if (fstat(fd, &file_stat) == -1)
>>>> +			return 0;
>>>> +		len = xsize_t(file_stat.st_size);
>>>> +	}
>>> 
>>> Errr... is it necessary?  The protocol no longer provides size=<n>
>>> hint, and neither uses such hint if provided.
>> 
>> We require the size in write_packetized_from_buf() later.
> 
> Don't we use write_packetized_from_fd() in the case of fd >= 0?

Of course! Ah too many refactorings :-)
I'll remove that.

Thank you,
Lars


^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2016-10-06 21:27 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-20 19:02 [PATCH v8 00/11] Git filter protocol larsxschneider
2016-09-20 19:02 ` [PATCH v8 01/11] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
2016-09-24 21:14   ` Jakub Narębski
2016-09-26 18:49     ` Lars Schneider
2016-09-28 23:15       ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 02/11] pkt-line: extract set_packet_header() larsxschneider
2016-09-24 21:22   ` Jakub Narębski
2016-09-26 18:53     ` Lars Schneider
2016-09-20 19:02 ` [PATCH v8 03/11] run-command: move check_pipe() from write_or_die to run_command larsxschneider
2016-09-24 22:12   ` Jakub Narębski
2016-09-26 16:13     ` Lars Schneider
2016-09-26 16:21       ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 04/11] pkt-line: add packet_write_fmt_gently() larsxschneider
2016-09-24 22:27   ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 05/11] pkt-line: add packet_flush_gently() larsxschneider
2016-09-24 22:56   ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 06/11] pkt-line: add packet_write_gently() larsxschneider
2016-09-25 11:26   ` Jakub Narębski
2016-09-26 19:21     ` Lars Schneider
2016-09-27  8:39       ` Jeff King
2016-09-27 19:33         ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 07/11] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-09-25 13:46   ` Jakub Narębski
2016-09-26 20:23     ` Lars Schneider
2016-09-27  8:14       ` Lars Schneider
2016-09-27  9:00         ` Jeff King
2016-09-27 12:10           ` Lars Schneider
2016-09-27 12:13             ` Jeff King
2016-09-20 19:02 ` [PATCH v8 08/11] convert: quote filter names in error messages larsxschneider
2016-09-25 14:03   ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 09/11] convert: modernize tests larsxschneider
2016-09-25 14:43   ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 10/11] convert: make apply_filter() adhere to standard Git error handling larsxschneider
2016-09-25 14:47   ` Jakub Narębski
2016-09-20 19:02 ` [PATCH v8 11/11] convert: add filter.<driver>.process option larsxschneider
2016-09-26 22:41   ` Jakub Narębski
2016-09-30 18:56     ` Lars Schneider
2016-10-04 20:50       ` Jakub Narębski
2016-10-06 13:16         ` Lars Schneider
2016-09-27 15:37   ` Jakub Narębski
2016-09-30 19:38     ` Lars Schneider
2016-10-04 21:00       ` Jakub Narębski
2016-10-06 21:27         ` Lars Schneider
2016-09-28 23:14   ` Jakub Narębski
2016-10-01 15:34     ` Lars Schneider
2016-10-04 21:34       ` Jakub Narębski
2016-09-28 21:49 ` [PATCH v8 00/11] Git filter protocol Junio C Hamano
2016-09-29 10:28   ` Lars Schneider
2016-09-29 11:57     ` Torsten Bögershausen
2016-09-29 16:57       ` Junio C Hamano
2016-09-29 17:57         ` Lars Schneider
2016-09-29 18:18           ` Torsten Bögershausen
2016-09-29 18:38             ` Johannes Sixt
2016-09-29 21:27           ` Junio C Hamano
2016-10-01 18:59             ` Lars Schneider
2016-10-01 20:48               ` Jakub Narębski
2016-10-03 17:13                 ` Lars Schneider
2016-10-04 19:04                   ` Jakub Narębski
2016-10-06 13:13                     ` Lars Schneider
2016-10-06 16:01                       ` Jeff King
2016-10-06 17:17                         ` Junio C Hamano
2016-10-03 17:02               ` Junio C Hamano
2016-10-03 17:35                 ` Lars Schneider
2016-10-04 12:11                 ` Jeff King
2016-10-04 16:47                   ` Junio C Hamano
2016-09-29 18:02         ` Jeff King
2016-09-29 21:19           ` Junio C Hamano
2016-09-29 20:50         ` Lars Schneider
2016-09-29 21:12           ` Junio C Hamano
2016-09-29 20:59       ` Jakub Narębski
2016-09-29 21:17         ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).