git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Jakub Narębski" <jnareb@gmail.com>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Martin-Louis Bright" <mlbright@gmail.com>,
	"Eric Wong" <e@80x24.org>, "Jeff King" <peff@peff.net>
Subject: Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option)
Date: Wed, 3 Aug 2016 20:30:18 +0200	[thread overview]
Message-ID: <607c07fe-5b6f-fd67-13e1-705020c267ee@gmail.com> (raw)
In-Reply-To: <ABE7D2DB-C45F-4F29-8CC2-8D873FD6C36A@gmail.com>

[I'm sorry for taking so long in writing this, as I see there is v4 already]

Greetings,


I'll answer to individual emails in more detail later, but I'd like to
go back to the drawing board, and attempt to summarize the discussion and
the proposal so far.

The ultimate goal is to be able to run filter drivers faster for both `clean`
and `smudge` operations.  This is done by starting filter driver once per
git command invocation, instead of once per file being processed.  Git needs
to pass actual contents of files to filter driver, and get its output.

We want the protocol between Git and filter driver process to be extensible,
so that new features can be added without modifying protocol.


1. CONFIGURATION

As I wrote, there are different ways of configuring new-type filter driver:

 * Using a separate variable to mark filter as using new protocol
   (the original approach):

   	[filter "protocol"]
		protocolVersion = v2
		clean  = rot13-clean-filter.pl
		smudge = rot13-smudge-filter.pl

   PROS: allows to have separate clean and smudge filters
   CONS: does not allow using old-style per-file filter together with new;
         easy to make mistake and use old-style filter, leading to hang

 * Creating new variables for new filter type, separate for each phase,
   for example `cleanProcess` and `smudgeProcess` (or `processClean` and
   `processSmudge`).

   	[filter "protocol"]
   		cleanProcess  = rot13-clean-filter.pl
   		smudgeProcess = rot13-smudge-filter.pl

   PROS: allows to have separate clean and smudge filters;
         makes possible to use per-file and per-command filters together
   CONS: proliferation of additional variables, (esp. when extending it);
   NOTE: need to decide precedence between `clean` and `cleanProcess`, etc.

 # Using a single variable for new filter type, and decide on which phase
   (which operation) is supported by filter driver during the handshake
   *(current approach)*

   	[filter "protocol"]
   		process = rot13-filtes.pl

   PROS: per-file and per-command filters possible with precedence rule;
         extensible to other types of drivers: textconv, diff, etc.
         only one invocation for commands which use both clean and smudge
   CONS: need single driver to be responsible for both clean and smudge;
         need to run driver to know that it does not support given
           operation (workaround exists)


2. HANDSHAKE (INITIALIZATION)

Next, there is deciding on and designing the handshake between Git (between
Git command) and the filter driver process.  With the `filter.<driver>.process`
solution the driver needs to tell which operations among (for now) "clean"
and "smudge" it does support.  Plus it provides a way to extend protocol,
adding new features, like support for streaming, cleaning from file or
smudging to file, providing size upfront, perhaps even progress report.

Current handshake consist of filter driver printing a signature, version
number and capabilities, in that order.  Git checks that it is well formed
and matches expectations, and notes which of "clean" and "smudge" operations
are supported by the filter.

There is no interaction from the Git side in the handshake, for example to
set options and expectations common to all files being filtered.  Take
one possible extension of protocol: supporting streaming.  The filter
driver needs to know whether it needs to read all the input, or whether
it can start printing output while input is incoming (e.g. to reduce
memory consumption)... though we may simply decide it to be next version
of the protocol.

On the other hand if the handshake began with Git sending some initializer
info to the filter driver, we probably could detect one-shot filter
misconfigured as process-filter.

Note that we need some way of deciding where handshake ends, either by
specifying number of entries (currently: three lines / pkt-line packets),
or providing some terminator ("smart" transport protocol uses flush packet
for this).

Current handshake (in symbolic form):

    git< [signature]    git-filter-protocol
    git< [version]      version 2
    git< [capabilites]  clean smudge

It is expected that the handshake is limited to this information, and
that they are in this order; so naming them doesn't buy us much

    git< [capabilites]  capabilities clean smudge

or

    git< [capabilites]  capabilities=clean smudge

or

    git< [capabilites]  capabilities: clean smudge

If capabilities are to be third item, adding "capabilities", as if Git would
look at the name and select what to do based on this name, doesn't buy us
anything.  Well, beside self-documenting of the protocol.  The "smart" protocol
do not use "capabilities" as prefix/name either.

We would probably do not want to move from strict-order of information, that
is "positional parameters".  It would require to implement a parser, both for
the Git side and for the filter driver process side.

On the other hand requiring flush packet to end the handshake doesn't bring
much overhead (it is 4 bytes, it is not over the network), and improves
extendability.  Well, so does using names, be it "<var> <value>", 
"<var>=<value>", "<var>: <value>...", "<var>=[<value>, <value>...]", etc.


Let's take a look how other parts of Git communicate with external process
(a "helper").

The git-credential(1) protocol uses <variable>=<value> syntax.  But capabilities
form a list; "<var>=<val1> <val2>" doesn't look that well.  Credential helper
only uses scalar (single) values.

The gitremote-helpers(1) protocol is command / response; for example helper
responds to "capabilities" command with the list of capabilities.  Here commands
and parameters are space separated, e.g. "option <name> <value>".

The "smart" transport protocol (send-pack and receive-pack) had to (ab)use
a quirk of implementation to extend protocol with capabilities negotiation.
Here the capabilities list is sent without any prefix; some capabilities
are parametrized, and use <capability>=<value> syntax (for example
"symref=HEAD:refs/heads/master").  The handshake is closed with flush
packet, but as it consist of variable-length ref advertisement, it needs
to have explicit terminator of the each part of the "handshake".


3. SENDING CONTENTS (FILE TO BE FILTERED AND FILTER OUTPUT)

Next thing to design is decision how to send contents to be filtered
to the filter driver process, and how to get filtered output from the
filter driver process.

One thing I think we can agree on early, is sending data to filter
process on its standard input, and receiving filtered result from its
standard output.

Because Git is sending (and receiving) multiple files, it needs some
way to distinguish where one file ends and the next begins, in both
directions, to and from filter.  Also, the `clean` and `smudge`
filters support expansion of the '%f' placeholder, so at least 
some filter drivers need name of the file being filtered.  So the
protocol must send it somehow to the filter driver.

There are different approaches possible; here are ones that were used,
and ones I thought about.

 * Send whole data to filter at once, and receiver all data at once,
   for example using something akin to the 'tar' archive, or 
   uncompressed 'zip' archive (both are implemented in Git for the
   `git archive` command).  Or just list of sizes and pathnames,
   empty entry as terminator, and then contents of all files
   concatenated.

   PROS:
   - can use the one-shot infrastructure implemented already
   CONS:
   - complicates Git code and filter driver code unnecessarily
   - difficult to implement error handling, esp. soft errors
     on filter driver side (error for single file, perhaps during
     output)
   - in synchronous version (non-streaming) requires absurd amout
     of memory / storage for the filter driver process

 * Send/receive data file by file, using <size> + <content>,
   that is, send size (plus other data like the filename), then
   file contents.

   This was the protocol used in the first iteration of series.

   PROS:
   - simple to implement on Git and on filter driver side
   NOTE:
   - you need to loop over read / user read_in_full anyway
   CONS:
   - no way to signal an error encountered during output, e.g. LFS
     network/server failure for after some contents were actually
     sent
   - impossible to implement streaming for filters that do not
     know size of output without examining full input

 # Send/receive data file by file, using some kind of chunking,
   with a end-of-file marker.  The solution used by Git is
   pkt-line, with flush packet used to signal end of file.

   This is protocol used by the current implementation.

   PROS:
   - no need to know size upfront, so easier streaming support
   - you can signal error that happened during output, after
     some data were sent, as well as error known upfront
   - tracing support for free (GIT_TRACE_PACKET)
   CONS:
   - filter driver program slightly more difficult to implement
   - some negligible amount of overhead

If we want in the end to implement streaming, then the last solution
is the way to go.


4. PER-FILE HANDSHAKE - SENDING FILE TO FILTER

Let's assume that for simplicity we want to implement (for now) only
the synchronous (non-streaming) case, where we send whole contents
of a file to filter driver process, and *then* read filter driver
output.  This is enough for git-LFS solutions, which were the reason
for this patch series.  But we want to keep the protocol flexible
enough so that streaming and other features could be added easily.

First, if we choose the solution where one process is responsible
for both "clean" and "smudge" operations (and in the future possibly
also "cleanFromFile" and "smudgeToFile"), Git needs to tell the
driver which operation to perform.

Together with operation Git can send additional information
(sub-capabilities)... or we can use a separate line / packet to
send it.

If we are using pkt-line, then the convention is that text lines
are terminated using LF ("\n") character.  This needs to be stated
explicitly in the documentation for filter.<driver>.process writers.

    git> packet:  [operation] clean size=67\n

We could denote that it is operation name, but it is obvious from
position in the stream, thus not really needed.

Then we need to provide the filename; some filters supposedly need
this ('%f' in per-file `clean` / `smudge`).  Note that filename can
contain internal space characters, and could contain newlines, equal
signs; anything that is not NUL ("\0") character.

    git> packet:  [pathname] subdir/sample-file.r\n

In most cases filename would be text, so perhaps we should use "\n"
terminator (which filter driver would have to strip).  We could use
"filename=" prefix, but it is not necessary.  We know where / when
to expect the pathname (relative to project root).

If we would want to be able to add variable number of packets to
the handshake, then Git should send flush packet to signal the
end of the handshake.  But IMVHO it is unnecessary complication
of the protocol; there is enough flexibility in it.  We know
that handshake consists of two packets.

The Git would sent contents of the file to be filtered, using
as many pack lines as needed (note: large file support needs
to be tested, at least as expensive test).  Flush packet is
used to signal the end of the file.

    git> packets:  <file contents>
    git> flush packet


5. FILTER DRIVER PROCESS RESPONSE

First filter should, in my opinion, reply that it received the
request (or the command, in the case of streaming supported).
Also, in this response it can provide further information to
Git process.

    git< packet: [received]  ok size=67\n

This response could be used to refuse to filter specific file
upfront (for example if the file is not present in the artifactory
for git-LFS solutions).

   git< packet: [rejected]  reject\n

We can even provide the reasoning to Git (maybe in the future
extension)... or filter driver can print the explanation to the
standard error (but then, no --quiet / --verbose support).

   git< packet: [rejected]  reject with-message\n
   git< packet: [message]   File not found on server\n
   git< flush packet

Another response, which I think should be standarized, or at
least described in the documentation, is filter driver refusing
to filter further (e.g. git-LFS and network is down), to be not
restarted by Git.

   git< packet: [quit]      quit msg=Server error\n

or

   git< packet: [quit]      quit Server error\n

or

   git< packet: [quit]      quit with-message\n
   git< packet: [message]   Server error\n
   git< flush packet

Maybe this is over-engineering, but I don't think so.

Next comes the output from the filter driver (filtered contents),
using possibly multiple pkt-lines, ending with a flush packet:

    git< packets:  <filtered contents>
    git< flush packet

Note that empty file would consist of zero pack lines of contents,
and one flush packet.

Finally, to allow handling of [resumable] errors that occurred
during sending file contents, especially for the future streaming
filters case, we want to confirm that we send whole file
successfully.

    git< packet: [status]   success\n

If there was an error during process, making data receives so far
invalid, filter driver should tell about it

    git< packet: [status]   fail\n

or

    git< packet: [status]   reject\n

This may happen for example for UCS-2 <-> UTF-8 filter when invalid
byte sequence is encountered.  This may happen for git-LFS if the
server fails during fetch, and spare / slave server doesn't have
a file.

We may want to quit filtering at this point, and not to send another
file.

   git< packet: [status]    quit\n

There is place for extra information after the status, and in the
future we can allow variable length information too.


Best,
-- 
Jakub Narębski

  reply	other threads:[~2016-08-03 18:31 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20160727000605.49982-1-larsxschneider%40gmail.com/>
2016-07-29 23:37 ` [PATCH v3 00/10] Git filter protocol larsxschneider
2016-07-29 23:37   ` [PATCH v3 01/10] pkt-line: extract set_packet_header() larsxschneider
2016-07-30 10:30     ` Jakub Narębski
2016-08-01 11:33       ` Lars Schneider
2016-08-03 20:05         ` Jakub Narębski
2016-08-05 11:52           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 02/10] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
2016-07-30 10:49     ` Jakub Narębski
2016-08-01 12:00       ` Lars Schneider
2016-08-03 20:12         ` Jakub Narębski
2016-08-05 12:02           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 03/10] pkt-line: add packet_flush_gentle() larsxschneider
2016-07-30 12:04     ` Jakub Narębski
2016-08-01 12:28       ` Lars Schneider
2016-07-31 20:36     ` Torstem Bögershausen
2016-07-31 21:45       ` Lars Schneider
2016-08-02 19:56         ` Torsten Bögershausen
2016-08-05  9:59           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 04/10] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-07-30 12:29     ` Jakub Narębski
2016-08-01 12:18       ` Lars Schneider
2016-08-03 20:15         ` Jakub Narębski
2016-07-29 23:37   ` [PATCH v3 05/10] pack-protocol: fix maximum pkt-line size larsxschneider
2016-07-30 13:58     ` Jakub Narębski
2016-08-01 12:23       ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 06/10] run-command: add clean_on_exit_handler larsxschneider
2016-07-30  9:50     ` Johannes Sixt
2016-08-01 11:14       ` Lars Schneider
2016-08-02  5:53         ` Johannes Sixt
2016-08-02  7:41           ` Lars Schneider
2016-07-29 23:37   ` [PATCH v3 07/10] convert: quote filter names in error messages larsxschneider
2016-07-29 23:37   ` [PATCH v3 08/10] convert: modernize tests larsxschneider
2016-07-29 23:38   ` [PATCH v3 09/10] convert: generate large test files only once larsxschneider
2016-07-29 23:38   ` [PATCH v3 10/10] convert: add filter.<driver>.process option larsxschneider
2016-07-30 22:05     ` Jakub Narębski
2016-07-31  9:42       ` Jakub Narębski
2016-07-31 19:49         ` Lars Schneider
2016-07-31 22:59           ` Jakub Narębski
2016-08-01 13:32       ` Lars Schneider
2016-08-03 18:30         ` Jakub Narębski [this message]
2016-08-05 10:32           ` Designing the filter process protocol (was: Re: [PATCH v3 10/10] convert: add filter.<driver>.process option) Lars Schneider
2016-08-06 18:24           ` Lars Schneider
2016-08-03 22:47         ` [PATCH v3 10/10] convert: add filter.<driver>.process option Jakub Narębski
2016-07-31 22:19     ` Jakub Narębski
2016-08-01 17:55       ` Lars Schneider
2016-08-04  0:42         ` Jakub Narębski
2016-08-03 13:10       ` Lars Schneider
2016-08-04 10:18         ` Jakub Narębski
2016-08-05 13:20           ` Lars Schneider
2016-08-03 16:42   ` [PATCH v4 00/12] Git filter protocol larsxschneider
2016-08-03 16:42     ` [PATCH v4 01/12] pkt-line: extract set_packet_header() larsxschneider
2016-08-03 20:18       ` Junio C Hamano
2016-08-03 21:12         ` Jeff King
2016-08-03 21:27           ` Jeff King
2016-08-04 16:14           ` Junio C Hamano
2016-08-05 14:55             ` Lars Schneider
2016-08-05 16:31               ` Junio C Hamano
2016-08-05 17:31             ` Lars Schneider
2016-08-05 17:41               ` Junio C Hamano
2016-08-03 21:56         ` Lars Schneider
2016-08-03 16:42     ` [PATCH v4 02/12] pkt-line: add direct_packet_write() and direct_packet_write_data() larsxschneider
2016-08-03 16:42     ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() larsxschneider
2016-08-03 21:39       ` Jeff King
2016-08-03 22:56         ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
2016-08-03 22:56           ` [PATCH 1/7] trace: handle NULL argument in trace_disable() Jeff King
2016-08-03 22:58           ` [PATCH 2/7] trace: stop using write_or_whine_pipe() Jeff King
2016-08-03 22:58           ` [PATCH 3/7] trace: use warning() for printing trace errors Jeff King
2016-08-04 20:41             ` Junio C Hamano
2016-08-04 21:21               ` Jeff King
2016-08-04 21:28                 ` Junio C Hamano
2016-08-05  7:56                   ` Jeff King
2016-08-05  7:59                   ` Christian Couder
2016-08-05 18:41                     ` Junio C Hamano
2016-08-03 23:00           ` [PATCH 4/7] trace: cosmetic fixes for error messages Jeff King
2016-08-04 20:42             ` Junio C Hamano
2016-08-05  8:00               ` Jeff King
2016-08-03 23:00           ` [PATCH 5/7] trace: correct variable name in write() error message Jeff King
2016-08-03 23:01           ` [PATCH 6/7] trace: disable key after write error Jeff King
2016-08-04 20:45             ` Junio C Hamano
2016-08-04 21:22               ` Jeff King
2016-08-05  7:58                 ` Jeff King
2016-08-03 23:01           ` [PATCH 7/7] write_or_die: drop write_or_whine_pipe() Jeff King
2016-08-03 23:04           ` [PATCH 0/7] minor trace fixes and cosmetic improvements Jeff King
2016-08-04 16:16         ` [PATCH v4 03/12] pkt-line: add packet_flush_gentle() Junio C Hamano
2016-08-03 16:42     ` [PATCH v4 04/12] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-08-03 16:42     ` [PATCH v4 05/12] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-08-03 16:42     ` [PATCH v4 06/12] pack-protocol: fix maximum pkt-line size larsxschneider
2016-08-03 16:42     ` [PATCH v4 07/12] run-command: add clean_on_exit_handler larsxschneider
2016-08-03 21:24       ` Jeff King
2016-08-03 22:15         ` Lars Schneider
2016-08-03 22:53           ` Jeff King
2016-08-03 23:09             ` Lars Schneider
2016-08-03 23:15               ` Jeff King
2016-08-05 13:08                 ` Lars Schneider
2016-08-05 21:19                   ` Torsten Bögershausen
2016-08-05 21:50                     ` Lars Schneider
2016-08-03 16:42     ` [PATCH v4 08/12] convert: quote filter names in error messages larsxschneider
2016-08-03 16:42     ` [PATCH v4 09/12] convert: modernize tests larsxschneider
2016-08-03 16:42     ` [PATCH v4 10/12] convert: generate large test files only once larsxschneider
2016-08-03 16:42     ` [PATCH v4 11/12] convert: add filter.<driver>.process option larsxschneider
2016-08-03 17:45       ` Junio C Hamano
2016-08-03 21:48         ` Lars Schneider
2016-08-03 22:46           ` Jeff King
2016-08-05 12:53             ` Lars Schneider
2016-08-03 20:29       ` Junio C Hamano
2016-08-03 21:37         ` Lars Schneider
2016-08-03 21:43           ` Junio C Hamano
2016-08-03 22:01             ` Lars Schneider
2016-08-05 21:34       ` Torsten Bögershausen
2016-08-05 21:49         ` Lars Schneider
2016-08-05 22:06         ` Junio C Hamano
2016-08-05 22:27           ` Jeff King
2016-08-06 11:55             ` Lars Schneider
2016-08-06 12:14               ` Jeff King
2016-08-06 18:19                 ` Lars Schneider
2016-08-08 15:02                   ` Jeff King
2016-08-08 16:21                     ` Lars Schneider
2016-08-08 16:26                       ` Jeff King
2016-08-06 20:40           ` Torsten Bögershausen
2016-08-03 16:42     ` [PATCH v4 12/12] convert: add filter.<driver>.process shutdown command option larsxschneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=607c07fe-5b6f-fd67-13e1-705020c267ee@gmail.com \
    --to=jnareb@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    --cc=mlbright@gmail.com \
    --cc=peff@peff.net \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).