git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Jakub Narębski" <jnareb@gmail.com>,
	mlbright@gmail.com, "Eric Wong" <e@80x24.org>,
	"Jeff King" <peff@peff.net>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	ben@wijen.net
Subject: Re: [PATCH v5 14/15] convert: add filter.<driver>.process option
Date: Fri, 12 Aug 2016 09:33:18 -0700	[thread overview]
Message-ID: <CAGZ79kboxgBRHSa2s7CKZ1Uo=13WT=rT8VHCNJNj_Q9jQzZAYw@mail.gmail.com> (raw)
In-Reply-To: <20160810130411.12419-15-larsxschneider@gmail.com>

On Wed, Aug 10, 2016 at 6:04 AM,  <larsxschneider@gmail.com> wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.
>
> In a preliminary performance test this developer used a clean/smudge filter
> written in golang to filter 12,000 files. This process took 364s with the
> existing filter mechanism and 5s with the new mechanism. See details here:
> https://github.com/github/git-lfs/pull/1382
>
> This patch adds the `filter.<driver>.process` string option which, if used,
> keeps the external filter process running and processes all blobs with
> the packet format (pkt-line) based protocol over standard input and standard
> output described below.
>
> Git starts the filter when it encounters the first file
> that needs to be cleaned or smudged. After the filter started
> Git sends a welcome message, a list of supported protocol
> version numbers, and a flush packet. Git expects to read the
> welcome message and one protocol version number from the
> previously sent list. Afterwards Git sends a list of supported
> capabilities and a flush packet. Git expects to read a list of
> desired capabilities, which must be a subset of the supported
> capabilities list, and a flush packet as response:
> ------------------------
> packet:          git> git-filter-client
> packet:          git> version=2
> packet:          git> version=42
> packet:          git> 0000
> packet:          git< git-filter-server
> packet:          git< version=2

what follows is specific to version=2?
version 42 may deem capabilities a bad idea?

> packet:          git> clean=true
> packet:          git> smudge=true
> packet:          git> not-yet-invented=true
> packet:          git> 0000
> packet:          git< clean=true
> packet:          git< smudge=true
> packet:          git< 0000
> ------------------------
> Supported filter capabilities in version 2 are "clean" and
> "smudge".

I assume version 2 is an example here and we actually start with v1?

Can you clarify why we need welcome messages?
(Is there a technical reason, or better debuggability for humans?)

>
> Afterwards Git sends a list of "key=value" pairs terminated with
> a flush packet. The list will contain at least the filter command
> (based on the supported capabilities) and the pathname of the file
> to filter relative to the repository root. Right after these packets
> Git sends the content split in zero or more pkt-line packets and a
> flush packet to terminate content.
> ------------------------
> packet:          git> command=smudge\n
> packet:          git> pathname=path/testfile.dat\n
> packet:          git> 0000
> packet:          git> CONTENT
> packet:          git> 0000
> ------------------------
>
> The filter is expected to respond with a list of "key=value" pairs
> terminated with a flush packet. If the filter does not experience
> problems then the list must contain a "success" status. Right after
> these packets the filter is expected to send the content in zero
> or more pkt-line packets and a flush packet at the end. Finally, a
> second list of "key=value" pairs terminated with a flush packet
> is expected. The filter can change the status in the second list.
> ------------------------
> packet:          git< status=success\n
> packet:          git< 0000
> packet:          git< SMUDGED_CONTENT
> packet:          git< 0000
> packet:          git< 0000  # empty list!
> ------------------------
>
> If the result content is empty then the filter is expected to respond
> with a success status and an empty list.
> ------------------------
> packet:          git< status=success\n
> packet:          git< 0000
> packet:          git< 0000  # empty content!
> packet:          git< 0000  # empty list!
> ------------------------

Why do we need the last flush packet? We'd expect as many successes
as we send out contents? Do we plan on interleaving operation, i.e.
Git sends out 10 files but the filter process is not as fast as Git sending
out and the answers trickle in slowly?

>
> In case the filter cannot or does not want to process the content,
> it is expected to respond with an "error" status. Depending on the
> `filter.<driver>.required` flag Git will interpret that as error
> but it will not stop or restart the filter process.
> ------------------------
> packet:          git< status=error\n
> packet:          git< 0000
> ------------------------
>
> In case the filter cannot or does not want to process the content
> as well as any future content for the lifetime of the Git process,
> it is expected to respond with an "error-all" status. Depending on
> the `filter.<driver>.required` flag Git will interpret that as error
> but it will not stop or restart the filter process.
> ------------------------
> packet:          git< status=error-all\n
> packet:          git< 0000
> ------------------------
>
> If the filter experiences an error during processing, then it can
> send the status "error". Depending on the `filter.<driver>.required`
> flag Git will interpret that as error but it will not stop or restart
> the filter process.
> ------------------------
> packet:          git< status=success\n

So the first success is meaningless essentially?
Would it make sense to move the sucess behind the content sending
in all cases?

> packet:          git< 0000
> packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
> packet:          git< 0000
> packet:          git< status=error\n
> packet:          git< 0000
> ------------------------
>
> If the filter dies during the communication or does not adhere to
> the protocol then Git will stop the filter process and restart it
> with the next file that needs to be processed.
>
> After the filter has processed a blob it is expected to wait for
> the next "key=value" list containing a command. When the Git process
> terminates, it will send a kill signal to the filter in that stage.
>
> If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
> is configured then these commands always take precedence over
> a configured `filter.<driver>.process` command.

okay. I think you can omit most of the commit message as it is a duplicate
of the documentation?

Instead the commit message can answer questions that are not part of
the documentation. (See the questions above which can be summarized
as "Why do we do it this way and not differently?")


> +       if (err || errno == EPIPE) {
> +               if (!strcmp(filter_status.buf, "error")) {
> +                       /*
> +                    * The filter signaled a problem with the file.
> +                    */

/* This could go into a single line comment. */

> +               } else if (!strcmp(filter_status.buf, "error-all")) {
> +                       /*
> +                        * The filter signaled a permanent problem. Don't try to filter
> +                        * files with the same command for the lifetime of the current
> +                        * Git process.
> +                        */
> +                        entry->supported_capabilities &= ~wanted_capability;
> +               } else {
> +                       /*
> +                        * Something went wrong with the protocol filter.
> +                        * Force shutdown and restart if another blob requires filtering!
> +                        */
> +                       error("external filter '%s' failed", cmd);

failed .. Can you give more information to the user such that they can easier
debug? (blob/path or state / expected state)


> +
>  static int read_convert_config(const char *var, const char *value, void *cb)
>  {
>         const char *key, *name;
> @@ -526,6 +818,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>         if (!strcmp("clean", key))
>                 return git_config_string(&drv->clean, var, value);
>
> +       if (!strcmp("process", key)) {
> +               return git_config_string(&drv->process, var, value);
> +       }

optional nit: braces unnecessary

Thanks,
Stefan

  reply	other threads:[~2016-08-12 16:33 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20160803164225.46355-1-larsxschneider@gmail.com/>
2016-08-10 13:03 ` [PATCH v5 00/15] Git filter protocol larsxschneider
2016-08-10 13:03   ` [PATCH v5 01/15] pkt-line: extract set_packet_header() larsxschneider
2016-08-10 13:03   ` [PATCH v5 02/15] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-08-10 13:13     ` Jeff King
2016-08-10 13:24       ` Lars Schneider
2016-08-10 13:30         ` Jeff King
2016-08-10 13:51           ` Lars Schneider
2016-08-10 14:33             ` Jeff King
2016-08-10 13:03   ` [PATCH v5 03/15] pkt-line: add `gentle` parameter to format_packet() larsxschneider
2016-08-10 13:15     ` Jeff King
2016-08-10 13:29       ` Lars Schneider
2016-08-10 13:37         ` Jeff King
2016-08-10 13:59           ` Lars Schneider
2016-08-10 14:34             ` Jeff King
2016-08-10 13:04   ` [PATCH v5 04/15] pkt-line: add packet_write_gently() larsxschneider
2016-08-10 13:28     ` Jeff King
2016-08-10 13:36       ` Lars Schneider
2016-08-10 13:40         ` Jeff King
2016-08-10 17:17           ` Junio C Hamano
2016-08-10 17:49             ` Lars Schneider
2016-08-10 18:21               ` Junio C Hamano
2016-08-10 19:15                 ` Lars Schneider
2016-08-10 13:04   ` [PATCH v5 05/15] pkt-line: add packet_write_gently_fmt() larsxschneider
2016-08-10 13:43     ` Jeff King
2016-08-10 14:10       ` Lars Schneider
2016-08-10 15:01         ` Jeff King
2016-08-10 17:18       ` Junio C Hamano
2016-08-10 17:53         ` Lars Schneider
2016-08-10 18:42           ` Junio C Hamano
2016-08-10 13:04   ` [PATCH v5 06/15] pkt-line: add packet_flush_gently() larsxschneider
2016-08-10 13:04   ` [PATCH v5 07/15] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-08-10 13:04   ` [PATCH v5 08/15] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
2016-08-10 13:04   ` [PATCH v5 09/15] pack-protocol: fix maximum pkt-line size larsxschneider
2016-08-10 13:04   ` [PATCH v5 10/15] convert: quote filter names in error messages larsxschneider
2016-08-10 13:04   ` [PATCH v5 11/15] convert: modernize tests larsxschneider
2016-08-10 13:04   ` [PATCH v5 12/15] convert: generate large test files only once larsxschneider
2016-08-10 13:04   ` [PATCH v5 13/15] convert: make apply_filter() adhere to standard Git error handling larsxschneider
2016-08-10 13:04   ` [PATCH v5 14/15] convert: add filter.<driver>.process option larsxschneider
2016-08-12 16:33     ` Stefan Beller [this message]
2016-08-12 16:38       ` Jeff King
2016-08-12 16:48         ` Stefan Beller
2016-08-12 17:08           ` Lars Schneider
2016-08-12 17:13             ` Junio C Hamano
2016-08-12 17:21               ` Lars Schneider
2016-08-12 18:03                 ` Junio C Hamano
2016-08-12 16:59       ` Lars Schneider
2016-08-12 17:07         ` Stefan Beller
2016-08-12 17:14           ` Lars Schneider
2016-08-10 13:04   ` [PATCH v5 15/15] read-cache: make sure file handles are not inherited by child processes larsxschneider
2016-08-18 14:23     ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGZ79kboxgBRHSa2s7CKZ1Uo=13WT=rT8VHCNJNj_Q9jQzZAYw@mail.gmail.com' \
    --to=sbeller@google.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=ben@wijen.net \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=larsxschneider@gmail.com \
    --cc=mlbright@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).