Re: [PATCH v5 14/15] convert: add filter.<driver>.process option

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Lars Schneider <larsxschneider@gmail.com>
To: Stefan Beller <sbeller@google.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Jakub Narębski" <jnareb@gmail.com>,
	"Martin-Louis Bright" <mlbright@gmail.com>,
	"Eric Wong" <e@80x24.org>, "Jeff King" <peff@peff.net>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	ben@wijen.net
Subject: Re: [PATCH v5 14/15] convert: add filter.<driver>.process option
Date: Fri, 12 Aug 2016 18:59:18 +0200	[thread overview]
Message-ID: <509A907F-B1B5-4244-B1C7-F1190296208D@gmail.com> (raw)
In-Reply-To: <CAGZ79kboxgBRHSa2s7CKZ1Uo=13WT=rT8VHCNJNj_Q9jQzZAYw@mail.gmail.com>


> On 12 Aug 2016, at 18:33, Stefan Beller <sbeller@google.com> wrote:
> 
> On Wed, Aug 10, 2016 at 6:04 AM,  <larsxschneider@gmail.com> wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>> 
>> In a preliminary performance test this developer used a clean/smudge filter
>> written in golang to filter 12,000 files. This process took 364s with the
>> existing filter mechanism and 5s with the new mechanism. See details here:
>> https://github.com/github/git-lfs/pull/1382
>> 
>> This patch adds the `filter.<driver>.process` string option which, if used,
>> keeps the external filter process running and processes all blobs with
>> the packet format (pkt-line) based protocol over standard input and standard
>> output described below.
>> 
>> Git starts the filter when it encounters the first file
>> that needs to be cleaned or smudged. After the filter started
>> Git sends a welcome message, a list of supported protocol
>> version numbers, and a flush packet. Git expects to read the
>> welcome message and one protocol version number from the
>> previously sent list. Afterwards Git sends a list of supported
>> capabilities and a flush packet. Git expects to read a list of
>> desired capabilities, which must be a subset of the supported
>> capabilities list, and a flush packet as response:
>> ------------------------
>> packet:          git> git-filter-client
>> packet:          git> version=2
>> packet:          git> version=42
>> packet:          git> 0000
>> packet:          git< git-filter-server
>> packet:          git< version=2
> 
> what follows is specific to version=2?
> version 42 may deem capabilities a bad idea?

"version=42" is just an example to show how the initialization could look
like in a distant future when we support even another protocol version.

You are correct, what follows is specific to version=2. I will state
that more clearly in the documentation.

Can you try to rephrase "version 42 may deem capabilities a bad idea?"
I am not sure I understand what you mean.


> 
>> packet:          git> clean=true
>> packet:          git> smudge=true
>> packet:          git> not-yet-invented=true
>> packet:          git> 0000
>> packet:          git< clean=true
>> packet:          git< smudge=true
>> packet:          git< 0000
>> ------------------------
>> Supported filter capabilities in version 2 are "clean" and
>> "smudge".
> 
> I assume version 2 is an example here and we actually start with v1?

No, it is actually called version 2 because I consider the current
clean/smudge protocol version 1.


> Can you clarify why we need welcome messages?
> (Is there a technical reason, or better debuggability for humans?)

The welcome message is necessary to distinguish the long running
filter protocol (v2) from the current one-shot filter protocol (v1).
This is becomes important if a users tries to use a v1 clean/smudge
filter with the v2 git config settings.


>> Afterwards Git sends a list of "key=value" pairs terminated with
>> a flush packet. The list will contain at least the filter command
>> (based on the supported capabilities) and the pathname of the file
>> to filter relative to the repository root. Right after these packets
>> Git sends the content split in zero or more pkt-line packets and a
>> flush packet to terminate content.
>> ------------------------
>> packet:          git> command=smudge\n
>> packet:          git> pathname=path/testfile.dat\n
>> packet:          git> 0000
>> packet:          git> CONTENT
>> packet:          git> 0000
>> ------------------------
>> 
>> The filter is expected to respond with a list of "key=value" pairs
>> terminated with a flush packet. If the filter does not experience
>> problems then the list must contain a "success" status. Right after
>> these packets the filter is expected to send the content in zero
>> or more pkt-line packets and a flush packet at the end. Finally, a
>> second list of "key=value" pairs terminated with a flush packet
>> is expected. The filter can change the status in the second list.
>> ------------------------
>> packet:          git< status=success\n
>> packet:          git< 0000
>> packet:          git< SMUDGED_CONTENT
>> packet:          git< 0000
>> packet:          git< 0000  # empty list!
>> ------------------------
>> 
>> If the result content is empty then the filter is expected to respond
>> with a success status and an empty list.
>> ------------------------
>> packet:          git< status=success\n
>> packet:          git< 0000
>> packet:          git< 0000  # empty content!
>> packet:          git< 0000  # empty list!
>> ------------------------
> 
> Why do we need the last flush packet? We'd expect as many successes
> as we send out contents? Do we plan on interleaving operation, i.e.
> Git sends out 10 files but the filter process is not as fast as Git sending
> out and the answers trickle in slowly?

Git filter processes run sequentially right now (unfortunately).

re flush: please see Peff's answer:
http://public-inbox.org/git/20160812163809.3wdkuqegxfjam2yn%40sigill.intra.peff.net/


>> In case the filter cannot or does not want to process the content,
>> it is expected to respond with an "error" status. Depending on the
>> `filter.<driver>.required` flag Git will interpret that as error
>> but it will not stop or restart the filter process.
>> ------------------------
>> packet:          git< status=error\n
>> packet:          git< 0000
>> ------------------------
>> 
>> In case the filter cannot or does not want to process the content
>> as well as any future content for the lifetime of the Git process,
>> it is expected to respond with an "error-all" status. Depending on
>> the `filter.<driver>.required` flag Git will interpret that as error
>> but it will not stop or restart the filter process.
>> ------------------------
>> packet:          git< status=error-all\n
>> packet:          git< 0000
>> ------------------------
>> 
>> If the filter experiences an error during processing, then it can
>> send the status "error". Depending on the `filter.<driver>.required`
>> flag Git will interpret that as error but it will not stop or restart
>> the filter process.
>> ------------------------
>> packet:          git< status=success\n
> 
> So the first success is meaningless essentially?
> Would it make sense to move the sucess behind the content sending
> in all cases?

Again, I refer to Peff's answer.


>> packet:          git< 0000
>> packet:          git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> packet:          git< 0000
>> packet:          git< status=error\n
>> packet:          git< 0000
>> ------------------------
>> 
>> If the filter dies during the communication or does not adhere to
>> the protocol then Git will stop the filter process and restart it
>> with the next file that needs to be processed.
>> 
>> After the filter has processed a blob it is expected to wait for
>> the next "key=value" list containing a command. When the Git process
>> terminates, it will send a kill signal to the filter in that stage.
>> 
>> If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
>> is configured then these commands always take precedence over
>> a configured `filter.<driver>.process` command.
> 
> okay. I think you can omit most of the commit message as it is a duplicate
> of the documentation?

Yes it duplicates the documentation. 


> Instead the commit message can answer questions that are not part of
> the documentation. (See the questions above which can be summarized
> as "Why do we do it this way and not differently?")

OK, point taken. I will write a new commit message for v6.


> 
>> +       if (err || errno == EPIPE) {
>> +               if (!strcmp(filter_status.buf, "error")) {
>> +                       /*
>> +                    * The filter signaled a problem with the file.
>> +                    */
> 
> /* This could go into a single line comment. */

OK, will change.


>> +               } else if (!strcmp(filter_status.buf, "error-all")) {
>> +                       /*
>> +                        * The filter signaled a permanent problem. Don't try to filter
>> +                        * files with the same command for the lifetime of the current
>> +                        * Git process.
>> +                        */
>> +                        entry->supported_capabilities &= ~wanted_capability;
>> +               } else {
>> +                       /*
>> +                        * Something went wrong with the protocol filter.
>> +                        * Force shutdown and restart if another blob requires filtering!
>> +                        */
>> +                       error("external filter '%s' failed", cmd);
> 
> failed .. Can you give more information to the user such that they can easier
> debug? (blob/path or state / expected state)

Agreed, will add!
However, we don't give this information with the current clean/smudge interface.


>> +
>> static int read_convert_config(const char *var, const char *value, void *cb)
>> {
>>        const char *key, *name;
>> @@ -526,6 +818,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>>        if (!strcmp("clean", key))
>>                return git_config_string(&drv->clean, var, value);
>> 
>> +       if (!strcmp("process", key)) {
>> +               return git_config_string(&drv->process, var, value);
>> +       }
> 
> optional nit: braces unnecessary

Agreed, will remove!


Thanks a lot for the review,
Lars

next prev parent reply	other threads:[~2016-08-12 16:59 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20160803164225.46355-1-larsxschneider@gmail.com/>
2016-08-10 13:03 ` [PATCH v5 00/15] Git filter protocol larsxschneider
2016-08-10 13:03   ` [PATCH v5 01/15] pkt-line: extract set_packet_header() larsxschneider
2016-08-10 13:03   ` [PATCH v5 02/15] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-08-10 13:13     ` Jeff King
2016-08-10 13:24       ` Lars Schneider
2016-08-10 13:30         ` Jeff King
2016-08-10 13:51           ` Lars Schneider
2016-08-10 14:33             ` Jeff King
2016-08-10 13:03   ` [PATCH v5 03/15] pkt-line: add `gentle` parameter to format_packet() larsxschneider
2016-08-10 13:15     ` Jeff King
2016-08-10 13:29       ` Lars Schneider
2016-08-10 13:37         ` Jeff King
2016-08-10 13:59           ` Lars Schneider
2016-08-10 14:34             ` Jeff King
2016-08-10 13:04   ` [PATCH v5 04/15] pkt-line: add packet_write_gently() larsxschneider
2016-08-10 13:28     ` Jeff King
2016-08-10 13:36       ` Lars Schneider
2016-08-10 13:40         ` Jeff King
2016-08-10 17:17           ` Junio C Hamano
2016-08-10 17:49             ` Lars Schneider
2016-08-10 18:21               ` Junio C Hamano
2016-08-10 19:15                 ` Lars Schneider
2016-08-10 13:04   ` [PATCH v5 05/15] pkt-line: add packet_write_gently_fmt() larsxschneider
2016-08-10 13:43     ` Jeff King
2016-08-10 14:10       ` Lars Schneider
2016-08-10 15:01         ` Jeff King
2016-08-10 17:18       ` Junio C Hamano
2016-08-10 17:53         ` Lars Schneider
2016-08-10 18:42           ` Junio C Hamano
2016-08-10 13:04   ` [PATCH v5 06/15] pkt-line: add packet_flush_gently() larsxschneider
2016-08-10 13:04   ` [PATCH v5 07/15] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-08-10 13:04   ` [PATCH v5 08/15] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
2016-08-10 13:04   ` [PATCH v5 09/15] pack-protocol: fix maximum pkt-line size larsxschneider
2016-08-10 13:04   ` [PATCH v5 10/15] convert: quote filter names in error messages larsxschneider
2016-08-10 13:04   ` [PATCH v5 11/15] convert: modernize tests larsxschneider
2016-08-10 13:04   ` [PATCH v5 12/15] convert: generate large test files only once larsxschneider
2016-08-10 13:04   ` [PATCH v5 13/15] convert: make apply_filter() adhere to standard Git error handling larsxschneider
2016-08-10 13:04   ` [PATCH v5 14/15] convert: add filter.<driver>.process option larsxschneider
2016-08-12 16:33     ` Stefan Beller
2016-08-12 16:38       ` Jeff King
2016-08-12 16:48         ` Stefan Beller
2016-08-12 17:08           ` Lars Schneider
2016-08-12 17:13             ` Junio C Hamano
2016-08-12 17:21               ` Lars Schneider
2016-08-12 18:03                 ` Junio C Hamano
2016-08-12 16:59       ` Lars Schneider [this message]
2016-08-12 17:07         ` Stefan Beller
2016-08-12 17:14           ` Lars Schneider
2016-08-10 13:04   ` [PATCH v5 15/15] read-cache: make sure file handles are not inherited by child processes larsxschneider
2016-08-18 14:23     ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=509A907F-B1B5-4244-B1C7-F1190296208D@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=ben@wijen.net \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=mlbright@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).