From: Lars Schneider <larsxschneider@gmail.com>
To: Stefan Beller <sbeller@google.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
"Junio C Hamano" <gitster@pobox.com>,
"Jakub Narębski" <jnareb@gmail.com>,
"Martin-Louis Bright" <mlbright@gmail.com>,
"Eric Wong" <e@80x24.org>, "Jeff King" <peff@peff.net>,
"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
ben@wijen.net
Subject: Re: [PATCH v5 14/15] convert: add filter.<driver>.process option
Date: Fri, 12 Aug 2016 18:59:18 +0200 [thread overview]
Message-ID: <509A907F-B1B5-4244-B1C7-F1190296208D@gmail.com> (raw)
In-Reply-To: <CAGZ79kboxgBRHSa2s7CKZ1Uo=13WT=rT8VHCNJNj_Q9jQzZAYw@mail.gmail.com>
> On 12 Aug 2016, at 18:33, Stefan Beller <sbeller@google.com> wrote:
>
> On Wed, Aug 10, 2016 at 6:04 AM, <larsxschneider@gmail.com> wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
>>
>> In a preliminary performance test this developer used a clean/smudge filter
>> written in golang to filter 12,000 files. This process took 364s with the
>> existing filter mechanism and 5s with the new mechanism. See details here:
>> https://github.com/github/git-lfs/pull/1382
>>
>> This patch adds the `filter.<driver>.process` string option which, if used,
>> keeps the external filter process running and processes all blobs with
>> the packet format (pkt-line) based protocol over standard input and standard
>> output described below.
>>
>> Git starts the filter when it encounters the first file
>> that needs to be cleaned or smudged. After the filter started
>> Git sends a welcome message, a list of supported protocol
>> version numbers, and a flush packet. Git expects to read the
>> welcome message and one protocol version number from the
>> previously sent list. Afterwards Git sends a list of supported
>> capabilities and a flush packet. Git expects to read a list of
>> desired capabilities, which must be a subset of the supported
>> capabilities list, and a flush packet as response:
>> ------------------------
>> packet: git> git-filter-client
>> packet: git> version=2
>> packet: git> version=42
>> packet: git> 0000
>> packet: git< git-filter-server
>> packet: git< version=2
>
> what follows is specific to version=2?
> version 42 may deem capabilities a bad idea?
"version=42" is just an example to show how the initialization could look
like in a distant future when we support even another protocol version.
You are correct, what follows is specific to version=2. I will state
that more clearly in the documentation.
Can you try to rephrase "version 42 may deem capabilities a bad idea?"
I am not sure I understand what you mean.
>
>> packet: git> clean=true
>> packet: git> smudge=true
>> packet: git> not-yet-invented=true
>> packet: git> 0000
>> packet: git< clean=true
>> packet: git< smudge=true
>> packet: git< 0000
>> ------------------------
>> Supported filter capabilities in version 2 are "clean" and
>> "smudge".
>
> I assume version 2 is an example here and we actually start with v1?
No, it is actually called version 2 because I consider the current
clean/smudge protocol version 1.
> Can you clarify why we need welcome messages?
> (Is there a technical reason, or better debuggability for humans?)
The welcome message is necessary to distinguish the long running
filter protocol (v2) from the current one-shot filter protocol (v1).
This is becomes important if a users tries to use a v1 clean/smudge
filter with the v2 git config settings.
>> Afterwards Git sends a list of "key=value" pairs terminated with
>> a flush packet. The list will contain at least the filter command
>> (based on the supported capabilities) and the pathname of the file
>> to filter relative to the repository root. Right after these packets
>> Git sends the content split in zero or more pkt-line packets and a
>> flush packet to terminate content.
>> ------------------------
>> packet: git> command=smudge\n
>> packet: git> pathname=path/testfile.dat\n
>> packet: git> 0000
>> packet: git> CONTENT
>> packet: git> 0000
>> ------------------------
>>
>> The filter is expected to respond with a list of "key=value" pairs
>> terminated with a flush packet. If the filter does not experience
>> problems then the list must contain a "success" status. Right after
>> these packets the filter is expected to send the content in zero
>> or more pkt-line packets and a flush packet at the end. Finally, a
>> second list of "key=value" pairs terminated with a flush packet
>> is expected. The filter can change the status in the second list.
>> ------------------------
>> packet: git< status=success\n
>> packet: git< 0000
>> packet: git< SMUDGED_CONTENT
>> packet: git< 0000
>> packet: git< 0000 # empty list!
>> ------------------------
>>
>> If the result content is empty then the filter is expected to respond
>> with a success status and an empty list.
>> ------------------------
>> packet: git< status=success\n
>> packet: git< 0000
>> packet: git< 0000 # empty content!
>> packet: git< 0000 # empty list!
>> ------------------------
>
> Why do we need the last flush packet? We'd expect as many successes
> as we send out contents? Do we plan on interleaving operation, i.e.
> Git sends out 10 files but the filter process is not as fast as Git sending
> out and the answers trickle in slowly?
Git filter processes run sequentially right now (unfortunately).
re flush: please see Peff's answer:
http://public-inbox.org/git/20160812163809.3wdkuqegxfjam2yn%40sigill.intra.peff.net/
>> In case the filter cannot or does not want to process the content,
>> it is expected to respond with an "error" status. Depending on the
>> `filter.<driver>.required` flag Git will interpret that as error
>> but it will not stop or restart the filter process.
>> ------------------------
>> packet: git< status=error\n
>> packet: git< 0000
>> ------------------------
>>
>> In case the filter cannot or does not want to process the content
>> as well as any future content for the lifetime of the Git process,
>> it is expected to respond with an "error-all" status. Depending on
>> the `filter.<driver>.required` flag Git will interpret that as error
>> but it will not stop or restart the filter process.
>> ------------------------
>> packet: git< status=error-all\n
>> packet: git< 0000
>> ------------------------
>>
>> If the filter experiences an error during processing, then it can
>> send the status "error". Depending on the `filter.<driver>.required`
>> flag Git will interpret that as error but it will not stop or restart
>> the filter process.
>> ------------------------
>> packet: git< status=success\n
>
> So the first success is meaningless essentially?
> Would it make sense to move the sucess behind the content sending
> in all cases?
Again, I refer to Peff's answer.
>> packet: git< 0000
>> packet: git< HALF_WRITTEN_ERRONEOUS_CONTENT
>> packet: git< 0000
>> packet: git< status=error\n
>> packet: git< 0000
>> ------------------------
>>
>> If the filter dies during the communication or does not adhere to
>> the protocol then Git will stop the filter process and restart it
>> with the next file that needs to be processed.
>>
>> After the filter has processed a blob it is expected to wait for
>> the next "key=value" list containing a command. When the Git process
>> terminates, it will send a kill signal to the filter in that stage.
>>
>> If a `filter.<driver>.clean` or `filter.<driver>.smudge` command
>> is configured then these commands always take precedence over
>> a configured `filter.<driver>.process` command.
>
> okay. I think you can omit most of the commit message as it is a duplicate
> of the documentation?
Yes it duplicates the documentation.
> Instead the commit message can answer questions that are not part of
> the documentation. (See the questions above which can be summarized
> as "Why do we do it this way and not differently?")
OK, point taken. I will write a new commit message for v6.
>
>> + if (err || errno == EPIPE) {
>> + if (!strcmp(filter_status.buf, "error")) {
>> + /*
>> + * The filter signaled a problem with the file.
>> + */
>
> /* This could go into a single line comment. */
OK, will change.
>> + } else if (!strcmp(filter_status.buf, "error-all")) {
>> + /*
>> + * The filter signaled a permanent problem. Don't try to filter
>> + * files with the same command for the lifetime of the current
>> + * Git process.
>> + */
>> + entry->supported_capabilities &= ~wanted_capability;
>> + } else {
>> + /*
>> + * Something went wrong with the protocol filter.
>> + * Force shutdown and restart if another blob requires filtering!
>> + */
>> + error("external filter '%s' failed", cmd);
>
> failed .. Can you give more information to the user such that they can easier
> debug? (blob/path or state / expected state)
Agreed, will add!
However, we don't give this information with the current clean/smudge interface.
>> +
>> static int read_convert_config(const char *var, const char *value, void *cb)
>> {
>> const char *key, *name;
>> @@ -526,6 +818,10 @@ static int read_convert_config(const char *var, const char *value, void *cb)
>> if (!strcmp("clean", key))
>> return git_config_string(&drv->clean, var, value);
>>
>> + if (!strcmp("process", key)) {
>> + return git_config_string(&drv->process, var, value);
>> + }
>
> optional nit: braces unnecessary
Agreed, will remove!
Thanks a lot for the review,
Lars
next prev parent reply other threads:[~2016-08-12 16:59 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20160803164225.46355-1-larsxschneider@gmail.com/>
2016-08-10 13:03 ` [PATCH v5 00/15] Git filter protocol larsxschneider
2016-08-10 13:03 ` [PATCH v5 01/15] pkt-line: extract set_packet_header() larsxschneider
2016-08-10 13:03 ` [PATCH v5 02/15] pkt-line: call packet_trace() only if a packet is actually send larsxschneider
2016-08-10 13:13 ` Jeff King
2016-08-10 13:24 ` Lars Schneider
2016-08-10 13:30 ` Jeff King
2016-08-10 13:51 ` Lars Schneider
2016-08-10 14:33 ` Jeff King
2016-08-10 13:03 ` [PATCH v5 03/15] pkt-line: add `gentle` parameter to format_packet() larsxschneider
2016-08-10 13:15 ` Jeff King
2016-08-10 13:29 ` Lars Schneider
2016-08-10 13:37 ` Jeff King
2016-08-10 13:59 ` Lars Schneider
2016-08-10 14:34 ` Jeff King
2016-08-10 13:04 ` [PATCH v5 04/15] pkt-line: add packet_write_gently() larsxschneider
2016-08-10 13:28 ` Jeff King
2016-08-10 13:36 ` Lars Schneider
2016-08-10 13:40 ` Jeff King
2016-08-10 17:17 ` Junio C Hamano
2016-08-10 17:49 ` Lars Schneider
2016-08-10 18:21 ` Junio C Hamano
2016-08-10 19:15 ` Lars Schneider
2016-08-10 13:04 ` [PATCH v5 05/15] pkt-line: add packet_write_gently_fmt() larsxschneider
2016-08-10 13:43 ` Jeff King
2016-08-10 14:10 ` Lars Schneider
2016-08-10 15:01 ` Jeff King
2016-08-10 17:18 ` Junio C Hamano
2016-08-10 17:53 ` Lars Schneider
2016-08-10 18:42 ` Junio C Hamano
2016-08-10 13:04 ` [PATCH v5 06/15] pkt-line: add packet_flush_gently() larsxschneider
2016-08-10 13:04 ` [PATCH v5 07/15] pkt-line: add functions to read/write flush terminated packet streams larsxschneider
2016-08-10 13:04 ` [PATCH v5 08/15] pkt-line: rename packet_write() to packet_write_fmt() larsxschneider
2016-08-10 13:04 ` [PATCH v5 09/15] pack-protocol: fix maximum pkt-line size larsxschneider
2016-08-10 13:04 ` [PATCH v5 10/15] convert: quote filter names in error messages larsxschneider
2016-08-10 13:04 ` [PATCH v5 11/15] convert: modernize tests larsxschneider
2016-08-10 13:04 ` [PATCH v5 12/15] convert: generate large test files only once larsxschneider
2016-08-10 13:04 ` [PATCH v5 13/15] convert: make apply_filter() adhere to standard Git error handling larsxschneider
2016-08-10 13:04 ` [PATCH v5 14/15] convert: add filter.<driver>.process option larsxschneider
2016-08-12 16:33 ` Stefan Beller
2016-08-12 16:38 ` Jeff King
2016-08-12 16:48 ` Stefan Beller
2016-08-12 17:08 ` Lars Schneider
2016-08-12 17:13 ` Junio C Hamano
2016-08-12 17:21 ` Lars Schneider
2016-08-12 18:03 ` Junio C Hamano
2016-08-12 16:59 ` Lars Schneider [this message]
2016-08-12 17:07 ` Stefan Beller
2016-08-12 17:14 ` Lars Schneider
2016-08-10 13:04 ` [PATCH v5 15/15] read-cache: make sure file handles are not inherited by child processes larsxschneider
2016-08-18 14:23 ` Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=509A907F-B1B5-4244-B1C7-F1190296208D@gmail.com \
--to=larsxschneider@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=ben@wijen.net \
--cc=e@80x24.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jnareb@gmail.com \
--cc=mlbright@gmail.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).