* Looking for help to understand external filter driver code @ 2016-07-19 16:40 Lars Schneider 2016-07-19 17:56 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Lars Schneider @ 2016-07-19 16:40 UTC (permalink / raw) To: Git Mailing List; +Cc: Junio C Hamano, Johannes Sixt Hi, a long time ago in aa4ed4 Junio introduced the external filter driver definition. Since that time we fork the Git process and then we fork again to run the external filter. This is probably a super stupid question with an obvious answer... but can anyone help me to understand the code and explain why we fork twice? Wouldn't it be sufficient to just fork once for the external filter since we cannot process anything in parallel anyways? In 546bb5 Hannes refactored Junio's code to use the async infrastructure that is used today. Thank you, Lars ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 16:40 Looking for help to understand external filter driver code Lars Schneider @ 2016-07-19 17:56 ` Junio C Hamano 2016-07-19 18:53 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2016-07-19 17:56 UTC (permalink / raw) To: Lars Schneider; +Cc: Git Mailing List, Johannes Sixt Lars Schneider <larsxschneider@gmail.com> writes: > a long time ago in aa4ed4 Junio introduced the external filter > driver definition. Since that time we fork the Git process and > then we fork again to run the external filter. This is probably a > super stupid question with an obvious answer... but can anyone > help me to understand the code and explain why we fork twice? In "git show aa4ed4" you find this picture: /* * Create a pipeline to have the command filter the buffer's * contents. * * (child --> cmd) --> us */ where "child" is a fork of the original Git process; it still has the <src, size> buffered data inherited from the original process, so it can write_in_full() into the pipe going to the cmd process. "us" is the original process that has the reading end of the pipe that connects us to the "(child --> cmd)" processes, so that it can read the filtered result from them. The key benefit of this arrangement is the above can be done without having to do poll() to flip between reading and writing that is needed to avoid deadlocking, which kept the code simpler. A later conversion of the write side into async does not fundamentally change anything from the original arrangement. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 17:56 ` Junio C Hamano @ 2016-07-19 18:53 ` Junio C Hamano 2016-07-19 20:44 ` Lars Schneider 0 siblings, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2016-07-19 18:53 UTC (permalink / raw) To: Lars Schneider; +Cc: Git Mailing List, Johannes Sixt Junio C Hamano <gitster@pobox.com> writes: > The key benefit of this arrangement is the above can be done without > having to do poll() to flip between reading and writing that is > needed to avoid deadlocking, which kept the code simpler. A later > conversion of the write side into async does not fundamentally > change anything from the original arrangement. Translation: I was too lazy to worry about doing poll()/select() when I did it originally. As long as you can do so correctly, be my guest to reduce one process by having the main process do both reading and writing. ;-) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 18:53 ` Junio C Hamano @ 2016-07-19 20:44 ` Lars Schneider 2016-07-19 21:33 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Lars Schneider @ 2016-07-19 20:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List, Johannes Sixt On 19 Jul 2016, at 20:53, Junio C Hamano <gitster@pobox.com> wrote: > Junio C Hamano <gitster@pobox.com> writes: > >> The key benefit of this arrangement is the above can be done without >> having to do poll() to flip between reading and writing that is >> needed to avoid deadlocking, which kept the code simpler. A later >> conversion of the write side into async does not fundamentally >> change anything from the original arrangement. > > Translation: I was too lazy to worry about doing poll()/select() > when I did it originally. As long as you can do so correctly, be my > guest to reduce one process by having the main process do both > reading and writing. > > ;-) Thanks a lot for this explanation. My goal is to add an option to the clean/smudge filter config that instructs Git to keep the external filter process running. Whenever Git encounters a file to be filtered then it would continue to talk to the filter process over a simple pipe protocol: Git writes --> 4 byte filename length Git writes --> filename string Git writes --> 4 byte content length Git writes --> content string Git reads <-- 4 byte filtered content length Git reads <-- filtered content I still need to read more about poll()/select() but it looks to me as if these functions are only required if Git doesn't know what to expect. With the sketched protocol above that wouldn't be the case. Therefore, I wonder if I would need to use poll()/select()? Thanks, Lars ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 20:44 ` Lars Schneider @ 2016-07-19 21:33 ` Junio C Hamano 2016-07-19 22:01 ` Lars Schneider ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Junio C Hamano @ 2016-07-19 21:33 UTC (permalink / raw) To: Lars Schneider; +Cc: Git Mailing List, Johannes Sixt Lars Schneider <larsxschneider@gmail.com> writes: > Git writes --> 4 byte filename length > Git writes --> filename string Why limit to 32GB? Perhaps NUL termination is more appropriate here? > Git writes --> 4 byte content length > Git writes --> content string > Git reads <-- 4 byte filtered content length > Git reads <-- filtered content Do you really need to force the sender to know the length in advance? Together with the sequential nature of the above exchange, i.e. the filter is forbidden from producing even a single byte of its output before reading everything Git feeds it, you are making it impossible to use filters that perform streaming conversion. Of course, with the "sequential" thing, you do not have to worry about deadlocking hence no need for poll/select, but I am not sure that is a good thing. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 21:33 ` Junio C Hamano @ 2016-07-19 22:01 ` Lars Schneider 2016-07-20 2:33 ` Torsten Bögershausen 2016-07-20 8:59 ` Jakub Narębski 2016-07-20 13:49 ` Jeff King 2 siblings, 1 reply; 10+ messages in thread From: Lars Schneider @ 2016-07-19 22:01 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List, Johannes Sixt On 19 Jul 2016, at 23:33, Junio C Hamano <gitster@pobox.com> wrote: > Lars Schneider <larsxschneider@gmail.com> writes: > >> Git writes --> 4 byte filename length >> Git writes --> filename string > > Why limit to 32GB? Perhaps NUL termination is more appropriate > here? OK, I will use NUL termination for the filename. You're also right about the limit - I will use 8 byte to encode the content length. >> Git writes --> 4 byte content length >> Git writes --> content string >> Git reads <-- 4 byte filtered content length >> Git reads <-- filtered content > > Do you really need to force the sender to know the length in > advance? Together with the sequential nature of the above exchange, > i.e. the filter is forbidden from producing even a single byte of > its output before reading everything Git feeds it, you are making it > impossible to use filters that perform streaming conversion. That is correct. However, for my particular use case streaming conversion wouldn't be useful anyways: https://github.com/github/git-lfs/pull/1382 > Of course, with the "sequential" thing, you do not have to worry > about deadlocking hence no need for poll/select, but I am not sure > that is a good thing. Thanks for the confirmation. I consider to exchange a "filter protocol version" right after the filter process has started. That way someone could add a more evolved "filter driver protocol" later on that supports streaming and the external filter could pick whatever protocol is most appropriate (and supported). Could that be an acceptable compromise to get a serious review of the "sequential" thing? Thanks, Lars ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 22:01 ` Lars Schneider @ 2016-07-20 2:33 ` Torsten Bögershausen 0 siblings, 0 replies; 10+ messages in thread From: Torsten Bögershausen @ 2016-07-20 2:33 UTC (permalink / raw) To: Lars Schneider, Junio C Hamano; +Cc: Git Mailing List, Johannes Sixt On 07/20/2016 12:01 AM, Lars Schneider wrote: > > On 19 Jul 2016, at 23:33, Junio C Hamano <gitster@pobox.com> wrote: > >> Lars Schneider <larsxschneider@gmail.com> writes: >> >>> Git writes --> 4 byte filename length >>> Git writes --> filename string >> >> Why limit to 32GB? Perhaps NUL termination is more appropriate >> here? > OK, I will use NUL termination for the filename. > You're also right about the limit - I will use 8 byte to encode the > content length. Is there any reason to encode the file length in binary format? With all the discussions about big endianess, little endianess, 4GiB or 32 GiB. How about simply writing the length as ASCII ? Unless we don't want to have a "spare" field for future extensions, it could be good to add an option field, which may be empty. On top of that, do we want a field separator different from the line separator ? How about this: <options><TAB><length><TAB><filename><NUL> <options> may be "var=value;var2=value2" or simply "" ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 21:33 ` Junio C Hamano 2016-07-19 22:01 ` Lars Schneider @ 2016-07-20 8:59 ` Jakub Narębski 2016-07-20 9:43 ` Lars Schneider 2016-07-20 13:49 ` Jeff King 2 siblings, 1 reply; 10+ messages in thread From: Jakub Narębski @ 2016-07-20 8:59 UTC (permalink / raw) To: Junio C Hamano, Lars Schneider; +Cc: Git Mailing List, Johannes Sixt W dniu 2016-07-19 o 23:33, Junio C Hamano pisze: > Lars Schneider <larsxschneider@gmail.com> writes: > >> > Git writes --> 4 byte filename length >> > Git writes --> filename string > > Why limit to 32GB? Perhaps NUL termination is more appropriate > here? Errr, I think limiting _filename_ to 32GB is a reasonable assumption, for forever... >> > Git writes --> 4 byte content length >> > Git writes --> content string ...while limiting file _content_ to 32GB might not be future-proof enough. ;-) -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-20 8:59 ` Jakub Narębski @ 2016-07-20 9:43 ` Lars Schneider 0 siblings, 0 replies; 10+ messages in thread From: Lars Schneider @ 2016-07-20 9:43 UTC (permalink / raw) To: Jakub Narębski; +Cc: Junio C Hamano, Git Mailing List, Johannes Sixt > On 20 Jul 2016, at 10:59, Jakub Narębski <jnareb@gmail.com> wrote: > > W dniu 2016-07-19 o 23:33, Junio C Hamano pisze: >> Lars Schneider <larsxschneider@gmail.com> writes: >> >>>> Git writes --> 4 byte filename length >>>> Git writes --> filename string >> >> Why limit to 32GB? Perhaps NUL termination is more appropriate >> here? > > Errr, I think limiting _filename_ to 32GB is a reasonable > assumption, for forever... Well, Java packages can get reaaaally long :D > >>>> Git writes --> 4 byte content length >>>> Git writes --> content string > > ...while limiting file _content_ to 32GB might not be > future-proof enough. > > ;-) I think this is what Junio meant to say. At least I interpreted it that way. - Lars ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Looking for help to understand external filter driver code 2016-07-19 21:33 ` Junio C Hamano 2016-07-19 22:01 ` Lars Schneider 2016-07-20 8:59 ` Jakub Narębski @ 2016-07-20 13:49 ` Jeff King 2 siblings, 0 replies; 10+ messages in thread From: Jeff King @ 2016-07-20 13:49 UTC (permalink / raw) To: Junio C Hamano; +Cc: Lars Schneider, Git Mailing List, Johannes Sixt On Tue, Jul 19, 2016 at 02:33:09PM -0700, Junio C Hamano wrote: > > Git writes --> 4 byte content length > > Git writes --> content string > > Git reads <-- 4 byte filtered content length > > Git reads <-- filtered content > > Do you really need to force the sender to know the length in > advance? Together with the sequential nature of the above exchange, > i.e. the filter is forbidden from producing even a single byte of > its output before reading everything Git feeds it, you are making it > impossible to use filters that perform streaming conversion. Another option: use pkt-lines with a flush packet to indicate end-of-input. That allows arbitrary sized data, with streaming, and reuses existing concepts from git. There is proportional overhead, but it's only 4 bytes per 64k, which is a tiny percent. It does make some implementations easier if they know the size ahead of time, though, so if we are _sure_ that nobody will want streaming later, it may not be a good tradeoff. If we do print a size ahead of time, the "normal" thing in git would be to do so in base-10 ascii followed by a newline (e.g., as found in "cat-file --batch", or fast-import's "data" command). -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-07-20 13:49 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-19 16:40 Looking for help to understand external filter driver code Lars Schneider 2016-07-19 17:56 ` Junio C Hamano 2016-07-19 18:53 ` Junio C Hamano 2016-07-19 20:44 ` Lars Schneider 2016-07-19 21:33 ` Junio C Hamano 2016-07-19 22:01 ` Lars Schneider 2016-07-20 2:33 ` Torsten Bögershausen 2016-07-20 8:59 ` Jakub Narębski 2016-07-20 9:43 ` Lars Schneider 2016-07-20 13:49 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).