git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* RFC: Enable delayed responses to Git clean/smudge filter requests
@ 2016-11-14 21:09 Lars Schneider
  2016-11-15  1:03 ` Eric Wong
  2017-01-09 20:44 ` Stefan Beller
  0 siblings, 2 replies; 16+ messages in thread
From: Lars Schneider @ 2016-11-14 21:09 UTC (permalink / raw)
  To: Git Mailing List

Hi,

Git always performs a clean/smudge filter on files in sequential order.
Sometimes a filter operation can take a noticeable amount of time. 
This blocks the entire Git process.

I would like to give a filter process the possibility to answer Git with
"I got your request, I am processing it, ask me for the result later!".

I see the following way to realize this:

In unpack-trees.c:check_updates() [1] we loop through the cache 
entries and "ask me later" could be an acceptable return value of the 
checkout_entry() call. The loop could run until all entries returned
success or error.

The filter machinery is triggered in various other places in Git and
all places that want to support "ask me later" would need to be patched 
accordingly.

--

Do you think this could be a viable approach?
Do you see a better way?

Thanks,
Lars


[1] https://github.com/git/git/blob/3ab228137f980ff72dbdf5064a877d07bec76df9/unpack-trees.c#L267


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-14 21:09 RFC: Enable delayed responses to Git clean/smudge filter requests Lars Schneider
@ 2016-11-15  1:03 ` Eric Wong
  2016-11-15 14:29   ` Lars Schneider
  2017-01-09 20:44 ` Stefan Beller
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Wong @ 2016-11-15  1:03 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git

Lars Schneider <larsxschneider@gmail.com> wrote:
> Hi,
> 
> Git always performs a clean/smudge filter on files in sequential order.
> Sometimes a filter operation can take a noticeable amount of time. 
> This blocks the entire Git process.

I have the same problem in many places which aren't git :>

> I would like to give a filter process the possibility to answer Git with
> "I got your request, I am processing it, ask me for the result later!".
> 
> I see the following way to realize this:
> 
> In unpack-trees.c:check_updates() [1] we loop through the cache 
> entries and "ask me later" could be an acceptable return value of the 
> checkout_entry() call. The loop could run until all entries returned
> success or error.
> 
> The filter machinery is triggered in various other places in Git and
> all places that want to support "ask me later" would need to be patched 
> accordingly.

That all sounds reasonable.

The filter itself would need to be aware of parallelism
if it lives for multiple objects, right?

> Do you think this could be a viable approach?

It'd probably require a bit of work, but yes, I think it's viable.

We already do this with curl_multi requests for parallel
fetching from dumb HTTP servers, but that's driven by curl
internals operating with a select/poll loop.

Perhaps the curl API could be a good example for doing this.

> Do you see a better way?

Nope, I prefer non-blocking state machines to threads for
debuggability and determinism.

Anyways, I'll plan on doing something similar (in Perl) with the
synchronous parts of public-inbox which relies on "cat-file --batch"
at some point... (my rotational disks are sloooooooow :<)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-15  1:03 ` Eric Wong
@ 2016-11-15 14:29   ` Lars Schneider
  2016-11-15 18:03     ` Junio C Hamano
  2016-11-15 18:27     ` Eric Wong
  0 siblings, 2 replies; 16+ messages in thread
From: Lars Schneider @ 2016-11-15 14:29 UTC (permalink / raw)
  To: Eric Wong; +Cc: git


> On 15 Nov 2016, at 02:03, Eric Wong <e@80x24.org> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> wrote:
>> Hi,
>> 
>> Git always performs a clean/smudge filter on files in sequential order.
>> Sometimes a filter operation can take a noticeable amount of time. 
>> This blocks the entire Git process.
> 
> I have the same problem in many places which aren't git :>
> 
>> I would like to give a filter process the possibility to answer Git with
>> "I got your request, I am processing it, ask me for the result later!".
>> 
>> I see the following way to realize this:
>> 
>> In unpack-trees.c:check_updates() [1] we loop through the cache 
>> entries and "ask me later" could be an acceptable return value of the 
>> checkout_entry() call. The loop could run until all entries returned
>> success or error.
>> 
>> The filter machinery is triggered in various other places in Git and
>> all places that want to support "ask me later" would need to be patched 
>> accordingly.
> 
> That all sounds reasonable.
> 
> The filter itself would need to be aware of parallelism
> if it lives for multiple objects, right?

Correct. This way Git doesn't need to deal with threading...


>> Do you think this could be a viable approach?
> 
> It'd probably require a bit of work, but yes, I think it's viable.
> 
> We already do this with curl_multi requests for parallel
> fetching from dumb HTTP servers, but that's driven by curl
> internals operating with a select/poll loop.
> 
> Perhaps the curl API could be a good example for doing this.

Thanks for the pointer!


>> Do you see a better way?
> 
> Nope, I prefer non-blocking state machines to threads for
> debuggability and determinism.

Agreed!


> Anyways, I'll plan on doing something similar (in Perl) with the
> synchronous parts of public-inbox which relies on "cat-file --batch"
> at some point... (my rotational disks are sloooooooow :<)

That sounds interesting! What changes to you have in mind for 
"cat-file --batch"? We are thinking about performance improvements
in that area, too. I would be happy to help reviewing your patches!

Thanks a lot for your RFC feedback,
Lars

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-15 14:29   ` Lars Schneider
@ 2016-11-15 18:03     ` Junio C Hamano
  2016-11-16  9:53       ` Lars Schneider
  2016-11-24 15:45       ` Lars Schneider
  2016-11-15 18:27     ` Eric Wong
  1 sibling, 2 replies; 16+ messages in thread
From: Junio C Hamano @ 2016-11-15 18:03 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Eric Wong, git

Lars Schneider <larsxschneider@gmail.com> writes:

>> The filter itself would need to be aware of parallelism
>> if it lives for multiple objects, right?
>
> Correct. This way Git doesn't need to deal with threading...

I think you need to be careful about three things (at least; there
may be more):

 * Codepaths that check out multiple cache entries do rely on the
   order of checkout.  We checkout removals first to make room so
   that creation of a path X can succeed if an existing path X/Y
   that used to want to see X as a directory can succeed (see the
   use of checkout_entry() by "git checkout", which does have two
   separate loops to explicitly guarantee this), for example.  I
   think "remove all and then create" you do not specifically have
   to worry about with the proposed change, but you may need to
   inspect and verify there aren't other kind of order dependency.

 * Done naively, it will lead to unmaintainable code, like this:

   + struct list_of_cache_entries *list = ...;
     for (i = 0; i < active_nr; i++)
   -    checkout_entry(active_cache[i], state, NULL);
   +    if (checkout_entry(active_cache[i], state, NULL) == DELAYED)
   +       add_cache_to_queue(&list, active_cache[i]);
   + while (list) {
   +    wait_for_checkout_to_finish(*list);
   +    list = list->next;
   + }

   I do not think we want to see such a rewrite all over the
   codepaths.  It might be OK to add such a "these entries are known
   to be delayed" list in struct checkout so that the above becomes
   more like this:

     for (i = 0; i < active_nr; i++)
        checkout_entry(active_cache[i], state, NULL);
   + checkout_entry_finish(state);

   That is, addition of a single "some of the checkout_entry() calls
   done so far might have been lazy, and I'll give them a chance to
   clean up" might be palatable.  Anything more than that on the
   caller side is not.

 * You'd need to rein in the maximum parallelism somehow, as you do
   not want to see hundreds of competing filter processes starting
   only to tell the main loop over an index with hundreds of entries
   that they are delayed checkouts.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-15 14:29   ` Lars Schneider
  2016-11-15 18:03     ` Junio C Hamano
@ 2016-11-15 18:27     ` Eric Wong
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Wong @ 2016-11-15 18:27 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git

Lars Schneider <larsxschneider@gmail.com> wrote:
> > On 15 Nov 2016, at 02:03, Eric Wong <e@80x24.org> wrote:
> 
> > Anyways, I'll plan on doing something similar (in Perl) with the
> > synchronous parts of public-inbox which relies on "cat-file --batch"
> > at some point... (my rotational disks are sloooooooow :<)
> 
> That sounds interesting! What changes to you have in mind for 
> "cat-file --batch"? We are thinking about performance improvements
> in that area, too. I would be happy to help reviewing your patches!

I'm not touching "cat-file --batch" itself, just the Perl code
that uses it.  (still trying my best to avoid working on C or
any compiled languages);

There'll probably be a Perl queue mechanism so there's still
only one cat-file process running per-inbox to avoid overloading
on seeks; but the Perl daemons should be able to handle network
duties and other in-memory operations while it waits for
cat-file.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-15 18:03     ` Junio C Hamano
@ 2016-11-16  9:53       ` Lars Schneider
  2016-11-16 18:15         ` Junio C Hamano
  2016-11-16 22:41         ` Jakub Narębski
  2016-11-24 15:45       ` Lars Schneider
  1 sibling, 2 replies; 16+ messages in thread
From: Lars Schneider @ 2016-11-16  9:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Wong, git


On 15 Nov 2016, at 19:03, Junio C Hamano <gitster@pobox.com> wrote:

> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> The filter itself would need to be aware of parallelism
>>> if it lives for multiple objects, right?
>> 
>> Correct. This way Git doesn't need to deal with threading...
> 
> I think you need to be careful about three things (at least; there
> may be more):
> 
> * Codepaths that check out multiple cache entries do rely on the
>   order of checkout.  We checkout removals first to make room so
>   that creation of a path X can succeed if an existing path X/Y
>   that used to want to see X as a directory can succeed (see the
>   use of checkout_entry() by "git checkout", which does have two
>   separate loops to explicitly guarantee this), for example.  I
>   think "remove all and then create" you do not specifically have
>   to worry about with the proposed change, but you may need to
>   inspect and verify there aren't other kind of order dependency.

OK


> * Done naively, it will lead to unmaintainable code, like this:
> 
>   + struct list_of_cache_entries *list = ...;
>     for (i = 0; i < active_nr; i++)
>   -    checkout_entry(active_cache[i], state, NULL);
>   +    if (checkout_entry(active_cache[i], state, NULL) == DELAYED)
>   +       add_cache_to_queue(&list, active_cache[i]);
>   + while (list) {
>   +    wait_for_checkout_to_finish(*list);
>   +    list = list->next;
>   + }
> 
>   I do not think we want to see such a rewrite all over the
>   codepaths.  It might be OK to add such a "these entries are known
>   to be delayed" list in struct checkout so that the above becomes
>   more like this:
> 
>     for (i = 0; i < active_nr; i++)
>        checkout_entry(active_cache[i], state, NULL);
>   + checkout_entry_finish(state);
> 
>   That is, addition of a single "some of the checkout_entry() calls
>   done so far might have been lazy, and I'll give them a chance to
>   clean up" might be palatable.  Anything more than that on the
>   caller side is not.

I haven't thought hard about the implementation, yet, but I'll try 
to stick to your suggestion and change as less code as possible on 
the caller sides.


> * You'd need to rein in the maximum parallelism somehow, as you do
>   not want to see hundreds of competing filter processes starting
>   only to tell the main loop over an index with hundreds of entries
>   that they are delayed checkouts.

I intend to implement this feature only for the new long running filter
process protocol. OK with you?


Thanks,
Lars

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-16  9:53       ` Lars Schneider
@ 2016-11-16 18:15         ` Junio C Hamano
  2016-11-16 18:47           ` Lars Schneider
  2016-11-16 22:41         ` Jakub Narębski
  1 sibling, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2016-11-16 18:15 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Eric Wong, git

Lars Schneider <larsxschneider@gmail.com> writes:

>> * You'd need to rein in the maximum parallelism somehow, as you do
>>   not want to see hundreds of competing filter processes starting
>>   only to tell the main loop over an index with hundreds of entries
>>   that they are delayed checkouts.
>
> I intend to implement this feature only for the new long running filter
> process protocol. OK with you?

Do you mean that a long-running filter process interacting with
convert_to_worktree() called from checkout_entry() will be the only
codepath that will spawn multiple processes or threads?  

That is fine, but it does not change the fact that you still need to
limit the maximum parallelism there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-16 18:15         ` Junio C Hamano
@ 2016-11-16 18:47           ` Lars Schneider
  2016-11-16 19:19             ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Lars Schneider @ 2016-11-16 18:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Wong, git


> On 16 Nov 2016, at 19:15, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> * You'd need to rein in the maximum parallelism somehow, as you do
>>>  not want to see hundreds of competing filter processes starting
>>>  only to tell the main loop over an index with hundreds of entries
>>>  that they are delayed checkouts.
>> 
>> I intend to implement this feature only for the new long running filter
>> process protocol. OK with you?
> 
> Do you mean that a long-running filter process interacting with
> convert_to_worktree() called from checkout_entry() will be the only
> codepath that will spawn multiple processes or threads?  
> 
> That is fine, but it does not change the fact that you still need to
> limit the maximum parallelism there.

Filters using the long running protocol are spawned only once by Git. 
The filter process would get all the smudge requests via the pipe
protocol and is supposed to manage the parallelism on its own.

- Lars

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-16 18:47           ` Lars Schneider
@ 2016-11-16 19:19             ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2016-11-16 19:19 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Eric Wong, git

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 16 Nov 2016, at 19:15, Junio C Hamano <gitster@pobox.com> wrote:
>> 
>> Lars Schneider <larsxschneider@gmail.com> writes:
>> 
>>>> * You'd need to rein in the maximum parallelism somehow, as you do
>>>>  not want to see hundreds of competing filter processes starting
>>>>  only to tell the main loop over an index with hundreds of entries
>>>>  that they are delayed checkouts.
>>> 
>>> I intend to implement this feature only for the new long running filter
>>> process protocol. OK with you?
>> 
>> Do you mean that a long-running filter process interacting with
>> convert_to_worktree() called from checkout_entry() will be the only
>> codepath that will spawn multiple processes or threads?  
>> 
>> That is fine, but it does not change the fact that you still need to
>> limit the maximum parallelism there.
>
> Filters using the long running protocol are spawned only once by Git. 
> The filter process would get all the smudge requests via the pipe
> protocol and is supposed to manage the parallelism on its own.

Yes, I think we are on the same page.  You need to be careful not to
let the filter process go berserk spawning too many threads or
processes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-16  9:53       ` Lars Schneider
  2016-11-16 18:15         ` Junio C Hamano
@ 2016-11-16 22:41         ` Jakub Narębski
  2016-11-16 23:46           ` Junio C Hamano
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Narębski @ 2016-11-16 22:41 UTC (permalink / raw)
  To: Lars Schneider, Junio C Hamano; +Cc: Eric Wong, git

W dniu 16.11.2016 o 10:53, Lars Schneider pisze:
> On 15 Nov 2016, at 19:03, Junio C Hamano <gitster@pobox.com> wrote:
>> Lars Schneider <larsxschneider@gmail.com> writes:
>>
>>>> The filter itself would need to be aware of parallelism
>>>> if it lives for multiple objects, right?
>>>
>>> Correct. This way Git doesn't need to deal with threading...
[...]

>> * You'd need to rein in the maximum parallelism somehow, as you do
>>   not want to see hundreds of competing filter processes starting
>>   only to tell the main loop over an index with hundreds of entries
>>   that they are delayed checkouts.
> 
> I intend to implement this feature only for the new long running filter
> process protocol. OK with you?

If I remember and understand it correctly, current version of long
running process protocol processes files sequentially, one by one:
git sends file to filter wholly, and receives response wholly.

In the single-file filter case, git calls filter process as async
task, in a separate thread, so that one thread feeds the filter,
and main thread (I think?) reads from it, to avoid deadlocks.

Couldn't something like this be done for long running filter process,
via protocol extension?  Namely, Git would send in an async task
content to be filtered, and filter process would stream the response
back, in any order.  If it would be not enough, we could borrow
idea of channels, and be sending few files back concurrently in
parallel, as separate channels... though that would probably
require quite a bit of change in caller.

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-16 22:41         ` Jakub Narębski
@ 2016-11-16 23:46           ` Junio C Hamano
  2016-11-17  9:19             ` Lars Schneider
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2016-11-16 23:46 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Lars Schneider, Eric Wong, git

Jakub Narębski <jnareb@gmail.com> writes:

>> I intend to implement this feature only for the new long running filter
>> process protocol. OK with you?
>
> If I remember and understand it correctly, current version of long
> running process protocol processes files sequentially, one by one:
> git sends file to filter wholly, and receives response wholly.
>
> In the single-file filter case, git calls filter process as async
> task, in a separate thread, so that one thread feeds the filter,
> and main thread (I think?) reads from it, to avoid deadlocks.
>
> Couldn't something like this be done for long running filter process,
> via protocol extension?

My reading of the message you are responding to is that Lars means
doing so by "implement this feature".  Instead of returning the
filtered bytes, a new protocol lets his filter to say "No result yet
for you to process, ask me later".


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-16 23:46           ` Junio C Hamano
@ 2016-11-17  9:19             ` Lars Schneider
  0 siblings, 0 replies; 16+ messages in thread
From: Lars Schneider @ 2016-11-17  9:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narębski, Eric Wong, git


> On 17 Nov 2016, at 00:46, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Jakub Narębski <jnareb@gmail.com> writes:
> 
>>> I intend to implement this feature only for the new long running filter
>>> process protocol. OK with you?
>> 
>> If I remember and understand it correctly, current version of long
>> running process protocol processes files sequentially, one by one:
>> git sends file to filter wholly, and receives response wholly.
>> 
>> In the single-file filter case, git calls filter process as async
>> task, in a separate thread, so that one thread feeds the filter,
>> and main thread (I think?) reads from it, to avoid deadlocks.
>> 
>> Couldn't something like this be done for long running filter process,
>> via protocol extension?
> 
> My reading of the message you are responding to is that Lars means
> doing so by "implement this feature".  Instead of returning the
> filtered bytes, a new protocol lets his filter to say "No result yet
> for you to process, ask me later".

Correct!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-15 18:03     ` Junio C Hamano
  2016-11-16  9:53       ` Lars Schneider
@ 2016-11-24 15:45       ` Lars Schneider
  2016-11-28 21:48         ` Junio C Hamano
  1 sibling, 1 reply; 16+ messages in thread
From: Lars Schneider @ 2016-11-24 15:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Wong, git


> On 15 Nov 2016, at 19:03, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> The filter itself would need to be aware of parallelism
>>> if it lives for multiple objects, right?
>> 
>> Correct. This way Git doesn't need to deal with threading...
> 
> I think you need to be careful about three things (at least; there
> may be more):
> 
> ...
> 
> * Done naively, it will lead to unmaintainable code, like this:
> 
>  ...

Hi Junio,

I started to work on the "delayed responses to Git clean/smudge filter
requests" implementation and I am looking for a recommendation regarding 
code maintainability:

Deep down in convert.c:636 `apply_multi_file_filter()` [1] the filter learns
from the external process that the filter response is not yet available.
I need to transport this information quite a few levels up the call
stack. 


# Option 1
I could do this by explicitly passing a pointer such as "int *is_delayed" 
to the function. This would mean I need to update the function definitions 
for all functions on my way through the stack:

int apply_multi_file_filter()
int apply_filter()
int convert_to_working_tree_internal()
int convert_to_working_tree()
...

# Option 2
All these functions pass-through an "int" return value that communicates
if the filter succeeded or failed. I could define a special return value
to communicate the third state: delayed. 


What way do you think is better from a maintenance point of view?
I prefer option 2 but I fear that these "special" values could confuse
future readers of the code.

Thanks,
Lars


[1] https://github.com/git/git/blob/e2b2d6a172b76d44cb7b1ddb12ea5bfac9613a44/convert.c#L673-L777


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-24 15:45       ` Lars Schneider
@ 2016-11-28 21:48         ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2016-11-28 21:48 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Eric Wong, git

Lars Schneider <larsxschneider@gmail.com> writes:

> What way do you think is better from a maintenance point of view?
> I prefer option 2 but I fear that these "special" values could confuse
> future readers of the code.

I recall getting confused by the redefinition of the meaning of
return value from the grep_directory() function when we started
threading the working-tree grep codepath at around 5b594f457a;
compared to that, as long as the "special" (and "normal") values are
made symbolic constants, I do not think it is too bad.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2016-11-14 21:09 RFC: Enable delayed responses to Git clean/smudge filter requests Lars Schneider
  2016-11-15  1:03 ` Eric Wong
@ 2017-01-09 20:44 ` Stefan Beller
  2017-01-11 12:57   ` Lars Schneider
  1 sibling, 1 reply; 16+ messages in thread
From: Stefan Beller @ 2017-01-09 20:44 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List

On Mon, Nov 14, 2016 at 1:09 PM, Lars Schneider
<larsxschneider@gmail.com> wrote:
> Hi,
>
> Git always performs a clean/smudge filter on files in sequential order.
> Sometimes a filter operation can take a noticeable amount of time.
> This blocks the entire Git process.
>
> I would like to give a filter process the possibility to answer Git with
> "I got your request, I am processing it, ask me for the result later!".
>
> I see the following way to realize this:
>
> In unpack-trees.c:check_updates() [1] we loop through the cache
> entries and "ask me later" could be an acceptable return value of the
> checkout_entry() call. The loop could run until all entries returned
> success or error.

Late to this thread, but here is an answer nevertheless.

I am currently working on getting submodules working
for working tree modifying commands (prominently checkout, but
also read-tree -u and any other caller that uses the code in
unpack-trees.)

Once the submodules are supported and used, I anticipate that
putting the files in the working tree on disk will become a bottle neck,
i.e. the checkout taking way too long for an oversized project.

So in the future we have to do something to make checkout fast
again, which IMHO is threading. My current vision is to have checkout
automatically choose a number of threads based on expected workload,
c.f. preload-index.c, line 18-25.

> The filter machinery is triggered in various other places in Git and
> all places that want to support "ask me later" would need to be patched
> accordingly.

I think this makes sense, even in a threaded git-checkout.
I assume this idea is implemented before threading hits checkout,
so a question on the design:

Who determines the workload that is acceptable?
From reading this email, it seems to be solely the filter that uses
as many threads/processes as it thinks is ok.

Would it be possible to enhance the protocol further to have
Git also mingle with the workload, i.e. tell the filter it is
allowed to use up (N-M) threads, as it itself already uses
M out of N configured threads?

(I do not want to discuss the details here, but only if such a thing
is viable with this approach as well)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RFC: Enable delayed responses to Git clean/smudge filter requests
  2017-01-09 20:44 ` Stefan Beller
@ 2017-01-11 12:57   ` Lars Schneider
  0 siblings, 0 replies; 16+ messages in thread
From: Lars Schneider @ 2017-01-11 12:57 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Git Mailing List


> On 09 Jan 2017, at 21:44, Stefan Beller <sbeller@google.com> wrote:
> 
> On Mon, Nov 14, 2016 at 1:09 PM, Lars Schneider
> <larsxschneider@gmail.com> wrote:
>> Hi,
>> 
>> Git always performs a clean/smudge filter on files in sequential order.
>> Sometimes a filter operation can take a noticeable amount of time.
>> This blocks the entire Git process.
>> 
>> I would like to give a filter process the possibility to answer Git with
>> "I got your request, I am processing it, ask me for the result later!".
>> 
>> I see the following way to realize this:
>> 
>> In unpack-trees.c:check_updates() [1] we loop through the cache
>> entries and "ask me later" could be an acceptable return value of the
>> checkout_entry() call. The loop could run until all entries returned
>> success or error.
> 
> Late to this thread, but here is an answer nevertheless.
> 
> I am currently working on getting submodules working
> for working tree modifying commands (prominently checkout, but
> also read-tree -u and any other caller that uses the code in
> unpack-trees.)
> 
> Once the submodules are supported and used, I anticipate that
> putting the files in the working tree on disk will become a bottle neck,
> i.e. the checkout taking way too long for an oversized project.
> 
> So in the future we have to do something to make checkout fast
> again, which IMHO is threading. My current vision is to have checkout
> automatically choose a number of threads based on expected workload,
> c.f. preload-index.c, line 18-25.

That sounds interesting! We are using "submodule.fetchjobs=0" to process
submodules in parallel already and it works great! Thanks a lot for
implementing this!


>> The filter machinery is triggered in various other places in Git and
>> all places that want to support "ask me later" would need to be patched
>> accordingly.
> 
> I think this makes sense, even in a threaded git-checkout.
> I assume this idea is implemented before threading hits checkout,
> so a question on the design:
> 
> Who determines the workload that is acceptable?
> From reading this email, it seems to be solely the filter that uses
> as many threads/processes as it thinks is ok.

Correct.


> Would it be possible to enhance the protocol further to have
> Git also mingle with the workload, i.e. tell the filter it is
> allowed to use up (N-M) threads, as it itself already uses
> M out of N configured threads?
> 
> (I do not want to discuss the details here, but only if such a thing
> is viable with this approach as well)

Yes, I think we could give a filter these kind of hints. However, it
would, of course, be up to the filter implementation to follow the hints.

In case you curious, here is the discussion on v1 of the delay implementation:
http://public-inbox.org/git/20170108191736.47359-1-larsxschneider@gmail.com/


Cheers,
Lars

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-14 21:09 RFC: Enable delayed responses to Git clean/smudge filter requests Lars Schneider
2016-11-15  1:03 ` Eric Wong
2016-11-15 14:29   ` Lars Schneider
2016-11-15 18:03     ` Junio C Hamano
2016-11-16  9:53       ` Lars Schneider
2016-11-16 18:15         ` Junio C Hamano
2016-11-16 18:47           ` Lars Schneider
2016-11-16 19:19             ` Junio C Hamano
2016-11-16 22:41         ` Jakub Narębski
2016-11-16 23:46           ` Junio C Hamano
2016-11-17  9:19             ` Lars Schneider
2016-11-24 15:45       ` Lars Schneider
2016-11-28 21:48         ` Junio C Hamano
2016-11-15 18:27     ` Eric Wong
2017-01-09 20:44 ` Stefan Beller
2017-01-11 12:57   ` Lars Schneider

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox