git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Ramsay Jones <ramsay@ramsayjones.plus.com>,
	Jacob Keller <jacob.keller@gmail.com>, Jeff King <peff@peff.net>,
	Jonathan Nieder <jrnieder@gmail.com>,
	Johannes Schindelin <johannes.schindelin@gmail.com>,
	Jens Lehmann <Jens.Lehmann@web.de>,
	Vitali Lovich <vlovich@gmail.com>,
	Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCHv4 06/14] run-command: add an asynchronous parallel child processor
Date: Wed, 23 Sep 2015 10:53:07 -0700	[thread overview]
Message-ID: <CAGZ79kYaqFRPfRORbknTyez5u0d6_BD0d5wBtAnjSkb4sUBUBA@mail.gmail.com> (raw)
In-Reply-To: <xmqqeghpzm0y.fsf@gitster.mtv.corp.google.com>

On Tue, Sep 22, 2015 at 11:29 PM, Junio C Hamano <gitster@pobox.com> wrote:

>
> And this one, when get_next_task() says "nothing more to do", is
> clearly "we returned without starting anything", so according to the
> comment it should be returning 0, but the code returns 1, which
> looks incorrect.
>
>> +     if (start_command(&pp->children[i].process))
>> +             pp->start_failure(pp->data,
>> +                               &pp->children[i].process,
>> +                               &pp->children[i].err);
>
> What should happen if start_failure returns without dying?
> Shouldn't this function return something, without doing the
> remainder of it?  i.e.
>
>         if (start_command(...)) {
>                 pp->start_failur(...);
>                 return SOMETHING;
>         }

Right, I forgot about that code path as I was blinded by the obviousness
("If you cannot start a process, of course we'll die").

So for this SOMETHING we need to decide if it should signal
that an immediate retry can be done. But then we could perform the
immediate retry ourselves:

startnewtask:
    if (!pp->get_next_task(pp->data,
                   &pp->children[i].process,
                   &pp->children[i].err))
        return 0;

    if (start_command(&pp->children[i].process)) {
        pp->start_failure(pp->data,
                  &pp->children[i].process,
                  &pp->children[i].err);
        goto startnewtask;
    }

But this could result in an endless loop.
Even if we would decide to return to the caller run_processes_parallel
and let them decide to try again, this version of the patch may produce an
infinite loop there.

The other alternative would be to make SOMETHING signal to not
immediately try again. ("We failed to start a child process, give it some time
by doing the poll/output and try again")

This however could not finish all workloads reliably as we may fail to start
the first child, such that there are 0 children processes running and the
control loop in run_processes_parallel shuts down the whole parallel processor.

So for now I'd lean on having the SOMETHING be the same boolean as
a successful start (failure -1, successful start 1, no more pending work 0)
and the difference between -1 and 1 can be sorted out in a later patch, which
introduces workloads with failing children.

>> +
>> +     while (1) {
>> +             while (pp.nr_processes < pp.max_processes &&
>> +                    !pp_start_one(&pp))
>> +                     ; /* nothing */
>> +             if (!pp.nr_processes)
>> +                     break;
>
> This inner loop is why I think "did we or did we not spawn a new
> process?" is not a great interface.

Right, we actually need to return whether we have nothing more to do
("Don't even try to call me again") or if we did something useful and expect
to do more useful things in the next call. (Either starting anew command or
finding out it failed).

This would be indicated by the -1/1/0 return signals.

>
> The reason why it is not a great interface is because there are two
> possible reasons why pp_start_one() does not spawn a new process,
> and this caller wants to behave differently depending on why it did
> not spawn a new process.  They are:
>
>  * get_next_task() truly ran out of things to do.
>
>  * get_next_task() gave us a task, but it did not start, and
>    start_failure was set not to die (e.g. the function can be used
>    to tell the next_task machinery that it needs to return a
>    replacement task for the one that failed to run.  That way, upon
>    next call to get_next_task, a replacement task can run instead of
>    the old one that failed).
>
> For the former, we want to stop looping, for the latter, we
> definitely do want to keep looping, as we want to make another call
> to get_next_task() to grab the replacement task for the one that
> just failed.
>
> So I think it makes more sense to define the meaning of the return
> value from pp_start_one() differently from the way this patch
> defines.  "Return 0 when we truly ran out of things to do, otherwise
> return non-zero", for example, would make more sense.

ok, we have the same opinion. I just documented poorly.

>  The return
> value does not tell you if the call resulted in one more process,
> but that is not any loss, as you can look at pp.nr_processes
> yourself if you really cared.
>
> With that, the above caller could be updated, with optional gradual
> ramp_up, like so:
>
>         #define RAMP_UP_LIMIT 2
>
>         while (1) {
>                 int ramp_up;
>                 int no_more_task;
>
>                 for (no_more_task = 0, ramp_up = RAMP_UP_LIMIT;
>                      !no_more_task && ramp_up && pp.nr_processes < pp.max_processes;
>                      ramp_up--)
>                         if (!pp_start_one(&pp))
>                                 no_more_task = 1;

I would not have the no_more_task variable, but just reuse
ramp_up and set it to zero in case of !pp_start_one(&pp).

I am not sure if the ramp up machinery is really needed.
I modified the test-run-command test function to start up to 400 processes.
(Most people will use less than 400 processes in the next 5 years), and run
just as in t0061:

    ./test-run-command run-command-parallel-400 sh -c "printf
\"%s\n%s\n\" Hello World"

The output felt immediate (not slowed down or anything). The numbers seem to
support that

    real 0m0.110s
    user 0m0.045s
    sys 0m0.366s

Any delay below 0.1 second cannot really be perceived by a human. You
can sure tell
a difference of 0.1 second in say 2 acoustic signals or light flashes,
but you cannot
tell that the output "was slow". So IMHO the ramp up machinery doesn't
have a high
priority for now.


>
>                 if (!pp.nr_processes && no_more_task)
>                         break;
>
> If you prefer to swamp the system with a thundering herd at the
> beginning, you can define RAMP_UP_LIMIT to really a high value
> instead, e.g. "#define RAMPUP_LIMIT pp.max_processes".  I however
> would not recommend it because doing so would hurt the perceived
> latency at the beginning.
>
> After the system goes into a steady state, how you set RAMP_UP_LIMIT
> would not make that much difference, as your slots should be almost
> always full and you will be replenishing an open slot with a single
> task as each running task finishes, and you would not be running
> more than one pp_start_one() at a time anyway.

Yeah there we could have a simple

    if (pp->nr_processes == pp->max_processes)
        poll_timeout = 5 seconds
    else
        poll_timeout = 10..100 milliseconds

>
>> +             pp_buffer_stderr(&pp);
>> +             pp_output(&pp);
>> +             pp_collect_finished(&pp);
>> +     }
>> +
>> +     pp_cleanup(&pp);
>> +
>> +     return 0;
>> +}

  reply	other threads:[~2015-09-23 17:53 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-23  1:45 [PATCHv4 00/14] fetch submodules in parallel and a preview on parallel "submodule update" Stefan Beller
2015-09-23  1:45 ` [PATCHv4 01/14] submodule: Send "Fetching submodule <foo>" to standard error Stefan Beller
2015-09-23  1:45 ` [PATCHv4 02/14] xread: poll on non blocking fds Stefan Beller
2015-09-23  1:45 ` [PATCHv4 03/14] xread_nonblock: add functionality to read from fds without blocking Stefan Beller
2015-09-23  1:45 ` [PATCHv4 04/14] strbuf: add strbuf_read_once to read " Stefan Beller
2015-09-23  1:45 ` [PATCHv4 05/14] run-command: factor out return value computation Stefan Beller
2015-09-23  1:45 ` [PATCHv4 06/14] run-command: add an asynchronous parallel child processor Stefan Beller
2015-09-23  6:29   ` Junio C Hamano
2015-09-23 17:53     ` Stefan Beller [this message]
2015-09-23 18:04       ` Junio C Hamano
2015-09-23 19:34         ` Junio C Hamano
2015-09-23 19:39           ` Stefan Beller
2015-09-23 19:47             ` Junio C Hamano
2015-09-23  6:47   ` Junio C Hamano
2015-09-23 14:59     ` Junio C Hamano
2015-09-23 17:54       ` Junio C Hamano
2015-09-23 23:41         ` [PATCHv5] Another squash on " Stefan Beller
2015-09-24  2:17           ` Junio C Hamano
2015-09-24 21:13             ` [PATCH 0/2] " Stefan Beller
2015-09-24 21:13               ` [PATCH 2/2] SQUASH for "fetch_populated_submodules: use new parallel job processing" Stefan Beller
2015-09-24 21:13               ` [PATCH 1/2] SQUASH??? Stefan Beller
2015-09-25  0:49                 ` Junio C Hamano
2015-09-25  1:09                   ` Junio C Hamano
2015-09-25 17:52                   ` Stefan Beller
2015-09-25 17:56                     ` Junio C Hamano
2015-09-25  1:08               ` [PATCH 0/2] Another squash on run-command: add an asynchronous parallel child processor Junio C Hamano
2015-09-25 18:56                 ` Stefan Beller
2015-09-25 19:04                   ` Junio C Hamano
2015-09-25 19:19                     ` Stefan Beller
2015-09-25 19:32                   ` Junio C Hamano
2015-09-23  1:45 ` [PATCHv4 07/14] fetch_populated_submodules: use new parallel job processing Stefan Beller
2015-09-23  1:45 ` [PATCHv4 08/14] submodules: allow parallel fetching, add tests and documentation Stefan Beller
2015-09-23  1:45 ` [PATCHv4 09/14] submodule-config: Untangle logic in parse_config Stefan Beller
2015-09-23  1:45 ` [PATCHv4 10/14] submodule config: keep update strategy around Stefan Beller
2015-09-23  1:45 ` [PATCHv4 11/14] git submodule update: cmd_update_recursive Stefan Beller
2015-09-23  1:45 ` [PATCHv4 12/14] git submodule update: cmd_update_clone Stefan Beller
2015-09-23 20:13   ` Junio C Hamano
2015-09-23  1:45 ` [PATCHv4 13/14] git submodule update: cmd_update_fetch Stefan Beller
2015-09-23  1:45 ` [PATCHv4 14/14] Rewrite submodule update in C Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGZ79kYaqFRPfRORbknTyez5u0d6_BD0d5wBtAnjSkb4sUBUBA@mail.gmail.com \
    --to=sbeller@google.com \
    --cc=Jens.Lehmann@web.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jacob.keller@gmail.com \
    --cc=johannes.schindelin@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=sunshine@sunshineco.com \
    --cc=vlovich@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).