git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Matheus Tavares <matheus.bernardino@usp.br>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Jeff Hostetler <git@jeffhostetler.com>
Subject: Re: [PATCH 3/5] parallel-checkout: add configuration options
Date: Wed, 31 Mar 2021 06:33:43 +0200	[thread overview]
Message-ID: <CAP8UFD3s3NUpi2eyPWFa5bL4rez1wNtj5-dUpv8dJLo2CKYu9A@mail.gmail.com> (raw)
In-Reply-To: <8c83e92445b4131e9b8f8e2aa29b00717b257d13.1616015337.git.matheus.bernardino@usp.br>

On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:

> The above benchmarks show that parallel checkout is most effective on
> repositories located on an SSD or over a distributed file system. For
> local file systems on spinning disks, and/or older machines, the
> parallelism does not always bring a good performance. For this reason,
> the default value for checkout.workers is one, a.k.a. sequential
> checkout.

I wonder how many people are still using HDD, and how many will still
use them in a few years.

I think having 1 as the default value for checkout.workers might be
good for a while for backward compatibility and stability, while
people who are interested can test parallel checkout on different
environments. But we might want, in a few releases, after some bugs,
if any, have been fixed, to use a default, maybe 10, that will provide
significant speedup for most people, and will justify the added
complexity, especially as your numbers still show a small speedup for
HDD when using 10.

> To decide the default value for checkout.thresholdForParallelism,
> another benchmark was executed in the "Local SSD" setup, where parallel
> checkout showed to be beneficial. This time, we compared the runtime of
> a `git checkout -f`, with and without parallelism, after randomly
> removing an increasing number of files from the Linux working tree. The
> "sequential fallback" column bellow corresponds to the executions where

s/bellow/below/

> checkout.workers was 10 but checkout.thresholdForParallelism was equal
> to the number of to-be-updated files plus one (so that we end up writing
> sequentially). Each test case was sampled 15 times, and each sample had
> a randomly different set of files removed. Here are the results:
>
>              sequential fallback   10 workers           speedup
> 10   files    772.3 ms ± 12.6 ms   769.0 ms ± 13.6 ms   1.00 ± 0.02
> 20   files    780.5 ms ± 15.8 ms   775.2 ms ±  9.2 ms   1.01 ± 0.02
> 50   files    806.2 ms ± 13.8 ms   767.4 ms ±  8.5 ms   1.05 ± 0.02
> 100  files    833.7 ms ± 21.4 ms   750.5 ms ± 16.8 ms   1.11 ± 0.04
> 200  files    897.6 ms ± 30.9 ms   730.5 ms ± 14.7 ms   1.23 ± 0.05
> 500  files   1035.4 ms ± 48.0 ms   677.1 ms ± 22.3 ms   1.53 ± 0.09
> 1000 files   1244.6 ms ± 35.6 ms   654.0 ms ± 38.3 ms   1.90 ± 0.12
> 2000 files   1488.8 ms ± 53.4 ms   658.8 ms ± 23.8 ms   2.26 ± 0.12
>
> From the above numbers, 100 files seems to be a reasonable default value
> for the threshold setting.
>
> Note: Up to 1000 files, we observe a drop in the execution time of the
> parallel code with an increase in the number of files. This is a rather
> odd behavior, but it was observed in multiple repetitions. Above 1000
> files, the execution time increases according to the number of files, as
> one would expect.
>
> About the test environments: Local SSD tests were executed on an
> i7-7700HQ (4 cores with hyper-threading) running Manjaro Linux. Local
> HDD tests were executed on an Intel(R) Xeon(R) E3-1230 (also 4 cores
> with hyper-threading), HDD Seagate Barracuda 7200.14 SATA 3.1, running
> Debian. NFS and EFS tests were executed on an Amazon EC2 c5n.xlarge
> instance, with 4 vCPUs. The Linux NFS server was running on a m6g.large
> instance with 2 vCPUSs and a 1 TB EBS GP2 volume. Before each timing,
> the linux repository was removed (or checked out back to its previous
> state), and `sync && sysctl vm.drop_caches=3` was executed.

Thanks for all these tests and details!

> Co-authored-by: Jeff Hostetler <jeffhost@microsoft.com>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>

[...]

> @@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void)
>         return parallel_checkout.status;
>  }
>
> +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100

Using a "static const int" might be a bit better.

> +void get_parallel_checkout_configs(int *num_workers, int *threshold)
> +{
> +       if (git_config_get_int("checkout.workers", num_workers))
> +               *num_workers = 1;

I think it's better when an important default value like this "1" is
made more visible using a "static const int" or a "#define".

> +       else if (*num_workers < 1)
> +               *num_workers = online_cpus();
> +
> +       if (git_config_get_int("checkout.thresholdForParallelism", threshold))
> +               *threshold = DEFAULT_THRESHOLD_FOR_PARALLELISM;
> +}
> +
>  void init_parallel_checkout(void)
>  {
>         if (parallel_checkout.status != PC_UNINITIALIZED)
> @@ -554,11 +569,9 @@ static void write_items_sequentially(struct checkout *state)
>                 write_pc_item(&parallel_checkout.items[i], state);
>  }
>
> -#define DEFAULT_NUM_WORKERS 2

As I say above, it doesn't look like a good idea to have removed this.

> -int run_parallel_checkout(struct checkout *state)
> +int run_parallel_checkout(struct checkout *state, int num_workers, int threshold)
>  {
> -       int ret, num_workers = DEFAULT_NUM_WORKERS;
> +       int ret;
>
>         if (parallel_checkout.status != PC_ACCEPTING_ENTRIES)
>                 BUG("cannot run parallel checkout: uninitialized or already running");
> @@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state)
>         if (parallel_checkout.nr < num_workers)
>                 num_workers = parallel_checkout.nr;
>
> -       if (num_workers <= 1) {
> +       if (num_workers <= 1 || parallel_checkout.nr < threshold) {

Here we check the number of workers...

>                 write_items_sequentially(state);
>         } else {
>                 struct pc_worker *workers = setup_workers(state, num_workers);

[...]

> @@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o,
>                 }
>         }
>         stop_progress(&progress);
> -       errs |= run_parallel_checkout(&state);
> +       if (pc_workers > 1)

...but the number of workers was already checked here.


> +               errs |= run_parallel_checkout(&state, pc_workers, pc_threshold);
>         errs |= finish_delayed_checkout(&state, NULL);
>         git_attr_set_direction(GIT_ATTR_CHECKIN);

  reply	other threads:[~2021-03-31  4:35 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17 21:12 [PATCH 0/5] Parallel Checkout (part 2) Matheus Tavares
2021-03-17 21:12 ` [PATCH 1/5] unpack-trees: add basic support for parallel checkout Matheus Tavares
2021-03-31  4:22   ` Christian Couder
2021-04-02 14:39     ` Matheus Tavares Bernardino
2021-03-17 21:12 ` [PATCH 2/5] parallel-checkout: make it truly parallel Matheus Tavares
2021-03-31  4:32   ` Christian Couder
2021-04-02 14:42     ` Matheus Tavares Bernardino
2021-03-17 21:12 ` [PATCH 3/5] parallel-checkout: add configuration options Matheus Tavares
2021-03-31  4:33   ` Christian Couder [this message]
2021-04-02 14:45     ` Matheus Tavares Bernardino
2021-03-17 21:12 ` [PATCH 4/5] parallel-checkout: support progress displaying Matheus Tavares
2021-03-17 21:12 ` [PATCH 5/5] parallel-checkout: add design documentation Matheus Tavares
2021-03-31  5:36   ` Christian Couder
2021-03-18 20:56 ` [PATCH 0/5] Parallel Checkout (part 2) Junio C Hamano
2021-03-19  3:24   ` Matheus Tavares
2021-03-19 22:58     ` Junio C Hamano
2021-03-31  5:42 ` Christian Couder
2021-04-08 16:16 ` [PATCH v2 " Matheus Tavares
2021-04-08 16:17   ` [PATCH v2 1/5] unpack-trees: add basic support for parallel checkout Matheus Tavares
2021-04-08 16:17   ` [PATCH v2 2/5] parallel-checkout: make it truly parallel Matheus Tavares
2021-04-08 16:17   ` [PATCH v2 3/5] parallel-checkout: add configuration options Matheus Tavares
2021-04-08 16:17   ` [PATCH v2 4/5] parallel-checkout: support progress displaying Matheus Tavares
2021-04-08 16:17   ` [PATCH v2 5/5] parallel-checkout: add design documentation Matheus Tavares
2021-04-08 19:52   ` [PATCH v2 0/5] Parallel Checkout (part 2) Junio C Hamano
2021-04-16 21:43   ` Junio C Hamano
2021-04-17 19:57     ` Matheus Tavares Bernardino
2021-04-19  9:41     ` Christian Couder
2021-04-19  0:14   ` [PATCH v3 " Matheus Tavares
2021-04-19  0:14     ` [PATCH v3 1/5] unpack-trees: add basic support for parallel checkout Matheus Tavares
2021-04-19  0:14     ` [PATCH v3 2/5] parallel-checkout: make it truly parallel Matheus Tavares
2021-04-19  0:14     ` [PATCH v3 3/5] parallel-checkout: add configuration options Matheus Tavares
2021-04-19  0:14     ` [PATCH v3 4/5] parallel-checkout: support progress displaying Matheus Tavares
2021-04-19  0:14     ` [PATCH v3 5/5] parallel-checkout: add design documentation Matheus Tavares
2021-04-19  9:36       ` Christian Couder
2021-04-19 19:53     ` [PATCH v4 0/5] Parallel Checkout (part 2) Matheus Tavares
2021-04-19 19:53       ` [PATCH v4 1/5] unpack-trees: add basic support for parallel checkout Matheus Tavares
2021-04-19 19:53       ` [PATCH v4 2/5] parallel-checkout: make it truly parallel Matheus Tavares
2021-04-19 19:53       ` [PATCH v4 3/5] parallel-checkout: add configuration options Matheus Tavares
2021-04-19 19:53       ` [PATCH v4 4/5] parallel-checkout: support progress displaying Matheus Tavares
2021-04-19 19:53       ` [PATCH v4 5/5] parallel-checkout: add design documentation Matheus Tavares

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD3s3NUpi2eyPWFa5bL4rez1wNtj5-dUpv8dJLo2CKYu9A@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=matheus.bernardino@usp.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).