Re: [PATCH v3 1/7] benchtests: Add bench-qsort - Adhemerval Zanella via Libc-alpha

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3 1/7] benchtests: Add bench-qsort
Date: Fri, 15 Oct 2021 14:19:29 -0300	[thread overview]
Message-ID: <90308963-e059-5e7d-257f-18f487f5f1a7@linaro.org> (raw)
In-Reply-To: <CAFUsyfK8oNc5=wsCZTDeaZ4_bpFnNrgzxr-vvfy=LqF22++mfg@mail.gmail.com>



On 15/10/2021 13:39, Noah Goldstein wrote:
> On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/10/2021 00:19, Noah Goldstein wrote:
>>> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> This patch adds a qsort benchmark.  I tried to model the benchmark
>>>> taking in consideration the possible input variation in both internal
>>>> element size, element numbers, and internal state for 1. real word cases
>>>> and 2. possible scenarios based on hardware characteristics.
>>>>
>>>> For 1. I tracked qsort usage (using a simple preload library to dump its
>>>> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
>>>> GCC 8 bootstrap build shows 51786641 call to qsort with the following
>>>> characterics:
>>>>
>>>> Key: number of elements:
>>>> key=2 : 39.74
>>>> key=3 : 19.23
>>>> key=4 : 9.77
>>>> key=1 : 8.44
>>>> key=0 : 6.60
>>>> key=5 : 4.40
>>>> key=7 : 2.37
>>>> key=6 : 2.25
>>>> key=9 : 1.48
>>>> key=8 : 0.97
>>>>
>>>> Key: element size in bytes:
>>>> key=8 : 91.74
>>>> key=32 : 3.62
>>>> key=4 : 2.42
>>>> key=40 : 1.20
>>>> key=16 : 0.67
>>>> key=24 : 0.30
>>>> key=48 : 0.05
>>>> key=56 : 0.00
>>>> key=1 : 0.00
>>>>
>>>> Key: total size (number of elements * element size):
>>>> key=16 : 35.98
>>>> key=24 : 18.67
>>>> key=32 : 9.79
>>>> key=8 : 8.28
>>>> key=0 : 6.60
>>>> key=40 : 4.21
>>>> key=64 : 3.15
>>>> key=48 : 2.24
>>>> key=56 : 2.15
>>>> key=80 : 1.45
>>>>
>>>> So for GCC:
>>>>
>>>>   - 80% of total qsort usage are done with 10 elements of less.
>>>>   - All usages are done element size of maximum of 56 bytes.
>>>>   - 90% of calls are done with array of maximum size of 80 bytes or
>>>>     less.
>>>>
>>>> (GCC on recent version now uses a internal qsort implementation instead
>>>> of the libc one).
>>>>
>>>> The Firefox usage was done with 2 hours of usage, with first 10 minutes
>>>> activelly openning and closing different types of sites.  It resulted in
>>>> 21042 calls with following characteristics:
>>>>
>>>> Key: number of elements:
>>>> key=7 : 24.40
>>>> key=1 : 10.44
>>>> key=3 : 6.33
>>>> key=4 : 5.81
>>>> key=2 : 5.46
>>>> key=6 : 4.80
>>>> key=17 : 4.54
>>>> key=0 : 3.07
>>>> key=5 : 3.05
>>>> key=9 : 2.51
>>>> key=12 : 2.06
>>>>
>>>> Key: element size in bytes:
>>>> key=8 : 94.49
>>>> key=28 : 4.40
>>>> key=2 : 0.70
>>>> key=16 : 0.19
>>>> key=36 : 0.07
>>>> key=12 : 0.07
>>>> key=40 : 0.07
>>>> key=24 : 0.03
>>>>
>>>> Key: total size (number of elements * element size):
>>>> key=56 : 24.20
>>>> key=8 : 10.27
>>>> key=24 : 6.36
>>>> key=32 : 5.86
>>>> key=16 : 5.46
>>>> key=48 : 4.80
>>>> key=476 : 3.75
>>>> key=0 : 3.07
>>>> key=40 : 3.05
>>>> key=72 : 2.50
>>>>
>>>> So for Firefox:
>>>>
>>>>   - 72% of total qsort usage are done with 18 elements of less.
>>>>   - All usages are done element size of maximum of 40 bytes.
>>>>   - 70% of calls are done with array of maximum size of 476 bytes or
>>>>     less.
>>>>
>>>> For 2. I used the idea of a machine with 3 levels of cache with sizes
>>>> L1 32kb, L2 256kb, and L3 4096Kb.
>>>>
>>>> It resulted in a benchmark with following traits:
>>>>
>>>>   * It checks four types of input arrays: sorted, mostly sorted, random,
>>>>     repeated, and bitonic:
>>>>
>>>>     - 'sorted': the array is already sorted.
>>>>     - 'mostly sorted': the array will have a certain number of random
>>>>        position with random values (current ratio used is 20%).
>>>>     - 'random' the array will contain random elements.
>>>>     - 'repeated' the array will have random elements with a certain
>>>>        number (current ratio is 20%) of a repeated element distributed
>>>>        randomly.
>>>>     - 'bitonic': elements will be strictly increasing to up middle of the
>>>>        array then strictly decreasing.  This is to trigger the
>>>>        pattern-defeting input for the median of 3 pivot selection used
>>>>        in quicksort implementation.
>>>>
>>>>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>>>>     element with 32 bytes (but using the uint32_t comparison checks).
>>>>     These element sizes are used to 1. to avoid include the comparison
>>>>     function itself and/or memory copy in sort benchmark itself, and 2.
>>>>     because key of size_t are the most used for both GCC and Firefox.
>>>>
>>>>   * Five different element numbers: 64 (which cover mostly of used
>>>>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>>>>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>>>>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>>>>     --nelem option.
>>>>
>>>> Checked on x86_64-linux-gnu
>>>> ---
>>>>  benchtests/Makefile      |   2 +-
>>>>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 196 insertions(+), 1 deletion(-)
>>>>  create mode 100644 benchtests/bench-qsort.c
>>>>
>>>> diff --git a/benchtests/Makefile b/benchtests/Makefile
>>>> index 1530939a8c..781cbca8af 100644
>>>> --- a/benchtests/Makefile
>>>> +++ b/benchtests/Makefile
>>>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>>>>  include ../gen-locales.mk
>>>>  endif
>>>>
>>>> -stdlib-benchset := strtod
>>>> +stdlib-benchset := strtod qsort
>>>>
>>>>  stdio-common-benchset := sprintf
>>>>
>>>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
>>>> new file mode 100644
>>>> index 0000000000..905aa6d7b6
>>>> --- /dev/null
>>>> +++ b/benchtests/bench-qsort.c
>>>> @@ -0,0 +1,195 @@
>>>> +/* Measure qsort function.
>>>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>>>> +   This file is part of the GNU C Library.
>>>> +
>>>> +   The GNU C Library is free software; you can redistribute it and/or
>>>> +   modify it under the terms of the GNU Lesser General Public
>>>> +   License as published by the Free Software Foundation; either
>>>> +   version 2.1 of the License, or (at your option) any later version.
>>>> +
>>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> +   Lesser General Public License for more details.
>>>> +
>>>> +   You should have received a copy of the GNU Lesser General Public
>>>> +   License along with the GNU C Library; if not, see
>>>> +   <http://www.gnu.org/licenses/>.  */
>>>> +
>>>> +#include <array_length.h>
>>>> +#include <bench-timing.h>
>>>> +#include <errno.h>
>>>> +#include <getopt.h>
>>>> +#include <json-lib.h>
>>>> +#include <stdio.h>
>>>> +#include <unistd.h>
>>>> +
>>>> +#include <stdlib/tst-qsort-common.c>
>>>> +
>>>> +/* Number of elements of determined type to be measured.  */
>>>> +static const size_t default_nelems[] =
>>>> +{
>>>> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
>>>> +                              on GCC build.  */
>>>> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
>>>> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
>>>> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
>>>> +};
>>>> +
>>>> +#define OPT_NELEM 10000
>>>> +#define OPT_SEED  10001
>>>> +#define CMDLINE_OPTIONS \
>>>> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
>>>> +  { "seed", required_argument, NULL, OPT_SEED },
>>>> +
>>>> +static const size_t max_nelem = 16;
>>>> +static size_t *elems = NULL;
>>>> +static size_t nelem = 0;
>>>> +static uint64_t seed = 0;
>>>> +static bool seed_set = false;
>>>> +
>>>> +static void
>>>> +cmdline_process_function (int c)
>>>> +{
>>>> +  switch (c)
>>>> +    {
>>>> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
>>>> +        The elements sizes as passed with a ':' as the delimiter, for
>>>> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
>>>> +      case OPT_NELEM:
>>>> +        {
>>>> +         elems = xmalloc (max_nelem * sizeof (size_t));
>>>> +         nelem = 0;
>>>> +
>>>> +         char *saveptr;
>>>> +         char *token;
>>>> +         token = strtok_r (optarg, ":", &saveptr);
>>>> +         if (token == NULL)
>>>> +           {
>>>> +             printf ("error: invalid --nelem value\n");
>>>> +             exit (EXIT_FAILURE);
>>>> +           }
>>>> +         do
>>>> +           {
>>>> +             if (nelem == max_nelem)
>>>> +               {
>>>> +                 printf ("error: invalid --nelem value (max elem)\n");
>>>> +                 exit (EXIT_FAILURE);
>>>> +               }
>>>> +             elems[nelem++] = atol (token);
>>>> +             token = strtok_r (saveptr, ":", &saveptr);
>>>> +           } while (token != NULL);
>>>> +        }
>>>> +      break;
>>>> +
>>>> +      /* handle the --seed option to use a different seed than a random one.
>>>> +        The SEED used should be a uint64_t number.  */
>>>> +      case OPT_SEED:
>>>> +       {
>>>> +         uint64_t value = strtoull (optarg, NULL, 0);
>>>> +         if (errno == ERANGE || value > UINT64_MAX)
>>>> +           {
>>>> +             printf ("error: seed should be a value in range of "
>>>> +                     "[0, UINT64_MAX]\n");
>>>> +             exit (EXIT_FAILURE);
>>>> +           }
>>>> +         seed = value;
>>>> +         seed_set = true;
>>>> +       }
>>>> +    }
>>>> +}
>>>> +
>>>> +#define CMDLINE_PROCESS cmdline_process_function
>>>> +
>>>> +static int
>>>> +do_test (void)
>>>> +{
>>>> +  random_init ();
>>>> +
>>>> +
>>>> +  json_ctx_t json_ctx;
>>>> +
>>>> +  json_init (&json_ctx, 0, stdout);
>>>> +
>>>> +  json_document_begin (&json_ctx);
>>>> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
>>>> +
>>>> +  json_attr_object_begin (&json_ctx, "functions");
>>>> +  json_attr_object_begin (&json_ctx, "qsort");
>>>> +  json_attr_uint (&json_ctx, "seed", seed);
>>>> +
>>>> +  json_array_begin (&json_ctx, "results");
>>>> +
>>>> +  const size_t *welem = elems == NULL ? default_nelems : elems;
>>>> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
>>>> +
>>>> +  struct test_t
>>>> +  {
>>>> +    size_t type_size;
>>>> +    cmpfunc_t cmpfunc;
>>>> +  };
>>>> +  static const struct test_t tests[] =
>>>> +  {
>>>> +    { sizeof (uint32_t), uint32_t_cmp },
>>>> +    { sizeof (uint64_t), uint64_t_cmp },
>>>> +    /* Test swap with large elements.  */
>>>> +    { 32,                uint32_t_cmp },
>>>> +  };
>>>> +
>>>> +  const size_t inner_loop_iters = 16;
>>>> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
>>>> +    {
>>>> +      for (const struct array_t *arraytype = arraytypes;
>>>> +          arraytype < array_end (arraytypes);
>>>> +          ++arraytype)
>>>> +       {
>>>> +         for (int k = 0; k < wnelem; k++)
>>>> +           {
>>>> +             json_element_object_begin (&json_ctx);
>>>> +
>>>> +             size_t arraysize = welem[k] * test->type_size;
>>>> +
>>>> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
>>>> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
>>>> +             json_attr_string (&json_ctx, "property", arraytype->name);
>>>> +
>>>> +             void *base = create_array (welem[k], test->type_size,
>>>> +                                        arraytype->type);
>>>> +             void *work = xmalloc (arraysize);
>>>> +
>>>> +             timing_t total = 0;
>>>> +
>>>> +             for (int n = 0; n < inner_loop_iters; n++)
>>>> +               {
>>>> +                 memcpy (work, base, arraysize);
>>>> +
>>>> +                 timing_t start, end, diff;
>>>> +                 TIMING_NOW (start);
>>>> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
>>>> +                 TIMING_NOW (end);
>>>> +
>>>> +                 TIMING_DIFF (diff, start, end);
>>>> +                 TIMING_ACCUM (total, diff);
>>>> +               }
>>>> +
>>> To avoid overhead from timing. Maybe you could allocate / initialize
>>> multiple buffers to test a given sort on. Then instead of having to
>>> re-memcpy / time you just loop through the N initialized buffers
>>> with one timing call. For the smaller sorts this may help decrease
>>> the timing overhead.
>>
>> I think this make sense, although it would require more memory. I changed
>> to:
>>
>> ---
>>   for (const struct test_t *test = tests; test < array_end (tests); ++test)
>>     {
>>       for (size_t at = 0; at < array_length (arraytypes); at++)
>>         {
>>           for (int ne = 0; ne < wnelem; ne++)
>>             {
>>               void *inputs[INNER_LOOP_ITERS];
>>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>>                 inputs[i] = create_array (welem[ne], test->type_size,
>>                                           arraytypes[at].type);
> 
> I think it may pay to use a 1-dimensional array. i.e
> 
> size_t qsort_bytes = welem[ne] * test->type_size;
> void * inputs = malloc(qsort_bytes * NTESTS);
> // initialize [inputs, inputs + qsort_bytes]
> for (int i = 1; i < NTESTS; ++i)
>     memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes)
> 
> then increment inputs by 'qsort_bytes' in the test loop.
> 
> NTESTS may be fine being INNER_LOOP_ITERS but
> if we are concerned about using too much memory
> we could add another loop through INNER_LOOP_ITERS / NTESTS
> and accumulate the time diffs.

Right, I added a fill_array() (which is similar to create_array,
but do not allocate memory) I changed to:

  size_t qsort_size = welem[ne] * test->type_size;
  char *inputs = xmalloc (qsort_size * INNER_LOOP_ITERS);                                          
  for (int i = 0; i < INNER_LOOP_ITERS; i++)
    fill_array (inputs + qsort_size * i, welem[ne],                                                
                est->type_size, arraytypes[at].type);

next prev parent reply	other threads:[~2021-10-15 17:19 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
2021-09-04  9:09   ` Alexander Monakov via Libc-alpha
2021-09-06 18:30     ` Adhemerval Zanella via Libc-alpha
2021-10-13  3:19   ` Noah Goldstein via Libc-alpha
2021-10-15 12:52     ` Adhemerval Zanella via Libc-alpha
2021-10-15 16:39       ` Noah Goldstein via Libc-alpha
2021-10-15 17:19         ` Adhemerval Zanella via Libc-alpha [this message]
2021-09-03 17:11 ` [PATCH v3 2/7] support: Fix getopt_long with CMDLINE_OPTIONS Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) Adhemerval Zanella via Libc-alpha
2021-10-13  3:29   ` Noah Goldstein via Libc-alpha
2021-10-13  3:39     ` Noah Goldstein via Libc-alpha
2021-10-15 13:29       ` Adhemerval Zanella via Libc-alpha
2021-10-15 17:17         ` Noah Goldstein via Libc-alpha
2021-10-15 17:34           ` Adhemerval Zanella via Libc-alpha
2021-10-15 17:45             ` Noah Goldstein via Libc-alpha
2021-10-15 17:56               ` Adhemerval Zanella via Libc-alpha
2021-10-15 13:12     ` Adhemerval Zanella via Libc-alpha
2021-10-15 16:45       ` Noah Goldstein via Libc-alpha
2021-10-15 17:21         ` Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 4/7] stdlib: Move insertion sort out qsort Adhemerval Zanella via Libc-alpha
2021-09-06 20:35   ` Fangrui Song via Libc-alpha
2021-09-06 20:48     ` Fangrui Song via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 5/7] stdlib: qsort: Move some macros to inline function Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
2021-09-06 18:43     ` Adhemerval Zanella via Libc-alpha
2021-09-06 20:23   ` Fangrui Song via Libc-alpha
2021-10-13  3:53   ` Noah Goldstein via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 7/7] stdlib: Remove use of mergesort on qsort (BZ #21719) Adhemerval Zanella via Libc-alpha
2021-09-03 19:18 ` [PATCH v3 0/7] Use introsort for qsort Paul Eggert
2021-09-06 14:13   ` Carlos O'Donell via Libc-alpha
2021-09-06 17:03     ` Zack Weinberg via Libc-alpha
2021-09-06 18:19       ` Adhemerval Zanella via Libc-alpha
2021-09-07  0:14     ` Paul Eggert
2021-09-07 14:32       ` Adhemerval Zanella via Libc-alpha
2021-09-07 17:39         ` Paul Eggert
2021-09-07 18:07           ` Adhemerval Zanella via Libc-alpha
2021-09-07 19:28             ` Paul Eggert
2021-09-08 11:56               ` Adhemerval Zanella via Libc-alpha
2021-09-09  0:39                 ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=90308963-e059-5e7d-257f-18f487f5f1a7@linaro.org \
    --to=libc-alpha@sourceware.org \
    --cc=adhemerval.zanella@linaro.org \
    --cc=goldstein.w.n@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).