From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS3215 2.6.0.0/16
X-Spam-Status: No, score=-4.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A,
	RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id BBFBD1F670
	for <e@80x24.org>; Fri, 15 Oct 2021 12:52:59 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id E4C9F3857C76
	for <e@80x24.org>; Fri, 15 Oct 2021 12:52:58 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E4C9F3857C76
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1634302378;
	bh=XPJs5N20CYaWgXb2xEU00HjmyuNKSdMx/Q6yPmPiATE=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=ivXjCCNpK3kJuPbDxRlWBzuCbUTga2RRFo10ETA+bpONHImWQW0WUgKG3+Xht3GrJ
	 mwJtWf9tmc09a+Q1OT9t6GPbYnRz5JFMm6GM5dBjbpj9k+1elA4SLfOvbPV4HOzyXz
	 P43lxpnuJ3kqgYpTjOwlEimgcqhr0oTVcuwnDT+I=
Received: from mail-ua1-x92d.google.com (mail-ua1-x92d.google.com
 [IPv6:2607:f8b0:4864:20::92d])
 by sourceware.org (Postfix) with ESMTPS id C08BF385802D
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 12:52:37 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C08BF385802D
Received: by mail-ua1-x92d.google.com with SMTP id u5so17642691uao.13
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 05:52:37 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=XPJs5N20CYaWgXb2xEU00HjmyuNKSdMx/Q6yPmPiATE=;
 b=KHP6/fWCJKyF64kx2EwISLBB8Z12y6XrdFmjBGcZSFN/H7ntRK4/D1R9XvSVu9s0eo
 3e5Uobk5OEWNJoAedtfzMC7tGFYpO8+5ovOA4paXQOzmmzFO1mKF4Wke3VEa/PuYQC9a
 VscQx1ElhRcgvARt5pfTH2mk5ftVQtLTEhTWqVeg6Sf//vl3pmBkYHOJz8KZySVvMZ9+
 6K5zyHMCa3vDmMzBQcGUc/1bMCte1SZ4KkShYKKZnLoN860KfpOGzkys/kkjBDd3DkfF
 ejknkCVfQZZ0fHYqDTjJOBEKB7HjSV06ul3dQ+nmOGlMioQtXJslfpkE3gZ9Xhlet4tb
 /oGA==
X-Gm-Message-State: AOAM532xFhW+IcztfFfY2MOhoAUH51Fbc9bfRTBE3Q156st4kZAR3vr/
 Jfv39kB1OxObgyANCFjCDahq6VYq0scn3A==
X-Google-Smtp-Source: ABdhPJwuKV3y10zHX1W/ZInGy72d89b7KBTjb9CwIdxQU3+vbtwcA4OkUG/1PI95A6R6YVKs3Ar1WQ==
X-Received: by 2002:a05:6102:e06:: with SMTP id
 o6mr9219269vst.53.1634302356959; 
 Fri, 15 Oct 2021 05:52:36 -0700 (PDT)
Received: from ?IPv6:2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2?
 ([2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2])
 by smtp.gmail.com with ESMTPSA id p6sm3826999vko.16.2021.10.15.05.52.35
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Fri, 15 Oct 2021 05:52:36 -0700 (PDT)
Subject: Re: [PATCH v3 1/7] benchtests: Add bench-qsort
To: Noah Goldstein <goldstein.w.n@gmail.com>
References: <20210903171144.952737-1-adhemerval.zanella@linaro.org>
 <20210903171144.952737-2-adhemerval.zanella@linaro.org>
 <CAFUsyfKw97xiQmcaRGP0QjVq2sq-KN9HbEv_BDzAFAt3UZfoMA@mail.gmail.com>
Message-ID: <b8dadcca-a668-aeda-529b-b01266a16d88@linaro.org>
Date: Fri, 15 Oct 2021 09:52:34 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
In-Reply-To: <CAFUsyfKw97xiQmcaRGP0QjVq2sq-KN9HbEv_BDzAFAt3UZfoMA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: GNU C Library <libc-alpha@sourceware.org>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>


On 13/10/2021 00:19, Noah Goldstein wrote:
> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> This patch adds a qsort benchmark.  I tried to model the benchmark
>> taking in consideration the possible input variation in both internal
>> element size, element numbers, and internal state for 1. real word cases
>> and 2. possible scenarios based on hardware characteristics.
>>
>> For 1. I tracked qsort usage (using a simple preload library to dump its
>> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
>> GCC 8 bootstrap build shows 51786641 call to qsort with the following
>> characterics:
>>
>> Key: number of elements:
>> key=2 : 39.74
>> key=3 : 19.23
>> key=4 : 9.77
>> key=1 : 8.44
>> key=0 : 6.60
>> key=5 : 4.40
>> key=7 : 2.37
>> key=6 : 2.25
>> key=9 : 1.48
>> key=8 : 0.97
>>
>> Key: element size in bytes:
>> key=8 : 91.74
>> key=32 : 3.62
>> key=4 : 2.42
>> key=40 : 1.20
>> key=16 : 0.67
>> key=24 : 0.30
>> key=48 : 0.05
>> key=56 : 0.00
>> key=1 : 0.00
>>
>> Key: total size (number of elements * element size):
>> key=16 : 35.98
>> key=24 : 18.67
>> key=32 : 9.79
>> key=8 : 8.28
>> key=0 : 6.60
>> key=40 : 4.21
>> key=64 : 3.15
>> key=48 : 2.24
>> key=56 : 2.15
>> key=80 : 1.45
>>
>> So for GCC:
>>
>>   - 80% of total qsort usage are done with 10 elements of less.
>>   - All usages are done element size of maximum of 56 bytes.
>>   - 90% of calls are done with array of maximum size of 80 bytes or
>>     less.
>>
>> (GCC on recent version now uses a internal qsort implementation instead
>> of the libc one).
>>
>> The Firefox usage was done with 2 hours of usage, with first 10 minutes
>> activelly openning and closing different types of sites.  It resulted in
>> 21042 calls with following characteristics:
>>
>> Key: number of elements:
>> key=7 : 24.40
>> key=1 : 10.44
>> key=3 : 6.33
>> key=4 : 5.81
>> key=2 : 5.46
>> key=6 : 4.80
>> key=17 : 4.54
>> key=0 : 3.07
>> key=5 : 3.05
>> key=9 : 2.51
>> key=12 : 2.06
>>
>> Key: element size in bytes:
>> key=8 : 94.49
>> key=28 : 4.40
>> key=2 : 0.70
>> key=16 : 0.19
>> key=36 : 0.07
>> key=12 : 0.07
>> key=40 : 0.07
>> key=24 : 0.03
>>
>> Key: total size (number of elements * element size):
>> key=56 : 24.20
>> key=8 : 10.27
>> key=24 : 6.36
>> key=32 : 5.86
>> key=16 : 5.46
>> key=48 : 4.80
>> key=476 : 3.75
>> key=0 : 3.07
>> key=40 : 3.05
>> key=72 : 2.50
>>
>> So for Firefox:
>>
>>   - 72% of total qsort usage are done with 18 elements of less.
>>   - All usages are done element size of maximum of 40 bytes.
>>   - 70% of calls are done with array of maximum size of 476 bytes or
>>     less.
>>
>> For 2. I used the idea of a machine with 3 levels of cache with sizes
>> L1 32kb, L2 256kb, and L3 4096Kb.
>>
>> It resulted in a benchmark with following traits:
>>
>>   * It checks four types of input arrays: sorted, mostly sorted, random,
>>     repeated, and bitonic:
>>
>>     - 'sorted': the array is already sorted.
>>     - 'mostly sorted': the array will have a certain number of random
>>        position with random values (current ratio used is 20%).
>>     - 'random' the array will contain random elements.
>>     - 'repeated' the array will have random elements with a certain
>>        number (current ratio is 20%) of a repeated element distributed
>>        randomly.
>>     - 'bitonic': elements will be strictly increasing to up middle of the
>>        array then strictly decreasing.  This is to trigger the
>>        pattern-defeting input for the median of 3 pivot selection used
>>        in quicksort implementation.
>>
>>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>>     element with 32 bytes (but using the uint32_t comparison checks).
>>     These element sizes are used to 1. to avoid include the comparison
>>     function itself and/or memory copy in sort benchmark itself, and 2.
>>     because key of size_t are the most used for both GCC and Firefox.
>>
>>   * Five different element numbers: 64 (which cover mostly of used
>>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>>     --nelem option.
>>
>> Checked on x86_64-linux-gnu
>> ---
>>  benchtests/Makefile      |   2 +-
>>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 196 insertions(+), 1 deletion(-)
>>  create mode 100644 benchtests/bench-qsort.c
>>
>> diff --git a/benchtests/Makefile b/benchtests/Makefile
>> index 1530939a8c..781cbca8af 100644
>> --- a/benchtests/Makefile
>> +++ b/benchtests/Makefile
>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>>  include ../gen-locales.mk
>>  endif
>>
>> -stdlib-benchset := strtod
>> +stdlib-benchset := strtod qsort
>>
>>  stdio-common-benchset := sprintf
>>
>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
>> new file mode 100644
>> index 0000000000..905aa6d7b6
>> --- /dev/null
>> +++ b/benchtests/bench-qsort.c
>> @@ -0,0 +1,195 @@
>> +/* Measure qsort function.
>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include <array_length.h>
>> +#include <bench-timing.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <json-lib.h>
>> +#include <stdio.h>
>> +#include <unistd.h>
>> +
>> +#include <stdlib/tst-qsort-common.c>
>> +
>> +/* Number of elements of determined type to be measured.  */
>> +static const size_t default_nelems[] =
>> +{
>> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
>> +                              on GCC build.  */
>> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
>> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
>> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
>> +};
>> +
>> +#define OPT_NELEM 10000
>> +#define OPT_SEED  10001
>> +#define CMDLINE_OPTIONS \
>> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
>> +  { "seed", required_argument, NULL, OPT_SEED },
>> +
>> +static const size_t max_nelem = 16;
>> +static size_t *elems = NULL;
>> +static size_t nelem = 0;
>> +static uint64_t seed = 0;
>> +static bool seed_set = false;
>> +
>> +static void
>> +cmdline_process_function (int c)
>> +{
>> +  switch (c)
>> +    {
>> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
>> +        The elements sizes as passed with a ':' as the delimiter, for
>> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
>> +      case OPT_NELEM:
>> +        {
>> +         elems = xmalloc (max_nelem * sizeof (size_t));
>> +         nelem = 0;
>> +
>> +         char *saveptr;
>> +         char *token;
>> +         token = strtok_r (optarg, ":", &saveptr);
>> +         if (token == NULL)
>> +           {
>> +             printf ("error: invalid --nelem value\n");
>> +             exit (EXIT_FAILURE);
>> +           }
>> +         do
>> +           {
>> +             if (nelem == max_nelem)
>> +               {
>> +                 printf ("error: invalid --nelem value (max elem)\n");
>> +                 exit (EXIT_FAILURE);
>> +               }
>> +             elems[nelem++] = atol (token);
>> +             token = strtok_r (saveptr, ":", &saveptr);
>> +           } while (token != NULL);
>> +        }
>> +      break;
>> +
>> +      /* handle the --seed option to use a different seed than a random one.
>> +        The SEED used should be a uint64_t number.  */
>> +      case OPT_SEED:
>> +       {
>> +         uint64_t value = strtoull (optarg, NULL, 0);
>> +         if (errno == ERANGE || value > UINT64_MAX)
>> +           {
>> +             printf ("error: seed should be a value in range of "
>> +                     "[0, UINT64_MAX]\n");
>> +             exit (EXIT_FAILURE);
>> +           }
>> +         seed = value;
>> +         seed_set = true;
>> +       }
>> +    }
>> +}
>> +
>> +#define CMDLINE_PROCESS cmdline_process_function
>> +
>> +static int
>> +do_test (void)
>> +{
>> +  random_init ();
>> +
>> +
>> +  json_ctx_t json_ctx;
>> +
>> +  json_init (&json_ctx, 0, stdout);
>> +
>> +  json_document_begin (&json_ctx);
>> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
>> +
>> +  json_attr_object_begin (&json_ctx, "functions");
>> +  json_attr_object_begin (&json_ctx, "qsort");
>> +  json_attr_uint (&json_ctx, "seed", seed);
>> +
>> +  json_array_begin (&json_ctx, "results");
>> +
>> +  const size_t *welem = elems == NULL ? default_nelems : elems;
>> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
>> +
>> +  struct test_t
>> +  {
>> +    size_t type_size;
>> +    cmpfunc_t cmpfunc;
>> +  };
>> +  static const struct test_t tests[] =
>> +  {
>> +    { sizeof (uint32_t), uint32_t_cmp },
>> +    { sizeof (uint64_t), uint64_t_cmp },
>> +    /* Test swap with large elements.  */
>> +    { 32,                uint32_t_cmp },
>> +  };
>> +
>> +  const size_t inner_loop_iters = 16;
>> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
>> +    {
>> +      for (const struct array_t *arraytype = arraytypes;
>> +          arraytype < array_end (arraytypes);
>> +          ++arraytype)
>> +       {
>> +         for (int k = 0; k < wnelem; k++)
>> +           {
>> +             json_element_object_begin (&json_ctx);
>> +
>> +             size_t arraysize = welem[k] * test->type_size;
>> +
>> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
>> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
>> +             json_attr_string (&json_ctx, "property", arraytype->name);
>> +
>> +             void *base = create_array (welem[k], test->type_size,
>> +                                        arraytype->type);
>> +             void *work = xmalloc (arraysize);
>> +
>> +             timing_t total = 0;
>> +
>> +             for (int n = 0; n < inner_loop_iters; n++)
>> +               {
>> +                 memcpy (work, base, arraysize);
>> +
>> +                 timing_t start, end, diff;
>> +                 TIMING_NOW (start);
>> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
>> +                 TIMING_NOW (end);
>> +
>> +                 TIMING_DIFF (diff, start, end);
>> +                 TIMING_ACCUM (total, diff);
>> +               }
>> +
> To avoid overhead from timing. Maybe you could allocate / initialize
> multiple buffers to test a given sort on. Then instead of having to
> re-memcpy / time you just loop through the N initialized buffers
> with one timing call. For the smaller sorts this may help decrease
> the timing overhead.

I think this make sense, although it would require more memory. I changed
to:

---
  for (const struct test_t *test = tests; test < array_end (tests); ++test)
    {
      for (size_t at = 0; at < array_length (arraytypes); at++)
        {
          for (int ne = 0; ne < wnelem; ne++)
            {
              void *inputs[INNER_LOOP_ITERS];
              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                inputs[i] = create_array (welem[ne], test->type_size,
                                          arraytypes[at].type);

              timing_t start, end, diff;
              TIMING_NOW (start);
              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc);
              TIMING_NOW (end);

              TIMING_DIFF (diff, start, end);

              json_element_object_begin (&json_ctx);
              json_attr_uint (&json_ctx, "nmemb", welem[ne]);
              json_attr_uint (&json_ctx, "type_size", test->type_size);
              json_attr_string (&json_ctx, "property", arraytypes[at].name);
              uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS;
              json_attr_uint (&json_ctx, "timings", timing);
              json_element_object_end (&json_ctx);

              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                free (inputs[i]);
            }
        }
    }
---