From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, PDS_RDNS_DYNAMIC_FP,RCVD_IN_DNSWL_HI,RDNS_DYNAMIC,SPF_HELO_PASS, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id DA03F1F670 for ; Wed, 13 Oct 2021 03:20:01 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B2BC73858C3A for ; Wed, 13 Oct 2021 03:20:00 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B2BC73858C3A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1634095200; bh=XD1Pdou+iVB3qMlRgTglYfIrAMpP0Lc9bnEDBJolt1o=; h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=I/wJt4G9oJMsYC4hVagKnU6+duelnrVwB3k/8A26eaPmUlrG20Pbfg9InIszKiLn6 mQeZLCsaqogbk15io1VpnJRlrsZoqu6Dd2FFokjXQEeX1eKav50l0h9UaYp9scVzIP zEIrht7tb1QYwSjSU0ShgI5HObrHo/lNoxZt7gpg= Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by sourceware.org (Postfix) with ESMTPS id B7EAB3858C27 for ; Wed, 13 Oct 2021 03:19:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B7EAB3858C27 Received: by mail-pf1-x434.google.com with SMTP id q19so1220774pfl.4 for ; Tue, 12 Oct 2021 20:19:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XD1Pdou+iVB3qMlRgTglYfIrAMpP0Lc9bnEDBJolt1o=; b=IojlH765RQuGHjXk7/zQRnDAHOcd/wD9Jj2bGAuG1quVFW5KjELgYufCmNZ/0qyVaP t0Bfo7dQN1lOxnwz0HOBwRofg4stX9Bvt2/1xl8U/Tb+ZzAktJaM1nnLUQE63CNR3yLY 63I0GEgrRXq+trTcIT7DjlHTgND/4XUMjWIbBRXyAGqkGI7V6nSs9HmzFTnVZfX30yox sczjIa2VVBNFuGUvHfWjHvWitp5cG0d/zFoHFkjPjzGlIYbsfrewm3mgEAJK8XOSkmol g4erGKI6+8QCGQBRlt6GhbkyLsKwDDrYeA+CqVk4lXHMrdFjZp4FOJLlFennC2XK6LLe SMpg== X-Gm-Message-State: AOAM532WS2nWteQSlq1dpLzdTNizwpBWLjCgO7zTQrWt/8hCGyhk5W7c HkwonkDiF3XFPwH0+GkVTmpP50AnRmCfHIqj5hU= X-Google-Smtp-Source: ABdhPJyIOLOHyUR2jqqzvI6XPocRxBvk4UEbGnhpLd7DzC4yG9QnEgG8HQKJeWdxf+JzabL8xVggtbVQA30ze1YkcWw= X-Received: by 2002:a63:5b09:: with SMTP id p9mr25593219pgb.321.1634095175155; Tue, 12 Oct 2021 20:19:35 -0700 (PDT) MIME-Version: 1.0 References: <20210903171144.952737-1-adhemerval.zanella@linaro.org> <20210903171144.952737-2-adhemerval.zanella@linaro.org> In-Reply-To: <20210903171144.952737-2-adhemerval.zanella@linaro.org> Date: Tue, 12 Oct 2021 23:19:09 -0400 Message-ID: Subject: Re: [PATCH v3 1/7] benchtests: Add bench-qsort To: Adhemerval Zanella Content-Type: text/plain; charset="UTF-8" X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Noah Goldstein via Libc-alpha Reply-To: Noah Goldstein Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha wrote: > > This patch adds a qsort benchmark. I tried to model the benchmark > taking in consideration the possible input variation in both internal > element size, element numbers, and internal state for 1. real word cases > and 2. possible scenarios based on hardware characteristics. > > For 1. I tracked qsort usage (using a simple preload library to dump its > usage and a script to pos-process it) on both GCC bootstrap and Firefox. > GCC 8 bootstrap build shows 51786641 call to qsort with the following > characterics: > > Key: number of elements: > key=2 : 39.74 > key=3 : 19.23 > key=4 : 9.77 > key=1 : 8.44 > key=0 : 6.60 > key=5 : 4.40 > key=7 : 2.37 > key=6 : 2.25 > key=9 : 1.48 > key=8 : 0.97 > > Key: element size in bytes: > key=8 : 91.74 > key=32 : 3.62 > key=4 : 2.42 > key=40 : 1.20 > key=16 : 0.67 > key=24 : 0.30 > key=48 : 0.05 > key=56 : 0.00 > key=1 : 0.00 > > Key: total size (number of elements * element size): > key=16 : 35.98 > key=24 : 18.67 > key=32 : 9.79 > key=8 : 8.28 > key=0 : 6.60 > key=40 : 4.21 > key=64 : 3.15 > key=48 : 2.24 > key=56 : 2.15 > key=80 : 1.45 > > So for GCC: > > - 80% of total qsort usage are done with 10 elements of less. > - All usages are done element size of maximum of 56 bytes. > - 90% of calls are done with array of maximum size of 80 bytes or > less. > > (GCC on recent version now uses a internal qsort implementation instead > of the libc one). > > The Firefox usage was done with 2 hours of usage, with first 10 minutes > activelly openning and closing different types of sites. It resulted in > 21042 calls with following characteristics: > > Key: number of elements: > key=7 : 24.40 > key=1 : 10.44 > key=3 : 6.33 > key=4 : 5.81 > key=2 : 5.46 > key=6 : 4.80 > key=17 : 4.54 > key=0 : 3.07 > key=5 : 3.05 > key=9 : 2.51 > key=12 : 2.06 > > Key: element size in bytes: > key=8 : 94.49 > key=28 : 4.40 > key=2 : 0.70 > key=16 : 0.19 > key=36 : 0.07 > key=12 : 0.07 > key=40 : 0.07 > key=24 : 0.03 > > Key: total size (number of elements * element size): > key=56 : 24.20 > key=8 : 10.27 > key=24 : 6.36 > key=32 : 5.86 > key=16 : 5.46 > key=48 : 4.80 > key=476 : 3.75 > key=0 : 3.07 > key=40 : 3.05 > key=72 : 2.50 > > So for Firefox: > > - 72% of total qsort usage are done with 18 elements of less. > - All usages are done element size of maximum of 40 bytes. > - 70% of calls are done with array of maximum size of 476 bytes or > less. > > For 2. I used the idea of a machine with 3 levels of cache with sizes > L1 32kb, L2 256kb, and L3 4096Kb. > > It resulted in a benchmark with following traits: > > * It checks four types of input arrays: sorted, mostly sorted, random, > repeated, and bitonic: > > - 'sorted': the array is already sorted. > - 'mostly sorted': the array will have a certain number of random > position with random values (current ratio used is 20%). > - 'random' the array will contain random elements. > - 'repeated' the array will have random elements with a certain > number (current ratio is 20%) of a repeated element distributed > randomly. > - 'bitonic': elements will be strictly increasing to up middle of the > array then strictly decreasing. This is to trigger the > pattern-defeting input for the median of 3 pivot selection used > in quicksort implementation. > > * Three elements sizes are checked: uint32_t, uint64_t, and an > element with 32 bytes (but using the uint32_t comparison checks). > These element sizes are used to 1. to avoid include the comparison > function itself and/or memory copy in sort benchmark itself, and 2. > because key of size_t are the most used for both GCC and Firefox. > > * Five different element numbers: 64 (which cover mostly of used > elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB > of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and > 24288/1048576 (L3 with 4 MB). The sizes are configurable by > --nelem option. > > Checked on x86_64-linux-gnu > --- > benchtests/Makefile | 2 +- > benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 196 insertions(+), 1 deletion(-) > create mode 100644 benchtests/bench-qsort.c > > diff --git a/benchtests/Makefile b/benchtests/Makefile > index 1530939a8c..781cbca8af 100644 > --- a/benchtests/Makefile > +++ b/benchtests/Makefile > @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \ > include ../gen-locales.mk > endif > > -stdlib-benchset := strtod > +stdlib-benchset := strtod qsort > > stdio-common-benchset := sprintf > > diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c > new file mode 100644 > index 0000000000..905aa6d7b6 > --- /dev/null > +++ b/benchtests/bench-qsort.c > @@ -0,0 +1,195 @@ > +/* Measure qsort function. > + Copyright (C) 2018 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > + > +/* Number of elements of determined type to be measured. */ > +static const size_t default_nelems[] = > +{ > + 256/sizeof(size_t), /* 64/128 which covers mostly used element number > + on GCC build. */ > + 32768/sizeof(size_t), /* 4096/8192 to fit on a L1 with 32 KB. */ > + 262144/sizeof(size_t), /* 32768/65536 to fit on a L2 with 256 KB. */ > + 4194304/sizeof(size_t), /* 524288/1048576 to fix on a L3 with 4 MB. */ > +}; > + > +#define OPT_NELEM 10000 > +#define OPT_SEED 10001 > +#define CMDLINE_OPTIONS \ > + { "nelem", required_argument, NULL, OPT_NELEM }, \ > + { "seed", required_argument, NULL, OPT_SEED }, > + > +static const size_t max_nelem = 16; > +static size_t *elems = NULL; > +static size_t nelem = 0; > +static uint64_t seed = 0; > +static bool seed_set = false; > + > +static void > +cmdline_process_function (int c) > +{ > + switch (c) > + { > + /* Handle the --nelem option to run different sizes than DEFAULT_ELEM. > + The elements sizes as passed with a ':' as the delimiter, for > + instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements. */ > + case OPT_NELEM: > + { > + elems = xmalloc (max_nelem * sizeof (size_t)); > + nelem = 0; > + > + char *saveptr; > + char *token; > + token = strtok_r (optarg, ":", &saveptr); > + if (token == NULL) > + { > + printf ("error: invalid --nelem value\n"); > + exit (EXIT_FAILURE); > + } > + do > + { > + if (nelem == max_nelem) > + { > + printf ("error: invalid --nelem value (max elem)\n"); > + exit (EXIT_FAILURE); > + } > + elems[nelem++] = atol (token); > + token = strtok_r (saveptr, ":", &saveptr); > + } while (token != NULL); > + } > + break; > + > + /* handle the --seed option to use a different seed than a random one. > + The SEED used should be a uint64_t number. */ > + case OPT_SEED: > + { > + uint64_t value = strtoull (optarg, NULL, 0); > + if (errno == ERANGE || value > UINT64_MAX) > + { > + printf ("error: seed should be a value in range of " > + "[0, UINT64_MAX]\n"); > + exit (EXIT_FAILURE); > + } > + seed = value; > + seed_set = true; > + } > + } > +} > + > +#define CMDLINE_PROCESS cmdline_process_function > + > +static int > +do_test (void) > +{ > + random_init (); > + > + > + json_ctx_t json_ctx; > + > + json_init (&json_ctx, 0, stdout); > + > + json_document_begin (&json_ctx); > + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); > + > + json_attr_object_begin (&json_ctx, "functions"); > + json_attr_object_begin (&json_ctx, "qsort"); > + json_attr_uint (&json_ctx, "seed", seed); > + > + json_array_begin (&json_ctx, "results"); > + > + const size_t *welem = elems == NULL ? default_nelems : elems; > + const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem; > + > + struct test_t > + { > + size_t type_size; > + cmpfunc_t cmpfunc; > + }; > + static const struct test_t tests[] = > + { > + { sizeof (uint32_t), uint32_t_cmp }, > + { sizeof (uint64_t), uint64_t_cmp }, > + /* Test swap with large elements. */ > + { 32, uint32_t_cmp }, > + }; > + > + const size_t inner_loop_iters = 16; > + for (const struct test_t *test = tests; test < array_end (tests); ++test) > + { > + for (const struct array_t *arraytype = arraytypes; > + arraytype < array_end (arraytypes); > + ++arraytype) > + { > + for (int k = 0; k < wnelem; k++) > + { > + json_element_object_begin (&json_ctx); > + > + size_t arraysize = welem[k] * test->type_size; > + > + json_attr_uint (&json_ctx, "nmemb", welem[k]); > + json_attr_uint (&json_ctx, "type_size", test->type_size); > + json_attr_string (&json_ctx, "property", arraytype->name); > + > + void *base = create_array (welem[k], test->type_size, > + arraytype->type); > + void *work = xmalloc (arraysize); > + > + timing_t total = 0; > + > + for (int n = 0; n < inner_loop_iters; n++) > + { > + memcpy (work, base, arraysize); > + > + timing_t start, end, diff; > + TIMING_NOW (start); > + qsort (work, welem[k], test->type_size, test->cmpfunc); > + TIMING_NOW (end); > + > + TIMING_DIFF (diff, start, end); > + TIMING_ACCUM (total, diff); > + } > + To avoid overhead from timing. Maybe you could allocate / initialize multiple buffers to test a given sort on. Then instead of having to re-memcpy / time you just loop through the N initialized buffers with one timing call. For the smaller sorts this may help decrease the timing overhead. > + json_attr_uint (&json_ctx, "timings", > + (double) total / (double) inner_loop_iters); > + json_element_object_end (&json_ctx); > + > + free (base); > + free (work); > + } > + } > + } > + > + json_array_end (&json_ctx); > + > + json_attr_object_end (&json_ctx); > + json_attr_object_end (&json_ctx); > + json_document_end (&json_ctx); > + > + return 0; > +} > + > +#define TIMEOUT 600 > +#include > -- > 2.30.2 >