From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-4.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 9A9001F670 for ; Fri, 15 Oct 2021 17:19:54 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id CA6F13857C6C for ; Fri, 15 Oct 2021 17:19:53 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CA6F13857C6C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1634318393; bh=3nQ+xiUF0h14VTz0LTUGTPoamajyvu8qm6qUxc6VRgo=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=xLRNgNZHrGZvMYwgPtMhwi1BzGfMWdJPSb2DitjUIbG2riCej6zIf1xIg8aJ/qKhu n3/cQHuidPrKvOQl0QkT/5d2CO8rkQiDtV1pjRrGvzCiV7hPCzhZZnthto3PLOe4I8 Jqyz4qtlwzEqZhUE5xxdpPABVEUE++PErK7YI+4s= Received: from mail-ua1-x92f.google.com (mail-ua1-x92f.google.com [IPv6:2607:f8b0:4864:20::92f]) by sourceware.org (Postfix) with ESMTPS id DE6563858C60 for ; Fri, 15 Oct 2021 17:19:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DE6563858C60 Received: by mail-ua1-x92f.google.com with SMTP id u5so19327210uao.13 for ; Fri, 15 Oct 2021 10:19:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=3nQ+xiUF0h14VTz0LTUGTPoamajyvu8qm6qUxc6VRgo=; b=61cYRzbciKlOfqVdoT2kThCitVdpR6jv18uBJ2CWy2muQkim4yJALh6HDHtYjK+gDq bEw0nAtapmW7gtdhNQ87kSodB2M3x9EqaBOyCwTZvngtB40ZAh6ZgFZfqbu47wZPAogt yNigxvNXwcy3TkQQ1FbfTFmfT80q/jHAK+EFIXycdxCaevoILg6MVrK2StX5QDvq8cvD DqnXumSOsEjHKoCRxFjm3M6jK0jeiLXiEz6xXkkYyy5JpMD3VqV9aNH5rKy7HC9yWCPl IzN6KJaAX4iKbBz3mEvwvNZmubgg7AX8K8+nv1wz0bvdOtiI7CbAxdVwM9mb+yi5X6Xx AWnw== X-Gm-Message-State: AOAM532FVfzo+6yOqNoa2tYz8/dyjzJxf0q+J7Ks6cVgx/IgebrS6IkB xnkqDgkMlpPozR5LJDfodMELk+Dvmg1/kQ== X-Google-Smtp-Source: ABdhPJwtekwKzfd2PqQA3Cg+1F/4WNJq2C9k0CsaFg7dK1hjZhtCXxt//513aiXS9WIwXh1f1LGNyw== X-Received: by 2002:a05:6102:c53:: with SMTP id y19mr15374258vss.55.1634318371927; Fri, 15 Oct 2021 10:19:31 -0700 (PDT) Received: from ?IPv6:2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2? ([2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2]) by smtp.gmail.com with ESMTPSA id n200sm4173242vke.2.2021.10.15.10.19.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 Oct 2021 10:19:31 -0700 (PDT) Subject: Re: [PATCH v3 1/7] benchtests: Add bench-qsort To: Noah Goldstein References: <20210903171144.952737-1-adhemerval.zanella@linaro.org> <20210903171144.952737-2-adhemerval.zanella@linaro.org> Message-ID: <90308963-e059-5e7d-257f-18f487f5f1a7@linaro.org> Date: Fri, 15 Oct 2021 14:19:29 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Adhemerval Zanella via Libc-alpha Reply-To: Adhemerval Zanella Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On 15/10/2021 13:39, Noah Goldstein wrote: > On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella > wrote: >> >> >> >> On 13/10/2021 00:19, Noah Goldstein wrote: >>> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha >>> wrote: >>>> >>>> This patch adds a qsort benchmark. I tried to model the benchmark >>>> taking in consideration the possible input variation in both internal >>>> element size, element numbers, and internal state for 1. real word cases >>>> and 2. possible scenarios based on hardware characteristics. >>>> >>>> For 1. I tracked qsort usage (using a simple preload library to dump its >>>> usage and a script to pos-process it) on both GCC bootstrap and Firefox. >>>> GCC 8 bootstrap build shows 51786641 call to qsort with the following >>>> characterics: >>>> >>>> Key: number of elements: >>>> key=2 : 39.74 >>>> key=3 : 19.23 >>>> key=4 : 9.77 >>>> key=1 : 8.44 >>>> key=0 : 6.60 >>>> key=5 : 4.40 >>>> key=7 : 2.37 >>>> key=6 : 2.25 >>>> key=9 : 1.48 >>>> key=8 : 0.97 >>>> >>>> Key: element size in bytes: >>>> key=8 : 91.74 >>>> key=32 : 3.62 >>>> key=4 : 2.42 >>>> key=40 : 1.20 >>>> key=16 : 0.67 >>>> key=24 : 0.30 >>>> key=48 : 0.05 >>>> key=56 : 0.00 >>>> key=1 : 0.00 >>>> >>>> Key: total size (number of elements * element size): >>>> key=16 : 35.98 >>>> key=24 : 18.67 >>>> key=32 : 9.79 >>>> key=8 : 8.28 >>>> key=0 : 6.60 >>>> key=40 : 4.21 >>>> key=64 : 3.15 >>>> key=48 : 2.24 >>>> key=56 : 2.15 >>>> key=80 : 1.45 >>>> >>>> So for GCC: >>>> >>>> - 80% of total qsort usage are done with 10 elements of less. >>>> - All usages are done element size of maximum of 56 bytes. >>>> - 90% of calls are done with array of maximum size of 80 bytes or >>>> less. >>>> >>>> (GCC on recent version now uses a internal qsort implementation instead >>>> of the libc one). >>>> >>>> The Firefox usage was done with 2 hours of usage, with first 10 minutes >>>> activelly openning and closing different types of sites. It resulted in >>>> 21042 calls with following characteristics: >>>> >>>> Key: number of elements: >>>> key=7 : 24.40 >>>> key=1 : 10.44 >>>> key=3 : 6.33 >>>> key=4 : 5.81 >>>> key=2 : 5.46 >>>> key=6 : 4.80 >>>> key=17 : 4.54 >>>> key=0 : 3.07 >>>> key=5 : 3.05 >>>> key=9 : 2.51 >>>> key=12 : 2.06 >>>> >>>> Key: element size in bytes: >>>> key=8 : 94.49 >>>> key=28 : 4.40 >>>> key=2 : 0.70 >>>> key=16 : 0.19 >>>> key=36 : 0.07 >>>> key=12 : 0.07 >>>> key=40 : 0.07 >>>> key=24 : 0.03 >>>> >>>> Key: total size (number of elements * element size): >>>> key=56 : 24.20 >>>> key=8 : 10.27 >>>> key=24 : 6.36 >>>> key=32 : 5.86 >>>> key=16 : 5.46 >>>> key=48 : 4.80 >>>> key=476 : 3.75 >>>> key=0 : 3.07 >>>> key=40 : 3.05 >>>> key=72 : 2.50 >>>> >>>> So for Firefox: >>>> >>>> - 72% of total qsort usage are done with 18 elements of less. >>>> - All usages are done element size of maximum of 40 bytes. >>>> - 70% of calls are done with array of maximum size of 476 bytes or >>>> less. >>>> >>>> For 2. I used the idea of a machine with 3 levels of cache with sizes >>>> L1 32kb, L2 256kb, and L3 4096Kb. >>>> >>>> It resulted in a benchmark with following traits: >>>> >>>> * It checks four types of input arrays: sorted, mostly sorted, random, >>>> repeated, and bitonic: >>>> >>>> - 'sorted': the array is already sorted. >>>> - 'mostly sorted': the array will have a certain number of random >>>> position with random values (current ratio used is 20%). >>>> - 'random' the array will contain random elements. >>>> - 'repeated' the array will have random elements with a certain >>>> number (current ratio is 20%) of a repeated element distributed >>>> randomly. >>>> - 'bitonic': elements will be strictly increasing to up middle of the >>>> array then strictly decreasing. This is to trigger the >>>> pattern-defeting input for the median of 3 pivot selection used >>>> in quicksort implementation. >>>> >>>> * Three elements sizes are checked: uint32_t, uint64_t, and an >>>> element with 32 bytes (but using the uint32_t comparison checks). >>>> These element sizes are used to 1. to avoid include the comparison >>>> function itself and/or memory copy in sort benchmark itself, and 2. >>>> because key of size_t are the most used for both GCC and Firefox. >>>> >>>> * Five different element numbers: 64 (which cover mostly of used >>>> elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB >>>> of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and >>>> 24288/1048576 (L3 with 4 MB). The sizes are configurable by >>>> --nelem option. >>>> >>>> Checked on x86_64-linux-gnu >>>> --- >>>> benchtests/Makefile | 2 +- >>>> benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 196 insertions(+), 1 deletion(-) >>>> create mode 100644 benchtests/bench-qsort.c >>>> >>>> diff --git a/benchtests/Makefile b/benchtests/Makefile >>>> index 1530939a8c..781cbca8af 100644 >>>> --- a/benchtests/Makefile >>>> +++ b/benchtests/Makefile >>>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \ >>>> include ../gen-locales.mk >>>> endif >>>> >>>> -stdlib-benchset := strtod >>>> +stdlib-benchset := strtod qsort >>>> >>>> stdio-common-benchset := sprintf >>>> >>>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c >>>> new file mode 100644 >>>> index 0000000000..905aa6d7b6 >>>> --- /dev/null >>>> +++ b/benchtests/bench-qsort.c >>>> @@ -0,0 +1,195 @@ >>>> +/* Measure qsort function. >>>> + Copyright (C) 2018 Free Software Foundation, Inc. >>>> + This file is part of the GNU C Library. >>>> + >>>> + The GNU C Library is free software; you can redistribute it and/or >>>> + modify it under the terms of the GNU Lesser General Public >>>> + License as published by the Free Software Foundation; either >>>> + version 2.1 of the License, or (at your option) any later version. >>>> + >>>> + The GNU C Library is distributed in the hope that it will be useful, >>>> + but WITHOUT ANY WARRANTY; without even the implied warranty of >>>> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >>>> + Lesser General Public License for more details. >>>> + >>>> + You should have received a copy of the GNU Lesser General Public >>>> + License along with the GNU C Library; if not, see >>>> + . */ >>>> + >>>> +#include >>>> +#include >>>> +#include >>>> +#include >>>> +#include >>>> +#include >>>> +#include >>>> + >>>> +#include >>>> + >>>> +/* Number of elements of determined type to be measured. */ >>>> +static const size_t default_nelems[] = >>>> +{ >>>> + 256/sizeof(size_t), /* 64/128 which covers mostly used element number >>>> + on GCC build. */ >>>> + 32768/sizeof(size_t), /* 4096/8192 to fit on a L1 with 32 KB. */ >>>> + 262144/sizeof(size_t), /* 32768/65536 to fit on a L2 with 256 KB. */ >>>> + 4194304/sizeof(size_t), /* 524288/1048576 to fix on a L3 with 4 MB. */ >>>> +}; >>>> + >>>> +#define OPT_NELEM 10000 >>>> +#define OPT_SEED 10001 >>>> +#define CMDLINE_OPTIONS \ >>>> + { "nelem", required_argument, NULL, OPT_NELEM }, \ >>>> + { "seed", required_argument, NULL, OPT_SEED }, >>>> + >>>> +static const size_t max_nelem = 16; >>>> +static size_t *elems = NULL; >>>> +static size_t nelem = 0; >>>> +static uint64_t seed = 0; >>>> +static bool seed_set = false; >>>> + >>>> +static void >>>> +cmdline_process_function (int c) >>>> +{ >>>> + switch (c) >>>> + { >>>> + /* Handle the --nelem option to run different sizes than DEFAULT_ELEM. >>>> + The elements sizes as passed with a ':' as the delimiter, for >>>> + instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements. */ >>>> + case OPT_NELEM: >>>> + { >>>> + elems = xmalloc (max_nelem * sizeof (size_t)); >>>> + nelem = 0; >>>> + >>>> + char *saveptr; >>>> + char *token; >>>> + token = strtok_r (optarg, ":", &saveptr); >>>> + if (token == NULL) >>>> + { >>>> + printf ("error: invalid --nelem value\n"); >>>> + exit (EXIT_FAILURE); >>>> + } >>>> + do >>>> + { >>>> + if (nelem == max_nelem) >>>> + { >>>> + printf ("error: invalid --nelem value (max elem)\n"); >>>> + exit (EXIT_FAILURE); >>>> + } >>>> + elems[nelem++] = atol (token); >>>> + token = strtok_r (saveptr, ":", &saveptr); >>>> + } while (token != NULL); >>>> + } >>>> + break; >>>> + >>>> + /* handle the --seed option to use a different seed than a random one. >>>> + The SEED used should be a uint64_t number. */ >>>> + case OPT_SEED: >>>> + { >>>> + uint64_t value = strtoull (optarg, NULL, 0); >>>> + if (errno == ERANGE || value > UINT64_MAX) >>>> + { >>>> + printf ("error: seed should be a value in range of " >>>> + "[0, UINT64_MAX]\n"); >>>> + exit (EXIT_FAILURE); >>>> + } >>>> + seed = value; >>>> + seed_set = true; >>>> + } >>>> + } >>>> +} >>>> + >>>> +#define CMDLINE_PROCESS cmdline_process_function >>>> + >>>> +static int >>>> +do_test (void) >>>> +{ >>>> + random_init (); >>>> + >>>> + >>>> + json_ctx_t json_ctx; >>>> + >>>> + json_init (&json_ctx, 0, stdout); >>>> + >>>> + json_document_begin (&json_ctx); >>>> + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); >>>> + >>>> + json_attr_object_begin (&json_ctx, "functions"); >>>> + json_attr_object_begin (&json_ctx, "qsort"); >>>> + json_attr_uint (&json_ctx, "seed", seed); >>>> + >>>> + json_array_begin (&json_ctx, "results"); >>>> + >>>> + const size_t *welem = elems == NULL ? default_nelems : elems; >>>> + const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem; >>>> + >>>> + struct test_t >>>> + { >>>> + size_t type_size; >>>> + cmpfunc_t cmpfunc; >>>> + }; >>>> + static const struct test_t tests[] = >>>> + { >>>> + { sizeof (uint32_t), uint32_t_cmp }, >>>> + { sizeof (uint64_t), uint64_t_cmp }, >>>> + /* Test swap with large elements. */ >>>> + { 32, uint32_t_cmp }, >>>> + }; >>>> + >>>> + const size_t inner_loop_iters = 16; >>>> + for (const struct test_t *test = tests; test < array_end (tests); ++test) >>>> + { >>>> + for (const struct array_t *arraytype = arraytypes; >>>> + arraytype < array_end (arraytypes); >>>> + ++arraytype) >>>> + { >>>> + for (int k = 0; k < wnelem; k++) >>>> + { >>>> + json_element_object_begin (&json_ctx); >>>> + >>>> + size_t arraysize = welem[k] * test->type_size; >>>> + >>>> + json_attr_uint (&json_ctx, "nmemb", welem[k]); >>>> + json_attr_uint (&json_ctx, "type_size", test->type_size); >>>> + json_attr_string (&json_ctx, "property", arraytype->name); >>>> + >>>> + void *base = create_array (welem[k], test->type_size, >>>> + arraytype->type); >>>> + void *work = xmalloc (arraysize); >>>> + >>>> + timing_t total = 0; >>>> + >>>> + for (int n = 0; n < inner_loop_iters; n++) >>>> + { >>>> + memcpy (work, base, arraysize); >>>> + >>>> + timing_t start, end, diff; >>>> + TIMING_NOW (start); >>>> + qsort (work, welem[k], test->type_size, test->cmpfunc); >>>> + TIMING_NOW (end); >>>> + >>>> + TIMING_DIFF (diff, start, end); >>>> + TIMING_ACCUM (total, diff); >>>> + } >>>> + >>> To avoid overhead from timing. Maybe you could allocate / initialize >>> multiple buffers to test a given sort on. Then instead of having to >>> re-memcpy / time you just loop through the N initialized buffers >>> with one timing call. For the smaller sorts this may help decrease >>> the timing overhead. >> >> I think this make sense, although it would require more memory. I changed >> to: >> >> --- >> for (const struct test_t *test = tests; test < array_end (tests); ++test) >> { >> for (size_t at = 0; at < array_length (arraytypes); at++) >> { >> for (int ne = 0; ne < wnelem; ne++) >> { >> void *inputs[INNER_LOOP_ITERS]; >> for (int i = 0; i < INNER_LOOP_ITERS; i++) >> inputs[i] = create_array (welem[ne], test->type_size, >> arraytypes[at].type); > > I think it may pay to use a 1-dimensional array. i.e > > size_t qsort_bytes = welem[ne] * test->type_size; > void * inputs = malloc(qsort_bytes * NTESTS); > // initialize [inputs, inputs + qsort_bytes] > for (int i = 1; i < NTESTS; ++i) > memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes) > > then increment inputs by 'qsort_bytes' in the test loop. > > NTESTS may be fine being INNER_LOOP_ITERS but > if we are concerned about using too much memory > we could add another loop through INNER_LOOP_ITERS / NTESTS > and accumulate the time diffs. Right, I added a fill_array() (which is similar to create_array, but do not allocate memory) I changed to: size_t qsort_size = welem[ne] * test->type_size; char *inputs = xmalloc (qsort_size * INNER_LOOP_ITERS); for (int i = 0; i < INNER_LOOP_ITERS; i++) fill_array (inputs + qsort_size * i, welem[ne], est->type_size, arraytypes[at].type);