From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 06C631F670 for ; Wed, 13 Oct 2021 03:31:15 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 443BC385841B for ; Wed, 13 Oct 2021 03:31:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 443BC385841B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1634095873; bh=fY7PjqlOYwzNMfOW/pkc2XXoCtjo4Yz1zwH1cMwMYQ4=; h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=fgyuYfQLCfQxnbh2qLDc7MZ52nDH6ne38gbXjkOrT/9zpT4hgULhJ6Bj2x3iunrwC sjHeQjhtCKPuIBbW0ndGT2Sbeij6hnXC7VrZivfJvQcO0Hly+o9uDAjVzh+cuMytTK Y4gaxw5QIy5EKudk41Je19+9uWH03ELOafmnkM34= Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by sourceware.org (Postfix) with ESMTPS id 2F6273858411 for ; Wed, 13 Oct 2021 03:30:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2F6273858411 Received: by mail-pj1-x102c.google.com with SMTP id g13-20020a17090a3c8d00b00196286963b9so3360137pjc.3 for ; Tue, 12 Oct 2021 20:30:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dGmQQEXiDYURAjsisXrz+GYDpd45FrWuNbJlwvNO3Rc=; b=vGOd8YLofD4zMRte9PCvsPnIsLiu/0zEuTQpBqZ3heLL0i6+WKk1tLhqjgA0h5NLG3 5bZf8WCu0dLN68uS+DAIIzVCBdOjFAtrVbeRk8lZt/twwHKIUYEvJLVvFXZpREutNR7n oD8JxNXQKI6DqD00lnztP140CNwCgfwWcKF/4P/SVzMndXdnNnC1Q3L5YNEpiEqbshXQ rPnpesXNxZsgAfZgPMuSvy71DQp+8J42vz8sTt5UJglkR7mMF4nVLuZwHKxXUGG752is Y6RqoAyQo/I+TOJj3LFszrfU0nlGKmx7zgX/u/sHtqqyij82tKikhe/eiE2zzeZTkmsR Dieg== X-Gm-Message-State: AOAM5304L9kDS1/daSd7QJpOQKpq3hDHR8gUDjFYGB04v+h6alK6c6yX FH6Jgx/sFSWUZs1U1EoPaMvnDN5JEuweweJIxxRjNY94 X-Google-Smtp-Source: ABdhPJyRA3IOW1W+8VQpvt/9ADH1SiWsHjm7mBo2CpPlmAn5g/iDOtJFz1JSCoNuABPqldQGkLvwFxcip8FMxckJsAI= X-Received: by 2002:a17:90a:b794:: with SMTP id m20mr10798354pjr.178.1634095806237; Tue, 12 Oct 2021 20:30:06 -0700 (PDT) MIME-Version: 1.0 References: <20210903171144.952737-1-adhemerval.zanella@linaro.org> <20210903171144.952737-4-adhemerval.zanella@linaro.org> In-Reply-To: <20210903171144.952737-4-adhemerval.zanella@linaro.org> Date: Tue, 12 Oct 2021 23:29:40 -0400 Message-ID: Subject: Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) To: Adhemerval Zanella Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Noah Goldstein via Libc-alpha Reply-To: Noah Goldstein Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha < libc-alpha@sourceware.org> wrote: > It optimizes take in consideration both the most common elements are > either 32 or 64 bit in size [1] and inputs are aligned to the word > boundary. This is similar to the optimization done on lib/sort.c > from Linux. > > This patchs adds an optimized swap operation on qsort based in previous > msort one. Instead of byte operation, three variants are provided: > > 1. Using uint32_t loads and stores. > 2. Using uint64_t loads and stores. > 3. Generic one with a temporary buffer and memcpy/mempcpy. > > The 1. and 2. options are selected only either if architecture defines > _STRING_ARCH_unaligned or if base pointer is aligned to required type. > > It also fixes BZ#19305 by checking input size against number of > elements 1 besides 0. > > Checked on x86_64-linux-gnu. > > [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html > --- > stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 91 insertions(+), 18 deletions(-) > > diff --git a/stdlib/qsort.c b/stdlib/qsort.c > index 23f2d28314..59458d151b 100644 > --- a/stdlib/qsort.c > +++ b/stdlib/qsort.c > @@ -24,20 +24,85 @@ > #include > #include > #include > +#include > > -/* Byte-wise swap two items of size SIZE. */ > -#define SWAP(a, b, size) > \ > - do > \ > - { > \ > - size_t __size = (size); > \ > - char *__a = (a), *__b = (b); > \ > - do > \ > - { > \ > - char __tmp = *__a; > \ > - *__a++ = *__b; > \ > - *__b++ = __tmp; > \ > - } while (--__size > 0); > \ > - } while (0) > +/* Swap SIZE bytes between addresses A and B. These helpers are provided > + along the generic one as an optimization. */ > + > +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t); > + > +/* Return trues is elements can be copied used word load and sortes. > + The size must be a multiple of the alignment, and the base address. */ > +static inline bool > +is_aligned_to_copy (const void *base, size_t size, size_t align) > +{ > + unsigned char lsbits = size; > +#if !_STRING_ARCH_unaligned > + lsbits |= (unsigned char)(uintptr_t) base; > +#endif > + return (lsbits & (align - 1)) == 0; > +} > + > +#define SWAP_WORDS_64 (swap_func_t)0 > +#define SWAP_WORDS_32 (swap_func_t)1 > +#define SWAP_BYTES (swap_func_t)2 > + > +static void > +swap_words_64 (void * restrict a, void * restrict b, size_t n) > +{ > + do > + { > + n -= 8; > + uint64_t t = *(uint64_t *)(a + n); > + *(uint64_t *)(a + n) = *(uint64_t *)(b + n); > + *(uint64_t *)(b + n) = t; > + } while (n); > +} > + > +static void > +swap_words_32 (void * restrict a, void * restrict b, size_t n) > +{ > + do > + { > + n -= 4; > + uint32_t t = *(uint32_t *)(a + n); > + *(uint32_t *)(a + n) = *(uint32_t *)(b + n); > + *(uint32_t *)(b + n) = t; > + } while (n); > +} > + > +static void > +swap_bytes (void * restrict a, void * restrict b, size_t n) > +{ > + /* Use multiple small memcpys with constant size to enable inlining > + on most targets. */ > + enum { SWAP_GENERIC_SIZE = 32 }; > + unsigned char tmp[SWAP_GENERIC_SIZE]; > + while (n > SWAP_GENERIC_SIZE) > + { > + memcpy (tmp, a, SWAP_GENERIC_SIZE); > + a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE; > + b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE; > + n -= SWAP_GENERIC_SIZE; > + } > + memcpy (tmp, a, n); > + memcpy (a, b, n); > + memcpy (b, tmp, n); > +} > + > +/* Replace the indirect call with a serie of if statements. It should > help > + the branch predictor. */ > 1) Really? On Intel at least an indirect call that is always going to the same place is certainly going to be predicted as well if not better than 2/3 branches + direct call. 2) If you're going to just test which swap function to use, why bother initializing swap_func? Why not just use an int? > +static void > +do_swap (void * restrict a, void * restrict b, size_t size, > + swap_func_t swap_func) > +{ > + if (swap_func == SWAP_WORDS_64) > + swap_words_64 (a, b, size); > + else if (swap_func == SWAP_WORDS_32) > + swap_words_32 (a, b, size); > + else > + swap_bytes (a, b, size); > +} > > /* Discontinue quicksort algorithm when partition gets below this size. > This particular magic number was chosen to work best on a Sun 4/260. */ > @@ -97,6 +162,14 @@ _quicksort (void *const pbase, size_t total_elems, > size_t size, > /* Avoid lossage with unsigned arithmetic below. */ > return; > > + swap_func_t swap_func; > + if (is_aligned_to_copy (pbase, size, 8)) > + swap_func = SWAP_WORDS_64; > + else if (is_aligned_to_copy (pbase, size, 4)) > + swap_func = SWAP_WORDS_32; > + else > + swap_func = SWAP_BYTES; > + > if (total_elems > MAX_THRESH) > { > char *lo = base_ptr; > @@ -120,13 +193,13 @@ _quicksort (void *const pbase, size_t total_elems, > size_t size, > char *mid = lo + size * ((hi - lo) / size >> 1); > > if ((*cmp) ((void *) mid, (void *) lo, arg) < 0) > - SWAP (mid, lo, size); > + do_swap (mid, lo, size, swap_func); > if ((*cmp) ((void *) hi, (void *) mid, arg) < 0) > - SWAP (mid, hi, size); > + do_swap (mid, hi, size, swap_func); > else > goto jump_over; > if ((*cmp) ((void *) mid, (void *) lo, arg) < 0) > - SWAP (mid, lo, size); > + do_swap (mid, lo, size, swap_func); > jump_over:; > > left_ptr = lo + size; > @@ -145,7 +218,7 @@ _quicksort (void *const pbase, size_t total_elems, > size_t size, > > if (left_ptr < right_ptr) > { > - SWAP (left_ptr, right_ptr, size); > + do_swap (left_ptr, right_ptr, size, swap_func); > if (mid == left_ptr) > mid = right_ptr; > else if (mid == right_ptr) > @@ -217,7 +290,7 @@ _quicksort (void *const pbase, size_t total_elems, > size_t size, > tmp_ptr = run_ptr; > > if (tmp_ptr != base_ptr) > - SWAP (tmp_ptr, base_ptr, size); > + do_swap (tmp_ptr, base_ptr, size, swap_func); > > /* Insertion sort, running from left-hand-side up to > right-hand-side. */ > > -- > 2.30.2 > >