From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-4.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A, PDS_RDNS_DYNAMIC_FP,RCVD_IN_DNSWL_HI,RDNS_DYNAMIC,SPF_HELO_PASS, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 5E43E1F670 for ; Fri, 15 Oct 2021 13:30:15 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3F7B63857C5B for ; Fri, 15 Oct 2021 13:30:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3F7B63857C5B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1634304614; bh=QNYs6jWvVz4q/SFptOvWlHN7ndxTMsTjSzYPxl+ZuWQ=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=fUhNGGs2uAmpPSc8y3uWluBo1poLzRVlpJdsYDtFwRTX3es77LdEMoGIlEs3swvGv 2brE6te3K+CqbqsxxveC9Oeo8YpMwbberaOkr62/dsc2yRDmm2ck2r5EidfN8JmYTG M6qTybJezY8+DxUmHlk6yoTZYQyG/wAYDed2W9u8= Received: from mail-vk1-xa33.google.com (mail-vk1-xa33.google.com [IPv6:2607:f8b0:4864:20::a33]) by sourceware.org (Postfix) with ESMTPS id 789263857C70 for ; Fri, 15 Oct 2021 13:29:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 789263857C70 Received: by mail-vk1-xa33.google.com with SMTP id j38so5130464vkd.10 for ; Fri, 15 Oct 2021 06:29:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=QNYs6jWvVz4q/SFptOvWlHN7ndxTMsTjSzYPxl+ZuWQ=; b=flXwnZnPgTpdXMaOvRr0gNKWccLuSCZ8WNHbKOVMENHyCE8Xu7xcCvgVmSwLEiDXgU dkQCKdGVxibBfZ6GoYj9RF+H5GmyL3J2EJUJ6/27JH39uZDBLb/h4nUrcHcdkID4qJcF nbhHL9GSWTC/fFhM6mtnIjfn84mtTVmqLpl9E+a9jrcYRhw4GlDeYVgJLmh9o6wHiYUN W8YiR+Tvk6p8O8fC0/j73+H0oCvpFaoDj0tA7+jipWBcXRxNX1GWf0oj9c4TNEMwOPLW CKM6LjOmSdEseUZ9TPQoQD37ZChJac8KTUmotP72sxVl8kaw9gEQ7og8fdffiBsi4pSh u5kg== X-Gm-Message-State: AOAM53257xVne0QtAcHuTNs/Sy+PSYJQVM/6ZkDYvftuEU5oXk77WVxm QPKnpGl4iK3pJ5t05N4lJXxsrmN8JODTkA== X-Google-Smtp-Source: ABdhPJycP8e3gEzKwy9Dl1GEXbWw02S/ERdph9qtV8vw8Y7Y1rgmzi9vQd204/wPiZVcOdr/FXIKog== X-Received: by 2002:a1f:9cd2:: with SMTP id f201mr12611386vke.9.1634304544834; Fri, 15 Oct 2021 06:29:04 -0700 (PDT) Received: from ?IPv6:2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2? ([2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2]) by smtp.gmail.com with ESMTPSA id o3sm3589099vsc.26.2021.10.15.06.29.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 Oct 2021 06:29:04 -0700 (PDT) Subject: Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) To: Noah Goldstein References: <20210903171144.952737-1-adhemerval.zanella@linaro.org> <20210903171144.952737-4-adhemerval.zanella@linaro.org> Message-ID: <788ad382-68eb-cf90-0bf7-681ed54177da@linaro.org> Date: Fri, 15 Oct 2021 10:29:02 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Adhemerval Zanella via Libc-alpha Reply-To: Adhemerval Zanella Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On 13/10/2021 00:39, Noah Goldstein wrote: > > > On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein > wrote: > > > > On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha > wrote: > > It optimizes take in consideration both the most common elements are > either 32 or 64 bit in size [1] and inputs are aligned to the word > boundary.  This is similar to the optimization done on lib/sort.c > from Linux. > > This patchs adds an optimized swap operation on qsort based in previous > msort one.  Instead of byte operation, three variants are provided: > >   1. Using uint32_t loads and stores. >   2. Using uint64_t loads and stores. >   3. Generic one with a temporary buffer and memcpy/mempcpy. > > The 1. and 2. options are selected only either if architecture defines > _STRING_ARCH_unaligned or if base pointer is aligned to required type. > > It also fixes BZ#19305 by checking input size against number of > elements 1 besides 0. > > Checked on x86_64-linux-gnu. > > [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html > --- >  stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++-------- >  1 file changed, 91 insertions(+), 18 deletions(-) > > diff --git a/stdlib/qsort.c b/stdlib/qsort.c > index 23f2d28314..59458d151b 100644 > --- a/stdlib/qsort.c > +++ b/stdlib/qsort.c > @@ -24,20 +24,85 @@ >  #include >  #include >  #include > +#include > > -/* Byte-wise swap two items of size SIZE. */ > -#define SWAP(a, b, size)                                                     \ > -  do                                                                         \ > -    {                                                                        \ > -      size_t __size = (size);                                                \ > -      char *__a = (a), *__b = (b);                                           \ > -      do                                                                     \ > -       {                                                                     \ > -         char __tmp = *__a;                                                  \ > -         *__a++ = *__b;                                                      \ > -         *__b++ = __tmp;                                                     \ > -       } while (--__size > 0);                                               \ > -    } while (0) > +/* Swap SIZE bytes between addresses A and B.  These helpers are provided > +   along the generic one as an optimization.  */ > + > +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t); > + > +/* Return trues is elements can be copied used word load and sortes. > +   The size must be a multiple of the alignment, and the base address.  */ > +static inline bool > +is_aligned_to_copy (const void *base, size_t size, size_t align) > +{ > +  unsigned char lsbits = size; > +#if !_STRING_ARCH_unaligned > +  lsbits |= (unsigned char)(uintptr_t) base; > +#endif > +  return (lsbits & (align - 1)) == 0; > +} > + > +#define SWAP_WORDS_64 (swap_func_t)0 > +#define SWAP_WORDS_32 (swap_func_t)1 > +#define SWAP_BYTES    (swap_func_t)2 > + > +static void > +swap_words_64 (void * restrict a, void * restrict b, size_t n) > +{ > +  do > +   { > +     n -= 8; > +     uint64_t t = *(uint64_t *)(a + n); > +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n); > +     *(uint64_t *)(b + n) = t; > +   } while (n); > +} > + > +static void > +swap_words_32 (void * restrict a, void * restrict b, size_t n) > +{ > +  do > +   { > +     n -= 4; > +     uint32_t t = *(uint32_t *)(a + n); > +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n); > +     *(uint32_t *)(b + n) = t; > +   } while (n); > +} > > > I'm not certain swap_words_32 / swap_words_8 will be optimal for larger > key sizes. Looking at GCC's implementation of swap_generic on modern x86_64:  > https://godbolt.org/z/638h3Y9va > It's able to optimize the temporary buffer out of the loop and use xmm registers which > will likely win out for larger sizes. It is probably not the most optimized code compiler can generated since I tried to go for a more generic code that should results in a somewhat better code in a architecture agnostic code. I trying to mimic some of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I used the same strategy used on d3496c9f4f27d (Linux did not optimize anything for the byte version). I think it would be possible to tune it for an specific architecture and/or compiler, but I would prefer to use a good enough algorithm that work reasonable on multiple architectures.