From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS17314 8.43.84.0/22
X-Spam-Status: No, score=-4.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A,
	RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 2BDC31F670
	for <e@80x24.org>; Fri, 15 Oct 2021 17:35:59 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id E11C43857C7E
	for <e@80x24.org>; Fri, 15 Oct 2021 17:35:57 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E11C43857C7E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1634319358;
	bh=riGoasin/emPgs69Vhf3MCg0/IQnD5AceOZX4ed0G7o=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=qA8GCuPvW1uy4A6m4u2qAIF8Y7yusjhJANUKau6alv1GhP0v0tI5J8aN0PxBBMAey
	 vsopiluA+nzfofaJY3L+7ctwDFdCGUEsy0eVK1Ntp5ntikmfBsYQS7Ow0hzcVpgRrv
	 Igw87oBeu9+0f86QeJ6BMpR1EhwY8v0U+I8AenkE=
Received: from mail-ua1-x935.google.com (mail-ua1-x935.google.com
 [IPv6:2607:f8b0:4864:20::935])
 by sourceware.org (Postfix) with ESMTPS id CFA3D3857820
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 17:34:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CFA3D3857820
Received: by mail-ua1-x935.google.com with SMTP id f3so19493371uap.6
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 10:34:36 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=riGoasin/emPgs69Vhf3MCg0/IQnD5AceOZX4ed0G7o=;
 b=XOk+A0EV6v1xIO7dNVmMOc8xvQ7iboxZitwHMs1ToIuiFJrfe0Vis8tjIkH2WQH/KG
 15erxxTR7UmIAl37VoEAct8q3ljZlUfYiLrgzbUyZr+VDxFbrAzLoGzjH9suLDIaofqQ
 Eai0z66zRLkiAGDiLUoXsJaX2BWiWwhChK8tHunv5iR2vs0UWezJuFW67va7yaQec2I2
 e9bxUtaw0FZxgAQPmUOquY7HboY4qKXYVWwEMhfARCkwuNb3oRlj5irUWqoO3mKuSuUN
 EUzojJcI9rjezO+kCwaGlKX6FSTzyulHw0qEis0uQuOFA5jBrt7zI7MPH5IiDPIpVthY
 8xZw==
X-Gm-Message-State: AOAM533pSupJG6vn63E66uALS6QMXdNT6ZTT4VeHFr2SXpAT/O0ZrqkN
 0I/qyzkLmCjyrZ1iOIF9ptRwvAgef+Ge8g==
X-Google-Smtp-Source: ABdhPJz1M/26dQySMJHrdX+SGAhBL121bjpiBNHyeNxwYIIbImq6STaaEVODJLGHt3buh8mLm/TMlA==
X-Received: by 2002:a67:d88c:: with SMTP id f12mr15807854vsj.33.1634319276212; 
 Fri, 15 Oct 2021 10:34:36 -0700 (PDT)
Received: from ?IPv6:2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2?
 ([2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2])
 by smtp.gmail.com with ESMTPSA id o3sm4049004vsc.26.2021.10.15.10.34.35
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Fri, 15 Oct 2021 10:34:36 -0700 (PDT)
Subject: Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation
 (BZ #19305)
To: Noah Goldstein <goldstein.w.n@gmail.com>
References: <20210903171144.952737-1-adhemerval.zanella@linaro.org>
 <20210903171144.952737-4-adhemerval.zanella@linaro.org>
 <CAFUsyfL-Ee0=6HmnA8Q4udtWHrxbSbHLY+ceWLFPcyvpS-yXUA@mail.gmail.com>
 <CAFUsyfKUh3Ai+udk8atdwhv+HBMKXRRuPCPD4znTY3KVSK8XOw@mail.gmail.com>
 <788ad382-68eb-cf90-0bf7-681ed54177da@linaro.org>
 <CAFUsyfJvF=F=uOpb5wcNyPhXKA=r5KktG-j5jtovGVxafiW9Kg@mail.gmail.com>
Message-ID: <ee6580d8-8929-a83d-c561-59db1827fb34@linaro.org>
Date: Fri, 15 Oct 2021 14:34:34 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
In-Reply-To: <CAFUsyfJvF=F=uOpb5wcNyPhXKA=r5KktG-j5jtovGVxafiW9Kg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: GNU C Library <libc-alpha@sourceware.org>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>


On 15/10/2021 14:17, Noah Goldstein wrote:
> On Fri, Oct 15, 2021 at 8:29 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/10/2021 00:39, Noah Goldstein wrote:
>>>
>>>
>>> On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
>>>
>>>
>>>
>>>     On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org <mailto:libc-alpha@sourceware.org>> wrote:
>>>
>>>         It optimizes take in consideration both the most common elements are
>>>         either 32 or 64 bit in size [1] and inputs are aligned to the word
>>>         boundary.  This is similar to the optimization done on lib/sort.c
>>>         from Linux.
>>>
>>>         This patchs adds an optimized swap operation on qsort based in previous
>>>         msort one.  Instead of byte operation, three variants are provided:
>>>
>>>           1. Using uint32_t loads and stores.
>>>           2. Using uint64_t loads and stores.
>>>           3. Generic one with a temporary buffer and memcpy/mempcpy.
>>>
>>>         The 1. and 2. options are selected only either if architecture defines
>>>         _STRING_ARCH_unaligned or if base pointer is aligned to required type.
>>>
>>>         It also fixes BZ#19305 by checking input size against number of
>>>         elements 1 besides 0.
>>>
>>>         Checked on x86_64-linux-gnu.
>>>
>>>         [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html <https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html>
>>>         ---
>>>          stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>>>          1 file changed, 91 insertions(+), 18 deletions(-)
>>>
>>>         diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>>>         index 23f2d28314..59458d151b 100644
>>>         --- a/stdlib/qsort.c
>>>         +++ b/stdlib/qsort.c
>>>         @@ -24,20 +24,85 @@
>>>          #include <limits.h>
>>>          #include <stdlib.h>
>>>          #include <string.h>
>>>         +#include <stdbool.h>
>>>
>>>         -/* Byte-wise swap two items of size SIZE. */
>>>         -#define SWAP(a, b, size)                                                     \
>>>         -  do                                                                         \
>>>         -    {                                                                        \
>>>         -      size_t __size = (size);                                                \
>>>         -      char *__a = (a), *__b = (b);                                           \
>>>         -      do                                                                     \
>>>         -       {                                                                     \
>>>         -         char __tmp = *__a;                                                  \
>>>         -         *__a++ = *__b;                                                      \
>>>         -         *__b++ = __tmp;                                                     \
>>>         -       } while (--__size > 0);                                               \
>>>         -    } while (0)
>>>         +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
>>>         +   along the generic one as an optimization.  */
>>>         +
>>>         +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
>>>         +
>>>         +/* Return trues is elements can be copied used word load and sortes.
>>>         +   The size must be a multiple of the alignment, and the base address.  */
>>>         +static inline bool
>>>         +is_aligned_to_copy (const void *base, size_t size, size_t align)
>>>         +{
>>>         +  unsigned char lsbits = size;
>>>         +#if !_STRING_ARCH_unaligned
>>>         +  lsbits |= (unsigned char)(uintptr_t) base;
>>>         +#endif
>>>         +  return (lsbits & (align - 1)) == 0;
>>>         +}
>>>         +
>>>         +#define SWAP_WORDS_64 (swap_func_t)0
>>>         +#define SWAP_WORDS_32 (swap_func_t)1
>>>         +#define SWAP_BYTES    (swap_func_t)2
>>>         +
>>>         +static void
>>>         +swap_words_64 (void * restrict a, void * restrict b, size_t n)
>>>         +{
>>>         +  do
>>>         +   {
>>>         +     n -= 8;
>>>         +     uint64_t t = *(uint64_t *)(a + n);
>>>         +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
>>>         +     *(uint64_t *)(b + n) = t;
>>>         +   } while (n);
>>>         +}
>>>         +
>>>         +static void
>>>         +swap_words_32 (void * restrict a, void * restrict b, size_t n)
>>>         +{
>>>         +  do
>>>         +   {
>>>         +     n -= 4;
>>>         +     uint32_t t = *(uint32_t *)(a + n);
>>>         +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
>>>         +     *(uint32_t *)(b + n) = t;
>>>         +   } while (n);
>>>         +}
>>>
>>>
>>> I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
>>> key sizes. Looking at GCC's implementation of swap_generic on modern x86_64:
>>> https://godbolt.org/z/638h3Y9va <https://godbolt.org/z/638h3Y9va>
>>> It's able to optimize the temporary buffer out of the loop and use xmm registers which
>>> will likely win out for larger sizes.
>>
>> It is probably not the most optimized code compiler can generated since
>> I tried to go for a more generic code that should results in a somewhat
>> better code in a architecture agnostic code.  I trying to mimic some
>> of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I
>> used the same strategy used on d3496c9f4f27d (Linux did not optimize
>> anything for the byte version).
>>
>> I think it would be possible to tune it for an specific architecture
>> and/or compiler, but I would prefer to use a good enough algorithm that
>> work reasonable on multiple architectures.
> 
> That's fair. Although I still think there are some improvements.
> 
> Looking at the assembly for all three in fact it seems GCC optimizes all of them
> to larger register copies: https://godbolt.org/z/bd9nnnoEY
> 
> The biggest difference seems to be the setup / register spills for
> the generic version so for the common case of a relatively small key
> the special case for 4/8 makes sense.
> 
> Have you checked that GCC is able to use the conditions for selecting
> the swap function to optimize the functions themselves? In the godbolt
> link I got reasonable value out of adding the invariants to swap_64/swap_32.

Not yet, but I think the generated code for both x86_64 and aarch64 seems
simple enough and should cover the common case (keys with size 4 or 8)
fast enough [1]. 

And since for v4 my plan is not to remove mergesort anymore, but limit it
to a static buffer; it might be possible to use the same swap routines
for both mergesort and quicksort.

> 
> It also may be worth it to write a custom memcpy implementation for
> size = 0..SWAP_GENERIC_SIZE so it can be inlined (and probably
> more optimized than what generic memcpy can get away with).
> 

I *really* do not want to go for this way.  My hope is with small sizes
compiler can inline the memcpy (which gcc does for most common
architectures).


[1] https://godbolt.org/z/v7e4xxqGa