From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS17314 8.43.84.0/22
X-Spam-Status: No, score=-4.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A,
	PDS_RDNS_DYNAMIC_FP,RCVD_IN_DNSWL_HI,RDNS_DYNAMIC,SPF_HELO_PASS,
	SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 5E43E1F670
	for <e@80x24.org>; Fri, 15 Oct 2021 13:30:15 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 3F7B63857C5B
	for <e@80x24.org>; Fri, 15 Oct 2021 13:30:14 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3F7B63857C5B
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1634304614;
	bh=QNYs6jWvVz4q/SFptOvWlHN7ndxTMsTjSzYPxl+ZuWQ=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=fUhNGGs2uAmpPSc8y3uWluBo1poLzRVlpJdsYDtFwRTX3es77LdEMoGIlEs3swvGv
	 2brE6te3K+CqbqsxxveC9Oeo8YpMwbberaOkr62/dsc2yRDmm2ck2r5EidfN8JmYTG
	 M6qTybJezY8+DxUmHlk6yoTZYQyG/wAYDed2W9u8=
Received: from mail-vk1-xa33.google.com (mail-vk1-xa33.google.com
 [IPv6:2607:f8b0:4864:20::a33])
 by sourceware.org (Postfix) with ESMTPS id 789263857C70
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 13:29:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 789263857C70
Received: by mail-vk1-xa33.google.com with SMTP id j38so5130464vkd.10
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 06:29:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=QNYs6jWvVz4q/SFptOvWlHN7ndxTMsTjSzYPxl+ZuWQ=;
 b=flXwnZnPgTpdXMaOvRr0gNKWccLuSCZ8WNHbKOVMENHyCE8Xu7xcCvgVmSwLEiDXgU
 dkQCKdGVxibBfZ6GoYj9RF+H5GmyL3J2EJUJ6/27JH39uZDBLb/h4nUrcHcdkID4qJcF
 nbhHL9GSWTC/fFhM6mtnIjfn84mtTVmqLpl9E+a9jrcYRhw4GlDeYVgJLmh9o6wHiYUN
 W8YiR+Tvk6p8O8fC0/j73+H0oCvpFaoDj0tA7+jipWBcXRxNX1GWf0oj9c4TNEMwOPLW
 CKM6LjOmSdEseUZ9TPQoQD37ZChJac8KTUmotP72sxVl8kaw9gEQ7og8fdffiBsi4pSh
 u5kg==
X-Gm-Message-State: AOAM53257xVne0QtAcHuTNs/Sy+PSYJQVM/6ZkDYvftuEU5oXk77WVxm
 QPKnpGl4iK3pJ5t05N4lJXxsrmN8JODTkA==
X-Google-Smtp-Source: ABdhPJycP8e3gEzKwy9Dl1GEXbWw02S/ERdph9qtV8vw8Y7Y1rgmzi9vQd204/wPiZVcOdr/FXIKog==
X-Received: by 2002:a1f:9cd2:: with SMTP id f201mr12611386vke.9.1634304544834; 
 Fri, 15 Oct 2021 06:29:04 -0700 (PDT)
Received: from ?IPv6:2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2?
 ([2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2])
 by smtp.gmail.com with ESMTPSA id o3sm3589099vsc.26.2021.10.15.06.29.03
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Fri, 15 Oct 2021 06:29:04 -0700 (PDT)
Subject: Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation
 (BZ #19305)
To: Noah Goldstein <goldstein.w.n@gmail.com>
References: <20210903171144.952737-1-adhemerval.zanella@linaro.org>
 <20210903171144.952737-4-adhemerval.zanella@linaro.org>
 <CAFUsyfL-Ee0=6HmnA8Q4udtWHrxbSbHLY+ceWLFPcyvpS-yXUA@mail.gmail.com>
 <CAFUsyfKUh3Ai+udk8atdwhv+HBMKXRRuPCPD4znTY3KVSK8XOw@mail.gmail.com>
Message-ID: <788ad382-68eb-cf90-0bf7-681ed54177da@linaro.org>
Date: Fri, 15 Oct 2021 10:29:02 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
In-Reply-To: <CAFUsyfKUh3Ai+udk8atdwhv+HBMKXRRuPCPD4znTY3KVSK8XOw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: GNU C Library <libc-alpha@sourceware.org>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>


On 13/10/2021 00:39, Noah Goldstein wrote:
> 
> 
> On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
> 
> 
> 
>     On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org <mailto:libc-alpha@sourceware.org>> wrote:
> 
>         It optimizes take in consideration both the most common elements are
>         either 32 or 64 bit in size [1] and inputs are aligned to the word
>         boundary.  This is similar to the optimization done on lib/sort.c
>         from Linux.
> 
>         This patchs adds an optimized swap operation on qsort based in previous
>         msort one.  Instead of byte operation, three variants are provided:
> 
>           1. Using uint32_t loads and stores.
>           2. Using uint64_t loads and stores.
>           3. Generic one with a temporary buffer and memcpy/mempcpy.
> 
>         The 1. and 2. options are selected only either if architecture defines
>         _STRING_ARCH_unaligned or if base pointer is aligned to required type.
> 
>         It also fixes BZ#19305 by checking input size against number of
>         elements 1 besides 0.
> 
>         Checked on x86_64-linux-gnu.
> 
>         [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html <https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html>
>         ---
>          stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>          1 file changed, 91 insertions(+), 18 deletions(-)
> 
>         diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>         index 23f2d28314..59458d151b 100644
>         --- a/stdlib/qsort.c
>         +++ b/stdlib/qsort.c
>         @@ -24,20 +24,85 @@
>          #include <limits.h>
>          #include <stdlib.h>
>          #include <string.h>
>         +#include <stdbool.h>
> 
>         -/* Byte-wise swap two items of size SIZE. */
>         -#define SWAP(a, b, size)                                                     \
>         -  do                                                                         \
>         -    {                                                                        \
>         -      size_t __size = (size);                                                \
>         -      char *__a = (a), *__b = (b);                                           \
>         -      do                                                                     \
>         -       {                                                                     \
>         -         char __tmp = *__a;                                                  \
>         -         *__a++ = *__b;                                                      \
>         -         *__b++ = __tmp;                                                     \
>         -       } while (--__size > 0);                                               \
>         -    } while (0)
>         +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
>         +   along the generic one as an optimization.  */
>         +
>         +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
>         +
>         +/* Return trues is elements can be copied used word load and sortes.
>         +   The size must be a multiple of the alignment, and the base address.  */
>         +static inline bool
>         +is_aligned_to_copy (const void *base, size_t size, size_t align)
>         +{
>         +  unsigned char lsbits = size;
>         +#if !_STRING_ARCH_unaligned
>         +  lsbits |= (unsigned char)(uintptr_t) base;
>         +#endif
>         +  return (lsbits & (align - 1)) == 0;
>         +}
>         +
>         +#define SWAP_WORDS_64 (swap_func_t)0
>         +#define SWAP_WORDS_32 (swap_func_t)1
>         +#define SWAP_BYTES    (swap_func_t)2
>         +
>         +static void
>         +swap_words_64 (void * restrict a, void * restrict b, size_t n)
>         +{
>         +  do
>         +   {
>         +     n -= 8;
>         +     uint64_t t = *(uint64_t *)(a + n);
>         +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
>         +     *(uint64_t *)(b + n) = t;
>         +   } while (n);
>         +}
>         +
>         +static void
>         +swap_words_32 (void * restrict a, void * restrict b, size_t n)
>         +{
>         +  do
>         +   {
>         +     n -= 4;
>         +     uint32_t t = *(uint32_t *)(a + n);
>         +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
>         +     *(uint32_t *)(b + n) = t;
>         +   } while (n);
>         +}
> 
> 
> I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
> key sizes. Looking at GCC's implementation of swap_generic on modern x86_64: 
> https://godbolt.org/z/638h3Y9va <https://godbolt.org/z/638h3Y9va>
> It's able to optimize the temporary buffer out of the loop and use xmm registers which
> will likely win out for larger sizes. 

It is probably not the most optimized code compiler can generated since
I tried to go for a more generic code that should results in a somewhat
better code in a architecture agnostic code.  I trying to mimic some
of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I
used the same strategy used on d3496c9f4f27d (Linux did not optimize
anything for the byte version).

I think it would be possible to tune it for an specific architecture
and/or compiler, but I would prefer to use a good enough algorithm that
work reasonable on multiple architectures.