From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-4.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 2BDC31F670 for ; Fri, 15 Oct 2021 17:35:59 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E11C43857C7E for ; Fri, 15 Oct 2021 17:35:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E11C43857C7E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1634319358; bh=riGoasin/emPgs69Vhf3MCg0/IQnD5AceOZX4ed0G7o=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=qA8GCuPvW1uy4A6m4u2qAIF8Y7yusjhJANUKau6alv1GhP0v0tI5J8aN0PxBBMAey vsopiluA+nzfofaJY3L+7ctwDFdCGUEsy0eVK1Ntp5ntikmfBsYQS7Ow0hzcVpgRrv Igw87oBeu9+0f86QeJ6BMpR1EhwY8v0U+I8AenkE= Received: from mail-ua1-x935.google.com (mail-ua1-x935.google.com [IPv6:2607:f8b0:4864:20::935]) by sourceware.org (Postfix) with ESMTPS id CFA3D3857820 for ; Fri, 15 Oct 2021 17:34:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CFA3D3857820 Received: by mail-ua1-x935.google.com with SMTP id f3so19493371uap.6 for ; Fri, 15 Oct 2021 10:34:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=riGoasin/emPgs69Vhf3MCg0/IQnD5AceOZX4ed0G7o=; b=XOk+A0EV6v1xIO7dNVmMOc8xvQ7iboxZitwHMs1ToIuiFJrfe0Vis8tjIkH2WQH/KG 15erxxTR7UmIAl37VoEAct8q3ljZlUfYiLrgzbUyZr+VDxFbrAzLoGzjH9suLDIaofqQ Eai0z66zRLkiAGDiLUoXsJaX2BWiWwhChK8tHunv5iR2vs0UWezJuFW67va7yaQec2I2 e9bxUtaw0FZxgAQPmUOquY7HboY4qKXYVWwEMhfARCkwuNb3oRlj5irUWqoO3mKuSuUN EUzojJcI9rjezO+kCwaGlKX6FSTzyulHw0qEis0uQuOFA5jBrt7zI7MPH5IiDPIpVthY 8xZw== X-Gm-Message-State: AOAM533pSupJG6vn63E66uALS6QMXdNT6ZTT4VeHFr2SXpAT/O0ZrqkN 0I/qyzkLmCjyrZ1iOIF9ptRwvAgef+Ge8g== X-Google-Smtp-Source: ABdhPJz1M/26dQySMJHrdX+SGAhBL121bjpiBNHyeNxwYIIbImq6STaaEVODJLGHt3buh8mLm/TMlA== X-Received: by 2002:a67:d88c:: with SMTP id f12mr15807854vsj.33.1634319276212; Fri, 15 Oct 2021 10:34:36 -0700 (PDT) Received: from ?IPv6:2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2? ([2804:431:c7ca:c6c7:f05e:9652:ab99:7fa2]) by smtp.gmail.com with ESMTPSA id o3sm4049004vsc.26.2021.10.15.10.34.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 Oct 2021 10:34:36 -0700 (PDT) Subject: Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) To: Noah Goldstein References: <20210903171144.952737-1-adhemerval.zanella@linaro.org> <20210903171144.952737-4-adhemerval.zanella@linaro.org> <788ad382-68eb-cf90-0bf7-681ed54177da@linaro.org> Message-ID: Date: Fri, 15 Oct 2021 14:34:34 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Adhemerval Zanella via Libc-alpha Reply-To: Adhemerval Zanella Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On 15/10/2021 14:17, Noah Goldstein wrote: > On Fri, Oct 15, 2021 at 8:29 AM Adhemerval Zanella > wrote: >> >> >> >> On 13/10/2021 00:39, Noah Goldstein wrote: >>> >>> >>> On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein > wrote: >>> >>> >>> >>> On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha > wrote: >>> >>> It optimizes take in consideration both the most common elements are >>> either 32 or 64 bit in size [1] and inputs are aligned to the word >>> boundary. This is similar to the optimization done on lib/sort.c >>> from Linux. >>> >>> This patchs adds an optimized swap operation on qsort based in previous >>> msort one. Instead of byte operation, three variants are provided: >>> >>> 1. Using uint32_t loads and stores. >>> 2. Using uint64_t loads and stores. >>> 3. Generic one with a temporary buffer and memcpy/mempcpy. >>> >>> The 1. and 2. options are selected only either if architecture defines >>> _STRING_ARCH_unaligned or if base pointer is aligned to required type. >>> >>> It also fixes BZ#19305 by checking input size against number of >>> elements 1 besides 0. >>> >>> Checked on x86_64-linux-gnu. >>> >>> [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html >>> --- >>> stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++-------- >>> 1 file changed, 91 insertions(+), 18 deletions(-) >>> >>> diff --git a/stdlib/qsort.c b/stdlib/qsort.c >>> index 23f2d28314..59458d151b 100644 >>> --- a/stdlib/qsort.c >>> +++ b/stdlib/qsort.c >>> @@ -24,20 +24,85 @@ >>> #include >>> #include >>> #include >>> +#include >>> >>> -/* Byte-wise swap two items of size SIZE. */ >>> -#define SWAP(a, b, size) \ >>> - do \ >>> - { \ >>> - size_t __size = (size); \ >>> - char *__a = (a), *__b = (b); \ >>> - do \ >>> - { \ >>> - char __tmp = *__a; \ >>> - *__a++ = *__b; \ >>> - *__b++ = __tmp; \ >>> - } while (--__size > 0); \ >>> - } while (0) >>> +/* Swap SIZE bytes between addresses A and B. These helpers are provided >>> + along the generic one as an optimization. */ >>> + >>> +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t); >>> + >>> +/* Return trues is elements can be copied used word load and sortes. >>> + The size must be a multiple of the alignment, and the base address. */ >>> +static inline bool >>> +is_aligned_to_copy (const void *base, size_t size, size_t align) >>> +{ >>> + unsigned char lsbits = size; >>> +#if !_STRING_ARCH_unaligned >>> + lsbits |= (unsigned char)(uintptr_t) base; >>> +#endif >>> + return (lsbits & (align - 1)) == 0; >>> +} >>> + >>> +#define SWAP_WORDS_64 (swap_func_t)0 >>> +#define SWAP_WORDS_32 (swap_func_t)1 >>> +#define SWAP_BYTES (swap_func_t)2 >>> + >>> +static void >>> +swap_words_64 (void * restrict a, void * restrict b, size_t n) >>> +{ >>> + do >>> + { >>> + n -= 8; >>> + uint64_t t = *(uint64_t *)(a + n); >>> + *(uint64_t *)(a + n) = *(uint64_t *)(b + n); >>> + *(uint64_t *)(b + n) = t; >>> + } while (n); >>> +} >>> + >>> +static void >>> +swap_words_32 (void * restrict a, void * restrict b, size_t n) >>> +{ >>> + do >>> + { >>> + n -= 4; >>> + uint32_t t = *(uint32_t *)(a + n); >>> + *(uint32_t *)(a + n) = *(uint32_t *)(b + n); >>> + *(uint32_t *)(b + n) = t; >>> + } while (n); >>> +} >>> >>> >>> I'm not certain swap_words_32 / swap_words_8 will be optimal for larger >>> key sizes. Looking at GCC's implementation of swap_generic on modern x86_64: >>> https://godbolt.org/z/638h3Y9va >>> It's able to optimize the temporary buffer out of the loop and use xmm registers which >>> will likely win out for larger sizes. >> >> It is probably not the most optimized code compiler can generated since >> I tried to go for a more generic code that should results in a somewhat >> better code in a architecture agnostic code. I trying to mimic some >> of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I >> used the same strategy used on d3496c9f4f27d (Linux did not optimize >> anything for the byte version). >> >> I think it would be possible to tune it for an specific architecture >> and/or compiler, but I would prefer to use a good enough algorithm that >> work reasonable on multiple architectures. > > That's fair. Although I still think there are some improvements. > > Looking at the assembly for all three in fact it seems GCC optimizes all of them > to larger register copies: https://godbolt.org/z/bd9nnnoEY > > The biggest difference seems to be the setup / register spills for > the generic version so for the common case of a relatively small key > the special case for 4/8 makes sense. > > Have you checked that GCC is able to use the conditions for selecting > the swap function to optimize the functions themselves? In the godbolt > link I got reasonable value out of adding the invariants to swap_64/swap_32. Not yet, but I think the generated code for both x86_64 and aarch64 seems simple enough and should cover the common case (keys with size 4 or 8) fast enough [1]. And since for v4 my plan is not to remove mergesort anymore, but limit it to a static buffer; it might be possible to use the same swap routines for both mergesort and quicksort. > > It also may be worth it to write a custom memcpy implementation for > size = 0..SWAP_GENERIC_SIZE so it can be inlined (and probably > more optimized than what generic memcpy can get away with). > I *really* do not want to go for this way. My hope is with small sizes compiler can inline the memcpy (which gcc does for most common architectures). [1] https://godbolt.org/z/v7e4xxqGa