From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS3215 2.6.0.0/16
X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 06C631F670
	for <e@80x24.org>; Wed, 13 Oct 2021 03:31:15 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 443BC385841B
	for <e@80x24.org>; Wed, 13 Oct 2021 03:31:13 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 443BC385841B
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1634095873;
	bh=fY7PjqlOYwzNMfOW/pkc2XXoCtjo4Yz1zwH1cMwMYQ4=;
	h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=fgyuYfQLCfQxnbh2qLDc7MZ52nDH6ne38gbXjkOrT/9zpT4hgULhJ6Bj2x3iunrwC
	 sjHeQjhtCKPuIBbW0ndGT2Sbeij6hnXC7VrZivfJvQcO0Hly+o9uDAjVzh+cuMytTK
	 Y4gaxw5QIy5EKudk41Je19+9uWH03ELOafmnkM34=
Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com
 [IPv6:2607:f8b0:4864:20::102c])
 by sourceware.org (Postfix) with ESMTPS id 2F6273858411
 for <libc-alpha@sourceware.org>; Wed, 13 Oct 2021 03:30:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2F6273858411
Received: by mail-pj1-x102c.google.com with SMTP id
 g13-20020a17090a3c8d00b00196286963b9so3360137pjc.3
 for <libc-alpha@sourceware.org>; Tue, 12 Oct 2021 20:30:07 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=dGmQQEXiDYURAjsisXrz+GYDpd45FrWuNbJlwvNO3Rc=;
 b=vGOd8YLofD4zMRte9PCvsPnIsLiu/0zEuTQpBqZ3heLL0i6+WKk1tLhqjgA0h5NLG3
 5bZf8WCu0dLN68uS+DAIIzVCBdOjFAtrVbeRk8lZt/twwHKIUYEvJLVvFXZpREutNR7n
 oD8JxNXQKI6DqD00lnztP140CNwCgfwWcKF/4P/SVzMndXdnNnC1Q3L5YNEpiEqbshXQ
 rPnpesXNxZsgAfZgPMuSvy71DQp+8J42vz8sTt5UJglkR7mMF4nVLuZwHKxXUGG752is
 Y6RqoAyQo/I+TOJj3LFszrfU0nlGKmx7zgX/u/sHtqqyij82tKikhe/eiE2zzeZTkmsR
 Dieg==
X-Gm-Message-State: AOAM5304L9kDS1/daSd7QJpOQKpq3hDHR8gUDjFYGB04v+h6alK6c6yX
 FH6Jgx/sFSWUZs1U1EoPaMvnDN5JEuweweJIxxRjNY94
X-Google-Smtp-Source: ABdhPJyRA3IOW1W+8VQpvt/9ADH1SiWsHjm7mBo2CpPlmAn5g/iDOtJFz1JSCoNuABPqldQGkLvwFxcip8FMxckJsAI=
X-Received: by 2002:a17:90a:b794:: with SMTP id
 m20mr10798354pjr.178.1634095806237; 
 Tue, 12 Oct 2021 20:30:06 -0700 (PDT)
MIME-Version: 1.0
References: <20210903171144.952737-1-adhemerval.zanella@linaro.org>
 <20210903171144.952737-4-adhemerval.zanella@linaro.org>
In-Reply-To: <20210903171144.952737-4-adhemerval.zanella@linaro.org>
Date: Tue, 12 Oct 2021 23:29:40 -0400
Message-ID: <CAFUsyfL-Ee0=6HmnA8Q4udtWHrxbSbHLY+ceWLFPcyvpS-yXUA@mail.gmail.com>
Subject: Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation
 (BZ #19305)
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>

On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <
libc-alpha@sourceware.org> wrote:

> It optimizes take in consideration both the most common elements are
> either 32 or 64 bit in size [1] and inputs are aligned to the word
> boundary.  This is similar to the optimization done on lib/sort.c
> from Linux.
>
> This patchs adds an optimized swap operation on qsort based in previous
> msort one.  Instead of byte operation, three variants are provided:
>
>   1. Using uint32_t loads and stores.
>   2. Using uint64_t loads and stores.
>   3. Generic one with a temporary buffer and memcpy/mempcpy.
>
> The 1. and 2. options are selected only either if architecture defines
> _STRING_ARCH_unaligned or if base pointer is aligned to required type.
>
> It also fixes BZ#19305 by checking input size against number of
> elements 1 besides 0.
>
> Checked on x86_64-linux-gnu.
>
> [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html
> ---
>  stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 91 insertions(+), 18 deletions(-)
>
> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> index 23f2d28314..59458d151b 100644
> --- a/stdlib/qsort.c
> +++ b/stdlib/qsort.c
> @@ -24,20 +24,85 @@
>  #include <limits.h>
>  #include <stdlib.h>
>  #include <string.h>
> +#include <stdbool.h>
>
> -/* Byte-wise swap two items of size SIZE. */
> -#define SWAP(a, b, size)
>    \
> -  do
>    \
> -    {
>     \
> -      size_t __size = (size);
>     \
> -      char *__a = (a), *__b = (b);
>    \
> -      do
>    \
> -       {
>    \
> -         char __tmp = *__a;
>     \
> -         *__a++ = *__b;
>     \
> -         *__b++ = __tmp;
>    \
> -       } while (--__size > 0);
>    \
> -    } while (0)
> +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
> +   along the generic one as an optimization.  */
> +
> +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
> +
> +/* Return trues is elements can be copied used word load and sortes.
> +   The size must be a multiple of the alignment, and the base address.  */
> +static inline bool
> +is_aligned_to_copy (const void *base, size_t size, size_t align)
> +{
> +  unsigned char lsbits = size;
> +#if !_STRING_ARCH_unaligned
> +  lsbits |= (unsigned char)(uintptr_t) base;
> +#endif
> +  return (lsbits & (align - 1)) == 0;
> +}
> +
> +#define SWAP_WORDS_64 (swap_func_t)0
> +#define SWAP_WORDS_32 (swap_func_t)1
> +#define SWAP_BYTES    (swap_func_t)2
> +
> +static void
> +swap_words_64 (void * restrict a, void * restrict b, size_t n)
> +{
> +  do
> +   {
> +     n -= 8;
> +     uint64_t t = *(uint64_t *)(a + n);
> +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
> +     *(uint64_t *)(b + n) = t;
> +   } while (n);
> +}
> +
> +static void
> +swap_words_32 (void * restrict a, void * restrict b, size_t n)
> +{
> +  do
> +   {
> +     n -= 4;
> +     uint32_t t = *(uint32_t *)(a + n);
> +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
> +     *(uint32_t *)(b + n) = t;
> +   } while (n);
> +}
> +
> +static void
> +swap_bytes (void * restrict a, void * restrict b, size_t n)
> +{
> +  /* Use multiple small memcpys with constant size to enable inlining
> +     on most targets.  */
> +  enum { SWAP_GENERIC_SIZE = 32 };
> +  unsigned char tmp[SWAP_GENERIC_SIZE];
> +  while (n > SWAP_GENERIC_SIZE)
> +    {
> +      memcpy (tmp, a, SWAP_GENERIC_SIZE);
> +      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
> +      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
> +      n -= SWAP_GENERIC_SIZE;
> +    }
> +  memcpy (tmp, a, n);
> +  memcpy (a, b, n);
> +  memcpy (b, tmp, n);
> +}
> +
> +/* Replace the indirect call with a serie of if statements.  It should
> help
> +   the branch predictor.  */
>

1) Really? On Intel at least an indirect call that is always going to the
same place
is certainly going to be predicted as well if not better than 2/3
branches + direct call.

2) If you're going to just test which swap function to use, why
bother initializing
swap_func? Why not just use an int?


> +static void
> +do_swap (void * restrict a, void * restrict b, size_t size,
> +        swap_func_t swap_func)
> +{
> +  if (swap_func == SWAP_WORDS_64)
> +    swap_words_64 (a, b, size);
> +  else if (swap_func == SWAP_WORDS_32)
> +    swap_words_32 (a, b, size);
> +  else
> +    swap_bytes (a, b, size);
> +}
>
>  /* Discontinue quicksort algorithm when partition gets below this size.
>     This particular magic number was chosen to work best on a Sun 4/260. */
> @@ -97,6 +162,14 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>      /* Avoid lossage with unsigned arithmetic below.  */
>      return;
>
> +  swap_func_t swap_func;
> +  if (is_aligned_to_copy (pbase, size, 8))
> +    swap_func = SWAP_WORDS_64;
> +  else if (is_aligned_to_copy (pbase, size, 4))
> +    swap_func = SWAP_WORDS_32;
> +  else
> +    swap_func = SWAP_BYTES;
> +
>    if (total_elems > MAX_THRESH)
>      {
>        char *lo = base_ptr;
> @@ -120,13 +193,13 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>           char *mid = lo + size * ((hi - lo) / size >> 1);
>
>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
> -           SWAP (mid, lo, size);
> +           do_swap (mid, lo, size, swap_func);
>           if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
> -           SWAP (mid, hi, size);
> +           do_swap (mid, hi, size, swap_func);
>           else
>             goto jump_over;
>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
> -           SWAP (mid, lo, size);
> +           do_swap (mid, lo, size, swap_func);
>         jump_over:;
>
>           left_ptr  = lo + size;
> @@ -145,7 +218,7 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>
>               if (left_ptr < right_ptr)
>                 {
> -                 SWAP (left_ptr, right_ptr, size);
> +                 do_swap (left_ptr, right_ptr, size, swap_func);
>                   if (mid == left_ptr)
>                     mid = right_ptr;
>                   else if (mid == right_ptr)
> @@ -217,7 +290,7 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>          tmp_ptr = run_ptr;
>
>      if (tmp_ptr != base_ptr)
> -      SWAP (tmp_ptr, base_ptr, size);
> +      do_swap (tmp_ptr, base_ptr, size, swap_func);
>
>      /* Insertion sort, running from left-hand-side up to
> right-hand-side.  */
>
> --
> 2.30.2
>
>