From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS17314 8.43.84.0/22
X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 3B6311F4B4
	for <e@80x24.org>; Mon, 19 Apr 2021 16:27:33 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 082C63851C16;
	Mon, 19 Apr 2021 16:27:32 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 082C63851C16
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1618849652;
	bh=F1/zvHfIEa5Sp4q88wf4mWg3kkC9vb5PfoCLLKyt1XE=;
	h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=I7UQNiyo+Ep2WPmnJXMjATgFQ6P0awEJL6AaIPAXWa8l7wL7WRYoDw6Z/tYqOYNSB
	 cTjYLZ4qpN76vqEAxwCv8Pse93kQsOdnNT8NQ6QzV7U8mcjU6HiW4UKocJpteejT4R
	 9/T9V+PMcqTsI+upangd/KENlCUsh/SGK8zvLPio=
Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com
 [IPv6:2607:f8b0:4864:20::1031])
 by sourceware.org (Postfix) with ESMTPS id 289C43857036
 for <libc-alpha@sourceware.org>; Mon, 19 Apr 2021 16:27:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 289C43857036
Received: by mail-pj1-x1031.google.com with SMTP id s14so13401378pjl.5
 for <libc-alpha@sourceware.org>; Mon, 19 Apr 2021 09:27:29 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=F1/zvHfIEa5Sp4q88wf4mWg3kkC9vb5PfoCLLKyt1XE=;
 b=OEMSUmvnZ6wQ7doeUthpQBHDI/A3zu8V40ro0jihUykJmVEcrpnMgNbCJNNnJFlWRU
 duJtIb4aSdJN4Xs5Rq27SKvO6nk2+728wi9noLeT8ov6dWb/SYeEr8bgpaZyu8sIPitc
 +VdvSgwl0mhQtrkSMoKGE4xza1XHntyAGTVJKg6XSa2b+0VVPHXOx04wUbSJEcj0TCnZ
 BYcSSmqVmdnG4CoxQTxE2vn4JUBaw/uvGabqL3nI+/SBvV3ZDSDZzcsJAZJiuE3FKd28
 UaeZdau7pdkNGIrxyqk3g2gbpx487WRy1tZo0izNumwMzyKpDPjQAZiFjB0P2Jb+FZEo
 hi0Q==
X-Gm-Message-State: AOAM5322djtT9xfLqnGp507/tEcLBu+CJDVsWd3jCaVzZn3rRBlsjIaU
 EaOae5vIIJxnJDW/tj1CajlSDiymSwgnUXn7IOY=
X-Google-Smtp-Source: ABdhPJwvswagzGe/Lf/AOW43kofzn5OfscLtLIYuU8F3fNaTlnbt/uevspNV4QyRDMonPWXkKUGea9y797wenOLF0Xk=
X-Received: by 2002:a17:90b:b03:: with SMTP id
 bf3mr25984446pjb.223.1618849647671; 
 Mon, 19 Apr 2021 09:27:27 -0700 (PDT)
MIME-Version: 1.0
References: <20210418220921.1868796-1-goldstein.w.n@gmail.com>
 <YH2YpTNfJ1Xvx/ZK@gmail.com>
In-Reply-To: <YH2YpTNfJ1Xvx/ZK@gmail.com>
Date: Mon, 19 Apr 2021 12:27:16 -0400
Message-ID: <CAFUsyf+oiwQ=502PAZv2bMsLjE+D3ERxz93RcrbJyTGNmYvJQg@mail.gmail.com>
Subject: Re: [PATCH v1 1/2] x86: Optimize less_vec evex and avx512
 memset-vec-unaligned-erms.S
To: "H.J. Lu" <hjl.tools@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

On Mon, Apr 19, 2021 at 10:50 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Apr 18, 2021 at 06:09:21PM -0400, Noah Goldstein wrote:
> > No bug. This commit adds optimized cased for less_vec memset case that
> > uses the avx512vl/avx512bw mask store avoiding the excessive
> > branches. test-memset and test-wmemset are passing.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> > Tests where run on the following CPUs:
> >
> > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html
> >
> > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
> >
> >
> > All times are the geometric mean of N=20. The unit of time is
> > seconds.
> >
> > "Cur" refers to the current implementation
> > "New" refers to this patches implementation
> >
> > There are 3 cases that matter for performance.
> >
> > 1) ptr is not within VEC_SIZE of a page
> > 2) ptr is within VEC_SIZE of a page but length is small enough so that
> >    there is not page cross
> > 3) page cross.
> >
> > Case 1 (which should be the most common) the new implementation has a
> > near universal improvement. The only exception is the avx512 case for
> > size = [0, 15] where I believe the downclocking from avx512 is causing
> > slowdown. Its worth noting that because bench-memset.c repeats the
> > same size the branch heavy case should be favored as the branches will
> > all be predicted correctly. In a setting with unpredictable length
> > this version should perform significant better. For example I
> > implemented something similiar to this change for memmove/memcpy and
> > saw ~40% speedup in bench-memcpy-random (but for other reasons this
> > change isnt good there).
> >
> > Cases 2 has a slowdown with this patch (roughly equivilent to the
> > performance improvement for case 1). Though I think this is probably
> > less important than the improvements for case 1 as page cross are
> > probably rarer than non-page cross.
> >
> > Case 3 has a very slight slowdown with this patch. But for the same
> > reason as above I think this patch is still an improvement.
> >
> > Its worth noting that the page cross check could be removed and
> > the mask store implementation would still be correct, but I'm finding
> > the the fault suppression is incredibly expensive from a performance
> > perspective and without the branch I see a 2 orders of magnitude
> > performance regression on the Case 2 benchmarks.
> >
> >
> ...
> >
> >  sysdeps/x86_64/multiarch/ifunc-memset.h       |  6 ++-
> >  .../multiarch/memset-avx512-unaligned-erms.S  |  2 +-
> >  .../multiarch/memset-evex-unaligned-erms.S    |  2 +-
> >  .../multiarch/memset-vec-unaligned-erms.S     | 52 +++++++++++++++----
> >  4 files changed, 47 insertions(+), 15 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-memset.h b/sysdeps/x86_64/multiarch/ifunc-memset.h
> > index 502f946a84..eda5640541 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-memset.h
> > +++ b/sysdeps/x86_64/multiarch/ifunc-memset.h
> > @@ -54,7 +54,8 @@ IFUNC_SELECTOR (void)
> >        && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
> >      {
> >        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> > -       && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
> > +          && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> > +          && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
> >       {
> >         if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> >           return OPTIMIZE (avx512_unaligned_erms);
> > @@ -68,7 +69,8 @@ IFUNC_SELECTOR (void)
> >    if (CPU_FEATURE_USABLE_P (cpu_features, AVX2))
> >      {
> >        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> > -       && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
> > +          && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> > +          && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
> >       {
> >         if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> >           return OPTIMIZE (evex_unaligned_erms);
>
> Please also update ifunc-impl-list.c.
>
> > diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > index 22e7b187c8..d03460be93 100644
> > --- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > @@ -19,6 +19,6 @@
> >  # define SECTION(p)          p##.evex512
> >  # define MEMSET_SYMBOL(p,s)  p##_avx512_##s
> >  # define WMEMSET_SYMBOL(p,s) p##_avx512_##s
> > -
> > +# define USE_LESS_VEC_MASKMOV        1
> >  # include "memset-vec-unaligned-erms.S"
> >  #endif
> > diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > index ae0a4d6e46..eb3541ef60 100644
> > --- a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > @@ -19,6 +19,6 @@
> >  # define SECTION(p)          p##.evex
> >  # define MEMSET_SYMBOL(p,s)  p##_evex_##s
> >  # define WMEMSET_SYMBOL(p,s) p##_evex_##s
> > -
> > +# define USE_LESS_VEC_MASKMOV        1
> >  # include "memset-vec-unaligned-erms.S"
> >  #endif
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 584747f1a1..6b02e87f48 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -63,6 +63,9 @@
> >  # endif
> >  #endif
> >
> > +#define PAGE_SIZE 4096
> > +#define LOG_PAGE_SIZE 12
> > +
> >  #ifndef SECTION
> >  # error SECTION is not defined!
> >  #endif
> > @@ -213,11 +216,38 @@ L(loop):
> >       cmpq    %rcx, %rdx
> >       jne     L(loop)
> >       VZEROUPPER_SHORT_RETURN
> > +
> > +     .p2align 4
> >  L(less_vec):
> >       /* Less than 1 VEC.  */
> >  # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> >  #  error Unsupported VEC_SIZE!
> >  # endif
> > +# ifdef USE_LESS_VEC_MASKMOV
> > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > +        cross check. Using sall instead of andl saves 3 bytes. Note
> > +        that we are using rax which is set in
> > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > +     sall    $(32 - LOG_PAGE_SIZE), %edi
>
> Why is SAL needed here?

To clear the upper bits from edi before checking for a page cross.
We only want to check edi[11:0] to see if one vec load will cross a page.
AFAIK this is the most efficient way to do that.

>
> > +     /* Check if VEC_SIZE load cross page. Mask loads suffer serious
> > +        performance degradation when it has to fault supress.  */
> > +     cmpl    $((PAGE_SIZE - VEC_SIZE) << (32 - LOG_PAGE_SIZE)), %edi
>
> Can you use only CMP, not SAL, for masked store?

What do you mean? This comparison is to check if there will be a page cross
on the store. Its not necessary for correctness but is for performance as
fault suppression appears to be really expensive. The code would still be
correct with no SAL and no CMP.

>
> > +     ja      L(cross_page)
> > +# if VEC_SIZE > 32
> > +     movq    $-1, %rcx
> > +     bzhiq   %rdx, %rcx, %rcx
> > +     kmovq   %rcx, %k1
> > +# else
> > +     movl    $-1, %ecx
> > +     bzhil   %edx, %ecx, %ecx
> > +     kmovd   %ecx, %k1
> > +# endif
> > +     vmovdqu8        %VEC(0), (%rax) {%k1}
> > +     VZEROUPPER_RETURN
> > +
> > +     .p2align 4
> > +L(cross_page):
> > +# endif
> >  # if VEC_SIZE > 32
> >       cmpb    $32, %dl
> >       jae     L(between_32_63)
> > @@ -234,36 +264,36 @@ L(less_vec):
> >       cmpb    $1, %dl
> >       ja      L(between_2_3)
> >       jb      1f
> > -     movb    %cl, (%rdi)
> > +     movb    %cl, (%rax)
> >  1:
> >       VZEROUPPER_RETURN
> >  # if VEC_SIZE > 32
> >       /* From 32 to 63.  No branch when size == 32.  */
> >  L(between_32_63):
> > -     VMOVU   %YMM0, -32(%rdi,%rdx)
> > -     VMOVU   %YMM0, (%rdi)
> > +     VMOVU   %YMM0, -32(%rax,%rdx)
> > +     VMOVU   %YMM0, (%rax)
> >       VZEROUPPER_RETURN
> >  # endif
> >  # if VEC_SIZE > 16
> >       /* From 16 to 31.  No branch when size == 16.  */
> >  L(between_16_31):
> > -     VMOVU   %XMM0, -16(%rdi,%rdx)
> > -     VMOVU   %XMM0, (%rdi)
> > +     VMOVU   %XMM0, -16(%rax,%rdx)
> > +     VMOVU   %XMM0, (%rax)
> >       VZEROUPPER_RETURN
> >  # endif
> >       /* From 8 to 15.  No branch when size == 8.  */
> >  L(between_8_15):
> > -     movq    %rcx, -8(%rdi,%rdx)
> > -     movq    %rcx, (%rdi)
> > +     movq    %rcx, -8(%rax,%rdx)
> > +     movq    %rcx, (%rax)
> >       VZEROUPPER_RETURN
> >  L(between_4_7):
> >       /* From 4 to 7.  No branch when size == 4.  */
> > -     movl    %ecx, -4(%rdi,%rdx)
> > -     movl    %ecx, (%rdi)
> > +     movl    %ecx, -4(%rax,%rdx)
> > +     movl    %ecx, (%rax)
> >       VZEROUPPER_RETURN
> >  L(between_2_3):
> >       /* From 2 to 3.  No branch when size == 2.  */
> > -     movw    %cx, -2(%rdi,%rdx)
> > -     movw    %cx, (%rdi)
> > +     movw    %cx, -2(%rax,%rdx)
> > +     movw    %cx, (%rax)
> >       VZEROUPPER_RETURN
> >  END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > --
> > 2.29.2
> >
>
> H.J.