From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 178581F8C6 for ; Tue, 27 Jul 2021 19:50:44 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E69B43839C4B for ; Tue, 27 Jul 2021 19:50:42 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E69B43839C4B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1627415442; bh=ggIo+ZGDV86gp6NyE1Y8Ix+SVTGzj8hZ9KE5pkktXX4=; h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=NQpWDAvRB/UAAw70sAov/Q3ljGVcZSub6c2XTszJPXFyCojjTOUQ6NRwLo9lYTLGx +9nzM7SdjO6gPkQtCiAsDxw1ojExWe8lkLFShBwIL6qttLTTSKaH/ZGQOMpSrfgvfu Vydn9N8mU2mjreBiPXFS+hf1niK1flbOjNAn1AA8= Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by sourceware.org (Postfix) with ESMTPS id 0F7AD385743B for ; Tue, 27 Jul 2021 19:50:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0F7AD385743B Received: by mail-pj1-x102a.google.com with SMTP id m10-20020a17090a34cab0290176b52c60ddso697736pjf.4 for ; Tue, 27 Jul 2021 12:50:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XERPYiQrbrwxKR5VU5lIzyuBDv6ZaweiwBCQ6aM7DVY=; b=Xm8KuhVcEUFSbK1UCHL2fsHCM0I7cJh4PQkrnxUDey+QyJpZ2VicW4sMQ/T6VpraAf G6w6ta2eoQbaO31GRRi9H9c5k95OMILfXYOPq3Eit6HdE7cz8Bpj2yg2MdgI6b+EyicM C2leOYes1VpEMrwwIjaqED6htsVRpUgz5bZQQ0UccZ1c7zidzZ6dF4qhW+uoLi6b2YAM scZu11Qg+KppLDEx6MpWR/N1TC6z7j+JRUzEL3Y+taHAR+OasUCDndpYRWNgqre+T5cx nuNbsuqBP6Sp5C0vDyyyhZPAAHrjB0SZo9jfydKC6cGKcPmfl2eQ4KU85wAQh9N30g6k c2ww== X-Gm-Message-State: AOAM531oebm2wJaG4wgTxEf70nLvsjzllNk/BeLDp8i9Mf9kN6Xm6j14 MuN0m4/clHht/d4vXvTg7VXaiUJtm1RKaXAVdeA= X-Google-Smtp-Source: ABdhPJyjdj7mXtAEBtgC1pwxEq7FEDhcC5rAWwkez7mDpbRDwbCkzwOioyGYh9fawtZuhG4ome5ZrWKCORdjRJ0m804= X-Received: by 2002:a17:902:830b:b029:12a:dd1b:74bf with SMTP id bd11-20020a170902830bb029012add1b74bfmr20070628plb.44.1627415420093; Tue, 27 Jul 2021 12:50:20 -0700 (PDT) MIME-Version: 1.0 References: <20210726120055.1089971-1-hjl.tools@gmail.com> In-Reply-To: Date: Tue, 27 Jul 2021 15:50:09 -0400 Message-ID: Subject: Re: [PATCH v3] x86-64: Add Avoid_Short_Distance_REP_MOVSB To: "H.J. Lu" Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Noah Goldstein via Libc-alpha Reply-To: Noah Goldstein Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On Tue, Jul 27, 2021 at 3:23 PM H.J. Lu wrote: > On Tue, Jul 27, 2021 at 12:12 PM Noah Goldstein > wrote: > > > > > > > > On Tue, Jul 27, 2021 at 12:06 PM H.J. Lu wrote: > >> > >> On Mon, Jul 26, 2021 at 9:06 PM Noah Goldstein > wrote: > >> > > >> > > >> > > >> > On Mon, Jul 26, 2021 at 11:11 PM H.J. Lu via Libc-alpha < > libc-alpha@sourceware.org> wrote: > >> >> > >> >> On Mon, Jul 26, 2021 at 7:15 PM Carlos O'Donell > wrote: > >> >> > > >> >> > On 7/26/21 8:00 AM, H.J. Lu via Libc-alpha wrote: > >> >> > > commit 3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5 > >> >> > > Author: H.J. Lu > >> >> > > Date: Sat Jan 25 14:19:40 2020 -0800 > >> >> > > > >> >> > > x86-64: Avoid rep movsb with short distance [BZ #27130] > >> >> > > introduced some regressions on Intel processors without Fast > Short REP > >> >> > > MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep > movsb with > >> >> > > short distance only on Intel processors with FSRM. > bench-memmove-large > >> >> > > on Skylake server shows that cycles of > __memmove_evex_unaligned_erms are > >> >> > > improved for the following data size: > >> >> > > > >> >> > > before after Improvement > >> >> > > length=4127, align1=3, align2=0: 479.38 343.00 28% > >> >> > > length=4223, align1=9, align2=5: 405.62 335.50 17% > >> >> > > length=8223, align1=3, align2=0: 786.12 495.00 37% > >> >> > > length=8319, align1=9, align2=5: 256.69 170.38 33% > >> >> > > length=16415, align1=3, align2=0: 1436.88 839.50 41% > >> >> > > length=16511, align1=9, align2=5: 1375.50 840.62 39% > >> >> > > length=32799, align1=3, align2=0: 2890.00 1850.62 36% > >> >> > > length=32895, align1=9, align2=5: 2891.38 1948.62 32% > >> >> > > > >> >> > > There are no regression on Ice Lake server. > >> >> > > >> >> > At this point we're waiting on Noah to provide feedback on the > performance > >> >> > results given the alignment nop insertion you provided as a > follow-up patch > >> > > >> > > >> > The results with the padding look good! > >> > > >> >> > >> >> > >> >> We are testing 25 byte nop padding now: > >> >> > >> >> > >> >> > https://gitlab.com/x86-glibc/glibc/-/commit/de8985640a568786a59576716db54e0749d420e8 > >> >> > >> > How did you come to the exact padding choice used? > >> > >> I first replaced the 9 byte instructions: > >> > >> andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > >> __x86_string_control(%rip) > >> jz 3f > >> > >> with a 9-byte NOP and reproduced the regression on Tiger Lake. It > confirmed > >> that the code layout caused the regression. I first tried adding > >> ".p2align 4" to > >> branch targets and they made no differences. Then I started adding > different > >> size of nops after > >> > >> andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > >> __x86_string_control(%rip) > >> jz 3f > >> movq %rdi, %rcx > >> subq %rsi, %rcx > >> jmp 2f > >> > >> with ".nops N". I started with N == 1 and doubled N in each step. I > >> noticed that > >> improvement started at N == 32. I started bisecting between 16 and 32: > >> > >> 1. 24 and 32 are good. > >> 2. 24 and 28 are good. > >> 3. 25 is the best overall. > >> > >> >> > >> >> > (unless you can confirm this yourself). > >> >> > > >> >> > Looking forward to a v2 the incorporates the alignment fix > (pending Noah's > >> >> > comments), and my suggestions below. > >> >> > >> >> > > >> >> > > --- > >> >> > > sysdeps/x86/cacheinfo.h | 7 > +++++++ > >> >> > > sysdeps/x86/cpu-features.c | 5 > +++++ > >> >> > > .../x86/include/cpu-features-preferred_feature_index_1.def | 1 + > >> >> > > sysdeps/x86/sysdep.h | 3 > +++ > >> >> > > sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S | 5 > +++++ > >> >> > > 5 files changed, 21 insertions(+) > >> >> > > > >> >> > > diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h > >> >> > > index eba8dbc4a6..174ea38f5b 100644 > >> >> > > --- a/sysdeps/x86/cacheinfo.h > >> >> > > +++ b/sysdeps/x86/cacheinfo.h > >> >> > > @@ -49,6 +49,9 @@ long int __x86_rep_stosb_threshold > attribute_hidden = 2048; > >> >> > > /* Threshold to stop using Enhanced REP MOVSB. */ > >> >> > > long int __x86_rep_movsb_stop_threshold attribute_hidden; > >> >> > > > >> >> > > +/* String/memory function control. */ > >> >> > > +int __x86_string_control attribute_hidden; > >> >> > > >> >> > Please expand comment. > >> >> > > >> >> > Suggest: > >> >> > > >> >> > /* A bit-wise OR of string/memory requirements for optimal > performance > >> >> > e.g. X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB. These > bits > >> >> > are used at runtime to tune implementation behavior. */ > >> >> > int __x86_string_control attribute_hidden; > >> >> > >> >> I will fix it in the v2 patch. > >> >> > >> >> Thanks. > >> >> > >> >> > > + > >> >> > > static void > >> >> > > init_cacheinfo (void) > >> >> > > { > >> >> > > @@ -71,5 +74,9 @@ init_cacheinfo (void) > >> >> > > __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; > >> >> > > __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; > >> >> > > __x86_rep_movsb_stop_threshold = > cpu_features->rep_movsb_stop_threshold; > >> >> > > + > >> >> > > + if (CPU_FEATURES_ARCH_P (cpu_features, > Avoid_Short_Distance_REP_MOVSB)) > >> >> > > + __x86_string_control > >> >> > > + |= X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB; > >> >> > > >> >> > OK. > >> >> > > >> >> > > } > >> >> > > #endif > >> >> > > diff --git a/sysdeps/x86/cpu-features.c > b/sysdeps/x86/cpu-features.c > >> >> > > index 706a172ba9..645bba6314 100644 > >> >> > > --- a/sysdeps/x86/cpu-features.c > >> >> > > +++ b/sysdeps/x86/cpu-features.c > >> >> > > @@ -555,6 +555,11 @@ init_cpu_features (struct cpu_features > *cpu_features) > >> >> > > cpu_features->preferred[index_arch_Prefer_AVX2_STRCMP] > >> >> > > |= bit_arch_Prefer_AVX2_STRCMP; > >> >> > > } > >> >> > > + > >> >> > > + /* Avoid avoid short distance REP MOVSB on processor with > FSRM. */ > >> >> > > + if (CPU_FEATURES_CPU_P (cpu_features, FSRM)) > >> >> > > + > cpu_features->preferred[index_arch_Avoid_Short_Distance_REP_MOVSB] > >> >> > > + |= bit_arch_Avoid_Short_Distance_REP_MOVSB; > >> >> > > >> >> > OK. > >> >> > > >> >> > > } > >> >> > > /* This spells out "AuthenticAMD" or "HygonGenuine". */ > >> >> > > else if ((ebx == 0x68747541 && ecx == 0x444d4163 && edx == > 0x69746e65) > >> >> > > diff --git > a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def > b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def > >> >> > > index 133aab19f1..d7c93f00c5 100644 > >> >> > > --- > a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def > >> >> > > +++ > b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def > >> >> > > @@ -33,3 +33,4 @@ BIT (Prefer_No_AVX512) > >> >> > > BIT (MathVec_Prefer_No_AVX512) > >> >> > > BIT (Prefer_FSRM) > >> >> > > BIT (Prefer_AVX2_STRCMP) > >> >> > > +BIT (Avoid_Short_Distance_REP_MOVSB) > >> >> > > >> >> > OK. > >> >> > > >> >> > > diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h > >> >> > > index 51c069bfe1..35cb90d507 100644 > >> >> > > --- a/sysdeps/x86/sysdep.h > >> >> > > +++ b/sysdeps/x86/sysdep.h > >> >> > > @@ -57,6 +57,9 @@ enum cf_protection_level > >> >> > > #define STATE_SAVE_MASK \ > >> >> > > ((1 << 1) | (1 << 2) | (1 << 3) | (1 << 5) | (1 << 6) | (1 << > 7)) > >> >> > > > >> >> > > >> >> > Suggest adding: > >> >> > > >> >> > /* Constants for bits in __x86_string_control: */ > >> >> > > >> >> > > +/* Avoid short distance REP MOVSB. */ > >> >> > > +#define X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB (1 > << 0) > >> >> > > >> >> > OK. > >> >> > > >> >> > > + > >> >> > > #ifdef __ASSEMBLER__ > >> >> > > > >> >> > > /* Syntactic details of assembler. */ > >> >> > > diff --git > a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > >> >> > > index a783da5de2..9f02624375 100644 > >> >> > > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > >> >> > > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > >> >> > > @@ -325,12 +325,16 @@ L(movsb): > >> >> > > /* Avoid slow backward REP MOVSB. */ > >> >> > > jb L(more_8x_vec_backward) > >> >> > > # if AVOID_SHORT_DISTANCE_REP_MOVSB > >> >> > > + andl > $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > >> >> > > + jz 3f > >> >> > > >> >> > OK. > >> >> > > >> >> > > movq %rdi, %rcx > >> >> > > subq %rsi, %rcx > >> >> > > jmp 2f > >> >> > > # endif > >> >> > > 1: > >> >> > > # if AVOID_SHORT_DISTANCE_REP_MOVSB > >> >> > > + andl > $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > >> >> > > + jz 3f > >> >> > > >> >> > OK. > >> >> > > >> >> > > movq %rsi, %rcx > >> >> > > subq %rdi, %rcx > >> >> > > 2: > >> >> > > @@ -338,6 +342,7 @@ L(movsb): > >> >> > > is N*4GB + [1..63] with N >= 0. */ > >> >> > > cmpl $63, %ecx > >> >> > > jbe L(more_2x_vec) /* Avoid "rep movsb" if ECX <= > 63. */ > >> >> > > +3: > >> >> > > >> >> > OK. > >> >> > > >> >> > > # endif > >> >> > > mov %RDX_LP, %RCX_LP > >> >> > > rep movsb > >> >> > > > >> >> > > >> >> > > >> >> > -- > >> >> > Cheers, > >> >> > Carlos. > >> >> > > >> >> > >> >> > >> >> -- > >> >> H.J. > >> > >> Here is the v2 patch: > >> > >> 1. Add a 25-byte NOP padding after JMP for > Avoid_Short_Distance_REP_MOVSB, > >> which improves bench-memcpy-random performance on Tiger Lake by ~30% > > > > > > I think this may not be due to unrelated factors. I reran the random > benchmarks with > > the function on a fresh page and entry of the *_erms version at either + > 0, 16, 32, 48 > > bytes and 1) don't see a universal improvement and 2) believe it's > likely that the 30% > > you measured is due to unrelated alignment changes. > > > > "__memcpy_avx_unaligned", "__memcpy_avx_unaligned_erms", > "__memcpy_evex_unaligned", "__memcpy_evex_unaligned_erms" > > > > + 0 Entry Alignment > > New: 91824.1, 95460.1, 95063.3, 97998.3 > > Old: 99973.7, 100127, 100370, 100049 > > > > + 16 Entry Alignment > > New: 129558, 129916, 122373, 124056 > > Old: 125361, 96475.4, 97457.8, 124319 > > > > + 32 Entry Alignment > > New: 95073.7, 92857.8, 90182.2, 92666.3 > > Old: 96702.1, 98558.9, 96797.1, 96887.1 > > > > + 48 Entry Alignment > > New: 135161, 134010, 123023, 148589 > > Old: 128150, 139029, 98382.2, 122686 > > > > So the 32 byte/64 byte entry alignment versions seem to favor this > change, > > but when entry alignment % 16 != 0 this change seems to perform worse. > > > > I generally think until we understand why the byte padding is necessary > its > > probably a mistake to include it unless its a benefit in actual SPEC2017. > > > > My general intuition is that this is an issue with the benchmarks > themselves > > so I support the change w.o the padding. > > Here is the v3 patch without the padding. OK for master? > Ok with this change going out. But I think we need to find the root cause of this degradation before more changes are made to the file. At the moment just about any change that adds 25 bytes before L(less_vec) will look like a major win. > > > > >> 2. Update comments for __x86_string_control. > >> > >> > >> -- > >> H.J. > > > > -- > H.J. >