[PATCH] x86: Refactor and improve performance of strchr-avx2.S

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] x86: Refactor and improve performance of strchr-avx2.S
@ 2021-01-20  9:29 noah via Libc-alpha
  2021-01-20 12:47 ` H.J. Lu via Libc-alpha
  0 siblings, 1 reply; 3+ messages in thread
From: noah via Libc-alpha @ 2021-01-20  9:29 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n

No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.

Author: noah <goldstein.w.n@gmail.com>
---
Possible improvements fall into 5 categories roughly ordered by
importance:

1 - The main loop that handles 4 vectors as a time has 4 uops removed
    from it. This is at the cost of additional port pressure on ports
    0/1 as vpminub and vpminud can only run on those ports whereas
    vpor can run on ports 0/1/5. But the 4 saved uops should more than
    make up for that. Analysis either by latency, throughput, or
    benchmarks shows its a performance improvement.

2 - As far as I can tell the cros_page_boundary logic was broken (or
    the jump label was especially confusing). The origional code
    tested for this by checking if the load would split cache lines
    not pages. I don't think there are any x86_64 architectures that
    support both AVX2 (Haskwell and newer) and don't have a minimum
    page size of 4kb. Given that the cost of splitting a cacheline
    appears to be low on CPUs that support AVX2
    [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
    is one-off I don't see it as being critical to avoid. If it
    critical to avoid a cacheline split load I think a branchless
    approach might be better. Thoughts? Note: Given that the check was
    changed to only be for page crosses, I think it is significantly
    less likely than before so I moved the branch target from such
    prime real estate. I am also unsure if there is a PAGE_SIZE define
    in glibc somewhere I can use instead of defining it here.

3 - What was origionally the more_4x_vec label was removed and the
    code only does 3 vector blocks now. The reasoning is as follows;
    there are two entry points to that code section, from a page cross
    or fall through (no page cross). The fall through case is more
    important to optimize for assuming the point above. In this case
    the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
    be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
    blocks optimizes for the [96 - 127] so that when the main 4x loop
    is hit, a new 128 byte aligned segment can be started. Doing 3
    vector blocks optimizes for the [0 - 32] case. I generally think
    the string is more likely to for aligned to cache line size (or L2
    prefetcher cache line pairs) than at [96 - 127] bytes. An
    alternative would be to make that code do 5x vector blocks. This
    would mean that at most 2 vector blocks where wasted when
    realigning to 128 bytes (as opposed to 3x or 4x which can allow
    for 3 vector blocks to be wasted). Thoughts?

4 - Replace salq using the %cl partial register to sarx. This assumes
    BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
    newer so I think it is safe.

5 - in the first_vec_xN return blocks change the addq from using rax
    as a destination to rdi as a destination. This just allows for 1
    uop shorter latency.

Benchamrks:
I can post my benchmarking code in the email thread if that is
appreciated. I benchmarked a variety of cases with different
alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
give a simple number of x percentage improvement. They where also run
on my personal computer (icelake-client). Of my 2732 test cases 1985
saw an improvement with these changes, 747 performed better with the
origional code. I should note, however, that my test cases had a
disproportionate number of cases with 4kb page crosses, which as
discussed I moved to a colder path.

In general the affects of this change are:

large/medium sized strings (from any part of memory really) 10-30%
performance boost.
Small strings that are not page crosses by 10-20% performance boost.
Small strings are cache line splits by 20-30%  performance boost.
Small strings that cross a page boundary (4kb page that is) see a 10%
performance regression.

No regressions in test suite. Both test-strchrnul and test-strchr
passed.

I would love to here you feedback and all of these points (or any that
I missed).

FSF Documentation has been signed and returned (via pub key and email
respectively)
    
 sysdeps/x86_64/multiarch/strchr-avx2.S | 173 +++++++++++++------------
 1 file changed, 87 insertions(+), 86 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
index d416558d04..09c2df86d1 100644
--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
+++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
@@ -27,10 +27,12 @@
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
 #  define VPCMPEQ	vpcmpeqd
+#  define VPMINU	vpminud
 #  define CHAR_REG	esi
 # else
 #  define VPBROADCAST	vpbroadcastb
 #  define VPCMPEQ	vpcmpeqb
+#  define VPMINU	vpminub
 #  define CHAR_REG	sil
 # endif
 
@@ -39,7 +41,8 @@
 # endif
 
 # define VEC_SIZE 32
-
+# define PAGE_SIZE 4096
+    
 	.section .text.avx,"ax",@progbits
 ENTRY (STRCHR)
 	movl	%edi, %ecx
@@ -48,8 +51,8 @@ ENTRY (STRCHR)
 	vpxor	%xmm9, %xmm9, %xmm9
 	VPBROADCAST %xmm0, %ymm0
 	/* Check if we may cross page boundary with one vector load.  */
-	andl	$(2 * VEC_SIZE - 1), %ecx
-	cmpl	$VEC_SIZE, %ecx
+	andl	$(PAGE_SIZE - 1), %ecx
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
 	ja	L(cros_page_boundary)
 
 	/* Check the first VEC_SIZE bytes.  Search for both CHAR and the
@@ -63,45 +66,11 @@ ENTRY (STRCHR)
 	jnz	L(first_vec_x0)
 
 	/* Align data for aligned loads in the loop.  */
-	addq	$VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-
-	jmp	L(more_4x_vec)
-
-	.p2align 4
-L(cros_page_boundary):
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-	vmovdqu	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	/* Remove the leading bytes.  */
-	sarl	%cl, %eax
-	testl	%eax, %eax
-	jz	L(aligned_more)
-	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-	addq	%rcx, %rax
-# ifdef USE_AS_STRCHRNUL
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
-	cmovne	%rdx, %rax
-# endif
-	VZEROUPPER
-	ret
-
-	.p2align 4
+    andq	$-VEC_SIZE, %rdi
 L(aligned_more):
 	addq	$VEC_SIZE, %rdi
 
-L(more_4x_vec):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+	/* Check the next 3 * VEC_SIZE.  Only one VEC_SIZE at a time
 	   since data is only aligned to VEC_SIZE.  */
 	vmovdqa	(%rdi), %ymm8
 	VPCMPEQ %ymm8, %ymm0, %ymm1
@@ -127,19 +96,9 @@ L(more_4x_vec):
 	testl	%eax, %eax
 	jnz	L(first_vec_x2)
 
-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x3)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
+    
 	/* Align data to 4 * VEC_SIZE.  */
-	movq	%rdi, %rcx
-	andl	$(4 * VEC_SIZE - 1), %ecx
+	addq	$(VEC_SIZE * 3), %rdi
 	andq	$-(4 * VEC_SIZE), %rdi
 
 	.p2align 4
@@ -150,34 +109,61 @@ L(loop_4x_vec):
 	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm7
 	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
 
-	VPCMPEQ %ymm5, %ymm0, %ymm1
-	VPCMPEQ %ymm6, %ymm0, %ymm2
-	VPCMPEQ %ymm7, %ymm0, %ymm3
-	VPCMPEQ %ymm8, %ymm0, %ymm4
-
-	VPCMPEQ %ymm5, %ymm9, %ymm5
-	VPCMPEQ %ymm6, %ymm9, %ymm6
-	VPCMPEQ %ymm7, %ymm9, %ymm7
-	VPCMPEQ %ymm8, %ymm9, %ymm8
+    /* Leaves only CHARS matching esi as 0.  */
+    vpxor %ymm5, %ymm0, %ymm1
+    vpxor %ymm6, %ymm0, %ymm2
+    vpxor %ymm7, %ymm0, %ymm3
+    vpxor %ymm8, %ymm0, %ymm4
 
-	vpor	%ymm1, %ymm5, %ymm1
-	vpor	%ymm2, %ymm6, %ymm2
-	vpor	%ymm3, %ymm7, %ymm3
-	vpor	%ymm4, %ymm8, %ymm4
+	VPMINU	%ymm1, %ymm5, %ymm1
+	VPMINU	%ymm2, %ymm6, %ymm2
+	VPMINU	%ymm3, %ymm7, %ymm3
+	VPMINU	%ymm4, %ymm8, %ymm4
 
-	vpor	%ymm1, %ymm2, %ymm5
-	vpor	%ymm3, %ymm4, %ymm6
+	VPMINU	%ymm1, %ymm2, %ymm5
+	VPMINU	%ymm3, %ymm4, %ymm6
 
-	vpor	%ymm5, %ymm6, %ymm5
+	VPMINU	%ymm5, %ymm6, %ymm5
 
+    VPCMPEQ %ymm5, %ymm9, %ymm5
 	vpmovmskb %ymm5, %eax
+    addq	$(VEC_SIZE * 4), %rdi
+    
 	testl	%eax, %eax
-	jnz	L(4x_vec_end)
+	jz	L(loop_4x_vec)
 
-	addq	$(VEC_SIZE * 4), %rdi
+    subq	$(VEC_SIZE * 4), %rdi
+
+L(4x_vec_end):
+    VPCMPEQ %ymm1, %ymm9, %ymm1
+	vpmovmskb %ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x0)
+    VPCMPEQ %ymm2, %ymm9, %ymm2
+	vpmovmskb %ymm2, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x1)
+    VPCMPEQ %ymm3, %ymm9, %ymm3
+	vpmovmskb %ymm3, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x2)
+    VPCMPEQ %ymm4, %ymm9, %ymm4
+	vpmovmskb %ymm4, %eax
 
-	jmp	L(loop_4x_vec)
+	tzcntl	%eax, %eax
+# ifdef USE_AS_STRCHRNUL
+	addq	$(VEC_SIZE * 3), %rdi
+	addq	%rdi, %rax
+# else
+	xorl	%edx, %edx
+	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	cmp	(%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
 
+    
 	.p2align 4
 L(first_vec_x0):
 	/* Found CHAR or the null byte.  */
@@ -197,7 +183,7 @@ L(first_vec_x0):
 L(first_vec_x1):
 	tzcntl	%eax, %eax
 # ifdef USE_AS_STRCHRNUL
-	addq	$VEC_SIZE, %rax
+	addq	$VEC_SIZE, %rdi
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
@@ -212,7 +198,7 @@ L(first_vec_x1):
 L(first_vec_x2):
 	tzcntl	%eax, %eax
 # ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 2), %rax
+	addq	$(VEC_SIZE * 2), %rdi
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
@@ -223,32 +209,47 @@ L(first_vec_x2):
 	VZEROUPPER
 	ret
 
+    /* Cold case for crossing page with first load.  */
 	.p2align 4
-L(4x_vec_end):
+L(cros_page_boundary):
+	andl	$(VEC_SIZE - 1), %ecx
+	andq	$-VEC_SIZE, %rdi
+	vmovdqu	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
+	/* Remove the leading bits.  */
+	sarxl	%ecx, %eax, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-	vpmovmskb %ymm2, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-	vpmovmskb %ymm3, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
-	vpmovmskb %ymm4, %eax
+	jnz	L(cros_page_return)
+	
+    /* Second block so that the 3 other blocks from L(aligned_more)
+       will get to next 4 * VEC_SIZE alignment.  */
+    andq	$-VEC_SIZE, %rdi
+    addq	$VEC_SIZE, %rdi
+    xorl    %ecx, %ecx
+    vmovdqa	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-L(first_vec_x3):
+	jz	L(aligned_more)
+
+L(cros_page_return):
 	tzcntl	%eax, %eax
+	addq	%rcx, %rax
 # ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 3), %rax
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	leaq	(%rdi, %rax), %rax
 	cmp	(%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
-
+    
 END (STRCHR)
-#endif
+# endif
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86: Refactor and improve performance of strchr-avx2.S
  2021-01-20  9:29 [PATCH] x86: Refactor and improve performance of strchr-avx2.S noah via Libc-alpha
@ 2021-01-20 12:47 ` H.J. Lu via Libc-alpha
  2021-01-21  0:12   ` [PATCH v2] " noah via Libc-alpha
  0 siblings, 1 reply; 3+ messages in thread
From: H.J. Lu via Libc-alpha @ 2021-01-20 12:47 UTC (permalink / raw)
  To: noah; +Cc: GNU C Library

On Wed, Jan 20, 2021 at 1:34 AM noah via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> No bug. Just seemed the performance could be improved a bit. Observed
> and expected behavior are unchanged. Optimized body of main
> loop. Updated page cross logic and optimized accordingly. Made a few
> minor instruction selection modifications. No regressions in test
> suite. Both test-strchrnul and test-strchr passed.

Thank you very much.

> Author: noah <goldstein.w.n@gmail.com>
> ---
> Possible improvements fall into 5 categories roughly ordered by
> importance:
>
> 1 - The main loop that handles 4 vectors as a time has 4 uops removed
>     from it. This is at the cost of additional port pressure on ports
>     0/1 as vpminub and vpminud can only run on those ports whereas
>     vpor can run on ports 0/1/5. But the 4 saved uops should more than
>     make up for that. Analysis either by latency, throughput, or
>     benchmarks shows its a performance improvement.
>
> 2 - As far as I can tell the cros_page_boundary logic was broken (or
>     the jump label was especially confusing). The origional code
>     tested for this by checking if the load would split cache lines
>     not pages. I don't think there are any x86_64 architectures that

The original code has

        /* Check if we may cross page boundary with one vector load.  */
        andl    $(2 * VEC_SIZE - 1), %ecx
        cmpl    $VEC_SIZE, %ecx
        ja      L(cros_page_boundary)

It is just very conservative.   It never fails to cross page boundary.

>     support both AVX2 (Haskwell and newer) and don't have a minimum
>     page size of 4kb. Given that the cost of splitting a cacheline
>     appears to be low on CPUs that support AVX2
>     [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
>     is one-off I don't see it as being critical to avoid. If it
>     critical to avoid a cacheline split load I think a branchless
>     approach might be better. Thoughts? Note: Given that the check was
>     changed to only be for page crosses, I think it is significantly
>     less likely than before so I moved the branch target from such
>     prime real estate. I am also unsure if there is a PAGE_SIZE define
>     in glibc somewhere I can use instead of defining it here.

Define PAGE_SIZE to 4096 is fine.

> 3 - What was origionally the more_4x_vec label was removed and the
>     code only does 3 vector blocks now. The reasoning is as follows;
>     there are two entry points to that code section, from a page cross
>     or fall through (no page cross). The fall through case is more
>     important to optimize for assuming the point above. In this case
>     the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
>     be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
>     blocks optimizes for the [96 - 127] so that when the main 4x loop
>     is hit, a new 128 byte aligned segment can be started. Doing 3
>     vector blocks optimizes for the [0 - 32] case. I generally think
>     the string is more likely to for aligned to cache line size (or L2
>     prefetcher cache line pairs) than at [96 - 127] bytes. An
>     alternative would be to make that code do 5x vector blocks. This
>     would mean that at most 2 vector blocks where wasted when
>     realigning to 128 bytes (as opposed to 3x or 4x which can allow
>     for 3 vector blocks to be wasted). Thoughts?

I picked 4x because it was how I unrolled in other string functions.
3x is fine if it faster than 4x.

> 4 - Replace salq using the %cl partial register to sarx. This assumes
>     BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
>     newer so I think it is safe.

Need to add a BMI2 check in strchr.c:

&& CPU_FEATURE_USABLE_P (cpu_features, BMI2)

>
> 5 - in the first_vec_xN return blocks change the addq from using rax
>     as a destination to rdi as a destination. This just allows for 1
>     uop shorter latency.

Sounds reasonable.

> Benchamrks:
> I can post my benchmarking code in the email thread if that is
> appreciated. I benchmarked a variety of cases with different
> alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
> give a simple number of x percentage improvement. They where also run
> on my personal computer (icelake-client). Of my 2732 test cases 1985
> saw an improvement with these changes, 747 performed better with the
> origional code. I should note, however, that my test cases had a
> disproportionate number of cases with 4kb page crosses, which as
> discussed I moved to a colder path.

Please submit a separate patch to add your workload to
benchtests/bench-strstr.c.

> In general the affects of this change are:
>
> large/medium sized strings (from any part of memory really) 10-30%
> performance boost.
> Small strings that are not page crosses by 10-20% performance boost.
> Small strings are cache line splits by 20-30%  performance boost.
> Small strings that cross a page boundary (4kb page that is) see a 10%
> performance regression.
>
> No regressions in test suite. Both test-strchrnul and test-strchr
> passed.

It is also used for wcschr and wcschrnul.  Do they work correctly with
your change?

> I would love to here you feedback and all of these points (or any that
> I missed).
>
> FSF Documentation has been signed and returned (via pub key and email
> respectively)
>
>  sysdeps/x86_64/multiarch/strchr-avx2.S | 173 +++++++++++++------------
>  1 file changed, 87 insertions(+), 86 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
> index d416558d04..09c2df86d1 100644
> --- a/sysdeps/x86_64/multiarch/strchr-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
> @@ -27,10 +27,12 @@
>  # ifdef USE_AS_WCSCHR
>  #  define VPBROADCAST  vpbroadcastd
>  #  define VPCMPEQ      vpcmpeqd
> +#  define VPMINU       vpminud
>  #  define CHAR_REG     esi
>  # else
>  #  define VPBROADCAST  vpbroadcastb
>  #  define VPCMPEQ      vpcmpeqb
> +#  define VPMINU       vpminub
>  #  define CHAR_REG     sil
>  # endif
>
> @@ -39,7 +41,8 @@
>  # endif
>
>  # define VEC_SIZE 32
> -
> +# define PAGE_SIZE 4096
> +
>         .section .text.avx,"ax",@progbits
>  ENTRY (STRCHR)
>         movl    %edi, %ecx
> @@ -48,8 +51,8 @@ ENTRY (STRCHR)
>         vpxor   %xmm9, %xmm9, %xmm9
>         VPBROADCAST %xmm0, %ymm0
>         /* Check if we may cross page boundary with one vector load.  */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  Please update comments.
> -       andl    $(2 * VEC_SIZE - 1), %ecx
> -       cmpl    $VEC_SIZE, %ecx
> +       andl    $(PAGE_SIZE - 1), %ecx
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %ecx
>         ja      L(cros_page_boundary)

Can you also replace cros_page_boundary with cross_page_boundary?

>         /* Check the first VEC_SIZE bytes.  Search for both CHAR and the
> @@ -63,45 +66,11 @@ ENTRY (STRCHR)
>         jnz     L(first_vec_x0)
>
>         /* Align data for aligned loads in the loop.  */
> -       addq    $VEC_SIZE, %rdi
> -       andl    $(VEC_SIZE - 1), %ecx
> -       andq    $-VEC_SIZE, %rdi
> -
> -       jmp     L(more_4x_vec)
> -
> -       .p2align 4
> -L(cros_page_boundary):
> -       andl    $(VEC_SIZE - 1), %ecx
> -       andq    $-VEC_SIZE, %rdi
> -       vmovdqu (%rdi), %ymm8
> -       VPCMPEQ %ymm8, %ymm0, %ymm1
> -       VPCMPEQ %ymm8, %ymm9, %ymm2
> -       vpor    %ymm1, %ymm2, %ymm1
> -       vpmovmskb %ymm1, %eax
> -       /* Remove the leading bytes.  */
> -       sarl    %cl, %eax
> -       testl   %eax, %eax
> -       jz      L(aligned_more)
> -       /* Found CHAR or the null byte.  */
> -       tzcntl  %eax, %eax
> -       addq    %rcx, %rax
> -# ifdef USE_AS_STRCHRNUL
> -       addq    %rdi, %rax
> -# else
> -       xorl    %edx, %edx
> -       leaq    (%rdi, %rax), %rax
> -       cmp     (%rax), %CHAR_REG
> -       cmovne  %rdx, %rax
> -# endif
> -       VZEROUPPER
> -       ret
> -
> -       .p2align 4
> +    andq       $-VEC_SIZE, %rdi
>  L(aligned_more):
>         addq    $VEC_SIZE, %rdi
>
> -L(more_4x_vec):
> -       /* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
> +       /* Check the next 3 * VEC_SIZE.  Only one VEC_SIZE at a time
>            since data is only aligned to VEC_SIZE.  */
>         vmovdqa (%rdi), %ymm8
>         VPCMPEQ %ymm8, %ymm0, %ymm1
> @@ -127,19 +96,9 @@ L(more_4x_vec):
>         testl   %eax, %eax
>         jnz     L(first_vec_x2)
>
> -       vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8
> -       VPCMPEQ %ymm8, %ymm0, %ymm1
> -       VPCMPEQ %ymm8, %ymm9, %ymm2
> -       vpor    %ymm1, %ymm2, %ymm1
> -       vpmovmskb %ymm1, %eax
> -       testl   %eax, %eax
> -       jnz     L(first_vec_x3)
> -
> -       addq    $(VEC_SIZE * 4), %rdi
> -
> +
>         /* Align data to 4 * VEC_SIZE.  */
> -       movq    %rdi, %rcx
> -       andl    $(4 * VEC_SIZE - 1), %ecx
> +       addq    $(VEC_SIZE * 3), %rdi
>         andq    $-(4 * VEC_SIZE), %rdi
>
>         .p2align 4
> @@ -150,34 +109,61 @@ L(loop_4x_vec):
>         vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7
>         vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8
>
> -       VPCMPEQ %ymm5, %ymm0, %ymm1
> -       VPCMPEQ %ymm6, %ymm0, %ymm2
> -       VPCMPEQ %ymm7, %ymm0, %ymm3
> -       VPCMPEQ %ymm8, %ymm0, %ymm4
> -
> -       VPCMPEQ %ymm5, %ymm9, %ymm5
> -       VPCMPEQ %ymm6, %ymm9, %ymm6
> -       VPCMPEQ %ymm7, %ymm9, %ymm7
> -       VPCMPEQ %ymm8, %ymm9, %ymm8
> +    /* Leaves only CHARS matching esi as 0.  */
> +    vpxor %ymm5, %ymm0, %ymm1
> +    vpxor %ymm6, %ymm0, %ymm2
> +    vpxor %ymm7, %ymm0, %ymm3
> +    vpxor %ymm8, %ymm0, %ymm4
>
> -       vpor    %ymm1, %ymm5, %ymm1
> -       vpor    %ymm2, %ymm6, %ymm2
> -       vpor    %ymm3, %ymm7, %ymm3
> -       vpor    %ymm4, %ymm8, %ymm4
> +       VPMINU  %ymm1, %ymm5, %ymm1
> +       VPMINU  %ymm2, %ymm6, %ymm2
> +       VPMINU  %ymm3, %ymm7, %ymm3
> +       VPMINU  %ymm4, %ymm8, %ymm4
>
> -       vpor    %ymm1, %ymm2, %ymm5
> -       vpor    %ymm3, %ymm4, %ymm6
> +       VPMINU  %ymm1, %ymm2, %ymm5
> +       VPMINU  %ymm3, %ymm4, %ymm6
>
> -       vpor    %ymm5, %ymm6, %ymm5
> +       VPMINU  %ymm5, %ymm6, %ymm5
>
> +    VPCMPEQ %ymm5, %ymm9, %ymm5
>         vpmovmskb %ymm5, %eax
> +    addq       $(VEC_SIZE * 4), %rdi
> +
>         testl   %eax, %eax
> -       jnz     L(4x_vec_end)
> +       jz      L(loop_4x_vec)
>
> -       addq    $(VEC_SIZE * 4), %rdi
> +    subq       $(VEC_SIZE * 4), %rdi
> +
> +L(4x_vec_end):
> +    VPCMPEQ %ymm1, %ymm9, %ymm1
> +       vpmovmskb %ymm1, %eax
> +       testl   %eax, %eax
> +       jnz     L(first_vec_x0)
> +    VPCMPEQ %ymm2, %ymm9, %ymm2
> +       vpmovmskb %ymm2, %eax
> +       testl   %eax, %eax
> +       jnz     L(first_vec_x1)
> +    VPCMPEQ %ymm3, %ymm9, %ymm3
> +       vpmovmskb %ymm3, %eax
> +       testl   %eax, %eax
> +       jnz     L(first_vec_x2)
> +    VPCMPEQ %ymm4, %ymm9, %ymm4
> +       vpmovmskb %ymm4, %eax
>
> -       jmp     L(loop_4x_vec)
> +       tzcntl  %eax, %eax
> +# ifdef USE_AS_STRCHRNUL
> +       addq    $(VEC_SIZE * 3), %rdi
> +       addq    %rdi, %rax
> +# else
> +       xorl    %edx, %edx
> +       leaq    (VEC_SIZE * 3)(%rdi, %rax), %rax
> +       cmp     (%rax), %CHAR_REG
> +       cmovne  %rdx, %rax
> +# endif
> +       VZEROUPPER
> +       ret
>
> +
>         .p2align 4
>  L(first_vec_x0):
>         /* Found CHAR or the null byte.  */
> @@ -197,7 +183,7 @@ L(first_vec_x0):
>  L(first_vec_x1):
>         tzcntl  %eax, %eax
>  # ifdef USE_AS_STRCHRNUL
> -       addq    $VEC_SIZE, %rax
> +       addq    $VEC_SIZE, %rdi
>         addq    %rdi, %rax
>  # else
>         xorl    %edx, %edx
> @@ -212,7 +198,7 @@ L(first_vec_x1):
>  L(first_vec_x2):
>         tzcntl  %eax, %eax
>  # ifdef USE_AS_STRCHRNUL
> -       addq    $(VEC_SIZE * 2), %rax
> +       addq    $(VEC_SIZE * 2), %rdi
>         addq    %rdi, %rax
>  # else
>         xorl    %edx, %edx
> @@ -223,32 +209,47 @@ L(first_vec_x2):
>         VZEROUPPER
>         ret
>
> +    /* Cold case for crossing page with first load.  */
>         .p2align 4
> -L(4x_vec_end):
> +L(cros_page_boundary):
> +       andl    $(VEC_SIZE - 1), %ecx
> +       andq    $-VEC_SIZE, %rdi
> +       vmovdqu (%rdi), %ymm8
> +       VPCMPEQ %ymm8, %ymm0, %ymm1
> +       VPCMPEQ %ymm8, %ymm9, %ymm2
> +       vpor    %ymm1, %ymm2, %ymm1
>         vpmovmskb %ymm1, %eax
> +       /* Remove the leading bits.  */
> +       sarxl   %ecx, %eax, %eax
>         testl   %eax, %eax
> -       jnz     L(first_vec_x0)
> -       vpmovmskb %ymm2, %eax
> -       testl   %eax, %eax
> -       jnz     L(first_vec_x1)
> -       vpmovmskb %ymm3, %eax
> -       testl   %eax, %eax
> -       jnz     L(first_vec_x2)
> -       vpmovmskb %ymm4, %eax
> +       jnz     L(cros_page_return)
> +
> +    /* Second block so that the 3 other blocks from L(aligned_more)
> +       will get to next 4 * VEC_SIZE alignment.  */
> +    andq       $-VEC_SIZE, %rdi
> +    addq       $VEC_SIZE, %rdi
> +    xorl    %ecx, %ecx
> +    vmovdqa    (%rdi), %ymm8
> +       VPCMPEQ %ymm8, %ymm0, %ymm1
> +       VPCMPEQ %ymm8, %ymm9, %ymm2
> +       vpor    %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %eax
>         testl   %eax, %eax
> -L(first_vec_x3):
> +       jz      L(aligned_more)
> +
> +L(cros_page_return):
>         tzcntl  %eax, %eax
> +       addq    %rcx, %rax
>  # ifdef USE_AS_STRCHRNUL
> -       addq    $(VEC_SIZE * 3), %rax
>         addq    %rdi, %rax
>  # else
>         xorl    %edx, %edx
> -       leaq    (VEC_SIZE * 3)(%rdi, %rax), %rax
> +       leaq    (%rdi, %rax), %rax
>         cmp     (%rax), %CHAR_REG
>         cmovne  %rdx, %rax
>  # endif
>         VZEROUPPER
>         ret
> -
> +
>  END (STRCHR)
> -#endif
> +# endif
> --
> 2.29.2
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2] x86: Refactor and improve performance of strchr-avx2.S
  2021-01-20 12:47 ` H.J. Lu via Libc-alpha
@ 2021-01-21  0:12   ` noah via Libc-alpha
  0 siblings, 0 replies; 3+ messages in thread
From: noah via Libc-alpha @ 2021-01-21  0:12 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n

No bug. Just seemed the performance could be improved a bit.  Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. test-strchrnul, test-strchr, test-wcschrnul, and test-wcschr
all passed.

Author: noah <goldstein.w.n@gmail.com>
---
>>
>> No bug. Just seemed the performance could be improved a bit. Observed
>> and expected behavior are unchanged. Optimized body of main
>> loop. Updated page cross logic and optimized accordingly. Made a few
>> minor instruction selection modifications. No regressions in test
>> suite. Both test-strchrnul and test-strchr passed.
>
>Thank you very much.
>

:) this is my first patch (hopefully). Its exciting!

>> Author: noah <goldstein.w.n@gmail.com>
>> ---
>> Possible improvements fall into 5 categories roughly ordered by
>> importance:
>>
>> 1 - The main loop that handles 4 vectors as a time has 4 uops removed
>>     from it. This is at the cost of additional port pressure on ports
>>     0/1 as vpminub and vpminud can only run on those ports whereas
>>     vpor can run on ports 0/1/5. But the 4 saved uops should more than
>>     make up for that. Analysis either by latency, throughput, or
>>     benchmarks shows its a performance improvement.
>>
>> 2 - As far as I can tell the cros_page_boundary logic was broken (or
>>     the jump label was especially confusing). The origional code
>>     tested for this by checking if the load would split cache lines
>>     not pages. I don't think there are any x86_64 architectures that
>
>The original code has
>
>        /* Check if we may cross page boundary with one vector load.  */
>        andl    $(2 * VEC_SIZE - 1), %ecx
>        cmpl    $VEC_SIZE, %ecx
>        ja      L(cros_page_boundary)
>
>It is just very conservative.   It never fails to cross page boundary.
>

I see. Got it.

>>     support both AVX2 (Haskwell and newer) and don't have a minimum
>>     page size of 4kb. Given that the cost of splitting a cacheline
>>     appears to be low on CPUs that support AVX2
>>     [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
>>     is one-off I don't see it as being critical to avoid. If it
>>     critical to avoid a cacheline split load I think a branchless
>>     approach might be better. Thoughts? Note: Given that the check was
>>     changed to only be for page crosses, I think it is significantly
>>     less likely than before so I moved the branch target from such
>>     prime real estate. I am also unsure if there is a PAGE_SIZE define
>>     in glibc somewhere I can use instead of defining it here.
>
>Define PAGE_SIZE to 4096 is fine.
>

Got it.

>> 3 - What was origionally the more_4x_vec label was removed and the
>>     code only does 3 vector blocks now. The reasoning is as follows;
>>     there are two entry points to that code section, from a page cross
>>     or fall through (no page cross). The fall through case is more
>>     important to optimize for assuming the point above. In this case
>>     the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
>>     be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
>>     blocks optimizes for the [96 - 127] so that when the main 4x loop
>>     is hit, a new 128 byte aligned segment can be started. Doing 3
>>     vector blocks optimizes for the [0 - 32] case. I generally think
>>     the string is more likely to for aligned to cache line size (or L2
>>     prefetcher cache line pairs) than at [96 - 127] bytes. An
>>     alternative would be to make that code do 5x vector blocks. This
>>     would mean that at most 2 vector blocks where wasted when
>>     realigning to 128 bytes (as opposed to 3x or 4x which can allow
>>     for 3 vector blocks to be wasted). Thoughts?
>
>I picked 4x because it was how I unrolled in other string functions.
>3x is fine if it faster than 4x.
>

I see. What are your thoughts on doing 5x? With 5x the worst case is 2
vector blocks get wasted on reallignment though it will add overhead
for A) getting the 4x loop and B) add overhead for [0 - 31], [32 -
63], and [96 - 127] initial alignment cases.

>> 4 - Replace salq using the %cl partial register to sarx. This assumes
>>     BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
>>     newer so I think it is safe.
>
>Need to add a BMI2 check in strchr.c:
>
>&& CPU_FEATURE_USABLE_P (cpu_features, BMI2)
>

Done.

>>
>> 5 - in the first_vec_xN return blocks change the addq from using rax
>>     as a destination to rdi as a destination. This just allows for 1
>>     uop shorter latency.
>
>Sounds reasonable.
>

I forgot to add in the origional email. I also removed a few
unnecissary instructions on ecx (it was being realigned with rdi but
never used)

>> Benchamrks:
>> I can post my benchmarking code in the email thread if that is
>> appreciated. I benchmarked a variety of cases with different
>> alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
>> give a simple number of x percentage improvement. They where also run
>> on my personal computer (icelake-client). Of my 2732 test cases 1985
>> saw an improvement with these changes, 747 performed better with the
>> origional code. I should note, however, that my test cases had a
>> disproportionate number of cases with 4kb page crosses, which as
>> discussed I moved to a colder path.
>
>Please submit a separate patch to add your workload to
>benchtests/bench-strstr.c.
>

Got it. A few questions. Should the benchmark have a single return
value to indicate pass/failure (for a performance regression) or is
just outputing times the idea so it can be used as a tool for future
patches? Second is the coding standard for benchmarks the same as
production code or are rules more relaxed? Last should there be a
seperate benchmark for strchr, strchrnul, wcschr, and wcschrnul or is
having them all in one with #defines the best approach?

>> In general the affects of this change are:
>>
>> large/medium sized strings (from any part of memory really) 10-30%
>> performance boost.
>> Small strings that are not page crosses by 10-20% performance boost.
>> Small strings are cache line splits by 20-30%  performance boost.
>> Small strings that cross a page boundary (4kb page that is) see a 10%
>> performance regression.
>>
>> No regressions in test suite. Both test-strchrnul and test-strchr
>> passed.
>
>It is also used for wcschr and wcschrnul.  Do they work correctly with
>your change?
>

Yes. Updated the commit message to indicate those tests pass
aswell. Updated my benchmark as well to test those for performance.

>> I would love to here you feedback and all of these points (or any that
>> I missed).
>>
>> FSF Documentation has been signed and returned (via pub key and email
>> respectively)
 sysdeps/x86_64/multiarch/strchr-avx2.S | 177 +++++++++++++------------
 sysdeps/x86_64/multiarch/strchr.c      |   2 +
 2 files changed, 91 insertions(+), 88 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
index d416558d04..628242c483 100644
--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
+++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
@@ -27,10 +27,12 @@
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
 #  define VPCMPEQ	vpcmpeqd
+#  define VPMINU	vpminud
 #  define CHAR_REG	esi
 # else
 #  define VPBROADCAST	vpbroadcastb
 #  define VPCMPEQ	vpcmpeqb
+#  define VPMINU	vpminub
 #  define CHAR_REG	sil
 # endif
 
@@ -39,7 +41,8 @@
 # endif
 
 # define VEC_SIZE 32
-
+# define PAGE_SIZE 4096
+    
 	.section .text.avx,"ax",@progbits
 ENTRY (STRCHR)
 	movl	%edi, %ecx
@@ -47,10 +50,10 @@ ENTRY (STRCHR)
 	vmovd	%esi, %xmm0
 	vpxor	%xmm9, %xmm9, %xmm9
 	VPBROADCAST %xmm0, %ymm0
-	/* Check if we may cross page boundary with one vector load.  */
-	andl	$(2 * VEC_SIZE - 1), %ecx
-	cmpl	$VEC_SIZE, %ecx
-	ja	L(cros_page_boundary)
+	/* Check if we cross page boundary with one vector load.  */
+	andl	$(PAGE_SIZE - 1), %ecx
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
+	ja	L(cross_page_boundary)
 
 	/* Check the first VEC_SIZE bytes.  Search for both CHAR and the
 	   null byte.  */
@@ -63,45 +66,11 @@ ENTRY (STRCHR)
 	jnz	L(first_vec_x0)
 
 	/* Align data for aligned loads in the loop.  */
-	addq	$VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-
-	jmp	L(more_4x_vec)
-
-	.p2align 4
-L(cros_page_boundary):
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-	vmovdqu	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	/* Remove the leading bytes.  */
-	sarl	%cl, %eax
-	testl	%eax, %eax
-	jz	L(aligned_more)
-	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-	addq	%rcx, %rax
-# ifdef USE_AS_STRCHRNUL
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
-	cmovne	%rdx, %rax
-# endif
-	VZEROUPPER
-	ret
-
-	.p2align 4
+    andq	$-VEC_SIZE, %rdi
 L(aligned_more):
 	addq	$VEC_SIZE, %rdi
 
-L(more_4x_vec):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+	/* Check the next 3 * VEC_SIZE.  Only one VEC_SIZE at a time
 	   since data is only aligned to VEC_SIZE.  */
 	vmovdqa	(%rdi), %ymm8
 	VPCMPEQ %ymm8, %ymm0, %ymm1
@@ -127,19 +96,9 @@ L(more_4x_vec):
 	testl	%eax, %eax
 	jnz	L(first_vec_x2)
 
-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x3)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
+    
 	/* Align data to 4 * VEC_SIZE.  */
-	movq	%rdi, %rcx
-	andl	$(4 * VEC_SIZE - 1), %ecx
+	addq	$(VEC_SIZE * 3), %rdi
 	andq	$-(4 * VEC_SIZE), %rdi
 
 	.p2align 4
@@ -150,34 +109,61 @@ L(loop_4x_vec):
 	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm7
 	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
 
-	VPCMPEQ %ymm5, %ymm0, %ymm1
-	VPCMPEQ %ymm6, %ymm0, %ymm2
-	VPCMPEQ %ymm7, %ymm0, %ymm3
-	VPCMPEQ %ymm8, %ymm0, %ymm4
-
-	VPCMPEQ %ymm5, %ymm9, %ymm5
-	VPCMPEQ %ymm6, %ymm9, %ymm6
-	VPCMPEQ %ymm7, %ymm9, %ymm7
-	VPCMPEQ %ymm8, %ymm9, %ymm8
+    /* Leaves only CHARS matching esi as 0.  */
+    vpxor %ymm5, %ymm0, %ymm1
+    vpxor %ymm6, %ymm0, %ymm2
+    vpxor %ymm7, %ymm0, %ymm3
+    vpxor %ymm8, %ymm0, %ymm4
 
-	vpor	%ymm1, %ymm5, %ymm1
-	vpor	%ymm2, %ymm6, %ymm2
-	vpor	%ymm3, %ymm7, %ymm3
-	vpor	%ymm4, %ymm8, %ymm4
+	VPMINU	%ymm1, %ymm5, %ymm1
+	VPMINU	%ymm2, %ymm6, %ymm2
+	VPMINU	%ymm3, %ymm7, %ymm3
+	VPMINU	%ymm4, %ymm8, %ymm4
 
-	vpor	%ymm1, %ymm2, %ymm5
-	vpor	%ymm3, %ymm4, %ymm6
+	VPMINU	%ymm1, %ymm2, %ymm5
+	VPMINU	%ymm3, %ymm4, %ymm6
 
-	vpor	%ymm5, %ymm6, %ymm5
+	VPMINU	%ymm5, %ymm6, %ymm5
 
+    VPCMPEQ %ymm5, %ymm9, %ymm5
 	vpmovmskb %ymm5, %eax
+    addq	$(VEC_SIZE * 4), %rdi
+    
 	testl	%eax, %eax
-	jnz	L(4x_vec_end)
+	jz	L(loop_4x_vec)
 
-	addq	$(VEC_SIZE * 4), %rdi
+    subq	$(VEC_SIZE * 4), %rdi
+
+L(4x_vec_end):
+    VPCMPEQ %ymm1, %ymm9, %ymm1
+	vpmovmskb %ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x0)
+    VPCMPEQ %ymm2, %ymm9, %ymm2
+	vpmovmskb %ymm2, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x1)
+    VPCMPEQ %ymm3, %ymm9, %ymm3
+	vpmovmskb %ymm3, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x2)
+    VPCMPEQ %ymm4, %ymm9, %ymm4
+	vpmovmskb %ymm4, %eax
 
-	jmp	L(loop_4x_vec)
+	tzcntl	%eax, %eax
+# ifdef USE_AS_STRCHRNUL
+	addq	$(VEC_SIZE * 3), %rdi
+	addq	%rdi, %rax
+# else
+	xorl	%edx, %edx
+	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	cmp	(%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
 
+    
 	.p2align 4
 L(first_vec_x0):
 	/* Found CHAR or the null byte.  */
@@ -197,7 +183,7 @@ L(first_vec_x0):
 L(first_vec_x1):
 	tzcntl	%eax, %eax
 # ifdef USE_AS_STRCHRNUL
-	addq	$VEC_SIZE, %rax
+	addq	$VEC_SIZE, %rdi
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
@@ -212,7 +198,7 @@ L(first_vec_x1):
 L(first_vec_x2):
 	tzcntl	%eax, %eax
 # ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 2), %rax
+	addq	$(VEC_SIZE * 2), %rdi
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
@@ -223,32 +209,47 @@ L(first_vec_x2):
 	VZEROUPPER
 	ret
 
+    /* Cold case for crossing page with first load.  */
 	.p2align 4
-L(4x_vec_end):
+L(cross_page_boundary):
+	andl	$(VEC_SIZE - 1), %ecx
+	andq	$-VEC_SIZE, %rdi
+	vmovdqu	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
+	/* Remove the leading bits.  */
+	sarxl	%ecx, %eax, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-	vpmovmskb %ymm2, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-	vpmovmskb %ymm3, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
-	vpmovmskb %ymm4, %eax
+	jnz	L(cross_page_return)
+	
+    /* Second block so that the 3 other blocks from L(aligned_more)
+       will get to next 4 * VEC_SIZE alignment.  */
+    andq	$-VEC_SIZE, %rdi
+    addq	$VEC_SIZE, %rdi
+    xorl    %ecx, %ecx
+    vmovdqa	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-L(first_vec_x3):
+	jz	L(aligned_more)
+
+L(cross_page_return):
 	tzcntl	%eax, %eax
+	addq	%rcx, %rax
 # ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 3), %rax
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	leaq	(%rdi, %rax), %rax
 	cmp	(%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
-
+    
 END (STRCHR)
-#endif
+# endif
diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c
index 583a152794..77ff44b8f3 100644
--- a/sysdeps/x86_64/multiarch/strchr.c
+++ b/sysdeps/x86_64/multiarch/strchr.c
@@ -37,6 +37,8 @@ IFUNC_SELECTOR (void)
 
   if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
       && CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     return OPTIMIZE (avx2);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-21  0:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20  9:29 [PATCH] x86: Refactor and improve performance of strchr-avx2.S noah via Libc-alpha
2021-01-20 12:47 ` H.J. Lu via Libc-alpha
2021-01-21  0:12   ` [PATCH v2] " noah via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).