unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v3 00/18] Improve generic string routines
@ 2018-01-10 12:47 Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 01/18] Parameterize op_t from memcopy.h Adhemerval Zanella
                   ` (18 more replies)
  0 siblings, 19 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha

It is an update of previous Richard's patchset [1] to provide generic 
string implementation for newer ports and make them only focus on 
just specific routines to get a better overall improvement.
It is done by:

  1. parametrizing the internal routines (for instance the find zero
     in a word) so each architecture can reimplement without the need
     to reimplement the whole routine.

  2. vectorizing more string implementations (for instance strcpy 
     and strcmp).

  3. Change some implementations to use already possible optimized
     ones (for instance strnlen).  It makes new ports to focus on
     only provide optimized implementation of a hardful symbols
     (for instance memchr) and make its improvement to be used in
     a larger set of routines.

For the rest of #5806 I think we can handle them later and if 
performance of generic implementation is closer I think it is better
to just remove old assembly implementations.

I also checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly 
implementation and disabling multiarch (it covers both LE and BE
for 64 and 32 bits). I also checked the string routines on alpha, hppa,
and sh.

Changes since v2:

  * Move string-fz{a,b,i} to its own patch.
 
  * Add a inline implementation for __builtin_c{l,t}z to avoid using
    compiler provided symbols.

  * Add a new header, string-maskoff.h, to handle unaligned accesses
    on some implementation.

  * Fixed strcmp on LE machines.

  * Added a unaligned strcpy variant for architecture that define
    _STRING_ARCH_unaligned.

  * Add SH string-fzb.h (which uses cmp/str instruction to find
    a zero in word).

Changes since v1:

  * Marked ChangeLog entries with [BZ #5806], as appropriate.

  * Reorganized the headers, so that armv6t2 and power6 need override
    as little as possible to use their (integer) zero detection insns.

  * Hopefully fixed all of the coding style issues.

  * Adjusted the memrchr algorithm as discussed.

  * Replaced the #ifdef STRRCHR etc that are used by the multiarch files.

  * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
    ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
    aarch64, alpha (qemu) and hppa (qemu).

PS: This patchset is aimed for 2.28.

[1] https://sourceware.org/ml/libc-alpha/2016-12/msg00830.html

Adhemerval Zanella (5):
  Add string-maskoff.h generic header
  Add string vectorized find and detection functions
  string: Improve generic strnlen
  string: Improve generic strcpy
  sh: Add string-fzb.h

Richard Henderson (13):
  Parameterize op_t from memcopy.h
  Parameterize OP_T_THRES from memcopy.h
  string: Improve generic strlen
  string: Improve generic memchr
  string: Improve generic memrchr
  string: Improve generic strchr
  string: Improve generic strchrnul
  string: Improve generic strcmp
  hppa: Add memcopy.h
  hppa: Add string-fzb.h and string-fzi.h
  alpha: Add string-fzb.h and string-fzi.h
  arm: Add string-fza.h
  powerpc: Add string-fza.h

 config.h.in                                        |   8 +
 configure                                          |  54 ++++++
 configure.ac                                       |  34 ++++
 string/memchr.c                                    | 157 ++++-----------
 string/memcmp.c                                    |   4 -
 string/memrchr.c                                   | 193 ++++--------------
 string/strchr.c                                    | 166 +++-------------
 string/strchrnul.c                                 | 146 +++-----------
 string/strcmp.c                                    |  97 +++++++++-
 string/strcpy.c                                    | 109 ++++++++++-
 string/strlen.c                                    |  83 ++------
 string/strnlen.c                                   | 139 +------------
 string/test-strcpy.c                               |  24 ++-
 sysdeps/alpha/string-fzb.h                         |  51 +++++
 sysdeps/alpha/string-fzi.h                         | 113 +++++++++++
 sysdeps/arm/armv6t2/string-fza.h                   |  69 +++++++
 sysdeps/generic/memcopy.h                          |  11 +-
 sysdeps/generic/string-extbyte.h                   |  35 ++++
 sysdeps/generic/string-fza.h                       | 117 +++++++++++
 sysdeps/generic/string-fzb.h                       |  49 +++++
 sysdeps/generic/string-fzi.h                       | 215 +++++++++++++++++++++
 sysdeps/generic/string-maskoff.h                   |  64 ++++++
 sysdeps/generic/string-opthr.h                     |  25 +++
 sysdeps/generic/string-optype.h                    |  31 +++
 sysdeps/hppa/memcopy.h                             |  44 +++++
 sysdeps/hppa/string-fzb.h                          |  69 +++++++
 sysdeps/hppa/string-fzi.h                          | 135 +++++++++++++
 sysdeps/i386/i686/multiarch/memrchr-c.c            |   2 +
 sysdeps/i386/i686/multiarch/strnlen-c.c            |  19 +-
 sysdeps/i386/memcopy.h                             |   3 -
 sysdeps/i386/string-opthr.h                        |  25 +++
 sysdeps/m68k/memcopy.h                             |   3 -
 sysdeps/powerpc/power6/string-fza.h                |  65 +++++++
 sysdeps/powerpc/powerpc32/power4/memcopy.h         |   5 -
 .../powerpc32/power4/multiarch/strnlen-ppc32.c     |  19 +-
 sysdeps/powerpc/powerpc32/power6/string-fza.h      |   1 +
 sysdeps/powerpc/powerpc64/power6/string-fza.h      |   1 +
 sysdeps/s390/multiarch/memrchr-c.c                 |   2 +
 sysdeps/s390/multiarch/strchr-c.c                  |   1 +
 sysdeps/s390/multiarch/strnlen-c.c                 |  18 +-
 sysdeps/sh/string-fzb.h                            |  53 +++++
 sysdeps/tile/memcmp.c                              |   1 -
 sysdeps/tile/memcopy.h                             |   7 -
 sysdeps/tile/tilegx32/gmp-mparam.h                 |  30 +++
 44 files changed, 1707 insertions(+), 790 deletions(-)
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h
 create mode 100644 sysdeps/generic/string-maskoff.h
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/generic/string-optype.h
 create mode 100644 sysdeps/hppa/memcopy.h
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzi.h
 create mode 100644 sysdeps/i386/string-opthr.h
 create mode 100644 sysdeps/powerpc/power6/string-fza.h
 create mode 100644 sysdeps/powerpc/powerpc32/power6/string-fza.h
 create mode 100644 sysdeps/powerpc/powerpc64/power6/string-fza.h
 create mode 100644 sysdeps/sh/string-fzb.h
 create mode 100644 sysdeps/tile/tilegx32/gmp-mparam.h

-- 
2.7.4



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 01/18] Parameterize op_t from memcopy.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-11 13:27   ` Joseph Myers
  2018-01-10 12:47 ` [PATCH v3 02/18] Parameterize OP_T_THRES " Adhemerval Zanella
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

Basically moves op_t definition out to an specific header, adds
the attribute 'may-alias', and cleanup its duplicated definitions.
It lead to inclusion of tilegx32 gmp-mparam.h similar to x32 so
op_t can be define as a long long (from _LONG_LONG_LIMB).

Checked with a build and check with run-built-tests=no for all major
Linux ABIs (alpha, aarch64, arm, hppa, i686, ia64, m68k, microblaze,
mips, mips64, nios2, powerpc, powerpc64le, s390x, sh4, sparc64,
tilegx, and x86_64).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	* sysdeps/generic/string-optype.h: New file.
	* sysdeps/generic/memcopy.h: Include it.
	* string/memcmp.c (op_t): Remove define.
	* sysdeps/tile/memcmp.c (op_t): Likewise.
	* sysdeps/tile/memcopy.h (op_t): Likewise.
	* sysdeps/tile/tilegx32/gmp-mparam.h: New file.
---
 string/memcmp.c                    |  1 -
 sysdeps/generic/memcopy.h          |  7 +++----
 sysdeps/generic/string-optype.h    | 31 +++++++++++++++++++++++++++++++
 sysdeps/tile/memcmp.c              |  1 -
 sysdeps/tile/memcopy.h             |  7 -------
 sysdeps/tile/tilegx32/gmp-mparam.h | 30 ++++++++++++++++++++++++++++++
 6 files changed, 64 insertions(+), 13 deletions(-)
 create mode 100644 sysdeps/generic/string-optype.h
 create mode 100644 sysdeps/tile/tilegx32/gmp-mparam.h

diff --git a/string/memcmp.c b/string/memcmp.c
index aea5129..4fd2f83 100644
--- a/string/memcmp.c
+++ b/string/memcmp.c
@@ -46,7 +46,6 @@
 /* Type to use for aligned memory operations.
    This should normally be the biggest type supported by a single load
    and store.  Must be an unsigned type.  */
-# define op_t	unsigned long int
 # define OPSIZ	(sizeof(op_t))
 
 /* Threshold value for when to enter the unrolled loops.  */
diff --git a/sysdeps/generic/memcopy.h b/sysdeps/generic/memcopy.h
index c0d8da3..c7e9cc9 100644
--- a/sysdeps/generic/memcopy.h
+++ b/sysdeps/generic/memcopy.h
@@ -56,10 +56,9 @@
      [I fail to understand.  I feel stupid.  --roland]
 */
 
-/* Type to use for aligned memory operations.
-   This should normally be the biggest type supported by a single load
-   and store.  */
-#define	op_t	unsigned long int
+/* Type to use for aligned memory operations.  */
+#include <string-optype.h>
+
 #define OPSIZ	(sizeof(op_t))
 
 /* Type to use for unaligned operations.  */
diff --git a/sysdeps/generic/string-optype.h b/sysdeps/generic/string-optype.h
new file mode 100644
index 0000000..1324070
--- /dev/null
+++ b/sysdeps/generic/string-optype.h
@@ -0,0 +1,31 @@
+/* Define a type to use for word access.  Generic version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_OPTYPE_H
+#define STRING_OPTYPE_H 1
+
+/* Use the existing parameterization from gmp as a default.  */
+#include <gmp-mparam.h>
+
+#ifdef _LONG_LONG_LIMB
+typedef unsigned long long int __attribute__((__may_alias__)) op_t;
+#else
+typedef unsigned long int __attribute__((__may_alias__)) op_t;
+#endif
+
+#endif /* string-optype.h */
diff --git a/sysdeps/tile/memcmp.c b/sysdeps/tile/memcmp.c
index b7cf00a..89fff57 100644
--- a/sysdeps/tile/memcmp.c
+++ b/sysdeps/tile/memcmp.c
@@ -45,7 +45,6 @@
 /* Type to use for aligned memory operations.
    This should normally be the biggest type supported by a single load
    and store.  Must be an unsigned type.  */
-# define op_t	unsigned long int
 # define OPSIZ	(sizeof(op_t))
 
 /* Threshold value for when to enter the unrolled loops.  */
diff --git a/sysdeps/tile/memcopy.h b/sysdeps/tile/memcopy.h
index 0c357c1..748f648 100644
--- a/sysdeps/tile/memcopy.h
+++ b/sysdeps/tile/memcopy.h
@@ -22,10 +22,3 @@
 /* The tilegx implementation of memcpy is safe to use for memmove.  */
 #undef MEMCPY_OK_FOR_FWD_MEMMOVE
 #define MEMCPY_OK_FOR_FWD_MEMMOVE 1
-
-/* Support more efficient copying on tilegx32, which supports
-   long long as a native 64-bit type.  */
-#if __WORDSIZE == 32
-# undef op_t
-# define op_t	unsigned long long int
-#endif
diff --git a/sysdeps/tile/tilegx32/gmp-mparam.h b/sysdeps/tile/tilegx32/gmp-mparam.h
new file mode 100644
index 0000000..7d1cb98
--- /dev/null
+++ b/sysdeps/tile/tilegx32/gmp-mparam.h
@@ -0,0 +1,30 @@
+/* Compiler/machine parameter header file.  TileGX32 version.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#if defined __GMP_H__ && ! defined _LONG_LONG_LIMB
+#error "Included too late for _LONG_LONG_LIMB to take effect"
+#endif
+
+#define _LONG_LONG_LIMB
+#define BITS_PER_MP_LIMB 64
+#define BYTES_PER_MP_LIMB 8
+#define BITS_PER_LONGINT 32
+#define BITS_PER_INT 32
+#define BITS_PER_SHORTINT 16
+#define BITS_PER_CHAR 8
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 02/18] Parameterize OP_T_THRES from memcopy.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 01/18] Parameterize op_t from memcopy.h Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 03/18] Add string-maskoff.h generic header Adhemerval Zanella
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

Basically it moves OP_T_THRES out of memcopy.h to its own header
and adjust each architecture that redefines it.

Checked with a build and check with run-built-tests=no for all major
Linux ABIs (alpha, aarch64, arm, hppa, i686, ia64, m68k, microblaze,
mips, mips64, nios2, powerpc, powerpc64le, s390x, sh4, sparc64,
tilegx, and x86_64).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	* sysdeps/generic/memcopy.h (OP_T_THRES): Move...
	* sysdeps/generic/string-opthr.h: ... here; new file.
	* sysdeps/i386/memcopy.h (OP_T_THRES): Move...
	* sysdeps/i386/string-opthr.h: ... here; new file.
	* sysdeps/m68k/memcopy.h (OP_T_THRES): Remove.
	* string/memcmp.c (OP_T_THRES): Remove definition.
	* sysdeps/powerpc/powerpc32/power4/memcopy.h (OP_T_THRES): Likewise.
---
 string/memcmp.c                            |  3 ---
 sysdeps/generic/memcopy.h                  |  4 +---
 sysdeps/generic/string-opthr.h             | 25 +++++++++++++++++++++++++
 sysdeps/i386/memcopy.h                     |  3 ---
 sysdeps/i386/string-opthr.h                | 25 +++++++++++++++++++++++++
 sysdeps/m68k/memcopy.h                     |  3 ---
 sysdeps/powerpc/powerpc32/power4/memcopy.h |  5 -----
 7 files changed, 51 insertions(+), 17 deletions(-)
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/i386/string-opthr.h

diff --git a/string/memcmp.c b/string/memcmp.c
index 4fd2f83..82ad082 100644
--- a/string/memcmp.c
+++ b/string/memcmp.c
@@ -48,9 +48,6 @@
    and store.  Must be an unsigned type.  */
 # define OPSIZ	(sizeof(op_t))
 
-/* Threshold value for when to enter the unrolled loops.  */
-# define OP_T_THRES	16
-
 /* Type to use for unaligned operations.  */
 typedef unsigned char byte;
 
diff --git a/sysdeps/generic/memcopy.h b/sysdeps/generic/memcopy.h
index c7e9cc9..1698379 100644
--- a/sysdeps/generic/memcopy.h
+++ b/sysdeps/generic/memcopy.h
@@ -58,6 +58,7 @@
 
 /* Type to use for aligned memory operations.  */
 #include <string-optype.h>
+#include <string-opthr.h>
 
 #define OPSIZ	(sizeof(op_t))
 
@@ -190,9 +191,6 @@ extern void _wordcopy_bwd_dest_aligned (long int, long int, size_t)
 
 #endif
 
-/* Threshold value for when to enter the unrolled loops.  */
-#define	OP_T_THRES	16
-
 /* Set to 1 if memcpy is safe to use for forward-copying memmove with
    overlapping addresses.  This is 0 by default because memcpy implementations
    are generally not safe for overlapping addresses.  */
diff --git a/sysdeps/generic/string-opthr.h b/sysdeps/generic/string-opthr.h
new file mode 100644
index 0000000..17fa627
--- /dev/null
+++ b/sysdeps/generic/string-opthr.h
@@ -0,0 +1,25 @@
+/* Define a threshold for word access.  Generic version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_OPTHR_H
+#define STRING_OPTHR_H 1
+
+/* Threshold value for when to enter the unrolled loops.  */
+#define OP_T_THRES	16
+
+#endif /* string-opthr.h */
diff --git a/sysdeps/i386/memcopy.h b/sysdeps/i386/memcopy.h
index 12bb39f..28cee47 100644
--- a/sysdeps/i386/memcopy.h
+++ b/sysdeps/i386/memcopy.h
@@ -19,9 +19,6 @@
 
 #include <sysdeps/generic/memcopy.h>
 
-#undef	OP_T_THRES
-#define	OP_T_THRES	8
-
 #undef	BYTE_COPY_FWD
 #define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)				      \
   do {									      \
diff --git a/sysdeps/i386/string-opthr.h b/sysdeps/i386/string-opthr.h
new file mode 100644
index 0000000..ed3e4b2
--- /dev/null
+++ b/sysdeps/i386/string-opthr.h
@@ -0,0 +1,25 @@
+/* Define a threshold for word access.  i386 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef I386_STRING_OPTHR_H
+#define I386_STRING_OPTHR_H 1
+
+/* Threshold value for when to enter the unrolled loops.  */
+#define OP_T_THRES	8
+
+#endif /* I386_STRING_OPTHR_H */
diff --git a/sysdeps/m68k/memcopy.h b/sysdeps/m68k/memcopy.h
index 58569c6..ee0c5fc 100644
--- a/sysdeps/m68k/memcopy.h
+++ b/sysdeps/m68k/memcopy.h
@@ -21,9 +21,6 @@
 
 #if	defined(__mc68020__) || defined(mc68020)
 
-#undef	OP_T_THRES
-#define	OP_T_THRES	16
-
 /* WORD_COPY_FWD and WORD_COPY_BWD are not symmetric on the 68020,
    because of its weird instruction overlap characteristics.  */
 
diff --git a/sysdeps/powerpc/powerpc32/power4/memcopy.h b/sysdeps/powerpc/powerpc32/power4/memcopy.h
index 8050abc..37ed40b 100644
--- a/sysdeps/powerpc/powerpc32/power4/memcopy.h
+++ b/sysdeps/powerpc/powerpc32/power4/memcopy.h
@@ -51,11 +51,6 @@
      [I fail to understand.  I feel stupid.  --roland]
 */
 
-
-/* Threshold value for when to enter the unrolled loops.  */
-#undef	OP_T_THRES
-#define OP_T_THRES 16
-
 /* Copy exactly NBYTES bytes from SRC_BP to DST_BP,
    without any assumptions about alignment of the pointers.  */
 #undef BYTE_COPY_FWD
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 03/18] Add string-maskoff.h generic header
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 01/18] Parameterize op_t from memcopy.h Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 02/18] Parameterize OP_T_THRES " Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 23:24   ` Paul Eggert
  2018-01-11 13:29   ` Joseph Myers
  2018-01-10 12:47 ` [PATCH v3 04/18] Add string vectorized find and detection functions Adhemerval Zanella
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha

Macros to operate on unaligned access for string operations:

  - create_mask: create a mask based on pointer alignment to sets up
    non-zero bytes before the beginning of the word so a following
    operation (such as find zero) might ignore these bytes.

  - highbit_mask: create a mask with high bit of each byte being 1,
    and the low 7 bits being all the opposite of the input.

These macros are meant to be used on optimized vectorized string
implementations.

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	* sysdeps/generic/string-maskoff.h: New file.
---
 sysdeps/generic/string-maskoff.h | 64 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)
 create mode 100644 sysdeps/generic/string-maskoff.h

diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
new file mode 100644
index 0000000..6231798
--- /dev/null
+++ b/sysdeps/generic/string-maskoff.h
@@ -0,0 +1,64 @@
+/* Mask off bits.  Generic C version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_MASKOFF_H
+#define STRING_MASKOFF_H 1
+
+#include <endian.h>
+#include <stdint.h>
+#include <string-optype.h>
+
+/* Provide a mask based on the pointer alignment that sets up non-zero
+   bytes before the beginning of the word.  It is used to mask off
+   undesirable bits from an aligned read from an unaligned pointer.
+   For instance, on a 64 bits machine with a pointer alignment of
+   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
+   (meaning to mask off the initial 3 bytes).  */
+static inline op_t
+create_mask (uintptr_t i)
+{
+  i = i % sizeof (op_t);
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return ~(((op_t)-1) << (i * CHAR_BIT));
+  else
+    return ~(((op_t)-1) >> (i * CHAR_BIT));
+}
+
+/* Setup an word with each byte being c_in.  For instance, on a 64 bits
+   machine with input as 0xce the functions returns 0xcececececececece.  */
+static inline op_t
+repeat_bytes (unsigned char c_in)
+{
+  return ((op_t)-1 / 0xff) * c_in;
+}
+
+/* Create a mask with high bit of each byte being 1, and the low 7 bits
+   being all the opposite of the input mask.  It is used to mask off
+   undesirable bits from an aligned read from an unaligned pointer,
+   and also taking care to avoid match possible bytes meant to be
+   matched.  For instance, on a 64 bits machine with a pointer alignment
+   of 3 the function returns 0x7f7f7f0000000000 (input meant to
+   be 0xffffff0000000000) for BE and 0x00000000007f7f7f for LE (input
+   meant to be 0x0000000000ffffff).  */
+static inline op_t
+highbit_mask (op_t m)
+{
+  return m & ~repeat_bytes (0x80);
+}
+
+#endif /* STRING_MASKOFF_H  */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (2 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 03/18] Add string-maskoff.h generic header Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-11 13:34   ` Joseph Myers
                     ` (2 more replies)
  2018-01-10 12:47 ` [PATCH v3 05/18] string: Improve generic strlen Adhemerval Zanella
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha

This patch adds generic string find and detection implementation meant
to be used in generic vectorized string implementation.  The idea is to
decompose the basic string operation so each architecture can reimplement
if it provides any specialized hardware instruction.

The 'string-fza.h' provides zero byte detection functions (find_zero_low,
find_zero_all, find_eq_low, find_eq_all, find_zero_eq_low, find_zero_eq_all,
find_zero_ne_low, and find_zero_ne_all).  They are used on both functions
provided by 'string-fzb.h' and 'string-fzi'.

The 'string-fzb.h' provides boolean zero byte detection with the
functions:

  - has_zero: determine if any byte within a word is zero.
  - has_eq: determine byte equality between two words.
  - has_zero_eq: determine if any byte within a word is zero along with
    byte equality between two words.

The 'string-fzi.h' provides zero byte detection along with its positions:

  - index_first_zero: return index of first zero byte within a word.
  - index_first_eq: return index of first byte different between two words.
  - index_first_zero_eq: return index of first zero byte within a word or
    first byte different between two words.
  - index_first_zero_ne: return index of first zero byte within a word or
    first byte equal between two words.
  - index_last_zero: return index of last zero byte within a word.
  - index_last_eq: return index of last byte different between two words.

Also, to avoid libcalls in the '__builtin_c{t,l}z{l}' calls (which may
add performance degradation), inline implementation based on De Bruijn
sequences are added (enabled by a configure check).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	* config.h.in (HAVE_BUILTIN_CTZ, HAVE_BUILTIN_CLZ): New defines.
	* configure.ac: Check for __builtin_ctz{l} with no external
	dependencies
	* sysdeps/generic/string-extbyte.h: New file.
	* sysdeps/generic/string-fza.h: Likewise.
	* sysdeps/generic/string-fzb.h: Likewise.
	* sysdeps/generic/string-fzi.h: Likewise.
---
 config.h.in                      |   8 ++
 configure                        |  54 ++++++++++
 configure.ac                     |  34 +++++++
 sysdeps/generic/string-extbyte.h |  35 +++++++
 sysdeps/generic/string-fza.h     | 117 +++++++++++++++++++++
 sysdeps/generic/string-fzb.h     |  49 +++++++++
 sysdeps/generic/string-fzi.h     | 215 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 512 insertions(+)
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h

diff --git a/config.h.in b/config.h.in
index d928e7d..03bcfe6 100644
--- a/config.h.in
+++ b/config.h.in
@@ -245,4 +245,12 @@
    in i386 6 argument syscall issue).  */
 #define CAN_USE_REGISTER_ASM_EBP 0
 
+/* If compiler supports __builtin_ctz{l} without any external depedencies
+   (libgcc for instance).  */
+#define HAVE_BUILTIN_CTZ 0
+
+/* If compiler supports __builtin_clz{l} without any external depedencies
+   (libgcc for instance).  */
+#define HAVE_BUILTIN_CLZ 0
+
 #endif
diff --git a/configure b/configure
index 7a8bd3f..ff4464f 100755
--- a/configure
+++ b/configure
@@ -6592,6 +6592,60 @@ if test $libc_cv_builtin_trap = yes; then
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz{l} with no external dependencies" >&5
+$as_echo_n "checking for __builtin_ctz{l} with no external dependencies... " >&6; }
+if ${libc_cv_builtin_ctz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  libc_cv_builtin_ctz=yes
+echo 'int foo (unsigned long x) { return __builtin_ctz (x); }' > conftest.c
+if { ac_try='${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  if grep '__ctz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_ctz=no
+  fi
+fi
+rm -f conftest.c conftest.s
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_builtin_ctz" >&5
+$as_echo "$libc_cv_builtin_ctz" >&6; }
+if test x$libc_cv_builtin_ctz = xyes; then
+  $as_echo "#define HAVE_BUILTIN_CTZ 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clz{l} with no external dependencies" >&5
+$as_echo_n "checking for __builtin_clz{l} with no external dependencies... " >&6; }
+if ${libc_cv_builtin_clz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  libc_cv_builtin_clz=yes
+echo 'int foo (unsigned long x) { return __builtin_clz (x); }' > conftest.c
+if { ac_try='${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  if grep '__clz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_clz=no
+  fi
+fi
+rm -f conftest.c conftest.s
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_builtin_clz" >&5
+$as_echo "$libc_cv_builtin_clz" >&6; }
+if test x$libc_cv_builtin_clz = xyes; then
+  $as_echo "#define HAVE_BUILTIN_CLZ 1" >>confdefs.h
+
+fi
+
 ac_ext=cpp
 ac_cpp='$CXXCPP $CPPFLAGS'
 ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
diff --git a/configure.ac b/configure.ac
index ca1282a..7f9c9f8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1675,6 +1675,40 @@ if test $libc_cv_builtin_trap = yes; then
   AC_DEFINE([HAVE_BUILTIN_TRAP])
 fi
 
+AC_CACHE_CHECK(for __builtin_ctz{l} with no external dependencies,
+	       libc_cv_builtin_ctz, [dnl
+libc_cv_builtin_ctz=yes
+echo 'int foo (unsigned long x) { return __builtin_ctz (x); }' > conftest.c
+if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&AS_MESSAGE_LOG_FD); then
+changequote(,)dnl
+  if grep '__ctz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_ctz=no
+  fi
+changequote([,])dnl
+fi
+rm -f conftest.c conftest.s
+])
+if test x$libc_cv_builtin_ctz = xyes; then
+  AC_DEFINE(HAVE_BUILTIN_CTZ)
+fi
+
+AC_CACHE_CHECK(for __builtin_clz{l} with no external dependencies,
+	       libc_cv_builtin_clz, [dnl
+libc_cv_builtin_clz=yes
+echo 'int foo (unsigned long x) { return __builtin_clz (x); }' > conftest.c
+if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&AS_MESSAGE_LOG_FD); then
+changequote(,)dnl
+  if grep '__clz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_clz=no
+  fi
+changequote([,])dnl
+fi
+rm -f conftest.c conftest.s
+])
+if test x$libc_cv_builtin_clz = xyes; then
+  AC_DEFINE(HAVE_BUILTIN_CLZ)
+fi
+
 dnl C++ feature tests.
 AC_LANG_PUSH([C++])
 
diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
new file mode 100644
index 0000000..69a78ce
--- /dev/null
+++ b/sysdeps/generic/string-extbyte.h
@@ -0,0 +1,35 @@
+/* Extract by from memory word.  Generic C version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_EXTBYTE_H
+#define STRING_EXTBYTE_H 1
+
+#include <limits.h>
+#include <endian.h>
+#include <string-optype.h>
+
+static inline unsigned char
+extractbyte (op_t x, unsigned idx)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return x >> (idx * CHAR_BIT);
+  else
+    return x >> (sizeof (x) - 1 - idx) * CHAR_BIT;
+}
+
+#endif /* STRING_EXTBYTE_H */
diff --git a/sysdeps/generic/string-fza.h b/sysdeps/generic/string-fza.h
new file mode 100644
index 0000000..ab208bf
--- /dev/null
+++ b/sysdeps/generic/string-fza.h
@@ -0,0 +1,117 @@
+/* Basic zero byte detection.  Generic C version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZA_H
+#define STRING_FZA_H 1
+
+#include <limits.h>
+#include <string-optype.h>
+
+/* This function returns non-zero if any byte in X is zero.
+   More specifically, at least one bit set within the least significant
+   byte that was zero; other bytes within the word are indeterminate.  */
+
+static inline op_t
+find_zero_low (op_t x)
+{
+  /* This expression comes from
+       https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+     Subtracting 1 sets 0x80 in a byte that was 0; anding ~x clears
+     0x80 in a byte that was >= 128; anding 0x80 isolates that test bit.  */
+  op_t lsb = (op_t)-1 / 0xff;
+  op_t msb = lsb << (CHAR_BIT - 1);
+  return (x - lsb) & ~x & msb;
+}
+
+/* This function returns at least one bit set within every byte of X that
+   is zero.  The result is exact in that, unlike find_zero_low, all bytes
+   are determinate.  This is usually used for finding the index of the
+   most significant byte that was zero.  */
+
+static inline op_t
+find_zero_all (op_t x)
+{
+  /* For each byte, find not-zero by
+     (0) And 0x7f so that we cannot carry between bytes,
+     (1) Add 0x7f so that non-zero carries into 0x80,
+     (2) Or in the original byte (which might have had 0x80 set).
+     Then invert and mask such that 0x80 is set iff that byte was zero.  */
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  return ~(((x & m) + m) | x | m);
+}
+
+/* With similar caveats, identify bytes that are equal between X1 and X2.  */
+
+static inline op_t
+find_eq_low (op_t x1, op_t x2)
+{
+  return find_zero_low (x1 ^ x2);
+}
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1 ^ x2);
+}
+
+/* With similar caveats, identify zero bytes in X1 and bytes that are
+   equal between in X1 and X2.  */
+
+static inline op_t
+find_zero_eq_low (op_t x1, op_t x2)
+{
+  op_t lsb = (op_t)-1 / 0xff;
+  op_t msb = lsb << (CHAR_BIT - 1);
+  op_t eq = x1 ^ x2;
+  return (((x1 - lsb) & ~x1) | ((eq - lsb) & ~eq)) & msb;
+}
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  op_t eq = x1 ^ x2;
+  op_t c1 = ((x1 & m) + m) | x1;
+  op_t c2 = ((eq & m) + m) | eq;
+  return ~((c1 & c2) | m);
+}
+
+/* With similar caveats, identify zero bytes in X1 and bytes that are
+   not equal between in X1 and X2.  */
+
+static inline op_t
+find_zero_ne_low (op_t x1, op_t x2)
+{
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  op_t eq = x1 ^ x2;
+  op_t nz1 = (x1 + m) | x1;	/* msb set if byte not zero */
+  op_t ne2 = (eq + m) | eq;	/* msb set if byte not equal */
+  return (ne2 | ~nz1) & ~m;	/* msb set if x1 zero or x2 not equal */
+}
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  op_t eq = x1 ^ x2;
+  op_t nz1 = ((x1 & m) + m) | x1;
+  op_t ne2 = ((eq & m) + m) | eq;
+  return (ne2 | ~nz1) & ~m;
+}
+
+#endif /* STRING_FZA_H */
diff --git a/sysdeps/generic/string-fzb.h b/sysdeps/generic/string-fzb.h
new file mode 100644
index 0000000..d4ab59b
--- /dev/null
+++ b/sysdeps/generic/string-fzb.h
@@ -0,0 +1,49 @@
+/* Zero byte detection, boolean.  Generic C version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZB_H
+#define STRING_FZB_H 1
+
+#include <endian.h>
+#include <string-fza.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  return find_zero_low (x) != 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return find_eq_low (x1, x2) != 0;
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return find_zero_eq_low (x1, x2);
+}
+
+#endif /* STRING_FZB_H */
diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
new file mode 100644
index 0000000..57101f2
--- /dev/null
+++ b/sysdeps/generic/string-fzi.h
@@ -0,0 +1,215 @@
+/* Zero byte detection; indexes.  Generic C version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZI_H
+#define STRING_FZI_H 1
+
+#include <limits.h>
+#include <endian.h>
+#include <string-fza.h>
+
+/* An improved bitscan routine, multiplying the De Bruijn sequence with a
+   0-1 mask separated by the least significant one bit of a scanned integer
+   or bitboard [1].
+
+   [1] https://chessprogramming.wikispaces.com/Kim+Walisch  */
+
+static inline unsigned
+index_access (const op_t i)
+{
+  static const char index[] =
+  {
+# if __WORDSIZE == 64
+     0, 47,  1, 56, 48, 27,  2, 60,
+    57, 49, 41, 37, 28, 16,  3, 61,
+    54, 58, 35, 52, 50, 42, 21, 44,
+    38, 32, 29, 23, 17, 11,  4, 62,
+    46, 55, 26, 59, 40, 36, 15, 53,
+    34, 51, 20, 43, 31, 22, 10, 45,
+    25, 39, 14, 33, 19, 30,  9, 24,
+    13, 18,  8, 12,  7,  6,  5, 63
+# else
+     0,  9,  1, 10, 13, 21,  2, 29,
+    11, 14, 16, 18, 22, 25,  3, 30,
+     8, 12, 20, 28, 15, 17, 24,  7,
+    19, 27, 23,  6, 26,  5,  4, 31
+# endif
+  };
+  return index[i];
+}
+
+/* For architecture which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
+   and/or __builtin_ctz{l} (HAVE_BUILTIN_CTZ) which uses external libcalls
+   (for intance __c{l,t}z{s,d}i2 from libgcc) the following wrapper provides
+   inline implementation for both count leading zeros and count trailing
+   zeros using branchless computation.  */
+
+static inline unsigned
+__ctz (op_t x)
+{
+#if !HAVE_BUILTIN_CTZ
+  op_t i;
+# if __WORDSIZE == 64
+  i = (x ^ (x - 1)) * 0x03F79D71B4CB0A89ull >> 58;
+# else
+  i = (x ^ (x - 1)) * 0x07C4ACDDU >> 27;
+# endif
+  return index_access (i);
+#else
+  if (sizeof (op_t) == sizeof (long))
+    return __builtin_ctzl (x);
+  else
+    return __builtin_ctzll (x);
+#endif
+};
+
+static inline unsigned
+__clz (op_t x)
+{
+#if !HAVE_BUILTIN_CLZ
+  unsigned r;
+  op_t i;
+
+  x |= x >> 1;
+  x |= x >> 2;
+  x |= x >> 4;
+  x |= x >> 8;
+  x |= x >> 16;
+# if __WORDSIZE == 64
+  x |= x >> 32;
+  i = x * 0x03F79D71B4CB0A89ull >> 58;
+# else
+  i = x * 0x07C4ACDDU >> 27;
+# endif
+  r = index_access (i);
+  return r ^ (sizeof (op_t) * CHAR_BIT - 1);
+#else
+  if (sizeof (op_t) == sizeof (long))
+    return __builtin_clzl (x);
+  else
+    return __builtin_clzll (x);
+#endif
+}
+
+/* A subroutine for the index_zero functions.  Given a test word C, return
+   the (memory order) index of the first byte (in memory order) that is
+   non-zero.  */
+
+static inline unsigned int
+index_first_ (op_t c)
+{
+  _Static_assert (sizeof (op_t) == sizeof (long)
+		  || sizeof (op_t) == sizeof (long long),
+		  "Unhandled word size");
+
+  unsigned r;
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    r = __ctz (c);
+  else
+    r = __clz (c);
+  return r / CHAR_BIT;
+}
+
+/* Similarly, but return the (memory order) index of the last byte
+   that is non-zero.  */
+
+static inline unsigned int
+index_last_ (op_t c)
+{
+  _Static_assert (sizeof (op_t) == sizeof (long)
+		  || sizeof (op_t) == sizeof (long long),
+		  "Unhandled word size");
+
+  unsigned r;
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    r = __clz (c);
+  else
+    r = __ctz (c);
+  return sizeof (op_t) - 1 - (r / CHAR_BIT);
+}
+
+/* Given a word X that is known to contain a zero byte, return the
+   index of the first such within the word in memory order.  */
+
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x = find_zero_low (x);
+  else
+    x = find_zero_all (x);
+  return index_first_ (x);
+}
+
+/* Similarly, but perform the search for byte equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_eq_low (x1, x2);
+  else
+    x1 = find_eq_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_zero_eq_low (x1, x2);
+  else
+    x1 = find_zero_eq_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   inequality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_zero_ne_low (x1, x2);
+  else
+    x1 = find_zero_ne_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but search for the last zero within X.  */
+
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x = find_zero_all (x);
+  else
+    x = find_zero_low (x);
+  return index_last_ (x);
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* STRING_FZI_H */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 05/18] string: Improve generic strlen
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (3 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 04/18] Add string vectorized find and detection functions Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-11 17:21   ` Paul Eggert
  2018-01-10 12:47 ` [PATCH v3 06/18] string: Improve generic memchr Adhemerval Zanella
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff functions to
    remove unwanted data.  This strategy follow assemble optimized
    ones for powerpc, sparc, and SH.

  - Use of has_zero and index_first_zero parametrized functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	[BZ #5806]
    	* string/strlen.c: Use them.
---
 string/strlen.c | 83 +++++++++++----------------------------------------------
 1 file changed, 15 insertions(+), 68 deletions(-)

diff --git a/string/strlen.c b/string/strlen.c
index 8ce1318..6bd0ed9 100644
--- a/string/strlen.c
+++ b/string/strlen.c
@@ -20,6 +20,11 @@
 
 #include <string.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
 
 #undef strlen
 
@@ -32,78 +37,20 @@
 size_t
 STRLEN (const char *str)
 {
-  const char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, himagic, lomagic;
+  /* Align pointer to sizeof op_t.  */
+  const uintptr_t s_int = (uintptr_t) str;
+  const op_t *word_ptr = (const op_t*) (s_int & -sizeof (op_t));
 
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = str; ((unsigned long int) char_ptr
-			& (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == '\0')
-      return char_ptr - str;
+  /* Read and MASK the first word. */
+  op_t word = *word_ptr | create_mask (s_int);
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  himagic = 0x80808080L;
-  lomagic = 0x01010101L;
-  if (sizeof (longword) > 4)
+  while (1)
     {
-      /* 64-bit version of the magic.  */
-      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-      himagic = ((himagic << 16) << 16) | himagic;
-      lomagic = ((lomagic << 16) << 16) | lomagic;
+      if (has_zero (word))
+	break;
+      word = *++word_ptr;
     }
-  if (sizeof (longword) > 8)
-    abort ();
 
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
-    {
-      longword = *longword_ptr++;
-
-      if (((longword - lomagic) & ~longword & himagic) != 0)
-	{
-	  /* Which of the bytes was the zero?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const char *cp = (const char *) (longword_ptr - 1);
-
-	  if (cp[0] == 0)
-	    return cp - str;
-	  if (cp[1] == 0)
-	    return cp - str + 1;
-	  if (cp[2] == 0)
-	    return cp - str + 2;
-	  if (cp[3] == 0)
-	    return cp - str + 3;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (cp[4] == 0)
-		return cp - str + 4;
-	      if (cp[5] == 0)
-		return cp - str + 5;
-	      if (cp[6] == 0)
-		return cp - str + 6;
-	      if (cp[7] == 0)
-		return cp - str + 7;
-	    }
-	}
-    }
+  return ((const char *) word_ptr) + index_first_zero (word) - str;
 }
 libc_hidden_builtin_def (strlen)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 06/18] string: Improve generic memchr
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (4 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 05/18] string: Improve generic strlen Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 07/18] string: Improve generic memrchr Adhemerval Zanella
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff function to
    remove unwanted data.  This strategy follow assemble optimized
    ones for aarch64, powerpc and tile.

  - Use string-fz{b,i} and string-opthr functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	[BZ #5806]
	* string/memchr.c: Use string-fzb.h, string-fzi.h, string-opthr.h.
---
 string/memchr.c | 157 +++++++++++++++-----------------------------------------
 1 file changed, 40 insertions(+), 117 deletions(-)

diff --git a/string/memchr.c b/string/memchr.c
index c4e21b8..ae3fd93 100644
--- a/string/memchr.c
+++ b/string/memchr.c
@@ -20,24 +20,16 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
-#ifndef _LIBC
-# include <config.h>
-#endif
-
 #include <string.h>
-
 #include <stddef.h>
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
+#include <string-opthr.h>
 
-#include <limits.h>
-
-#undef __memchr
-#ifdef _LIBC
-# undef memchr
-#endif
-
-#ifndef weak_alias
-# define __memchr memchr
-#endif
+#undef memchr
 
 #ifndef MEMCHR
 # define MEMCHR __memchr
@@ -47,116 +39,47 @@
 void *
 MEMCHR (void const *s, int c_in, size_t n)
 {
-  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
-     long instead of a 64-bit uintmax_t tends to give better
-     performance.  On 64-bit hardware, unsigned long is generally 64
-     bits already.  Change this typedef to experiment with
-     performance.  */
-  typedef unsigned long int longword;
-
-  const unsigned char *char_ptr;
-  const longword *longword_ptr;
-  longword repeated_one;
-  longword repeated_c;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
-
-  /* Handle the first few bytes by reading one byte at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
-       --n, ++char_ptr)
-    if (*char_ptr == c)
-      return (void *) char_ptr;
-
-  longword_ptr = (const longword *) char_ptr;
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to any size longwords.  */
-
-  /* Compute auxiliary longword values:
-     repeated_one is a value which has a 1 in every byte.
-     repeated_c has c in every byte.  */
-  repeated_one = 0x01010101;
-  repeated_c = c | (c << 8);
-  repeated_c |= repeated_c << 16;
-  if (0xffffffffU < (longword) -1)
-    {
-      repeated_one |= repeated_one << 31 << 1;
-      repeated_c |= repeated_c << 31 << 1;
-      if (8 < sizeof (longword))
-	{
-	  size_t i;
-
-	  for (i = 64; i < sizeof (longword) * 8; i *= 2)
-	    {
-	      repeated_one |= repeated_one << i;
-	      repeated_c |= repeated_c << i;
-	    }
-	}
-    }
+  const op_t *word_ptr, *lword;
+  op_t repeated_c, before_mask, word;
+  const char *lbyte;
+  char *ret;
+  uintptr_t s_int;
 
-  /* Instead of the traditional loop which tests each byte, we will test a
-     longword at a time.  The tricky part is testing if *any of the four*
-     bytes in the longword in question are equal to c.  We first use an xor
-     with repeated_c.  This reduces the task to testing whether *any of the
-     four* bytes in longword1 is zero.
-
-     We compute tmp =
-       ((longword1 - repeated_one) & ~longword1) & (repeated_one << 7).
-     That is, we perform the following operations:
-       1. Subtract repeated_one.
-       2. & ~longword1.
-       3. & a mask consisting of 0x80 in every byte.
-     Consider what happens in each byte:
-       - If a byte of longword1 is zero, step 1 and 2 transform it into 0xff,
-	 and step 3 transforms it into 0x80.  A carry can also be propagated
-	 to more significant bytes.
-       - If a byte of longword1 is nonzero, let its lowest 1 bit be at
-	 position k (0 <= k <= 7); so the lowest k bits are 0.  After step 1,
-	 the byte ends in a single bit of value 0 and k bits of value 1.
-	 After step 2, the result is just k bits of value 1: 2^k - 1.  After
-	 step 3, the result is 0.  And no carry is produced.
-     So, if longword1 has only non-zero bytes, tmp is zero.
-     Whereas if longword1 has a zero byte, call j the position of the least
-     significant zero byte.  Then the result has a zero at positions 0, ...,
-     j-1 and a 0x80 at position j.  We cannot predict the result at the more
-     significant bytes (positions j+1..3), but it does not matter since we
-     already have a non-zero bit at position 8*j+7.
-
-     So, the test whether any byte in longword1 is zero is equivalent to
-     testing whether tmp is nonzero.  */
-
-  while (n >= sizeof (longword))
-    {
-      longword longword1 = *longword_ptr ^ repeated_c;
 
-      if ((((longword1 - repeated_one) & ~longword1)
-	   & (repeated_one << 7)) != 0)
-	break;
-      longword_ptr++;
-      n -= sizeof (longword);
-    }
+  if (__glibc_unlikely (n == 0))
+    return NULL;
+
+  s_int = (uintptr_t) s;
+  word_ptr = (const op_t*) (s_int & -sizeof (op_t));
 
-  char_ptr = (const unsigned char *) longword_ptr;
+  /* Set up a word, each of whose bytes is C.  */
+  repeated_c = repeat_bytes (c_in);
+  before_mask = create_mask (s_int);
 
-  /* At this point, we know that either n < sizeof (longword), or one of the
-     sizeof (longword) bytes starting at char_ptr is == c.  On little-endian
-     machines, we could determine the first such byte without any further
-     memory accesses, just by looking at the tmp result from the last loop
-     iteration.  But this does not work on big-endian machines.  Choose code
-     that works in both cases.  */
+  /* Compute the address of the last byte taking in consideration possible
+     overflow.  */
+  uintptr_t lbyte_int = s_int + n - 1;
+  lbyte_int |= -(lbyte_int < s_int);
+  lbyte = (const char *) lbyte_int;
 
-  for (; n > 0; --n, ++char_ptr)
+  /* Compute the address of the word containing the last byte. */
+  lword = (const op_t *) ((uintptr_t) lbyte & -sizeof (op_t));
+
+  /* Read the first word, but munge it so that bytes before the array
+     will not match goal.  */
+  word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
+
+  while (has_eq (word, repeated_c) == 0)
     {
-      if (*char_ptr == c)
-	return (void *) char_ptr;
+      if (word_ptr == lword)
+	return NULL;
+      word = *++word_ptr;
     }
 
-  return NULL;
+  /* We found a match, but it might be in a byte past the end
+     of the array.  */
+  ret = (char *) word_ptr + index_first_eq (word, repeated_c);
+  return (ret <= lbyte) ? ret : NULL;
 }
-#ifdef weak_alias
 weak_alias (__memchr, memchr)
-#endif
 libc_hidden_builtin_def (memchr)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 07/18] string: Improve generic memrchr
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (5 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 06/18] string: Improve generic memchr Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 08/18] string: Improve generic strnlen Adhemerval Zanella
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

New algorithm have the following key differences:

  - Use string-fz{b,i} functions.

It also cleanups the multiple inclusion by leaving the ifunc
implementation to undef the weak_alias.

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	[BZ #5806]
	* string/memrchr.c: Use string-fzb.h, string-fzi.h.
	* sysdeps/i386/i686/multiarch/memrchr-c.c: Redefined weak_alias.
	* sysdeps/s390/multiarch/memrchr-c.c: Likewise.
---
 string/memrchr.c                        | 193 +++++++-------------------------
 sysdeps/i386/i686/multiarch/memrchr-c.c |   2 +
 sysdeps/s390/multiarch/memrchr-c.c      |   2 +
 3 files changed, 44 insertions(+), 153 deletions(-)

diff --git a/string/memrchr.c b/string/memrchr.c
index 191b89a..5ae9c81 100644
--- a/string/memrchr.c
+++ b/string/memrchr.c
@@ -21,177 +21,64 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
-#include <stdlib.h>
-
-#ifdef HAVE_CONFIG_H
-# include <config.h>
-#endif
-
-#if defined _LIBC
-# include <string.h>
-# include <memcopy.h>
-#endif
-
-#if defined HAVE_LIMITS_H || defined _LIBC
-# include <limits.h>
-#endif
-
-#define LONG_MAX_32_BITS 2147483647
-
-#ifndef LONG_MAX
-# define LONG_MAX LONG_MAX_32_BITS
-#endif
-
-#include <sys/types.h>
+#include <string.h>
+#include <stdint.h>
+#include <limits.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-opthr.h>
+#include <string-maskoff.h>
 
 #undef __memrchr
 #undef memrchr
 
-#ifndef weak_alias
-# define __memrchr memrchr
+#ifndef MEMRCHR
+# define MEMRCHR __memrchr
 #endif
 
-/* Search no more than N bytes of S for C.  */
 void *
-#ifndef MEMRCHR
-__memrchr
-#else
-MEMRCHR
-#endif
-     (const void *s, int c_in, size_t n)
+MEMRCHR (const void *s, int c_in, size_t n)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
+  uintptr_t s_int = (uintptr_t) s;
+  uintptr_t lbyte_int = s_int + n;
 
   /* Handle the last few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s + n;
-       n > 0 && ((unsigned long int) char_ptr
-		 & (sizeof (longword) - 1)) != 0;
-       --n)
-    if (*--char_ptr == c)
+     Do this until CHAR_PTR is aligned on a word boundary, or
+     the entirety of small inputs.  */
+  const unsigned char *char_ptr = (const unsigned char *) lbyte_int;
+  size_t align = lbyte_int % sizeof (op_t);
+  if (n < OP_T_THRES || align > n)
+    align = n;
+  for (size_t i = 0; i < align; ++i)
+    if (*--char_ptr == c_in)
       return (void *) char_ptr;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (const unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
+  const op_t *word_ptr = (const op_t *) char_ptr;
+  n -= align;
+  if (__glibc_unlikely (n == 0))
+    return NULL;
 
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-#if LONG_MAX > LONG_MAX_32_BITS
-  charmask |= charmask << 32;
-#endif
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  while (n >= sizeof (longword))
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C, not zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *--longword_ptr ^ charmask;
+  /* Compute the address of the word containing the initial byte. */
+  const op_t *lword = (const op_t *) (s_int & -sizeof (op_t));
 
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
+  /* Set up a word, each of whose bytes is C.  */
+  op_t repeated_c = repeat_bytes (c_in);
 
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
+  char *ret;
+  op_t word;
 
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) longword_ptr;
-
-#if LONG_MAX > 2147483647
-	  if (cp[7] == c)
-	    return (void *) &cp[7];
-	  if (cp[6] == c)
-	    return (void *) &cp[6];
-	  if (cp[5] == c)
-	    return (void *) &cp[5];
-	  if (cp[4] == c)
-	    return (void *) &cp[4];
-#endif
-	  if (cp[3] == c)
-	    return (void *) &cp[3];
-	  if (cp[2] == c)
-	    return (void *) &cp[2];
-	  if (cp[1] == c)
-	    return (void *) &cp[1];
-	  if (cp[0] == c)
-	    return (void *) cp;
-	}
-
-      n -= sizeof (longword);
-    }
-
-  char_ptr = (const unsigned char *) longword_ptr;
-
-  while (n-- > 0)
+  while (word_ptr != lword)
     {
-      if (*--char_ptr == c)
-	return (void *) char_ptr;
+      word = *--word_ptr;
+      if (has_eq (word, repeated_c))
+	goto found;
     }
+  return NULL;
 
-  return 0;
+found:
+  /* We found a match, but it might be in a byte past the start 
+     of the array.  */
+  ret = (char *) word_ptr + index_last_eq (word, repeated_c);
+  return (ret >= (char*) s) ? ret : NULL;
 }
-#ifndef MEMRCHR
-# ifdef weak_alias
 weak_alias (__memrchr, memrchr)
-# endif
-#endif
diff --git a/sysdeps/i386/i686/multiarch/memrchr-c.c b/sysdeps/i386/i686/multiarch/memrchr-c.c
index ef7bbbe..23c937b 100644
--- a/sysdeps/i386/i686/multiarch/memrchr-c.c
+++ b/sysdeps/i386/i686/multiarch/memrchr-c.c
@@ -1,5 +1,7 @@
 #if IS_IN (libc)
 # define MEMRCHR  __memrchr_ia32
+# undef weak_alias
+# define weak_alias(a,b)
 # include <string.h>
 extern void *__memrchr_ia32 (const void *, int, size_t);
 #endif
diff --git a/sysdeps/s390/multiarch/memrchr-c.c b/sysdeps/s390/multiarch/memrchr-c.c
index 1e3c914..d7e59a4 100644
--- a/sysdeps/s390/multiarch/memrchr-c.c
+++ b/sysdeps/s390/multiarch/memrchr-c.c
@@ -18,6 +18,8 @@
 
 #if defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)
 # define MEMRCHR  __memrchr_c
+# undef weak_alias
+# define weak_alias(a,b)
 
 # include <string.h>
 extern __typeof (__memrchr) __memrchr_c;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 08/18] string: Improve generic strnlen
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (6 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 07/18] string: Improve generic memrchr Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 09/18] string: Improve generic strchr Adhemerval Zanella
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha

With an optimized memchr, new strnlen implementation basically calls
memchr and adjust the result pointer value.

It also cleanups the multiple inclusion by leaving the ifunc
implementation to undef the weak_alias and libc_hidden_def.

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	[BZ #5806]
	* string/strnlen.c: Rewrite in terms of memchr.
	* sysdeps/i386/i686/multiarch/strnlen-c.c: Redefine weak_alias
	and libc_hidden_def.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c:
	Likewise.
	* sysdeps/s390/multiarch/strnlen-c.c: Likewise.
---
 string/strnlen.c                                   | 139 ++-------------------
 sysdeps/i386/i686/multiarch/strnlen-c.c            |  19 +--
 .../powerpc32/power4/multiarch/strnlen-ppc32.c     |  19 +--
 sysdeps/s390/multiarch/strnlen-c.c                 |  18 ++-
 4 files changed, 43 insertions(+), 152 deletions(-)

diff --git a/string/strnlen.c b/string/strnlen.c
index c2ce1eb..a3ec6af 100644
--- a/string/strnlen.c
+++ b/string/strnlen.c
@@ -21,146 +21,21 @@
    not, see <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <stdlib.h>
 
 /* Find the length of S, but scan at most MAXLEN characters.  If no
    '\0' terminator is found in that many characters, return MAXLEN.  */
 
-#ifdef STRNLEN
-# define __strnlen STRNLEN
+#ifndef STRNLEN
+# define STRNLEN __strnlen
 #endif
 
 size_t
-__strnlen (const char *str, size_t maxlen)
+STRNLEN (const char *str, size_t maxlen)
 {
-  const char *char_ptr, *end_ptr = str + maxlen;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, himagic, lomagic;
-
-  if (maxlen == 0)
-    return 0;
-
-  if (__glibc_unlikely (end_ptr < str))
-    end_ptr = (const char *) ~0UL;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = str; ((unsigned long int) char_ptr
-			& (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == '\0')
-      {
-	if (char_ptr > end_ptr)
-	  char_ptr = end_ptr;
-	return char_ptr - str;
-      }
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  himagic = 0x80808080L;
-  lomagic = 0x01010101L;
-  if (sizeof (longword) > 4)
-    {
-      /* 64-bit version of the magic.  */
-      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-      himagic = ((himagic << 16) << 16) | himagic;
-      lomagic = ((lomagic << 16) << 16) | lomagic;
-    }
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  while (longword_ptr < (unsigned long int *) end_ptr)
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.  */
-
-      longword = *longword_ptr++;
-
-      if ((longword - lomagic) & himagic)
-	{
-	  /* Which of the bytes was the zero?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const char *cp = (const char *) (longword_ptr - 1);
-
-	  char_ptr = cp;
-	  if (cp[0] == 0)
-	    break;
-	  char_ptr = cp + 1;
-	  if (cp[1] == 0)
-	    break;
-	  char_ptr = cp + 2;
-	  if (cp[2] == 0)
-	    break;
-	  char_ptr = cp + 3;
-	  if (cp[3] == 0)
-	    break;
-	  if (sizeof (longword) > 4)
-	    {
-	      char_ptr = cp + 4;
-	      if (cp[4] == 0)
-		break;
-	      char_ptr = cp + 5;
-	      if (cp[5] == 0)
-		break;
-	      char_ptr = cp + 6;
-	      if (cp[6] == 0)
-		break;
-	      char_ptr = cp + 7;
-	      if (cp[7] == 0)
-		break;
-	    }
-	}
-      char_ptr = end_ptr;
-    }
-
-  if (char_ptr > end_ptr)
-    char_ptr = end_ptr;
-  return char_ptr - str;
+  const char *found = memchr (str, '\0', maxlen);
+  return found ? found - str : maxlen;
 }
-#ifndef STRNLEN
-libc_hidden_def (__strnlen)
+
 weak_alias (__strnlen, strnlen)
-#endif
+libc_hidden_def (__strnlen)
 libc_hidden_def (strnlen)
diff --git a/sysdeps/i386/i686/multiarch/strnlen-c.c b/sysdeps/i386/i686/multiarch/strnlen-c.c
index 351e939..bfbf811 100644
--- a/sysdeps/i386/i686/multiarch/strnlen-c.c
+++ b/sysdeps/i386/i686/multiarch/strnlen-c.c
@@ -1,10 +1,15 @@
 #define STRNLEN  __strnlen_ia32
+#undef weak_alias
+#define weak_alias(a,b)
+#undef libc_hidden_def
+#define libc_hidden_def(a)
+
+#include <string/strnlen.c>
+
 #ifdef SHARED
-# undef libc_hidden_def
-# define libc_hidden_def(name)  \
-    __hidden_ver1 (__strnlen_ia32, __GI_strnlen, __strnlen_ia32); \
-    strong_alias (__strnlen_ia32, __strnlen_ia32_1); \
-    __hidden_ver1 (__strnlen_ia32_1, __GI___strnlen, __strnlen_ia32_1);
+/* Alias for internal symbol to avoid PLT generation, it redirects the
+   libc_hidden_def (__strnlen/strlen) to default implementation.  */
+__hidden_ver1 (__strnlen_ia32, __GI_strnlen, __strnlen_ia32);
+strong_alias (__strnlen_ia32, __strnlen_ia32_1);
+__hidden_ver1 (__strnlen_ia32_1, __GI___strnlen, __strnlen_ia32_1);
 #endif
-
-#include "string/strnlen.c"
diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
index df940d3..e2ccd21 100644
--- a/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
+++ b/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
@@ -17,12 +17,17 @@
    <http://www.gnu.org/licenses/>.  */
 
 #define STRNLEN  __strnlen_ppc
-#ifdef SHARED
-# undef libc_hidden_def
-# define libc_hidden_def(name)  \
-    __hidden_ver1 (__strnlen_ppc, __GI_strnlen, __strnlen_ppc); \
-    strong_alias (__strnlen_ppc, __strnlen_ppc_1); \
-    __hidden_ver1 (__strnlen_ppc_1, __GI___strnlen, __strnlen_ppc_1);
-#endif
+#undef weak_alias
+#define weak_alias(a,b)
+#undef libc_hidden_def
+#define libc_hidden_def(a)
 
 #include <string/strnlen.c>
+
+#ifdef SHARED
+/* Alias for internal symbol to avoid PLT generation, it redirects the
+   libc_hidden_def (__strnlen/strlen) to default implementation.  */
+__hidden_ver1 (__strnlen_ppc, __GI_strnlen, __strnlen_ppc); \
+strong_alias (__strnlen_ppc, __strnlen_ppc_1); \
+__hidden_ver1 (__strnlen_ppc_1, __GI___strnlen, __strnlen_ppc_1);
+#endif
diff --git a/sysdeps/s390/multiarch/strnlen-c.c b/sysdeps/s390/multiarch/strnlen-c.c
index 353e83e..f77f59d 100644
--- a/sysdeps/s390/multiarch/strnlen-c.c
+++ b/sysdeps/s390/multiarch/strnlen-c.c
@@ -18,13 +18,19 @@
 
 #if defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)
 # define STRNLEN  __strnlen_c
+# undef weak_alias
+# define weak_alias(a,b)
+# undef libc_hidden_def
+# define libc_hidden_def(a)
+
+# include <string/strnlen.c>
+
 # ifdef SHARED
-#  undef libc_hidden_def
-#  define libc_hidden_def(name)					\
-  __hidden_ver1 (__strnlen_c, __GI_strnlen, __strnlen_c);	\
-  strong_alias (__strnlen_c, __strnlen_c_1);			\
-  __hidden_ver1 (__strnlen_c_1, __GI___strnlen, __strnlen_c_1);
+/* Alias for internal symbol to avoid PLT generation, it redirects the
+   libc_hidden_def (__strnlen/strlen) to default implementation.  */
+__hidden_ver1 (__strnlen_c, __GI_strnlen, __strnlen_c);
+strong_alias (__strnlen_c, __strnlen_c_1);
+__hidden_ver1 (__strnlen_c_1, __GI___strnlen, __strnlen_c_1);
 # endif /* SHARED */
 
-# include <string/strnlen.c>
 #endif /* HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc) */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 09/18] string: Improve generic strchr
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (7 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 08/18] string: Improve generic strnlen Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 10/18] string: Improve generic strchrnul Adhemerval Zanella
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff function to
    remove unwanted data.  This strategy follow assemble optimized
    ones for aarch64 and powerpc.

  - Use string-fz{b,i} and string-extbyte function.

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	[BZ #5806]
	* string/strchr.c: Use string-fzb.h, string-fzi.h, string-extbyte.h.
	* sysdeps/s390/multiarch/strchr-c.c: Redefine weak_alias.
---
 string/strchr.c                   | 166 +++++++-------------------------------
 sysdeps/s390/multiarch/strchr-c.c |   1 +
 2 files changed, 29 insertions(+), 138 deletions(-)

diff --git a/string/strchr.c b/string/strchr.c
index a63fdfc..ee8ed5c 100644
--- a/string/strchr.c
+++ b/string/strchr.c
@@ -22,8 +22,15 @@
 
 #include <string.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-extbyte.h>
+#include <string-maskoff.h>
 
 #undef strchr
+#undef index
 
 #ifndef STRCHR
 # define STRCHR strchr
@@ -33,153 +40,36 @@
 char *
 STRCHR (const char *s, int c_in)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
+  const op_t *word_ptr;
+  op_t found, word;
 
-  c = (unsigned char) c_in;
+  /* Set up a word, each of whose bytes is C.  */
+  unsigned char c = (unsigned char) c_in;
+  op_t repeated_c = repeat_bytes (c_in);
 
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == c)
-      return (void *) char_ptr;
-    else if (*char_ptr == '\0')
-      return NULL;
+  /* Align the input address to op_t.  */
+  uintptr_t s_int = (uintptr_t) s;
+  word_ptr = (op_t*) (s_int & -sizeof (op_t));
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+  /* Read the first aligned word, but force bytes before the string to
+     match neither zero nor goal (we make sure the high bit of each byte
+     is 1, and the low 7 bits are all the opposite of the goal byte).  */
+  op_t bmask = create_mask (s_int);
+  word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
 
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
+  while (1)
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0 ||
-
-	  /* That caught zeroes.  Now test for C.  */
-	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	    }
-	}
+      if (has_zero_eq (word, repeated_c))
+        break;
+      word = *++word_ptr;
     }
 
+  found = index_first_zero_eq (word, repeated_c);
+
+  if (extractbyte (word, found) == c)
+    return (char *) (word_ptr) + found;
   return NULL;
 }
 
-#ifdef weak_alias
-# undef index
 weak_alias (strchr, index)
-#endif
 libc_hidden_builtin_def (strchr)
diff --git a/sysdeps/s390/multiarch/strchr-c.c b/sysdeps/s390/multiarch/strchr-c.c
index 606cb56..e91ef94 100644
--- a/sysdeps/s390/multiarch/strchr-c.c
+++ b/sysdeps/s390/multiarch/strchr-c.c
@@ -19,6 +19,7 @@
 #if defined HAVE_S390_VX_ASM_SUPPORT && IS_IN (libc)
 # define STRCHR  __strchr_c
 # undef weak_alias
+# define weak_alias(a, b)
 # ifdef SHARED
 #  undef libc_hidden_builtin_def
 #  define libc_hidden_builtin_def(name)				\
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 10/18] string: Improve generic strchrnul
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (8 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 09/18] string: Improve generic strchr Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 11/18] string: Improve generic strcmp Adhemerval Zanella
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff function to
    remove unwanted data.  This strategy follow assemble optimized
    ones for aarch64, powerpc and tile.

  - Use string-fz{b,i} functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	[BZ #5806]
	* string/strchrnul.c: Use string-fzb.h, string-fzi.h.
---
 string/strchrnul.c | 146 +++++++++--------------------------------------------
 1 file changed, 25 insertions(+), 121 deletions(-)

diff --git a/string/strchrnul.c b/string/strchrnul.c
index 5a17602..beeab88 100644
--- a/string/strchrnul.c
+++ b/string/strchrnul.c
@@ -21,8 +21,12 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <memcopy.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
 
 #undef __strchrnul
 #undef strchrnul
@@ -33,134 +37,34 @@
 
 /* Find the first occurrence of C in S or the final NUL byte.  */
 char *
-STRCHRNUL (const char *s, int c_in)
+STRCHRNUL (const char *str, int c_in)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
+  const op_t *word_ptr;
+  op_t found, word;
 
-  c = (unsigned char) c_in;
+  /* Set up a word, each of whose bytes is C.  */
+  op_t repeated_c = repeat_bytes (c_in);
 
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == c || *char_ptr == '\0')
-      return (void *) char_ptr;
+  /* Align the input address to op_t.  */
+  uintptr_t s_int = (uintptr_t) str;
+  word_ptr = (op_t*) (s_int & -sizeof (op_t));
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+  /* Read the first aligned word, but force bytes before the string to
+     match neither zero nor goal (we make sure the high bit of each byte
+     is 1, and the low 7 bits are all the opposite of the goal byte).  */
+  op_t bmask = create_mask (s_int);
+  word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
 
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
+  while (1)
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0 ||
-
-	  /* That caught zeroes.  Now test for C.  */
-	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	    }
-	}
+      if (has_zero_eq (word, repeated_c))
+        break;
+      word = *++word_ptr;
     }
 
-  /* This should never happen.  */
-  return NULL;
+  found = index_first_zero_eq (word, repeated_c);
+
+  return (char *) (word_ptr) + found;
 }
 
 weak_alias (__strchrnul, strchrnul)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 11/18] string: Improve generic strcmp
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (9 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 10/18] string: Improve generic strchrnul Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 12/18] string: Improve generic strcpy Adhemerval Zanella
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

New generic implementation tries to use word operations along with
the new string-fz{b,i} functions even for inputs with different
alignments (with still uses aligned access plus merge operation
to get a correct word by word comparison).

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	* string/strcmp.c: Rewrite using memcopy.h, string-fzb.h,
	string-fzi.h.
---
 string/strcmp.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 89 insertions(+), 8 deletions(-)

diff --git a/string/strcmp.c b/string/strcmp.c
index e198d19..c346ab9 100644
--- a/string/strcmp.c
+++ b/string/strcmp.c
@@ -16,6 +16,12 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
+#include <stdint.h>
+#include <limits.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-extbyte.h>
+#include <memcopy.h>
 
 #undef strcmp
 
@@ -29,19 +35,94 @@
 int
 STRCMP (const char *p1, const char *p2)
 {
-  const unsigned char *s1 = (const unsigned char *) p1;
-  const unsigned char *s2 = (const unsigned char *) p2;
+  const op_t *x1, *x2;
+  op_t w1, w2;
   unsigned char c1, c2;
+  uintptr_t i, n, ofs;
+  int diff;
 
-  do
+  /* Handle the unaligned bytes of p1 first.  */
+  n = -(uintptr_t)p1 % sizeof(op_t);
+  for (i = 0; i < n; ++i)
     {
-      c1 = (unsigned char) *s1++;
-      c2 = (unsigned char) *s2++;
-      if (c1 == '\0')
-	return c1 - c2;
+      c1 = *p1++;
+      c2 = *p2++;
+      diff = c1 - c2;
+      if (c1 == '\0' || diff)
+	return diff;
     }
-  while (c1 == c2);
 
+  /* P1 is now aligned to unsigned long.  P2 may or may not be.  */
+  x1 = (const op_t *)p1;
+  w1 = *x1++;
+  ofs = (uintptr_t)p2 % sizeof(op_t);
+  if (ofs == 0)
+    {
+      x2 = (const op_t *)p2;
+      w2 = *x2++;
+      /* Aligned loop.  If a difference is found, exit to compare the
+         bytes.  Else if a zero is found we have equal strings.  */
+      while (w1 == w2)
+	{
+	  if (has_zero (w1))
+	    return 0;
+          w1 = *x1++;
+          w2 = *x2++;
+	}
+    }
+  else
+    {
+      op_t w2a, w2b;
+      uintptr_t sh_1, sh_2;
+
+      x2 = (const op_t *)(p2 - ofs);
+      w2a = *x2++;
+      sh_1 = ofs * CHAR_BIT;
+      sh_2 = sizeof(op_t) * CHAR_BIT - sh_1;
+
+      /* Align the first partial of P2, with 0xff for the rest of the
+         bytes so that we can also apply the has_zero test to see if we
+         have already reached EOS.  If we have, then we can simply fall
+         through to the final comparison.  */
+      w2 = MERGE (w2a, sh_1, (op_t)-1, sh_2);
+      if (!has_zero (w2))
+	{
+	  /* Unaligned loop.  The invariant is that W2B, which is "ahead"
+             of W1, does not contain end-of-string.  Therefore it is safe
+             (and necessary) to read another word from each while we do
+             not have a difference.  */
+	  while (1)
+	    {
+	      w2b = *x2++;
+	      w2 = MERGE (w2a, sh_1, w2b, sh_2);
+	      if (w1 != w2)
+		goto final_cmp;
+	      if (has_zero (w2b))
+		break;
+	      w1 = *x1++;
+	      w2a = w2b;
+	    }
+
+	  /* Zero found in the second partial of P2.  If we had EOS
+	     in the aligned word, we have equality.  */
+	  if (has_zero (w1))
+	    return 0;
+
+          /* Load the final word of P1 and align the final partial of P2.  */
+	  w1 = *x1++;
+          w2 = MERGE (w2b, sh_1, 0, sh_2);
+	}
+    }
+
+ final_cmp:
+  for (i = 0; i < sizeof (op_t); i++)
+    {
+      c1 = extractbyte (w1, i);
+      c2 = extractbyte (w2, i);
+      if (c1 == '\0' || c1 != c2)
+        return c1 - c2;
+    }
   return c1 - c2;
 }
+
 libc_hidden_builtin_def (strcmp)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 12/18] string: Improve generic strcpy
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (10 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 11/18] string: Improve generic strcmp Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 13/18] hppa: Add memcopy.h Adhemerval Zanella
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Adhemerval Zanella

From: Adhemerval Zanella <adhemerval.zanella@linaro.com>

New generic implementation tries to use word operations along with
the new string-fz{b,i} functions even for inputs with different
alignments (with still uses aligned access plus merge operation
to get a correct word by word comparison).

Checked on x86_64-linux-gnu, i686-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

	Richard Henderson  <rth@twiddle.net>
	Adhemerval Zanella  <adhemerval.zanella@linaro.org>

	* string/strcpy.c: Rewrite using memcopy.h, string-fzb.h,
        string-fzi.h.
	* string/test-strcpy.c (test_main): Add move coverage.
---
 string/strcpy.c      | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 string/test-strcpy.c |  24 +++++++++++-
 2 files changed, 130 insertions(+), 3 deletions(-)

diff --git a/string/strcpy.c b/string/strcpy.c
index a4cce89..358b1b1 100644
--- a/string/strcpy.c
+++ b/string/strcpy.c
@@ -15,8 +15,13 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
-#include <stddef.h>
 #include <string.h>
+#include <stdint.h>
+#include <limits.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-extbyte.h>
+#include <memcopy.h>
 
 #undef strcpy
 
@@ -28,6 +33,106 @@
 char *
 STRCPY (char *dest, const char *src)
 {
-  return memcpy (dest, src, strlen (src) + 1);
+  char *dst = dest;
+  const op_t *xs;
+  op_t *xd;
+  op_t ws;
+
+#if _STRING_ARCH_unaligned
+  /* For architectures which supports unaligned memory operations, it first
+     aligns the source pointer, reads op_t bytes at time until a zero is
+     found, and writes unaligned to destination.  */
+  uintptr_t n = -(uintptr_t) src % sizeof (op_t);
+  for (uintptr_t i = 0; i < n; ++i)
+    {
+      unsigned c = *src++;
+      *dst++ = c;
+      if (c == '\0')
+	return dest;
+    }
+  xs = (const op_t *) src;
+  ws = *xs++;
+  xd = (op_t *) dst;
+  while (!has_zero (ws))
+    {
+      *xd++ = ws;
+      ws = *xs++;
+    }
+#else
+  /* For architectures which only supports aligned accesses, it first align
+     the destination pointer.  */
+  uintptr_t n = -(uintptr_t) dst % sizeof (op_t);
+  for (uintptr_t i = 0; i < n; ++i)
+    {
+      unsigned c = *src++;
+      *dst++ = c;
+      if (c == '\0')
+	return dest;
+    }
+  xd = (op_t *) dst;
+
+  /* Destination is aligned to op_t while source might be not.  */
+  uintptr_t ofs = (uintptr_t) src % sizeof (op_t);
+  if (ofs == 0)
+    {
+      /* Aligned loop.  If a zero is found, exit to copy the remaining
+	 bytes.  */
+      xs = (const op_t *) src;
+
+      ws = *xs++;
+      while (!has_zero (ws))
+	{
+	  *xd++ = ws;
+	  ws = *xs++;
+	}
+    }
+  else
+    {
+      /* Unaligned loop: align the source pointer and mask off the
+	 undesirable bytes which is not part of the string.  */
+      op_t wsa, wsb;
+      uintptr_t sh_1, sh_2;
+
+      xs = (const op_t *)(src - ofs);
+      wsa = *xs++;
+      sh_1 = ofs * CHAR_BIT;
+      sh_2 = sizeof(op_t) * CHAR_BIT - sh_1;
+
+      /* Align the first partial op_t from source, with 0xff for the rest
+	 of the bytes so that we can also apply the has_zero test to see if we
+         have already reached EOS.  If we have, then we can simply fall
+         through to the final byte copies.  */
+      ws = MERGE (wsa, sh_1, (op_t)-1, sh_2);
+      if (!has_zero (ws))
+	{
+	  while (1)
+	    {
+	      wsb = *xs++;
+	      ws = MERGE (wsa, sh_1, wsb, sh_2);
+	      if (has_zero (wsb))
+		break;
+	      *xd++ = ws;
+	      wsa = wsb;
+	    }
+
+	  /* WS may contain bytes that we not written yet in destination.
+	     Write them down and merge with the op_t containing the EOS
+	     byte. */
+	  if (!has_zero (ws))
+	    {
+	      *xd++ = ws;
+	      ws = MERGE (wsb, sh_1, ws, sh_2);
+	    }
+	}
+    }
+#endif
+
+  /* Just copy the final bytes from op_t.  */
+  dst = (char *) xd;
+  uintptr_t fz = index_first_zero (ws);
+  for (uintptr_t i = 0; i < fz + 1; i++)
+    *dst++ = extractbyte (ws, i);
+
+  return dest;
 }
 libc_hidden_builtin_def (strcpy)
diff --git a/string/test-strcpy.c b/string/test-strcpy.c
index 2a1bf93..fa03c73 100644
--- a/string/test-strcpy.c
+++ b/string/test-strcpy.c
@@ -207,7 +207,7 @@ do_random_tests (void)
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j;
 
   test_init ();
 
@@ -222,12 +222,26 @@ test_main (void)
       do_test (0, 0, i, BIG_CHAR);
       do_test (0, i, i, SMALL_CHAR);
       do_test (i, 0, i, BIG_CHAR);
+
+      for (j = 1; j < 16; ++j)
+	{
+	  do_test (0, 0, i + j, SMALL_CHAR);
+	  do_test (0, 0, i + j, BIG_CHAR);
+	  do_test (0, i, i + j, SMALL_CHAR);
+	  do_test (i, 0, i + j, BIG_CHAR);
+	}
     }
 
   for (i = 1; i < 8; ++i)
     {
       do_test (0, 0, 8 << i, SMALL_CHAR);
       do_test (8 - i, 2 * i, 8 << i, SMALL_CHAR);
+
+      for (j = 1; j < 8; ++j)
+	{
+	  do_test (0, 0, (8 << i) + j, SMALL_CHAR);
+	  do_test (8 - i, 2 * i, (8 << i) + j, SMALL_CHAR);
+	}
     }
 
   for (i = 1; i < 8; ++i)
@@ -236,6 +250,14 @@ test_main (void)
       do_test (2 * i, i, 8 << i, BIG_CHAR);
       do_test (i, i, 8 << i, SMALL_CHAR);
       do_test (i, i, 8 << i, BIG_CHAR);
+
+      for (j = 1; j < 8; ++j)
+	{
+	  do_test (i, 2 * i, (8 << i) + j, SMALL_CHAR);
+	  do_test (2 * i, i, (8 << i) + j, BIG_CHAR);
+	  do_test (i, i, (8 << i) + j, SMALL_CHAR);
+	  do_test (i, i, (8 << i) + j, BIG_CHAR);
+	}
     }
 
   do_random_tests ();
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 13/18] hppa: Add memcopy.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (11 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 12/18] string: Improve generic strcpy Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-11 13:36   ` Joseph Myers
  2018-01-10 12:47 ` [PATCH v3 14/18] hppa: Add string-fzb.h and string-fzi.h Adhemerval Zanella
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

GCC's combine pass cannot merge (x >> c | y << (32 - c)) into a
double-word shift unless (1) the subtract is in the same basic block
and (2) the result of the subtract is used exactly once.  Neither
condition is true for any use of MERGE.

By forcing the use of a double-word shift, we not only reduce
contention on SAR, but also allow the setting of SAR to be hoisted
outside of a loop.

Checked on hppa-linux-gnu.

	Richard Henderson  <rth@twiddle.net>

	* sysdeps/hppa/memcopy.h: New file.
---
 sysdeps/hppa/memcopy.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)
 create mode 100644 sysdeps/hppa/memcopy.h

diff --git a/sysdeps/hppa/memcopy.h b/sysdeps/hppa/memcopy.h
new file mode 100644
index 0000000..4dcade7
--- /dev/null
+++ b/sysdeps/hppa/memcopy.h
@@ -0,0 +1,44 @@
+/* Definitions for memory copy functions, PA-RISC version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdeps/generic/memcopy.h>
+
+/* Use a single double-word shift instead of two shifts and an ior.
+   If the uses of MERGE were close to the computation of shl/shr,
+   the compiler might have been able to create this itself.
+   But instead that computation is well separated.
+
+   Using an inline function instead of a macro is the easiest way
+   to ensure that the types are correct.  */
+
+#undef MERGE
+
+extern void link_error(void);
+
+static inline op_t
+MERGE(op_t w0, int shl, op_t w1, int shr)
+{
+  op_t res;
+  if (OPSIZ == 4)
+    asm("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
+  else if (OPSIZ == 8)
+    asm("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
+  else
+    link_error(), res = 0;
+  return res;
+}
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 14/18] hppa: Add string-fzb.h and string-fzi.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (12 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 13/18] hppa: Add memcopy.h Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:47 ` [PATCH v3 15/18] alpha: " Adhemerval Zanella
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

Use UXOR,SBZ to test for a zero byte within a word.  While we can
get semi-decent code out of asm-goto, we would do slightly better
with a compiler builtin.

For index_zero et al, sequential testing of bytes is less expensive than
any tricks that involve a count-leading-zeros insn that we don't have.

Checked on hppa-linux-gnu.

	Richard Henderson  <rth@twiddle.net>

	* sysdeps/hppa/string-fzb.h: New file.
	* sysdeps/hppa/string-fzi.h: Likewise.
---
 sysdeps/hppa/string-fzb.h |  69 ++++++++++++++++++++++++
 sysdeps/hppa/string-fzi.h | 135 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 204 insertions(+)
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzi.h

diff --git a/sysdeps/hppa/string-fzb.h b/sysdeps/hppa/string-fzb.h
new file mode 100644
index 0000000..0385d99
--- /dev/null
+++ b/sysdeps/hppa/string-fzb.h
@@ -0,0 +1,69 @@
+/* Zero byte detection, boolean.  HPPA version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZB_H
+#define STRING_FZB_H 1
+
+#include <string-optype.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* It's more useful to expose a control transfer to the compiler
+     than to expose a proper boolean result.  */
+  asm goto ("uxor,sbz %%r0,%0,%%r0\n\t"
+	    "b,n %l1" : : "r"(x) : : nbz);
+  return 1;
+ nbz:
+  return 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  asm goto ("uxor,sbz %0,%1,%%r0\n\t"
+	    "b,n %l2" : : "r"(x1), "r"(x2) : : nbz);
+  return 1;
+ nbz:
+  return 0;
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  asm goto ("uxor,sbz %%r0,%0,%%r0\n\t"
+	    "uxor,nbz %0,%1,%%r0\n\t"
+	    "b,n %l2" : : "r"(x1), "r"(x2) : : sbz);
+  return 0;
+ sbz:
+  return 1;
+}
+
+#endif /* STRING_HASZERO_H */
diff --git a/sysdeps/hppa/string-fzi.h b/sysdeps/hppa/string-fzi.h
new file mode 100644
index 0000000..22bd8ac
--- /dev/null
+++ b/sysdeps/hppa/string-fzi.h
@@ -0,0 +1,135 @@
+/* string-fzi.h -- zero byte detection; indexes.  HPPA version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZI_H
+#define STRING_FZI_H 1
+
+#include <string-optype.h>
+
+/* Given a word X that is known to contain a zero byte, return the
+   index of the first such within the long in memory order.  */
+
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %1,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but perform the search for byte equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  return index_first_zero (x1 ^ x2);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,= %1,23,8,%%r0\n\t"
+       "extrw,u,<> %2,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,= %1,15,8,%%r0\n\t"
+       "extrw,u,<> %2,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,= %1,7,8,%%r0\n\t"
+       "extrw,u,<> %2,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x1), "r"(x1 ^ x2), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but perform the search for zero within X1 or 
+   inequality between X1 and X2. */
+
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %2,23,8,%%r0\n\t"
+       "extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %2,15,8,%%r0\n\t"
+       "extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %2,7,8,%%r0\n\t"
+       "extrw,u,<> %1,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x1), "r"(x1 ^ x2), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but search for the last zero within X.  */
+
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no ctz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %1,31,8,%%r0\n\t"
+       "ldi 3,%0"
+       : "=r"(ret) : "r"(x), "0"(0));
+
+  return ret;
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* STRING_FZI_H */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 15/18] alpha: Add string-fzb.h and string-fzi.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (13 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 14/18] hppa: Add string-fzb.h and string-fzi.h Adhemerval Zanella
@ 2018-01-10 12:47 ` Adhemerval Zanella
  2018-01-10 12:48 ` [PATCH v3 16/18] arm: Add string-fza.h Adhemerval Zanella
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:47 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

While alpha has the more important string functions in assembly,
there are still a few for find the generic routines are used.

Use the CMPBGE insn, via the builtin, for testing of zeros.  Use a
simplified expansion of __builtin_ctz when the insn isn't available.

Checked on alpha-linux-gnu.

	Richard Henderson  <rth@twiddle.net>

	* sysdeps/alpha/string-fzb.h: New file.
	* sysdeps/alpha/string-fzi.h: Likewise.
---
 sysdeps/alpha/string-fzb.h |  51 ++++++++++++++++++++
 sysdeps/alpha/string-fzi.h | 113 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 164 insertions(+)
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h

diff --git a/sysdeps/alpha/string-fzb.h b/sysdeps/alpha/string-fzb.h
new file mode 100644
index 0000000..0e6a71c
--- /dev/null
+++ b/sysdeps/alpha/string-fzb.h
@@ -0,0 +1,51 @@
+/* Zero byte detection; boolean.  Alpha version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZB_H
+#define STRING_FZB_H 1
+
+#include <string-optype.h>
+
+/* Note that since CMPBGE creates a bit mask rather than a byte mask,
+   we cannot simply provide a target-specific string-fza.h.  */
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  return __builtin_alpha_cmpbge (0, x) != 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1 ^ x2);
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1) | has_eq (x1, x2);
+}
+
+#endif /* STRING_FZB_H */
diff --git a/sysdeps/alpha/string-fzi.h b/sysdeps/alpha/string-fzi.h
new file mode 100644
index 0000000..243a9e5
--- /dev/null
+++ b/sysdeps/alpha/string-fzi.h
@@ -0,0 +1,113 @@
+/* string-fzi.h -- zero byte detection; indices.  Alpha version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZI_H
+#define STRING_FZI_H
+
+#include <limits.h>
+#include <string-optype.h>
+
+/* Note that since CMPBGE creates a bit mask rather than a byte mask,
+   we cannot simply provide a target-specific string-fza.h.  */
+
+/* A subroutine for the index_zero functions.  Given a bitmask C,
+   return the index of the first bit set in memory order.  */
+
+static inline unsigned int
+index_first_ (unsigned long int c)
+{
+#ifdef __alpha_cix__
+  return __builtin_ctzl (c);
+#else
+  c = c & -c;
+  return (c & 0xf0 ? 4 : 0) + (c & 0xcc ? 2 : 0) + (c & 0xaa ? 1 : 0);
+#endif
+}
+
+/* Similarly, but return the (memory order) index of the last bit
+   that is non-zero.  Note that only the least 8 bits may be nonzero.  */
+
+static inline unsigned int
+index_last_ (unsigned long int x)
+{
+#ifdef __alpha_cix__
+  return __builtin_clzl (x) ^ 63;
+#else
+  unsigned r = 0;
+  if (x & 0xf0)
+    r += 4;
+  if (x & (0xc << r))
+    r += 2;
+  if (x & (0x2 << r))
+    r += 1;
+  return r;
+#endif
+}
+
+/* Given a word X that is known to contain a zero byte, return the
+   index of the first such within the word in memory order.  */
+
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  return index_first_ (__builtin_alpha_cmpbge (0, x));
+}
+
+/* Similarly, but perform the test for byte equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  return index_first_zero (x1 ^ x2);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  return index_first_ (__builtin_alpha_cmpbge (0, x1)
+		       | __builtin_alpha_cmpbge (0, x1 ^ x2));
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   inequality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  return index_first_ (__builtin_alpha_cmpbge (0, x1)
+		       | (__builtin_alpha_cmpbge (0, x1 ^ x2) ^ 0xFF));
+}
+
+/* Similarly, but search for the last zero within X.  */
+
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  return index_last_ (__builtin_alpha_cmpbge (0, x));
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* STRING_FZI_H */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 16/18] arm: Add string-fza.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (14 preceding siblings ...)
  2018-01-10 12:47 ` [PATCH v3 15/18] alpha: " Adhemerval Zanella
@ 2018-01-10 12:48 ` Adhemerval Zanella
  2018-01-10 12:48 ` [PATCH v3 17/18] powerpc: " Adhemerval Zanella
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:48 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

While arm has the more important string functions in assembly,
there are still a few generic routines used.

Use the UQSUB8 insn for testing of zeros.

Checked on armv7-linux-gnueabihf

	Richard Henderson  <rth@twiddle.net>

	* sysdeps/arm/armv6t2/string-fza.h: New file.
---
 sysdeps/arm/armv6t2/string-fza.h | 69 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h

diff --git a/sysdeps/arm/armv6t2/string-fza.h b/sysdeps/arm/armv6t2/string-fza.h
new file mode 100644
index 0000000..8c38f87
--- /dev/null
+++ b/sysdeps/arm/armv6t2/string-fza.h
@@ -0,0 +1,69 @@
+/* Zero byte detection; basics.  ARM version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZA_H
+#define STRING_FZA_H 1
+
+#include <string-optype.h>
+
+/* This function returns at least one bit set within every byte
+   of X that is zero.  */
+
+static inline op_t
+find_zero_all (op_t x)
+{
+  /* Use unsigned saturated subtraction from 1 in each byte.
+     That leaves 1 for every byte that was zero.  */
+  op_t ret, ones = (op_t)-1 / 0xff;
+  asm ("uqsub8 %0,%1,%2" : "=r"(ret) : "r"(ones), "r"(x));
+  return ret;
+}
+
+/* Identify bytes that are equal between X1 and X2.  */
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1 ^ x2);
+}
+
+/* Identify zero bytes in X1 or equality between X1 and X2.  */
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
+}
+
+/* Identify zero bytes in X1 or inequality between X1 and X2.  */
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  /* Make use of the fact that we'll already have ONES in a register.  */
+  op_t ones = (op_t)-1 / 0xff;
+  return find_zero_all (x1) | (find_zero_all (x1 ^ x2) ^ ones);
+}
+
+/* Define the "inexact" versions in terms of the exact versions.  */
+#define find_zero_low		find_zero_all
+#define find_eq_low		find_eq_all
+#define find_zero_eq_low	find_zero_eq_all
+#define find_zero_ne_low	find_zero_ne_all
+
+#endif /* STRING_FZA_H */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 17/18] powerpc: Add string-fza.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (15 preceding siblings ...)
  2018-01-10 12:48 ` [PATCH v3 16/18] arm: Add string-fza.h Adhemerval Zanella
@ 2018-01-10 12:48 ` Adhemerval Zanella
  2018-01-10 12:56   ` Tulio Magno Quites Machado Filho
  2018-01-10 12:48 ` [PATCH v3 18/18] sh: Add string-fzb.h Adhemerval Zanella
  2018-01-10 22:30 ` [PATCH v3 00/18] Improve generic string routines Ondřej Bílka
  18 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:48 UTC (permalink / raw
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

While ppc has the more important string functions in assembly,
there are still a few generic routines used.

Use the Power 6 CMPB insn for testing of zeros.

Checked on powerpc64le-linux-gnu.

	Richard Henderson  <rth@twiddle.net>

	* sysdeps/powerpc/power6/string-fza.h: New file.
	* sysdeps/powerpc/powerpc32/power6/string-fza.h: Likewise.
	* sysdeps/powerpc/powerpc64/power6/string-fza.h: Likewise.
---
 sysdeps/powerpc/power6/string-fza.h           | 65 +++++++++++++++++++++++++++
 sysdeps/powerpc/powerpc32/power6/string-fza.h |  1 +
 sysdeps/powerpc/powerpc64/power6/string-fza.h |  1 +
 3 files changed, 67 insertions(+)
 create mode 100644 sysdeps/powerpc/power6/string-fza.h
 create mode 100644 sysdeps/powerpc/powerpc32/power6/string-fza.h
 create mode 100644 sysdeps/powerpc/powerpc64/power6/string-fza.h

diff --git a/sysdeps/powerpc/power6/string-fza.h b/sysdeps/powerpc/power6/string-fza.h
new file mode 100644
index 0000000..4549dde
--- /dev/null
+++ b/sysdeps/powerpc/power6/string-fza.h
@@ -0,0 +1,65 @@
+/* Zero byte detection; basics.  Power6/ISA 2.03 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZA_H
+#define STRING_FZA_H 1
+
+#include <string-optype.h>
+
+/* This function returns 0xff for each byte that is
+   equal between X1 and X2.  */
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  op_t ret;
+  asm ("cmpb %0,%1,%2" : "=r"(ret) : "r"(x1), "r"(x2));
+  return ret;
+}
+
+/* This function returns 0xff for each byte that is zero in X.  */
+
+static inline op_t
+find_zero_all (op_t x)
+{
+  return find_eq_all (x, 0);
+}
+
+/* Identify zero bytes in X1 or equality between X1 and X2.  */
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | find_eq_all (x1, x2);
+}
+
+/* Identify zero bytes in X1 or inequality between X1 and X2.  */
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | ~find_eq_all (x1, x2);
+}
+
+/* Define the "inexact" versions in terms of the exact versions.  */
+#define find_zero_low		find_zero_all
+#define find_eq_low		find_eq_all
+#define find_zero_eq_low	find_zero_eq_all
+#define find_zero_ne_low	find_zero_ne_all
+
+#endif /* STRING_FZA_H */
diff --git a/sysdeps/powerpc/powerpc32/power6/string-fza.h b/sysdeps/powerpc/powerpc32/power6/string-fza.h
new file mode 100644
index 0000000..bb00d7c
--- /dev/null
+++ b/sysdeps/powerpc/powerpc32/power6/string-fza.h
@@ -0,0 +1 @@
+#include <sysdeps/powerpc/power6/string-fza.h>
diff --git a/sysdeps/powerpc/powerpc64/power6/string-fza.h b/sysdeps/powerpc/powerpc64/power6/string-fza.h
new file mode 100644
index 0000000..bb00d7c
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/power6/string-fza.h
@@ -0,0 +1 @@
+#include <sysdeps/powerpc/power6/string-fza.h>
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 18/18] sh: Add string-fzb.h
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (16 preceding siblings ...)
  2018-01-10 12:48 ` [PATCH v3 17/18] powerpc: " Adhemerval Zanella
@ 2018-01-10 12:48 ` Adhemerval Zanella
  2018-01-10 22:30 ` [PATCH v3 00/18] Improve generic string routines Ondřej Bílka
  18 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-10 12:48 UTC (permalink / raw
  To: libc-alpha

Use the SH cmp/str on has_{zero,eq,zero_eq}.

Checked on sh4-linux-gnu.

	Adhemerval Zanella <adhemerval.zanella@linaro.org>

	* sysdeps/sh/string-fzb.h: New file.
---
 sysdeps/sh/string-fzb.h | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 sysdeps/sh/string-fzb.h

diff --git a/sysdeps/sh/string-fzb.h b/sysdeps/sh/string-fzb.h
new file mode 100644
index 0000000..70d8a8f
--- /dev/null
+++ b/sysdeps/sh/string-fzb.h
@@ -0,0 +1,53 @@
+/* Zero byte detection; boolean.  SH4 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZB_H
+#define STRING_FZB_H 1
+
+#include <string-optype.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  op_t zero = 0x0, ret;
+  asm volatile ("cmp/str %1,%2\n"
+		"movt %0\n"
+		: "=r" (ret)
+		: "r" (zero), "r" (x));
+  return ret;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1 ^ x2);
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1) | has_eq (x1, x2);
+}
+
+#endif /* STRING_FZB_H */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 17/18] powerpc: Add string-fza.h
  2018-01-10 12:48 ` [PATCH v3 17/18] powerpc: " Adhemerval Zanella
@ 2018-01-10 12:56   ` Tulio Magno Quites Machado Filho
  0 siblings, 0 replies; 45+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2018-01-10 12:56 UTC (permalink / raw
  To: Adhemerval Zanella, libc-alpha; +Cc: Richard Henderson

Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> From: Richard Henderson <rth@twiddle.net>
>
> While ppc has the more important string functions in assembly,
> there are still a few generic routines used.
>
> Use the Power 6 CMPB insn for testing of zeros.
>
> Checked on powerpc64le-linux-gnu.
>
> 	Richard Henderson  <rth@twiddle.net>
>
> 	* sysdeps/powerpc/power6/string-fza.h: New file.
> 	* sysdeps/powerpc/powerpc32/power6/string-fza.h: Likewise.
> 	* sysdeps/powerpc/powerpc64/power6/string-fza.h: Likewise.

From my reply in Dec. 2016:   ;-)

Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>

Thanks!

-- 
Tulio Magno



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 00/18] Improve generic string routines
  2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
                   ` (17 preceding siblings ...)
  2018-01-10 12:48 ` [PATCH v3 18/18] sh: Add string-fzb.h Adhemerval Zanella
@ 2018-01-10 22:30 ` Ondřej Bílka
  2018-01-11 10:54   ` Adhemerval Zanella
  18 siblings, 1 reply; 45+ messages in thread
From: Ondřej Bílka @ 2018-01-10 22:30 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: libc-alpha

On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote:
> It is an update of previous Richard's patchset [1] to provide generic 
> string implementation for newer ports and make them only focus on 
> just specific routines to get a better overall improvement.
> It is done by:
> 
This is basically poorly reinvented version of my patch to make generic
string routines. By dusting it of one could get lot better performance
than this.

This contains lot of glaring issues, main ones are:
strnlen - quite ineffective as testing vs zero is faster than generic c.
Could be trivially improved by inlining memchr. 

strcmp - this

+  /* Handle the unaligned bytes of p1 first.  */
+  n = -(uintptr_t)p1 % sizeof(op_t);
+  for (i = 0; i < n; ++i)

is a the worst way how to write memcmp as you make loop with unpredictable
branch about alignment. Also in strcmp trying to be clever usually
causes performance regression because its likely that you get difference
in starting bytes.

strcpy - this was previously changed to strlen+memcpy because it with
platform specific strlen/memcpy it is faster. One should at least check
if some platforms are affected. I wouldn't be surprised that one could
get faster function just by adding some unrolling to memcpy. 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 03/18] Add string-maskoff.h generic header
  2018-01-10 12:47 ` [PATCH v3 03/18] Add string-maskoff.h generic header Adhemerval Zanella
@ 2018-01-10 23:24   ` Paul Eggert
  2018-01-11 10:54     ` Adhemerval Zanella
  2018-01-11 13:29   ` Joseph Myers
  1 sibling, 1 reply; 45+ messages in thread
From: Paul Eggert @ 2018-01-10 23:24 UTC (permalink / raw
  To: Adhemerval Zanella, libc-alpha

On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
> +/* Create a mask with high bit of each byte being 1, and the low 7 bits
> +   being all the opposite of the input mask.  It is used to mask off
> +   undesirable bits from an aligned read from an unaligned pointer,
> +   and also taking care to avoid match possible bytes meant to be
> +   matched.  For instance, on a 64 bits machine with a pointer alignment
> +   of 3 the function returns 0x7f7f7f0000000000 (input meant to
> +   be 0xffffff0000000000) for BE and 0x00000000007f7f7f for LE (input
> +   meant to be 0x0000000000ffffff).  */
> +static inline op_t
> +highbit_mask (op_t m)
> +{
> +  return m & ~repeat_bytes (0x80);
> +}

The first line of the comment does not match the implementation, which 
would have to be something like "return (m ^ repeat_bytes (0x7f)) | 
repeat_bytes (0x80);" to match the comment. Also, I suggest replacing 
"~repeat_bytes (0x80)" with "repeat_bytes (0x7f)" as the latter is a bit 
simpler.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 00/18] Improve generic string routines
  2018-01-10 22:30 ` [PATCH v3 00/18] Improve generic string routines Ondřej Bílka
@ 2018-01-11 10:54   ` Adhemerval Zanella
  2018-01-11 13:49     ` Joseph Myers
  0 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 10:54 UTC (permalink / raw
  To: Ondřej Bílka; +Cc: libc-alpha



On 10/01/2018 20:30, Ondřej Bílka wrote:
> On Wed, Jan 10, 2018 at 10:47:44AM -0200, Adhemerval Zanella wrote:
>> It is an update of previous Richard's patchset [1] to provide generic 
>> string implementation for newer ports and make them only focus on 
>> just specific routines to get a better overall improvement.
>> It is done by:
>>
> This is basically poorly reinvented version of my patch to make generic
> string routines. By dusting it of one could get lot better performance
> than this.

Why not work together to get to get a better implementation instead of
bashing a patch proposal by just saying 'I know/did better'? I did take a 
look at your suggestion and my first impression what a more convoluted
with a lot of over-engineering. I used Richard's proposal because as a
start point mainly because I thought it is better decomposed and organized
over the internal string operations and a bit simpler for generic
implementations.

Not saying your previous proposal is out of value, and I would like if we
could improve this patch proposal with your input.

> 
> This contains lot of glaring issues, main ones are:
> strnlen - quite ineffective as testing vs zero is faster than generic c.
> Could be trivially improved by inlining memchr. 

The 08/08 is using memchr instead (just no inline it)...

> 
> strcmp - this
> 
> +  /* Handle the unaligned bytes of p1 first.  */
> +  n = -(uintptr_t)p1 % sizeof(op_t);
> +  for (i = 0; i < n; ++i)
> 
> is a the worst way how to write memcmp as you make loop with unpredictable
> branch about alignment. Also in strcmp trying to be clever usually
> causes performance regression because its likely that you get difference
> in starting bytes.

I am open to suggestions here, usually the other methods trade speed for
more complexity and hardware/abi idiosyncrasies. Also keep in mind the idea 
of implementation is to be used on a large sets of different hardware, 
where strategies like unaligned access within a page can not be guarantee
to work.


> 
> strcpy - this was previously changed to strlen+memcpy because it with
> platform specific strlen/memcpy it is faster. One should at least check
> if some platforms are affected. I wouldn't be surprised that one could
> get faster function just by adding some unrolling to memcpy. 
> 

It really depends, for instance aarch64 implementation sets a threshold
where strlen+memcpy is actually used where then it switches to load/store.
Also both strlen and memcpy is inlined, witch amortizes the function call
cost.

I though about adding such optimization, but it will need to rely if 
compiler can inline size bounded strlen/memcpy to actually make sense 
for small sizes (where it is not true for a lot of platforms).  And 
there is also the issue of which is the size to used as threshold. 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 03/18] Add string-maskoff.h generic header
  2018-01-10 23:24   ` Paul Eggert
@ 2018-01-11 10:54     ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 10:54 UTC (permalink / raw
  To: Paul Eggert, libc-alpha



On 10/01/2018 21:24, Paul Eggert wrote:
> On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
>> +/* Create a mask with high bit of each byte being 1, and the low 7 bits
>> +   being all the opposite of the input mask.  It is used to mask off
>> +   undesirable bits from an aligned read from an unaligned pointer,
>> +   and also taking care to avoid match possible bytes meant to be
>> +   matched.  For instance, on a 64 bits machine with a pointer alignment
>> +   of 3 the function returns 0x7f7f7f0000000000 (input meant to
>> +   be 0xffffff0000000000) for BE and 0x00000000007f7f7f for LE (input
>> +   meant to be 0x0000000000ffffff).  */
>> +static inline op_t
>> +highbit_mask (op_t m)
>> +{
>> +  return m & ~repeat_bytes (0x80);
>> +}
> 
> The first line of the comment does not match the implementation, which would have to be something like "return (m ^ repeat_bytes (0x7f)) | repeat_bytes (0x80);" to match the comment. Also, I suggest replacing "~repeat_bytes (0x80)" with "repeat_bytes (0x7f)" as the latter is a bit simpler.

Thanks, I will change it locally.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 01/18] Parameterize op_t from memcopy.h
  2018-01-10 12:47 ` [PATCH v3 01/18] Parameterize op_t from memcopy.h Adhemerval Zanella
@ 2018-01-11 13:27   ` Joseph Myers
  2018-01-11 18:04     ` Adhemerval Zanella
  0 siblings, 1 reply; 45+ messages in thread
From: Joseph Myers @ 2018-01-11 13:27 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Wed, 10 Jan 2018, Adhemerval Zanella wrote:

> +#ifdef _LONG_LONG_LIMB
> +typedef unsigned long long int __attribute__((__may_alias__)) op_t;
> +#else
> +typedef unsigned long int __attribute__((__may_alias__)) op_t;
> +#endif

Missing space between __attribute__ and '('.

> +#if defined __GMP_H__ && ! defined _LONG_LONG_LIMB
> +#error "Included too late for _LONG_LONG_LIMB to take effect"
> +#endif

Missing preprocessor indentation, "# error".

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 03/18] Add string-maskoff.h generic header
  2018-01-10 12:47 ` [PATCH v3 03/18] Add string-maskoff.h generic header Adhemerval Zanella
  2018-01-10 23:24   ` Paul Eggert
@ 2018-01-11 13:29   ` Joseph Myers
  2018-01-11 17:57     ` Adhemerval Zanella
  1 sibling, 1 reply; 45+ messages in thread
From: Joseph Myers @ 2018-01-11 13:29 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: libc-alpha

On Wed, 10 Jan 2018, Adhemerval Zanella wrote:

> +/* Provide a mask based on the pointer alignment that sets up non-zero
> +   bytes before the beginning of the word.  It is used to mask off
> +   undesirable bits from an aligned read from an unaligned pointer.
> +   For instance, on a 64 bits machine with a pointer alignment of
> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000

Is there a missing "for BE" at the end of this line?

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-10 12:47 ` [PATCH v3 04/18] Add string vectorized find and detection functions Adhemerval Zanella
@ 2018-01-11 13:34   ` Joseph Myers
  2018-01-11 18:25     ` Adhemerval Zanella
  2018-01-11 13:44   ` Luis Machado
  2018-01-11 16:47   ` Paul Eggert
  2 siblings, 1 reply; 45+ messages in thread
From: Joseph Myers @ 2018-01-11 13:34 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: libc-alpha

On Wed, 10 Jan 2018, Adhemerval Zanella wrote:

> +
> +static inline unsigned char
> +extractbyte (op_t x, unsigned idx)

Missing comment on the semantics of this function.  "unsigned int".

> +/* For architecture which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
> +   and/or __builtin_ctz{l} (HAVE_BUILTIN_CTZ) which uses external libcalls
> +   (for intance __c{l,t}z{s,d}i2 from libgcc) the following wrapper provides
> +   inline implementation for both count leading zeros and count trailing
> +   zeros using branchless computation.  */

I think the comments need to say a bit more about the semantics of these 
functions.  In particular, do they follow the same rule as the built-in 
functions that behavior is undefined if the argument is zero?  If they do, 
then I'd expect the comments on the functions that call them to specify 
that they must not be called with a zero argument (zero arguments in this 
case generally corresponding to words that are not at the end of the 
string etc., so the functions indeed don't get called in that case).

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 13/18] hppa: Add memcopy.h
  2018-01-10 12:47 ` [PATCH v3 13/18] hppa: Add memcopy.h Adhemerval Zanella
@ 2018-01-11 13:36   ` Joseph Myers
  2018-01-12 18:01     ` Adhemerval Zanella
  0 siblings, 1 reply; 45+ messages in thread
From: Joseph Myers @ 2018-01-11 13:36 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Wed, 10 Jan 2018, Adhemerval Zanella wrote:

> +extern void link_error(void);

Missing space before '('.

> +static inline op_t
> +MERGE(op_t w0, int shl, op_t w1, int shr)

Likewise.

> +  if (OPSIZ == 4)
> +    asm("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
> +  else if (OPSIZ == 8)
> +    asm("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
> +  else
> +    link_error(), res = 0;

Likewise, after asm, and after link_error, and after strings in asm.

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-10 12:47 ` [PATCH v3 04/18] Add string vectorized find and detection functions Adhemerval Zanella
  2018-01-11 13:34   ` Joseph Myers
@ 2018-01-11 13:44   ` Luis Machado
  2018-01-11 18:25     ` Adhemerval Zanella
  2018-01-11 16:47   ` Paul Eggert
  2 siblings, 1 reply; 45+ messages in thread
From: Luis Machado @ 2018-01-11 13:44 UTC (permalink / raw
  To: Adhemerval Zanella, libc-alpha

Some quick typos i noticed while skimming through it.

On 01/10/2018 10:47 AM, Adhemerval Zanella wrote:
> +/* If compiler supports __builtin_ctz{l} without any external depedencies

typo... dependencies

More cases of this.

> diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
> new file mode 100644
> index 0000000..69a78ce
> --- /dev/null
> +++ b/sysdeps/generic/string-extbyte.h
> @@ -0,0 +1,35 @@
> +/* Extract by from memory word.  Generic C version.

Extract byte?

> diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
> new file mode 100644
> index 0000000..57101f2
> --- /dev/null
> +++ b/sysdeps/generic/string-fzi.h
> @@ -0,0 +1,215 @@
> +/* Zero byte detection; indexes.  Generic C version.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef STRING_FZI_H
> +#define STRING_FZI_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-fza.h>
> +
> +/* An improved bitscan routine, multiplying the De Bruijn sequence with a
> +   0-1 mask separated by the least significant one bit of a scanned integer
> +   or bitboard [1].
> +
> +   [1] https://chessprogramming.wikispaces.com/Kim+Walisch  */
> +
> +static inline unsigned
> +index_access (const op_t i)
> +{
> +  static const char index[] =
> +  {
> +# if __WORDSIZE == 64
> +     0, 47,  1, 56, 48, 27,  2, 60,
> +    57, 49, 41, 37, 28, 16,  3, 61,
> +    54, 58, 35, 52, 50, 42, 21, 44,
> +    38, 32, 29, 23, 17, 11,  4, 62,
> +    46, 55, 26, 59, 40, 36, 15, 53,
> +    34, 51, 20, 43, 31, 22, 10, 45,
> +    25, 39, 14, 33, 19, 30,  9, 24,
> +    13, 18,  8, 12,  7,  6,  5, 63
> +# else
> +     0,  9,  1, 10, 13, 21,  2, 29,
> +    11, 14, 16, 18, 22, 25,  3, 30,
> +     8, 12, 20, 28, 15, 17, 24,  7,
> +    19, 27, 23,  6, 26,  5,  4, 31
> +# endif
> +  };
> +  return index[i];
> +}
> +
> +/* For architecture which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)

For architectures?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 00/18] Improve generic string routines
  2018-01-11 10:54   ` Adhemerval Zanella
@ 2018-01-11 13:49     ` Joseph Myers
  0 siblings, 0 replies; 45+ messages in thread
From: Joseph Myers @ 2018-01-11 13:49 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: Ondřej Bílka, libc-alpha

On Thu, 11 Jan 2018, Adhemerval Zanella wrote:

> > This contains lot of glaring issues, main ones are:
> > strnlen - quite ineffective as testing vs zero is faster than generic c.
> > Could be trivially improved by inlining memchr. 
> 
> The 08/08 is using memchr instead (just no inline it)...

Which is a good idea on the general principles of architectures being able 
to add optimized versions of a few common functions, such as memchr, and 
have those used internally by the less common functions, such as strnlen.

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-10 12:47 ` [PATCH v3 04/18] Add string vectorized find and detection functions Adhemerval Zanella
  2018-01-11 13:34   ` Joseph Myers
  2018-01-11 13:44   ` Luis Machado
@ 2018-01-11 16:47   ` Paul Eggert
  2018-01-11 18:54     ` Adhemerval Zanella
  2 siblings, 1 reply; 45+ messages in thread
From: Paul Eggert @ 2018-01-11 16:47 UTC (permalink / raw
  To: Adhemerval Zanella, libc-alpha

On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
> +  op_t lsb = (op_t)-1 / 0xff;
> +  op_t msb = lsb << (CHAR_BIT - 1);
This would be simpler and clearer if it were rewritten as:

     opt_t lsb = repeat_bytes (0x01);
     opt_t msb = repeat_bytes (0x80);

There are several other opportunities for this kind of simplification.

> +static inline op_t
> +find_zero_eq_low (op_t x1, op_t x2)
> +{
> +  op_t lsb = (op_t)-1 / 0xff;
> +  op_t msb = lsb << (CHAR_BIT - 1);
> +  op_t eq = x1 ^ x2;
> +  return (((x1 - lsb) & ~x1) | ((eq - lsb) & ~eq)) & msb;
> +}

How about the following simpler implementation instead? I expect it's 
just as fast:

    return find_zero_low (x1) | find_zero_low (x1 ^ x2);

Similarly for find_zero_eq_all, find_zero_ne_low, find_zero_ne_all.

> +static inline unsigned
> +__clz (op_t x)
> +{
> +#if !HAVE_BUILTIN_CLZ
> +  unsigned r;
> +  op_t i;
> +
> +  x |= x >> 1;
> +  x |= x >> 2;
> +  x |= x >> 4;
> +  x |= x >> 8;
> +  x |= x >> 16;
> +# if __WORDSIZE == 64
> +  x |= x >> 32;
> +  i = x * 0x03F79D71B4CB0A89ull >> 58;
> +# else
> +  i = x * 0x07C4ACDDU >> 27;
> +# endif
> +  r = index_access (i);
> +  return r ^ (sizeof (op_t) * CHAR_BIT - 1);

The Gnulib integer_length module has a faster implementation, at least 
for 32-bit platforms. Do we still care about 32-bit platforms? If so, 
you might want to take a look at it.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/18] string: Improve generic strlen
  2018-01-10 12:47 ` [PATCH v3 05/18] string: Improve generic strlen Adhemerval Zanella
@ 2018-01-11 17:21   ` Paul Eggert
  2018-01-12 17:59     ` Adhemerval Zanella
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Eggert @ 2018-01-11 17:21 UTC (permalink / raw
  To: Adhemerval Zanella, libc-alpha; +Cc: Richard Henderson

On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
> +  /* Align pointer to sizeof op_t.  */
> +  const uintptr_t s_int = (uintptr_t) str;
> +  const op_t *word_ptr = (const op_t*) (s_int & -sizeof (op_t));

I see this sort of code used in multiple places (often with different 
implementations), and suggest packaging it up into an inline function. 
Also, you might consider implementing it this way, which is a bit 
simpler (it's the method used in your generic strcmp):

   op_t *
   word_containing (char const *p)
   {
      return (op_t *) (p - (uintptr_t) p % sizeof (op_t));
   }

This generates the same code with gcc -O2, and minimizes the usage of 
pointers as integers.

Similarly, in other code I suggest using char * instead of uintptr_t, as 
much as possible. This works just as well with ordinary addition, and 
will simplify GCC's job if we ever build with -fcheck-pointer-bounds.

> +  while (1)
>       {
> -      /* 64-bit version of the magic.  */
> -      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> -      himagic = ((himagic << 16) << 16) | himagic;
> -      lomagic = ((lomagic << 16) << 16) | lomagic;
> +      if (has_zero (word))
> +	break;
> +      word = *++word_ptr;
>       }

This would be a bit simpler:

   while (! has_zero (word))
     word = *++word_ptr;



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 03/18] Add string-maskoff.h generic header
  2018-01-11 13:29   ` Joseph Myers
@ 2018-01-11 17:57     ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 17:57 UTC (permalink / raw
  To: Joseph Myers; +Cc: libc-alpha



On 11/01/2018 11:29, Joseph Myers wrote:
> On Wed, 10 Jan 2018, Adhemerval Zanella wrote:
> 
>> +/* Provide a mask based on the pointer alignment that sets up non-zero
>> +   bytes before the beginning of the word.  It is used to mask off
>> +   undesirable bits from an aligned read from an unaligned pointer.
>> +   For instance, on a 64 bits machine with a pointer alignment of
>> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
> 
> Is there a missing "for BE" at the end of this line?
> 

I applied the following patch based on Paul's suggestion:

diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
index 6231798..98e7852 100644
--- a/sysdeps/generic/string-maskoff.h
+++ b/sysdeps/generic/string-maskoff.h
@@ -47,18 +47,17 @@ repeat_bytes (unsigned char c_in)
   return ((op_t)-1 / 0xff) * c_in;
 }
 
-/* Create a mask with high bit of each byte being 1, and the low 7 bits
-   being all the opposite of the input mask.  It is used to mask off
-   undesirable bits from an aligned read from an unaligned pointer,
-   and also taking care to avoid match possible bytes meant to be
-   matched.  For instance, on a 64 bits machine with a pointer alignment
-   of 3 the function returns 0x7f7f7f0000000000 (input meant to
-   be 0xffffff0000000000) for BE and 0x00000000007f7f7f for LE (input
-   meant to be 0x0000000000ffffff).  */
+/* Based on mask created by 'create_mask', mask off the high bit of each
+   byte in the mask.  It is used to mask off undesirable bits from an
+   aligned read from an unaligned pointer, and also taking care to avoid
+   match possible bytes meant to be matched.  For instance, on a 64 bits
+   machine with a mask created from a pointer with an alignment of 3
+   (0x0000000000ffffff) the function returns 0x7f7f7f0000000000 for BE
+   and 0x00000000007f7f7f for LE.  */
 static inline op_t
 highbit_mask (op_t m)
 {
-  return m & ~repeat_bytes (0x80);
+  return m & repeat_bytes (0x7f);
 }
 
 #endif /* STRING_MASKOFF_H  */


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 01/18] Parameterize op_t from memcopy.h
  2018-01-11 13:27   ` Joseph Myers
@ 2018-01-11 18:04     ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 18:04 UTC (permalink / raw
  To: Joseph Myers; +Cc: libc-alpha, Richard Henderson



On 11/01/2018 11:27, Joseph Myers wrote:
> On Wed, 10 Jan 2018, Adhemerval Zanella wrote:
> 
>> +#ifdef _LONG_LONG_LIMB
>> +typedef unsigned long long int __attribute__((__may_alias__)) op_t;
>> +#else
>> +typedef unsigned long int __attribute__((__may_alias__)) op_t;
>> +#endif
> 
> Missing space between __attribute__ and '('.
> 
>> +#if defined __GMP_H__ && ! defined _LONG_LONG_LIMB
>> +#error "Included too late for _LONG_LONG_LIMB to take effect"
>> +#endif
> 
> Missing preprocessor indentation, "# error".
> 

Thanks, I fixed locally both observations.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-11 13:44   ` Luis Machado
@ 2018-01-11 18:25     ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 18:25 UTC (permalink / raw
  To: Luis Machado, libc-alpha



On 11/01/2018 11:44, Luis Machado wrote:
> Some quick typos i noticed while skimming through it.
> 
> On 01/10/2018 10:47 AM, Adhemerval Zanella wrote:
>> +/* If compiler supports __builtin_ctz{l} without any external depedencies
> 
> typo... dependencies
> 
> More cases of this.
> 
>> diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
>> new file mode 100644
>> index 0000000..69a78ce
>> --- /dev/null
>> +++ b/sysdeps/generic/string-extbyte.h
>> @@ -0,0 +1,35 @@
>> +/* Extract by from memory word.  Generic C version.
> 
> Extract byte?
> 
>> diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
>> new file mode 100644
>> index 0000000..57101f2
>> --- /dev/null
>> +++ b/sysdeps/generic/string-fzi.h
>> @@ -0,0 +1,215 @@
>> +/* Zero byte detection; indexes.  Generic C version.
>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#ifndef STRING_FZI_H
>> +#define STRING_FZI_H 1
>> +
>> +#include <limits.h>
>> +#include <endian.h>
>> +#include <string-fza.h>
>> +
>> +/* An improved bitscan routine, multiplying the De Bruijn sequence with a
>> +   0-1 mask separated by the least significant one bit of a scanned integer
>> +   or bitboard [1].
>> +
>> +   [1] https://chessprogramming.wikispaces.com/Kim+Walisch  */
>> +
>> +static inline unsigned
>> +index_access (const op_t i)
>> +{
>> +  static const char index[] =
>> +  {
>> +# if __WORDSIZE == 64
>> +     0, 47,  1, 56, 48, 27,  2, 60,
>> +    57, 49, 41, 37, 28, 16,  3, 61,
>> +    54, 58, 35, 52, 50, 42, 21, 44,
>> +    38, 32, 29, 23, 17, 11,  4, 62,
>> +    46, 55, 26, 59, 40, 36, 15, 53,
>> +    34, 51, 20, 43, 31, 22, 10, 45,
>> +    25, 39, 14, 33, 19, 30,  9, 24,
>> +    13, 18,  8, 12,  7,  6,  5, 63
>> +# else
>> +     0,  9,  1, 10, 13, 21,  2, 29,
>> +    11, 14, 16, 18, 22, 25,  3, 30,
>> +     8, 12, 20, 28, 15, 17, 24,  7,
>> +    19, 27, 23,  6, 26,  5,  4, 31
>> +# endif
>> +  };
>> +  return index[i];
>> +}
>> +
>> +/* For architecture which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
> 
> For architectures?

Thanks, I fixed all locally.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-11 13:34   ` Joseph Myers
@ 2018-01-11 18:25     ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 18:25 UTC (permalink / raw
  To: Joseph Myers; +Cc: libc-alpha



On 11/01/2018 11:34, Joseph Myers wrote:
> On Wed, 10 Jan 2018, Adhemerval Zanella wrote:
> 
>> +
>> +static inline unsigned char
>> +extractbyte (op_t x, unsigned idx)
> 
> Missing comment on the semantics of this function.  "unsigned int".

I applied this change locally:

diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
index 69a78ce..c4693b2 100644
--- a/sysdeps/generic/string-extbyte.h
+++ b/sysdeps/generic/string-extbyte.h
@@ -23,8 +23,10 @@
 #include <endian.h>
 #include <string-optype.h>
 
+/* Extract the byte at index IDX from word X, with index 0 being the
+   least significant byte.  */
 static inline unsigned char
-extractbyte (op_t x, unsigned idx)
+extractbyte (op_t x, unsigned int idx)
 {
   if (__BYTE_ORDER == __LITTLE_ENDIAN)
     return x >> (idx * CHAR_BIT);


> 
>> +/* For architecture which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
>> +   and/or __builtin_ctz{l} (HAVE_BUILTIN_CTZ) which uses external libcalls
>> +   (for intance __c{l,t}z{s,d}i2 from libgcc) the following wrapper provides
>> +   inline implementation for both count leading zeros and count trailing
>> +   zeros using branchless computation.  */
> 
> I think the comments need to say a bit more about the semantics of these 
> functions.  In particular, do they follow the same rule as the built-in 
> functions that behavior is undefined if the argument is zero?  If they do, 
> then I'd expect the comments on the functions that call them to specify 
> that they must not be called with a zero argument (zero arguments in this 
> case generally corresponding to words that are not at the end of the 
> string etc., so the functions indeed don't get called in that case).
> 

I think it is reasonable to set the same semantic as the builtins and I
did not pay attention to outline this mainly because they are not indeed
used called in such cases (which would be UB for the builtin case in
fact).  I applied this patch locally:

diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
index 57101f2..534babb 100644
--- a/sysdeps/generic/string-fzi.h
+++ b/sysdeps/generic/string-fzi.h
@@ -29,7 +29,7 @@
 
    [1] https://chessprogramming.wikispaces.com/Kim+Walisch  */
 
-static inline unsigned
+static inline unsigned int
 index_access (const op_t i)
 {
   static const char index[] =
@@ -53,13 +53,14 @@ index_access (const op_t i)
   return index[i];
 }
 
-/* For architecture which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
+/* For architectures which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
    and/or __builtin_ctz{l} (HAVE_BUILTIN_CTZ) which uses external libcalls
    (for intance __c{l,t}z{s,d}i2 from libgcc) the following wrapper provides
    inline implementation for both count leading zeros and count trailing
-   zeros using branchless computation.  */
+   zeros using branchless computation.  As for builtin, if x is 0 the
+   result is undefined.*/
 
-static inline unsigned
+static inline unsigned int
 __ctz (op_t x)
 {
 #if !HAVE_BUILTIN_CTZ
@@ -78,7 +79,7 @@ __ctz (op_t x)
 #endif
 };
 
-static inline unsigned
+static inline unsigned int
 __clz (op_t x)
 {
 #if !HAVE_BUILTIN_CLZ


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-11 16:47   ` Paul Eggert
@ 2018-01-11 18:54     ` Adhemerval Zanella
  2018-01-12  1:08       ` Paul Eggert
  2018-01-12 13:30       ` Adhemerval Zanella
  0 siblings, 2 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-11 18:54 UTC (permalink / raw
  To: Paul Eggert, libc-alpha



On 11/01/2018 14:47, Paul Eggert wrote:
> On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
>> +  op_t lsb = (op_t)-1 / 0xff;
>> +  op_t msb = lsb << (CHAR_BIT - 1);
> This would be simpler and clearer if it were rewritten as:
> 
>     opt_t lsb = repeat_bytes (0x01);
>     opt_t msb = repeat_bytes (0x80);
> 
> There are several other opportunities for this kind of simplification.

Indeed, I changed it locally

> 
>> +static inline op_t
>> +find_zero_eq_low (op_t x1, op_t x2)
>> +{
>> +  op_t lsb = (op_t)-1 / 0xff;
>> +  op_t msb = lsb << (CHAR_BIT - 1);
>> +  op_t eq = x1 ^ x2;
>> +  return (((x1 - lsb) & ~x1) | ((eq - lsb) & ~eq)) & msb;
>> +}
> 
> How about the following simpler implementation instead? I expect it's just as fast:
> 
>    return find_zero_low (x1) | find_zero_low (x1 ^ x2);
> 
> Similarly for find_zero_eq_all, find_zero_ne_low, find_zero_ne_all.

I think this seems ok and code generation for at least aarch64, powerpc64le,
sparc64, and x86_64 seems similar.

> 
>> +static inline unsigned
>> +__clz (op_t x)
>> +{
>> +#if !HAVE_BUILTIN_CLZ
>> +  unsigned r;
>> +  op_t i;
>> +
>> +  x |= x >> 1;
>> +  x |= x >> 2;
>> +  x |= x >> 4;
>> +  x |= x >> 8;
>> +  x |= x >> 16;
>> +# if __WORDSIZE == 64
>> +  x |= x >> 32;
>> +  i = x * 0x03F79D71B4CB0A89ull >> 58;
>> +# else
>> +  i = x * 0x07C4ACDDU >> 27;
>> +# endif
>> +  r = index_access (i);
>> +  return r ^ (sizeof (op_t) * CHAR_BIT - 1);
> 
> The Gnulib integer_length module has a faster implementation, at least for 32-bit platforms. Do we still care about 32-bit platforms? If so, you might want to take a look at it.

Do you mean the version which uses double to integer, the one with 6 comparisons
or the naive one? Indeed I think for 32 bits is issuing a lot of instruction,
the only advantage I see it is branchless (which might be a gain in some kind
of architectures).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-11 18:54     ` Adhemerval Zanella
@ 2018-01-12  1:08       ` Paul Eggert
  2018-01-12 17:08         ` Joseph Myers
  2018-01-12 13:30       ` Adhemerval Zanella
  1 sibling, 1 reply; 45+ messages in thread
From: Paul Eggert @ 2018-01-12  1:08 UTC (permalink / raw
  To: Adhemerval Zanella, libc-alpha

On 01/11/2018 10:54 AM, Adhemerval Zanella wrote:
>> The Gnulib integer_length module has a faster implementation, at least for 32-bit platforms. Do we still care about 32-bit platforms? If so, you might want to take a look at it.
> Do you mean the version which uses double to integer, the one with 6 comparisons
> or the naive one?

I meant the one that converts int to double. It can be branchless since 
we assume the int is nonzero.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-11 18:54     ` Adhemerval Zanella
  2018-01-12  1:08       ` Paul Eggert
@ 2018-01-12 13:30       ` Adhemerval Zanella
  1 sibling, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-12 13:30 UTC (permalink / raw
  To: Paul Eggert, libc-alpha



On 11/01/2018 16:54, Adhemerval Zanella wrote:
> 
> 
> On 11/01/2018 14:47, Paul Eggert wrote:
>> On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
>>> +  op_t lsb = (op_t)-1 / 0xff;
>>> +  op_t msb = lsb << (CHAR_BIT - 1);
>> This would be simpler and clearer if it were rewritten as:
>>
>>     opt_t lsb = repeat_bytes (0x01);
>>     opt_t msb = repeat_bytes (0x80);
>>
>> There are several other opportunities for this kind of simplification.
> 
> Indeed, I changed it locally
> 
>>
>>> +static inline op_t
>>> +find_zero_eq_low (op_t x1, op_t x2)
>>> +{
>>> +  op_t lsb = (op_t)-1 / 0xff;
>>> +  op_t msb = lsb << (CHAR_BIT - 1);
>>> +  op_t eq = x1 ^ x2;
>>> +  return (((x1 - lsb) & ~x1) | ((eq - lsb) & ~eq)) & msb;
>>> +}
>>
>> How about the following simpler implementation instead? I expect it's just as fast:
>>
>>    return find_zero_low (x1) | find_zero_low (x1 ^ x2);
>>
>> Similarly for find_zero_eq_all, find_zero_ne_low, find_zero_ne_all.
> 
> I think this seems ok and code generation for at least aarch64, powerpc64le,
> sparc64, and x86_64 seems similar.

While trying to compose find_zero_new_{low,all} with find_zero_{low,all}
made me not so sure if it would be gain. To accomplish we will need to add
another operation, such as:

---
static inline op_t
find_zero_ne_low (op_t x1, op_t x2)
{
  op_t x = repeat_bytes (0x80);
  return find_zero_low (x1) | (find_zero_low (x1 ^ x2) ^ x);
}
---

Which seems slight worse than current regarding generated instructions.
Using GCC 7.2.1 for x86_64 I see:

* Patch version:

find_zero_ne_low.constprop.0:
.LFB28: 
        .cfi_startproc
        movabsq $1229782938247303441, %rdx
        movq    %rdx, %rcx
        movabsq $9187201950435737471, %rdx
        leaq    (%rdi,%rdx), %rax
        xorq    %rdi, %rcx
        addq    %rcx, %rdx
        orq     %rdi, %rax
        orq     %rcx, %rdx
        movabsq $-9187201950435737472, %rdi
        notq    %rax
        orq     %rdx, %rax
        andq    %rdi, %rax
        ret


* find_zero_low version above:

find_zero_ne_low.constprop.0:
.LFB28: 
        .cfi_startproc
        movabsq $1229782938247303441, %rax
        movabsq $-72340172838076673, %rdx
        movabsq $-1229782938247303442, %rcx
        xorq    %rdi, %rax
        xorq    %rdi, %rcx
        addq    %rdx, %rax
        addq    %rdi, %rdx
        notq    %rdi
        andq    %rcx, %rax
        andq    %rdi, %rdx
        movabsq $-9187201950435737472, %rdi
        notq    %rax
        orq     %rdx, %rax
        andq    %rdi, %rax
        ret



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-12  1:08       ` Paul Eggert
@ 2018-01-12 17:08         ` Joseph Myers
  2018-01-12 17:58           ` Adhemerval Zanella
  0 siblings, 1 reply; 45+ messages in thread
From: Joseph Myers @ 2018-01-12 17:08 UTC (permalink / raw
  To: Paul Eggert; +Cc: Adhemerval Zanella, libc-alpha

On Thu, 11 Jan 2018, Paul Eggert wrote:

> On 01/11/2018 10:54 AM, Adhemerval Zanella wrote:
> > > The Gnulib integer_length module has a faster implementation, at least for
> > > 32-bit platforms. Do we still care about 32-bit platforms? If so, you
> > > might want to take a look at it.
> > Do you mean the version which uses double to integer, the one with 6
> > comparisons
> > or the naive one?
> 
> I meant the one that converts int to double. It can be branchless since we
> assume the int is nonzero.

Looking at glibc architectures (and architectures with recently proposed 
ports):

* The following have clz patterns in GCC, unconditionally, meaning this 
glibc patch will always use __builtin_clz functions and any fallback code 
is irrelevant: aarch64 i386 ia64 powerpc tilegx x86_64.  (On ia64 the 
pattern uses conversion to floating point.)

* The following have clz patterns in GCC, conditionally: alpha arm m68k 
microblaze mips s390 sparc (and arc).  I have not checked whether in some 
of those cases the conditions might in fact be true for every 
configuration for which glibc can be built.

* The following lack clz patterns in GCC: hppa nios2 sh (and riscv).

If the configuration lacking clz is also soft-float, converting int to 
double is an extremely inefficient way ending up calling the libgcc clz 
implementation (both soft-fp and fp-bit use __builtin_clz).  I think 
that's sufficient reason to avoid an approach involving conversion to 
double unless an architecture has opted in to using it as an efficient 
approach on that architecture.

(For arm, for example, clz is supported if "TARGET_32BIT && arm_arch5", so 
the only configurations without __builtin_clz expanded inline by the 
compiler are armv4t ones - which are also all soft-float, so the expansion 
using double can never make sense for arm.)

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 04/18] Add string vectorized find and detection functions
  2018-01-12 17:08         ` Joseph Myers
@ 2018-01-12 17:58           ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-12 17:58 UTC (permalink / raw
  To: Joseph Myers, Paul Eggert; +Cc: libc-alpha



On 12/01/2018 15:08, Joseph Myers wrote:
> On Thu, 11 Jan 2018, Paul Eggert wrote:
> 
>> On 01/11/2018 10:54 AM, Adhemerval Zanella wrote:
>>>> The Gnulib integer_length module has a faster implementation, at least for
>>>> 32-bit platforms. Do we still care about 32-bit platforms? If so, you
>>>> might want to take a look at it.
>>> Do you mean the version which uses double to integer, the one with 6
>>> comparisons
>>> or the naive one?
>>
>> I meant the one that converts int to double. It can be branchless since we
>> assume the int is nonzero.
> 
> Looking at glibc architectures (and architectures with recently proposed 
> ports):
> 
> * The following have clz patterns in GCC, unconditionally, meaning this 
> glibc patch will always use __builtin_clz functions and any fallback code 
> is irrelevant: aarch64 i386 ia64 powerpc tilegx x86_64.  (On ia64 the 
> pattern uses conversion to floating point.)
> 
> * The following have clz patterns in GCC, conditionally: alpha arm m68k 
> microblaze mips s390 sparc (and arc).  I have not checked whether in some 
> of those cases the conditions might in fact be true for every 
> configuration for which glibc can be built.
> 
> * The following lack clz patterns in GCC: hppa nios2 sh (and riscv).
> 
> If the configuration lacking clz is also soft-float, converting int to 
> double is an extremely inefficient way ending up calling the libgcc clz 
> implementation (both soft-fp and fp-bit use __builtin_clz).  I think 
> that's sufficient reason to avoid an approach involving conversion to 
> double unless an architecture has opted in to using it as an efficient 
> approach on that architecture.

Thanks for remind about soft-float, also for some architectures that does
have hardware floating pointer units the int to/from float is also a costly
operation.

Regarding index_{first,last}_ fallback implementation, maybe simpler 
implementation which just check the mask bits instead of fallback ones for
leading/trailing zero bit should better, I am open to suggestions here.

> 
> (For arm, for example, clz is supported if "TARGET_32BIT && arm_arch5", so 
> the only configurations without __builtin_clz expanded inline by the 
> compiler are armv4t ones - which are also all soft-float, so the expansion 
> using double can never make sense for arm.)
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 05/18] string: Improve generic strlen
  2018-01-11 17:21   ` Paul Eggert
@ 2018-01-12 17:59     ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-12 17:59 UTC (permalink / raw
  To: Paul Eggert, libc-alpha; +Cc: Richard Henderson



On 11/01/2018 15:21, Paul Eggert wrote:
> On 01/10/2018 04:47 AM, Adhemerval Zanella wrote:
>> +  /* Align pointer to sizeof op_t.  */
>> +  const uintptr_t s_int = (uintptr_t) str;
>> +  const op_t *word_ptr = (const op_t*) (s_int & -sizeof (op_t));
> 
> I see this sort of code used in multiple places (often with different implementations), and suggest packaging it up into an inline function. Also, you might consider implementing it this way, which is a bit simpler (it's the method used in your generic strcmp):
> 
>   op_t *
>   word_containing (char const *p)
>   {
>      return (op_t *) (p - (uintptr_t) p % sizeof (op_t));
>   }
> 
> This generates the same code with gcc -O2, and minimizes the usage of pointers as integers.

Thanks, I have added this function suggestion to string-maskoff.h and used on
the generic implementations.

> 
> Similarly, in other code I suggest using char * instead of uintptr_t, as much as possible. This works just as well with ordinary addition, and will simplify GCC's job if we ever build with -fcheck-pointer-bounds.
> 
>> +  while (1)
>>       {
>> -      /* 64-bit version of the magic.  */
>> -      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>> -      himagic = ((himagic << 16) << 16) | himagic;
>> -      lomagic = ((lomagic << 16) << 16) | lomagic;
>> +      if (has_zero (word))
>> +    break;
>> +      word = *++word_ptr;
>>       }
> 
> This would be a bit simpler:
> 
>   while (! has_zero (word))
>     word = *++word_ptr;
> 

Thanks, I have applied it locally as well.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 13/18] hppa: Add memcopy.h
  2018-01-11 13:36   ` Joseph Myers
@ 2018-01-12 18:01     ` Adhemerval Zanella
  2018-01-12 18:17       ` Joseph Myers
  0 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-12 18:01 UTC (permalink / raw
  To: Joseph Myers; +Cc: libc-alpha, Richard Henderson



On 11/01/2018 11:36, Joseph Myers wrote:
> On Wed, 10 Jan 2018, Adhemerval Zanella wrote:
> 
>> +extern void link_error(void);
> 
> Missing space before '('.
> 
>> +static inline op_t
>> +MERGE(op_t w0, int shl, op_t w1, int shr)
> 
> Likewise.
> 
>> +  if (OPSIZ == 4)
>> +    asm("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
>> +  else if (OPSIZ == 8)
>> +    asm("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
>> +  else
>> +    link_error(), res = 0;
> 
> Likewise, after asm, and after link_error, and after strings in asm.
> 

I fixed all locally.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 13/18] hppa: Add memcopy.h
  2018-01-12 18:01     ` Adhemerval Zanella
@ 2018-01-12 18:17       ` Joseph Myers
  2018-01-12 18:37         ` Adhemerval Zanella
  0 siblings, 1 reply; 45+ messages in thread
From: Joseph Myers @ 2018-01-12 18:17 UTC (permalink / raw
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Fri, 12 Jan 2018, Adhemerval Zanella wrote:

> >> +  if (OPSIZ == 4)
> >> +    asm("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
> >> +  else if (OPSIZ == 8)
> >> +    asm("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
> >> +  else
> >> +    link_error(), res = 0;
> > 
> > Likewise, after asm, and after link_error, and after strings in asm.
> > 
> 
> I fixed all locally.

(Some of the other patches were also similarly missing spaces after asm 
constraint strings.)

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 13/18] hppa: Add memcopy.h
  2018-01-12 18:17       ` Joseph Myers
@ 2018-01-12 18:37         ` Adhemerval Zanella
  0 siblings, 0 replies; 45+ messages in thread
From: Adhemerval Zanella @ 2018-01-12 18:37 UTC (permalink / raw
  To: Joseph Myers; +Cc: libc-alpha, Richard Henderson



On 12/01/2018 16:17, Joseph Myers wrote:
> On Fri, 12 Jan 2018, Adhemerval Zanella wrote:
> 
>>>> +  if (OPSIZ == 4)
>>>> +    asm("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
>>>> +  else if (OPSIZ == 8)
>>>> +    asm("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
>>>> +  else
>>>> +    link_error(), res = 0;
>>>
>>> Likewise, after asm, and after link_error, and after strings in asm.
>>>
>>
>> I fixed all locally.
> 
> (Some of the other patches were also similarly missing spaces after asm 
> constraint strings.)
> 

Thanks, I will double check it (I used Richard's version unchanged).


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-01-12 18:35 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-10 12:47 [PATCH v3 00/18] Improve generic string routines Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 01/18] Parameterize op_t from memcopy.h Adhemerval Zanella
2018-01-11 13:27   ` Joseph Myers
2018-01-11 18:04     ` Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 02/18] Parameterize OP_T_THRES " Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 03/18] Add string-maskoff.h generic header Adhemerval Zanella
2018-01-10 23:24   ` Paul Eggert
2018-01-11 10:54     ` Adhemerval Zanella
2018-01-11 13:29   ` Joseph Myers
2018-01-11 17:57     ` Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 04/18] Add string vectorized find and detection functions Adhemerval Zanella
2018-01-11 13:34   ` Joseph Myers
2018-01-11 18:25     ` Adhemerval Zanella
2018-01-11 13:44   ` Luis Machado
2018-01-11 18:25     ` Adhemerval Zanella
2018-01-11 16:47   ` Paul Eggert
2018-01-11 18:54     ` Adhemerval Zanella
2018-01-12  1:08       ` Paul Eggert
2018-01-12 17:08         ` Joseph Myers
2018-01-12 17:58           ` Adhemerval Zanella
2018-01-12 13:30       ` Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 05/18] string: Improve generic strlen Adhemerval Zanella
2018-01-11 17:21   ` Paul Eggert
2018-01-12 17:59     ` Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 06/18] string: Improve generic memchr Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 07/18] string: Improve generic memrchr Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 08/18] string: Improve generic strnlen Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 09/18] string: Improve generic strchr Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 10/18] string: Improve generic strchrnul Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 11/18] string: Improve generic strcmp Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 12/18] string: Improve generic strcpy Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 13/18] hppa: Add memcopy.h Adhemerval Zanella
2018-01-11 13:36   ` Joseph Myers
2018-01-12 18:01     ` Adhemerval Zanella
2018-01-12 18:17       ` Joseph Myers
2018-01-12 18:37         ` Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 14/18] hppa: Add string-fzb.h and string-fzi.h Adhemerval Zanella
2018-01-10 12:47 ` [PATCH v3 15/18] alpha: " Adhemerval Zanella
2018-01-10 12:48 ` [PATCH v3 16/18] arm: Add string-fza.h Adhemerval Zanella
2018-01-10 12:48 ` [PATCH v3 17/18] powerpc: " Adhemerval Zanella
2018-01-10 12:56   ` Tulio Magno Quites Machado Filho
2018-01-10 12:48 ` [PATCH v3 18/18] sh: Add string-fzb.h Adhemerval Zanella
2018-01-10 22:30 ` [PATCH v3 00/18] Improve generic string routines Ondřej Bílka
2018-01-11 10:54   ` Adhemerval Zanella
2018-01-11 13:49     ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).