[PATCH v3 0/7] Use introsort for qsort

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] Use introsort for qsort
@ 2021-09-03 17:11 Adhemerval Zanella via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

This is respin from my previous attempt to fix and improve qsort() [1].
The main motivation to use introsort is to make it fully AS-Safe and
AC-Safe, with a limited stack size requirement, to remove the use
of malloc (along with the system memory checks), and keeping Worst-case
performance on O(n*ln(n)) (instead of potentially quadradic as for the
quicksort).

The implementation does not aim to be the state-of-the-art sort
algorithm, rather I used a well understood and simple on ((introsort,
used on libstdc++) and leverage the current quicksort implementation
along with a heapsort one from Linux kernel.

The resulting implementation has bounded stack requirement (about 1.8k
on x86_64), and results in an simplified code and binary:

# master
$ size stdlib/qsort.o stdlib/msort.o
   text    data     bss     dec     hex filename
   1316       0       0    1316     524 stdlib/qsort.o
   2045       0      12    2057     809 stdlib/msort.o

# patched
$ size stdlib/qsort.os
   text    data     bss     dec     hex filename
   2373       0       0    2373     945 stdlib/qsort.os

The performance difference with the provided benchmark clearly show the
tradeoff of using a instrosort over a mergesort.  The results are on
a Ryzen 9 with gcc 11 ran with --nelem 49152:786432:8388608 (to simulate
the L1/L2/L3 of this CPU used):

SORTED
  elements=    49152 (size= 4): %52.47% (master=3654381.0 patches=5571774.0)
  elements=   786432 (size= 4): %57.39% (master=73067211.0 patches=115003964.0)
  elements=  8388608 (size= 4): %49.71% (master=1005357374.0 patches=1505144332.0)
  elements=    49152 (size= 8): %48.98% (master=3796489.0 patches=5656023.0)
  elements=   786432 (size= 8): %52.34% (master=75809561.0 patches=115491584.0)
  elements=  8388608 (size= 8): %49.97% (master=1004201712.0 patches=1506017512.0)
  elements=    49152 (size=32): %-2.69% (master=11191649.0 patches=10891084.0)
  elements=   786432 (size=32): %-12.16% (master=238870746.0 patches=209822358.0)
  elements=  8388608 (size=32): %-23.33% (master=3323248933.0 patches=2547848399.0)
MOSTLYSORTED
  elements=    49152 (size= 4): %66.47% (master=6915979.0 patches=11512945.0)
  elements=   786432 (size= 4): %66.48% (master=143188170.0 patches=238379284.0)
  elements=  8388608 (size= 4): %70.90% (master=1805703842.0 patches=3086011963.0)
  elements=    49152 (size= 8): %78.11% (master=6836231.0 patches=12176330.0)
  elements=   786432 (size= 8): %77.79% (master=138851879.0 patches=246867806.0)
  elements=  8388608 (size= 8): %77.70% (master=1773884617.0 patches=3152112348.0)
  elements=    49152 (size=32): %-6.44% (master=13766992.0 patches=12880888.0)
  elements=   786432 (size=32): %-15.19% (master=287590899.0 patches=243919593.0)
  elements=  8388608 (size=32): %-25.63% (master=3852042869.0 patches=2864641676.0)
RANDOM
  elements=    49152 (size= 4): %9.54% (master=15078260.0 patches=16516457.0)
  elements=   786432 (size= 4): %5.64% (master=305805587.0 patches=323043450.0)
  elements=  8388608 (size= 4): %2.78% (master=3838340458.0 patches=3945202718.0)
  elements=    49152 (size= 8): %13.75% (master=14676586.0 patches=16694499.0)
  elements=   786432 (size= 8): %10.71% (master=295583560.0 patches=327252646.0)
  elements=  8388608 (size= 8): %7.67% (master=3722946692.0 patches=4008563091.0)
  elements=    49152 (size=32): %-8.34% (master=20944462.0 patches=19196696.0)
  elements=   786432 (size=32): %-14.31% (master=425822838.0 patches=364896701.0)
  elements=  8388608 (size=32): %-19.86% (master=5534217212.0 patches=4435276362.0)
REPEATED
  elements=    49152 (size= 4): %9.18% (master=14051610.0 patches=15341610.0)
  elements=   786432 (size= 4): %6.01% (master=282787457.0 patches=299777700.0)
  elements=  8388608 (size= 4): %4.19% (master=3539279291.0 patches=3687517777.0)
  elements=    49152 (size= 8): %12.37% (master=13659340.0 patches=15349336.0)
  elements=   786432 (size= 8): %11.11% (master=273070595.0 patches=303413116.0)
  elements=  8388608 (size= 8): %8.46% (master=3424739041.0 patches=3714521094.0)
  elements=    49152 (size=32): %-11.21% (master=19826230.0 patches=17603795.0)
  elements=   786432 (size=32): %-16.73% (master=405322605.0 patches=337501658.0)
  elements=  8388608 (size=32): %-21.93% (master=5261140329.0 patches=4107591456.0)
BITONIC
  elements=    49152 (size= 4): %405.43% (master=4632011.0 patches=23411611.0)
  elements=   786432 (size= 4): %493.28% (master=92649410.0 patches=549674361.0)
  elements=  8388608 (size= 4): %609.03% (master=1095591416.0 patches=7768102251.0)
  elements=    49152 (size= 8): %402.05% (master=4672695.0 patches=23459484.0)
  elements=   786432 (size= 8): %480.60% (master=94419035.0 patches=548197066.0)
  elements=  8388608 (size= 8): %596.41% (master=1116631167.0 patches=7776333333.0)
  elements=    49152 (size=32): %-6.92% (master=11545720.0 patches=10747100.0)
  elements=   786432 (size=32): %-14.97% (master=245037969.0 patches=208357373.0)
  elements=  8388608 (size=32): %-24.14% (master=3357712042.0 patches=2547221807.0)

With the bitonic and sorted showing the biggest hit.  Even though, I
think the tradeoff is acceptable.  The improvements over sizes larger
than 8 is mostly due the optimization done on swap operations.

[1] https://sourceware.org/pipermail/libc-alpha/2018-August/096981.html

Adhemerval Zanella (7):
  benchtests: Add bench-qsort
  support: Fix getopt_long with CMDLINE_OPTIONS
  stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  stdlib: Move insertion sort out qsort
  stdlib: qsort: Move some macros to inline function
  stdlib: Implement introsort with qsort
  stdlib: Remove use of mergesort on qsort (BZ #21719)

 benchtests/Makefile         |   2 +-
 benchtests/bench-qsort.c    | 195 +++++++++++++++++++++
 manual/argp.texi            |   2 +-
 manual/locale.texi          |   3 +-
 manual/search.texi          |   7 +-
 stdlib/Makefile             |   3 +-
 stdlib/msort.c              | 310 ---------------------------------
 stdlib/qsort.c              | 333 ++++++++++++++++++++++++++++--------
 support/support_test_main.c |   1 -
 support/test-driver.h       |   1 +
 10 files changed, 460 insertions(+), 397 deletions(-)
 create mode 100644 benchtests/bench-qsort.c
 delete mode 100644 stdlib/msort.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-09-04  9:09   ` Alexander Monakov via Libc-alpha
  2021-10-13  3:19   ` Noah Goldstein via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 2/7] support: Fix getopt_long with CMDLINE_OPTIONS Adhemerval Zanella via Libc-alpha
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

This patch adds a qsort benchmark.  I tried to model the benchmark
taking in consideration the possible input variation in both internal
element size, element numbers, and internal state for 1. real word cases
and 2. possible scenarios based on hardware characteristics.

For 1. I tracked qsort usage (using a simple preload library to dump its
usage and a script to pos-process it) on both GCC bootstrap and Firefox.
GCC 8 bootstrap build shows 51786641 call to qsort with the following
characterics:

Key: number of elements:
key=2 : 39.74
key=3 : 19.23
key=4 : 9.77
key=1 : 8.44
key=0 : 6.60
key=5 : 4.40
key=7 : 2.37
key=6 : 2.25
key=9 : 1.48
key=8 : 0.97

Key: element size in bytes:
key=8 : 91.74
key=32 : 3.62
key=4 : 2.42
key=40 : 1.20
key=16 : 0.67
key=24 : 0.30
key=48 : 0.05
key=56 : 0.00
key=1 : 0.00

Key: total size (number of elements * element size):
key=16 : 35.98
key=24 : 18.67
key=32 : 9.79
key=8 : 8.28
key=0 : 6.60
key=40 : 4.21
key=64 : 3.15
key=48 : 2.24
key=56 : 2.15
key=80 : 1.45

So for GCC:

  - 80% of total qsort usage are done with 10 elements of less.
  - All usages are done element size of maximum of 56 bytes.
  - 90% of calls are done with array of maximum size of 80 bytes or
    less.

(GCC on recent version now uses a internal qsort implementation instead
of the libc one).

The Firefox usage was done with 2 hours of usage, with first 10 minutes
activelly openning and closing different types of sites.  It resulted in
21042 calls with following characteristics:

Key: number of elements:
key=7 : 24.40
key=1 : 10.44
key=3 : 6.33
key=4 : 5.81
key=2 : 5.46
key=6 : 4.80
key=17 : 4.54
key=0 : 3.07
key=5 : 3.05
key=9 : 2.51
key=12 : 2.06

Key: element size in bytes:
key=8 : 94.49
key=28 : 4.40
key=2 : 0.70
key=16 : 0.19
key=36 : 0.07
key=12 : 0.07
key=40 : 0.07
key=24 : 0.03

Key: total size (number of elements * element size):
key=56 : 24.20
key=8 : 10.27
key=24 : 6.36
key=32 : 5.86
key=16 : 5.46
key=48 : 4.80
key=476 : 3.75
key=0 : 3.07
key=40 : 3.05
key=72 : 2.50

So for Firefox:

  - 72% of total qsort usage are done with 18 elements of less.
  - All usages are done element size of maximum of 40 bytes.
  - 70% of calls are done with array of maximum size of 476 bytes or
    less.

For 2. I used the idea of a machine with 3 levels of cache with sizes
L1 32kb, L2 256kb, and L3 4096Kb.

It resulted in a benchmark with following traits:

  * It checks four types of input arrays: sorted, mostly sorted, random,
    repeated, and bitonic:

    - 'sorted': the array is already sorted.
    - 'mostly sorted': the array will have a certain number of random
       position with random values (current ratio used is 20%).
    - 'random' the array will contain random elements.
    - 'repeated' the array will have random elements with a certain
       number (current ratio is 20%) of a repeated element distributed
       randomly.
    - 'bitonic': elements will be strictly increasing to up middle of the
       array then strictly decreasing.  This is to trigger the
       pattern-defeting input for the median of 3 pivot selection used
       in quicksort implementation.

  * Three elements sizes are checked: uint32_t, uint64_t, and an
    element with 32 bytes (but using the uint32_t comparison checks).
    These element sizes are used to 1. to avoid include the comparison
    function itself and/or memory copy in sort benchmark itself, and 2.
    because key of size_t are the most used for both GCC and Firefox.

  * Five different element numbers: 64 (which cover mostly of used
    elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
    of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
    24288/1048576 (L3 with  4 MB).  The sizes are configurable by
    --nelem option.

Checked on x86_64-linux-gnu
---
 benchtests/Makefile      |   2 +-
 benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 196 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-qsort.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 1530939a8c..781cbca8af 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
 include ../gen-locales.mk
 endif
 
-stdlib-benchset := strtod
+stdlib-benchset := strtod qsort
 
 stdio-common-benchset := sprintf
 
diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
new file mode 100644
index 0000000000..905aa6d7b6
--- /dev/null
+++ b/benchtests/bench-qsort.c
@@ -0,0 +1,195 @@
+/* Measure qsort function.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <array_length.h>
+#include <bench-timing.h>
+#include <errno.h>
+#include <getopt.h>
+#include <json-lib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include <stdlib/tst-qsort-common.c>
+
+/* Number of elements of determined type to be measured.  */
+static const size_t default_nelems[] =
+{
+  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
+			       on GCC build.  */
+  32768/sizeof(size_t),	    /* 4096/8192 to fit on a L1 with 32 KB.  */
+  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
+  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
+};
+
+#define OPT_NELEM 10000
+#define OPT_SEED  10001
+#define CMDLINE_OPTIONS \
+  { "nelem", required_argument, NULL, OPT_NELEM }, \
+  { "seed", required_argument, NULL, OPT_SEED },
+
+static const size_t max_nelem = 16;
+static size_t *elems = NULL;
+static size_t nelem = 0;
+static uint64_t seed = 0;
+static bool seed_set = false;
+
+static void
+cmdline_process_function (int c)
+{
+  switch (c)
+    {
+      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
+	 The elements sizes as passed with a ':' as the delimiter, for
+	 instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
+      case OPT_NELEM:
+        {
+	  elems = xmalloc (max_nelem * sizeof (size_t));
+	  nelem = 0;
+
+	  char *saveptr;
+	  char *token;
+	  token = strtok_r (optarg, ":", &saveptr);
+	  if (token == NULL)
+	    {
+	      printf ("error: invalid --nelem value\n");
+	      exit (EXIT_FAILURE);
+	    }
+	  do
+	    {
+	      if (nelem == max_nelem)
+		{
+		  printf ("error: invalid --nelem value (max elem)\n");
+		  exit (EXIT_FAILURE);
+		}
+	      elems[nelem++] = atol (token);
+	      token = strtok_r (saveptr, ":", &saveptr);
+	    } while (token != NULL);
+        }
+      break;
+
+      /* handle the --seed option to use a different seed than a random one.
+	 The SEED used should be a uint64_t number.  */
+      case OPT_SEED:
+	{
+	  uint64_t value = strtoull (optarg, NULL, 0);
+	  if (errno == ERANGE || value > UINT64_MAX)
+	    {
+	      printf ("error: seed should be a value in range of "
+		      "[0, UINT64_MAX]\n");
+	      exit (EXIT_FAILURE);
+	    }
+	  seed = value;
+	  seed_set = true;
+	}
+    }
+}
+
+#define CMDLINE_PROCESS cmdline_process_function
+
+static int
+do_test (void)
+{
+  random_init ();
+
+
+  json_ctx_t json_ctx;
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+  json_attr_object_begin (&json_ctx, "qsort");
+  json_attr_uint (&json_ctx, "seed", seed);
+
+  json_array_begin (&json_ctx, "results");
+
+  const size_t *welem = elems == NULL ? default_nelems : elems;
+  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
+
+  struct test_t
+  {
+    size_t type_size;
+    cmpfunc_t cmpfunc;
+  };
+  static const struct test_t tests[] =
+  {
+    { sizeof (uint32_t), uint32_t_cmp },
+    { sizeof (uint64_t), uint64_t_cmp },
+    /* Test swap with large elements.  */
+    { 32,                uint32_t_cmp },
+  };
+
+  const size_t inner_loop_iters = 16;
+  for (const struct test_t *test = tests; test < array_end (tests); ++test)
+    {
+      for (const struct array_t *arraytype = arraytypes;
+	   arraytype < array_end (arraytypes);
+	   ++arraytype)
+	{
+	  for (int k = 0; k < wnelem; k++)
+	    {
+	      json_element_object_begin (&json_ctx);
+
+	      size_t arraysize = welem[k] * test->type_size;
+
+	      json_attr_uint (&json_ctx, "nmemb", welem[k]);
+	      json_attr_uint (&json_ctx, "type_size", test->type_size);
+	      json_attr_string (&json_ctx, "property", arraytype->name);
+
+	      void *base = create_array (welem[k], test->type_size,
+					 arraytype->type);
+	      void *work = xmalloc (arraysize);
+
+	      timing_t total = 0;
+
+	      for (int n = 0; n < inner_loop_iters; n++)
+	        {
+		  memcpy (work, base, arraysize);
+
+	          timing_t start, end, diff;
+	          TIMING_NOW (start);
+	          qsort (work, welem[k], test->type_size, test->cmpfunc);
+	          TIMING_NOW (end);
+
+	          TIMING_DIFF (diff, start, end);
+	          TIMING_ACCUM (total, diff);
+	        }
+
+	     json_attr_uint (&json_ctx, "timings",
+			     (double) total / (double) inner_loop_iters);
+	     json_element_object_end (&json_ctx);
+
+	     free (base);
+	     free (work);
+	   }
+        }
+    }
+
+  json_array_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+  json_attr_object_end (&json_ctx);
+  json_document_end (&json_ctx);
+
+  return 0;
+}
+
+#define TIMEOUT 600
+#include <support/test-driver.c>
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 2/7] support: Fix getopt_long with CMDLINE_OPTIONS
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) Adhemerval Zanella via Libc-alpha
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

---
 support/support_test_main.c | 1 -
 support/test-driver.h       | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/support/support_test_main.c b/support/support_test_main.c
index 07e3cdd173..ab1a475192 100644
--- a/support/support_test_main.c
+++ b/support/support_test_main.c
@@ -42,7 +42,6 @@
 static const struct option default_options[] =
 {
   TEST_DEFAULT_OPTIONS
-  { NULL, 0, NULL, 0 }
 };
 
 /* Show people how to run the program.  */
diff --git a/support/test-driver.h b/support/test-driver.h
index 8d4f38275d..3e54814a03 100644
--- a/support/test-driver.h
+++ b/support/test-driver.h
@@ -60,6 +60,7 @@ enum
   { "verbose", no_argument, NULL, 'v' },                \
   { "direct", no_argument, NULL, OPT_DIRECT },          \
   { "test-dir", required_argument, NULL, OPT_TESTDIR }, \
+  { NULL, 0, NULL, 0 }
 
 /* The directory the test should use for temporary files.  */
 extern const char *test_dir;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 2/7] support: Fix getopt_long with CMDLINE_OPTIONS Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-10-13  3:29   ` Noah Goldstein via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 4/7] stdlib: Move insertion sort out qsort Adhemerval Zanella via Libc-alpha
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

It optimizes take in consideration both the most common elements are
either 32 or 64 bit in size [1] and inputs are aligned to the word
boundary.  This is similar to the optimization done on lib/sort.c
from Linux.

This patchs adds an optimized swap operation on qsort based in previous
msort one.  Instead of byte operation, three variants are provided:

  1. Using uint32_t loads and stores.
  2. Using uint64_t loads and stores.
  3. Generic one with a temporary buffer and memcpy/mempcpy.

The 1. and 2. options are selected only either if architecture defines
_STRING_ARCH_unaligned or if base pointer is aligned to required type.

It also fixes BZ#19305 by checking input size against number of
elements 1 besides 0.

Checked on x86_64-linux-gnu.

[1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html
---
 stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 91 insertions(+), 18 deletions(-)

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 23f2d28314..59458d151b 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -24,20 +24,85 @@
 #include <limits.h>
 #include <stdlib.h>
 #include <string.h>
+#include <stdbool.h>
 
-/* Byte-wise swap two items of size SIZE. */
-#define SWAP(a, b, size)						      \
-  do									      \
-    {									      \
-      size_t __size = (size);						      \
-      char *__a = (a), *__b = (b);					      \
-      do								      \
-	{								      \
-	  char __tmp = *__a;						      \
-	  *__a++ = *__b;						      \
-	  *__b++ = __tmp;						      \
-	} while (--__size > 0);						      \
-    } while (0)
+/* Swap SIZE bytes between addresses A and B.  These helpers are provided
+   along the generic one as an optimization.  */
+
+typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
+
+/* Return trues is elements can be copied used word load and sortes.
+   The size must be a multiple of the alignment, and the base address.  */
+static inline bool
+is_aligned_to_copy (const void *base, size_t size, size_t align)
+{
+  unsigned char lsbits = size;
+#if !_STRING_ARCH_unaligned
+  lsbits |= (unsigned char)(uintptr_t) base;
+#endif
+  return (lsbits & (align - 1)) == 0;
+}
+
+#define SWAP_WORDS_64 (swap_func_t)0
+#define SWAP_WORDS_32 (swap_func_t)1
+#define SWAP_BYTES    (swap_func_t)2
+
+static void
+swap_words_64 (void * restrict a, void * restrict b, size_t n)
+{
+  do
+   {
+     n -= 8;
+     uint64_t t = *(uint64_t *)(a + n);
+     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
+     *(uint64_t *)(b + n) = t;
+   } while (n);
+}
+
+static void
+swap_words_32 (void * restrict a, void * restrict b, size_t n)
+{
+  do
+   {
+     n -= 4;
+     uint32_t t = *(uint32_t *)(a + n);
+     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
+     *(uint32_t *)(b + n) = t;
+   } while (n);
+}
+
+static void
+swap_bytes (void * restrict a, void * restrict b, size_t n)
+{
+  /* Use multiple small memcpys with constant size to enable inlining
+     on most targets.  */
+  enum { SWAP_GENERIC_SIZE = 32 };
+  unsigned char tmp[SWAP_GENERIC_SIZE];
+  while (n > SWAP_GENERIC_SIZE)
+    {
+      memcpy (tmp, a, SWAP_GENERIC_SIZE);
+      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
+      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
+      n -= SWAP_GENERIC_SIZE;
+    }
+  memcpy (tmp, a, n);
+  memcpy (a, b, n);
+  memcpy (b, tmp, n);
+}
+
+/* Replace the indirect call with a serie of if statements.  It should help
+   the branch predictor.  */
+static void
+do_swap (void * restrict a, void * restrict b, size_t size,
+	 swap_func_t swap_func)
+{
+  if (swap_func == SWAP_WORDS_64)
+    swap_words_64 (a, b, size);
+  else if (swap_func == SWAP_WORDS_32)
+    swap_words_32 (a, b, size);
+  else
+    swap_bytes (a, b, size);
+}
 
 /* Discontinue quicksort algorithm when partition gets below this size.
    This particular magic number was chosen to work best on a Sun 4/260. */
@@ -97,6 +162,14 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
     /* Avoid lossage with unsigned arithmetic below.  */
     return;
 
+  swap_func_t swap_func;
+  if (is_aligned_to_copy (pbase, size, 8))
+    swap_func = SWAP_WORDS_64;
+  else if (is_aligned_to_copy (pbase, size, 4))
+    swap_func = SWAP_WORDS_32;
+  else
+    swap_func = SWAP_BYTES;
+
   if (total_elems > MAX_THRESH)
     {
       char *lo = base_ptr;
@@ -120,13 +193,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
 	  char *mid = lo + size * ((hi - lo) / size >> 1);
 
 	  if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
-	    SWAP (mid, lo, size);
+	    do_swap (mid, lo, size, swap_func);
 	  if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
-	    SWAP (mid, hi, size);
+	    do_swap (mid, hi, size, swap_func);
 	  else
 	    goto jump_over;
 	  if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
-	    SWAP (mid, lo, size);
+	    do_swap (mid, lo, size, swap_func);
 	jump_over:;
 
 	  left_ptr  = lo + size;
@@ -145,7 +218,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
 
 	      if (left_ptr < right_ptr)
 		{
-		  SWAP (left_ptr, right_ptr, size);
+		  do_swap (left_ptr, right_ptr, size, swap_func);
 		  if (mid == left_ptr)
 		    mid = right_ptr;
 		  else if (mid == right_ptr)
@@ -217,7 +290,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
         tmp_ptr = run_ptr;
 
     if (tmp_ptr != base_ptr)
-      SWAP (tmp_ptr, base_ptr, size);
+      do_swap (tmp_ptr, base_ptr, size, swap_func);
 
     /* Insertion sort, running from left-hand-side up to right-hand-side.  */
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 4/7] stdlib: Move insertion sort out qsort
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
                   ` (2 preceding siblings ...)
  2021-09-03 17:11 ` [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-09-06 20:35   ` Fangrui Song via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 5/7] stdlib: qsort: Move some macros to inline function Adhemerval Zanella via Libc-alpha
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

---
 stdlib/qsort.c | 100 ++++++++++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 47 deletions(-)

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 59458d151b..b69417dedd 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -150,6 +150,58 @@ typedef struct
       smaller partition.  This *guarantees* no more than log (total_elems)
       stack size is needed (actually O(1) in this case)!  */
 
+static void
+insertion_sort (void *const pbase, size_t total_elems, size_t size,
+                swap_func_t swap_func,
+	        __compar_d_fn_t cmp, void *arg)
+{
+  char *base_ptr = (char *) pbase;
+  char *const end_ptr = &base_ptr[size * (total_elems - 1)];
+  char *tmp_ptr = base_ptr;
+#define min(x, y) ((x) < (y) ? (x) : (y))
+  const size_t max_thresh = MAX_THRESH * size;
+  char *thresh = min(end_ptr, base_ptr + max_thresh);
+  char *run_ptr;
+
+  /* Find smallest element in first threshold and place it at the
+     array's beginning.  This is the smallest array element,
+     and the operation speeds up insertion sort's inner loop. */
+
+  for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
+    if (cmp (run_ptr, tmp_ptr, arg) < 0)
+      tmp_ptr = run_ptr;
+
+  if (tmp_ptr != base_ptr)
+    do_swap (tmp_ptr, base_ptr, size, swap_func);
+
+  /* Insertion sort, running from left-hand-side up to right-hand-side.  */
+
+  run_ptr = base_ptr + size;
+  while ((run_ptr += size) <= end_ptr)
+    {
+      tmp_ptr = run_ptr - size;
+      while (cmp (run_ptr, tmp_ptr, arg) < 0)
+        tmp_ptr -= size;
+
+      tmp_ptr += size;
+      if (tmp_ptr != run_ptr)
+        {
+          char *trav;
+
+          trav = run_ptr + size;
+          while (--trav >= run_ptr)
+            {
+              char c = *trav;
+              char *hi, *lo;
+
+              for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
+                *hi = *lo;
+              *hi = c;
+            }
+        }
+    }
+}
+
 void
 _quicksort (void *const pbase, size_t total_elems, size_t size,
 	    __compar_d_fn_t cmp, void *arg)
@@ -272,51 +324,5 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
      for partitions below MAX_THRESH size. BASE_PTR points to the beginning
      of the array to sort, and END_PTR points at the very last element in
      the array (*not* one beyond it!). */
-
-#define min(x, y) ((x) < (y) ? (x) : (y))
-
-  {
-    char *const end_ptr = &base_ptr[size * (total_elems - 1)];
-    char *tmp_ptr = base_ptr;
-    char *thresh = min(end_ptr, base_ptr + max_thresh);
-    char *run_ptr;
-
-    /* Find smallest element in first threshold and place it at the
-       array's beginning.  This is the smallest array element,
-       and the operation speeds up insertion sort's inner loop. */
-
-    for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
-      if ((*cmp) ((void *) run_ptr, (void *) tmp_ptr, arg) < 0)
-        tmp_ptr = run_ptr;
-
-    if (tmp_ptr != base_ptr)
-      do_swap (tmp_ptr, base_ptr, size, swap_func);
-
-    /* Insertion sort, running from left-hand-side up to right-hand-side.  */
-
-    run_ptr = base_ptr + size;
-    while ((run_ptr += size) <= end_ptr)
-      {
-	tmp_ptr = run_ptr - size;
-	while ((*cmp) ((void *) run_ptr, (void *) tmp_ptr, arg) < 0)
-	  tmp_ptr -= size;
-
-	tmp_ptr += size;
-        if (tmp_ptr != run_ptr)
-          {
-            char *trav;
-
-	    trav = run_ptr + size;
-	    while (--trav >= run_ptr)
-              {
-                char c = *trav;
-                char *hi, *lo;
-
-                for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
-                  *hi = *lo;
-                *hi = c;
-              }
-          }
-      }
-  }
+  insertion_sort (pbase, total_elems, size, swap_func, cmp, arg);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 5/7] stdlib: qsort: Move some macros to inline function
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
                   ` (3 preceding siblings ...)
  2021-09-03 17:11 ` [PATCH v3 4/7] stdlib: Move insertion sort out qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

Change the macro PUSH and POP and rmeove the macro STACK_NOT_EMPTY.
---
 stdlib/qsort.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index b69417dedd..5df640362d 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -115,15 +115,28 @@ typedef struct
     char *hi;
   } stack_node;
 
-/* The next 4 #defines implement a very fast in-line stack abstraction. */
 /* The stack needs log (total_elements) entries (we could even subtract
    log(MAX_THRESH)).  Since total_elements has type size_t, we get as
    upper bound for log (total_elements):
    bits per byte (CHAR_BIT) * sizeof(size_t).  */
-#define STACK_SIZE	(CHAR_BIT * sizeof (size_t))
-#define PUSH(low, high)	((void) ((top->lo = (low)), (top->hi = (high)), ++top))
-#define	POP(low, high)	((void) (--top, (low = top->lo), (high = top->hi)))
-#define	STACK_NOT_EMPTY	(stack < top)
+enum { STACK_SIZE = CHAR_BIT * sizeof (size_t) };
+
+static inline stack_node *
+push (stack_node *top, char *lo, char *hi)
+{
+  top->lo = lo;
+  top->hi = hi;
+  return ++top;
+}
+
+static inline stack_node *
+pop (stack_node *top, char **lo, char **hi)
+{
+  --top;
+  *lo = top->lo;
+  *hi = top->hi;
+  return top;
+}
 
 
 /* Order size using quicksort.  This implementation incorporates
@@ -229,9 +242,9 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
       stack_node stack[STACK_SIZE];
       stack_node *top = stack;
 
-      PUSH (NULL, NULL);
+      top = push (top, NULL, NULL);
 
-      while (STACK_NOT_EMPTY)
+      while (stack < top)
         {
           char *left_ptr;
           char *right_ptr;
@@ -296,7 +309,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
             {
               if ((size_t) (hi - left_ptr) <= max_thresh)
 		/* Ignore both small partitions. */
-                POP (lo, hi);
+		top = pop (top, &lo, &hi);
               else
 		/* Ignore small left partition. */
                 lo = left_ptr;
@@ -307,13 +320,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
           else if ((right_ptr - lo) > (hi - left_ptr))
             {
 	      /* Push larger left partition indices. */
-              PUSH (lo, right_ptr);
+	      top = push (top, lo, right_ptr);
               lo = left_ptr;
             }
           else
             {
 	      /* Push larger right partition indices. */
-              PUSH (left_ptr, hi);
+	      top = push (top, left_ptr, hi);
               hi = right_ptr;
             }
         }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 6/7] stdlib: Implement introsort with qsort
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
                   ` (4 preceding siblings ...)
  2021-09-03 17:11 ` [PATCH v3 5/7] stdlib: qsort: Move some macros to inline function Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
                     ` (2 more replies)
  2021-09-03 17:11 ` [PATCH v3 7/7] stdlib: Remove use of mergesort on qsort (BZ #21719) Adhemerval Zanella via Libc-alpha
  2021-09-03 19:18 ` [PATCH v3 0/7] Use introsort for qsort Paul Eggert
  7 siblings, 3 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

This patch adds a introsort implementation on qsort to avoid worse-case
performance of quicksort to O(nlog n).  The heapsort fallback used is a
heapsort based on Linux implementation (commit 22a241ccb2c19962a).  As a
side note the introsort implementation is similar the one used on
libstdc++ for std::sort.

Checked on x86_64-linux-gnu.
---
 stdlib/qsort.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 7 deletions(-)

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 5df640362d..8368576aae 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -113,6 +113,7 @@ typedef struct
   {
     char *lo;
     char *hi;
+    size_t depth;
   } stack_node;
 
 /* The stack needs log (total_elements) entries (we could even subtract
@@ -122,23 +123,92 @@ typedef struct
 enum { STACK_SIZE = CHAR_BIT * sizeof (size_t) };
 
 static inline stack_node *
-push (stack_node *top, char *lo, char *hi)
+push (stack_node *top, char *lo, char *hi, size_t depth)
 {
   top->lo = lo;
   top->hi = hi;
+  top->depth = depth;
   return ++top;
 }
 
 static inline stack_node *
-pop (stack_node *top, char **lo, char **hi)
+pop (stack_node *top, char **lo, char **hi, size_t *depth)
 {
   --top;
   *lo = top->lo;
   *hi = top->hi;
+  *depth = top->depth;
   return top;
 }
 
 
+/* A fast, small, non-recursive O(nlog n) heapsort, adapted from Linux
+   lib/sort.c.  Used on introsort implementation as a fallback routine with
+   worst-case performance of O(nlog n) and worst-case space complexity of
+   O(1).  */
+
+static inline size_t
+parent (size_t i, unsigned int lsbit, size_t size)
+{
+  i -= size;
+  i -= size & -(i & lsbit);
+  return i / 2;
+}
+
+static void
+heapsort_r (void *base, void *end, size_t size, swap_func_t swap_func,
+	    __compar_d_fn_t cmp, void *arg)
+{
+  size_t num = ((uintptr_t) end - (uintptr_t) base) / size;
+  size_t n = num * size, a = (num/2) * size;
+  /* Used to find parent  */
+  const unsigned int lsbit = size & -size;
+
+  /* num < 2 || size == 0.  */
+  if (a == 0)
+    return;
+
+  for (;;)
+    {
+      size_t b, c, d;
+
+      if (a != 0)
+	/* Building heap: sift down --a */
+	a -= size;
+      else if (n -= size)
+	/* Sorting: Extract root to --n */
+	do_swap (base, base + n, size, swap_func);
+      else
+	break;
+
+      /* Sift element at "a" down into heap.  This is the "bottom-up" variant,
+	 which significantly reduces calls to cmp_func(): we find the sift-down
+	 path all the way to the leaves (one compare per level), then backtrack
+	 to find where to insert the target element.
+
+	 Because elements tend to sift down close to the leaves, this uses fewer
+	 compares than doing two per level on the way down.  (A bit more than
+	 half as many on average, 3/4 worst-case.).  */
+      for (b = a; c = 2 * b + size, (d = c + size) < n;)
+	b = cmp (base + c, base + d, arg) >= 0 ? c : d;
+      if (d == n)
+	/* Special case last leaf with no sibling.  */
+	b = c;
+
+      /* Now backtrack from "b" to the correct location for "a".  */
+      while (b != a && cmp (base + a, base + b, arg) >= 0)
+	b = parent (b, lsbit, size);
+      /* Where "a" belongs.  */
+      c = b;
+      while (b != a)
+	{
+	  /* Shift it into place.  */
+	  b = parent (b, lsbit, size);
+          do_swap (base + b, base + c, size, swap_func);
+        }
+    }
+}
+
 /* Order size using quicksort.  This implementation incorporates
    four optimizations discussed in Sedgewick:
 
@@ -223,7 +293,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
 
   const size_t max_thresh = MAX_THRESH * size;
 
-  if (total_elems == 0)
+  if (total_elems <= 1)
     /* Avoid lossage with unsigned arithmetic below.  */
     return;
 
@@ -235,6 +305,9 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
   else
     swap_func = SWAP_BYTES;
 
+  /* Maximum depth before quicksort switches to heapsort.  */
+  size_t depth = 2 * (CHAR_BIT - 1 - __builtin_clzl (total_elems));
+
   if (total_elems > MAX_THRESH)
     {
       char *lo = base_ptr;
@@ -242,10 +315,17 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
       stack_node stack[STACK_SIZE];
       stack_node *top = stack;
 
-      top = push (top, NULL, NULL);
+      top = push (top, NULL, NULL, depth);
 
       while (stack < top)
         {
+	  if (depth == 0)
+	    {
+	      heapsort_r (lo, hi, size, swap_func, cmp, arg);
+              top = pop (top, &lo, &hi, &depth);
+	      continue;
+	    }
+
           char *left_ptr;
           char *right_ptr;
 
@@ -309,7 +389,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
             {
               if ((size_t) (hi - left_ptr) <= max_thresh)
 		/* Ignore both small partitions. */
-		top = pop (top, &lo, &hi);
+		top = pop (top, &lo, &hi, &depth);
               else
 		/* Ignore small left partition. */
                 lo = left_ptr;
@@ -320,13 +400,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
           else if ((right_ptr - lo) > (hi - left_ptr))
             {
 	      /* Push larger left partition indices. */
-	      top = push (top, lo, right_ptr);
+	      top = push (top, lo, right_ptr, depth - 1);
               lo = left_ptr;
             }
           else
             {
 	      /* Push larger right partition indices. */
-	      top = push (top, left_ptr, hi);
+	      top = push (top, left_ptr, hi, depth - 1);
               hi = right_ptr;
             }
         }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 7/7] stdlib: Remove use of mergesort on qsort (BZ #21719)
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
                   ` (5 preceding siblings ...)
  2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-03 17:11 ` Adhemerval Zanella via Libc-alpha
  2021-09-03 19:18 ` [PATCH v3 0/7] Use introsort for qsort Paul Eggert
  7 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-03 17:11 UTC (permalink / raw)
  To: libc-alpha

This patch removes the mergesort optimization on qsort implementation
and uses the quicksort instead.  The mergesort implementation has some
issues:

  - It is as-safe only for certain types sizes (if total size is less
    than 1 KB with large element sizes also forcing memory allocation)
    which contradicts the function documentation.  Although not required
    by the C standard, it is preferable and doable to have an O(1) space
    implementation.

  - The malloc for certain element size and element number adds
    arbitrary latency (might even be worse if malloc is interposed).

  - To avoid trigger swap from memory allocation the implementation
    relies on system information that might be virtualized (for instance
    VMs with overcommit memory) which might lead to potentially use of
    swap even if system advertise more memory than actually has.  The
    check also have the downside of issuing syscalls where none is
    expected (although only once per execution).

  - The mergesort is suboptimal on an already sorted array (BZ#21719).

The quicksort implementation is already optimized to use constant extra
space (due to the limit of total number of elements from maximum VM
size) and thus can be used to avoid the malloc usage issues.

Resulting performance is slower due the usage of qsort, specially in the
worst-case scenario (partialy or sorted arrays) and due the fact
mergesort uses a slight improved swap operations.

This change also renders the BZ#21719 fix unrequired (since it is meant
to fix the sorted input performance degradation for mergesort).  The
manual is also updated to indicate the function is now async-cancel
safe.

Checked on x86_64-linux-gnu.
---
 manual/argp.texi   |   2 +-
 manual/locale.texi |   3 +-
 manual/search.texi |   7 +-
 stdlib/Makefile    |   3 +-
 stdlib/msort.c     | 310 ---------------------------------------------
 stdlib/qsort.c     |  15 ++-
 6 files changed, 18 insertions(+), 322 deletions(-)
 delete mode 100644 stdlib/msort.c

diff --git a/manual/argp.texi b/manual/argp.texi
index 0023441812..b77ad68285 100644
--- a/manual/argp.texi
+++ b/manual/argp.texi
@@ -735,7 +735,7 @@ for options, bad phase of the moon, etc.
 @c  hol_set_group ok
 @c   hol_find_entry ok
 @c  hol_sort @mtslocale @acucorrupt
-@c   qsort dup @acucorrupt
+@c   qsort dup
 @c    hol_entry_qcmp @mtslocale
 @c     hol_entry_cmp @mtslocale
 @c      group_cmp ok
diff --git a/manual/locale.texi b/manual/locale.texi
index 720e0ca952..f6afa5dc44 100644
--- a/manual/locale.texi
+++ b/manual/locale.texi
@@ -253,7 +253,7 @@ The symbols in this section are defined in the header file @file{locale.h}.
 @c    calculate_head_size ok
 @c    __munmap ok
 @c    compute_hashval ok
-@c    qsort dup @acucorrupt
+@c    qsort dup
 @c     rangecmp ok
 @c    malloc @ascuheap @acsmem
 @c    strdup @ascuheap @acsmem
@@ -275,7 +275,6 @@ The symbols in this section are defined in the header file @file{locale.h}.
 @c      realloc @ascuheap @acsmem
 @c     realloc @ascuheap @acsmem
 @c     fclose @ascuheap @asulock @acsmem @acsfd @aculock
-@c     qsort @ascuheap @acsmem
 @c      alias_compare dup
 @c    libc_lock_unlock @aculock
 @c   _nl_explode_name @ascuheap @acsmem
diff --git a/manual/search.texi b/manual/search.texi
index 5691bf2f2b..a550858478 100644
--- a/manual/search.texi
+++ b/manual/search.texi
@@ -159,7 +159,7 @@ To sort an array using an arbitrary comparison function, use the
 
 @deftypefun void qsort (void *@var{array}, size_t @var{count}, size_t @var{size}, comparison_fn_t @var{compare})
 @standards{ISO, stdlib.h}
-@safety{@prelim{}@mtsafe{}@assafe{}@acunsafe{@acucorrupt{}}}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
 The @code{qsort} function sorts the array @var{array}.  The array
 contains @var{count} elements, each of which is of size @var{size}.
 
@@ -199,9 +199,8 @@ Functions}):
 The @code{qsort} function derives its name from the fact that it was
 originally implemented using the ``quick sort'' algorithm.
 
-The implementation of @code{qsort} in this library might not be an
-in-place sort and might thereby use an extra amount of memory to store
-the array.
+The implementation of @code{qsort} in this library is an in-place sort
+and uses a constant extra space (allocated on the stack).
 @end deftypefun
 
 @node Search/Sort Example
diff --git a/stdlib/Makefile b/stdlib/Makefile
index ef8a9007d8..27aa36eaff 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -34,7 +34,7 @@ headers	:= stdlib.h bits/stdlib.h bits/stdlib-ldbl.h bits/stdlib-float.h      \
 routines	:=							      \
 	atof atoi atol atoll						      \
 	abort								      \
-	bsearch qsort msort						      \
+	bsearch qsort							      \
 	getenv putenv setenv secure-getenv				      \
 	exit on_exit atexit cxa_atexit cxa_finalize old_atexit		      \
 	quick_exit at_quick_exit cxa_at_quick_exit cxa_thread_atexit_impl     \
@@ -143,7 +143,6 @@ extra-test-objs += tst-putenvmod.os
 generated += isomac isomac.out tst-putenvmod.so
 
 CFLAGS-bsearch.c += $(uses-callbacks)
-CFLAGS-msort.c += $(uses-callbacks)
 CFLAGS-qsort.c += $(uses-callbacks)
 CFLAGS-system.c += -fexceptions
 CFLAGS-system.os = -fomit-frame-pointer
diff --git a/stdlib/msort.c b/stdlib/msort.c
deleted file mode 100644
index 8750cc59db..0000000000
--- a/stdlib/msort.c
+++ /dev/null
@@ -1,310 +0,0 @@
-/* An alternative to qsort, with an identical interface.
-   This file is part of the GNU C Library.
-   Copyright (C) 1992-2021 Free Software Foundation, Inc.
-   Written by Mike Haertel, September 1988.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <https://www.gnu.org/licenses/>.  */
-
-#include <alloca.h>
-#include <stdint.h>
-#include <stdlib.h>
-#include <string.h>
-#include <unistd.h>
-#include <memcopy.h>
-#include <errno.h>
-#include <atomic.h>
-
-struct msort_param
-{
-  size_t s;
-  size_t var;
-  __compar_d_fn_t cmp;
-  void *arg;
-  char *t;
-};
-static void msort_with_tmp (const struct msort_param *p, void *b, size_t n);
-
-static void
-msort_with_tmp (const struct msort_param *p, void *b, size_t n)
-{
-  char *b1, *b2;
-  size_t n1, n2;
-
-  if (n <= 1)
-    return;
-
-  n1 = n / 2;
-  n2 = n - n1;
-  b1 = b;
-  b2 = (char *) b + (n1 * p->s);
-
-  msort_with_tmp (p, b1, n1);
-  msort_with_tmp (p, b2, n2);
-
-  char *tmp = p->t;
-  const size_t s = p->s;
-  __compar_d_fn_t cmp = p->cmp;
-  void *arg = p->arg;
-  switch (p->var)
-    {
-    case 0:
-      while (n1 > 0 && n2 > 0)
-	{
-	  if ((*cmp) (b1, b2, arg) <= 0)
-	    {
-	      *(uint32_t *) tmp = *(uint32_t *) b1;
-	      b1 += sizeof (uint32_t);
-	      --n1;
-	    }
-	  else
-	    {
-	      *(uint32_t *) tmp = *(uint32_t *) b2;
-	      b2 += sizeof (uint32_t);
-	      --n2;
-	    }
-	  tmp += sizeof (uint32_t);
-	}
-      break;
-    case 1:
-      while (n1 > 0 && n2 > 0)
-	{
-	  if ((*cmp) (b1, b2, arg) <= 0)
-	    {
-	      *(uint64_t *) tmp = *(uint64_t *) b1;
-	      b1 += sizeof (uint64_t);
-	      --n1;
-	    }
-	  else
-	    {
-	      *(uint64_t *) tmp = *(uint64_t *) b2;
-	      b2 += sizeof (uint64_t);
-	      --n2;
-	    }
-	  tmp += sizeof (uint64_t);
-	}
-      break;
-    case 2:
-      while (n1 > 0 && n2 > 0)
-	{
-	  unsigned long *tmpl = (unsigned long *) tmp;
-	  unsigned long *bl;
-
-	  tmp += s;
-	  if ((*cmp) (b1, b2, arg) <= 0)
-	    {
-	      bl = (unsigned long *) b1;
-	      b1 += s;
-	      --n1;
-	    }
-	  else
-	    {
-	      bl = (unsigned long *) b2;
-	      b2 += s;
-	      --n2;
-	    }
-	  while (tmpl < (unsigned long *) tmp)
-	    *tmpl++ = *bl++;
-	}
-      break;
-    case 3:
-      while (n1 > 0 && n2 > 0)
-	{
-	  if ((*cmp) (*(const void **) b1, *(const void **) b2, arg) <= 0)
-	    {
-	      *(void **) tmp = *(void **) b1;
-	      b1 += sizeof (void *);
-	      --n1;
-	    }
-	  else
-	    {
-	      *(void **) tmp = *(void **) b2;
-	      b2 += sizeof (void *);
-	      --n2;
-	    }
-	  tmp += sizeof (void *);
-	}
-      break;
-    default:
-      while (n1 > 0 && n2 > 0)
-	{
-	  if ((*cmp) (b1, b2, arg) <= 0)
-	    {
-	      tmp = (char *) __mempcpy (tmp, b1, s);
-	      b1 += s;
-	      --n1;
-	    }
-	  else
-	    {
-	      tmp = (char *) __mempcpy (tmp, b2, s);
-	      b2 += s;
-	      --n2;
-	    }
-	}
-      break;
-    }
-
-  if (n1 > 0)
-    memcpy (tmp, b1, n1 * s);
-  memcpy (b, p->t, (n - n2) * s);
-}
-
-
-void
-__qsort_r (void *b, size_t n, size_t s, __compar_d_fn_t cmp, void *arg)
-{
-  size_t size = n * s;
-  char *tmp = NULL;
-  struct msort_param p;
-
-  /* For large object sizes use indirect sorting.  */
-  if (s > 32)
-    size = 2 * n * sizeof (void *) + s;
-
-  if (size < 1024)
-    /* The temporary array is small, so put it on the stack.  */
-    p.t = __alloca (size);
-  else
-    {
-      /* We should avoid allocating too much memory since this might
-	 have to be backed up by swap space.  */
-      static long int phys_pages;
-      static int pagesize;
-
-      if (pagesize == 0)
-	{
-	  phys_pages = __sysconf (_SC_PHYS_PAGES);
-
-	  if (phys_pages == -1)
-	    /* Error while determining the memory size.  So let's
-	       assume there is enough memory.  Otherwise the
-	       implementer should provide a complete implementation of
-	       the `sysconf' function.  */
-	    phys_pages = (long int) (~0ul >> 1);
-
-	  /* The following determines that we will never use more than
-	     a quarter of the physical memory.  */
-	  phys_pages /= 4;
-
-	  /* Make sure phys_pages is written to memory.  */
-	  atomic_write_barrier ();
-
-	  pagesize = __sysconf (_SC_PAGESIZE);
-	}
-
-      /* Just a comment here.  We cannot compute
-	   phys_pages * pagesize
-	   and compare the needed amount of memory against this value.
-	   The problem is that some systems might have more physical
-	   memory then can be represented with a `size_t' value (when
-	   measured in bytes.  */
-
-      /* If the memory requirements are too high don't allocate memory.  */
-      if (size / pagesize > (size_t) phys_pages)
-	{
-	  _quicksort (b, n, s, cmp, arg);
-	  return;
-	}
-
-      /* It's somewhat large, so malloc it.  */
-      int save = errno;
-      tmp = malloc (size);
-      __set_errno (save);
-      if (tmp == NULL)
-	{
-	  /* Couldn't get space, so use the slower algorithm
-	     that doesn't need a temporary array.  */
-	  _quicksort (b, n, s, cmp, arg);
-	  return;
-	}
-      p.t = tmp;
-    }
-
-  p.s = s;
-  p.var = 4;
-  p.cmp = cmp;
-  p.arg = arg;
-
-  if (s > 32)
-    {
-      /* Indirect sorting.  */
-      char *ip = (char *) b;
-      void **tp = (void **) (p.t + n * sizeof (void *));
-      void **t = tp;
-      void *tmp_storage = (void *) (tp + n);
-
-      while ((void *) t < tmp_storage)
-	{
-	  *t++ = ip;
-	  ip += s;
-	}
-      p.s = sizeof (void *);
-      p.var = 3;
-      msort_with_tmp (&p, p.t + n * sizeof (void *), n);
-
-      /* tp[0] .. tp[n - 1] is now sorted, copy around entries of
-	 the original array.  Knuth vol. 3 (2nd ed.) exercise 5.2-10.  */
-      char *kp;
-      size_t i;
-      for (i = 0, ip = (char *) b; i < n; i++, ip += s)
-	if ((kp = tp[i]) != ip)
-	  {
-	    size_t j = i;
-	    char *jp = ip;
-	    memcpy (tmp_storage, ip, s);
-
-	    do
-	      {
-		size_t k = (kp - (char *) b) / s;
-		tp[j] = jp;
-		memcpy (jp, kp, s);
-		j = k;
-		jp = kp;
-		kp = tp[k];
-	      }
-	    while (kp != ip);
-
-	    tp[j] = jp;
-	    memcpy (jp, tmp_storage, s);
-	  }
-    }
-  else
-    {
-      if ((s & (sizeof (uint32_t) - 1)) == 0
-	  && ((char *) b - (char *) 0) % __alignof__ (uint32_t) == 0)
-	{
-	  if (s == sizeof (uint32_t))
-	    p.var = 0;
-	  else if (s == sizeof (uint64_t)
-		   && ((char *) b - (char *) 0) % __alignof__ (uint64_t) == 0)
-	    p.var = 1;
-	  else if ((s & (sizeof (unsigned long) - 1)) == 0
-		   && ((char *) b - (char *) 0)
-		      % __alignof__ (unsigned long) == 0)
-	    p.var = 2;
-	}
-      msort_with_tmp (&p, b, n);
-    }
-  free (tmp);
-}
-libc_hidden_def (__qsort_r)
-weak_alias (__qsort_r, qsort_r)
-
-
-void
-qsort (void *b, size_t n, size_t s, __compar_fn_t cmp)
-{
-  return __qsort_r (b, n, s, (__compar_d_fn_t) cmp, NULL);
-}
-libc_hidden_def (qsort)
diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 8368576aae..028210dc8f 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -20,7 +20,6 @@
    Engineering a sort function; Jon Bentley and M. Douglas McIlroy;
    Software - Practice and Experience; Vol. 23 (11), 1249-1265, 1993.  */
 
-#include <alloca.h>
 #include <limits.h>
 #include <stdlib.h>
 #include <string.h>
@@ -286,8 +285,8 @@ insertion_sort (void *const pbase, size_t total_elems, size_t size,
 }
 
 void
-_quicksort (void *const pbase, size_t total_elems, size_t size,
-	    __compar_d_fn_t cmp, void *arg)
+__qsort_r (void *const pbase, size_t total_elems, size_t size,
+	   __compar_d_fn_t cmp, void *arg)
 {
   char *base_ptr = (char *) pbase;
 
@@ -419,3 +418,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
      the array (*not* one beyond it!). */
   insertion_sort (pbase, total_elems, size, swap_func, cmp, arg);
 }
+
+libc_hidden_def (__qsort_r)
+weak_alias (__qsort_r, qsort_r)
+
+void
+qsort (void *b, size_t n, size_t s, __compar_fn_t cmp)
+{
+  return __qsort_r (b, n, s, (__compar_d_fn_t) cmp, NULL);
+}
+libc_hidden_def (qsort)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
                   ` (6 preceding siblings ...)
  2021-09-03 17:11 ` [PATCH v3 7/7] stdlib: Remove use of mergesort on qsort (BZ #21719) Adhemerval Zanella via Libc-alpha
@ 2021-09-03 19:18 ` Paul Eggert
  2021-09-06 14:13   ` Carlos O'Donell via Libc-alpha
  7 siblings, 1 reply; 41+ messages in thread
From: Paul Eggert @ 2021-09-03 19:18 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On 9/3/21 10:11 AM, Adhemerval Zanella wrote:

> The performance difference with the provided benchmark clearly show the
> tradeoff of using a instrosort over a mergesort.

Zhang, Meng and Liang <https://arxiv.org/pdf/1609.04471> report that 
introsort is particularly bad with reverse-sorted input; that case 
should be added to the benchmarks.

The proposed change would appear to hurt performance significantly in 
the common case of sorting pointers. My guess is that typical glibc 
users would prefer speed to hardening against rare signal-handling or 
low-memory cases, since portable code will have to assume that qsort 
isn't hardened anyway.

If hardening is a goal, perhaps we should consider an alternate API 
instead of changing qsort's implementation. If we do that, we can 
improve the API as well as changing the function's name. The new API 
could promise to not call malloc, or more generally to be 
async-signal-safe if the comparison function is. The comparison function 
could take a third void* argument so it can in effect be a closure (this 
is qsort's greatest failing), the sorting function could take a flag 
argument for options, and it could return an error indication (in case 
it detects a comparison function gone haywire, say). That sort of thing.

> The improvements over sizes larger
> than 8 is mostly due the optimization done on swap operations.

A similar optimization could be done to the existing code independently, 
no? I expect that an apples-to-apples performance comparison would be 
even less favorable to the proposed change.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-04  9:09   ` Alexander Monakov via Libc-alpha
  2021-09-06 18:30     ` Adhemerval Zanella via Libc-alpha
  2021-10-13  3:19   ` Noah Goldstein via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Alexander Monakov via Libc-alpha @ 2021-09-04  9:09 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Fri, 3 Sep 2021, Adhemerval Zanella via Libc-alpha wrote:

> So for GCC:
> 
>   - 80% of total qsort usage are done with 10 elements of less.

Please keep in mind that calls with larger number of elements run
slower. If you look at cycle count rather than call count, you'll find
that it is more evenly split: about 50% of total time in qsort is with
10 elements or fewer, the other 50% are with 11 elements or more.

Alexander

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/7] stdlib: Implement introsort with qsort
  2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
  2021-09-06 18:43     ` Adhemerval Zanella via Libc-alpha
  2021-09-06 20:23   ` Fangrui Song via Libc-alpha
  2021-10-13  3:53   ` Noah Goldstein via Libc-alpha
  2 siblings, 1 reply; 41+ messages in thread
From: Alexander Monakov via Libc-alpha @ 2021-09-04  9:17 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Fri, 3 Sep 2021, Adhemerval Zanella via Libc-alpha wrote:

> @@ -235,6 +305,9 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>    else
>      swap_func = SWAP_BYTES;
>  
> +  /* Maximum depth before quicksort switches to heapsort.  */
> +  size_t depth = 2 * (CHAR_BIT - 1 - __builtin_clzl (total_elems));
> +

This doesn't look right, the expression in parenthesis is usually negative,
conversion to size_t wraps around and you make 'depth' a huge positive value.
The first term probably should have been 'CHAR_BIT * sizeof(size_t)'.

Does this mean that heapsort fallback was not exercised in testing?

Also, is it certain at that point that total_elems is nonzero? __builtin_clzl
is undefined for a zero argument.

Alexander

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-03 19:18 ` [PATCH v3 0/7] Use introsort for qsort Paul Eggert
@ 2021-09-06 14:13   ` Carlos O'Donell via Libc-alpha
  2021-09-06 17:03     ` Zack Weinberg via Libc-alpha
  2021-09-07  0:14     ` Paul Eggert
  0 siblings, 2 replies; 41+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-06 14:13 UTC (permalink / raw)
  To: Paul Eggert, Adhemerval Zanella; +Cc: libc-alpha

On 9/3/21 3:18 PM, Paul Eggert wrote:
> On 9/3/21 10:11 AM, Adhemerval Zanella wrote:
> 
>> The performance difference with the provided benchmark clearly show
>> the tradeoff of using a instrosort over a mergesort.
> 
> Zhang, Meng and Liang <https://arxiv.org/pdf/1609.04471> report that
> introsort is particularly bad with reverse-sorted input; that case
> should be added to the benchmarks.

It should, and it would be useful, however, I think that Adhemerval's
changes to make qsort AS-safe are actually relevant and important.

All existing ruby implementations I can see use qsort in an
AS-safe-requiring context (process.c) and this would help those scenarios.

In general I think glibc should provide a qsort that is:

- Safe.

- Does not have quadratic performance behaviour.

If we want better performance we need to:

- Add new APIs mirroring what BSD is doing.

- New APIs explicitly call out the *type* of sort algorithm.

- Encourage developers and application authors to contribute those
  new implementations and use them in their applications.

There is never going to be a qsort that meets all of the workload demands
and so we should aim for a good implementation that is as safe as we
can make it (limit memory usage, avoid malloc, avoid quadratic performance
behaviour).

> The proposed change would appear to hurt performance significantly in
> the common case of sorting pointers. My guess is that typical glibc
> users would prefer speed to hardening against rare signal-handling or
> low-memory cases, since portable code will have to assume that qsort
> isn't hardened anyway.

I disagree. I think users do not expect qsort to be high performance,
they expect it to be OK performance, OK memory usage, and lastly safe.

My experience with core python developers has been that they expect
glibc to provide functions that are safe in as many contexts as possible
in order to have them avoid rolling their own for the implementation
of the language runtime.

Application developers may not care about this AS-safety, but they also
have the flexiblity to bring in additional dependencies. We can satisfy
those requirements by adding more algorithms under new symbols.

> If hardening is a goal, perhaps we should consider an alternate API
> instead of changing qsort's implementation. If we do that, we can
> improve the API as well as changing the function's name. The new API
> could promise to not call malloc, or more generally to be
> async-signal-safe if the comparison function is. The comparison
> function could take a third void* argument so it can in effect be a
> closure (this is qsort's greatest failing), the sorting function
> could take a flag argument for options, and it could return an error
> indication (in case it detects a comparison function gone haywire,
> say). That sort of thing.

I'd argue the opposite. Make the current functions safe. Add new high
performance versions with explicit algorithms and explicit tradeoffs.

>> The improvements over sizes larger than 8 is mostly due the
>> optimization done on swap operations.
> 
> A similar optimization could be done to the existing code
> independently, no? I expect that an apples-to-apples performance
> comparison would be even less favorable to the proposed change.

Size is nice, but safety is nicer IMO.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-06 14:13   ` Carlos O'Donell via Libc-alpha
@ 2021-09-06 17:03     ` Zack Weinberg via Libc-alpha
  2021-09-06 18:19       ` Adhemerval Zanella via Libc-alpha
  2021-09-07  0:14     ` Paul Eggert
  1 sibling, 1 reply; 41+ messages in thread
From: Zack Weinberg via Libc-alpha @ 2021-09-06 17:03 UTC (permalink / raw)
  To: libc-alpha

On Mon, Sep 6, 2021, at 10:13 AM, Carlos O'Donell via Libc-alpha wrote:
> 
> In general I think glibc should provide a qsort that is:
> 
> - Safe.
> 
> - Does not have quadratic performance behaviour.
> 
> If we want better performance we need to:
> 
> - Add new APIs mirroring what BSD is doing.
> 
> - New APIs explicitly call out the *type* of sort algorithm.
> 
> - Encourage developers and application authors to contribute those
>   new implementations and use them in their applications.

Do you have a link or other reference for these new APIs being developed by the BSDs?

(asking out of curiosity; I haven't been following this discussion closely enough to have an opinion)

zw

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-06 17:03     ` Zack Weinberg via Libc-alpha
@ 2021-09-06 18:19       ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-06 18:19 UTC (permalink / raw)
  To: libc-alpha, Zack Weinberg



On 06/09/2021 14:03, Zack Weinberg via Libc-alpha wrote:
> On Mon, Sep 6, 2021, at 10:13 AM, Carlos O'Donell via Libc-alpha wrote:
>>
>> In general I think glibc should provide a qsort that is:
>>
>> - Safe.
>>
>> - Does not have quadratic performance behaviour.
>>
>> If we want better performance we need to:
>>
>> - Add new APIs mirroring what BSD is doing.
>>
>> - New APIs explicitly call out the *type* of sort algorithm.
>>
>> - Encourage developers and application authors to contribute those
>>   new implementations and use them in their applications.
> 
> Do you have a link or other reference for these new APIs being developed by the BSDs?
> 
> (asking out of curiosity; I haven't been following this discussion closely enough to have an opinion)
> 

The FreeBSD provides qsort, which implements a quicksort, heapsort,
and mergesort [1].

[1] https://www.freebsd.org/cgi/man.cgi?query=qsort&sektion=3&manpath=freebsd-release

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-09-04  9:09   ` Alexander Monakov via Libc-alpha
@ 2021-09-06 18:30     ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-06 18:30 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: libc-alpha



On 04/09/2021 06:09, Alexander Monakov wrote:
> On Fri, 3 Sep 2021, Adhemerval Zanella via Libc-alpha wrote:
> 
>> So for GCC:
>>
>>   - 80% of total qsort usage are done with 10 elements of less.
> 
> Please keep in mind that calls with larger number of elements run
> slower. If you look at cycle count rather than call count, you'll find
> that it is more evenly split: about 50% of total time in qsort is with
> 10 elements or fewer, the other 50% are with 11 elements or more.

Thanks for the info, indeed I only take in consideration the call
count instead of cycle spent.  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/7] stdlib: Implement introsort with qsort
  2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
@ 2021-09-06 18:43     ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-06 18:43 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: libc-alpha



On 04/09/2021 06:17, Alexander Monakov wrote:
> On Fri, 3 Sep 2021, Adhemerval Zanella via Libc-alpha wrote:
> 
>> @@ -235,6 +305,9 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>>    else
>>      swap_func = SWAP_BYTES;
>>  
>> +  /* Maximum depth before quicksort switches to heapsort.  */
>> +  size_t depth = 2 * (CHAR_BIT - 1 - __builtin_clzl (total_elems));
>> +
> 
> This doesn't look right, the expression in parenthesis is usually negative,
> conversion to size_t wraps around and you make 'depth' a huge positive value.
> The first term probably should have been 'CHAR_BIT * sizeof(size_t)'.

It is not indeed, it should be:

  size_t depth = 2 * (sizeof (size_t) * CHAR_BIT - 1
                      - __builtin_clzl (total_elems));

(I think I did something wrong on some rebase, since I had the above
originally).

> 
> Does this mean that heapsort fallback was not exercised in testing?

The current tests does not trigger the quadratic behavior on quicksort,
that's why I have added the tst-qsort3.  The heapsort_r is called
on both 'SORTED' and 'BITONIC' array distribution (which should be
expected since they tend to be pattern-defeating for quicksort
implementations).

> 
> Also, is it certain at that point that total_elems is nonzero? __builtin_clzl
> is undefined for a zero argument.

Yes, the patch also takes care of that:

@@ -222,7 +292,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,

   const size_t max_thresh = MAX_THRESH * size;

-  if (total_elems == 0)
+  if (total_elems <= 1)
     /* Avoid lossage with unsigned arithmetic below.  */
     return;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/7] stdlib: Implement introsort with qsort
  2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
  2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
@ 2021-09-06 20:23   ` Fangrui Song via Libc-alpha
  2021-10-13  3:53   ` Noah Goldstein via Libc-alpha
  2 siblings, 0 replies; 41+ messages in thread
From: Fangrui Song via Libc-alpha @ 2021-09-06 20:23 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha


On 2021-09-03, Adhemerval Zanella via Libc-alpha wrote:
>This patch adds a introsort implementation on qsort to avoid worse-case
>performance of quicksort to O(nlog n).  The heapsort fallback used is a
>heapsort based on Linux implementation (commit 22a241ccb2c19962a).  As a
>side note the introsort implementation is similar the one used on
>libstdc++ for std::sort.
>
>Checked on x86_64-linux-gnu.
>---
> stdlib/qsort.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 87 insertions(+), 7 deletions(-)
>
>diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>index 5df640362d..8368576aae 100644
>--- a/stdlib/qsort.c
>+++ b/stdlib/qsort.c
>@@ -113,6 +113,7 @@ typedef struct
>   {
>     char *lo;
>     char *hi;
>+    size_t depth;
>   } stack_node;
>
> /* The stack needs log (total_elements) entries (we could even subtract
>@@ -122,23 +123,92 @@ typedef struct
> enum { STACK_SIZE = CHAR_BIT * sizeof (size_t) };
>
> static inline stack_node *
>-push (stack_node *top, char *lo, char *hi)
>+push (stack_node *top, char *lo, char *hi, size_t depth)
> {
>   top->lo = lo;
>   top->hi = hi;
>+  top->depth = depth;
>   return ++top;
> }
>
> static inline stack_node *
>-pop (stack_node *top, char **lo, char **hi)
>+pop (stack_node *top, char **lo, char **hi, size_t *depth)
> {
>   --top;
>   *lo = top->lo;
>   *hi = top->hi;
>+  *depth = top->depth;
>   return top;
> }
>
>
>+/* A fast, small, non-recursive O(nlog n) heapsort, adapted from Linux
>+   lib/sort.c.  Used on introsort implementation as a fallback routine with
>+   worst-case performance of O(nlog n) and worst-case space complexity of
>+   O(1).  */
>+
>+static inline size_t
>+parent (size_t i, unsigned int lsbit, size_t size)
>+{
>+  i -= size;
>+  i -= size & -(i & lsbit);

i -= size & -(i & lsbit); may need a comment that it is logically 

if (i & lsbit) i -= size;

to make i/2 a multiple of size.


>+  return i / 2;
>+}
>+
>+static void
>+heapsort_r (void *base, void *end, size_t size, swap_func_t swap_func,
>+	    __compar_d_fn_t cmp, void *arg)
>+{
>+  size_t num = ((uintptr_t) end - (uintptr_t) base) / size;
>+  size_t n = num * size, a = (num/2) * size;
>+  /* Used to find parent  */
>+  const unsigned int lsbit = size & -size;
>+
>+  /* num < 2 || size == 0.  */
>+  if (a == 0)
>+    return;
>+
>+  for (;;)
>+    {
>+      size_t b, c, d;
>+
>+      if (a != 0)
>+	/* Building heap: sift down --a */
>+	a -= size;
>+      else if (n -= size)
>+	/* Sorting: Extract root to --n */
>+	do_swap (base, base + n, size, swap_func);
>+      else
>+	break;
>+
>+      /* Sift element at "a" down into heap.  This is the "bottom-up" variant,
>+	 which significantly reduces calls to cmp_func(): we find the sift-down
>+	 path all the way to the leaves (one compare per level), then backtrack
>+	 to find where to insert the target element.
>+
>+	 Because elements tend to sift down close to the leaves, this uses fewer
>+	 compares than doing two per level on the way down.  (A bit more than
>+	 half as many on average, 3/4 worst-case.).  */
>+      for (b = a; c = 2 * b + size, (d = c + size) < n;)
>+	b = cmp (base + c, base + d, arg) >= 0 ? c : d;
>+      if (d == n)
>+	/* Special case last leaf with no sibling.  */
>+	b = c;
>+
>+      /* Now backtrack from "b" to the correct location for "a".  */
>+      while (b != a && cmp (base + a, base + b, arg) >= 0)
>+	b = parent (b, lsbit, size);
>+      /* Where "a" belongs.  */
>+      c = b;
>+      while (b != a)
>+	{
>+	  /* Shift it into place.  */
>+	  b = parent (b, lsbit, size);
>+          do_swap (base + b, base + c, size, swap_func);
>+        }

here a tab should be used.

>+    }
>+}
>+
> /* Order size using quicksort.  This implementation incorporates
>    four optimizations discussed in Sedgewick:
>
>@@ -223,7 +293,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>
>   const size_t max_thresh = MAX_THRESH * size;
>
>-  if (total_elems == 0)
>+  if (total_elems <= 1)
>     /* Avoid lossage with unsigned arithmetic below.  */
>     return;
>
>@@ -235,6 +305,9 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>   else
>     swap_func = SWAP_BYTES;
>
>+  /* Maximum depth before quicksort switches to heapsort.  */
>+  size_t depth = 2 * (CHAR_BIT - 1 - __builtin_clzl (total_elems));

With this fixed, the logic LGTM.

>   if (total_elems > MAX_THRESH)
>     {
>       char *lo = base_ptr;
>@@ -242,10 +315,17 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>       stack_node stack[STACK_SIZE];
>       stack_node *top = stack;
>
>-      top = push (top, NULL, NULL);
>+      top = push (top, NULL, NULL, depth);
>
>       while (stack < top)
>         {
>+	  if (depth == 0)
>+	    {
>+	      heapsort_r (lo, hi, size, swap_func, cmp, arg);
>+              top = pop (top, &lo, &hi, &depth);
>+	      continue;
>+	    }
>+
>           char *left_ptr;
>           char *right_ptr;
>
>@@ -309,7 +389,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>             {
>               if ((size_t) (hi - left_ptr) <= max_thresh)
> 		/* Ignore both small partitions. */
>-		top = pop (top, &lo, &hi);
>+		top = pop (top, &lo, &hi, &depth);
>               else
> 		/* Ignore small left partition. */
>                 lo = left_ptr;
>@@ -320,13 +400,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>           else if ((right_ptr - lo) > (hi - left_ptr))
>             {
> 	      /* Push larger left partition indices. */
>-	      top = push (top, lo, right_ptr);
>+	      top = push (top, lo, right_ptr, depth - 1);
>               lo = left_ptr;
>             }
>           else
>             {
> 	      /* Push larger right partition indices. */
>-	      top = push (top, left_ptr, hi);
>+	      top = push (top, left_ptr, hi, depth - 1);
>               hi = right_ptr;
>             }
>         }
>-- 
>2.30.2
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/7] stdlib: Move insertion sort out qsort
  2021-09-03 17:11 ` [PATCH v3 4/7] stdlib: Move insertion sort out qsort Adhemerval Zanella via Libc-alpha
@ 2021-09-06 20:35   ` Fangrui Song via Libc-alpha
  2021-09-06 20:48     ` Fangrui Song via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Fangrui Song via Libc-alpha @ 2021-09-06 20:35 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On 2021-09-03, Adhemerval Zanella via Libc-alpha wrote:
>---
> stdlib/qsort.c | 100 ++++++++++++++++++++++++++-----------------------
> 1 file changed, 53 insertions(+), 47 deletions(-)
>
>diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>index 59458d151b..b69417dedd 100644
>--- a/stdlib/qsort.c
>+++ b/stdlib/qsort.c
>@@ -150,6 +150,58 @@ typedef struct
>       smaller partition.  This *guarantees* no more than log (total_elems)
>       stack size is needed (actually O(1) in this case)!  */
>
>+static void
>+insertion_sort (void *const pbase, size_t total_elems, size_t size,
>+                swap_func_t swap_func,
>+	        __compar_d_fn_t cmp, void *arg)
>+{
>+  char *base_ptr = (char *) pbase;
>+  char *const end_ptr = &base_ptr[size * (total_elems - 1)];
>+  char *tmp_ptr = base_ptr;
>+#define min(x, y) ((x) < (y) ? (x) : (y))
>+  const size_t max_thresh = MAX_THRESH * size;

But I think MAX_THRESH being 4 is unfortunate.
All modern architectures want a value larger than 4 :)

Reviewed-by: Fangrui Song <maskray@google.com>

>+  char *thresh = min(end_ptr, base_ptr + max_thresh);
>+  char *run_ptr;
>+
>+  /* Find smallest element in first threshold and place it at the
>+     array's beginning.  This is the smallest array element,
>+     and the operation speeds up insertion sort's inner loop. */
>+
>+  for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
>+    if (cmp (run_ptr, tmp_ptr, arg) < 0)
>+      tmp_ptr = run_ptr;
>+
>+  if (tmp_ptr != base_ptr)
>+    do_swap (tmp_ptr, base_ptr, size, swap_func);
>+
>+  /* Insertion sort, running from left-hand-side up to right-hand-side.  */
>+
>+  run_ptr = base_ptr + size;
>+  while ((run_ptr += size) <= end_ptr)
>+    {
>+      tmp_ptr = run_ptr - size;
>+      while (cmp (run_ptr, tmp_ptr, arg) < 0)
>+        tmp_ptr -= size;
>+
>+      tmp_ptr += size;
>+      if (tmp_ptr != run_ptr)
>+        {
>+          char *trav;
>+
>+          trav = run_ptr + size;
>+          while (--trav >= run_ptr)
>+            {
>+              char c = *trav;
>+              char *hi, *lo;
>+
>+              for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
>+                *hi = *lo;
>+              *hi = c;
>+            }

The bytewise move is a bit unfortunate and may slow down the insertion sort
quite a bit... But without allocation or code duplication I don't know a
better approach...

>+        }
>+    }
>+}
>+
> void
> _quicksort (void *const pbase, size_t total_elems, size_t size,
> 	    __compar_d_fn_t cmp, void *arg)
>@@ -272,51 +324,5 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>      for partitions below MAX_THRESH size. BASE_PTR points to the beginning
>      of the array to sort, and END_PTR points at the very last element in
>      the array (*not* one beyond it!). */
>-
>-#define min(x, y) ((x) < (y) ? (x) : (y))
>-
>-  {
>-    char *const end_ptr = &base_ptr[size * (total_elems - 1)];
>-    char *tmp_ptr = base_ptr;
>-    char *thresh = min(end_ptr, base_ptr + max_thresh);
>-    char *run_ptr;
>-
>-    /* Find smallest element in first threshold and place it at the
>-       array's beginning.  This is the smallest array element,
>-       and the operation speeds up insertion sort's inner loop. */
>-
>-    for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
>-      if ((*cmp) ((void *) run_ptr, (void *) tmp_ptr, arg) < 0)
>-        tmp_ptr = run_ptr;
>-
>-    if (tmp_ptr != base_ptr)
>-      do_swap (tmp_ptr, base_ptr, size, swap_func);
>-
>-    /* Insertion sort, running from left-hand-side up to right-hand-side.  */
>-
>-    run_ptr = base_ptr + size;
>-    while ((run_ptr += size) <= end_ptr)
>-      {
>-	tmp_ptr = run_ptr - size;
>-	while ((*cmp) ((void *) run_ptr, (void *) tmp_ptr, arg) < 0)
>-	  tmp_ptr -= size;
>-
>-	tmp_ptr += size;
>-        if (tmp_ptr != run_ptr)
>-          {
>-            char *trav;
>-
>-	    trav = run_ptr + size;
>-	    while (--trav >= run_ptr)
>-              {
>-                char c = *trav;
>-                char *hi, *lo;
>-
>-                for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
>-                  *hi = *lo;
>-                *hi = c;
>-              }
>-          }
>-      }
>-  }
>+  insertion_sort (pbase, total_elems, size, swap_func, cmp, arg);
> }
>-- 
>2.30.2
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/7] stdlib: Move insertion sort out qsort
  2021-09-06 20:35   ` Fangrui Song via Libc-alpha
@ 2021-09-06 20:48     ` Fangrui Song via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Fangrui Song via Libc-alpha @ 2021-09-06 20:48 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha


On 2021-09-06, Fangrui Song wrote:
>On 2021-09-03, Adhemerval Zanella via Libc-alpha wrote:
>>---
>>stdlib/qsort.c | 100 ++++++++++++++++++++++++++-----------------------
>>1 file changed, 53 insertions(+), 47 deletions(-)
>>
>>diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>>index 59458d151b..b69417dedd 100644
>>--- a/stdlib/qsort.c
>>+++ b/stdlib/qsort.c
>>@@ -150,6 +150,58 @@ typedef struct
>>      smaller partition.  This *guarantees* no more than log (total_elems)
>>      stack size is needed (actually O(1) in this case)!  */
>>
>>+static void
>>+insertion_sort (void *const pbase, size_t total_elems, size_t size,
>>+                swap_func_t swap_func,
>>+	        __compar_d_fn_t cmp, void *arg)
>>+{
>>+  char *base_ptr = (char *) pbase;
>>+  char *const end_ptr = &base_ptr[size * (total_elems - 1)];
>>+  char *tmp_ptr = base_ptr;
>>+#define min(x, y) ((x) < (y) ? (x) : (y))
>>+  const size_t max_thresh = MAX_THRESH * size;
>
>But I think MAX_THRESH being 4 is unfortunate.
>All modern architectures want a value larger than 4 :)
>
>Reviewed-by: Fangrui Song <maskray@google.com>
>
>>+  char *thresh = min(end_ptr, base_ptr + max_thresh);
>>+  char *run_ptr;
>>+
>>+  /* Find smallest element in first threshold and place it at the
>>+     array's beginning.  This is the smallest array element,
>>+     and the operation speeds up insertion sort's inner loop. */
>>+
>>+  for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
>>+    if (cmp (run_ptr, tmp_ptr, arg) < 0)
>>+      tmp_ptr = run_ptr;
>>+
>>+  if (tmp_ptr != base_ptr)
>>+    do_swap (tmp_ptr, base_ptr, size, swap_func);
>>+
>>+  /* Insertion sort, running from left-hand-side up to right-hand-side.  */
>>+
>>+  run_ptr = base_ptr + size;
>>+  while ((run_ptr += size) <= end_ptr)
>>+    {
>>+      tmp_ptr = run_ptr - size;
>>+      while (cmp (run_ptr, tmp_ptr, arg) < 0)
>>+        tmp_ptr -= size;
>>+
>>+      tmp_ptr += size;
>>+      if (tmp_ptr != run_ptr)
>>+        {
>>+          char *trav;
>>+
>>+          trav = run_ptr + size;
>>+          while (--trav >= run_ptr)
>>+            {
>>+              char c = *trav;
>>+              char *hi, *lo;
>>+
>>+              for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
>>+                *hi = *lo;
>>+              *hi = c;
>>+            }
>
>The bytewise move is a bit unfortunate and may slow down the insertion sort
>quite a bit... But without allocation or code duplication I don't know a
>better approach...

If we want to optimize insertion sort for the common case,
perhaps also optimize the cases when the element size is <= SWAP_GENERIC_SIZE.

Use an   unsigned char tmp[SWAP_GENERIC_SIZE];
as you do in another patch.

There will be a bit code bloat, though...

>
>>+        }
>>+    }
>>+}
>>+
>>void
>>_quicksort (void *const pbase, size_t total_elems, size_t size,
>>	    __compar_d_fn_t cmp, void *arg)
>>@@ -272,51 +324,5 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
>>     for partitions below MAX_THRESH size. BASE_PTR points to the beginning
>>     of the array to sort, and END_PTR points at the very last element in
>>     the array (*not* one beyond it!). */
>>-
>>-#define min(x, y) ((x) < (y) ? (x) : (y))
>>-
>>-  {
>>-    char *const end_ptr = &base_ptr[size * (total_elems - 1)];
>>-    char *tmp_ptr = base_ptr;
>>-    char *thresh = min(end_ptr, base_ptr + max_thresh);
>>-    char *run_ptr;
>>-
>>-    /* Find smallest element in first threshold and place it at the
>>-       array's beginning.  This is the smallest array element,
>>-       and the operation speeds up insertion sort's inner loop. */
>>-
>>-    for (run_ptr = tmp_ptr + size; run_ptr <= thresh; run_ptr += size)
>>-      if ((*cmp) ((void *) run_ptr, (void *) tmp_ptr, arg) < 0)
>>-        tmp_ptr = run_ptr;
>>-
>>-    if (tmp_ptr != base_ptr)
>>-      do_swap (tmp_ptr, base_ptr, size, swap_func);
>>-
>>-    /* Insertion sort, running from left-hand-side up to right-hand-side.  */
>>-
>>-    run_ptr = base_ptr + size;
>>-    while ((run_ptr += size) <= end_ptr)
>>-      {
>>-	tmp_ptr = run_ptr - size;
>>-	while ((*cmp) ((void *) run_ptr, (void *) tmp_ptr, arg) < 0)
>>-	  tmp_ptr -= size;
>>-
>>-	tmp_ptr += size;
>>-        if (tmp_ptr != run_ptr)
>>-          {
>>-            char *trav;
>>-
>>-	    trav = run_ptr + size;
>>-	    while (--trav >= run_ptr)
>>-              {
>>-                char c = *trav;
>>-                char *hi, *lo;
>>-
>>-                for (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)
>>-                  *hi = *lo;
>>-                *hi = c;
>>-              }
>>-          }
>>-      }
>>-  }
>>+  insertion_sort (pbase, total_elems, size, swap_func, cmp, arg);
>>}
>>-- 
>>2.30.2
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-06 14:13   ` Carlos O'Donell via Libc-alpha
  2021-09-06 17:03     ` Zack Weinberg via Libc-alpha
@ 2021-09-07  0:14     ` Paul Eggert
  2021-09-07 14:32       ` Adhemerval Zanella via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Paul Eggert @ 2021-09-07  0:14 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha

On 9/6/21 7:13 AM, Carlos O'Donell wrote:

> All existing ruby implementations I can see use qsort in an
> AS-safe-requiring context (process.c) and this would help those scenarios.

I don't see how those scenarios are practical. Assuming glibc and the 
standard Ruby implementation, any async signal problems could occur only 
if Ruby's spawn method has more than 32 redirects, as this is the 
minimum number that would cause qsort to call malloc on practical 
platforms. In practice Ruby spawns don't have that many redirects, so 
Ruby programs won't run into a problem unless they're contrived to 
illustrate this portability bug in the Ruby implementation.

> If we want better performance we need to:
> 
> - Add new APIs mirroring what BSD is doing.

BSD's APIs are not good, partly because they rely on nonportable C 
extensions (blocks) that I doubt whether glibc can insist on, partly for 
other reasons discussed below. An API like what I suggested in my 
previous email would be better, if we want to add new APIs.

> - New APIs explicitly call out the *type* of sort algorithm.

"Type" should not mean "heapsort" vs "quicksort" vs "mergesort" as in 
BSD. Users need performance, async-signal-safety, and stability; the BSD 
"type" approach is merely a means to that end and any new API should 
focus on their needs rather than on "type".

> we should aim for a good implementation that is as safe as we
> can make it (limit memory usage, avoid malloc, avoid quadratic performance
> behaviour).

If such an approach doesn't significantly hurt typical performance, I'm 
all for it. I'm skeptical that it's doable, though.

Let's not underestimate ordinary users' desires for performance here. 
qsort is used in other core apps (e.g., Emacs) without running afoul of 
async-signal-safety bugs.

Also, this is not merely about core apps like Emacs and Ruby; it's also 
about user one-off apps.  I often see Glibc qsort in benchmarks, and if 
the benchmarks results become significantly worse we'll be more likely 
to scare users away from Glibc and toward other platforms. Unless we can 
fix the significant performance issues in the proposed patch, the 
benefit of working around an only-theoretical bug in Ruby's 
implementation doesn't seem worth this cost.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-07  0:14     ` Paul Eggert
@ 2021-09-07 14:32       ` Adhemerval Zanella via Libc-alpha
  2021-09-07 17:39         ` Paul Eggert
  0 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-07 14:32 UTC (permalink / raw)
  To: Paul Eggert, Carlos O'Donell; +Cc: libc-alpha

On 06/09/2021 21:14, Paul Eggert wrote:
> On 9/6/21 7:13 AM, Carlos O'Donell wrote:
> 
>> All existing ruby implementations I can see use qsort in an
>> AS-safe-requiring context (process.c) and this would help those scenarios.
> 
> I don't see how those scenarios are practical. Assuming glibc and the standard Ruby implementation, any async signal problems could occur only if Ruby's spawn method has more than 32 redirects, as this is the minimum number that would cause qsort to call malloc on practical platforms. In practice Ruby spawns don't have that many redirects, so Ruby programs won't run into a problem unless they're contrived to illustrate this portability bug in the Ruby implementation.

I consider this a QoI to move toward provide memory bounded and async-safe
interfaces even though the interface contract does not really specify such
constraint.  I give you that it might lead to non-portable code such the
ruby case, but I prefer to actually move the interface to be async-safe
instead of either trying break the caller expectations (that only works due
an optimization) or changing the implementation and let the users handle 
any potential breakage.

Eventually what might happen when some breakage occurs is users like ruby 
to roll out their own qsort() implementation to fulfill its need.

We already have some concession about symbols that are not support to work
in some scenarios (such as malloc() after fork()).

> 
>> If we want better performance we need to:
>>
>> - Add new APIs mirroring what BSD is doing.
> 
> BSD's APIs are not good, partly because they rely on nonportable C extensions (blocks) that I doubt whether glibc can insist on, partly for other reasons discussed below. An API like what I suggested in my previous email would be better, if we want to add new APIs.

The blocks usage is only for *_b symbols, the BSD do provide default
C compliant mergesort and heapsort.  Also, FreeBSD really specifies
the worst-case space complexity and its quicksort has constant-size
(although it uses recursion).

> 
>> - New APIs explicitly call out the *type* of sort algorithm.
> 
> "Type" should not mean "heapsort" vs "quicksort" vs "mergesort" as in BSD. Users need performance, async-signal-safety, and stability; the BSD "type" approach is merely a means to that end and any new API should focus on their needs rather than on "type".
> 
>> we should aim for a good implementation that is as safe as we
>> can make it (limit memory usage, avoid malloc, avoid quadratic performance
>> behaviour).
> 
> If such an approach doesn't significantly hurt typical performance, I'm all for it. I'm skeptical that it's doable, though.

As I put, it is always a tradeoff.  Making async-safe means that we need to
either use a different algorithm with constant worst-case space, use a
hybrid approach, or use a different mechanism to allocate memory.

> 
> Let's not underestimate ordinary users' desires for performance here. qsort is used in other core apps (e.g., Emacs) without running afoul of async-signal-safety bugs.
> 
> Also, this is not merely about core apps like Emacs and Ruby; it's also about user one-off apps.  I often see Glibc qsort in benchmarks, and if the benchmarks results become significantly worse we'll be more likely to scare users away from Glibc and toward other platforms. Unless we can fix the significant performance issues in the proposed patch, the benefit of working around an only-theoretical bug in Ruby's implementation doesn't seem worth this cost.

I really don't think providing a async-safe with constant worst-case space
and be *clear* about it will 'scare away' uses, specially because we are
making the interface behaving in a more constraint way which imho is good
way forward.  Users will roll out their implementation due different needs
and tradeoff, such as gcc or python.

What I think really upsets users it finding out that if they use a different
element size or a large number its qsort now starts to call mallocs, or starts
to opens system files, or when memory is low it fallbacks to a completely
different implementation (which might have quadratic performance in some cases).

One possibility is to keep the mergesort using the static array if it
fits and fallback to another sorting (possible heapsort or at least
the intrasort I am suggesting).  I still prefer to keep it simple and
with a predictable performance.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-07 14:32       ` Adhemerval Zanella via Libc-alpha
@ 2021-09-07 17:39         ` Paul Eggert
  2021-09-07 18:07           ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Paul Eggert @ 2021-09-07 17:39 UTC (permalink / raw)
  To: Adhemerval Zanella, Carlos O'Donell; +Cc: libc-alpha

On 9/7/21 7:32 AM, Adhemerval Zanella wrote:

> I consider this a QoI to move toward provide memory bounded and async-safe
> interfaces

This is of course fine for new APIs. However, qsort users are accustomed 
to the longstanding behavior, and good performance is part of their 
expectations.

> Eventually what might happen when some breakage occurs is users like ruby
> to roll out their own qsort() implementation to fulfill its need.

Ruby already has its own qsort implementation. Ruby's portability bug 
(which occurs only in theory) is that at a delicate point during a fork 
Ruby uses the system qsort rather than its own qsort.

For what it's worth I submitted a patch for this bug 
<https://bugs.ruby-lang.org/issues/18152>. Regardless of whether that 
patch is accepted, this Ruby example is not a good argument for changing 
glibc qsort behavior, since the bug is not triggered in practical programs.

> We already have some concession about symbols that are not support to work
> in some scenarios (such as malloc() after fork()).

Yes, often it make sense to go beyond what the standard requires. 
However, each case is different needs to be decided on its own merits.

> The blocks usage is only for *_b symbols, the BSD do provide default
> C compliant mergesort and heapsort.

The blocks usage in macOS is essential for doing closures (the 
equivalent of qsort_r). Since we don't want to require blocks, we need a 
better API than what macOS supplies for non-qsort.

Would you like to work together to come up with a new API that addresses 
both of our concerns?

> I really don't think providing a async-safe with constant worst-case space
> and be *clear* about it will 'scare away' uses, specially because we are
> making the interface behaving in a more constraint way which imho is good
> way forward.  Users will roll out their implementation due different needs
> and tradeoff, such as gcc or python.

If we were creating the qsort implementation for the first time, the 
async-signal-safety arguments could be compelling. However, it is an 
entirely different thing to ask a significant number of users to change 
longstanding uses; this would be more work for them simply because the 
libc maintainers changed their mind about a particular 
performance-safety tradeoff.

> What I think really upsets users it finding out that if they use a different
> element size or a large number its qsort now starts to call mallocs

Users can be upset by many things. They can be upset if qsort calls 
malloc. They can be upset if qsort is slow. They can be especially upset 
if we make changes to qsort that force them to rewrite their code. No 
matter what we do, users can be upset. It's our job to manage these 
tradeoffs, and to do so in a stable and responsible way.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-07 17:39         ` Paul Eggert
@ 2021-09-07 18:07           ` Adhemerval Zanella via Libc-alpha
  2021-09-07 19:28             ` Paul Eggert
  0 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-07 18:07 UTC (permalink / raw)
  To: Paul Eggert, Carlos O'Donell; +Cc: libc-alpha



On 07/09/2021 14:39, Paul Eggert wrote:
> On 9/7/21 7:32 AM, Adhemerval Zanella wrote:
> 
>> I consider this a QoI to move toward provide memory bounded and async-safe
>> interfaces
> 
> This is of course fine for new APIs. However, qsort users are accustomed to the longstanding behavior, and good performance is part of their expectations.
> 
> 
>> Eventually what might happen when some breakage occurs is users like ruby
>> to roll out their own qsort() implementation to fulfill its need.
> 
> Ruby already has its own qsort implementation. Ruby's portability bug (which occurs only in theory) is that at a delicate point during a fork Ruby uses the system qsort rather than its own qsort.
> 
> For what it's worth I submitted a patch for this bug <https://bugs.ruby-lang.org/issues/18152>. Regardless of whether that patch is accepted, this Ruby example is not a good argument for changing glibc qsort behavior, since the bug is not triggered in practical programs.

I disagree, and stating:

"This is not a practical problem with glibc, since glibc qsort is 
async-signal-safe with small sorts and in practice Ruby's use of qsort 
is invariably small enough"

is even more problematic because it states that we are providing
async-safeness depending of the input arguments (which is really a
bad interface).

And your fix is exactly what is not the really intended: user need to
roll out ad-hoc sorting implementation to handle libc limitations.

> 
> 
>> We already have some concession about symbols that are not support to work
>> in some scenarios (such as malloc() after fork()).
> 
> Yes, often it make sense to go beyond what the standard requires. However, each case is different needs to be decided on its own merits.
> 
> 
>> The blocks usage is only for *_b symbols, the BSD do provide default
>> C compliant mergesort and heapsort.
> 
> The blocks usage in macOS is essential for doing closures (the equivalent of qsort_r). Since we don't want to require blocks, we need a better API than what macOS supplies for non-qsort.
> 

My point is blocks neither closures are essential to provide different
sort algorithms with different strategies and constraints.

> Would you like to work together to come up with a new API that addresses both of our concerns?
I am not against your suggestion, but I think this is orthogonal to 
providing a better semantic and guarantees to the current qsort.

> 
> 
>> I really don't think providing a async-safe with constant worst-case space
>> and be *clear* about it will 'scare away' uses, specially because we are
>> making the interface behaving in a more constraint way which imho is good
>> way forward.  Users will roll out their implementation due different needs
>> and tradeoff, such as gcc or python.
> 
> If we were creating the qsort implementation for the first time, the async-signal-safety arguments could be compelling. However, it is an entirely different thing to ask a significant number of users to change longstanding uses; this would be more work for them simply because the libc maintainers changed their mind about a particular performance-safety tradeoff.
> 
> 
>> What I think really upsets users it finding out that if they use a different
>> element size or a large number its qsort now starts to call mallocs
> 
> Users can be upset by many things. They can be upset if qsort calls malloc. They can be upset if qsort is slow. They can be especially upset if we make changes to qsort that force them to rewrite their code. No matter what we do, users can be upset. It's our job to manage these tradeoffs, and to do so in a stable and responsible way.

I agree that performance degradation is really a potential problem,
that's why I used an known algorithm used as a default sort algorithm
for different library (stdlibc++). 

And I tend to see async-signal-safeness, along with bounded memory
utilization and simple algorithm a better move forward to 
*the system C library* than providing the state of the art sort algorithm.
From my experience we have much more bug report in the are of broken
semantic than from missing performance. As before, users will most likely
move to other sorting algorithm if qsort is a hotspot.

And if performance is paramount we can either provide it as an alternative 
API as you suggest or thought a tunable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-07 18:07           ` Adhemerval Zanella via Libc-alpha
@ 2021-09-07 19:28             ` Paul Eggert
  2021-09-08 11:56               ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Paul Eggert @ 2021-09-07 19:28 UTC (permalink / raw)
  To: Adhemerval Zanella, Carlos O'Donell; +Cc: libc-alpha

On 9/7/21 11:07 AM, Adhemerval Zanella wrote:
> we are providing
> async-safeness depending of the input arguments (which is really a
> bad interface).

Many functions are async-signal-safe for some arguments but not others. 
The 'free' function is just one example. qsort is not at all unusual in 
this department. This sort of interface is "really bad" only in the 
sense that every function that is not async-signal-safe is "really bad".

> My point is blocks neither closures are essential to provide different
> sort algorithms with different strategies and constraints.

If that's the point, then I disagree. qsort_r is there for a reason, and 
this reason applies to any sorting algorithm. Some callers can limp 
along without qsort_r, by writing code that is not thread-safe or 
whatever; but many callers really need qsort_r and a similar need would 
apply to any sorting algorithm.

>> > Would you like to work together to come up with a new API that addresses both of our concerns?
> I am not against your suggestion

That's a good way forward, then. Do you have an API in mind? If not, I 
can propose one.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-07 19:28             ` Paul Eggert
@ 2021-09-08 11:56               ` Adhemerval Zanella via Libc-alpha
  2021-09-09  0:39                 ` Paul Eggert
  0 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-09-08 11:56 UTC (permalink / raw)
  To: Paul Eggert, Carlos O'Donell; +Cc: libc-alpha



On 07/09/2021 16:28, Paul Eggert wrote:
> On 9/7/21 11:07 AM, Adhemerval Zanella wrote:
>> we are providing
>> async-safeness depending of the input arguments (which is really a
>> bad interface).
> 
> Many functions are async-signal-safe for some arguments but not others. The 'free' function is just one example. qsort is not at all unusual in this department. This sort of interface is "really bad" only in the sense that every function that is not async-signal-safe is "really bad".

My point it is confusing and leads to bad assumption from users, as ruby
shows.

> 
>> My point is blocks neither closures are essential to provide different
>> sort algorithms with different strategies and constraints.
> 
> If that's the point, then I disagree. qsort_r is there for a reason, and this reason applies to any sorting algorithm. Some callers can limp along without qsort_r, by writing code that is not thread-safe or whatever; but many callers really need qsort_r and a similar need would apply to any sorting algorithm.

I think you missed my point here, I was not advertise to not provide
a qsort_r like, but rather that the blocks extensions from BSD is
just an extension.

> 
>>> > Would you like to work together to come up with a new API that addresses both of our concerns?
>> I am not against your suggestion
> 
> That's a good way forward, then. Do you have an API in mind? If not, I can propose one.

Sure, but this is orthogonal to the change I am proposing here. May
we move forward in this direction first?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/7] Use introsort for qsort
  2021-09-08 11:56               ` Adhemerval Zanella via Libc-alpha
@ 2021-09-09  0:39                 ` Paul Eggert
  0 siblings, 0 replies; 41+ messages in thread
From: Paul Eggert @ 2021-09-09  0:39 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On 9/8/21 4:56 AM, Adhemerval Zanella wrote:

> it is confusing and leads to bad assumption from users, as ruby
> shows.

Again, Ruby is not a good example of why a change is needed, as Ruby 
works fine in practice.

>>>>> Would you like to work together to come up with a new API that addresses both of our concerns?
>>> I am not against your suggestion
>>
>> That's a good way forward, then. Do you have an API in mind? If not, I can propose one.
> 
> Sure, but this is orthogonal to the change I am proposing here.

It's not orthogonal, because if we provide a better API that will be a 
solution to this (so far only-theoretical) problem. Although a better 
API should also address other problems, that doesn't mean it's 
orthogonal to this one.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
  2021-09-04  9:09   ` Alexander Monakov via Libc-alpha
@ 2021-10-13  3:19   ` Noah Goldstein via Libc-alpha
  2021-10-15 12:52     ` Adhemerval Zanella via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-13  3:19 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> This patch adds a qsort benchmark.  I tried to model the benchmark
> taking in consideration the possible input variation in both internal
> element size, element numbers, and internal state for 1. real word cases
> and 2. possible scenarios based on hardware characteristics.
>
> For 1. I tracked qsort usage (using a simple preload library to dump its
> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
> GCC 8 bootstrap build shows 51786641 call to qsort with the following
> characterics:
>
> Key: number of elements:
> key=2 : 39.74
> key=3 : 19.23
> key=4 : 9.77
> key=1 : 8.44
> key=0 : 6.60
> key=5 : 4.40
> key=7 : 2.37
> key=6 : 2.25
> key=9 : 1.48
> key=8 : 0.97
>
> Key: element size in bytes:
> key=8 : 91.74
> key=32 : 3.62
> key=4 : 2.42
> key=40 : 1.20
> key=16 : 0.67
> key=24 : 0.30
> key=48 : 0.05
> key=56 : 0.00
> key=1 : 0.00
>
> Key: total size (number of elements * element size):
> key=16 : 35.98
> key=24 : 18.67
> key=32 : 9.79
> key=8 : 8.28
> key=0 : 6.60
> key=40 : 4.21
> key=64 : 3.15
> key=48 : 2.24
> key=56 : 2.15
> key=80 : 1.45
>
> So for GCC:
>
>   - 80% of total qsort usage are done with 10 elements of less.
>   - All usages are done element size of maximum of 56 bytes.
>   - 90% of calls are done with array of maximum size of 80 bytes or
>     less.
>
> (GCC on recent version now uses a internal qsort implementation instead
> of the libc one).
>
> The Firefox usage was done with 2 hours of usage, with first 10 minutes
> activelly openning and closing different types of sites.  It resulted in
> 21042 calls with following characteristics:
>
> Key: number of elements:
> key=7 : 24.40
> key=1 : 10.44
> key=3 : 6.33
> key=4 : 5.81
> key=2 : 5.46
> key=6 : 4.80
> key=17 : 4.54
> key=0 : 3.07
> key=5 : 3.05
> key=9 : 2.51
> key=12 : 2.06
>
> Key: element size in bytes:
> key=8 : 94.49
> key=28 : 4.40
> key=2 : 0.70
> key=16 : 0.19
> key=36 : 0.07
> key=12 : 0.07
> key=40 : 0.07
> key=24 : 0.03
>
> Key: total size (number of elements * element size):
> key=56 : 24.20
> key=8 : 10.27
> key=24 : 6.36
> key=32 : 5.86
> key=16 : 5.46
> key=48 : 4.80
> key=476 : 3.75
> key=0 : 3.07
> key=40 : 3.05
> key=72 : 2.50
>
> So for Firefox:
>
>   - 72% of total qsort usage are done with 18 elements of less.
>   - All usages are done element size of maximum of 40 bytes.
>   - 70% of calls are done with array of maximum size of 476 bytes or
>     less.
>
> For 2. I used the idea of a machine with 3 levels of cache with sizes
> L1 32kb, L2 256kb, and L3 4096Kb.
>
> It resulted in a benchmark with following traits:
>
>   * It checks four types of input arrays: sorted, mostly sorted, random,
>     repeated, and bitonic:
>
>     - 'sorted': the array is already sorted.
>     - 'mostly sorted': the array will have a certain number of random
>        position with random values (current ratio used is 20%).
>     - 'random' the array will contain random elements.
>     - 'repeated' the array will have random elements with a certain
>        number (current ratio is 20%) of a repeated element distributed
>        randomly.
>     - 'bitonic': elements will be strictly increasing to up middle of the
>        array then strictly decreasing.  This is to trigger the
>        pattern-defeting input for the median of 3 pivot selection used
>        in quicksort implementation.
>
>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>     element with 32 bytes (but using the uint32_t comparison checks).
>     These element sizes are used to 1. to avoid include the comparison
>     function itself and/or memory copy in sort benchmark itself, and 2.
>     because key of size_t are the most used for both GCC and Firefox.
>
>   * Five different element numbers: 64 (which cover mostly of used
>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>     --nelem option.
>
> Checked on x86_64-linux-gnu
> ---
>  benchtests/Makefile      |   2 +-
>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 196 insertions(+), 1 deletion(-)
>  create mode 100644 benchtests/bench-qsort.c
>
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 1530939a8c..781cbca8af 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>  include ../gen-locales.mk
>  endif
>
> -stdlib-benchset := strtod
> +stdlib-benchset := strtod qsort
>
>  stdio-common-benchset := sprintf
>
> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
> new file mode 100644
> index 0000000000..905aa6d7b6
> --- /dev/null
> +++ b/benchtests/bench-qsort.c
> @@ -0,0 +1,195 @@
> +/* Measure qsort function.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <array_length.h>
> +#include <bench-timing.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <json-lib.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include <stdlib/tst-qsort-common.c>
> +
> +/* Number of elements of determined type to be measured.  */
> +static const size_t default_nelems[] =
> +{
> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
> +                              on GCC build.  */
> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
> +};
> +
> +#define OPT_NELEM 10000
> +#define OPT_SEED  10001
> +#define CMDLINE_OPTIONS \
> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
> +  { "seed", required_argument, NULL, OPT_SEED },
> +
> +static const size_t max_nelem = 16;
> +static size_t *elems = NULL;
> +static size_t nelem = 0;
> +static uint64_t seed = 0;
> +static bool seed_set = false;
> +
> +static void
> +cmdline_process_function (int c)
> +{
> +  switch (c)
> +    {
> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
> +        The elements sizes as passed with a ':' as the delimiter, for
> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
> +      case OPT_NELEM:
> +        {
> +         elems = xmalloc (max_nelem * sizeof (size_t));
> +         nelem = 0;
> +
> +         char *saveptr;
> +         char *token;
> +         token = strtok_r (optarg, ":", &saveptr);
> +         if (token == NULL)
> +           {
> +             printf ("error: invalid --nelem value\n");
> +             exit (EXIT_FAILURE);
> +           }
> +         do
> +           {
> +             if (nelem == max_nelem)
> +               {
> +                 printf ("error: invalid --nelem value (max elem)\n");
> +                 exit (EXIT_FAILURE);
> +               }
> +             elems[nelem++] = atol (token);
> +             token = strtok_r (saveptr, ":", &saveptr);
> +           } while (token != NULL);
> +        }
> +      break;
> +
> +      /* handle the --seed option to use a different seed than a random one.
> +        The SEED used should be a uint64_t number.  */
> +      case OPT_SEED:
> +       {
> +         uint64_t value = strtoull (optarg, NULL, 0);
> +         if (errno == ERANGE || value > UINT64_MAX)
> +           {
> +             printf ("error: seed should be a value in range of "
> +                     "[0, UINT64_MAX]\n");
> +             exit (EXIT_FAILURE);
> +           }
> +         seed = value;
> +         seed_set = true;
> +       }
> +    }
> +}
> +
> +#define CMDLINE_PROCESS cmdline_process_function
> +
> +static int
> +do_test (void)
> +{
> +  random_init ();
> +
> +
> +  json_ctx_t json_ctx;
> +
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  json_attr_object_begin (&json_ctx, "functions");
> +  json_attr_object_begin (&json_ctx, "qsort");
> +  json_attr_uint (&json_ctx, "seed", seed);
> +
> +  json_array_begin (&json_ctx, "results");
> +
> +  const size_t *welem = elems == NULL ? default_nelems : elems;
> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
> +
> +  struct test_t
> +  {
> +    size_t type_size;
> +    cmpfunc_t cmpfunc;
> +  };
> +  static const struct test_t tests[] =
> +  {
> +    { sizeof (uint32_t), uint32_t_cmp },
> +    { sizeof (uint64_t), uint64_t_cmp },
> +    /* Test swap with large elements.  */
> +    { 32,                uint32_t_cmp },
> +  };
> +
> +  const size_t inner_loop_iters = 16;
> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
> +    {
> +      for (const struct array_t *arraytype = arraytypes;
> +          arraytype < array_end (arraytypes);
> +          ++arraytype)
> +       {
> +         for (int k = 0; k < wnelem; k++)
> +           {
> +             json_element_object_begin (&json_ctx);
> +
> +             size_t arraysize = welem[k] * test->type_size;
> +
> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
> +             json_attr_string (&json_ctx, "property", arraytype->name);
> +
> +             void *base = create_array (welem[k], test->type_size,
> +                                        arraytype->type);
> +             void *work = xmalloc (arraysize);
> +
> +             timing_t total = 0;
> +
> +             for (int n = 0; n < inner_loop_iters; n++)
> +               {
> +                 memcpy (work, base, arraysize);
> +
> +                 timing_t start, end, diff;
> +                 TIMING_NOW (start);
> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
> +                 TIMING_NOW (end);
> +
> +                 TIMING_DIFF (diff, start, end);
> +                 TIMING_ACCUM (total, diff);
> +               }
> +
To avoid overhead from timing. Maybe you could allocate / initialize
multiple buffers to test a given sort on. Then instead of having to
re-memcpy / time you just loop through the N initialized buffers
with one timing call. For the smaller sorts this may help decrease
the timing overhead.

> +            json_attr_uint (&json_ctx, "timings",
> +                            (double) total / (double) inner_loop_iters);
> +            json_element_object_end (&json_ctx);
> +
> +            free (base);
> +            free (work);
> +          }
> +        }
> +    }
> +
> +  json_array_end (&json_ctx);
> +
> +  json_attr_object_end (&json_ctx);
> +  json_attr_object_end (&json_ctx);
> +  json_document_end (&json_ctx);
> +
> +  return 0;
> +}
> +
> +#define TIMEOUT 600
> +#include <support/test-driver.c>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-09-03 17:11 ` [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) Adhemerval Zanella via Libc-alpha
@ 2021-10-13  3:29   ` Noah Goldstein via Libc-alpha
  2021-10-13  3:39     ` Noah Goldstein via Libc-alpha
  2021-10-15 13:12     ` Adhemerval Zanella via Libc-alpha
  0 siblings, 2 replies; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-13  3:29 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <
libc-alpha@sourceware.org> wrote:

> It optimizes take in consideration both the most common elements are
> either 32 or 64 bit in size [1] and inputs are aligned to the word
> boundary.  This is similar to the optimization done on lib/sort.c
> from Linux.
>
> This patchs adds an optimized swap operation on qsort based in previous
> msort one.  Instead of byte operation, three variants are provided:
>
>   1. Using uint32_t loads and stores.
>   2. Using uint64_t loads and stores.
>   3. Generic one with a temporary buffer and memcpy/mempcpy.
>
> The 1. and 2. options are selected only either if architecture defines
> _STRING_ARCH_unaligned or if base pointer is aligned to required type.
>
> It also fixes BZ#19305 by checking input size against number of
> elements 1 besides 0.
>
> Checked on x86_64-linux-gnu.
>
> [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html
> ---
>  stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 91 insertions(+), 18 deletions(-)
>
> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> index 23f2d28314..59458d151b 100644
> --- a/stdlib/qsort.c
> +++ b/stdlib/qsort.c
> @@ -24,20 +24,85 @@
>  #include <limits.h>
>  #include <stdlib.h>
>  #include <string.h>
> +#include <stdbool.h>
>
> -/* Byte-wise swap two items of size SIZE. */
> -#define SWAP(a, b, size)
>    \
> -  do
>    \
> -    {
>     \
> -      size_t __size = (size);
>     \
> -      char *__a = (a), *__b = (b);
>    \
> -      do
>    \
> -       {
>    \
> -         char __tmp = *__a;
>     \
> -         *__a++ = *__b;
>     \
> -         *__b++ = __tmp;
>    \
> -       } while (--__size > 0);
>    \
> -    } while (0)
> +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
> +   along the generic one as an optimization.  */
> +
> +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
> +
> +/* Return trues is elements can be copied used word load and sortes.
> +   The size must be a multiple of the alignment, and the base address.  */
> +static inline bool
> +is_aligned_to_copy (const void *base, size_t size, size_t align)
> +{
> +  unsigned char lsbits = size;
> +#if !_STRING_ARCH_unaligned
> +  lsbits |= (unsigned char)(uintptr_t) base;
> +#endif
> +  return (lsbits & (align - 1)) == 0;
> +}
> +
> +#define SWAP_WORDS_64 (swap_func_t)0
> +#define SWAP_WORDS_32 (swap_func_t)1
> +#define SWAP_BYTES    (swap_func_t)2
> +
> +static void
> +swap_words_64 (void * restrict a, void * restrict b, size_t n)
> +{
> +  do
> +   {
> +     n -= 8;
> +     uint64_t t = *(uint64_t *)(a + n);
> +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
> +     *(uint64_t *)(b + n) = t;
> +   } while (n);
> +}
> +
> +static void
> +swap_words_32 (void * restrict a, void * restrict b, size_t n)
> +{
> +  do
> +   {
> +     n -= 4;
> +     uint32_t t = *(uint32_t *)(a + n);
> +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
> +     *(uint32_t *)(b + n) = t;
> +   } while (n);
> +}
> +
> +static void
> +swap_bytes (void * restrict a, void * restrict b, size_t n)
> +{
> +  /* Use multiple small memcpys with constant size to enable inlining
> +     on most targets.  */
> +  enum { SWAP_GENERIC_SIZE = 32 };
> +  unsigned char tmp[SWAP_GENERIC_SIZE];
> +  while (n > SWAP_GENERIC_SIZE)
> +    {
> +      memcpy (tmp, a, SWAP_GENERIC_SIZE);
> +      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
> +      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
> +      n -= SWAP_GENERIC_SIZE;
> +    }
> +  memcpy (tmp, a, n);
> +  memcpy (a, b, n);
> +  memcpy (b, tmp, n);
> +}
> +
> +/* Replace the indirect call with a serie of if statements.  It should
> help
> +   the branch predictor.  */
>

1) Really? On Intel at least an indirect call that is always going to the
same place
is certainly going to be predicted as well if not better than 2/3
branches + direct call.

2) If you're going to just test which swap function to use, why
bother initializing
swap_func? Why not just use an int?



> +static void
> +do_swap (void * restrict a, void * restrict b, size_t size,
> +        swap_func_t swap_func)
> +{
> +  if (swap_func == SWAP_WORDS_64)
> +    swap_words_64 (a, b, size);
> +  else if (swap_func == SWAP_WORDS_32)
> +    swap_words_32 (a, b, size);
> +  else
> +    swap_bytes (a, b, size);
> +}
>
>  /* Discontinue quicksort algorithm when partition gets below this size.
>     This particular magic number was chosen to work best on a Sun 4/260. */
> @@ -97,6 +162,14 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>      /* Avoid lossage with unsigned arithmetic below.  */
>      return;
>
> +  swap_func_t swap_func;
> +  if (is_aligned_to_copy (pbase, size, 8))
> +    swap_func = SWAP_WORDS_64;
> +  else if (is_aligned_to_copy (pbase, size, 4))
> +    swap_func = SWAP_WORDS_32;
> +  else
> +    swap_func = SWAP_BYTES;
> +
>    if (total_elems > MAX_THRESH)
>      {
>        char *lo = base_ptr;
> @@ -120,13 +193,13 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>           char *mid = lo + size * ((hi - lo) / size >> 1);
>
>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
> -           SWAP (mid, lo, size);
> +           do_swap (mid, lo, size, swap_func);
>           if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
> -           SWAP (mid, hi, size);
> +           do_swap (mid, hi, size, swap_func);
>           else
>             goto jump_over;
>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
> -           SWAP (mid, lo, size);
> +           do_swap (mid, lo, size, swap_func);
>         jump_over:;
>
>           left_ptr  = lo + size;
> @@ -145,7 +218,7 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>
>               if (left_ptr < right_ptr)
>                 {
> -                 SWAP (left_ptr, right_ptr, size);
> +                 do_swap (left_ptr, right_ptr, size, swap_func);
>                   if (mid == left_ptr)
>                     mid = right_ptr;
>                   else if (mid == right_ptr)
> @@ -217,7 +290,7 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>          tmp_ptr = run_ptr;
>
>      if (tmp_ptr != base_ptr)
> -      SWAP (tmp_ptr, base_ptr, size);
> +      do_swap (tmp_ptr, base_ptr, size, swap_func);
>
>      /* Insertion sort, running from left-hand-side up to
> right-hand-side.  */
>
> --
> 2.30.2
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-13  3:29   ` Noah Goldstein via Libc-alpha
@ 2021-10-13  3:39     ` Noah Goldstein via Libc-alpha
  2021-10-15 13:29       ` Adhemerval Zanella via Libc-alpha
  2021-10-15 13:12     ` Adhemerval Zanella via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-13  3:39 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

>
>
> On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <
> libc-alpha@sourceware.org> wrote:
>
>> It optimizes take in consideration both the most common elements are
>> either 32 or 64 bit in size [1] and inputs are aligned to the word
>> boundary.  This is similar to the optimization done on lib/sort.c
>> from Linux.
>>
>> This patchs adds an optimized swap operation on qsort based in previous
>> msort one.  Instead of byte operation, three variants are provided:
>>
>>   1. Using uint32_t loads and stores.
>>   2. Using uint64_t loads and stores.
>>   3. Generic one with a temporary buffer and memcpy/mempcpy.
>>
>> The 1. and 2. options are selected only either if architecture defines
>> _STRING_ARCH_unaligned or if base pointer is aligned to required type.
>>
>> It also fixes BZ#19305 by checking input size against number of
>> elements 1 besides 0.
>>
>> Checked on x86_64-linux-gnu.
>>
>> [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html
>> ---
>>  stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 91 insertions(+), 18 deletions(-)
>>
>> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>> index 23f2d28314..59458d151b 100644
>> --- a/stdlib/qsort.c
>> +++ b/stdlib/qsort.c
>> @@ -24,20 +24,85 @@
>>  #include <limits.h>
>>  #include <stdlib.h>
>>  #include <string.h>
>> +#include <stdbool.h>
>>
>> -/* Byte-wise swap two items of size SIZE. */
>> -#define SWAP(a, b, size)
>>      \
>> -  do
>>      \
>> -    {
>>     \
>> -      size_t __size = (size);
>>     \
>> -      char *__a = (a), *__b = (b);
>>      \
>> -      do
>>      \
>> -       {
>>      \
>> -         char __tmp = *__a;
>>     \
>> -         *__a++ = *__b;
>>     \
>> -         *__b++ = __tmp;
>>      \
>> -       } while (--__size > 0);
>>      \
>> -    } while (0)
>> +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
>> +   along the generic one as an optimization.  */
>> +
>> +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
>> +
>> +/* Return trues is elements can be copied used word load and sortes.
>> +   The size must be a multiple of the alignment, and the base address.
>> */
>> +static inline bool
>> +is_aligned_to_copy (const void *base, size_t size, size_t align)
>> +{
>> +  unsigned char lsbits = size;
>> +#if !_STRING_ARCH_unaligned
>> +  lsbits |= (unsigned char)(uintptr_t) base;
>> +#endif
>> +  return (lsbits & (align - 1)) == 0;
>> +}
>> +
>> +#define SWAP_WORDS_64 (swap_func_t)0
>> +#define SWAP_WORDS_32 (swap_func_t)1
>> +#define SWAP_BYTES    (swap_func_t)2
>> +
>> +static void
>> +swap_words_64 (void * restrict a, void * restrict b, size_t n)
>> +{
>> +  do
>> +   {
>> +     n -= 8;
>> +     uint64_t t = *(uint64_t *)(a + n);
>> +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
>> +     *(uint64_t *)(b + n) = t;
>> +   } while (n);
>> +}
>> +
>> +static void
>> +swap_words_32 (void * restrict a, void * restrict b, size_t n)
>> +{
>> +  do
>> +   {
>> +     n -= 4;
>> +     uint32_t t = *(uint32_t *)(a + n);
>> +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
>> +     *(uint32_t *)(b + n) = t;
>> +   } while (n);
>> +}
>>
>
I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
key sizes. Looking at GCC's implementation of swap_generic on modern
x86_64:
https://godbolt.org/z/638h3Y9va
It's able to optimize the temporary buffer out of the loop and use xmm
registers which
will likely win out for larger sizes.

> +
>> +static void
>> +swap_bytes (void * restrict a, void * restrict b, size_t n)
>> +{
>> +  /* Use multiple small memcpys with constant size to enable inlining
>> +     on most targets.  */
>> +  enum { SWAP_GENERIC_SIZE = 32 };
>> +  unsigned char tmp[SWAP_GENERIC_SIZE];
>> +  while (n > SWAP_GENERIC_SIZE)
>> +    {
>> +      memcpy (tmp, a, SWAP_GENERIC_SIZE);
>> +      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
>> +      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
>> +      n -= SWAP_GENERIC_SIZE;
>> +    }
>> +  memcpy (tmp, a, n);
>> +  memcpy (a, b, n);
>> +  memcpy (b, tmp, n);
>> +}
>> +
>> +/* Replace the indirect call with a serie of if statements.  It should
>> help
>> +   the branch predictor.  */
>>
>
> 1) Really? On Intel at least an indirect call that is always going to the
> same place
> is certainly going to be predicted as well if not better than 2/3
> branches + direct call.
>
> 2) If you're going to just test which swap function to use, why
> bother initializing
> swap_func? Why not just use an int?
>
>
>
>> +static void
>> +do_swap (void * restrict a, void * restrict b, size_t size,
>> +        swap_func_t swap_func)
>> +{
>> +  if (swap_func == SWAP_WORDS_64)
>> +    swap_words_64 (a, b, size);
>> +  else if (swap_func == SWAP_WORDS_32)
>> +    swap_words_32 (a, b, size);
>> +  else
>> +    swap_bytes (a, b, size);
>> +}
>>
>>  /* Discontinue quicksort algorithm when partition gets below this size.
>>     This particular magic number was chosen to work best on a Sun 4/260.
>> */
>> @@ -97,6 +162,14 @@ _quicksort (void *const pbase, size_t total_elems,
>> size_t size,
>>      /* Avoid lossage with unsigned arithmetic below.  */
>>      return;
>>
>> +  swap_func_t swap_func;
>> +  if (is_aligned_to_copy (pbase, size, 8))
>> +    swap_func = SWAP_WORDS_64;
>> +  else if (is_aligned_to_copy (pbase, size, 4))
>>
>
For many modern architectures that support fast unaligned loads/stores
(for example x86_64 SnB and newer) I don't think this check really makes
sense.


> +    swap_func = SWAP_WORDS_32;
>> +  else
>> +    swap_func = SWAP_BYTES;
>> +
>>    if (total_elems > MAX_THRESH)
>>      {
>>        char *lo = base_ptr;
>> @@ -120,13 +193,13 @@ _quicksort (void *const pbase, size_t total_elems,
>> size_t size,
>>           char *mid = lo + size * ((hi - lo) / size >> 1);
>>
>>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
>> -           SWAP (mid, lo, size);
>> +           do_swap (mid, lo, size, swap_func);
>>           if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
>> -           SWAP (mid, hi, size);
>> +           do_swap (mid, hi, size, swap_func);
>>           else
>>             goto jump_over;
>>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
>> -           SWAP (mid, lo, size);
>> +           do_swap (mid, lo, size, swap_func);
>>         jump_over:;
>>
>>           left_ptr  = lo + size;
>> @@ -145,7 +218,7 @@ _quicksort (void *const pbase, size_t total_elems,
>> size_t size,
>>
>>               if (left_ptr < right_ptr)
>>                 {
>> -                 SWAP (left_ptr, right_ptr, size);
>> +                 do_swap (left_ptr, right_ptr, size, swap_func);
>>                   if (mid == left_ptr)
>>                     mid = right_ptr;
>>                   else if (mid == right_ptr)
>> @@ -217,7 +290,7 @@ _quicksort (void *const pbase, size_t total_elems,
>> size_t size,
>>          tmp_ptr = run_ptr;
>>
>>      if (tmp_ptr != base_ptr)
>> -      SWAP (tmp_ptr, base_ptr, size);
>> +      do_swap (tmp_ptr, base_ptr, size, swap_func);
>>
>>      /* Insertion sort, running from left-hand-side up to
>> right-hand-side.  */
>>
>> --
>> 2.30.2
>>
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/7] stdlib: Implement introsort with qsort
  2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
  2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
  2021-09-06 20:23   ` Fangrui Song via Libc-alpha
@ 2021-10-13  3:53   ` Noah Goldstein via Libc-alpha
  2 siblings, 0 replies; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-13  3:53 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Sep 3, 2021 at 1:16 PM Adhemerval Zanella via Libc-alpha <
libc-alpha@sourceware.org> wrote:

> This patch adds a introsort implementation on qsort to avoid worse-case
> performance of quicksort to O(nlog n).  The heapsort fallback used is a
> heapsort based on Linux implementation (commit 22a241ccb2c19962a).  As a
> side note the introsort implementation is similar the one used on
> libstdc++ for std::sort.
>
> Checked on x86_64-linux-gnu.
> ---
>  stdlib/qsort.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 87 insertions(+), 7 deletions(-)
>
> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> index 5df640362d..8368576aae 100644
> --- a/stdlib/qsort.c
> +++ b/stdlib/qsort.c
> @@ -113,6 +113,7 @@ typedef struct
>    {
>      char *lo;
>      char *hi;
> +    size_t depth;
>

Why do a depth tracker per stack_node as opposed to one for
the entire call?


>    } stack_node;
>
>  /* The stack needs log (total_elements) entries (we could even subtract
> @@ -122,23 +123,92 @@ typedef struct
>  enum { STACK_SIZE = CHAR_BIT * sizeof (size_t) };
>
>  static inline stack_node *
> -push (stack_node *top, char *lo, char *hi)
> +push (stack_node *top, char *lo, char *hi, size_t depth)
>  {
>    top->lo = lo;
>    top->hi = hi;
> +  top->depth = depth;
>    return ++top;
>  }
>
>  static inline stack_node *
> -pop (stack_node *top, char **lo, char **hi)
> +pop (stack_node *top, char **lo, char **hi, size_t *depth)
>  {
>    --top;
>    *lo = top->lo;
>    *hi = top->hi;
> +  *depth = top->depth;
>    return top;
>  }
>
>
> +/* A fast, small, non-recursive O(nlog n) heapsort, adapted from Linux
> +   lib/sort.c.  Used on introsort implementation as a fallback routine
> with
> +   worst-case performance of O(nlog n) and worst-case space complexity of
> +   O(1).  */
> +
> +static inline size_t
> +parent (size_t i, unsigned int lsbit, size_t size)
> +{
> +  i -= size;
> +  i -= size & -(i & lsbit);
> +  return i / 2;
> +}
> +
> +static void
> +heapsort_r (void *base, void *end, size_t size, swap_func_t swap_func,
> +           __compar_d_fn_t cmp, void *arg)
> +{
> +  size_t num = ((uintptr_t) end - (uintptr_t) base) / size;
> +  size_t n = num * size, a = (num/2) * size;
> +  /* Used to find parent  */
> +  const unsigned int lsbit = size & -size;
> +
> +  /* num < 2 || size == 0.  */
> +  if (a == 0)
> +    return;
> +
> +  for (;;)
> +    {
> +      size_t b, c, d;
> +
> +      if (a != 0)
> +       /* Building heap: sift down --a */
> +       a -= size;
> +      else if (n -= size)
> +       /* Sorting: Extract root to --n */
> +       do_swap (base, base + n, size, swap_func);
> +      else
> +       break;
> +
> +      /* Sift element at "a" down into heap.  This is the "bottom-up"
> variant,
> +        which significantly reduces calls to cmp_func(): we find the
> sift-down
> +        path all the way to the leaves (one compare per level), then
> backtrack
> +        to find where to insert the target element.
> +
> +        Because elements tend to sift down close to the leaves, this uses
> fewer
> +        compares than doing two per level on the way down.  (A bit more
> than
> +        half as many on average, 3/4 worst-case.).  */
> +      for (b = a; c = 2 * b + size, (d = c + size) < n;)
> +       b = cmp (base + c, base + d, arg) >= 0 ? c : d;
> +      if (d == n)
> +       /* Special case last leaf with no sibling.  */
> +       b = c;
> +
> +      /* Now backtrack from "b" to the correct location for "a".  */
> +      while (b != a && cmp (base + a, base + b, arg) >= 0)
> +       b = parent (b, lsbit, size);
> +      /* Where "a" belongs.  */
> +      c = b;
> +      while (b != a)
> +       {
> +         /* Shift it into place.  */
> +         b = parent (b, lsbit, size);
> +          do_swap (base + b, base + c, size, swap_func);
> +        }
> +    }
> +}
> +
>  /* Order size using quicksort.  This implementation incorporates
>     four optimizations discussed in Sedgewick:
>
> @@ -223,7 +293,7 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>
>    const size_t max_thresh = MAX_THRESH * size;
>
> -  if (total_elems == 0)
> +  if (total_elems <= 1)
>      /* Avoid lossage with unsigned arithmetic below.  */
>      return;
>
> @@ -235,6 +305,9 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>    else
>      swap_func = SWAP_BYTES;
>
> +  /* Maximum depth before quicksort switches to heapsort.  */
> +  size_t depth = 2 * (CHAR_BIT - 1 - __builtin_clzl (total_elems));
> +
>    if (total_elems > MAX_THRESH)
>      {
>        char *lo = base_ptr;
> @@ -242,10 +315,17 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>        stack_node stack[STACK_SIZE];
>        stack_node *top = stack;
>
> -      top = push (top, NULL, NULL);
> +      top = push (top, NULL, NULL, depth);
>
>        while (stack < top)
>          {
> +         if (depth == 0)
> +           {

+             heapsort_r (lo, hi, size, swap_func, cmp, arg);
> +              top = pop (top, &lo, &hi, &depth);
> +             continue;
> +           }
> +
>            char *left_ptr;
>            char *right_ptr;
>
> @@ -309,7 +389,7 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>              {
>                if ((size_t) (hi - left_ptr) <= max_thresh)
>                 /* Ignore both small partitions. */
> -               top = pop (top, &lo, &hi);
> +               top = pop (top, &lo, &hi, &depth);
>                else
>                 /* Ignore small left partition. */
>                  lo = left_ptr;
> @@ -320,13 +400,13 @@ _quicksort (void *const pbase, size_t total_elems,
> size_t size,
>            else if ((right_ptr - lo) > (hi - left_ptr))
>

Since we now have a depth counter, is it faster to still
always select the bigger region for the stack or
to remove this branch and just choose a direction?

We should be able to bound the size of the stack structure
with depth.


>              {
>               /* Push larger left partition indices. */
> -             top = push (top, lo, right_ptr);
> +             top = push (top, lo, right_ptr, depth - 1);
>                lo = left_ptr;
>              }
>            else
>              {
>               /* Push larger right partition indices. */
> -             top = push (top, left_ptr, hi);
> +             top = push (top, left_ptr, hi, depth - 1);
>                hi = right_ptr;
>              }
>          }
> --
> 2.30.2
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-10-13  3:19   ` Noah Goldstein via Libc-alpha
@ 2021-10-15 12:52     ` Adhemerval Zanella via Libc-alpha
  2021-10-15 16:39       ` Noah Goldstein via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 12:52 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 13/10/2021 00:19, Noah Goldstein wrote:
> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> This patch adds a qsort benchmark.  I tried to model the benchmark
>> taking in consideration the possible input variation in both internal
>> element size, element numbers, and internal state for 1. real word cases
>> and 2. possible scenarios based on hardware characteristics.
>>
>> For 1. I tracked qsort usage (using a simple preload library to dump its
>> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
>> GCC 8 bootstrap build shows 51786641 call to qsort with the following
>> characterics:
>>
>> Key: number of elements:
>> key=2 : 39.74
>> key=3 : 19.23
>> key=4 : 9.77
>> key=1 : 8.44
>> key=0 : 6.60
>> key=5 : 4.40
>> key=7 : 2.37
>> key=6 : 2.25
>> key=9 : 1.48
>> key=8 : 0.97
>>
>> Key: element size in bytes:
>> key=8 : 91.74
>> key=32 : 3.62
>> key=4 : 2.42
>> key=40 : 1.20
>> key=16 : 0.67
>> key=24 : 0.30
>> key=48 : 0.05
>> key=56 : 0.00
>> key=1 : 0.00
>>
>> Key: total size (number of elements * element size):
>> key=16 : 35.98
>> key=24 : 18.67
>> key=32 : 9.79
>> key=8 : 8.28
>> key=0 : 6.60
>> key=40 : 4.21
>> key=64 : 3.15
>> key=48 : 2.24
>> key=56 : 2.15
>> key=80 : 1.45
>>
>> So for GCC:
>>
>>   - 80% of total qsort usage are done with 10 elements of less.
>>   - All usages are done element size of maximum of 56 bytes.
>>   - 90% of calls are done with array of maximum size of 80 bytes or
>>     less.
>>
>> (GCC on recent version now uses a internal qsort implementation instead
>> of the libc one).
>>
>> The Firefox usage was done with 2 hours of usage, with first 10 minutes
>> activelly openning and closing different types of sites.  It resulted in
>> 21042 calls with following characteristics:
>>
>> Key: number of elements:
>> key=7 : 24.40
>> key=1 : 10.44
>> key=3 : 6.33
>> key=4 : 5.81
>> key=2 : 5.46
>> key=6 : 4.80
>> key=17 : 4.54
>> key=0 : 3.07
>> key=5 : 3.05
>> key=9 : 2.51
>> key=12 : 2.06
>>
>> Key: element size in bytes:
>> key=8 : 94.49
>> key=28 : 4.40
>> key=2 : 0.70
>> key=16 : 0.19
>> key=36 : 0.07
>> key=12 : 0.07
>> key=40 : 0.07
>> key=24 : 0.03
>>
>> Key: total size (number of elements * element size):
>> key=56 : 24.20
>> key=8 : 10.27
>> key=24 : 6.36
>> key=32 : 5.86
>> key=16 : 5.46
>> key=48 : 4.80
>> key=476 : 3.75
>> key=0 : 3.07
>> key=40 : 3.05
>> key=72 : 2.50
>>
>> So for Firefox:
>>
>>   - 72% of total qsort usage are done with 18 elements of less.
>>   - All usages are done element size of maximum of 40 bytes.
>>   - 70% of calls are done with array of maximum size of 476 bytes or
>>     less.
>>
>> For 2. I used the idea of a machine with 3 levels of cache with sizes
>> L1 32kb, L2 256kb, and L3 4096Kb.
>>
>> It resulted in a benchmark with following traits:
>>
>>   * It checks four types of input arrays: sorted, mostly sorted, random,
>>     repeated, and bitonic:
>>
>>     - 'sorted': the array is already sorted.
>>     - 'mostly sorted': the array will have a certain number of random
>>        position with random values (current ratio used is 20%).
>>     - 'random' the array will contain random elements.
>>     - 'repeated' the array will have random elements with a certain
>>        number (current ratio is 20%) of a repeated element distributed
>>        randomly.
>>     - 'bitonic': elements will be strictly increasing to up middle of the
>>        array then strictly decreasing.  This is to trigger the
>>        pattern-defeting input for the median of 3 pivot selection used
>>        in quicksort implementation.
>>
>>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>>     element with 32 bytes (but using the uint32_t comparison checks).
>>     These element sizes are used to 1. to avoid include the comparison
>>     function itself and/or memory copy in sort benchmark itself, and 2.
>>     because key of size_t are the most used for both GCC and Firefox.
>>
>>   * Five different element numbers: 64 (which cover mostly of used
>>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>>     --nelem option.
>>
>> Checked on x86_64-linux-gnu
>> ---
>>  benchtests/Makefile      |   2 +-
>>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 196 insertions(+), 1 deletion(-)
>>  create mode 100644 benchtests/bench-qsort.c
>>
>> diff --git a/benchtests/Makefile b/benchtests/Makefile
>> index 1530939a8c..781cbca8af 100644
>> --- a/benchtests/Makefile
>> +++ b/benchtests/Makefile
>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>>  include ../gen-locales.mk
>>  endif
>>
>> -stdlib-benchset := strtod
>> +stdlib-benchset := strtod qsort
>>
>>  stdio-common-benchset := sprintf
>>
>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
>> new file mode 100644
>> index 0000000000..905aa6d7b6
>> --- /dev/null
>> +++ b/benchtests/bench-qsort.c
>> @@ -0,0 +1,195 @@
>> +/* Measure qsort function.
>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include <array_length.h>
>> +#include <bench-timing.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <json-lib.h>
>> +#include <stdio.h>
>> +#include <unistd.h>
>> +
>> +#include <stdlib/tst-qsort-common.c>
>> +
>> +/* Number of elements of determined type to be measured.  */
>> +static const size_t default_nelems[] =
>> +{
>> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
>> +                              on GCC build.  */
>> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
>> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
>> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
>> +};
>> +
>> +#define OPT_NELEM 10000
>> +#define OPT_SEED  10001
>> +#define CMDLINE_OPTIONS \
>> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
>> +  { "seed", required_argument, NULL, OPT_SEED },
>> +
>> +static const size_t max_nelem = 16;
>> +static size_t *elems = NULL;
>> +static size_t nelem = 0;
>> +static uint64_t seed = 0;
>> +static bool seed_set = false;
>> +
>> +static void
>> +cmdline_process_function (int c)
>> +{
>> +  switch (c)
>> +    {
>> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
>> +        The elements sizes as passed with a ':' as the delimiter, for
>> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
>> +      case OPT_NELEM:
>> +        {
>> +         elems = xmalloc (max_nelem * sizeof (size_t));
>> +         nelem = 0;
>> +
>> +         char *saveptr;
>> +         char *token;
>> +         token = strtok_r (optarg, ":", &saveptr);
>> +         if (token == NULL)
>> +           {
>> +             printf ("error: invalid --nelem value\n");
>> +             exit (EXIT_FAILURE);
>> +           }
>> +         do
>> +           {
>> +             if (nelem == max_nelem)
>> +               {
>> +                 printf ("error: invalid --nelem value (max elem)\n");
>> +                 exit (EXIT_FAILURE);
>> +               }
>> +             elems[nelem++] = atol (token);
>> +             token = strtok_r (saveptr, ":", &saveptr);
>> +           } while (token != NULL);
>> +        }
>> +      break;
>> +
>> +      /* handle the --seed option to use a different seed than a random one.
>> +        The SEED used should be a uint64_t number.  */
>> +      case OPT_SEED:
>> +       {
>> +         uint64_t value = strtoull (optarg, NULL, 0);
>> +         if (errno == ERANGE || value > UINT64_MAX)
>> +           {
>> +             printf ("error: seed should be a value in range of "
>> +                     "[0, UINT64_MAX]\n");
>> +             exit (EXIT_FAILURE);
>> +           }
>> +         seed = value;
>> +         seed_set = true;
>> +       }
>> +    }
>> +}
>> +
>> +#define CMDLINE_PROCESS cmdline_process_function
>> +
>> +static int
>> +do_test (void)
>> +{
>> +  random_init ();
>> +
>> +
>> +  json_ctx_t json_ctx;
>> +
>> +  json_init (&json_ctx, 0, stdout);
>> +
>> +  json_document_begin (&json_ctx);
>> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
>> +
>> +  json_attr_object_begin (&json_ctx, "functions");
>> +  json_attr_object_begin (&json_ctx, "qsort");
>> +  json_attr_uint (&json_ctx, "seed", seed);
>> +
>> +  json_array_begin (&json_ctx, "results");
>> +
>> +  const size_t *welem = elems == NULL ? default_nelems : elems;
>> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
>> +
>> +  struct test_t
>> +  {
>> +    size_t type_size;
>> +    cmpfunc_t cmpfunc;
>> +  };
>> +  static const struct test_t tests[] =
>> +  {
>> +    { sizeof (uint32_t), uint32_t_cmp },
>> +    { sizeof (uint64_t), uint64_t_cmp },
>> +    /* Test swap with large elements.  */
>> +    { 32,                uint32_t_cmp },
>> +  };
>> +
>> +  const size_t inner_loop_iters = 16;
>> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
>> +    {
>> +      for (const struct array_t *arraytype = arraytypes;
>> +          arraytype < array_end (arraytypes);
>> +          ++arraytype)
>> +       {
>> +         for (int k = 0; k < wnelem; k++)
>> +           {
>> +             json_element_object_begin (&json_ctx);
>> +
>> +             size_t arraysize = welem[k] * test->type_size;
>> +
>> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
>> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
>> +             json_attr_string (&json_ctx, "property", arraytype->name);
>> +
>> +             void *base = create_array (welem[k], test->type_size,
>> +                                        arraytype->type);
>> +             void *work = xmalloc (arraysize);
>> +
>> +             timing_t total = 0;
>> +
>> +             for (int n = 0; n < inner_loop_iters; n++)
>> +               {
>> +                 memcpy (work, base, arraysize);
>> +
>> +                 timing_t start, end, diff;
>> +                 TIMING_NOW (start);
>> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
>> +                 TIMING_NOW (end);
>> +
>> +                 TIMING_DIFF (diff, start, end);
>> +                 TIMING_ACCUM (total, diff);
>> +               }
>> +
> To avoid overhead from timing. Maybe you could allocate / initialize
> multiple buffers to test a given sort on. Then instead of having to
> re-memcpy / time you just loop through the N initialized buffers
> with one timing call. For the smaller sorts this may help decrease
> the timing overhead.

I think this make sense, although it would require more memory. I changed
to:

---
  for (const struct test_t *test = tests; test < array_end (tests); ++test)
    {
      for (size_t at = 0; at < array_length (arraytypes); at++)
        {
          for (int ne = 0; ne < wnelem; ne++)
            {
              void *inputs[INNER_LOOP_ITERS];
              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                inputs[i] = create_array (welem[ne], test->type_size,
                                          arraytypes[at].type);

              timing_t start, end, diff;
              TIMING_NOW (start);
              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc);
              TIMING_NOW (end);

              TIMING_DIFF (diff, start, end);

              json_element_object_begin (&json_ctx);
              json_attr_uint (&json_ctx, "nmemb", welem[ne]);
              json_attr_uint (&json_ctx, "type_size", test->type_size);
              json_attr_string (&json_ctx, "property", arraytypes[at].name);
              uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS;
              json_attr_uint (&json_ctx, "timings", timing);
              json_element_object_end (&json_ctx);

              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                free (inputs[i]);
            }
        }
    }
---

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-13  3:29   ` Noah Goldstein via Libc-alpha
  2021-10-13  3:39     ` Noah Goldstein via Libc-alpha
@ 2021-10-15 13:12     ` Adhemerval Zanella via Libc-alpha
  2021-10-15 16:45       ` Noah Goldstein via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 13:12 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 13/10/2021 00:29, Noah Goldstein wrote:
>     +static void
>     +swap_bytes (void * restrict a, void * restrict b, size_t n)
>     +{
>     +  /* Use multiple small memcpys with constant size to enable inlining
>     +     on most targets.  */
>     +  enum { SWAP_GENERIC_SIZE = 32 };
>     +  unsigned char tmp[SWAP_GENERIC_SIZE];
>     +  while (n > SWAP_GENERIC_SIZE)
>     +    {
>     +      memcpy (tmp, a, SWAP_GENERIC_SIZE);
>     +      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
>     +      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
>     +      n -= SWAP_GENERIC_SIZE;
>     +    }
>     +  memcpy (tmp, a, n);
>     +  memcpy (a, b, n);
>     +  memcpy (b, tmp, n);
>     +}
>     +
>     +/* Replace the indirect call with a serie of if statements.  It should help
>     +   the branch predictor.  */
> 
>  
> 1) Really? On Intel at least an indirect call that is always going to the same place
> is certainly going to be predicted as well if not better than 2/3 branches + direct call.
> 

I shamelessly copy the same strategy Linux kernel used on its lib/sort.c
(8fb583c4258d08f0).  Maybe Linux internal usage of its qsort() leads to
better predictable branch, and for this change I would prefer to work
better on different architectures than assume an specific one.

> 2) If you're going to just test which swap function to use, why bother initializing
> swap_func? Why not just use an int?

Indeed this is no much gain on glibc usage.  The kernel provides a API to use
used-defined swap_func, which is not our case.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-13  3:39     ` Noah Goldstein via Libc-alpha
@ 2021-10-15 13:29       ` Adhemerval Zanella via Libc-alpha
  2021-10-15 17:17         ` Noah Goldstein via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 13:29 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 13/10/2021 00:39, Noah Goldstein wrote:
> 
> 
> On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
> 
> 
> 
>     On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org <mailto:libc-alpha@sourceware.org>> wrote:
> 
>         It optimizes take in consideration both the most common elements are
>         either 32 or 64 bit in size [1] and inputs are aligned to the word
>         boundary.  This is similar to the optimization done on lib/sort.c
>         from Linux.
> 
>         This patchs adds an optimized swap operation on qsort based in previous
>         msort one.  Instead of byte operation, three variants are provided:
> 
>           1. Using uint32_t loads and stores.
>           2. Using uint64_t loads and stores.
>           3. Generic one with a temporary buffer and memcpy/mempcpy.
> 
>         The 1. and 2. options are selected only either if architecture defines
>         _STRING_ARCH_unaligned or if base pointer is aligned to required type.
> 
>         It also fixes BZ#19305 by checking input size against number of
>         elements 1 besides 0.
> 
>         Checked on x86_64-linux-gnu.
> 
>         [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html <https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html>
>         ---
>          stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>          1 file changed, 91 insertions(+), 18 deletions(-)
> 
>         diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>         index 23f2d28314..59458d151b 100644
>         --- a/stdlib/qsort.c
>         +++ b/stdlib/qsort.c
>         @@ -24,20 +24,85 @@
>          #include <limits.h>
>          #include <stdlib.h>
>          #include <string.h>
>         +#include <stdbool.h>
> 
>         -/* Byte-wise swap two items of size SIZE. */
>         -#define SWAP(a, b, size)                                                     \
>         -  do                                                                         \
>         -    {                                                                        \
>         -      size_t __size = (size);                                                \
>         -      char *__a = (a), *__b = (b);                                           \
>         -      do                                                                     \
>         -       {                                                                     \
>         -         char __tmp = *__a;                                                  \
>         -         *__a++ = *__b;                                                      \
>         -         *__b++ = __tmp;                                                     \
>         -       } while (--__size > 0);                                               \
>         -    } while (0)
>         +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
>         +   along the generic one as an optimization.  */
>         +
>         +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
>         +
>         +/* Return trues is elements can be copied used word load and sortes.
>         +   The size must be a multiple of the alignment, and the base address.  */
>         +static inline bool
>         +is_aligned_to_copy (const void *base, size_t size, size_t align)
>         +{
>         +  unsigned char lsbits = size;
>         +#if !_STRING_ARCH_unaligned
>         +  lsbits |= (unsigned char)(uintptr_t) base;
>         +#endif
>         +  return (lsbits & (align - 1)) == 0;
>         +}
>         +
>         +#define SWAP_WORDS_64 (swap_func_t)0
>         +#define SWAP_WORDS_32 (swap_func_t)1
>         +#define SWAP_BYTES    (swap_func_t)2
>         +
>         +static void
>         +swap_words_64 (void * restrict a, void * restrict b, size_t n)
>         +{
>         +  do
>         +   {
>         +     n -= 8;
>         +     uint64_t t = *(uint64_t *)(a + n);
>         +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
>         +     *(uint64_t *)(b + n) = t;
>         +   } while (n);
>         +}
>         +
>         +static void
>         +swap_words_32 (void * restrict a, void * restrict b, size_t n)
>         +{
>         +  do
>         +   {
>         +     n -= 4;
>         +     uint32_t t = *(uint32_t *)(a + n);
>         +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
>         +     *(uint32_t *)(b + n) = t;
>         +   } while (n);
>         +}
> 
> 
> I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
> key sizes. Looking at GCC's implementation of swap_generic on modern x86_64: 
> https://godbolt.org/z/638h3Y9va <https://godbolt.org/z/638h3Y9va>
> It's able to optimize the temporary buffer out of the loop and use xmm registers which
> will likely win out for larger sizes. 

It is probably not the most optimized code compiler can generated since
I tried to go for a more generic code that should results in a somewhat
better code in a architecture agnostic code.  I trying to mimic some
of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I
used the same strategy used on d3496c9f4f27d (Linux did not optimize
anything for the byte version).

I think it would be possible to tune it for an specific architecture
and/or compiler, but I would prefer to use a good enough algorithm that
work reasonable on multiple architectures.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-10-15 12:52     ` Adhemerval Zanella via Libc-alpha
@ 2021-10-15 16:39       ` Noah Goldstein via Libc-alpha
  2021-10-15 17:19         ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-15 16:39 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/10/2021 00:19, Noah Goldstein wrote:
> > On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> This patch adds a qsort benchmark.  I tried to model the benchmark
> >> taking in consideration the possible input variation in both internal
> >> element size, element numbers, and internal state for 1. real word cases
> >> and 2. possible scenarios based on hardware characteristics.
> >>
> >> For 1. I tracked qsort usage (using a simple preload library to dump its
> >> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
> >> GCC 8 bootstrap build shows 51786641 call to qsort with the following
> >> characterics:
> >>
> >> Key: number of elements:
> >> key=2 : 39.74
> >> key=3 : 19.23
> >> key=4 : 9.77
> >> key=1 : 8.44
> >> key=0 : 6.60
> >> key=5 : 4.40
> >> key=7 : 2.37
> >> key=6 : 2.25
> >> key=9 : 1.48
> >> key=8 : 0.97
> >>
> >> Key: element size in bytes:
> >> key=8 : 91.74
> >> key=32 : 3.62
> >> key=4 : 2.42
> >> key=40 : 1.20
> >> key=16 : 0.67
> >> key=24 : 0.30
> >> key=48 : 0.05
> >> key=56 : 0.00
> >> key=1 : 0.00
> >>
> >> Key: total size (number of elements * element size):
> >> key=16 : 35.98
> >> key=24 : 18.67
> >> key=32 : 9.79
> >> key=8 : 8.28
> >> key=0 : 6.60
> >> key=40 : 4.21
> >> key=64 : 3.15
> >> key=48 : 2.24
> >> key=56 : 2.15
> >> key=80 : 1.45
> >>
> >> So for GCC:
> >>
> >>   - 80% of total qsort usage are done with 10 elements of less.
> >>   - All usages are done element size of maximum of 56 bytes.
> >>   - 90% of calls are done with array of maximum size of 80 bytes or
> >>     less.
> >>
> >> (GCC on recent version now uses a internal qsort implementation instead
> >> of the libc one).
> >>
> >> The Firefox usage was done with 2 hours of usage, with first 10 minutes
> >> activelly openning and closing different types of sites.  It resulted in
> >> 21042 calls with following characteristics:
> >>
> >> Key: number of elements:
> >> key=7 : 24.40
> >> key=1 : 10.44
> >> key=3 : 6.33
> >> key=4 : 5.81
> >> key=2 : 5.46
> >> key=6 : 4.80
> >> key=17 : 4.54
> >> key=0 : 3.07
> >> key=5 : 3.05
> >> key=9 : 2.51
> >> key=12 : 2.06
> >>
> >> Key: element size in bytes:
> >> key=8 : 94.49
> >> key=28 : 4.40
> >> key=2 : 0.70
> >> key=16 : 0.19
> >> key=36 : 0.07
> >> key=12 : 0.07
> >> key=40 : 0.07
> >> key=24 : 0.03
> >>
> >> Key: total size (number of elements * element size):
> >> key=56 : 24.20
> >> key=8 : 10.27
> >> key=24 : 6.36
> >> key=32 : 5.86
> >> key=16 : 5.46
> >> key=48 : 4.80
> >> key=476 : 3.75
> >> key=0 : 3.07
> >> key=40 : 3.05
> >> key=72 : 2.50
> >>
> >> So for Firefox:
> >>
> >>   - 72% of total qsort usage are done with 18 elements of less.
> >>   - All usages are done element size of maximum of 40 bytes.
> >>   - 70% of calls are done with array of maximum size of 476 bytes or
> >>     less.
> >>
> >> For 2. I used the idea of a machine with 3 levels of cache with sizes
> >> L1 32kb, L2 256kb, and L3 4096Kb.
> >>
> >> It resulted in a benchmark with following traits:
> >>
> >>   * It checks four types of input arrays: sorted, mostly sorted, random,
> >>     repeated, and bitonic:
> >>
> >>     - 'sorted': the array is already sorted.
> >>     - 'mostly sorted': the array will have a certain number of random
> >>        position with random values (current ratio used is 20%).
> >>     - 'random' the array will contain random elements.
> >>     - 'repeated' the array will have random elements with a certain
> >>        number (current ratio is 20%) of a repeated element distributed
> >>        randomly.
> >>     - 'bitonic': elements will be strictly increasing to up middle of the
> >>        array then strictly decreasing.  This is to trigger the
> >>        pattern-defeting input for the median of 3 pivot selection used
> >>        in quicksort implementation.
> >>
> >>   * Three elements sizes are checked: uint32_t, uint64_t, and an
> >>     element with 32 bytes (but using the uint32_t comparison checks).
> >>     These element sizes are used to 1. to avoid include the comparison
> >>     function itself and/or memory copy in sort benchmark itself, and 2.
> >>     because key of size_t are the most used for both GCC and Firefox.
> >>
> >>   * Five different element numbers: 64 (which cover mostly of used
> >>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
> >>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
> >>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
> >>     --nelem option.
> >>
> >> Checked on x86_64-linux-gnu
> >> ---
> >>  benchtests/Makefile      |   2 +-
> >>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 196 insertions(+), 1 deletion(-)
> >>  create mode 100644 benchtests/bench-qsort.c
> >>
> >> diff --git a/benchtests/Makefile b/benchtests/Makefile
> >> index 1530939a8c..781cbca8af 100644
> >> --- a/benchtests/Makefile
> >> +++ b/benchtests/Makefile
> >> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
> >>  include ../gen-locales.mk
> >>  endif
> >>
> >> -stdlib-benchset := strtod
> >> +stdlib-benchset := strtod qsort
> >>
> >>  stdio-common-benchset := sprintf
> >>
> >> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
> >> new file mode 100644
> >> index 0000000000..905aa6d7b6
> >> --- /dev/null
> >> +++ b/benchtests/bench-qsort.c
> >> @@ -0,0 +1,195 @@
> >> +/* Measure qsort function.
> >> +   Copyright (C) 2018 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <http://www.gnu.org/licenses/>.  */
> >> +
> >> +#include <array_length.h>
> >> +#include <bench-timing.h>
> >> +#include <errno.h>
> >> +#include <getopt.h>
> >> +#include <json-lib.h>
> >> +#include <stdio.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <stdlib/tst-qsort-common.c>
> >> +
> >> +/* Number of elements of determined type to be measured.  */
> >> +static const size_t default_nelems[] =
> >> +{
> >> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
> >> +                              on GCC build.  */
> >> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
> >> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
> >> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
> >> +};
> >> +
> >> +#define OPT_NELEM 10000
> >> +#define OPT_SEED  10001
> >> +#define CMDLINE_OPTIONS \
> >> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
> >> +  { "seed", required_argument, NULL, OPT_SEED },
> >> +
> >> +static const size_t max_nelem = 16;
> >> +static size_t *elems = NULL;
> >> +static size_t nelem = 0;
> >> +static uint64_t seed = 0;
> >> +static bool seed_set = false;
> >> +
> >> +static void
> >> +cmdline_process_function (int c)
> >> +{
> >> +  switch (c)
> >> +    {
> >> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
> >> +        The elements sizes as passed with a ':' as the delimiter, for
> >> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
> >> +      case OPT_NELEM:
> >> +        {
> >> +         elems = xmalloc (max_nelem * sizeof (size_t));
> >> +         nelem = 0;
> >> +
> >> +         char *saveptr;
> >> +         char *token;
> >> +         token = strtok_r (optarg, ":", &saveptr);
> >> +         if (token == NULL)
> >> +           {
> >> +             printf ("error: invalid --nelem value\n");
> >> +             exit (EXIT_FAILURE);
> >> +           }
> >> +         do
> >> +           {
> >> +             if (nelem == max_nelem)
> >> +               {
> >> +                 printf ("error: invalid --nelem value (max elem)\n");
> >> +                 exit (EXIT_FAILURE);
> >> +               }
> >> +             elems[nelem++] = atol (token);
> >> +             token = strtok_r (saveptr, ":", &saveptr);
> >> +           } while (token != NULL);
> >> +        }
> >> +      break;
> >> +
> >> +      /* handle the --seed option to use a different seed than a random one.
> >> +        The SEED used should be a uint64_t number.  */
> >> +      case OPT_SEED:
> >> +       {
> >> +         uint64_t value = strtoull (optarg, NULL, 0);
> >> +         if (errno == ERANGE || value > UINT64_MAX)
> >> +           {
> >> +             printf ("error: seed should be a value in range of "
> >> +                     "[0, UINT64_MAX]\n");
> >> +             exit (EXIT_FAILURE);
> >> +           }
> >> +         seed = value;
> >> +         seed_set = true;
> >> +       }
> >> +    }
> >> +}
> >> +
> >> +#define CMDLINE_PROCESS cmdline_process_function
> >> +
> >> +static int
> >> +do_test (void)
> >> +{
> >> +  random_init ();
> >> +
> >> +
> >> +  json_ctx_t json_ctx;
> >> +
> >> +  json_init (&json_ctx, 0, stdout);
> >> +
> >> +  json_document_begin (&json_ctx);
> >> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> >> +
> >> +  json_attr_object_begin (&json_ctx, "functions");
> >> +  json_attr_object_begin (&json_ctx, "qsort");
> >> +  json_attr_uint (&json_ctx, "seed", seed);
> >> +
> >> +  json_array_begin (&json_ctx, "results");
> >> +
> >> +  const size_t *welem = elems == NULL ? default_nelems : elems;
> >> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
> >> +
> >> +  struct test_t
> >> +  {
> >> +    size_t type_size;
> >> +    cmpfunc_t cmpfunc;
> >> +  };
> >> +  static const struct test_t tests[] =
> >> +  {
> >> +    { sizeof (uint32_t), uint32_t_cmp },
> >> +    { sizeof (uint64_t), uint64_t_cmp },
> >> +    /* Test swap with large elements.  */
> >> +    { 32,                uint32_t_cmp },
> >> +  };
> >> +
> >> +  const size_t inner_loop_iters = 16;
> >> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
> >> +    {
> >> +      for (const struct array_t *arraytype = arraytypes;
> >> +          arraytype < array_end (arraytypes);
> >> +          ++arraytype)
> >> +       {
> >> +         for (int k = 0; k < wnelem; k++)
> >> +           {
> >> +             json_element_object_begin (&json_ctx);
> >> +
> >> +             size_t arraysize = welem[k] * test->type_size;
> >> +
> >> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
> >> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
> >> +             json_attr_string (&json_ctx, "property", arraytype->name);
> >> +
> >> +             void *base = create_array (welem[k], test->type_size,
> >> +                                        arraytype->type);
> >> +             void *work = xmalloc (arraysize);
> >> +
> >> +             timing_t total = 0;
> >> +
> >> +             for (int n = 0; n < inner_loop_iters; n++)
> >> +               {
> >> +                 memcpy (work, base, arraysize);
> >> +
> >> +                 timing_t start, end, diff;
> >> +                 TIMING_NOW (start);
> >> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
> >> +                 TIMING_NOW (end);
> >> +
> >> +                 TIMING_DIFF (diff, start, end);
> >> +                 TIMING_ACCUM (total, diff);
> >> +               }
> >> +
> > To avoid overhead from timing. Maybe you could allocate / initialize
> > multiple buffers to test a given sort on. Then instead of having to
> > re-memcpy / time you just loop through the N initialized buffers
> > with one timing call. For the smaller sorts this may help decrease
> > the timing overhead.
>
> I think this make sense, although it would require more memory. I changed
> to:
>
> ---
>   for (const struct test_t *test = tests; test < array_end (tests); ++test)
>     {
>       for (size_t at = 0; at < array_length (arraytypes); at++)
>         {
>           for (int ne = 0; ne < wnelem; ne++)
>             {
>               void *inputs[INNER_LOOP_ITERS];
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 inputs[i] = create_array (welem[ne], test->type_size,
>                                           arraytypes[at].type);

I think it may pay to use a 1-dimensional array. i.e

size_t qsort_bytes = welem[ne] * test->type_size;
void * inputs = malloc(qsort_bytes * NTESTS);
// initialize [inputs, inputs + qsort_bytes]
for (int i = 1; i < NTESTS; ++i)
    memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes)

then increment inputs by 'qsort_bytes' in the test loop.

NTESTS may be fine being INNER_LOOP_ITERS but
if we are concerned about using too much memory
we could add another loop through INNER_LOOP_ITERS / NTESTS
and accumulate the time diffs.



>
>               timing_t start, end, diff;
>               TIMING_NOW (start);
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc);
>               TIMING_NOW (end);
>
>               TIMING_DIFF (diff, start, end);
>
>               json_element_object_begin (&json_ctx);
>               json_attr_uint (&json_ctx, "nmemb", welem[ne]);
>               json_attr_uint (&json_ctx, "type_size", test->type_size);
>               json_attr_string (&json_ctx, "property", arraytypes[at].name);
>               uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS;
>               json_attr_uint (&json_ctx, "timings", timing);
>               json_element_object_end (&json_ctx);
>
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 free (inputs[i]);
>             }
>         }
>     }
> ---

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-15 13:12     ` Adhemerval Zanella via Libc-alpha
@ 2021-10-15 16:45       ` Noah Goldstein via Libc-alpha
  2021-10-15 17:21         ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-15 16:45 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Oct 15, 2021 at 8:12 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/10/2021 00:29, Noah Goldstein wrote:
> >     +static void
> >     +swap_bytes (void * restrict a, void * restrict b, size_t n)
> >     +{
> >     +  /* Use multiple small memcpys with constant size to enable inlining
> >     +     on most targets.  */
> >     +  enum { SWAP_GENERIC_SIZE = 32 };
> >     +  unsigned char tmp[SWAP_GENERIC_SIZE];
> >     +  while (n > SWAP_GENERIC_SIZE)
> >     +    {
> >     +      memcpy (tmp, a, SWAP_GENERIC_SIZE);
> >     +      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
> >     +      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
> >     +      n -= SWAP_GENERIC_SIZE;
> >     +    }
> >     +  memcpy (tmp, a, n);
> >     +  memcpy (a, b, n);
> >     +  memcpy (b, tmp, n);
> >     +}
> >     +
> >     +/* Replace the indirect call with a serie of if statements.  It should help
> >     +   the branch predictor.  */
> >
> >
> > 1) Really? On Intel at least an indirect call that is always going to the same place
> > is certainly going to be predicted as well if not better than 2/3 branches + direct call.
> >
>
> I shamelessly copy the same strategy Linux kernel used on its lib/sort.c
> (8fb583c4258d08f0).  Maybe Linux internal usage of its qsort() leads to
> better predictable branch, and for this change I would prefer to work
> better on different architectures than assume an specific one.

If it's running in the Kernel spectre mitigations can make indirect jumps
particularly expensive. I don't think we have the same issue targeting
userland.

>
> > 2) If you're going to just test which swap function to use, why bother initializing
> > swap_func? Why not just use an int?
>
> Indeed this is no much gain on glibc usage.  The kernel provides a API to use
> used-defined swap_func, which is not our case.
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-15 13:29       ` Adhemerval Zanella via Libc-alpha
@ 2021-10-15 17:17         ` Noah Goldstein via Libc-alpha
  2021-10-15 17:34           ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-15 17:17 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Oct 15, 2021 at 8:29 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/10/2021 00:39, Noah Goldstein wrote:
> >
> >
> > On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
> >
> >
> >
> >     On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org <mailto:libc-alpha@sourceware.org>> wrote:
> >
> >         It optimizes take in consideration both the most common elements are
> >         either 32 or 64 bit in size [1] and inputs are aligned to the word
> >         boundary.  This is similar to the optimization done on lib/sort.c
> >         from Linux.
> >
> >         This patchs adds an optimized swap operation on qsort based in previous
> >         msort one.  Instead of byte operation, three variants are provided:
> >
> >           1. Using uint32_t loads and stores.
> >           2. Using uint64_t loads and stores.
> >           3. Generic one with a temporary buffer and memcpy/mempcpy.
> >
> >         The 1. and 2. options are selected only either if architecture defines
> >         _STRING_ARCH_unaligned or if base pointer is aligned to required type.
> >
> >         It also fixes BZ#19305 by checking input size against number of
> >         elements 1 besides 0.
> >
> >         Checked on x86_64-linux-gnu.
> >
> >         [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html <https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html>
> >         ---
> >          stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
> >          1 file changed, 91 insertions(+), 18 deletions(-)
> >
> >         diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> >         index 23f2d28314..59458d151b 100644
> >         --- a/stdlib/qsort.c
> >         +++ b/stdlib/qsort.c
> >         @@ -24,20 +24,85 @@
> >          #include <limits.h>
> >          #include <stdlib.h>
> >          #include <string.h>
> >         +#include <stdbool.h>
> >
> >         -/* Byte-wise swap two items of size SIZE. */
> >         -#define SWAP(a, b, size)                                                     \
> >         -  do                                                                         \
> >         -    {                                                                        \
> >         -      size_t __size = (size);                                                \
> >         -      char *__a = (a), *__b = (b);                                           \
> >         -      do                                                                     \
> >         -       {                                                                     \
> >         -         char __tmp = *__a;                                                  \
> >         -         *__a++ = *__b;                                                      \
> >         -         *__b++ = __tmp;                                                     \
> >         -       } while (--__size > 0);                                               \
> >         -    } while (0)
> >         +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
> >         +   along the generic one as an optimization.  */
> >         +
> >         +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
> >         +
> >         +/* Return trues is elements can be copied used word load and sortes.
> >         +   The size must be a multiple of the alignment, and the base address.  */
> >         +static inline bool
> >         +is_aligned_to_copy (const void *base, size_t size, size_t align)
> >         +{
> >         +  unsigned char lsbits = size;
> >         +#if !_STRING_ARCH_unaligned
> >         +  lsbits |= (unsigned char)(uintptr_t) base;
> >         +#endif
> >         +  return (lsbits & (align - 1)) == 0;
> >         +}
> >         +
> >         +#define SWAP_WORDS_64 (swap_func_t)0
> >         +#define SWAP_WORDS_32 (swap_func_t)1
> >         +#define SWAP_BYTES    (swap_func_t)2
> >         +
> >         +static void
> >         +swap_words_64 (void * restrict a, void * restrict b, size_t n)
> >         +{
> >         +  do
> >         +   {
> >         +     n -= 8;
> >         +     uint64_t t = *(uint64_t *)(a + n);
> >         +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
> >         +     *(uint64_t *)(b + n) = t;
> >         +   } while (n);
> >         +}
> >         +
> >         +static void
> >         +swap_words_32 (void * restrict a, void * restrict b, size_t n)
> >         +{
> >         +  do
> >         +   {
> >         +     n -= 4;
> >         +     uint32_t t = *(uint32_t *)(a + n);
> >         +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
> >         +     *(uint32_t *)(b + n) = t;
> >         +   } while (n);
> >         +}
> >
> >
> > I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
> > key sizes. Looking at GCC's implementation of swap_generic on modern x86_64:
> > https://godbolt.org/z/638h3Y9va <https://godbolt.org/z/638h3Y9va>
> > It's able to optimize the temporary buffer out of the loop and use xmm registers which
> > will likely win out for larger sizes.
>
> It is probably not the most optimized code compiler can generated since
> I tried to go for a more generic code that should results in a somewhat
> better code in a architecture agnostic code.  I trying to mimic some
> of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I
> used the same strategy used on d3496c9f4f27d (Linux did not optimize
> anything for the byte version).
>
> I think it would be possible to tune it for an specific architecture
> and/or compiler, but I would prefer to use a good enough algorithm that
> work reasonable on multiple architectures.

That's fair. Although I still think there are some improvements.

Looking at the assembly for all three in fact it seems GCC optimizes all of them
to larger register copies: https://godbolt.org/z/bd9nnnoEY

The biggest difference seems to be the setup / register spills for
the generic version so for the common case of a relatively small key
the special case for 4/8 makes sense.

Have you checked that GCC is able to use the conditions for selecting
the swap function to optimize the functions themselves? In the godbolt
link I got reasonable value out of adding the invariants to swap_64/swap_32.

It also may be worth it to write a custom memcpy implementation for
size = 0..SWAP_GENERIC_SIZE so it can be inlined (and probably
more optimized than what generic memcpy can get away with).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/7] benchtests: Add bench-qsort
  2021-10-15 16:39       ` Noah Goldstein via Libc-alpha
@ 2021-10-15 17:19         ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 17:19 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 15/10/2021 13:39, Noah Goldstein wrote:
> On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/10/2021 00:19, Noah Goldstein wrote:
>>> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> This patch adds a qsort benchmark.  I tried to model the benchmark
>>>> taking in consideration the possible input variation in both internal
>>>> element size, element numbers, and internal state for 1. real word cases
>>>> and 2. possible scenarios based on hardware characteristics.
>>>>
>>>> For 1. I tracked qsort usage (using a simple preload library to dump its
>>>> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
>>>> GCC 8 bootstrap build shows 51786641 call to qsort with the following
>>>> characterics:
>>>>
>>>> Key: number of elements:
>>>> key=2 : 39.74
>>>> key=3 : 19.23
>>>> key=4 : 9.77
>>>> key=1 : 8.44
>>>> key=0 : 6.60
>>>> key=5 : 4.40
>>>> key=7 : 2.37
>>>> key=6 : 2.25
>>>> key=9 : 1.48
>>>> key=8 : 0.97
>>>>
>>>> Key: element size in bytes:
>>>> key=8 : 91.74
>>>> key=32 : 3.62
>>>> key=4 : 2.42
>>>> key=40 : 1.20
>>>> key=16 : 0.67
>>>> key=24 : 0.30
>>>> key=48 : 0.05
>>>> key=56 : 0.00
>>>> key=1 : 0.00
>>>>
>>>> Key: total size (number of elements * element size):
>>>> key=16 : 35.98
>>>> key=24 : 18.67
>>>> key=32 : 9.79
>>>> key=8 : 8.28
>>>> key=0 : 6.60
>>>> key=40 : 4.21
>>>> key=64 : 3.15
>>>> key=48 : 2.24
>>>> key=56 : 2.15
>>>> key=80 : 1.45
>>>>
>>>> So for GCC:
>>>>
>>>>   - 80% of total qsort usage are done with 10 elements of less.
>>>>   - All usages are done element size of maximum of 56 bytes.
>>>>   - 90% of calls are done with array of maximum size of 80 bytes or
>>>>     less.
>>>>
>>>> (GCC on recent version now uses a internal qsort implementation instead
>>>> of the libc one).
>>>>
>>>> The Firefox usage was done with 2 hours of usage, with first 10 minutes
>>>> activelly openning and closing different types of sites.  It resulted in
>>>> 21042 calls with following characteristics:
>>>>
>>>> Key: number of elements:
>>>> key=7 : 24.40
>>>> key=1 : 10.44
>>>> key=3 : 6.33
>>>> key=4 : 5.81
>>>> key=2 : 5.46
>>>> key=6 : 4.80
>>>> key=17 : 4.54
>>>> key=0 : 3.07
>>>> key=5 : 3.05
>>>> key=9 : 2.51
>>>> key=12 : 2.06
>>>>
>>>> Key: element size in bytes:
>>>> key=8 : 94.49
>>>> key=28 : 4.40
>>>> key=2 : 0.70
>>>> key=16 : 0.19
>>>> key=36 : 0.07
>>>> key=12 : 0.07
>>>> key=40 : 0.07
>>>> key=24 : 0.03
>>>>
>>>> Key: total size (number of elements * element size):
>>>> key=56 : 24.20
>>>> key=8 : 10.27
>>>> key=24 : 6.36
>>>> key=32 : 5.86
>>>> key=16 : 5.46
>>>> key=48 : 4.80
>>>> key=476 : 3.75
>>>> key=0 : 3.07
>>>> key=40 : 3.05
>>>> key=72 : 2.50
>>>>
>>>> So for Firefox:
>>>>
>>>>   - 72% of total qsort usage are done with 18 elements of less.
>>>>   - All usages are done element size of maximum of 40 bytes.
>>>>   - 70% of calls are done with array of maximum size of 476 bytes or
>>>>     less.
>>>>
>>>> For 2. I used the idea of a machine with 3 levels of cache with sizes
>>>> L1 32kb, L2 256kb, and L3 4096Kb.
>>>>
>>>> It resulted in a benchmark with following traits:
>>>>
>>>>   * It checks four types of input arrays: sorted, mostly sorted, random,
>>>>     repeated, and bitonic:
>>>>
>>>>     - 'sorted': the array is already sorted.
>>>>     - 'mostly sorted': the array will have a certain number of random
>>>>        position with random values (current ratio used is 20%).
>>>>     - 'random' the array will contain random elements.
>>>>     - 'repeated' the array will have random elements with a certain
>>>>        number (current ratio is 20%) of a repeated element distributed
>>>>        randomly.
>>>>     - 'bitonic': elements will be strictly increasing to up middle of the
>>>>        array then strictly decreasing.  This is to trigger the
>>>>        pattern-defeting input for the median of 3 pivot selection used
>>>>        in quicksort implementation.
>>>>
>>>>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>>>>     element with 32 bytes (but using the uint32_t comparison checks).
>>>>     These element sizes are used to 1. to avoid include the comparison
>>>>     function itself and/or memory copy in sort benchmark itself, and 2.
>>>>     because key of size_t are the most used for both GCC and Firefox.
>>>>
>>>>   * Five different element numbers: 64 (which cover mostly of used
>>>>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>>>>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>>>>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>>>>     --nelem option.
>>>>
>>>> Checked on x86_64-linux-gnu
>>>> ---
>>>>  benchtests/Makefile      |   2 +-
>>>>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 196 insertions(+), 1 deletion(-)
>>>>  create mode 100644 benchtests/bench-qsort.c
>>>>
>>>> diff --git a/benchtests/Makefile b/benchtests/Makefile
>>>> index 1530939a8c..781cbca8af 100644
>>>> --- a/benchtests/Makefile
>>>> +++ b/benchtests/Makefile
>>>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>>>>  include ../gen-locales.mk
>>>>  endif
>>>>
>>>> -stdlib-benchset := strtod
>>>> +stdlib-benchset := strtod qsort
>>>>
>>>>  stdio-common-benchset := sprintf
>>>>
>>>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
>>>> new file mode 100644
>>>> index 0000000000..905aa6d7b6
>>>> --- /dev/null
>>>> +++ b/benchtests/bench-qsort.c
>>>> @@ -0,0 +1,195 @@
>>>> +/* Measure qsort function.
>>>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>>>> +   This file is part of the GNU C Library.
>>>> +
>>>> +   The GNU C Library is free software; you can redistribute it and/or
>>>> +   modify it under the terms of the GNU Lesser General Public
>>>> +   License as published by the Free Software Foundation; either
>>>> +   version 2.1 of the License, or (at your option) any later version.
>>>> +
>>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> +   Lesser General Public License for more details.
>>>> +
>>>> +   You should have received a copy of the GNU Lesser General Public
>>>> +   License along with the GNU C Library; if not, see
>>>> +   <http://www.gnu.org/licenses/>.  */
>>>> +
>>>> +#include <array_length.h>
>>>> +#include <bench-timing.h>
>>>> +#include <errno.h>
>>>> +#include <getopt.h>
>>>> +#include <json-lib.h>
>>>> +#include <stdio.h>
>>>> +#include <unistd.h>
>>>> +
>>>> +#include <stdlib/tst-qsort-common.c>
>>>> +
>>>> +/* Number of elements of determined type to be measured.  */
>>>> +static const size_t default_nelems[] =
>>>> +{
>>>> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
>>>> +                              on GCC build.  */
>>>> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
>>>> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
>>>> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
>>>> +};
>>>> +
>>>> +#define OPT_NELEM 10000
>>>> +#define OPT_SEED  10001
>>>> +#define CMDLINE_OPTIONS \
>>>> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
>>>> +  { "seed", required_argument, NULL, OPT_SEED },
>>>> +
>>>> +static const size_t max_nelem = 16;
>>>> +static size_t *elems = NULL;
>>>> +static size_t nelem = 0;
>>>> +static uint64_t seed = 0;
>>>> +static bool seed_set = false;
>>>> +
>>>> +static void
>>>> +cmdline_process_function (int c)
>>>> +{
>>>> +  switch (c)
>>>> +    {
>>>> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
>>>> +        The elements sizes as passed with a ':' as the delimiter, for
>>>> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
>>>> +      case OPT_NELEM:
>>>> +        {
>>>> +         elems = xmalloc (max_nelem * sizeof (size_t));
>>>> +         nelem = 0;
>>>> +
>>>> +         char *saveptr;
>>>> +         char *token;
>>>> +         token = strtok_r (optarg, ":", &saveptr);
>>>> +         if (token == NULL)
>>>> +           {
>>>> +             printf ("error: invalid --nelem value\n");
>>>> +             exit (EXIT_FAILURE);
>>>> +           }
>>>> +         do
>>>> +           {
>>>> +             if (nelem == max_nelem)
>>>> +               {
>>>> +                 printf ("error: invalid --nelem value (max elem)\n");
>>>> +                 exit (EXIT_FAILURE);
>>>> +               }
>>>> +             elems[nelem++] = atol (token);
>>>> +             token = strtok_r (saveptr, ":", &saveptr);
>>>> +           } while (token != NULL);
>>>> +        }
>>>> +      break;
>>>> +
>>>> +      /* handle the --seed option to use a different seed than a random one.
>>>> +        The SEED used should be a uint64_t number.  */
>>>> +      case OPT_SEED:
>>>> +       {
>>>> +         uint64_t value = strtoull (optarg, NULL, 0);
>>>> +         if (errno == ERANGE || value > UINT64_MAX)
>>>> +           {
>>>> +             printf ("error: seed should be a value in range of "
>>>> +                     "[0, UINT64_MAX]\n");
>>>> +             exit (EXIT_FAILURE);
>>>> +           }
>>>> +         seed = value;
>>>> +         seed_set = true;
>>>> +       }
>>>> +    }
>>>> +}
>>>> +
>>>> +#define CMDLINE_PROCESS cmdline_process_function
>>>> +
>>>> +static int
>>>> +do_test (void)
>>>> +{
>>>> +  random_init ();
>>>> +
>>>> +
>>>> +  json_ctx_t json_ctx;
>>>> +
>>>> +  json_init (&json_ctx, 0, stdout);
>>>> +
>>>> +  json_document_begin (&json_ctx);
>>>> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
>>>> +
>>>> +  json_attr_object_begin (&json_ctx, "functions");
>>>> +  json_attr_object_begin (&json_ctx, "qsort");
>>>> +  json_attr_uint (&json_ctx, "seed", seed);
>>>> +
>>>> +  json_array_begin (&json_ctx, "results");
>>>> +
>>>> +  const size_t *welem = elems == NULL ? default_nelems : elems;
>>>> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
>>>> +
>>>> +  struct test_t
>>>> +  {
>>>> +    size_t type_size;
>>>> +    cmpfunc_t cmpfunc;
>>>> +  };
>>>> +  static const struct test_t tests[] =
>>>> +  {
>>>> +    { sizeof (uint32_t), uint32_t_cmp },
>>>> +    { sizeof (uint64_t), uint64_t_cmp },
>>>> +    /* Test swap with large elements.  */
>>>> +    { 32,                uint32_t_cmp },
>>>> +  };
>>>> +
>>>> +  const size_t inner_loop_iters = 16;
>>>> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
>>>> +    {
>>>> +      for (const struct array_t *arraytype = arraytypes;
>>>> +          arraytype < array_end (arraytypes);
>>>> +          ++arraytype)
>>>> +       {
>>>> +         for (int k = 0; k < wnelem; k++)
>>>> +           {
>>>> +             json_element_object_begin (&json_ctx);
>>>> +
>>>> +             size_t arraysize = welem[k] * test->type_size;
>>>> +
>>>> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
>>>> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
>>>> +             json_attr_string (&json_ctx, "property", arraytype->name);
>>>> +
>>>> +             void *base = create_array (welem[k], test->type_size,
>>>> +                                        arraytype->type);
>>>> +             void *work = xmalloc (arraysize);
>>>> +
>>>> +             timing_t total = 0;
>>>> +
>>>> +             for (int n = 0; n < inner_loop_iters; n++)
>>>> +               {
>>>> +                 memcpy (work, base, arraysize);
>>>> +
>>>> +                 timing_t start, end, diff;
>>>> +                 TIMING_NOW (start);
>>>> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
>>>> +                 TIMING_NOW (end);
>>>> +
>>>> +                 TIMING_DIFF (diff, start, end);
>>>> +                 TIMING_ACCUM (total, diff);
>>>> +               }
>>>> +
>>> To avoid overhead from timing. Maybe you could allocate / initialize
>>> multiple buffers to test a given sort on. Then instead of having to
>>> re-memcpy / time you just loop through the N initialized buffers
>>> with one timing call. For the smaller sorts this may help decrease
>>> the timing overhead.
>>
>> I think this make sense, although it would require more memory. I changed
>> to:
>>
>> ---
>>   for (const struct test_t *test = tests; test < array_end (tests); ++test)
>>     {
>>       for (size_t at = 0; at < array_length (arraytypes); at++)
>>         {
>>           for (int ne = 0; ne < wnelem; ne++)
>>             {
>>               void *inputs[INNER_LOOP_ITERS];
>>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>>                 inputs[i] = create_array (welem[ne], test->type_size,
>>                                           arraytypes[at].type);
> 
> I think it may pay to use a 1-dimensional array. i.e
> 
> size_t qsort_bytes = welem[ne] * test->type_size;
> void * inputs = malloc(qsort_bytes * NTESTS);
> // initialize [inputs, inputs + qsort_bytes]
> for (int i = 1; i < NTESTS; ++i)
>     memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes)
> 
> then increment inputs by 'qsort_bytes' in the test loop.
> 
> NTESTS may be fine being INNER_LOOP_ITERS but
> if we are concerned about using too much memory
> we could add another loop through INNER_LOOP_ITERS / NTESTS
> and accumulate the time diffs.

Right, I added a fill_array() (which is similar to create_array,
but do not allocate memory) I changed to:

  size_t qsort_size = welem[ne] * test->type_size;
  char *inputs = xmalloc (qsort_size * INNER_LOOP_ITERS);                                          
  for (int i = 0; i < INNER_LOOP_ITERS; i++)
    fill_array (inputs + qsort_size * i, welem[ne],                                                
                est->type_size, arraytypes[at].type);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-15 16:45       ` Noah Goldstein via Libc-alpha
@ 2021-10-15 17:21         ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 17:21 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 15/10/2021 13:45, Noah Goldstein wrote:
> On Fri, Oct 15, 2021 at 8:12 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/10/2021 00:29, Noah Goldstein wrote:
>>>     +static void
>>>     +swap_bytes (void * restrict a, void * restrict b, size_t n)
>>>     +{
>>>     +  /* Use multiple small memcpys with constant size to enable inlining
>>>     +     on most targets.  */
>>>     +  enum { SWAP_GENERIC_SIZE = 32 };
>>>     +  unsigned char tmp[SWAP_GENERIC_SIZE];
>>>     +  while (n > SWAP_GENERIC_SIZE)
>>>     +    {
>>>     +      memcpy (tmp, a, SWAP_GENERIC_SIZE);
>>>     +      a = memcpy (a, b, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
>>>     +      b = memcpy (b, tmp, SWAP_GENERIC_SIZE) + SWAP_GENERIC_SIZE;
>>>     +      n -= SWAP_GENERIC_SIZE;
>>>     +    }
>>>     +  memcpy (tmp, a, n);
>>>     +  memcpy (a, b, n);
>>>     +  memcpy (b, tmp, n);
>>>     +}
>>>     +
>>>     +/* Replace the indirect call with a serie of if statements.  It should help
>>>     +   the branch predictor.  */
>>>
>>>
>>> 1) Really? On Intel at least an indirect call that is always going to the same place
>>> is certainly going to be predicted as well if not better than 2/3 branches + direct call.
>>>
>>
>> I shamelessly copy the same strategy Linux kernel used on its lib/sort.c
>> (8fb583c4258d08f0).  Maybe Linux internal usage of its qsort() leads to
>> better predictable branch, and for this change I would prefer to work
>> better on different architectures than assume an specific one.
> 
> If it's running in the Kernel spectre mitigations can make indirect jumps
> particularly expensive. I don't think we have the same issue targeting
> userland.

Indeed I didn't take this in consideration, and it does not seems to
help much in glibc case.  I will used indirect calls.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-15 17:17         ` Noah Goldstein via Libc-alpha
@ 2021-10-15 17:34           ` Adhemerval Zanella via Libc-alpha
  2021-10-15 17:45             ` Noah Goldstein via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 17:34 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 15/10/2021 14:17, Noah Goldstein wrote:
> On Fri, Oct 15, 2021 at 8:29 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/10/2021 00:39, Noah Goldstein wrote:
>>>
>>>
>>> On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
>>>
>>>
>>>
>>>     On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org <mailto:libc-alpha@sourceware.org>> wrote:
>>>
>>>         It optimizes take in consideration both the most common elements are
>>>         either 32 or 64 bit in size [1] and inputs are aligned to the word
>>>         boundary.  This is similar to the optimization done on lib/sort.c
>>>         from Linux.
>>>
>>>         This patchs adds an optimized swap operation on qsort based in previous
>>>         msort one.  Instead of byte operation, three variants are provided:
>>>
>>>           1. Using uint32_t loads and stores.
>>>           2. Using uint64_t loads and stores.
>>>           3. Generic one with a temporary buffer and memcpy/mempcpy.
>>>
>>>         The 1. and 2. options are selected only either if architecture defines
>>>         _STRING_ARCH_unaligned or if base pointer is aligned to required type.
>>>
>>>         It also fixes BZ#19305 by checking input size against number of
>>>         elements 1 besides 0.
>>>
>>>         Checked on x86_64-linux-gnu.
>>>
>>>         [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html <https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html>
>>>         ---
>>>          stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
>>>          1 file changed, 91 insertions(+), 18 deletions(-)
>>>
>>>         diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>>>         index 23f2d28314..59458d151b 100644
>>>         --- a/stdlib/qsort.c
>>>         +++ b/stdlib/qsort.c
>>>         @@ -24,20 +24,85 @@
>>>          #include <limits.h>
>>>          #include <stdlib.h>
>>>          #include <string.h>
>>>         +#include <stdbool.h>
>>>
>>>         -/* Byte-wise swap two items of size SIZE. */
>>>         -#define SWAP(a, b, size)                                                     \
>>>         -  do                                                                         \
>>>         -    {                                                                        \
>>>         -      size_t __size = (size);                                                \
>>>         -      char *__a = (a), *__b = (b);                                           \
>>>         -      do                                                                     \
>>>         -       {                                                                     \
>>>         -         char __tmp = *__a;                                                  \
>>>         -         *__a++ = *__b;                                                      \
>>>         -         *__b++ = __tmp;                                                     \
>>>         -       } while (--__size > 0);                                               \
>>>         -    } while (0)
>>>         +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
>>>         +   along the generic one as an optimization.  */
>>>         +
>>>         +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
>>>         +
>>>         +/* Return trues is elements can be copied used word load and sortes.
>>>         +   The size must be a multiple of the alignment, and the base address.  */
>>>         +static inline bool
>>>         +is_aligned_to_copy (const void *base, size_t size, size_t align)
>>>         +{
>>>         +  unsigned char lsbits = size;
>>>         +#if !_STRING_ARCH_unaligned
>>>         +  lsbits |= (unsigned char)(uintptr_t) base;
>>>         +#endif
>>>         +  return (lsbits & (align - 1)) == 0;
>>>         +}
>>>         +
>>>         +#define SWAP_WORDS_64 (swap_func_t)0
>>>         +#define SWAP_WORDS_32 (swap_func_t)1
>>>         +#define SWAP_BYTES    (swap_func_t)2
>>>         +
>>>         +static void
>>>         +swap_words_64 (void * restrict a, void * restrict b, size_t n)
>>>         +{
>>>         +  do
>>>         +   {
>>>         +     n -= 8;
>>>         +     uint64_t t = *(uint64_t *)(a + n);
>>>         +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
>>>         +     *(uint64_t *)(b + n) = t;
>>>         +   } while (n);
>>>         +}
>>>         +
>>>         +static void
>>>         +swap_words_32 (void * restrict a, void * restrict b, size_t n)
>>>         +{
>>>         +  do
>>>         +   {
>>>         +     n -= 4;
>>>         +     uint32_t t = *(uint32_t *)(a + n);
>>>         +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
>>>         +     *(uint32_t *)(b + n) = t;
>>>         +   } while (n);
>>>         +}
>>>
>>>
>>> I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
>>> key sizes. Looking at GCC's implementation of swap_generic on modern x86_64:
>>> https://godbolt.org/z/638h3Y9va <https://godbolt.org/z/638h3Y9va>
>>> It's able to optimize the temporary buffer out of the loop and use xmm registers which
>>> will likely win out for larger sizes.
>>
>> It is probably not the most optimized code compiler can generated since
>> I tried to go for a more generic code that should results in a somewhat
>> better code in a architecture agnostic code.  I trying to mimic some
>> of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I
>> used the same strategy used on d3496c9f4f27d (Linux did not optimize
>> anything for the byte version).
>>
>> I think it would be possible to tune it for an specific architecture
>> and/or compiler, but I would prefer to use a good enough algorithm that
>> work reasonable on multiple architectures.
> 
> That's fair. Although I still think there are some improvements.
> 
> Looking at the assembly for all three in fact it seems GCC optimizes all of them
> to larger register copies: https://godbolt.org/z/bd9nnnoEY
> 
> The biggest difference seems to be the setup / register spills for
> the generic version so for the common case of a relatively small key
> the special case for 4/8 makes sense.
> 
> Have you checked that GCC is able to use the conditions for selecting
> the swap function to optimize the functions themselves? In the godbolt
> link I got reasonable value out of adding the invariants to swap_64/swap_32.

Not yet, but I think the generated code for both x86_64 and aarch64 seems
simple enough and should cover the common case (keys with size 4 or 8)
fast enough [1]. 

And since for v4 my plan is not to remove mergesort anymore, but limit it
to a static buffer; it might be possible to use the same swap routines
for both mergesort and quicksort.

> 
> It also may be worth it to write a custom memcpy implementation for
> size = 0..SWAP_GENERIC_SIZE so it can be inlined (and probably
> more optimized than what generic memcpy can get away with).
> 

I *really* do not want to go for this way.  My hope is with small sizes
compiler can inline the memcpy (which gcc does for most common
architectures).


[1] https://godbolt.org/z/v7e4xxqGa

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-15 17:34           ` Adhemerval Zanella via Libc-alpha
@ 2021-10-15 17:45             ` Noah Goldstein via Libc-alpha
  2021-10-15 17:56               ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Noah Goldstein via Libc-alpha @ 2021-10-15 17:45 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Fri, Oct 15, 2021 at 12:34 PM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 15/10/2021 14:17, Noah Goldstein wrote:
> > On Fri, Oct 15, 2021 at 8:29 AM Adhemerval Zanella
> > <adhemerval.zanella@linaro.org> wrote:
> >>
> >>
> >>
> >> On 13/10/2021 00:39, Noah Goldstein wrote:
> >>>
> >>>
> >>> On Tue, Oct 12, 2021 at 11:29 PM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
> >>>
> >>>
> >>>
> >>>     On Fri, Sep 3, 2021 at 1:14 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org <mailto:libc-alpha@sourceware.org>> wrote:
> >>>
> >>>         It optimizes take in consideration both the most common elements are
> >>>         either 32 or 64 bit in size [1] and inputs are aligned to the word
> >>>         boundary.  This is similar to the optimization done on lib/sort.c
> >>>         from Linux.
> >>>
> >>>         This patchs adds an optimized swap operation on qsort based in previous
> >>>         msort one.  Instead of byte operation, three variants are provided:
> >>>
> >>>           1. Using uint32_t loads and stores.
> >>>           2. Using uint64_t loads and stores.
> >>>           3. Generic one with a temporary buffer and memcpy/mempcpy.
> >>>
> >>>         The 1. and 2. options are selected only either if architecture defines
> >>>         _STRING_ARCH_unaligned or if base pointer is aligned to required type.
> >>>
> >>>         It also fixes BZ#19305 by checking input size against number of
> >>>         elements 1 besides 0.
> >>>
> >>>         Checked on x86_64-linux-gnu.
> >>>
> >>>         [1] https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html <https://sourceware.org/pipermail/libc-alpha/2018-August/096984.html>
> >>>         ---
> >>>          stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
> >>>          1 file changed, 91 insertions(+), 18 deletions(-)
> >>>
> >>>         diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> >>>         index 23f2d28314..59458d151b 100644
> >>>         --- a/stdlib/qsort.c
> >>>         +++ b/stdlib/qsort.c
> >>>         @@ -24,20 +24,85 @@
> >>>          #include <limits.h>
> >>>          #include <stdlib.h>
> >>>          #include <string.h>
> >>>         +#include <stdbool.h>
> >>>
> >>>         -/* Byte-wise swap two items of size SIZE. */
> >>>         -#define SWAP(a, b, size)                                                     \
> >>>         -  do                                                                         \
> >>>         -    {                                                                        \
> >>>         -      size_t __size = (size);                                                \
> >>>         -      char *__a = (a), *__b = (b);                                           \
> >>>         -      do                                                                     \
> >>>         -       {                                                                     \
> >>>         -         char __tmp = *__a;                                                  \
> >>>         -         *__a++ = *__b;                                                      \
> >>>         -         *__b++ = __tmp;                                                     \
> >>>         -       } while (--__size > 0);                                               \
> >>>         -    } while (0)
> >>>         +/* Swap SIZE bytes between addresses A and B.  These helpers are provided
> >>>         +   along the generic one as an optimization.  */
> >>>         +
> >>>         +typedef void (*swap_func_t)(void * restrict, void * restrict, size_t);
> >>>         +
> >>>         +/* Return trues is elements can be copied used word load and sortes.
> >>>         +   The size must be a multiple of the alignment, and the base address.  */
> >>>         +static inline bool
> >>>         +is_aligned_to_copy (const void *base, size_t size, size_t align)
> >>>         +{
> >>>         +  unsigned char lsbits = size;
> >>>         +#if !_STRING_ARCH_unaligned
> >>>         +  lsbits |= (unsigned char)(uintptr_t) base;
> >>>         +#endif
> >>>         +  return (lsbits & (align - 1)) == 0;
> >>>         +}
> >>>         +
> >>>         +#define SWAP_WORDS_64 (swap_func_t)0
> >>>         +#define SWAP_WORDS_32 (swap_func_t)1
> >>>         +#define SWAP_BYTES    (swap_func_t)2
> >>>         +
> >>>         +static void
> >>>         +swap_words_64 (void * restrict a, void * restrict b, size_t n)
> >>>         +{
> >>>         +  do
> >>>         +   {
> >>>         +     n -= 8;
> >>>         +     uint64_t t = *(uint64_t *)(a + n);
> >>>         +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
> >>>         +     *(uint64_t *)(b + n) = t;
> >>>         +   } while (n);
> >>>         +}
> >>>         +
> >>>         +static void
> >>>         +swap_words_32 (void * restrict a, void * restrict b, size_t n)
> >>>         +{
> >>>         +  do
> >>>         +   {
> >>>         +     n -= 4;
> >>>         +     uint32_t t = *(uint32_t *)(a + n);
> >>>         +     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
> >>>         +     *(uint32_t *)(b + n) = t;
> >>>         +   } while (n);
> >>>         +}
> >>>
> >>>
> >>> I'm not certain swap_words_32 / swap_words_8 will be optimal for larger
> >>> key sizes. Looking at GCC's implementation of swap_generic on modern x86_64:
> >>> https://godbolt.org/z/638h3Y9va <https://godbolt.org/z/638h3Y9va>
> >>> It's able to optimize the temporary buffer out of the loop and use xmm registers which
> >>> will likely win out for larger sizes.
> >>
> >> It is probably not the most optimized code compiler can generated since
> >> I tried to go for a more generic code that should results in a somewhat
> >> better code in a architecture agnostic code.  I trying to mimic some
> >> of optimization Linux did on 37d0ec34d111acfdb, and the swap_bytes I
> >> used the same strategy used on d3496c9f4f27d (Linux did not optimize
> >> anything for the byte version).
> >>
> >> I think it would be possible to tune it for an specific architecture
> >> and/or compiler, but I would prefer to use a good enough algorithm that
> >> work reasonable on multiple architectures.
> >
> > That's fair. Although I still think there are some improvements.
> >
> > Looking at the assembly for all three in fact it seems GCC optimizes all of them
> > to larger register copies: https://godbolt.org/z/bd9nnnoEY
> >
> > The biggest difference seems to be the setup / register spills for
> > the generic version so for the common case of a relatively small key
> > the special case for 4/8 makes sense.
> >
> > Have you checked that GCC is able to use the conditions for selecting
> > the swap function to optimize the functions themselves? In the godbolt
> > link I got reasonable value out of adding the invariants to swap_64/swap_32.
>
> Not yet, but I think the generated code for both x86_64 and aarch64 seems
> simple enough and should cover the common case (keys with size 4 or 8)
> fast enough [1].

swap is in the inner loop. Seems like a pretty critical component to have fully
optimized. The aarch64 version looks good, but the x86_64 version seems
to be lacking. Not arguing for an arch specific version, but if the directives
can add value to x86_64 without detracting from aarch64 seems like a zero
cost improvement.

>
> And since for v4 my plan is not to remove mergesort anymore, but limit it
> to a static buffer; it might be possible to use the same swap routines
> for both mergesort and quicksort.
>
> >
> > It also may be worth it to write a custom memcpy implementation for
> > size = 0..SWAP_GENERIC_SIZE so it can be inlined (and probably
> > more optimized than what generic memcpy can get away with).
> >
>
> I *really* do not want to go for this way.  My hope is with small sizes
> compiler can inline the memcpy (which gcc does for most common
> architectures).

Since size is non-constant for the tail I don't see how we are going
avoid 3x memcpy calls. Although that can be another patch if it
gets values.

>
>
> [1] https://godbolt.org/z/v7e4xxqGa

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305)
  2021-10-15 17:45             ` Noah Goldstein via Libc-alpha
@ 2021-10-15 17:56               ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-10-15 17:56 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 15/10/2021 14:45, Noah Goldstein wrote:
> swap is in the inner loop. Seems like a pretty critical component to have fully
> optimized. The aarch64 version looks good, but the x86_64 version seems
> to be lacking. Not arguing for an arch specific version, but if the directives
> can add value to x86_64 without detracting from aarch64 seems like a zero
> cost improvement.
> 
> Since size is non-constant for the tail I don't see how we are going
> avoid 3x memcpy calls. Although that can be another patch if it
> gets values.
> 
>>
>>
>> [1] https://godbolt.org/z/v7e4xxqGa

Maybe use a byte copy in the tail to avoid memcpy [1], another option
might to tune SWAP_GENERIC_SIZE to make the tail less costly (16 should
be ok for most architecture, although some might be better with large
values).

[1] https://godbolt.org/z/G76dcej16

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-10-15 17:56 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-03 17:11 [PATCH v3 0/7] Use introsort for qsort Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 1/7] benchtests: Add bench-qsort Adhemerval Zanella via Libc-alpha
2021-09-04  9:09   ` Alexander Monakov via Libc-alpha
2021-09-06 18:30     ` Adhemerval Zanella via Libc-alpha
2021-10-13  3:19   ` Noah Goldstein via Libc-alpha
2021-10-15 12:52     ` Adhemerval Zanella via Libc-alpha
2021-10-15 16:39       ` Noah Goldstein via Libc-alpha
2021-10-15 17:19         ` Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 2/7] support: Fix getopt_long with CMDLINE_OPTIONS Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 3/7] stdlib: Optimization qsort{_r} swap implementation (BZ #19305) Adhemerval Zanella via Libc-alpha
2021-10-13  3:29   ` Noah Goldstein via Libc-alpha
2021-10-13  3:39     ` Noah Goldstein via Libc-alpha
2021-10-15 13:29       ` Adhemerval Zanella via Libc-alpha
2021-10-15 17:17         ` Noah Goldstein via Libc-alpha
2021-10-15 17:34           ` Adhemerval Zanella via Libc-alpha
2021-10-15 17:45             ` Noah Goldstein via Libc-alpha
2021-10-15 17:56               ` Adhemerval Zanella via Libc-alpha
2021-10-15 13:12     ` Adhemerval Zanella via Libc-alpha
2021-10-15 16:45       ` Noah Goldstein via Libc-alpha
2021-10-15 17:21         ` Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 4/7] stdlib: Move insertion sort out qsort Adhemerval Zanella via Libc-alpha
2021-09-06 20:35   ` Fangrui Song via Libc-alpha
2021-09-06 20:48     ` Fangrui Song via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 5/7] stdlib: qsort: Move some macros to inline function Adhemerval Zanella via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 6/7] stdlib: Implement introsort with qsort Adhemerval Zanella via Libc-alpha
2021-09-04  9:17   ` Alexander Monakov via Libc-alpha
2021-09-06 18:43     ` Adhemerval Zanella via Libc-alpha
2021-09-06 20:23   ` Fangrui Song via Libc-alpha
2021-10-13  3:53   ` Noah Goldstein via Libc-alpha
2021-09-03 17:11 ` [PATCH v3 7/7] stdlib: Remove use of mergesort on qsort (BZ #21719) Adhemerval Zanella via Libc-alpha
2021-09-03 19:18 ` [PATCH v3 0/7] Use introsort for qsort Paul Eggert
2021-09-06 14:13   ` Carlos O'Donell via Libc-alpha
2021-09-06 17:03     ` Zack Weinberg via Libc-alpha
2021-09-06 18:19       ` Adhemerval Zanella via Libc-alpha
2021-09-07  0:14     ` Paul Eggert
2021-09-07 14:32       ` Adhemerval Zanella via Libc-alpha
2021-09-07 17:39         ` Paul Eggert
2021-09-07 18:07           ` Adhemerval Zanella via Libc-alpha
2021-09-07 19:28             ` Paul Eggert
2021-09-08 11:56               ` Adhemerval Zanella via Libc-alpha
2021-09-09  0:39                 ` Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).