unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: Carlos O'Donell <carlos@redhat.com>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Cc: nd <nd@arm.com>
Subject: [PATCH v2] Add malloc micro benchmark
Date: Tue, 2 Jan 2018 18:20:04 +0000	[thread overview]
Message-ID: <DB6PR0801MB2053641333453CE91496266E83190@DB6PR0801MB2053.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <6ad98d83-d49b-25a3-ef01-e93e18f4740b@redhat.com>

Carlos O'Donell wrote:

> If you have a pattern of malloc/free of *similar* sized blocks, then
> it overflows the sized bin in the tcache, with other size bins remaining
> empty. The cache itself does not dynamically reconfigure itself to consume
> X MiB or Y % of RSS, instead it uses a simple data structure to contain
> a fixed number of fixed size blocks.
>
> Therefore I agree, that enhancing the core data structure in tcache may
> result in better overall performance, particularly if we got rid of the
> fixed bin sizes and instead found a way to be performant *and* keep a
> running total of consumption.

Well it could keep track of sum of all block sizes and limit that. That would
be better than a fixed limit on number of blocks in each bin. 

> Likewise *all* of malloc needs to be moved to a better data structure than
> just linked lists. I would like to see glibc's malloc offer a cacheing
> footprint of no more than Y % of RSS available, and let the user tweak that.
> Currently we just consume RSS without much regard for overhead. Though this
> is a different case than than what you are talking about, the changes are
> related via data-structure enhancements that would benefit both cases IMO.

What kind of datastructure do you have in mind? Small blocks could be
allocated together in pages. This would avoid the per-block overhead and
changes the linked list into a bitmap scan. However there aren't that many
alternatives. I've written first-fit allocators using autobalancing trees, but
walking the tree is expensive due to being not having good locality.

>> I *wish* we could test main_arena vs. threaded arena, since they
>> have different code and behave differently e.g. sbrk vs. mmap'd
>> heap.

I've added test of main and thread arenas in the latest version as well
as a test with a larger tcache.

> As I suggested in bug 15321:
> https://sourceware.org/bugzilla/show_bug.cgi?id=15321
>
> We need to merge the main_arena and threaded code together, and stop
> treating them as different things. Right now the main_arena, if you
> look at the code, is a *pretend* heap with a partial data structure
> layered in place. This needs to go away. We need to treat all heaps
> as identical, with identical code paths, with just different backing
> storage.

Yes that sounds like a good plan.

> I think people still expect that thread 0 allocates from the sbrk
> heap in a single-threaded application, and we can do that by ensuring
> sbrk is used to provide the backing store for the main thread. This way
> we can jump the pointer 64MB like we normally do for mmap'd heaps, but
> then on page touch there the kernel just extends the heap normally.

A quick benchmark shows sbrk is about ~25% faster than mmap, so it
appears useful. Looking at the addresses, sbrk grows up, while mmap
grows down the address space. I think other arenas could use sbrk as
well as long as you can free sbrk memory via MADV_FREE or
MADV_DONTUSE.


>> Implementation:
>>
>> You need to make this robust against env vars changing malloc
>> behaviour. You should use mallopt to change some parameters.

I've added support to change the tcache count in mallopt, so we can
benchmark different settings.

> We need to move to MADV_FREE, which was designed for memory allocators.
>
> The semantics of MADV_DONTNEED have the problem that one has to consider:
> * Is the data destructively lost in that page?
> * Is the data flushed to the underlying store before being not-needed?
> All of which lead to MADV_DONTNEED doing a lot of teardown work to ensure
> that users don't corrupt the data in their backing stores.
>
> I think that detection of MADV_FREE, and usage, would help performance,
> but only on > Linux 4.5, and that might be OK for you.

Well we should detect when MADV_FREE is supported and use that.
If not, tweak the size of MADV_DONTNEED - perhaps based on a
percentage of the RSS size so we limit the overhead.


Anyway, here is the new version:


Add a malloc micro benchmark to enable accurate testing of the
various paths in malloc and free.  The benchmark does a varying
number of allocations of a given block size, then frees them again.

It tests 4 different scenarios: single-threaded using main arena,
multi-threaded using thread-arena, main arena with SINGLE_THREAD_P
false and main arena with the tcache count set larger than the
default.  To enable this, add support for M_TCACHE_COUNT in mallopt.

OK for commit?

ChangeLog:
2018-01-02  Wilco Dijkstra  <wdijkstr@arm.com>

	* benchtests/Makefile: Add malloc-simple benchmark.
	* benchtests/bench-malloc-simple.c: New benchmark.
	* malloc/malloc.h (M_TCACHE_COUNT): Add new define.
	* malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT.

--
diff --git a/benchtests/Makefile b/benchtests/Makefile
index 74b3821ccfea6912e68578ad2598d68a9e38223c..5052bbbfe79f6d5a0b16c427dfc4807271805e61 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -90,7 +90,7 @@ CFLAGS-bench-trunc.c += -fno-builtin
 CFLAGS-bench-truncf.c += -fno-builtin
 
 ifeq (${BENCHSET},)
-bench-malloc := malloc-thread
+bench-malloc := malloc-thread malloc-simple
 else
 bench-malloc := $(filter malloc-%,${BENCHSET})
 endif
@@ -98,7 +98,7 @@ endif
 $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
 $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm)
 $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
-$(objpfx)bench-malloc-thread: $(shared-thread-library)
+$(addprefix $(objpfx)bench-,$(bench-malloc)): $(shared-thread-library)
 
 \f
 
@@ -165,7 +165,7 @@ bench-clean:
 ifneq ($(strip ${BENCHSET}),)
 VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \
    wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \
-   malloc-thread
+   malloc-thread malloc-simple
 INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET})
 ifneq (${INVALIDBENCHSETNAMES},)
 $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES})
@@ -201,10 +201,18 @@ bench-set: $(binaries-benchset)
 
 bench-malloc: $(binaries-bench-malloc)
 	for run in $^; do \
+	  echo "$${run}"; \
+	  if [ `basename $${run}` = "bench-malloc-thread" ]; then \
 		for thr in 1 8 16 32; do \
 			echo "Running $${run} $${thr}"; \
-	  $(run-bench) $${thr} > $${run}-$${thr}.out; \
-	  done;\
+			$(run-bench) $${thr} > $${run}-$${thr}.out; \
+		done;\
+	  else \
+		for thr in 8 16 32 64 128 256 512 1024 2048 4096; do \
+		  echo "Running $${run} $${thr}"; \
+		  $(run-bench) $${thr} > $${run}-$${thr}.out; \
+		done;\
+	  fi;\
 	done
 
 # Build and execute the benchmark functions.  This target generates JSON
diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c
new file mode 100644
index 0000000000000000000000000000000000000000..151c38de50c5e747e05d69c717452241a47d7d22
--- /dev/null
+++ b/benchtests/bench-malloc-simple.c
@@ -0,0 +1,201 @@
+/* Benchmark malloc and free functions.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <malloc.h>
+#include <sys/resource.h>
+#include "bench-timing.h"
+#include "json-lib.h"
+
+#define NUM_ITERS 1000000
+#define NUM_ALLOCS 4
+#define MAX_ALLOCS 1000
+
+typedef struct
+{
+  size_t iters;
+  size_t size;
+  int n;
+  timing_t elapsed;
+} malloc_args;
+
+static void
+do_benchmark (malloc_args *args, int **arr)
+{
+  timing_t start, stop;
+  size_t iters = args->iters;
+  size_t size = args->size;
+  int n = args->n;
+
+  TIMING_NOW (start);
+
+  for (int j = 0; j < iters; j++)
+    {
+      for (int i = 0; i < n; i++)
+	arr[i] = malloc (size);
+
+      for (int i = 0; i < n; i++)
+	free (arr[i]);
+    }
+
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (args->elapsed, start, stop);
+}
+
+static malloc_args tests[4][NUM_ALLOCS];
+static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS };
+
+static void *
+thread_test (void *p)
+{
+  int **arr = (int**)p;
+
+  /* Run benchmark multi-threaded.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[2][i], arr);
+
+  return p;
+}
+
+void
+bench (unsigned long size)
+{
+  size_t iters = NUM_ITERS;
+  int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*));
+  unsigned long res;
+
+  TIMING_INIT (res);
+  (void) res;
+
+  /* Set tcache count to default.  */
+  mallopt (M_TCACHE_COUNT, -1);
+
+  for (int t = 0; t <= 3; t++)
+    for (int i = 0; i < NUM_ALLOCS; i++)
+      {
+	tests[t][i].n = allocs[i];
+	tests[t][i].size = size;
+	tests[t][i].iters = iters / allocs[i];
+
+	/* Do a quick warmup run.  */
+	if (t == 0)
+	  do_benchmark (&tests[0][i], arr);
+      }
+
+  /* Run benchmark single threaded in main_arena.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[0][i], arr);
+
+  /* Run benchmark in a thread_arena.  */
+  pthread_t t;
+  pthread_create (&t, NULL, thread_test, (void*)arr);
+  pthread_join (t, NULL);
+
+  /* Repeat benchmark in main_arena with SINGLE_THREAD_P == false.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[1][i], arr);
+
+  /* Increase size of tcache.  */
+  mallopt (M_TCACHE_COUNT, 100);
+
+  /* Run again but with larger tcache.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[3][i], arr);
+
+  mallopt (M_TCACHE_COUNT, -1);
+
+  free (arr);
+
+  json_ctx_t json_ctx;
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+
+  json_attr_object_begin (&json_ctx, "malloc");
+
+  char s[100];
+  double iters2 = iters;
+
+  json_attr_object_begin (&json_ctx, "");
+  json_attr_double (&json_ctx, "malloc_block_size", size);
+
+  struct rusage usage;
+  getrusage (RUSAGE_SELF, &usage);
+  json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss);
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "main_arena_st_allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[0][i].elapsed / iters2);
+    }
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "main_arena_mt_allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[1][i].elapsed / iters2);
+    }
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "big_tcache_mt_allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[3][i].elapsed / iters2);
+    }
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "thread_arena__allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[2][i].elapsed / iters2);
+    }
+
+  json_attr_object_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+
+  json_document_end (&json_ctx);
+}
+
+static void usage (const char *name)
+{
+  fprintf (stderr, "%s: <alloc_size>\n", name);
+  exit (1);
+}
+
+int
+main (int argc, char **argv)
+{
+  long val = 16;
+  if (argc == 2)
+    val = strtol (argv[1], NULL, 0);
+
+  if (argc > 2 || val <= 0)
+    usage (argv[0]);
+
+  bench (val);
+
+  return 0;
+}
diff --git a/malloc/malloc.h b/malloc/malloc.h
index 339ab64c7d336873211a9057a923d87e8c1e025d..a047385a4fc8d7b3bb3e120a94440193dba306ed 100644
--- a/malloc/malloc.h
+++ b/malloc/malloc.h
@@ -121,6 +121,7 @@ extern struct mallinfo mallinfo (void) __THROW;
 #define M_PERTURB           -6
 #define M_ARENA_TEST        -7
 #define M_ARENA_MAX         -8
+#define M_TCACHE_COUNT	    -9
 
 /* General SVID/XPG interface to tunable parameters. */
 extern int mallopt (int __param, int __val) __THROW;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 0c9e0748b4c10988f6fe99ac2e5b21b8b7b603c3..a07438d276ff4c8177552e1c0d186ee7c8bd7692 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -5177,6 +5177,11 @@ __libc_mallopt (int param_number, int value)
       if (value > 0)
 	do_set_arena_max (value);
       break;
+#if USE_TCACHE
+    case M_TCACHE_COUNT:
+      do_set_tcache_count (value >= 0 ? value : TCACHE_FILL_COUNT);
+      break;
+#endif
     }
   __libc_lock_unlock (av->mutex);
   return res;

    

  reply	other threads:[~2018-01-02 18:18 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-01 13:51 [PATCH] Add malloc micro benchmark Wilco Dijkstra
2017-12-01 16:13 ` Carlos O'Donell
2017-12-18 15:18   ` Wilco Dijkstra
2017-12-18 16:32     ` Carlos O'Donell
2018-01-02 18:20       ` Wilco Dijkstra [this message]
2018-01-02 18:45         ` [PATCH v2] " DJ Delorie
2018-01-03 12:12           ` Wilco Dijkstra
2018-01-03 15:07             ` Carlos O'Donell
2018-01-04 13:48               ` Wilco Dijkstra
2018-01-04 16:37                 ` Adhemerval Zanella
2018-01-05 14:32                 ` Carlos O'Donell
2018-01-05 15:50                   ` Adhemerval Zanella
2018-01-05 16:17                     ` Carlos O'Donell
2018-01-05 16:46                       ` Adhemerval Zanella
2018-01-05 17:27                         ` Carlos O'Donell
2018-01-05 14:33         ` Carlos O'Donell
2018-01-05 16:28           ` Joseph Myers
2018-01-05 17:26             ` Carlos O'Donell
2018-02-28 12:40               ` Florian Weimer
2018-02-28 14:11                 ` Ondřej Bílka
2018-02-28 14:16                   ` Florian Weimer
2018-02-28 16:16                     ` Carlos O'Donell
2018-02-28 20:17                       ` Ondřej Bílka
2018-02-28 16:46                     ` Ondřej Bílka
2018-02-28 17:01                       ` Wilco Dijkstra
2018-02-28 18:21                         ` Carlos O'Donell
2018-02-28 19:56                         ` Ondřej Bílka
2018-02-28 21:56                           ` DJ Delorie
2018-03-01 11:24                             ` Ondřej Bílka
2017-12-18 23:02     ` [PATCH] " DJ Delorie
2017-12-28 14:09       ` Wilco Dijkstra
2017-12-28 19:01         ` DJ Delorie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB6PR0801MB2053641333453CE91496266E83190@DB6PR0801MB2053.eurprd08.prod.outlook.com \
    --to=wilco.dijkstra@arm.com \
    --cc=carlos@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).