Re: [PATCH] Add malloc micro benchmark

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: Carlos O'Donell <carlos@redhat.com>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Cc: nd <nd@arm.com>
Subject: Re: [PATCH] Add malloc micro benchmark
Date: Mon, 18 Dec 2017 15:18:47 +0000	[thread overview]
Message-ID: <DB6PR0801MB20538A414F1659748E703918833C0@DB6PR0801MB2053.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <bcc3af78-c1ac-2d24-6a6a-c14d6ca7dc96@redhat.com>

Carlos O'Donell wrote:

Thanks for the review!

> This test is a long time coming and is a great idea.
>
> My "big" question here is: What are we trying to model?
>
> Do we want to prove that the single threaded optimizations you
> added are helping a given size class of allocations?

Yes that is the main goal of the benchmark. It models the allocation
pattern of a few benchmarks which were reported as being slow
despite the new tcache (which didn't show any gains).

When the tcache was configured to be larger there was a major
speedup, suggesting that the tcache doesn't work on patterns with
a high number of (de)allocations of similar sized blocks. Since DJ
didn't seem keen on increasing the tcache size despite it showing
major gains across a wide range of benchmarks, I decided to fix
the performance for the single-threaded case at least. It's now 2.5x
faster on a few sever benchmarks (of course the next question is
whether tcache is actually useful in its current form).

> You are currently modeling a workload that has increasing
> memory size requests and in some ways this is an odd workload
> that has high external fragmentation characteristics. For example
> after allocating lots of 256 byte blocks we move on to 1024 byte
> blocks, with the latter being unusable unless we coalesce.

I'm assuming coalescing works as expected. If it doesn't, it would
be a nasty bug.

> I *wish* we could test main_arena vs. threaded arena, since they
> have different code and behave differently e.g. sbrk vs. mmap'd
> heap.

I'd have to check how easy it is to force it to use the thread arena.
The whole thing is just crazily weird, with too many different code
paths and possibilities. It seems much easier just to always use
thread arenas, and perhaps use sbrk only if there is some serious
advantage over mmap. Also it appears all the values are set to
what was perhaps reasonable 10-20 years ago, not today. When
a small server has 128GB, there is absolutely no reason to worry
about returning 128KB to the OS as quickly as possible...

> Implementation:
>
> You need to make this robust against env vars changing malloc
> behaviour. You should use mallopt to change some parameters.

You mean setting the tcache size explicitly (maybe even switching off)?

>> Note something very bad happens for the larger allocations, there
>> is a 25x slowdown from 25 to 400 allocations of 4KB blocks...
>
> Keep in mind you are testing the performance of sbrk here. In a threaded
> arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then
> draws allocations from it. So in some ways main_arena is expenseive,
> but both have to pay a page-touch cost...
>
> For each 4KiB block you touch the block to write the co-located metadata
> and that forces the kernel to give you a blank page, which you then do 
> nothing with. Then you repeat the above again.
>
> For all other sizes you amortize the cost of the new page among
> several allocations.
> 
> Do you have any other explanation?

Well that looks like a reasonable explanation, but it shows a serious
performance bug - I think we use MADV_DONTNEED which doesn't
work on Linux and will cause all pages to be deallocated, reallocated
and zero-filled... This is the sort of case where you need to be very
careful to amortize over many allocations or long elapsed time, if at
all (many other allocators never give pages back).

> At some point you will hit the mmap threshold and the cost of the
> allocation will skyrocket as you have to call mmap.

That only happens on huge allocations (much larger than 4KB), or when
you run out of sbrk space (unlikely).

> In glibc we have:
>
> tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap
>
> If you proceed through from small allocations to larger allocations
> you will create chunks that cannot be used by future allocations.
> In many cases this is a worst case performance bottleneck. The
> heap will contain many 256 byte allocations but these cannot service
> the 1024 bytes, that is unless consolidation has been run. So this
> tests the consolidation as much as anything else, which might not
> trigger because of the free thresholds required.

If consolidation doesn't work that's a serious bug. However allocation
performance should not be affected either way - in a real application
those small blocks might still be allocated. As long as consolidation
runs quickly (generally it's a small percentage in profiles), it won't
affect the results.

> So what are we trying to model here?
>
> If we want to look at the cost of independent size class allocations
> then we need a clean process and allocate only a given size, and look
> at performance across the number of allocations.

That's certainly feasible if we keep the number of sizes small (less
than the list below). It should be easy to reuse the bench-malloc-thread.c
makefile magic to run the same binary with multiple sizes.

> I would also have much finer grained allocations by powers of 2.
> 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want
> to see what happens for the allocations which are:
..
> Would this serve better to show that your single threaded malloc
> changes were helpful for a given size class?

Well I can easily add some of the above sizes, it's highly configurable.
I don't think there will be much difference with the existing sizes though.

> You need to use mallopt to make sure the user's environment
> did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your
> maximum allocation size.

I don't think that is possible given the largest allocation size is 4KB.

Wilco

next prev parent reply	other threads:[~2017-12-18 15:16 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-01 13:51 [PATCH] Add malloc micro benchmark Wilco Dijkstra
2017-12-01 16:13 ` Carlos O'Donell
2017-12-18 15:18   ` Wilco Dijkstra [this message]
2017-12-18 16:32     ` Carlos O'Donell
2018-01-02 18:20       ` [PATCH v2] " Wilco Dijkstra
2018-01-02 18:45         ` DJ Delorie
2018-01-03 12:12           ` Wilco Dijkstra
2018-01-03 15:07             ` Carlos O'Donell
2018-01-04 13:48               ` Wilco Dijkstra
2018-01-04 16:37                 ` Adhemerval Zanella
2018-01-05 14:32                 ` Carlos O'Donell
2018-01-05 15:50                   ` Adhemerval Zanella
2018-01-05 16:17                     ` Carlos O'Donell
2018-01-05 16:46                       ` Adhemerval Zanella
2018-01-05 17:27                         ` Carlos O'Donell
2018-01-05 14:33         ` Carlos O'Donell
2018-01-05 16:28           ` Joseph Myers
2018-01-05 17:26             ` Carlos O'Donell
2018-02-28 12:40               ` Florian Weimer
2018-02-28 14:11                 ` Ondřej Bílka
2018-02-28 14:16                   ` Florian Weimer
2018-02-28 16:16                     ` Carlos O'Donell
2018-02-28 20:17                       ` Ondřej Bílka
2018-02-28 16:46                     ` Ondřej Bílka
2018-02-28 17:01                       ` Wilco Dijkstra
2018-02-28 18:21                         ` Carlos O'Donell
2018-02-28 19:56                         ` Ondřej Bílka
2018-02-28 21:56                           ` DJ Delorie
2018-03-01 11:24                             ` Ondřej Bílka
2017-12-18 23:02     ` [PATCH] " DJ Delorie
2017-12-28 14:09       ` Wilco Dijkstra
2017-12-28 19:01         ` DJ Delorie
  -- strict thread matches above, loose matches on Subject: below --
2019-02-01 16:27 Wilco Dijkstra
2019-02-08 19:37 ` DJ Delorie
2019-02-14 16:38   ` Wilco Dijkstra
2019-02-14 20:42     ` DJ Delorie
2019-02-28  4:52 ` Carlos O'Donell
2019-03-04 17:35   ` Wilco Dijkstra
2019-03-18 17:16     ` Wilco Dijkstra
2019-04-09  5:25       ` Carlos O'Donell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB6PR0801MB20538A414F1659748E703918833C0@DB6PR0801MB2053.eurprd08.prod.outlook.com \
    --to=wilco.dijkstra@arm.com \
    --cc=carlos@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).