From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Carlos O'Donell Newsgroups: gmane.comp.lib.glibc.alpha Subject: Re: [PATCH] Add malloc micro benchmark Date: Mon, 18 Dec 2017 08:32:51 -0800 Message-ID: <6ad98d83-d49b-25a3-ef01-e93e18f4740b@redhat.com> References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1513614694 17626 195.159.176.226 (18 Dec 2017 16:31:34 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 18 Dec 2017 16:31:34 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 Cc: nd To: Wilco Dijkstra , "libc-alpha@sourceware.org" Original-X-From: libc-alpha-return-88261-glibc-alpha=m.gmane.org@sourceware.org Mon Dec 18 17:31:29 2017 Return-path: Envelope-to: glibc-alpha@blaine.gmane.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=sOTkPhD3BdaU9jOG DznWh2feTbAgTQrq4XUVcDEbr/XJVukVRQqp0KEutH6iZPMN41iKirdWy7PAKZWp mspZqc6y5eOr3BBK+Sw5LbxqHvbDkU8hwAFYH1WPeeV5cv9GhML4OGue9/ME9cEg Q6H3TDwV4aHrc9EsDYtncfk5k44= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:cc:references:from:message-id:date :mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=Puc+TV2Lh6nSWtd/4NNpxe QuNzg=; b=TxP/FwCijzdJuXNFwTrDQlrAx8sTvrIZhR3Kmci6jpX7rDQPLahhzg wPChZDkq+pIMLdcTuFgpR0fkbeF9iu2DC/TrMZ9nmZfbL4+kcp4T6VzzksqlKoD7 blVIyns/A1Fb1mvp1c4lhizyclXRszUdrrYt5yBzgq/57/wHVYr70= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Original-Sender: libc-alpha-owner@sourceware.org Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=fluid, Maintenance, strategies X-HELO: mail-qk0-f181.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=3m4mJKzikALWEd4+tgPTBkz+15MjbNwRM+pVOwXj4iQ=; b=LFQh+q+o0ZN7Q6m4/TXycWZZxwhp1WStxMvnI1PODtIx3/A6hSNUI+RLFn9XN+gKAS Pdf8pbtqHS+I1RadhmSbqXelwEeFkE9nj6mN29iF3UN1vp9UPMPNbed/c4n8lJKqG756 Igq35bEfMFTTDnQrfR469KP36OID5Gn3mBvnQipk+eUBT3ZoonWsVM4RHvKlM/2votZQ 3nL2YTvmcVIH6hurAoOV1HTCzsAJgbnL4xt017LAXcSCV4pvfPlmosPYCOardfLaXoBo rJQX/SuvF+mFszwxDcyd06DpwJ9xtLcOGydM3w00AoWO/35JOS71iw4t0H1DbxN8g0Xh qMpg== X-Gm-Message-State: AKGB3mLZzldCpRrokhyrQWw9/960ldqFqQ64grHR37mV48YH4orPQ9Lk u/KvH8QCb29zsFgqndHsKLTgcg== X-Google-Smtp-Source: ACJfBot7qadJEhokP24pw+SAND7CSs1/xWgHOqe0qlgkjsTb5RDl241pW+7Td3tV4hfNUCxm6q2u3A== X-Received: by 10.55.203.221 with SMTP id u90mr444357qkl.334.1513614775075; Mon, 18 Dec 2017 08:32:55 -0800 (PST) In-Reply-To: Xref: news.gmane.org gmane.comp.lib.glibc.alpha:80624 Archived-At: Received: from server1.sourceware.org ([209.132.180.131] helo=sourceware.org) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eQyK8-00045p-2Y for glibc-alpha@blaine.gmane.org; Mon, 18 Dec 2017 17:31:28 +0100 Received: (qmail 94356 invoked by alias); 18 Dec 2017 16:33:00 -0000 Received: (qmail 94268 invoked by uid 89); 18 Dec 2017 16:33:00 -0000 On 12/18/2017 07:18 AM, Wilco Dijkstra wrote: > Carlos O'Donell wrote: > > Thanks for the review! Thank you for the detailed follow up. >> This test is a long time coming and is a great idea. >> >> My "big" question here is: What are we trying to model? >> >> Do we want to prove that the single threaded optimizations you >> added are helping a given size class of allocations? > > Yes that is the main goal of the benchmark. It models the allocation > pattern of a few benchmarks which were reported as being slow > despite the new tcache (which didn't show any gains). OK. > When the tcache was configured to be larger there was a major > speedup, suggesting that the tcache doesn't work on patterns with > a high number of (de)allocations of similar sized blocks. Since DJ > didn't seem keen on increasing the tcache size despite it showing > major gains across a wide range of benchmarks, I decided to fix > the performance for the single-threaded case at least. It's now 2.5x > faster on a few sever benchmarks (of course the next question is > whether tcache is actually useful in its current form). If you have a pattern of malloc/free of *similar* sized blocks, then it overflows the sized bin in the tcache, with other size bins remaining empty. The cache itself does not dynamically reconfigure itself to consume X MiB or Y % of RSS, instead it uses a simple data structure to contain a fixed number of fixed size blocks. Therefore I agree, that enhancing the core data structure in tcache may result in better overall performance, particularly if we got rid of the fixed bin sizes and instead found a way to be performant *and* keep a running total of consumption. This is not a trivial goal though. Likewise *all* of malloc needs to be moved to a better data structure than just linked lists. I would like to see glibc's malloc offer a cacheing footprint of no more than Y % of RSS available, and let the user tweak that. Currently we just consume RSS without much regard for overhead. Though this is a different case than than what you are talking about, the changes are related via data-structure enhancements that would benefit both cases IMO. >> You are currently modeling a workload that has increasing >> memory size requests and in some ways this is an odd workload >> that has high external fragmentation characteristics. For example >> after allocating lots of 256 byte blocks we move on to 1024 byte >> blocks, with the latter being unusable unless we coalesce. > > I'm assuming coalescing works as expected. If it doesn't, it would > be a nasty bug. You are probably right. >> I *wish* we could test main_arena vs. threaded arena, since they >> have different code and behave differently e.g. sbrk vs. mmap'd >> heap. > > I'd have to check how easy it is to force it to use the thread arena. > The whole thing is just crazily weird, with too many different code > paths and possibilities. It seems much easier just to always use > thread arenas, and perhaps use sbrk only if there is some serious > advantage over mmap. Also it appears all the values are set to > what was perhaps reasonable 10-20 years ago, not today. When > a small server has 128GB, there is absolutely no reason to worry > about returning 128KB to the OS as quickly as possible... (a) Returning memory based on a limit of memory cached. The decision to return memory to the operating system should be based on a desire to run within the bounds of a certain amount of cached memory in the user process. This should be the goal IMO. We should not return 128KB to the OS unless we are within our bounds of Y % of RSS cache, or X MiB of RSS cache. This bounded behaviour is more and more important for (b). So I argue that this has nothing to do with how much memory the server has but how much the user wants as cache in the process. This gets back to your point about tcache size needing to be bigger; if you had Y % RSS allocated to tcache it would solve your needs. (b) Packing density matters, or rather consistent RSS usage matters. Yes, and no. We are facing a lot of downstream request for container, and VM packing efficiency. This means that your 128GB is split into 32 servers each with 4GB, or 64 servers each with 2GB running smaller services. In these cases we *do* care a lot about packing density. (b) Maintenance costs of the existing weird cases and harmonizing threaded and main_arena paths. As I suggested in bug 15321: https://sourceware.org/bugzilla/show_bug.cgi?id=15321 We need to merge the main_arena and threaded code together, and stop treating them as different things. Right now the main_arena, if you look at the code, is a *pretend* heap with a partial data structure layered in place. This needs to go away. We need to treat all heaps as identical, with identical code paths, with just different backing storage. I think people still expect that thread 0 allocates from the sbrk heap in a single-threaded application, and we can do that by ensuring sbrk is used to provide the backing store for the main thread. This way we can jump the pointer 64MB like we normally do for mmap'd heaps, but then on page touch there the kernel just extends the heap normally. No difference (except VMA usage). Once that is in place we can experiment with other strategies like never using sbrk. >> Implementation: >> >> You need to make this robust against env vars changing malloc >> behaviour. You should use mallopt to change some parameters. > > You mean setting the tcache size explicitly (maybe even switching off)? You have several options: * Add a wrapper script that clear all mallopt related env vars. * Adjust the Makefile to clear all mallopt related env vars before starting the test. * Set tcache sizes explicitly *if* that is what you want, but likely you don't want this and want to run the test with just the defaults to see how the defaults are performing. >>> Note something very bad happens for the larger allocations, there >>> is a 25x slowdown from 25 to 400 allocations of 4KB blocks... >> >> Keep in mind you are testing the performance of sbrk here. In a threaded >> arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then >> draws allocations from it. So in some ways main_arena is expenseive, >> but both have to pay a page-touch cost... >> >> For each 4KiB block you touch the block to write the co-located metadata >> and that forces the kernel to give you a blank page, which you then do >> nothing with. Then you repeat the above again. >> >> For all other sizes you amortize the cost of the new page among >> several allocations. >> >> Do you have any other explanation? > > Well that looks like a reasonable explanation, but it shows a serious > performance bug - I think we use MADV_DONTNEED which doesn't > work on Linux and will cause all pages to be deallocated, reallocated > and zero-filled... This is the sort of case where you need to be very > careful to amortize over many allocations or long elapsed time, if at > all (many other allocators never give pages back). We need to move to MADV_FREE, which was designed for memory allocators. The semantics of MADV_DONTNEED have the problem that one has to consider: * Is the data destructively lost in that page? * Is the data flushed to the underlying store before being not-needed? All of which lead to MADV_DONTNEED doing a lot of teardown work to ensure that users don't corrupt the data in their backing stores. I think that detection of MADV_FREE, and usage, would help performance, but only on > Linux 4.5, and that might be OK for you. >> At some point you will hit the mmap threshold and the cost of the >> allocation will skyrocket as you have to call mmap. > > That only happens on huge allocations (much larger than 4KB), or when > you run out of sbrk space (unlikely). It happens at the mmap threshold, which is variable :-) Please consider the implementation as a fluid set of parameters that model application behaviour. We can run out of sbrk space *immediately* if you have an interposing low-address mmap that means sbrk can't grow (again see swbz#15321). Right now the mmap threshold is 128KiB though, so you're right, for the default. I don't know if that size is a good idea or not. >> In glibc we have: >> >> tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap >> >> If you proceed through from small allocations to larger allocations >> you will create chunks that cannot be used by future allocations. >> In many cases this is a worst case performance bottleneck. The >> heap will contain many 256 byte allocations but these cannot service >> the 1024 bytes, that is unless consolidation has been run. So this >> tests the consolidation as much as anything else, which might not >> trigger because of the free thresholds required. > > If consolidation doesn't work that's a serious bug. However allocation > performance should not be affected either way - in a real application > those small blocks might still be allocated. As long as consolidation > runs quickly (generally it's a small percentage in profiles), it won't > affect the results. OK. >> So what are we trying to model here? >> >> If we want to look at the cost of independent size class allocations >> then we need a clean process and allocate only a given size, and look >> at performance across the number of allocations. > > That's certainly feasible if we keep the number of sizes small (less > than the list below). It should be easy to reuse the bench-malloc-thread.c > makefile magic to run the same binary with multiple sizes. OK. >> I would also have much finer grained allocations by powers of 2. >> 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want >> to see what happens for the allocations which are: > .. >> Would this serve better to show that your single threaded malloc >> changes were helpful for a given size class? > > Well I can easily add some of the above sizes, it's highly configurable. > I don't think there will be much difference with the existing sizes though. Perhaps, but I don't know the answer to that. >> You need to use mallopt to make sure the user's environment >> did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your >> maximum allocation size. > > I don't think that is possible given the largest allocation size is 4KB. We carry out the allocation with mmap regardless, rounding up the size to that of a page. -- Cheers, Carlos.