From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on starla X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 3B3E41F44D for ; Wed, 20 Mar 2024 16:26:52 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (2048-bit key; unprotected) header.d=efficios.com header.i=@efficios.com header.a=rsa-sha256 header.s=smtpout1 header.b=w/9nXy0I; dkim-atps=neutral Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 778C13857C4C for ; Wed, 20 Mar 2024 16:26:51 +0000 (GMT) Received: from smtpout.efficios.com (smtpout.efficios.com [IPv6:2607:5300:203:b2ee::31e5]) by sourceware.org (Postfix) with ESMTPS id D1CA6385841A for ; Wed, 20 Mar 2024 16:26:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D1CA6385841A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=efficios.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org D1CA6385841A Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:5300:203:b2ee::31e5 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1710951993; cv=none; b=oyd+GPOKCG2ewjU8JRo8V9FyycVKkKTCh+aowJElaGxggk5m9jHFq61Szl07NYUjKv8k4aAv9QKj6kVoOSSEmt9AHEjE7oQyj6QE7sIIveE5zKVPdVoRFpUd/7lfETOZBZ0TciqCWwH5oGaZASXWSqjzSqETNYhZw7kH7g8e62M= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1710951993; c=relaxed/simple; bh=FfQXqQTqxprasFQA5/334WseHrA55WQtEW3YznheWUI=; h=DKIM-Signature:Message-ID:Date:MIME-Version:From:To:Subject; b=aDQR3w2PsLPYsJ0y9e9+vP7mdwhj9szBOPuWcSbbAmuBsR4K9UMpHeVS8ihyZNorePQVHAB/FZa4fBJt4ReNUUp8HjfLPy3W1yqA2iw6XQgH1ZgtFRntsX7P7tt0bVizAFehjXjVda1m6mHAHqNhH5FNTOER42vdPCsFXnFw4LY= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1710951988; bh=FfQXqQTqxprasFQA5/334WseHrA55WQtEW3YznheWUI=; h=Date:From:To:Cc:Subject:From; b=w/9nXy0IGUf9pVQPcvRVZOLdrt53zmoG0+uXJLi1hY32TQU9I8wIBUm2PBbjdstVS 0aACOB5YcpOuvrhjTdfdJSusQswbJI9G2ZcjSujk5h3/63N98KfmzMY6FqpMPikhOD NfKYaxUeQ6wvqk8ft3H7n6nuisqjLjORCEBiHOAoDAh1v/nFveuW4D9nGilxO6o6q6 cx16Y2By/qFIchJ5ijgwEmnwRKI4il/d4lTXj9MbUF44HaOaqOJPk/M3w1m92cCMVS 4w0pmVeS4xQUQouK6QlFBPAdwgifskZZWZ0mAW+89U5Dhijkupg/gAADVDlHITpB/+ yHaho5bVXjTeg== Received: from [172.16.0.134] (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4V0DVq6FgCzkh1; Wed, 20 Mar 2024 12:26:27 -0400 (EDT) Message-ID: <218bd8f1-d382-4024-a90f-59b5fef5184a@efficios.com> Date: Wed, 20 Mar 2024 12:26:47 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US From: Mathieu Desnoyers To: "carlos@redhat.com" , DJ Delorie , Florian Weimer Cc: Olivier Dion , Michael Jeanson , libc-alpha , paulmck , Peter Zijlstra , Boqun Feng , linux-kernel , Linus Torvalds , Dennis Zhou , Tejun Heo , Christoph Lameter , linux-mm Subject: [RFC] A new per-cpu memory allocator for userspace in librseq Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Hi! When looking at what is missing make librseq a generally usable project to support per-cpu data structures in user-space, I noticed that what we miss is a per-cpu memory allocator conceptually similar to what the Linux kernel internally provides [1]. The per-CPU memory allocator is analogous to TLS (Thread-Local Storage) memory: TLS is Thread-Local Storage, whereas the per-CPU memory allocator provides CPU-Local Storage. My goal is to improve locality and remove the need to waste precious cache lines with padding when indexing per-cpu data as an array of items. So we decided to go ahead and implement a per-cpu allocator for userspace in the librseq project [2,3] with the following characteristics: * Allocations are performed in memory pools (mempool). Allocations are power of 2, fixed sized, configured at pool creation. * Memory pools can be added to a pool set to allow allocation of variable size records. * Allocating "items" from a memory pool allocates memory for all CPUs. * The "stride" to index per-cpu data is user-configurable. Indexing per-cpu data from an allocated pointer is as simple as: (uintptr_t) ptr + (cpu * stride) Where the multiplication is actually a shift because stride is a power of 2 constant. * Pools consist of a linked list of "ranges" (a stride worth of item allocation), thus making the pool extensible when running out of space, up to a user-configurable limit. * Freeing a pointer only requires the pointer to free as input (and the pool stride constant). Finding the range and pool associated with the pointer is done by applying a mask to the pointer. The memory mappings of the ranges are aligned to make this mask find the range base, and thus allow accessing the range structure placed in a header page immediately before. One interesting problem we faced is what should be done to prevent wasting memory due to allocation of useless pages in a system where there are lots of configured CPUs, but very few are actually used by the application due to a combination of cpu affinity, cpusets, and cpu hotplug. Minimizing the amount of page allocation while offering the ability to allocate zeroed (or pre-initialized) items is the crux of this issue. We thus came up with two approaches based on copy-on-write (COW) to tackle this, which we call the "pool populate policy": * RSEQ_MEMPOOL_POPULATE_COW_INIT (default): Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages from the initial values pages on first write. The COW_INIT approach maps an extra "initial values" stride with each pool range as MAP_SHARED from a memfd. All per-cpu strides map these initial values as MAP_PRIVATE, so the first write access from an active CPU will trigger a COW page allocation. The downside of this scheme is that its use of MAP_SHARED is not compatible with using the pool from children processes after fork, and its use of COW is not compatible with shared memory use-cases. * RSEQ_MEMPOOL_POPULATE_COW_ZERO: Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages from the zero page on first write. As long as the user only uses malloc, zmalloc, or malloc_init with zeroed content to allocate items, it does not trigger COW of all per-cpu pages, leaving in place the zero page until an active CPU writes to its per-cpu item. The COW_ZERO approach maps the per-cpu strides as private anonymous memory, and therefore only triggers COW page allocation when a CPU writes over those zero pages. As a downside, this scheme will trigger COW page allocation for all possible CPUs when using zmalloc_init() to populate non-zeroed initial values for an item. Its upsides are that this scheme can be used across fork and eventually can be used over shared memory. Other noteworthy features are that this mempool allocator can be used as a global allocator as well. It has an optional "robust" attribute which enables checks for memory corruption and double-free. Users with more custom use-cases can register an "init" callback to be called for after each new range/cpu are allocated. Feedback is welcome ! Thanks, Mathieu [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/percpu.h [2] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/mempool.h [3] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/src/rseq-mempool.c -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com