From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 4ED6F1F8C6 for ; Tue, 24 Aug 2021 19:32:56 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8BBDE3858421 for ; Tue, 24 Aug 2021 19:32:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8BBDE3858421 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629833575; bh=tdAS+n5LfcRVzFmbjL7UsiWyqw6+8nrFTPl2GbSsyco=; h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=kBaaKTj/HOcUmWD3lcjd949ruEpt7pUZFClvNej8ywgSw122zDW3BR9YfJE0em6cO 2wcULRXaqfm6gOBReyshhQIDkwmmIPdF1CFJ8Ax5crCBFa9GygasKxxFezvEr8IRZ6 azF2AEOb5vEflgkytCSUaHcUnVWlOIvNq+4KlDlA= Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by sourceware.org (Postfix) with ESMTPS id 3A59E3858401 for ; Tue, 24 Aug 2021 19:32:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3A59E3858401 Received: by mail-pl1-x62f.google.com with SMTP id w6so12870957plg.9 for ; Tue, 24 Aug 2021 12:32:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=wFuPLvoG+sGSeP26JGo5JCyipXTwaP8JiD4pd0zD0vk=; b=jfGZPLJ4dqmon3+qOW8JkzNBOf08xkSHvSH31m5biY4mvwzfpDIo04/sXstgWMKqiS VDkxlU7sFZUdq69LqR9YePmx1C3ye+/vzpZHUfo7FsR5zdx0QmE23nbJAY/E3WOq1iB1 APC71n2ltZvVpOU5wtKQrTJcgCqQ9Nf0jLiV9hynvVAepz+1axFiqGvVVs7MnGHG/NGA Hxe1Ai8IRSOwQ7cvTL82MYwwx3zUxiLr3YsfmpQF82havII9hk44YzOeHyst/8dkfz/O pKR0FZPUpz1++D+jjScl4UinVDftvzOVUlxHSTuFjtmWXCmpKkPFRiG0yrFvlkpfbGEF 4Hmg== X-Gm-Message-State: AOAM530dpuDorV6+PnhrIr2+GN/9mQ8R+Cp5DrBdQe+wQGlxUOxTSbIi Umz1bP+EuftxCkvSn/ex7Rq65H2hvHMehsWkvj8= X-Google-Smtp-Source: ABdhPJx6/lw6/E8UKpJiVX+C4EfXhshbIhAcsHYWzWACnoJkJ5Uc02AfEOlmBXiQoPaW8/dEETk8K9xBbXbKI+uFYH0= X-Received: by 2002:a17:902:9009:b0:12d:8de4:bc2d with SMTP id a9-20020a170902900900b0012d8de4bc2dmr34873023plp.44.1629833553369; Tue, 24 Aug 2021 12:32:33 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Date: Tue, 24 Aug 2021 15:32:22 -0400 Message-ID: Subject: Re: [PATCH 2/5] benchtests: Add new random cases to bench-memcpy-random.c To: Wilco Dijkstra Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Noah Goldstein via Libc-alpha Reply-To: Noah Goldstein Cc: GNU C Library Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" On Tue, Aug 24, 2021 at 1:09 PM Wilco Dijkstra wrote: > Hi Noah, > > > This commit adds three new benchmarks for the SPEC2017 > > distribution. One randomized if dst > src and the other two set it > > either 1/0. > > What is the goal of this? Have you ever seen memcpy do something > radically differently depending on whether src < dst (assuming no > overlaps)? > Generally it's the same basic loop that gets used, so I don't see the > usefulness > of adding this. > Since on x86_64 at least we use memmove to implement memcpy there are a few points where the logic revolves around src < dst. Particularly in L(more_8x_vec) which branch on the condition w.o checking for overlap. I don't think its particularly fair for the random tests to essentially ignore that branch and imo a better reflection of real-world usage. An implementation could add some really branch heavy code in the case ignored by the current implementation that looked good on the fixed benchmarks but would perform worse in the randomized tests. For example in cases where src/dst 4k alias forward/backward looping can make a big difference. On fixed tests it may be worth it to add branches that check for that, but in a randomized environment where the branches can miss could say otherwise. > Some comments on the implementation details are below, there are > several issues that need to be resolved. > > > As well add some tests for fixed sizes with randomize alignment and > > value of dst > src. This can be useful for testing different alignment > > configurations. > > This tests fixed size copies from 1KB to 128KB. Large copies are typically > aligned, so I'm not sure it makes sense to use the random alignment > distribution for small copies here. > I generally agree. But on the flip side there is essentially no way to tiebreak different alignment methods. For example in x86_64 aligning to 1x vs 2x VECs will improve about 50% of the fixed tests and harm 50%. I think this can be useful for determining "which method is expected to yield better results". There are other more branch heavy possible alignment configurations, which I don't think fixed tests would be able to accurately benchmark as they would be 100% predicted. > > -#define MIN_PAGE_SIZE (512*1024+getpagesize()) > +#define MAX_TEST_SIZE (512*1024) > > This makes sense. Please also use it in the loops. Also it may be > worthwhile to increase this value a bit if larger sizes are what you > are after (and the minimum size of 4K could be increased a bit as > well as L1/L2 caches have increased a lot in recent years). > > +#define MIN_PAGE_SIZE (3*MAX_TEST_SIZE+3*getpagesize()) > > I don't get this at all. The goal is to copy within the MAX_TEST_SIZE > region, not to copy outside it. And why 3 times the size? > > typedef struct > { > - uint64_t src : 24; > - uint64_t dst : 24; > - uint64_t len : 16; > +/* 26 bits for src and dst so we have extra bit for alternating dst > > + src without a branch. */ > + uint64_t src : 26; > + uint64_t dst : 26; > + /* For size < 4096 12 bits is enough. */ > + uint64_t len : 12; > } copy_t; > > This doesn't make any sense, where do we need offsets larger than 2^24? > We need the extra bits so store max_size offsets to alternate direction. > > +static size_t > +init_copy(size_t max_size, int dst_gt_src) > +{ > + size_t i, dst_offset, src_offset; > + if (dst_gt_src <= 0) > + { > + dst_offset = 0; > + src_offset = MAX_TEST_SIZE + getpagesize(); > + } > + else > + { > + dst_offset = MAX_TEST_SIZE + getpagesize(); > + src_offset = 0; > + } > > This doesn't make sense since this is now copying outside of the region, > effectively enlarging the region but also always reading from one part and > writing to another. The data outside the region is uninitialized which > means > it is using the zero page which gives unrepresentative results. > I see. I didn't quite grasp the purpose of the max_size. (As seen by the fact that I renamed "length" to "max-alignment", now "region-size"). Changed it to use max_size instead of MAX_TEST_SIZE and initialize 3 * max_size. I think its better to expand region size to 3x max_size as opposed to 2x max_size and have noise for store-forwarding (on x86 at least) > > + /* Create a random set of copies with the given size and alignment > distributions. */ > for (i = 0; i < MAX_COPIES; i++) > { > + dst_offset = dst_gt_src == -1 > + ? (rand() & 1) ? MAX_TEST_SIZE + getpagesize() : 0 > + : dst_offset; > > Same here. If you want to force src > dst then just skip cases where > src < dst. Similarly for avoiding accidental overlaps (which are likely > to happen for small values of max_size). > I don't quite understand what you mean? I change it to be 2 * max_size in v2 and src_offset always has a value of max_size if src > dst or randomized. This should prevent overlaps as essentially src has its own max_size buffer and dst has two possible max_size buffers, one below src and one above. The current implementation was extremely buggy. Sorry! > Given the much larger sizes used for some tests, it is also worth > ensuring the copy stays within the region. > Fixed in v2 > > copy[i].dst = (rand () & (max_size - 1)); > copy[i].dst &= ~dst_align_arr[rand () & ALIGN_MASK]; > + copy[i].dst += dst_offset; > copy[i].src = (rand () & (max_size - 1)); > copy[i].src &= ~src_align_arr[rand () & ALIGN_MASK]; > + copy[i].src += src_offset; > copy[i].len = size_arr[rand () & SIZE_MASK]; > } > + return i; > +} > > +static void > +do_test (json_ctx_t *json_ctx, size_t max_size, int dst_gt_src) > +{ > + size_t n; > + memset (buf1, 1, max_size); > > Note this doesn't initialize the extra data in the buffer which creates > odd performance behaviour due to the zero page. > Fixed in v2. > > + n = init_copy(max_size, dst_gt_src); > json_element_object_begin (json_ctx); > - json_attr_uint (json_ctx, "length", (double) max_size); > + json_attr_uint (json_ctx, "max-alignment", (double) max_size); > > max-alignment, where does that come from??? > > I misunderstood the purpose of "length" which is essentially total region size. Imo "length" in context of memcpy generally refers to copy size. Renamed "region-size" and set to full region size used which previously was 2 * max_size, now 3 * max_size. > json_array_begin (&json_ctx, "results"); > for (int i = 4; i <= 512; i = i * 2) > - do_test (&json_ctx, i * 1024); > + { > + if (i * 1024 > MAX_TEST_SIZE) > + continue; > > It's much easier to change the for loop to i = 4096 to MAX_TEST_SIZE > and remove the various i * 1024. Then changing MAX_TEST_SIZE just works. > Fixed in v2. > > Cheers, > Wilco