From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS3215 2.6.0.0/16
X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 4ED6F1F8C6
	for <e@80x24.org>; Tue, 24 Aug 2021 19:32:56 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 8BBDE3858421
	for <e@80x24.org>; Tue, 24 Aug 2021 19:32:55 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8BBDE3858421
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1629833575;
	bh=tdAS+n5LfcRVzFmbjL7UsiWyqw6+8nrFTPl2GbSsyco=;
	h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=kBaaKTj/HOcUmWD3lcjd949ruEpt7pUZFClvNej8ywgSw122zDW3BR9YfJE0em6cO
	 2wcULRXaqfm6gOBReyshhQIDkwmmIPdF1CFJ8Ax5crCBFa9GygasKxxFezvEr8IRZ6
	 azF2AEOb5vEflgkytCSUaHcUnVWlOIvNq+4KlDlA=
Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com
 [IPv6:2607:f8b0:4864:20::62f])
 by sourceware.org (Postfix) with ESMTPS id 3A59E3858401
 for <libc-alpha@sourceware.org>; Tue, 24 Aug 2021 19:32:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3A59E3858401
Received: by mail-pl1-x62f.google.com with SMTP id w6so12870957plg.9
 for <libc-alpha@sourceware.org>; Tue, 24 Aug 2021 12:32:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=wFuPLvoG+sGSeP26JGo5JCyipXTwaP8JiD4pd0zD0vk=;
 b=jfGZPLJ4dqmon3+qOW8JkzNBOf08xkSHvSH31m5biY4mvwzfpDIo04/sXstgWMKqiS
 VDkxlU7sFZUdq69LqR9YePmx1C3ye+/vzpZHUfo7FsR5zdx0QmE23nbJAY/E3WOq1iB1
 APC71n2ltZvVpOU5wtKQrTJcgCqQ9Nf0jLiV9hynvVAepz+1axFiqGvVVs7MnGHG/NGA
 Hxe1Ai8IRSOwQ7cvTL82MYwwx3zUxiLr3YsfmpQF82havII9hk44YzOeHyst/8dkfz/O
 pKR0FZPUpz1++D+jjScl4UinVDftvzOVUlxHSTuFjtmWXCmpKkPFRiG0yrFvlkpfbGEF
 4Hmg==
X-Gm-Message-State: AOAM530dpuDorV6+PnhrIr2+GN/9mQ8R+Cp5DrBdQe+wQGlxUOxTSbIi
 Umz1bP+EuftxCkvSn/ex7Rq65H2hvHMehsWkvj8=
X-Google-Smtp-Source: ABdhPJx6/lw6/E8UKpJiVX+C4EfXhshbIhAcsHYWzWACnoJkJ5Uc02AfEOlmBXiQoPaW8/dEETk8K9xBbXbKI+uFYH0=
X-Received: by 2002:a17:902:9009:b0:12d:8de4:bc2d with SMTP id
 a9-20020a170902900900b0012d8de4bc2dmr34873023plp.44.1629833553369; Tue, 24
 Aug 2021 12:32:33 -0700 (PDT)
MIME-Version: 1.0
References: <VE1PR08MB5599DE2F8A6B91510CB3EFCD83C59@VE1PR08MB5599.eurprd08.prod.outlook.com>
In-Reply-To: <VE1PR08MB5599DE2F8A6B91510CB3EFCD83C59@VE1PR08MB5599.eurprd08.prod.outlook.com>
Date: Tue, 24 Aug 2021 15:32:22 -0400
Message-ID: <CAFUsyfK0EPP5q7zA84K7NrDbsM9NY1tSo_+QKNK=C_cZWA=KHA@mail.gmail.com>
Subject: Re: [PATCH 2/5] benchtests: Add new random cases to
 bench-memcpy-random.c
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>

On Tue, Aug 24, 2021 at 1:09 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
wrote:

> Hi Noah,
>
> > This commit adds three new benchmarks for the SPEC2017
> > distribution. One randomized if dst > src and the other two set it
> > either 1/0.
>
> What is the goal of this? Have you ever seen memcpy do something
> radically differently depending on whether src < dst (assuming no
> overlaps)?
> Generally it's the same basic loop that gets used, so I don't see the
> usefulness
> of adding this.
>

Since on x86_64 at least we use memmove to implement memcpy there are a
few points where the logic revolves around src < dst. Particularly in
L(more_8x_vec)
which branch on the condition w.o checking for overlap. I don't think its
particularly
fair for the random tests to essentially ignore that branch and imo a
better reflection
of real-world usage. An implementation could add some really branch heavy
code
in the case ignored by the current implementation that looked good on the
fixed
benchmarks but would perform worse in the randomized tests.

For example in cases where src/dst 4k alias forward/backward looping can
make a big difference.
On fixed tests it may be worth it to add branches that check for that, but
in a randomized
environment where the branches can miss could say otherwise.


> Some comments on the implementation details are below, there are
> several issues that need to be resolved.
>
> > As well add some tests for fixed sizes with randomize alignment and
> > value of dst > src. This can be useful for testing different alignment
> > configurations.
>
> This tests fixed size copies from 1KB to 128KB. Large copies are typically
> aligned, so I'm not sure it makes sense to use the random alignment
> distribution for small copies here.
>

I generally agree. But on the flip side there is essentially no way to
tiebreak different
alignment methods. For example in x86_64 aligning to 1x vs 2x VECs will
improve about
50% of the fixed tests and harm 50%. I think this can be useful for
determining "which method
is expected to yield better results". There are other more branch heavy
possible alignment
configurations, which I don't think fixed tests would be able to accurately
benchmark as they
would be 100% predicted.


>
> -#define MIN_PAGE_SIZE (512*1024+getpagesize())
> +#define MAX_TEST_SIZE (512*1024)
>
> This makes sense. Please also use it in the loops. Also it may be
> worthwhile to increase this value a bit if larger sizes are what you
> are after (and the minimum size of 4K could be increased a bit as
> well as L1/L2 caches have increased a lot in recent years).
>


> +#define MIN_PAGE_SIZE (3*MAX_TEST_SIZE+3*getpagesize())
>
> I don't get this at all. The goal is to copy within the MAX_TEST_SIZE
> region, not to copy outside it. And why 3 times the size?
>
>  typedef struct
>  {
> -  uint64_t src : 24;
> -  uint64_t dst : 24;
> -  uint64_t len : 16;
> +/* 26 bits for src and dst so we have extra bit for alternating dst >
> +   src without a branch.  */
> +  uint64_t src : 26;
> +  uint64_t dst : 26;
> +  /* For size < 4096 12 bits is enough.  */
> +  uint64_t len : 12;
>  } copy_t;
>
> This doesn't make any sense, where do we need offsets larger than 2^24?
>
We need the extra bits so store max_size offsets to alternate direction.


>
> +static size_t
> +init_copy(size_t max_size, int dst_gt_src)
> +{
> +  size_t i, dst_offset, src_offset;
> +  if (dst_gt_src <= 0)
> +    {
> +      dst_offset = 0;
> +      src_offset = MAX_TEST_SIZE + getpagesize();
> +    }
> +  else
> +    {
> +      dst_offset = MAX_TEST_SIZE + getpagesize();
> +      src_offset = 0;
> +    }
>
> This doesn't make sense since this is now copying outside of the region,
> effectively enlarging the region but also always reading from one part and
> writing to another. The data outside the region is uninitialized which
> means
> it is using the zero page which gives unrepresentative results.
>

I see. I didn't quite grasp the purpose of the max_size. (As seen by the
fact that I renamed "length" to "max-alignment", now "region-size").

Changed it to use max_size instead of MAX_TEST_SIZE and initialize
3 * max_size.

I think its better to expand region size to 3x max_size as opposed to 2x
max_size and have noise for store-forwarding (on x86 at least)

>
> +    /* Create a random set of copies with the given size and alignment
>       distributions.  */
>    for (i = 0; i < MAX_COPIES; i++)
>      {
> +      dst_offset  = dst_gt_src == -1
> +                        ? (rand() & 1) ? MAX_TEST_SIZE + getpagesize() : 0
> +                        : dst_offset;
>
> Same here. If you want to force src > dst then just skip cases where
> src < dst. Similarly for avoiding accidental overlaps (which are likely
> to happen for small values of max_size).
>

I don't quite understand what you mean? I change it to be
2 * max_size in v2 and src_offset always has a value of
max_size if src > dst or randomized. This should prevent
overlaps as essentially src has its own max_size buffer
and dst has two possible max_size buffers, one below src
and one above.

The current implementation was extremely buggy. Sorry!


> Given the much larger sizes used for some tests, it is also worth
> ensuring the copy stays within the region.
>

Fixed in v2


>
>        copy[i].dst = (rand () & (max_size - 1));
>        copy[i].dst &= ~dst_align_arr[rand () & ALIGN_MASK];
> +      copy[i].dst += dst_offset;
>        copy[i].src = (rand () & (max_size - 1));
>        copy[i].src &= ~src_align_arr[rand () & ALIGN_MASK];
> +      copy[i].src += src_offset;
>        copy[i].len = size_arr[rand () & SIZE_MASK];
>      }
> +  return i;
> +}
>
> +static void
> +do_test (json_ctx_t *json_ctx, size_t max_size, int dst_gt_src)
> +{
> +  size_t n;
> +  memset (buf1, 1, max_size);
>
> Note this doesn't initialize the extra data in the buffer which creates
> odd performance behaviour due to the zero page.
>

Fixed in v2.

>
> +  n = init_copy(max_size, dst_gt_src);
>    json_element_object_begin (json_ctx);
> -  json_attr_uint (json_ctx, "length", (double) max_size);
> +  json_attr_uint (json_ctx, "max-alignment", (double) max_size);
>
> max-alignment, where does that come from???
>
> I misunderstood the purpose of "length" which is essentially total
region size. Imo "length" in context of memcpy generally refers to
copy size. Renamed "region-size" and set to full region size used
which previously was 2 * max_size, now 3 * max_size.


>    json_array_begin (&json_ctx, "results");
>    for (int i = 4; i <= 512; i = i * 2)
> -    do_test (&json_ctx, i * 1024);
> +    {
> +      if (i * 1024 > MAX_TEST_SIZE)
> +          continue;
>
> It's much easier to change the for loop to i = 4096 to MAX_TEST_SIZE
> and remove the various i * 1024. Then changing MAX_TEST_SIZE just works.
>
Fixed in v2.


>
> Cheers,
> Wilco