Re: [PATCH 2/5] benchtests: Add new random cases to bench-memcpy-random.c

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH 2/5] benchtests: Add new random cases to bench-memcpy-random.c
Date: Tue, 24 Aug 2021 15:32:22 -0400	[thread overview]
Message-ID: <CAFUsyfK0EPP5q7zA84K7NrDbsM9NY1tSo_+QKNK=C_cZWA=KHA@mail.gmail.com> (raw)
In-Reply-To: <VE1PR08MB5599DE2F8A6B91510CB3EFCD83C59@VE1PR08MB5599.eurprd08.prod.outlook.com>

On Tue, Aug 24, 2021 at 1:09 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
wrote:

> Hi Noah,
>
> > This commit adds three new benchmarks for the SPEC2017
> > distribution. One randomized if dst > src and the other two set it
> > either 1/0.
>
> What is the goal of this? Have you ever seen memcpy do something
> radically differently depending on whether src < dst (assuming no
> overlaps)?
> Generally it's the same basic loop that gets used, so I don't see the
> usefulness
> of adding this.
>

Since on x86_64 at least we use memmove to implement memcpy there are a
few points where the logic revolves around src < dst. Particularly in
L(more_8x_vec)
which branch on the condition w.o checking for overlap. I don't think its
particularly
fair for the random tests to essentially ignore that branch and imo a
better reflection
of real-world usage. An implementation could add some really branch heavy
code
in the case ignored by the current implementation that looked good on the
fixed
benchmarks but would perform worse in the randomized tests.

For example in cases where src/dst 4k alias forward/backward looping can
make a big difference.
On fixed tests it may be worth it to add branches that check for that, but
in a randomized
environment where the branches can miss could say otherwise.


> Some comments on the implementation details are below, there are
> several issues that need to be resolved.
>
> > As well add some tests for fixed sizes with randomize alignment and
> > value of dst > src. This can be useful for testing different alignment
> > configurations.
>
> This tests fixed size copies from 1KB to 128KB. Large copies are typically
> aligned, so I'm not sure it makes sense to use the random alignment
> distribution for small copies here.
>

I generally agree. But on the flip side there is essentially no way to
tiebreak different
alignment methods. For example in x86_64 aligning to 1x vs 2x VECs will
improve about
50% of the fixed tests and harm 50%. I think this can be useful for
determining "which method
is expected to yield better results". There are other more branch heavy
possible alignment
configurations, which I don't think fixed tests would be able to accurately
benchmark as they
would be 100% predicted.


>
> -#define MIN_PAGE_SIZE (512*1024+getpagesize())
> +#define MAX_TEST_SIZE (512*1024)
>
> This makes sense. Please also use it in the loops. Also it may be
> worthwhile to increase this value a bit if larger sizes are what you
> are after (and the minimum size of 4K could be increased a bit as
> well as L1/L2 caches have increased a lot in recent years).
>


> +#define MIN_PAGE_SIZE (3*MAX_TEST_SIZE+3*getpagesize())
>
> I don't get this at all. The goal is to copy within the MAX_TEST_SIZE
> region, not to copy outside it. And why 3 times the size?
>
>  typedef struct
>  {
> -  uint64_t src : 24;
> -  uint64_t dst : 24;
> -  uint64_t len : 16;
> +/* 26 bits for src and dst so we have extra bit for alternating dst >
> +   src without a branch.  */
> +  uint64_t src : 26;
> +  uint64_t dst : 26;
> +  /* For size < 4096 12 bits is enough.  */
> +  uint64_t len : 12;
>  } copy_t;
>
> This doesn't make any sense, where do we need offsets larger than 2^24?
>
We need the extra bits so store max_size offsets to alternate direction.


>
> +static size_t
> +init_copy(size_t max_size, int dst_gt_src)
> +{
> +  size_t i, dst_offset, src_offset;
> +  if (dst_gt_src <= 0)
> +    {
> +      dst_offset = 0;
> +      src_offset = MAX_TEST_SIZE + getpagesize();
> +    }
> +  else
> +    {
> +      dst_offset = MAX_TEST_SIZE + getpagesize();
> +      src_offset = 0;
> +    }
>
> This doesn't make sense since this is now copying outside of the region,
> effectively enlarging the region but also always reading from one part and
> writing to another. The data outside the region is uninitialized which
> means
> it is using the zero page which gives unrepresentative results.
>

I see. I didn't quite grasp the purpose of the max_size. (As seen by the
fact that I renamed "length" to "max-alignment", now "region-size").

Changed it to use max_size instead of MAX_TEST_SIZE and initialize
3 * max_size.

I think its better to expand region size to 3x max_size as opposed to 2x
max_size and have noise for store-forwarding (on x86 at least)

>
> +    /* Create a random set of copies with the given size and alignment
>       distributions.  */
>    for (i = 0; i < MAX_COPIES; i++)
>      {
> +      dst_offset  = dst_gt_src == -1
> +                        ? (rand() & 1) ? MAX_TEST_SIZE + getpagesize() : 0
> +                        : dst_offset;
>
> Same here. If you want to force src > dst then just skip cases where
> src < dst. Similarly for avoiding accidental overlaps (which are likely
> to happen for small values of max_size).
>

I don't quite understand what you mean? I change it to be
2 * max_size in v2 and src_offset always has a value of
max_size if src > dst or randomized. This should prevent
overlaps as essentially src has its own max_size buffer
and dst has two possible max_size buffers, one below src
and one above.

The current implementation was extremely buggy. Sorry!


> Given the much larger sizes used for some tests, it is also worth
> ensuring the copy stays within the region.
>

Fixed in v2


>
>        copy[i].dst = (rand () & (max_size - 1));
>        copy[i].dst &= ~dst_align_arr[rand () & ALIGN_MASK];
> +      copy[i].dst += dst_offset;
>        copy[i].src = (rand () & (max_size - 1));
>        copy[i].src &= ~src_align_arr[rand () & ALIGN_MASK];
> +      copy[i].src += src_offset;
>        copy[i].len = size_arr[rand () & SIZE_MASK];
>      }
> +  return i;
> +}
>
> +static void
> +do_test (json_ctx_t *json_ctx, size_t max_size, int dst_gt_src)
> +{
> +  size_t n;
> +  memset (buf1, 1, max_size);
>
> Note this doesn't initialize the extra data in the buffer which creates
> odd performance behaviour due to the zero page.
>

Fixed in v2.

>
> +  n = init_copy(max_size, dst_gt_src);
>    json_element_object_begin (json_ctx);
> -  json_attr_uint (json_ctx, "length", (double) max_size);
> +  json_attr_uint (json_ctx, "max-alignment", (double) max_size);
>
> max-alignment, where does that come from???
>
> I misunderstood the purpose of "length" which is essentially total
region size. Imo "length" in context of memcpy generally refers to
copy size. Renamed "region-size" and set to full region size used
which previously was 2 * max_size, now 3 * max_size.


>    json_array_begin (&json_ctx, "results");
>    for (int i = 4; i <= 512; i = i * 2)
> -    do_test (&json_ctx, i * 1024);
> +    {
> +      if (i * 1024 > MAX_TEST_SIZE)
> +          continue;
>
> It's much easier to change the for loop to i = 4096 to MAX_TEST_SIZE
> and remove the various i * 1024. Then changing MAX_TEST_SIZE just works.
>
Fixed in v2.


>
> Cheers,
> Wilco

next prev parent reply	other threads:[~2021-08-24 19:32 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24 17:09 [PATCH 2/5] benchtests: Add new random cases to bench-memcpy-random.c Wilco Dijkstra via Libc-alpha
2021-08-24 19:32 ` Noah Goldstein via Libc-alpha [this message]
2021-08-26 17:06   ` Wilco Dijkstra via Libc-alpha
  -- strict thread matches above, loose matches on Subject: below --
2021-08-24  8:27 [PATCH 1/5] string: Make tests birdirectional test-memcpy.c Noah Goldstein via Libc-alpha
2021-08-24  8:27 ` [PATCH 2/5] benchtests: Add new random cases to bench-memcpy-random.c Noah Goldstein via Libc-alpha
2021-08-24 15:18   ` H.J. Lu via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFUsyfK0EPP5q7zA84K7NrDbsM9NY1tSo_+QKNK=C_cZWA=KHA@mail.gmail.com' \
    --to=libc-alpha@sourceware.org \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=goldstein.w.n@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).