Re: [PATCH 1/2] bswap.h: drop unaligned loads

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "René Scharfe" <l.s.r@web.de>
To: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>, git <git@vger.kernel.org>,
	Han-Wen Nienhuys <hanwenn@gmail.com>
Subject: Re: [PATCH 1/2] bswap.h: drop unaligned loads
Date: Fri, 25 Sep 2020 00:02:38 +0200	[thread overview]
Message-ID: <b4b9475f-9c84-1889-835d-9f6e81198e5b@web.de> (raw)
In-Reply-To: <20200924192111.GA2528225@coredump.intra.peff.net>

Am 24.09.20 um 21:21 schrieb Jeff King:
> Our put_be32() routine and its variants (get_be32(), put_be64(), etc)
> has two implementations: on some platforms we cast memory in place and
> use nothl()/htonl(), which can cause unaligned memory access. And on
> others, we pick out the individual bytes using bitshifts.
>
> This introduces extra complexity, and sometimes causes compilers to
> generate warnings about type-punning. And it's not clear there's any
> performance advantage.
>
> This split goes back to 660231aa97 (block-sha1: support for
> architectures with memory alignment restrictions, 2009-08-12). The
> unaligned versions were part of the original block-sha1 code in
> d7c208a92e (Add new optimized C 'block-sha1' routines, 2009-08-05),
> which says it is:
>
>    Based on the mozilla SHA1 routine, but doing the input data accesses a
>    word at a time and with 'htonl()' instead of loading bytes and shifting.
>
> Back then, Linus provided timings versus the mozilla code which showed a
> 27% improvement:
>
>   https://lore.kernel.org/git/alpine.LFD.2.01.0908051545000.3390@localhost.localdomain/
>
> However, the unaligned loads were either not the useful part of that
> speedup, or perhaps compilers and processors have changed since then.
> Here are times for computing the sha1 of 4GB of random data, with and
> without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
> gcc 10, -O2, and the processor is a Core i9-9880H.
>
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>     Range (min … max):    6.550 s …  6.841 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
>     Range (min … max):    6.394 s …  6.447 s    10 runs
>
> And here's the same test run on an AMD A8-7600, using gcc 8.
>
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
>     Range (min … max):   11.509 s … 11.861 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
>     Range (min … max):   11.637 s … 11.863 s    10 runs

Yay, benchmarks!  GCC 10.2 with -O2 on an i5-9600K without NO_UNALIGNED_LOADS:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.547 s ±  0.015 s    [User: 6.127 s, System: 0.395 s]
    Range (min … max):    6.531 s …  6.583 s    10 runs

... and with NO_UNALIGNED_LOADS set:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.496 s ±  0.011 s    [User: 6.135 s, System: 0.360 s]
    Range (min … max):    6.486 s …  6.519 s    10 runs

clang 10 without NO_UNALIGNED_LOADS:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.697 s ±  0.028 s    [User: 6.343 s, System: 0.354 s]
    Range (min … max):    6.675 s …  6.754 s    10 runs

... and with NO_UNALIGNED_LOADS set:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.714 s ±  0.049 s    [User: 6.320 s, System: 0.375 s]
    Range (min … max):    6.651 s …  6.791 s    10 runs

> +cc René because I know he is going to feed the two of them into
>     godbolt; I could do that, too, but he will provide much better analysis
>     on top ;)

Weeell, I don't know about that, but I couldn't resist taking a quick
look at what some compilers do with the 32-bit functions, which are the
ones used in block-sha1: https://www.godbolt.org/z/rhKMTM.

Older versions of gcc and clang didn't see through the shifting
put_be32() implementation.  If you go back further there are also
versions that didn't optimize the shifting get_be32().  And the latest
icc still can't do that.

gcc 10.2 just optimizes all functions to a bswap and a mov.  Can't do
any better than that, can you?

But why do we then see a difference in our benchmark results?  Not sure,
but https://www.godbolt.org/z/7xh8ao shows that gcc is shuffling some
instructions around depending on the implementation.  Switch to clang if
you want to see more vigorous shuffling.

The performance of bigger pieces of code seems to be a matter of luck
to some extent. :-/

René

next prev parent reply	other threads:[~2020-09-24 22:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-24 19:16 [PATCH 0/2] drop unaligned loads Jeff King
2020-09-24 19:21 ` [PATCH 1/2] bswap.h: " Jeff King
2020-09-24 22:02   ` René Scharfe [this message]
2020-09-25  4:56     ` Jeff King
2020-09-25  1:13   ` brian m. carlson
2020-09-25  9:05     ` Carlo Arenas
2020-09-25  9:09       ` Jeff King
2020-09-25 20:48       ` Thomas Guyot
2020-09-24 19:22 ` [PATCH 2/2] Revert "fast-export: use local array to store anonymized oid" Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b4b9475f-9c84-1889-835d-9f6e81198e5b@web.de \
    --to=l.s.r@web.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hanwen@google.com \
    --cc=hanwenn@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).