git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/2] drop unaligned loads
@ 2020-09-24 19:16 Jeff King
  2020-09-24 19:21 ` [PATCH 1/2] bswap.h: " Jeff King
  2020-09-24 19:22 ` [PATCH 2/2] Revert "fast-export: use local array to store anonymized oid" Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Jeff King @ 2020-09-24 19:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Han-Wen Nienhuys, git, Han-Wen Nienhuys

On Thu, Sep 24, 2020 at 10:22:20AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Then I did the same, but building with -DNO_UNALIGNED_LOADS. The latter
> > actually ran faster, by a small margin. Here are the hyperfine results:
> >
> >   [stock]
> >   Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
> >   Range (min … max):    6.550 s …  6.841 s    10 runs
> >
> >   [-DNO_UNALIGNED_LOADS]
> >   Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
> >   Range (min … max):    6.394 s …  6.447 s    10 runs
> >
> > For casual use as in reftables I doubt the difference is even
> > measurable. But this result implies that perhaps we ought to just be
> > using the fallback version all the time.
> 
> I like that one.  One less configurable knob that makes us execute
> different codepaths is one less thing to be worried about.

Here it is with a little more research, then, and a cleanup we can do on
top.

  [1/2]: bswap.h: drop unaligned loads
  [2/2]: Revert "fast-export: use local array to store anonymized oid"

 Makefile              |  1 -
 builtin/fast-export.c |  8 ++++----
 compat/bswap.h        | 24 ------------------------
 3 files changed, 4 insertions(+), 29 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-24 19:16 [PATCH 0/2] drop unaligned loads Jeff King
@ 2020-09-24 19:21 ` Jeff King
  2020-09-24 22:02   ` René Scharfe
  2020-09-25  1:13   ` brian m. carlson
  2020-09-24 19:22 ` [PATCH 2/2] Revert "fast-export: use local array to store anonymized oid" Jeff King
  1 sibling, 2 replies; 9+ messages in thread
From: Jeff King @ 2020-09-24 19:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, René Scharfe, Han-Wen Nienhuys, git,
	Han-Wen Nienhuys

Our put_be32() routine and its variants (get_be32(), put_be64(), etc)
has two implementations: on some platforms we cast memory in place and
use nothl()/htonl(), which can cause unaligned memory access. And on
others, we pick out the individual bytes using bitshifts.

This introduces extra complexity, and sometimes causes compilers to
generate warnings about type-punning. And it's not clear there's any
performance advantage.

This split goes back to 660231aa97 (block-sha1: support for
architectures with memory alignment restrictions, 2009-08-12). The
unaligned versions were part of the original block-sha1 code in
d7c208a92e (Add new optimized C 'block-sha1' routines, 2009-08-05),
which says it is:

   Based on the mozilla SHA1 routine, but doing the input data accesses a
   word at a time and with 'htonl()' instead of loading bytes and shifting.

Back then, Linus provided timings versus the mozilla code which showed a
27% improvement:

  https://lore.kernel.org/git/alpine.LFD.2.01.0908051545000.3390@localhost.localdomain/

However, the unaligned loads were either not the useful part of that
speedup, or perhaps compilers and processors have changed since then.
Here are times for computing the sha1 of 4GB of random data, with and
without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
gcc 10, -O2, and the processor is a Core i9-9880H.

  [stock]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
    Range (min … max):    6.550 s …  6.841 s    10 runs

  [-DNO_UNALIGNED_LOADS]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
    Range (min … max):    6.394 s …  6.447 s    10 runs

And here's the same test run on an AMD A8-7600, using gcc 8.

  [stock]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
    Range (min … max):   11.509 s … 11.861 s    10 runs

  [-DNO_UNALIGNED_LOADS]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
    Range (min … max):   11.637 s … 11.863 s    10 runs

So the unaligned loads don't seem to help much, and actually make things
worse. It's possible there are platforms where they provide more
benefit, but:

  - the non-x86 platforms for which we use this code are old and obscure
    (powerpc and s390).

  - the main caller that cares about performance is block-sha1. But
    these days it is rarely used anyway, in favor of sha1dc (which is
    already much slower, and nobody seems to have cared that much).

Let's just drop unaligned versions entirely in the name of simplicity.

Signed-off-by: Jeff King <peff@peff.net>
---
+cc Linus for any thoughts or performance wisdom, and in particular if
    there might be other platforms worth benchmarking (though unless
    the improvement is dramatic on some system, avoiding the complexity
    and pointer aliasing seems worthwhile to me).

+cc René because I know he is going to feed the two of them into
    godbolt; I could do that, too, but he will provide much better analysis
    on top ;)

 Makefile       |  1 -
 compat/bswap.h | 24 ------------------------
 2 files changed, 25 deletions(-)

diff --git a/Makefile b/Makefile
index 92d188fd37..d46765aa9e 100644
--- a/Makefile
+++ b/Makefile
@@ -1218,7 +1218,6 @@ SANITIZERS := $(foreach flag,$(subst $(comma),$(space),$(SANITIZE)),$(flag))
 BASIC_CFLAGS += -fsanitize=$(SANITIZE) -fno-sanitize-recover=$(SANITIZE)
 BASIC_CFLAGS += -fno-omit-frame-pointer
 ifneq ($(filter undefined,$(SANITIZERS)),)
-BASIC_CFLAGS += -DNO_UNALIGNED_LOADS
 BASIC_CFLAGS += -DSHA1DC_FORCE_ALIGNED_ACCESS
 endif
 ifneq ($(filter leak,$(SANITIZERS)),)
diff --git a/compat/bswap.h b/compat/bswap.h
index e4e25735ce..c0bb744adc 100644
--- a/compat/bswap.h
+++ b/compat/bswap.h
@@ -145,28 +145,6 @@ static inline uint64_t git_bswap64(uint64_t x)
 
 #endif
 
-/*
- * Performance might be improved if the CPU architecture is OK with
- * unaligned 32-bit loads and a fast ntohl() is available.
- * Otherwise fall back to byte loads and shifts which is portable,
- * and is faster on architectures with memory alignment issues.
- */
-
-#if !defined(NO_UNALIGNED_LOADS) && ( \
-    defined(__i386__) || defined(__x86_64__) || \
-    defined(_M_IX86) || defined(_M_X64) || \
-    defined(__ppc__) || defined(__ppc64__) || \
-    defined(__powerpc__) || defined(__powerpc64__) || \
-    defined(__s390__) || defined(__s390x__))
-
-#define get_be16(p)	ntohs(*(unsigned short *)(p))
-#define get_be32(p)	ntohl(*(unsigned int *)(p))
-#define get_be64(p)	ntohll(*(uint64_t *)(p))
-#define put_be32(p, v)	do { *(unsigned int *)(p) = htonl(v); } while (0)
-#define put_be64(p, v)	do { *(uint64_t *)(p) = htonll(v); } while (0)
-
-#else
-
 static inline uint16_t get_be16(const void *ptr)
 {
 	const unsigned char *p = ptr;
@@ -212,6 +190,4 @@ static inline void put_be64(void *ptr, uint64_t value)
 	p[7] = value >>  0;
 }
 
-#endif
-
 #endif /* COMPAT_BSWAP_H */
-- 
2.28.0.1082.g195fe9fe3f


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] Revert "fast-export: use local array to store anonymized oid"
  2020-09-24 19:16 [PATCH 0/2] drop unaligned loads Jeff King
  2020-09-24 19:21 ` [PATCH 1/2] bswap.h: " Jeff King
@ 2020-09-24 19:22 ` Jeff King
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2020-09-24 19:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Han-Wen Nienhuys, git, Han-Wen Nienhuys

This reverts commit f39ad38410da554af54966bf74fa0402355852ac.

That commit was trying to silence a type-punning warning on older
versions of gcc. However, its analysis was all wrong. I didn't notice
that we _were_ in fact type-punning because there are two versions of
put_be32(): one that uses casts and unaligned loads, and another that
uses bitshifts. I looked at the latter, but on my platform we were
defaulting to the former.

However, as of the previous commit, we'll always use the bitshift
version. So we can drop this hackery to avoid the warning, making the
code slightly cleaner.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fast-export.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 1b8fca3ee0..e266d90a81 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -405,12 +405,12 @@ static char *generate_fake_oid(void *data)
 {
 	static uint32_t counter = 1; /* avoid null oid */
 	const unsigned hashsz = the_hash_algo->rawsz;
-	unsigned char out[GIT_MAX_RAWSZ];
+	struct object_id oid;
 	char *hex = xmallocz(GIT_MAX_HEXSZ);
 
-	hashclr(out);
-	put_be32(out + hashsz - 4, counter++);
-	return hash_to_hex_algop_r(hex, out, the_hash_algo);
+	oidclr(&oid);
+	put_be32(oid.hash + hashsz - 4, counter++);
+	return oid_to_hex_r(hex, &oid);
 }
 
 static const char *anonymize_oid(const char *oid_hex)
-- 
2.28.0.1082.g195fe9fe3f

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-24 19:21 ` [PATCH 1/2] bswap.h: " Jeff King
@ 2020-09-24 22:02   ` René Scharfe
  2020-09-25  4:56     ` Jeff King
  2020-09-25  1:13   ` brian m. carlson
  1 sibling, 1 reply; 9+ messages in thread
From: René Scharfe @ 2020-09-24 22:02 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Han-Wen Nienhuys, git, Han-Wen Nienhuys

Am 24.09.20 um 21:21 schrieb Jeff King:
> Our put_be32() routine and its variants (get_be32(), put_be64(), etc)
> has two implementations: on some platforms we cast memory in place and
> use nothl()/htonl(), which can cause unaligned memory access. And on
> others, we pick out the individual bytes using bitshifts.
>
> This introduces extra complexity, and sometimes causes compilers to
> generate warnings about type-punning. And it's not clear there's any
> performance advantage.
>
> This split goes back to 660231aa97 (block-sha1: support for
> architectures with memory alignment restrictions, 2009-08-12). The
> unaligned versions were part of the original block-sha1 code in
> d7c208a92e (Add new optimized C 'block-sha1' routines, 2009-08-05),
> which says it is:
>
>    Based on the mozilla SHA1 routine, but doing the input data accesses a
>    word at a time and with 'htonl()' instead of loading bytes and shifting.
>
> Back then, Linus provided timings versus the mozilla code which showed a
> 27% improvement:
>
>   https://lore.kernel.org/git/alpine.LFD.2.01.0908051545000.3390@localhost.localdomain/
>
> However, the unaligned loads were either not the useful part of that
> speedup, or perhaps compilers and processors have changed since then.
> Here are times for computing the sha1 of 4GB of random data, with and
> without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
> gcc 10, -O2, and the processor is a Core i9-9880H.
>
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>     Range (min … max):    6.550 s …  6.841 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
>     Range (min … max):    6.394 s …  6.447 s    10 runs
>
> And here's the same test run on an AMD A8-7600, using gcc 8.
>
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
>     Range (min … max):   11.509 s … 11.861 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
>     Range (min … max):   11.637 s … 11.863 s    10 runs

Yay, benchmarks!  GCC 10.2 with -O2 on an i5-9600K without NO_UNALIGNED_LOADS:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.547 s ±  0.015 s    [User: 6.127 s, System: 0.395 s]
    Range (min … max):    6.531 s …  6.583 s    10 runs

... and with NO_UNALIGNED_LOADS set:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.496 s ±  0.011 s    [User: 6.135 s, System: 0.360 s]
    Range (min … max):    6.486 s …  6.519 s    10 runs

clang 10 without NO_UNALIGNED_LOADS:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.697 s ±  0.028 s    [User: 6.343 s, System: 0.354 s]
    Range (min … max):    6.675 s …  6.754 s    10 runs

... and with NO_UNALIGNED_LOADS set:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.714 s ±  0.049 s    [User: 6.320 s, System: 0.375 s]
    Range (min … max):    6.651 s …  6.791 s    10 runs

> +cc René because I know he is going to feed the two of them into
>     godbolt; I could do that, too, but he will provide much better analysis
>     on top ;)

Weeell, I don't know about that, but I couldn't resist taking a quick
look at what some compilers do with the 32-bit functions, which are the
ones used in block-sha1: https://www.godbolt.org/z/rhKMTM.

Older versions of gcc and clang didn't see through the shifting
put_be32() implementation.  If you go back further there are also
versions that didn't optimize the shifting get_be32().  And the latest
icc still can't do that.

gcc 10.2 just optimizes all functions to a bswap and a mov.  Can't do
any better than that, can you?

But why do we then see a difference in our benchmark results?  Not sure,
but https://www.godbolt.org/z/7xh8ao shows that gcc is shuffling some
instructions around depending on the implementation.  Switch to clang if
you want to see more vigorous shuffling.

The performance of bigger pieces of code seems to be a matter of luck
to some extent. :-/

René

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-24 19:21 ` [PATCH 1/2] bswap.h: " Jeff King
  2020-09-24 22:02   ` René Scharfe
@ 2020-09-25  1:13   ` brian m. carlson
  2020-09-25  9:05     ` Carlo Arenas
  1 sibling, 1 reply; 9+ messages in thread
From: brian m. carlson @ 2020-09-25  1:13 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Linus Torvalds, René Scharfe,
	Han-Wen Nienhuys, git, Han-Wen Nienhuys

[-- Attachment #1: Type: text/plain, Size: 2829 bytes --]

On 2020-09-24 at 19:21:11, Jeff King wrote:
> However, the unaligned loads were either not the useful part of that
> speedup, or perhaps compilers and processors have changed since then.
> Here are times for computing the sha1 of 4GB of random data, with and
> without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
> gcc 10, -O2, and the processor is a Core i9-9880H.
> 
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>     Range (min … max):    6.550 s …  6.841 s    10 runs
> 
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
>     Range (min … max):    6.394 s …  6.447 s    10 runs
> 
> And here's the same test run on an AMD A8-7600, using gcc 8.
> 
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
>     Range (min … max):   11.509 s … 11.861 s    10 runs
> 
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
>     Range (min … max):   11.637 s … 11.863 s    10 runs

I think this is a fine and desirable change, both for performance and
correctness.  It is, as usual, well explained.

> So the unaligned loads don't seem to help much, and actually make things
> worse. It's possible there are platforms where they provide more
> benefit, but:
> 
>   - the non-x86 platforms for which we use this code are old and obscure
>     (powerpc and s390).

I cannot speak for s390, since I have never owned one, but my
understanding on unaligned access is that typically there is a tiny
penalty on x86 (about a cycle) and a more significant penalty on
PowerPC, although that may have changed with newer POWER chips.  So my
gut tells me this is an improvement either way, although I no longer own
any such bootable hardware to measure for certain.

Anyway, as René found, the latest versions of GCC already use the
peephole optimizer to recognize and optimize this on x86, so I expect
they'll do so on other architectures as well.  Byte swapping is a pretty
common operation.

>   - the main caller that cares about performance is block-sha1. But
>     these days it is rarely used anyway, in favor of sha1dc (which is
>     already much slower, and nobody seems to have cared that much).

I think block-sha256 uses it as well, but in any case, it's still faster
than sha1dc and people who care desperately about performance will use a
crypto library instead.
-- 
brian m. carlson: Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-24 22:02   ` René Scharfe
@ 2020-09-25  4:56     ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2020-09-25  4:56 UTC (permalink / raw)
  To: René Scharfe; +Cc: Junio C Hamano, Han-Wen Nienhuys, git, Han-Wen Nienhuys

On Fri, Sep 25, 2020 at 12:02:38AM +0200, René Scharfe wrote:

> Older versions of gcc and clang didn't see through the shifting
> put_be32() implementation.  If you go back further there are also
> versions that didn't optimize the shifting get_be32().  And the latest
> icc still can't do that.
> 
> gcc 10.2 just optimizes all functions to a bswap and a mov.  Can't do
> any better than that, can you?
> 
> But why do we then see a difference in our benchmark results?  Not sure,
> but https://www.godbolt.org/z/7xh8ao shows that gcc is shuffling some
> instructions around depending on the implementation.  Switch to clang if
> you want to see more vigorous shuffling.

We do redefine ntohl(), etc in compat/bswap.h. Looking at them, I'd
think the result would end up as a bswap instruction, though. And
indeed, trying to feed that to godbolt results in the same output you
got.

It does sound like older compilers were benefiting from the unaligned
versions. Building with gcc-4.8 (from debian jessie in a docker
container on the same machine), I get ~6.25s with the unaligned load
versus ~6.6s with the bit-shifting code. So that's the opposite of how
the newer compiler behaves.

Benchmarking clang-8 (which your results showed doesn't handle the
shifting version well). It likewise is just _slightly_ slower after my
patch (11.47s versus 11.57s).

Given that newer compilers behave the opposite way, and the overall
small magnitude of the impact, I'm still comfortable with the change.
It's nice to have a better understanding of how the compiler is
impacting it (even if I am still confused how anything at all changes on
the newer compilers). Thanks for digging.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-25  1:13   ` brian m. carlson
@ 2020-09-25  9:05     ` Carlo Arenas
  2020-09-25  9:09       ` Jeff King
  2020-09-25 20:48       ` Thomas Guyot
  0 siblings, 2 replies; 9+ messages in thread
From: Carlo Arenas @ 2020-09-25  9:05 UTC (permalink / raw)
  To: brian m. carlson, Jeff King, Junio C Hamano, Linus Torvalds,
	René Scharfe, Han-Wen Nienhuys, git, Han-Wen Nienhuys

On Thu, Sep 24, 2020 at 6:15 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On 2020-09-24 at 19:21:11, Jeff King wrote:
> >   [stock]
> >   Benchmark #1: t/helper/test-tool sha1 <foo.rand
> >     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
> >     Range (min … max):    6.550 s …  6.841 s    10 runs

slightly offtopic but what generates this nicely formatted output?

> I cannot speak for s390, since I have never owned one

I happen to be lucky enough to have access to one (RHEL 8.2/z15, gcc
8.3.1) and seems (third consecutive run):

stock: user: 7.555s, system: 1.191s
-DNO_UNALIGNED_LOADS: user: 7.561s, system: 1.189s

Carlo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-25  9:05     ` Carlo Arenas
@ 2020-09-25  9:09       ` Jeff King
  2020-09-25 20:48       ` Thomas Guyot
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2020-09-25  9:09 UTC (permalink / raw)
  To: Carlo Arenas
  Cc: brian m. carlson, Junio C Hamano, Linus Torvalds,
	René Scharfe, Han-Wen Nienhuys, git, Han-Wen Nienhuys

On Fri, Sep 25, 2020 at 02:05:09AM -0700, Carlo Arenas wrote:

> > >   [stock]
> > >   Benchmark #1: t/helper/test-tool sha1 <foo.rand
> > >     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
> > >     Range (min … max):    6.550 s …  6.841 s    10 runs
> 
> slightly offtopic but what generates this nicely formatted output?

It's this:

  https://github.com/sharkdp/hyperfine

It will actually run both versions and compare them, but it's a little
more involved to set up (since you have to do a build step in between).

> > I cannot speak for s390, since I have never owned one
> 
> I happen to be lucky enough to have access to one (RHEL 8.2/z15, gcc
> 8.3.1) and seems (third consecutive run):
> 
> stock: user: 7.555s, system: 1.191s
> -DNO_UNALIGNED_LOADS: user: 7.561s, system: 1.189s

Thanks. That's not too surprising. gcc 8 seems to be able to optimize
both versions to the same thing (though I have no idea if s390 has a
bswap instruction).

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] bswap.h: drop unaligned loads
  2020-09-25  9:05     ` Carlo Arenas
  2020-09-25  9:09       ` Jeff King
@ 2020-09-25 20:48       ` Thomas Guyot
  1 sibling, 0 replies; 9+ messages in thread
From: Thomas Guyot @ 2020-09-25 20:48 UTC (permalink / raw)
  To: Carlo Arenas, brian m. carlson, Jeff King, Junio C Hamano,
	Linus Torvalds, René Scharfe, Han-Wen Nienhuys, git,
	Han-Wen Nienhuys

On 2020-09-25 05:05, Carlo Arenas wrote:
> On Thu, Sep 24, 2020 at 6:15 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
>> On 2020-09-24 at 19:21:11, Jeff King wrote:
>>>   [stock]
>>>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>>>     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>>>     Range (min … max):    6.550 s …  6.841 s    10 runs
> 
> slightly offtopic but what generates this nicely formatted output?

It turns out I may have finally reproduced my diff speed regression at
home - just today - and I read this :) I may use it to test, nice timing!

About my unrelated regression, TL;DR, have anyone benchmarked under a
VM? (more context below...)

>> I cannot speak for s390, since I have never owned one
> 
> I happen to be lucky enough to have access to one (RHEL 8.2/z15, gcc
> 8.3.1) and seems (third consecutive run):
> 
> stock: user: 7.555s, system: 1.191s
> -DNO_UNALIGNED_LOADS: user: 7.561s, system: 1.189s

So also somewhat offtopic too, but I've been trying lately to reproduce
a speed regression with "git diff --no-ext-diff --quiet" (used by git
prompt in contrib) on some large repos between 2.9.5 and 2.16.2. Overall
the diff was over 3x as slow where I tested, and I measured a couple
timings between mmap of the same two large files between the two
version, the delta was approx 10x! (no other syscalls so it's all
computing and - if my theory proves right - I guess a lot of pagefaults
or similar handling control back to the host).

I prepped a VM where I will do more testing (I ran one preliminary test
and there seemed to be a visible difference, but I didn't have a proper
repo to test with yet). The point being it might be worth comparing the
two on VM's as well.

Regards,

Thomas

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-25 20:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24 19:16 [PATCH 0/2] drop unaligned loads Jeff King
2020-09-24 19:21 ` [PATCH 1/2] bswap.h: " Jeff King
2020-09-24 22:02   ` René Scharfe
2020-09-25  4:56     ` Jeff King
2020-09-25  1:13   ` brian m. carlson
2020-09-25  9:05     ` Carlo Arenas
2020-09-25  9:09       ` Jeff King
2020-09-25 20:48       ` Thomas Guyot
2020-09-24 19:22 ` [PATCH 2/2] Revert "fast-export: use local array to store anonymized oid" Jeff King

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git