[PATCH] git gc: Speed it up by 18% via faster hash comparisons

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] git gc: Speed it up by 18% via faster hash comparisons
@ 2011-04-27 22:51 Ingo Molnar
  2011-04-27 23:10 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-27 22:51 UTC (permalink / raw)
  To: git
  Cc: H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg


Looking at the stalled-cycled profile of a cached 'git gc' i noticed the 
find_pack_entry_one() entry:

 $ perf record -e stalled-cycles -F 10000 ./git gc
 $ perf report

# Events: 26K stalled-cycles
#
# Overhead     Command          Shared Object                                     Symbol
# ........  ..........  .....................  .........................................
#
    26.07%         git  git                    [.] lookup_object
    10.22%         git  libz.so.1.2.5          [.] 0xc43a          
     7.08%         git  libz.so.1.2.5          [.] inflate
     6.63%         git  git                    [.] find_pack_entry_one
     5.37%         git  [kernel.kallsyms]      [k] do_raw_spin_lock
     4.03%         git  git                    [.] lookup_blob
     3.09%         git  libc-2.13.90.so        [.] __strlen_sse42
     2.81%         git  libc-2.13.90.so        [.] __memcpy_ssse3_back

Annotation shows:

 $ perf annotate find_pack_entry_one

 Percent |      Source code & Disassembly of git
------------------------------------------------
         :
     ...
         :                      int cmp = hashcmp(index + mi * stride, sha1);
    0.90 :        4b9264:       89 ee                   mov    %ebp,%esi
    0.45 :        4b9266:       41 0f af f2             imul   %r10d,%esi
    2.86 :        4b926a:       4c 01 de                add    %r11,%rsi
   53.34 :        4b926d:       f3 a6                   repz cmpsb %es:(%rdi),%ds:(%rsi)
   14.37 :        4b926f:       0f 92 c0                setb   %al
    5.78 :        4b9272:       41 0f 97 c4             seta   %r12b
    1.52 :        4b9276:       41 28 c4                sub    %al,%r12b

Most overhead is in hashcmp(), which uses memcmp(), which falls back to 
assembly string operations.

But we know that hashcmp() compares hashes, which if they do not match, the first byte
will differ in 99% of the cases.

So i tried the patch below: instead of relying on GCC putting in the string 
ops, i used an open-coded loop for this relatively short comparison, which does 
not go beyond the first byte in 99% of the cases.

While it at i also open-coded the is_null_sha1() comparison: instead of 
comparing it to null_sha1[] byte by byte, we can use the sha1 value directly 
and use much more optimal 64-bit and 32-bit comparisons.

The results were rather surprising:

 #
 # Before:
 #

 $ perf stat --sync --repeat 10 ./git gc

 Performance counter stats for './git gc' (10 runs):

       2771.119892 task-clock               #    0.863 CPUs utilized            ( +-  0.16% )
             1,813 context-switches         #    0.001 M/sec                    ( +-  3.06% )
               167 CPU-migrations           #    0.000 M/sec                    ( +-  2.92% )
            39,210 page-faults              #    0.014 M/sec                    ( +-  0.26% )
     8,828,405,654 cycles                   #    3.186 GHz                      ( +-  0.13% )
     2,102,083,909 stalled-cycles           #   23.81% of all cycles are idle   ( +-  0.52% )
     8,821,931,740 instructions             #    1.00  insns per cycle        
                                            #    0.24  stalled cycles per insn  ( +-  0.04% )
     1,750,408,175 branches                 #  631.661 M/sec                    ( +-  0.04% )
        74,612,120 branch-misses            #    4.26% of all branches          ( +-  0.07% )

        3.211098537  seconds time elapsed  ( +-  1.52% )

[ Note: the --sync option to perf stat emits a sync() before each 'git gc' 
  test-run, this reduces the noise of 'elapsed time' numbers enormously. ]

 #
 # After:
 #

 $ perf stat --sync --repeat 10 ./git gc

 Performance counter stats for './git gc' (10 runs):

       2349.498022 task-clock               #    0.807 CPUs utilized            ( +-  0.15% )
             1,842 context-switches         #    0.001 M/sec                    ( +-  2.50% )
               164 CPU-migrations           #    0.000 M/sec                    ( +-  3.67% )
            39,350 page-faults              #    0.017 M/sec                    ( +-  0.06% )
     7,484,317,230 cycles                   #    3.185 GHz                      ( +-  0.15% )
     1,577,673,341 stalled-cycles           #   21.08% of all cycles are idle   ( +-  0.67% )
    11,067,826,786 instructions             #    1.48  insns per cycle        
                                            #    0.14  stalled cycles per insn  ( +-  0.02% )
     2,489,157,909 branches                 # 1059.442 M/sec                    ( +-  0.02% )
        59,384,019 branch-misses            #    2.39% of all branches          ( +-  0.22% )

        2.910829134  seconds time elapsed  ( +-  1.39% )

'git gc' got faster by 18%! Interestingly, 33% of all prior stalled cycles 
disappeared: most of them turned into actual cycle count savings and speedups.

So it's rather clear that the string assembly instructions based memcmp is 
suboptimal for short comparisons like this: there's quite a bit of setup 
latency in repz cmpsb and the CPU is idling around during that time.

(I ran this on a Nehalem system, so it's a rather fast and modern Intel CPU.)

Also note another very interesting detail, the number of branch misses went way 
down, well beyond the measurement noise:

 before:   74,612,120 branch-misses            #    4.26% of all branches          ( +-  0.07% )
  after:   59,384,019 branch-misses            #    2.39% of all branches          ( +-  0.22% )

One theory would be that the open-coded loop is easier for the CPU to speculate 
along, so it produces less branch misses.

The number of instructions and branches increased - this is mostly because the 
PMU counts a complex 'repz cmpsb' as a single instruction issuing many uops, 
while the open-coded loop consists of separate instructions - but roughly the 
same amount of uops.

I suspect 'git fsck' got faster as well, but i have not measured that.

There's more string op use in the Git sha1 code, but this was the lowest 
hanging fruit.

Thanks,

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/cache.h b/cache.h
index 2674f4c..c5a54fb 100644
--- a/cache.h
+++ b/cache.h
@@ -675,14 +675,33 @@ extern char *sha1_pack_name(const unsigned char *sha1);
 extern char *sha1_pack_index_name(const unsigned char *sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
-	return !memcmp(sha1, null_sha1, 20);
+	int i;
+
+	for (i = 0; i < 20; i++, sha1++, sha2++) {
+		if (*sha1 != *sha2) {
+			if (*sha1 < *sha2)
+				return -1;
+			return +1;
+		}
+	}
+
+	return 0;
 }
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+
+static inline int is_null_sha1(const unsigned char *sha1)
 {
-	return memcmp(sha1, sha2, 20);
+	const unsigned long long *sha1_64 = (void *)sha1;
+	const unsigned int *sha1_32 = (void *)sha1;
+
+	if (sha1_64[0] || sha1_64[1] || sha1_32[4])
+		return 0;
+
+	return 1;
 }
+
 static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
 {
 	memcpy(sha_dst, sha_src, 20);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 22:51 [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
@ 2011-04-27 23:10 ` Ingo Molnar
  2011-04-27 23:18 ` Jonathan Nieder
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-27 23:10 UTC (permalink / raw)
  To: git
  Cc: H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg


* Ingo Molnar <mingo@elte.hu> wrote:

> I suspect 'git fsck' got faster as well, but i have not measured that.

It got faster a bit:

 #
 # Before:
 #

 $ perf stat --sync --repeat 5 ./git fsck >/dev/null

 Performance counter stats for './git fsck' (5 runs):

      32011.163574 task-clock               #    0.998 CPUs utilized            ( +-  0.08% )
                46 context-switches         #    0.000 M/sec                    ( +-  2.77% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
            60,279 page-faults              #    0.002 M/sec                    ( +- 12.21% )
   102,597,312,322 cycles                   #    3.205 GHz                      ( +-  0.08% )
    27,303,254,781 stalled-cycles           #   26.61% of all cycles are idle   ( +-  2.51% )
   152,359,589,474 instructions             #    1.49  insns per cycle        
                                            #    0.18  stalled cycles per insn  ( +-  0.03% )
    13,225,673,730 branches                 #  413.158 M/sec                    ( +-  0.06% )
     1,226,749,384 branch-misses            #    9.28% of all branches          ( +-  0.08% )

       32.083499222  seconds time elapsed  ( +-  0.07% )

 #
 # After:
 #

 Performance counter stats for './git fsck' (5 runs):

      31605.868825 task-clock               #    0.998 CPUs utilized            ( +-  0.08% )
                42 context-switches         #    0.000 M/sec                    ( +-  3.92% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-100.00% )
            62,979 page-faults              #    0.002 M/sec                    ( +- 14.72% )
   101,297,181,916 cycles                   #    3.205 GHz                      ( +-  0.08% )
    27,173,614,721 stalled-cycles           #   26.83% of all cycles are idle   ( +-  0.49% )
   155,074,859,385 instructions             #    1.53  insns per cycle        
                                            #    0.18  stalled cycles per insn  ( +-  0.01% )
    14,132,018,558 branches                 #  447.133 M/sec                    ( +-  0.02% )
     1,207,054,592 branch-misses            #    8.54% of all branches          ( +-  0.03% )

       31.675135938  seconds time elapsed  ( +-  0.08% )

so there's a +1.3% speedup.

But git fsck stalls are dominated by libz and other external libraries:

# Events: 30K stalled-cycles
#
# Overhead  Command          Shared Object                        Symbol
# ........  .......  .....................  ............................
#
    36.13%      git  libz.so.1.2.5          [.] 0x90be          
    18.27%      git  libc-2.13.90.so        [.] __memcpy_ssse3_back
    13.68%      git  libcrypto.so.1.0.0d    [.] sha1_block_data_order
     5.85%      git  libz.so.1.2.5          [.] inflate
     4.69%      git  git                    [.] lookup_object
     4.58%      git  libz.so.1.2.5          [.] adler32
     4.30%      git  libz.so.1.2.5          [.] 0xc280          
     2.19%      git  libc-2.13.90.so        [.] _int_malloc

So those dominate execution time.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 22:51 [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
  2011-04-27 23:10 ` Ingo Molnar
@ 2011-04-27 23:18 ` Jonathan Nieder
  2011-04-28  6:36   ` Ingo Molnar
  2011-04-28  9:32   ` Dmitry Potapov
  2011-04-27 23:32 ` Junio C Hamano
  2011-04-28  8:52 ` Dmitry Potapov
  3 siblings, 2 replies; 45+ messages in thread
From: Jonathan Nieder @ 2011-04-27 23:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

Ingo Molnar wrote:

> Most overhead is in hashcmp(), which uses memcmp(), which falls back to 
> assembly string operations.
>
> But we know that hashcmp() compares hashes, which if they do not match, the first byte
> will differ in 99% of the cases.
>
> So i tried the patch below: instead of relying on GCC putting in the string 
> ops, i used an open-coded loop for this relatively short comparison, which does 
> not go beyond the first byte in 99% of the cases.
[...]
> --- a/cache.h
> +++ b/cache.h
> @@ -675,14 +675,33 @@ extern char *sha1_pack_name(const unsigned char *sha1);
[...]
> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>  {
> -	return memcmp(sha1, sha2, 20);
> +	int i;
> +
> +	for (i = 0; i < 20; i++, sha1++, sha2++) {

Hm.  This would be very sensitive to the compiler, since a too-smart
optimizer could take this loop and rewrite it back to memcmp!  So I
wonder if it's possible to convey this to the compiler more precisely:

	return memcmp_probably_differs_early(sha1, sha2, 20);

E.g., how would something like

	const unsigned int *start1 = (const unsigned int *) sha1;
	const unsigned int *start2 = (const unsigned int *) sha2;

	if (likely(*start1 != *start2)) {
		if (*start1 < *start2)
			return -1;
		return +1;
	}
	return memcmp(sha1 + 4, sha2 + 4, 16);

perform?

I suspect we don't have to worry about endianness as long as hashcmp
yields a consistent ordering, but I haven't checked.

Thanks, that was interesting.

Regards,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 22:51 [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
  2011-04-27 23:10 ` Ingo Molnar
  2011-04-27 23:18 ` Jonathan Nieder
@ 2011-04-27 23:32 ` Junio C Hamano
  2011-04-28  0:35   ` Ralf Baechle
  2011-04-28  6:27   ` Ingo Molnar
  2011-04-28  8:52 ` Dmitry Potapov
  3 siblings, 2 replies; 45+ messages in thread
From: Junio C Hamano @ 2011-04-27 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

Ingo Molnar <mingo@elte.hu> writes:

> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>  {
> -	return !memcmp(sha1, null_sha1, 20);
> +	int i;
> +
> +	for (i = 0; i < 20; i++, sha1++, sha2++) {
> +		if (*sha1 != *sha2) {
> +			if (*sha1 < *sha2)
> +				return -1;
> +			return +1;
> +		}
> +	}
> +
> +	return 0;

This is very unfortunate, as it is so trivially correct and we shouldn't
have to do it.  If the compiler does not use a good inlined memcmp(), this
patch may fly, but I fear it may hurt other compilers, no?

> +static inline int is_null_sha1(const unsigned char *sha1)
>  {
> -	return memcmp(sha1, sha2, 20);
> +	const unsigned long long *sha1_64 = (void *)sha1;
> +	const unsigned int *sha1_32 = (void *)sha1;

Can everybody do unaligned accesses just fine?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 23:32 ` Junio C Hamano
@ 2011-04-28  0:35   ` Ralf Baechle
  2011-04-28  8:18     ` Bernhard R. Link
  2011-04-28  6:27   ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Ralf Baechle @ 2011-04-28  0:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ingo Molnar, git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

On Wed, Apr 27, 2011 at 04:32:12PM -0700, Junio C Hamano wrote:

> > +static inline int is_null_sha1(const unsigned char *sha1)
> >  {
> > -	return memcmp(sha1, sha2, 20);
> > +	const unsigned long long *sha1_64 = (void *)sha1;
> > +	const unsigned int *sha1_32 = (void *)sha1;
> 
> Can everybody do unaligned accesses just fine?

Misaligned accesses cause exceptions on some architectures which then
are fixed up in software making these accesses _very_ slow.  You can
use __attribute__((packed)) to work around that but that will on the
affected architectures make gcc generate code pessimistically that is
slower than not using __attribute__((packed)) in case of proper
alignment.  And __attribute__((packed)) only works with GCC.

  Ralf

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 23:32 ` Junio C Hamano
  2011-04-28  0:35   ` Ralf Baechle
@ 2011-04-28  6:27   ` Ingo Molnar
  2011-04-28  9:17     ` Erik Faye-Lund
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  6:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg


* Junio C Hamano <gitster@pobox.com> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> 
> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> >  {
> > -	return !memcmp(sha1, null_sha1, 20);
> > +	int i;
> > +
> > +	for (i = 0; i < 20; i++, sha1++, sha2++) {
> > +		if (*sha1 != *sha2) {
> > +			if (*sha1 < *sha2)
> > +				return -1;
> > +			return +1;
> > +		}
> > +	}
> > +
> > +	return 0;
> 
> This is very unfortunate, as it is so trivially correct and we shouldn't
> have to do it.  If the compiler does not use a good inlined memcmp(), this
> patch may fly, but I fear it may hurt other compilers, no?

Well, i used a very fresh GCC version:

  gcc version 4.6.0 20110419 (Red Hat 4.6.0-5) (GCC)

And used a relatively fresh CPU as well. So given how compiler and CPU versions 
trickle down to users and how long they live there Git will live with this 
combination for years to come.

Secondly, the combined speedup of the cached case with my two patches appears 
to be more than 30% on my testbox so it's a very nifty win from two relatively 
simple changes.

Should a compiler ever turn this into suboptimal code again we can revisit the 
issue once more - it's not like we *can* keep the compiler from messing up the 
assembly output! :-) ...

> > +static inline int is_null_sha1(const unsigned char *sha1)
> >  {
> > -	return memcmp(sha1, sha2, 20);
> > +	const unsigned long long *sha1_64 = (void *)sha1;
> > +	const unsigned int *sha1_32 = (void *)sha1;
> 
> Can everybody do unaligned accesses just fine?

I have added some quick debug code and none of the sha1 pointers (in my 
admittedly very limited) testing showed misaligned pointers on 64-bit systems.

On 32-bit systems the pointer might be 32-bit aligned only - the patch below 
implements the function 32-bit comparisons.

But is_null_sha1() is not called that often in the tests i've done so we could 
keep it untouched as well.

Thanks,

	Ingo

diff --git a/cache.h b/cache.h
index 2674f4c..427ad5a 100644
--- a/cache.h
+++ b/cache.h
@@ -675,14 +675,32 @@ extern char *sha1_pack_name(const unsigned char *sha1);
 extern char *sha1_pack_index_name(const unsigned char *sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
-	return !memcmp(sha1, null_sha1, 20);
+	int i;
+
+	for (i = 0; i < 20; i++, sha1++, sha2++) {
+		if (*sha1 != *sha2) {
+			if (*sha1 < *sha2)
+				return -1;
+			return +1;
+		}
+	}
+
+	return 0;
 }
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+
+static inline int is_null_sha1(const unsigned char *sha1)
 {
-	return memcmp(sha1, sha2, 20);
+	const unsigned int *sha1_32 = (void *)sha1;
+
+	if (sha1_32[0] || sha1_32[1] || sha1_32[2] || sha1_32[3] || sha1_32[4])
+		return 0;
+
+	return 1;
 }
+
 static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
 {
 	memcpy(sha_dst, sha_src, 20);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 23:18 ` Jonathan Nieder
@ 2011-04-28  6:36   ` Ingo Molnar
  2011-04-28  9:31     ` Jonathan Nieder
  2011-04-28 10:36     ` Ingo Molnar
  2011-04-28  9:32   ` Dmitry Potapov
  1 sibling, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  6:36 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

* Jonathan Nieder <jrnieder@gmail.com> wrote:

> Ingo Molnar wrote:
> 
> > Most overhead is in hashcmp(), which uses memcmp(), which falls back to 
> > assembly string operations.
> >
> > But we know that hashcmp() compares hashes, which if they do not match, the first byte
> > will differ in 99% of the cases.
> >
> > So i tried the patch below: instead of relying on GCC putting in the string 
> > ops, i used an open-coded loop for this relatively short comparison, which does 
> > not go beyond the first byte in 99% of the cases.
> [...]
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -675,14 +675,33 @@ extern char *sha1_pack_name(const unsigned char *sha1);
> [...]
> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> >  {
> > -	return memcmp(sha1, sha2, 20);
> > +	int i;
> > +
> > +	for (i = 0; i < 20; i++, sha1++, sha2++) {
> 
> Hm.  This would be very sensitive to the compiler, since a too-smart 
> optimizer could take this loop and rewrite it back to memcmp! So I wonder if 
> it's possible to convey this to the compiler more precisely:
> 
> 	return memcmp_probably_differs_early(sha1, sha2, 20);
> 
> E.g., how would something like
> 
> 	const unsigned int *start1 = (const unsigned int *) sha1;
> 	const unsigned int *start2 = (const unsigned int *) sha2;
> 
> 	if (likely(*start1 != *start2)) {
> 		if (*start1 < *start2)
> 			return -1;
> 		return +1;
> 	}
> 	return memcmp(sha1 + 4, sha2 + 4, 16);
>
> perform?

Note that this function wont work on like 99% of the systems out there due to 
endianness assumptions in Git.

Also, your hypothetical smart compiler would recognize the above as equivalent 
to memcmp(sha1, sha2, 20) and could rewrite it again - so we'd be back to 
square 1.

As i argued in my other mail in this thread, it's hard to keep a compiler from 
messing up its assembly output if it really wants to mess up - we'll deal with 
it when that happens. I used a very fresh compiler and a modern CPU for my 
testing - so even if very, very new compilers improve this problem somehow it 
will stay with us for a long, long time.

Having said that, it would be nice if someone could test these two patches on a 
modern AMD box, using the perf stat from here:

  http://people.redhat.com/mingo/tip.git/README

  cd tools/perf/
  make -j install

and do something like this to test git gc's performance:

  $ perf stat --sync --repeat 10 ./git gc

... to see whether these speedups are generic, or somehow Intel CPU specific.

> I suspect we don't have to worry about endianness as long as hashcmp
> yields a consistent ordering, but I haven't checked.

Well i messed up endianness in an early version of this patch and 'git gc' was 
eminently unhappy about it! I have not figured out which part of Git relies on 
the comparison result though - most places seem to use the result as a boolean.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  0:35   ` Ralf Baechle
@ 2011-04-28  8:18     ` Bernhard R. Link
  2011-04-28  9:42       ` Andreas Ericsson
  0 siblings, 1 reply; 45+ messages in thread
From: Bernhard R. Link @ 2011-04-28  8:18 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Junio C Hamano, Ingo Molnar, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

* Ralf Baechle <ralf@linux-mips.org> [110428 02:35]:
> On Wed, Apr 27, 2011 at 04:32:12PM -0700, Junio C Hamano wrote:
> 
> > > +static inline int is_null_sha1(const unsigned char *sha1)
> > >  {
> > > -	return memcmp(sha1, sha2, 20);
> > > +	const unsigned long long *sha1_64 = (void *)sha1;
> > > +	const unsigned int *sha1_32 = (void *)sha1;
> > 
> > Can everybody do unaligned accesses just fine?
> 
> Misaligned accesses cause exceptions on some architectures which then
> are fixed up in software making these accesses _very_ slow.  You can
> use __attribute__((packed)) to work around that but that will on the
> affected architectures make gcc generate code pessimistically that is
> slower than not using __attribute__((packed)) in case of proper
> alignment.  And __attribute__((packed)) only works with GCC.

Even __attribute__((packed)) usually does not allow arbitrary aligned
data, but can intruct the code to generate code to access code
misaligned in a special way. (I have already seen code where thus
accessing a properly aligned long caused a SIGBUS, because it was
aligned because being in a misaligned packed struct).

In short: misaligning stuff works on x86, everywhere else it is disaster
waiting to happen. (And people claiming compiler bugs or broken
architectures, just because they do not know the basic rules of C).

	Bernhard R. Link

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 22:51 [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
                   ` (2 preceding siblings ...)
  2011-04-27 23:32 ` Junio C Hamano
@ 2011-04-28  8:52 ` Dmitry Potapov
  2011-04-28  9:11   ` Ingo Molnar
  3 siblings, 1 reply; 45+ messages in thread
From: Dmitry Potapov @ 2011-04-28  8:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>  {
> -       return !memcmp(sha1, null_sha1, 20);
> +       int i;
> +
> +       for (i = 0; i < 20; i++, sha1++, sha2++) {
> +               if (*sha1 != *sha2) {

At the very least, you may want to put 'likely' in this 'if'
condition, otherwise the compiler may optimize this loop in
the same way as with memcmp. So, it may work well now, but
it may not work much slower with future versions or different
level of optimization. (AFAIK, -O3 is far more aggressive in
optimizing of loops).

Another thing is that so far this optimization was only with
GCC, and we do not know whether it helps or harms to compilers.
So, maybe placing this code under 'ifdef __GNUC__' makes more
sense than pushing this change on other compilers too.

Dmitry

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  8:52 ` Dmitry Potapov
@ 2011-04-28  9:11   ` Ingo Molnar
  2011-04-28  9:31     ` Dmitry Potapov
  2011-04-28  9:38     ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  9:11 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg


* Dmitry Potapov <dpotapov@gmail.com> wrote:

> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> >  {
> > -       return !memcmp(sha1, null_sha1, 20);
> > +       int i;
> > +
> > +       for (i = 0; i < 20; i++, sha1++, sha2++) {
> > +               if (*sha1 != *sha2) {
> 
> At the very least, you may want to put 'likely' in this 'if'
> condition, otherwise the compiler may optimize this loop in
> the same way as with memcmp. So, it may work well now, but
> it may not work much slower with future versions or different
> level of optimization. (AFAIK, -O3 is far more aggressive in
> optimizing of loops).

the main difference is between the string assembly instructions and the loop. 
Modern CPUs will hardly notice this loop being emitted with slight variations 
by the compiler. So i do not share this concern.

> Another thing is that so far this optimization was only with
> GCC, and we do not know whether it helps or harms to compilers.
> So, maybe placing this code under 'ifdef __GNUC__' makes more
> sense than pushing this change on other compilers too.

Well, given that 90%+ of Git development is done with GCC (and probably the 
user share is similarly high as well):

 aldebaran:~/git> git log --pretty=oneline -i --grep gcc | wc -l
 127
 aldebaran:~/git> git log --pretty=oneline -i --grep llvm | wc -l
 1

and given that these two patches speed up 'git gc' by 30%+, i doubt we'd want 
to litter core Git code with #ifdef __GNUC__ uglinesses.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  6:27   ` Ingo Molnar
@ 2011-04-28  9:17     ` Erik Faye-Lund
  2011-04-28  9:33       ` Ingo Molnar
  2011-04-28  9:37       ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28  9:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Junio C Hamano <gitster@pobox.com> wrote:
>
>> Ingo Molnar <mingo@elte.hu> writes:
>>
>> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>> >  {
>> > -   return !memcmp(sha1, null_sha1, 20);
>> > +   int i;
>> > +
>> > +   for (i = 0; i < 20; i++, sha1++, sha2++) {
>> > +           if (*sha1 != *sha2) {
>> > +                   if (*sha1 < *sha2)
>> > +                           return -1;
>> > +                   return +1;
>> > +           }

Why not just:

if (*sha1 != *sha2)
        return *sha2 - *sha1;

memcmp isn't guaranteed to return onlt the values -1, 0, +1, it can
return any value, just as long as it's sign of a non-zero return
express the relationship between the first mis-matching byte.

>> > +   }
>> > +
>> > +   return 0;
>>
>> This is very unfortunate, as it is so trivially correct and we shouldn't
>> have to do it.  If the compiler does not use a good inlined memcmp(), this
>> patch may fly, but I fear it may hurt other compilers, no?

If the common case is that the hashes are random (as the assumption in
this patch is), then this patch should give very close to ideal
performance, no? A good memcmp might be faster when there's a match;
but how often do we really compare SHA-1's that are identical?

I see your worry, but if the assumption is correct, I doubt it'd turn
out to be a real problem. ~99.6% of the time we'd early-out on the
first byte, which is ideal.

If comparing identical SHA-1's are important, perhaps just having a
early-out just on the first byte and then doing memcmp is a good
solution (similar to what Jonathan Nieder proposed, but without
alignment problems)?

> Secondly, the combined speedup of the cached case with my two patches appears
> to be more than 30% on my testbox so it's a very nifty win from two relatively
> simple changes.

That speed-up was on ONE test vector, no? There are a lot of other
uses of hash-comparisons in Git, did you measure those?

>> > +static inline int is_null_sha1(const unsigned char *sha1)
>> >  {
>> > -   return memcmp(sha1, sha2, 20);
>> > +   const unsigned long long *sha1_64 = (void *)sha1;
>> > +   const unsigned int *sha1_32 = (void *)sha1;
>>
>> Can everybody do unaligned accesses just fine?
>
> I have added some quick debug code and none of the sha1 pointers (in my
> admittedly very limited) testing showed misaligned pointers on 64-bit systems.
>
> On 32-bit systems the pointer might be 32-bit aligned only - the patch below
> implements the function 32-bit comparisons.

That's simply wrong. Unsigned char arrays can and will be unaligned,
and this causes exceptions on most architectures (x86 is pretty much
the exception here). While some systems for these architectures
support unaligned reads from the exception handler, others doesn't. So
this patch is pretty much guaranteed to cause a crash in some setups.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  6:36   ` Ingo Molnar
@ 2011-04-28  9:31     ` Jonathan Nieder
  2011-04-28 10:36     ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Jonathan Nieder @ 2011-04-28  9:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

Ingo Molnar wrote:
> * Jonathan Nieder <jrnieder@gmail.com> wrote:

>> E.g., how would something like
>>
>> 	const unsigned int *start1 = (const unsigned int *) sha1;
>> 	const unsigned int *start2 = (const unsigned int *) sha2;
>>
>> 	if (likely(*start1 != *start2)) {
>> 		if (*start1 < *start2)
>> 			return -1;
>> 		return +1;
>> 	}
>> 	return memcmp(sha1 + 4, sha2 + 4, 16);
>>
>> perform?
>
> Note that this function wont work on like 99% of the systems out there due to 
> endianness assumptions in Git.

Yes, I was greedy and broke the semantics, and my suggestion was
nonsensical for other reasons (e.g., alignment), too.  I should have
written something like:

	if (likely(*sha1 != *sha2)) {
		if (*sha1 < *sha2)
			return -1;
		return +1;
	}
	return memcmp(sha1, sha2, 20);

since speeding it up 255/256 times seems good enough already.

> Also, your hypothetical smart compiler would recognize the above as equivalent 
> to memcmp(sha1, sha2, 20) and could rewrite it again - so we'd be back to 
> square 1.

True.  The real point is a "likely" to explain to human readers what
is happening.

> Having said that, it would be nice if someone could test these two patches on a 
> modern AMD box, using the perf stat from here:
>
>   http://people.redhat.com/mingo/tip.git/README
>
>   cd tools/perf/
>   make -j install
>
> and do something like this to test git gc's performance:
>
>   $ perf stat --sync --repeat 10 ./git gc
>
> ... to see whether these speedups are generic, or somehow Intel CPU specific.

Sounds like fun.  Will try to find time to play around with this in
the next few days.

> Well i messed up endianness in an early version of this patch and 'git gc' was
> eminently unhappy about it! I have not figured out which part of Git relies on
> the comparison result though - most places seem to use the result as a boolean.

I think hashcmp is used to run binary searches within a packfile
index.  Thanks for explaining.

Regards,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:11   ` Ingo Molnar
@ 2011-04-28  9:31     ` Dmitry Potapov
  2011-04-28  9:44       ` Ingo Molnar
  2011-04-28  9:38     ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Dmitry Potapov @ 2011-04-28  9:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Dmitry Potapov <dpotapov@gmail.com> wrote:
>
>> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
>> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>> >  {
>> > -       return !memcmp(sha1, null_sha1, 20);
>> > +       int i;
>> > +
>> > +       for (i = 0; i < 20; i++, sha1++, sha2++) {
>> > +               if (*sha1 != *sha2) {
>>
>> At the very least, you may want to put 'likely' in this 'if'
>> condition, otherwise the compiler may optimize this loop in
>> the same way as with memcmp. So, it may work well now, but
>> it may not work much slower with future versions or different
>> level of optimization. (AFAIK, -O3 is far more aggressive in
>> optimizing of loops).
>
> the main difference is between the string assembly instructions and the loop.
> Modern CPUs will hardly notice this loop being emitted with slight variations
> by the compiler. So i do not share this concern.

Here you make an assumption what kind of optimization the compiler
can do. As Jonathan noticed above, theoretically a smart compiler
can turn this loop into memcmp (or code very similar to memcmp).

The reason why memcmp does not work well is that it is optimized
for the worst case scenario (where beginning of two strings is
the same), while _we_ know that with a hash it very unlikely,
and we want to conduct this knowledge to the compiler in some
way. Just re-writing memcmp as explicit loop does not conduct
this knowledge.

Therefore, I believe it makes sense to add 'likely'. I have not
tested this code, but in the past, I had a very similar code
which was compiled with -O3, and just putting likely turned out
to 40% speed-up for that comparison function.

Dmitry

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-27 23:18 ` Jonathan Nieder
  2011-04-28  6:36   ` Ingo Molnar
@ 2011-04-28  9:32   ` Dmitry Potapov
  1 sibling, 0 replies; 45+ messages in thread
From: Dmitry Potapov @ 2011-04-28  9:32 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ingo Molnar, git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

2011/4/28 Jonathan Nieder <jrnieder@gmail.com>:
>
> Hm.  This would be very sensitive to the compiler, since a too-smart
> optimizer could take this loop and rewrite it back to memcmp!  So I
> wonder if it's possible to convey this to the compiler more precisely:
>
>        return memcmp_probably_differs_early(sha1, sha2, 20);
>
> E.g., how would something like
>
>        const unsigned int *start1 = (const unsigned int *) sha1;
>        const unsigned int *start2 = (const unsigned int *) sha2;
>
>        if (likely(*start1 != *start2)) {
>                if (*start1 < *start2)

It can be a problem with unalligned access. So, IMHO, it is
better to use get_be32 here:

        unsigned start1 = get_be32(sha1);
        unsigned start2 = get_be32(sha2);

        if (likely(start1 != start2)) {
                if (start1 < start2)

...

Dmitry

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:17     ` Erik Faye-Lund
@ 2011-04-28  9:33       ` Ingo Molnar
  2011-04-28  9:37       ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  9:33 UTC (permalink / raw)
  To: Erik Faye-Lund
  Cc: Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg


* Erik Faye-Lund <kusmabite@gmail.com> wrote:

> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Junio C Hamano <gitster@pobox.com> wrote:
> >
> >> Ingo Molnar <mingo@elte.hu> writes:
> >>
> >> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> >> >  {
> >> > -   return !memcmp(sha1, null_sha1, 20);
> >> > +   int i;
> >> > +
> >> > +   for (i = 0; i < 20; i++, sha1++, sha2++) {
> >> > +           if (*sha1 != *sha2) {
> >> > +                   if (*sha1 < *sha2)
> >> > +                           return -1;
> >> > +                   return +1;
> >> > +           }
> 
> Why not just:
> 
> if (*sha1 != *sha2)
>         return *sha2 - *sha1;

You mean "*sha1 - *sha2", right?

> memcmp isn't guaranteed to return onlt the values -1, 0, +1, it can
> return any value, just as long as it's sign of a non-zero return
> express the relationship between the first mis-matching byte.

Yeah, agreed, updated patch below. Seems to work fine here.

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/cache.h b/cache.h
index 2674f4c..574c948 100644
--- a/cache.h
+++ b/cache.h
@@ -675,14 +675,30 @@ extern char *sha1_pack_name(const unsigned char *sha1);
 extern char *sha1_pack_index_name(const unsigned char *sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
-	return !memcmp(sha1, null_sha1, 20);
+	int i;
+
+	for (i = 0; i < 20; i++, sha1++, sha2++) {
+		if (*sha1 != *sha2)
+			return *sha1 - *sha2;
+	}
+
+	return 0;
 }
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+
+static inline int is_null_sha1(const unsigned char *sha1)
 {
-	return memcmp(sha1, sha2, 20);
+	const unsigned long long *sha1_64 = (void *)sha1;
+	const unsigned int *sha1_32 = (void *)sha1;
+
+	if (sha1_64[0] || sha1_64[1] || sha1_32[4])
+		return 0;
+
+	return 1;
 }
+
 static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
 {
 	memcpy(sha_dst, sha_src, 20);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:17     ` Erik Faye-Lund
  2011-04-28  9:33       ` Ingo Molnar
@ 2011-04-28  9:37       ` Ingo Molnar
  2011-04-28  9:50         ` Erik Faye-Lund
  2011-04-28 16:36         ` Dmitry Potapov
  1 sibling, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  9:37 UTC (permalink / raw)
  To: Erik Faye-Lund
  Cc: Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg


* Erik Faye-Lund <kusmabite@gmail.com> wrote:

> > Secondly, the combined speedup of the cached case with my two patches 
> > appears to be more than 30% on my testbox so it's a very nifty win from two 
> > relatively simple changes.
> 
> That speed-up was on ONE test vector, no? There are a lot of other uses of 
> hash-comparisons in Git, did you measure those?

I picked this hash function because it showed up in the profile (see the 
profile i posted). There's one other hash that mattered as well in the profile, 
see the lookup_object() patch i sent yesterday.

> > I have added some quick debug code and none of the sha1 pointers (in my 
> > admittedly very limited) testing showed misaligned pointers on 64-bit 
> > systems.
> >
> > On 32-bit systems the pointer might be 32-bit aligned only - the patch 
> > below implements the function 32-bit comparisons.
> 
> That's simply wrong. Unsigned char arrays can and will be unaligned, and this 
> causes exceptions on most architectures (x86 is pretty much the exception 
> here). While some systems for these architectures support unaligned reads 
> from the exception handler, others doesn't. So this patch is pretty much 
> guaranteed to cause a crash in some setups.

If unsigned char arrays are allocated unaligned then that's another bug i 
suspect that should be fixed. Unaligned access on x86 is not free either - 
there's cycle penalties.

Alas, i have not seen these sha1 hash buffers being allocated unaligned (in my 
very limited testing). In which spots are they allocated unaligned?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:11   ` Ingo Molnar
  2011-04-28  9:31     ` Dmitry Potapov
@ 2011-04-28  9:38     ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  9:38 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg


* Ingo Molnar <mingo@elte.hu> wrote:

> > Another thing is that so far this optimization was only with
> > GCC, and we do not know whether it helps or harms to compilers.
> > So, maybe placing this code under 'ifdef __GNUC__' makes more
> > sense than pushing this change on other compilers too.
> 
> Well, given that 90%+ of Git development is done with GCC (and probably the 
> user share is similarly high as well):
> 
>  aldebaran:~/git> git log --pretty=oneline -i --grep gcc | wc -l
>  127
>  aldebaran:~/git> git log --pretty=oneline -i --grep llvm | wc -l
>  1

I left out:

   aldebaran:~/git> git log --pretty=oneline -i --grep clang | wc -l
   5

Still well within the 90%+ figure.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  8:18     ` Bernhard R. Link
@ 2011-04-28  9:42       ` Andreas Ericsson
  2011-04-28  9:55         ` Erik Faye-Lund
  0 siblings, 1 reply; 45+ messages in thread
From: Andreas Ericsson @ 2011-04-28  9:42 UTC (permalink / raw)
  To: Bernhard R. Link
  Cc: Ralf Baechle, Junio C Hamano, Ingo Molnar, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

On 04/28/2011 10:18 AM, Bernhard R. Link wrote:
> * Ralf Baechle<ralf@linux-mips.org>  [110428 02:35]:
>> On Wed, Apr 27, 2011 at 04:32:12PM -0700, Junio C Hamano wrote:
>>
>>>> +static inline int is_null_sha1(const unsigned char *sha1)
>>>>   {
>>>> -	return memcmp(sha1, sha2, 20);
>>>> +	const unsigned long long *sha1_64 = (void *)sha1;
>>>> +	const unsigned int *sha1_32 = (void *)sha1;
>>>
>>> Can everybody do unaligned accesses just fine?
>>
>> Misaligned accesses cause exceptions on some architectures which then
>> are fixed up in software making these accesses _very_ slow.  You can
>> use __attribute__((packed)) to work around that but that will on the
>> affected architectures make gcc generate code pessimistically that is
>> slower than not using __attribute__((packed)) in case of proper
>> alignment.  And __attribute__((packed)) only works with GCC.
> 
> Even __attribute__((packed)) usually does not allow arbitrary aligned
> data, but can intruct the code to generate code to access code
> misaligned in a special way. (I have already seen code where thus
> accessing a properly aligned long caused a SIGBUS, because it was
> aligned because being in a misaligned packed struct).
> 
> In short: misaligning stuff works on x86, everywhere else it is disaster
> waiting to happen. (And people claiming compiler bugs or broken
> architectures, just because they do not know the basic rules of C).
> 

Given that the vast majority of user systems are x86 style ones, it's
probably worth using this patch on such systems and stick to a
partially unrolled byte-by-byte comparison that finishes early on
the rest of them. Properly pipelined, it will just mean that the early
return undoes the fetch steps for the 3-4 unrolled bytes that it
computes in advance, so if the diff comes in the first 10-12 bytes,
it will still be a win.

For bonus points, check if both bytestrings are equally (un)aligned
first and, if they are, half-Duff it out with a fallthrough switch
statement (without the while() loop) to compare byte-by-byte first
and then word-for-word on the rest of it. The setup and complexity
is probably not worth it for our meager 20-byte strings though.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:31     ` Dmitry Potapov
@ 2011-04-28  9:44       ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28  9:44 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg


* Dmitry Potapov <dpotapov@gmail.com> wrote:

> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Dmitry Potapov <dpotapov@gmail.com> wrote:
> >
> >> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
> >> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> >> >  {
> >> > -       return !memcmp(sha1, null_sha1, 20);
> >> > +       int i;
> >> > +
> >> > +       for (i = 0; i < 20; i++, sha1++, sha2++) {
> >> > +               if (*sha1 != *sha2) {
> >>
> >> At the very least, you may want to put 'likely' in this 'if'
> >> condition, otherwise the compiler may optimize this loop in
> >> the same way as with memcmp. So, it may work well now, but
> >> it may not work much slower with future versions or different
> >> level of optimization. (AFAIK, -O3 is far more aggressive in
> >> optimizing of loops).
> >
> > the main difference is between the string assembly instructions and the loop.
> > Modern CPUs will hardly notice this loop being emitted with slight variations
> > by the compiler. So i do not share this concern.
> 
> Here you make an assumption what kind of optimization the compiler
> can do. [...]

I make no assumption there because rule #1 is that the compiler can pretty well 
do what it wants and we have little control over that.

> [...] As Jonathan noticed above, theoretically a smart compiler can turn this 
> loop into memcmp (or code very similar to memcmp).

Yes, and in practice it does not, and in practice we can speed up git gc 
measurably.

> The reason why memcmp does not work well is that it is optimized
> for the worst case scenario (where beginning of two strings is
> the same), while _we_ know that with a hash it very unlikely,
> and we want to conduct this knowledge to the compiler in some
> way. Just re-writing memcmp as explicit loop does not conduct
> this knowledge.
> 
> Therefore, I believe it makes sense to add 'likely'. I have not
> tested this code, but in the past, I had a very similar code
> which was compiled with -O3, and just putting likely turned out
> to 40% speed-up for that comparison function.

You guys can certainly add the 'likely()' if you want to (it likely wont hurt) 
- but note that the compiler can *still* turn it into a memcpy() - see rule #1 
above.

Note that Git does not have a likely() facility at the moment and 
__builtin_expect() is a GNU extension. Should be a separate patch.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:37       ` Ingo Molnar
@ 2011-04-28  9:50         ` Erik Faye-Lund
  2011-04-28 10:10           ` Pekka Enberg
  2011-04-28 10:19           ` Ingo Molnar
  2011-04-28 16:36         ` Dmitry Potapov
  1 sibling, 2 replies; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28  9:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Erik Faye-Lund <kusmabite@gmail.com> wrote:
>
>> > Secondly, the combined speedup of the cached case with my two patches
>> > appears to be more than 30% on my testbox so it's a very nifty win from two
>> > relatively simple changes.
>>
>> That speed-up was on ONE test vector, no? There are a lot of other uses of
>> hash-comparisons in Git, did you measure those?
>
> I picked this hash function because it showed up in the profile (see the
> profile i posted). There's one other hash that mattered as well in the profile,
> see the lookup_object() patch i sent yesterday.
>

My point was that the 30% improvement was in "git gc", which is not
the only important use-case. How does this affect other git commands?

My suspicion is that it's generally an improvement, but I don't know
for sure. I think something like this might be an acceptable
trade-off, though:

diff --git a/cache.h b/cache.h
index c730c58..50b6f55 100644
--- a/cache.h
+++ b/cache.h
@@ -687,6 +687,10 @@ static inline int is_null_sha1(const unsigned char *sha1)
 }
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
+	/* early out for fast mis-match */
+	if (*sha1 != *sha2)
+		return *sha2 - *sha1;
+
 	return memcmp(sha1, sha2, 20);
 }
 static inline void hashcpy(unsigned char *sha_dst, const unsigned
char *sha_src)

>> > I have added some quick debug code and none of the sha1 pointers (in my
>> > admittedly very limited) testing showed misaligned pointers on 64-bit
>> > systems.
>> >
>> > On 32-bit systems the pointer might be 32-bit aligned only - the patch
>> > below implements the function 32-bit comparisons.
>>
>> That's simply wrong. Unsigned char arrays can and will be unaligned, and this
>> causes exceptions on most architectures (x86 is pretty much the exception
>> here). While some systems for these architectures support unaligned reads
>> from the exception handler, others doesn't. So this patch is pretty much
>> guaranteed to cause a crash in some setups.
>
> If unsigned char arrays are allocated unaligned then that's another bug i
> suspect that should be fixed.

We can't. The compiler decides the alignment of variables on the
stack. Some compilers / compiler-setting pairs might align
char-arrays, while others might not.

> Unaligned access on x86 is not free either -
> there's cycle penalties.
>
> Alas, i have not seen these sha1 hash buffers being allocated unaligned (in my
> very limited testing). In which spots are they allocated unaligned?

Like I said above, it can happen when allocated on the stack. But it
can also happen in malloc'ed structs, or in global variables. An array
is aligned to the size of it's base member type. But malloc does
worst-case-allignment, because it happens at run-time without
type-information.

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:42       ` Andreas Ericsson
@ 2011-04-28  9:55         ` Erik Faye-Lund
  2011-04-28 20:19           ` H. Peter Anvin
  0 siblings, 1 reply; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28  9:55 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Bernhard R. Link, Ralf Baechle, Junio C Hamano, Ingo Molnar, git,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

On Thu, Apr 28, 2011 at 11:42 AM, Andreas Ericsson <ae@op5.se> wrote:
> On 04/28/2011 10:18 AM, Bernhard R. Link wrote:
>> * Ralf Baechle<ralf@linux-mips.org>  [110428 02:35]:
>>> On Wed, Apr 27, 2011 at 04:32:12PM -0700, Junio C Hamano wrote:
>>>
>>>>> +static inline int is_null_sha1(const unsigned char *sha1)
>>>>>   {
>>>>> -  return memcmp(sha1, sha2, 20);
>>>>> +  const unsigned long long *sha1_64 = (void *)sha1;
>>>>> +  const unsigned int *sha1_32 = (void *)sha1;
>>>>
>>>> Can everybody do unaligned accesses just fine?
>>>
>>> Misaligned accesses cause exceptions on some architectures which then
>>> are fixed up in software making these accesses _very_ slow.  You can
>>> use __attribute__((packed)) to work around that but that will on the
>>> affected architectures make gcc generate code pessimistically that is
>>> slower than not using __attribute__((packed)) in case of proper
>>> alignment.  And __attribute__((packed)) only works with GCC.
>>
>> Even __attribute__((packed)) usually does not allow arbitrary aligned
>> data, but can intruct the code to generate code to access code
>> misaligned in a special way. (I have already seen code where thus
>> accessing a properly aligned long caused a SIGBUS, because it was
>> aligned because being in a misaligned packed struct).
>>
>> In short: misaligning stuff works on x86, everywhere else it is disaster
>> waiting to happen. (And people claiming compiler bugs or broken
>> architectures, just because they do not know the basic rules of C).
>>
>
> Given that the vast majority of user systems are x86 style ones, it's
> probably worth using this patch on such systems and stick to a
> partially unrolled byte-by-byte comparison that finishes early on
> the rest of them.

I disagree. We have no guarantee that the SHA-1s are aligned on x86
either, and unaligned accesses are slow on x86.

I think it's much much cleaner to add an early-out on the first byte,
and hope that memcmp is optimized properly. If it's not, those
platforms can add an override to memcmp in git-compat-util and/or
compat/*.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:50         ` Erik Faye-Lund
@ 2011-04-28 10:10           ` Pekka Enberg
  2011-04-28 10:19             ` Erik Faye-Lund
  2011-04-28 10:19           ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Pekka Enberg @ 2011-04-28 10:10 UTC (permalink / raw)
  To: kusmabite
  Cc: Ingo Molnar, Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On 4/28/11 12:50 PM, Erik Faye-Lund wrote:
>> Alas, i have not seen these sha1 hash buffers being allocated unaligned (in my
>> very limited testing). In which spots are they allocated unaligned?
>
> Like I said above, it can happen when allocated on the stack. But it
> can also happen in malloc'ed structs, or in global variables. An array
> is aligned to the size of it's base member type. But malloc does
> worst-case-allignment, because it happens at run-time without
> type-information.

I'd be very surprised if malloc() did "worst case alignment" - that'd 
suck pretty badly from performance point of view. However, if you want 
*guarantees* about the alignment, there's memalign() for heap allocations.

Stack allocation alignment is a harder issue but I doubt it's as bad as 
you make it out to be. On x86, for example, stack pointer is almost 
always 8 or 16 byte aligned with compilers whose writers have spent any 
time reading the Intel optimization manuals.

So yes, your statements are absolutely correct but I strongly doubt it 
matters that much in practice unless you're using a really crappy 
compiler...

			Pekka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:50         ` Erik Faye-Lund
  2011-04-28 10:10           ` Pekka Enberg
@ 2011-04-28 10:19           ` Ingo Molnar
  2011-04-28 12:02             ` Nguyen Thai Ngoc Duy
                               ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28 10:19 UTC (permalink / raw)
  To: Erik Faye-Lund
  Cc: Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg


* Erik Faye-Lund <kusmabite@gmail.com> wrote:

> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
> >
> > * Erik Faye-Lund <kusmabite@gmail.com> wrote:
> >
> >> > Secondly, the combined speedup of the cached case with my two patches
> >> > appears to be more than 30% on my testbox so it's a very nifty win from two
> >> > relatively simple changes.
> >>
> >> That speed-up was on ONE test vector, no? There are a lot of other uses of
> >> hash-comparisons in Git, did you measure those?
> >
> > I picked this hash function because it showed up in the profile (see the
> > profile i posted). There's one other hash that mattered as well in the profile,
> > see the lookup_object() patch i sent yesterday.
> 
> My point was that the 30% improvement was in "git gc", which is not
> the only important use-case. How does this affect other git commands?

In a followup mail i measured git fsck, which showed a speedup too. (despite 
being mostly dependent on external libraries to do most of the processing)

If you'd like to see other things tested please suggest a testcase that you 
think uses these hashes extensively, i don't really know what the slowest (and 
affected) Git commands are - git gc is the one *i* notice as being pretty slow 
(for good reasons).

> >> from the exception handler, others doesn't. So this patch is pretty much
> >> guaranteed to cause a crash in some setups.
> >
> > If unsigned char arrays are allocated unaligned then that's another bug i
> > suspect that should be fixed.
> 
> We can't. The compiler decides the alignment of variables on the stack. Some 
> compilers / compiler-setting pairs might align char-arrays, while others 
> might not.

Even if that were true it can be solved: you'd need to declare the sha1 not as 
a char array but as a u32 * array or so. We do have control over the alignment 
of data structures, obviously.

> > Unaligned access on x86 is not free either - there's cycle penalties.
> >
> > Alas, i have not seen these sha1 hash buffers being allocated unaligned (in 
> > my very limited testing). In which spots are they allocated unaligned?
> 
> Like I said above, it can happen when allocated on the stack. But it can also 
> happen in malloc'ed structs, or in global variables. An array is aligned to 
> the size of it's base member type. But malloc does worst-case-allignment, 
> because it happens at run-time without type-information.

Well, should we ready be ready to throw up our hands as if we didnt have 
control over the alignment of objects and have to accept suboptimal code as a 
result? We do have control over that.

In any case, i'll retract the null case as it really isnt called that often in 
the tests i've done - updated patch below - it simply falls back on to hashcmp.

Thanks,

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/cache.h b/cache.h
index 2674f4c..39fa9cd 100644
--- a/cache.h
+++ b/cache.h
@@ -675,14 +675,24 @@ extern char *sha1_pack_name(const unsigned char *sha1);
 extern char *sha1_pack_index_name(const unsigned char *sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
-	return !memcmp(sha1, null_sha1, 20);
+	int i;
+
+	for (i = 0; i < 20; i++, sha1++, sha2++) {
+		if (*sha1 != *sha2)
+			return *sha1 - *sha2;
+	}
+
+	return 0;
 }
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+
+static inline int is_null_sha1(const unsigned char *sha1)
 {
-	return memcmp(sha1, sha2, 20);
+	return !hashcmp(sha1, null_sha1);
 }
+
 static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
 {
 	memcpy(sha_dst, sha_src, 20);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:10           ` Pekka Enberg
@ 2011-04-28 10:19             ` Erik Faye-Lund
  2011-04-28 10:30               ` Pekka Enberg
  0 siblings, 1 reply; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28 10:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On Thu, Apr 28, 2011 at 12:10 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> On 4/28/11 12:50 PM, Erik Faye-Lund wrote:
>>>
>>> Alas, i have not seen these sha1 hash buffers being allocated unaligned
>>> (in my
>>> very limited testing). In which spots are they allocated unaligned?
>>
>> Like I said above, it can happen when allocated on the stack. But it
>> can also happen in malloc'ed structs, or in global variables. An array
>> is aligned to the size of it's base member type. But malloc does
>> worst-case-allignment, because it happens at run-time without
>> type-information.
>
> I'd be very surprised if malloc() did "worst case alignment" - that'd suck
> pretty badly from performance point of view.

>From POSIX (I don't have K&R at hand, but it's also specified there):
"The pointer returned if the allocation succeeds shall be suitably
aligned so that it may be assigned to a pointer to any type of object
and then used to access such an object in the space allocated (until
the space is explicitly freed or reallocated)."

I put it in quotes because it's not the worst-case alignment you can
ever think of, but rather the worst case alignment of your CPUs
alignment requirements. This is 4 bytes for most CPUs.

> Stack allocation alignment is a harder issue but I doubt it's as bad as you
> make it out to be. On x86, for example, stack pointer is almost always 8 or
> 16 byte aligned with compilers whose writers have spent any time reading the
> Intel optimization manuals.
>
> So yes, your statements are absolutely correct but I strongly doubt it
> matters that much in practice unless you're using a really crappy
> compiler...

I'm sorry, but the the fact of the matter is that we don't write code
for one compiler, we try to please many. Crappy compilers are very
much out there in the wild, and we have to deal with it. So, we can't
depend on char-arrays being aligned to 32-bytes. This code WILL break
on GCC for ARM, so it's not a theoretical issue at all. It will also
most likely break on GCC for x86 when optimizations are disabled.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:19             ` Erik Faye-Lund
@ 2011-04-28 10:30               ` Pekka Enberg
  2011-04-28 11:59                 ` Erik Faye-Lund
                                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Pekka Enberg @ 2011-04-28 10:30 UTC (permalink / raw)
  To: kusmabite
  Cc: Ingo Molnar, Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

Hi,

On 4/28/11 1:19 PM, Erik Faye-Lund wrote:
> On Thu, Apr 28, 2011 at 12:10 PM, Pekka Enberg<penberg@cs.helsinki.fi>  wrote:
>> On 4/28/11 12:50 PM, Erik Faye-Lund wrote:
>>>>
>>>> Alas, i have not seen these sha1 hash buffers being allocated unaligned
>>>> (in my
>>>> very limited testing). In which spots are they allocated unaligned?
>>>
>>> Like I said above, it can happen when allocated on the stack. But it
>>> can also happen in malloc'ed structs, or in global variables. An array
>>> is aligned to the size of it's base member type. But malloc does
>>> worst-case-allignment, because it happens at run-time without
>>> type-information.
>>
>> I'd be very surprised if malloc() did "worst case alignment" - that'd suck
>> pretty badly from performance point of view.
>
>  From POSIX (I don't have K&R at hand, but it's also specified there):
> "The pointer returned if the allocation succeeds shall be suitably
> aligned so that it may be assigned to a pointer to any type of object
> and then used to access such an object in the space allocated (until
> the space is explicitly freed or reallocated)."
>
> I put it in quotes because it's not the worst-case alignment you can
> ever think of, but rather the worst case alignment of your CPUs
> alignment requirements. This is 4 bytes for most CPUs.

That's just the minimum guarantee! Why do you think modern malloc() 
implementations don't try *very* hard to provide best possible alignment?

>> Stack allocation alignment is a harder issue but I doubt it's as bad as you
>> make it out to be. On x86, for example, stack pointer is almost always 8 or
>> 16 byte aligned with compilers whose writers have spent any time reading the
>> Intel optimization manuals.
>>
>> So yes, your statements are absolutely correct but I strongly doubt it
>> matters that much in practice unless you're using a really crappy
>> compiler...
>
> I'm sorry, but the the fact of the matter is that we don't write code
> for one compiler, we try to please many. Crappy compilers are very
> much out there in the wild, and we have to deal with it. So, we can't
> depend on char-arrays being aligned to 32-bytes. This code WILL break
> on GCC for ARM, so it's not a theoretical issue at all. It will also
> most likely break on GCC for x86 when optimizations are disabled.

Yes, ARM is a problem and I didn't try to claim otherwise. However, it's 
not "impossible to fix" as you say with memalign().

But my comment was mostly about your claim that "we have no guarantee 
that the SHA-1s are aligned on x86 either, and unaligned accesses are 
slow on x86" which only matters in practice if you have a crappy 
compiler. And arguing for performance if you don't have a reasonable 
compiler is pretty uninteresting.

			Pekka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  6:36   ` Ingo Molnar
  2011-04-28  9:31     ` Jonathan Nieder
@ 2011-04-28 10:36     ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28 10:36 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Fr?d?ric Weisbecker, Pekka Enberg


* Ingo Molnar <mingo@elte.hu> wrote:

> Having said that, it would be nice if someone could test these two patches on a 
> modern AMD box, using the perf stat from here:
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
>   cd tools/perf/
>   make -j install
> 
> and do something like this to test git gc's performance:
> 
>   $ perf stat --sync --repeat 10 ./git gc
> 
> ... to see whether these speedups are generic, or somehow Intel CPU specific.

To follow up on this angle, i've now tested both 'git gc' performance 
improvement patches on an Opteron box:

 #
 # Before:
 #

 Performance counter stats for './git gc' (10 runs):

       6400.903957 task-clock               #    0.977 CPUs utilized            ( +-  0.09% )
             1,537 context-switches         #    0.000 M/sec                    ( +-  3.12% )
               118 CPU-migrations           #    0.000 M/sec                    ( +-  7.08% )
            40,879 page-faults              #    0.006 M/sec                    ( +-  0.03% )
    14,768,724,649 cycles                   #    2.307 GHz                      ( +-  0.09% )
     9,185,135,676 instructions             #    0.62  insns per cycle          ( +-  0.08% )
     1,883,860,739 branches                 #  294.312 M/sec                    ( +-  0.16% )
        99,800,253 branch-misses            #    5.30% of all branches          ( +-  0.34% )

        6.548348691  seconds time elapsed  ( +-  1.01% )

 #
 # After:
 #

 Performance counter stats for './git gc' (10 runs):

       6078.191492 task-clock               #    0.979 CPUs utilized            ( +-  0.09% )
             1,511 context-switches         #    0.000 M/sec                    ( +-  2.85% )
               104 CPU-migrations           #    0.000 M/sec                    ( +-  6.29% )
            41,410 page-faults              #    0.007 M/sec                    ( +-  0.04% )
    14,024,602,415 cycles                   #    2.307 GHz                      ( +-  0.09% )
     <not counted> stalled-cycles          
    11,364,757,635 instructions             #    0.81  insns per cycle          ( +-  0.06% )
     2,612,173,056 branches                 #  429.762 M/sec                    ( +-  0.09% )
       109,257,789 branch-misses            #    4.18% of all branches          ( +-  0.74% )

        6.209749362  seconds time elapsed  ( +-  0.51% )

So they bring an about +5.3% improvement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:30               ` Pekka Enberg
@ 2011-04-28 11:59                 ` Erik Faye-Lund
  2011-04-28 12:12                   ` Pekka Enberg
                                     ` (2 more replies)
  2011-04-28 12:16                 ` Tor Arntsen
  2011-04-28 12:17                 ` Andreas Ericsson
  2 siblings, 3 replies; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28 11:59 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On Thu, Apr 28, 2011 at 12:30 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi,
>
> On 4/28/11 1:19 PM, Erik Faye-Lund wrote:
>>
>> On Thu, Apr 28, 2011 at 12:10 PM, Pekka Enberg<penberg@cs.helsinki.fi>
>>  wrote:
>>>
>>> On 4/28/11 12:50 PM, Erik Faye-Lund wrote:
>>>>>
>>>>> Alas, i have not seen these sha1 hash buffers being allocated unaligned
>>>>> (in my
>>>>> very limited testing). In which spots are they allocated unaligned?
>>>>
>>>> Like I said above, it can happen when allocated on the stack. But it
>>>> can also happen in malloc'ed structs, or in global variables. An array
>>>> is aligned to the size of it's base member type. But malloc does
>>>> worst-case-allignment, because it happens at run-time without
>>>> type-information.
>>>
>>> I'd be very surprised if malloc() did "worst case alignment" - that'd
>>> suck
>>> pretty badly from performance point of view.
>>
>>  From POSIX (I don't have K&R at hand, but it's also specified there):
>> "The pointer returned if the allocation succeeds shall be suitably
>> aligned so that it may be assigned to a pointer to any type of object
>> and then used to access such an object in the space allocated (until
>> the space is explicitly freed or reallocated)."
>>
>> I put it in quotes because it's not the worst-case alignment you can
>> ever think of, but rather the worst case alignment of your CPUs
>> alignment requirements. This is 4 bytes for most CPUs.
>
> That's just the minimum guarantee! Why do you think modern malloc()
> implementations don't try *very* hard to provide best possible alignment?
>

Yes, it's the minimum alignment requirement. And yes, malloc
implementations try to keep the alignment. I don't think there's any
contradiction between what you and I said.

>>> Stack allocation alignment is a harder issue but I doubt it's as bad as
>>> you
>>> make it out to be. On x86, for example, stack pointer is almost always 8
>>> or
>>> 16 byte aligned with compilers whose writers have spent any time reading
>>> the
>>> Intel optimization manuals.
>>>
>>> So yes, your statements are absolutely correct but I strongly doubt it
>>> matters that much in practice unless you're using a really crappy
>>> compiler...
>>
>> I'm sorry, but the the fact of the matter is that we don't write code
>> for one compiler, we try to please many. Crappy compilers are very
>> much out there in the wild, and we have to deal with it. So, we can't
>> depend on char-arrays being aligned to 32-bytes. This code WILL break
>> on GCC for ARM, so it's not a theoretical issue at all. It will also
>> most likely break on GCC for x86 when optimizations are disabled.
>
> Yes, ARM is a problem and I didn't try to claim otherwise. However, it's not
> "impossible to fix" as you say with memalign().

True, it's not impossible. It's just an insane thing to try to do, for
a very small gain. The important change was the early-out, and we can
get that while still using a platform-optimized memcmp.

> But my comment was mostly about your claim that "we have no guarantee that
> the SHA-1s are aligned on x86 either, and unaligned accesses are slow on
> x86" which only matters in practice if you have a crappy compiler. And
> arguing for performance if you don't have a reasonable compiler is pretty
> uninteresting.

I agree that not aligning arrays when optimizations are disabled isn't
a big problem on x86. But I don't think that assuming that every
reasonable compiler/compiler-setting pair for x86 align all char-array
makes sense. Aligning short arrays on the stack can lead to
sub-optimal caching for local variables, for instance. Alignment isn't
the only thing that matters.

But that point aside, we need an implementation that is both fast and
correct on all platforms; type-punning arrays is not the way to do it.

So my preference is still something like this. Call me conservative ;)

diff --git a/cache.h b/cache.h
index c730c58..8bc03c6 100644
--- a/cache.h
+++ b/cache.h
@@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
 extern char *sha1_pack_index_name(const unsigned char *sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
-	return !memcmp(sha1, null_sha1, 20);
+	/* early out for fast mis-match */
+	if (*sha1 != *sha2)
+		return *sha1 - *sha2;
+
+	return memcmp(sha1 + 1, sha2 + 1, 19);
 }
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+static inline int is_null_sha1(const unsigned char *sha1)
 {
-	return memcmp(sha1, sha2, 20);
+	return !hashcmp(sha1, null_sha1);
 }
 static inline void hashcpy(unsigned char *sha_dst, const unsigned
char *sha_src)
 {

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:19           ` Ingo Molnar
@ 2011-04-28 12:02             ` Nguyen Thai Ngoc Duy
  2011-04-28 12:18             ` Erik Faye-Lund
  2011-04-28 20:20             ` Junio C Hamano
  2 siblings, 0 replies; 45+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-04-28 12:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Erik Faye-Lund, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
> If you'd like to see other things tested please suggest a testcase that you
> think uses these hashes extensively, i don't really know what the slowest (and
> affected) Git commands are - git gc is the one *i* notice as being pretty slow
> (for good reasons).

git clone
-- 
Duy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 11:59                 ` Erik Faye-Lund
@ 2011-04-28 12:12                   ` Pekka Enberg
  2011-04-28 12:36                   ` Jonathan Nieder
  2011-04-29  7:05                   ` Alex Riesen
  2 siblings, 0 replies; 45+ messages in thread
From: Pekka Enberg @ 2011-04-28 12:12 UTC (permalink / raw)
  To: kusmabite
  Cc: Ingo Molnar, Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On 4/28/11 2:59 PM, Erik Faye-Lund wrote:
> So my preference is still something like this. Call me conservative ;)
>
> diff --git a/cache.h b/cache.h
> index c730c58..8bc03c6 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>   extern char *sha1_pack_index_name(const unsigned char *sha1);
>   extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>   extern const unsigned char null_sha1[20];
> -static inline int is_null_sha1(const unsigned char *sha1)
> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>   {
> -	return !memcmp(sha1, null_sha1, 20);
> +	/* early out for fast mis-match */
> +	if (*sha1 != *sha2)
> +		return *sha1 - *sha2;
> +
> +	return memcmp(sha1 + 1, sha2 + 1, 19);
>   }
> -static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> +static inline int is_null_sha1(const unsigned char *sha1)
>   {
> -	return memcmp(sha1, sha2, 20);
> +	return !hashcmp(sha1, null_sha1);
>   }
>   static inline void hashcpy(unsigned char *sha_dst, const unsigned
> char *sha_src)
>   {

Yup, might be the most reasonable thing to do if it still speeds things up.

			Pekka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:30               ` Pekka Enberg
  2011-04-28 11:59                 ` Erik Faye-Lund
@ 2011-04-28 12:16                 ` Tor Arntsen
  2011-04-28 20:23                   ` H. Peter Anvin
  2011-04-28 12:17                 ` Andreas Ericsson
  2 siblings, 1 reply; 45+ messages in thread
From: Tor Arntsen @ 2011-04-28 12:16 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kusmabite, Ingo Molnar, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On Thu, Apr 28, 2011 at 12:30, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hi,
>
> On 4/28/11 1:19 PM, Erik Faye-Lund wrote:
[alignment issues]
>
> Yes, ARM is a problem and I didn't try to claim otherwise. However, it's not
> "impossible to fix" as you say with memalign().

MIPS (e.g. SGI machines) also bus-errors on non-aligned data.

-Tor

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:30               ` Pekka Enberg
  2011-04-28 11:59                 ` Erik Faye-Lund
  2011-04-28 12:16                 ` Tor Arntsen
@ 2011-04-28 12:17                 ` Andreas Ericsson
  2011-04-28 12:28                   ` Erik Faye-Lund
  2 siblings, 1 reply; 45+ messages in thread
From: Andreas Ericsson @ 2011-04-28 12:17 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kusmabite, Ingo Molnar, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On 04/28/2011 12:30 PM, Pekka Enberg wrote:
> Hi,
> 
> On 4/28/11 1:19 PM, Erik Faye-Lund wrote:
>> On Thu, Apr 28, 2011 at 12:10 PM, Pekka Enberg<penberg@cs.helsinki.fi> wrote:
>>> On 4/28/11 12:50 PM, Erik Faye-Lund wrote:
>>>>>
>>>>> Alas, i have not seen these sha1 hash buffers being allocated unaligned
>>>>> (in my
>>>>> very limited testing). In which spots are they allocated unaligned?
>>>>
>>>> Like I said above, it can happen when allocated on the stack. But it
>>>> can also happen in malloc'ed structs, or in global variables. An array
>>>> is aligned to the size of it's base member type. But malloc does
>>>> worst-case-allignment, because it happens at run-time without
>>>> type-information.
>>>
>>> I'd be very surprised if malloc() did "worst case alignment" - that'd suck
>>> pretty badly from performance point of view.
>>
>> From POSIX (I don't have K&R at hand, but it's also specified there):
>> "The pointer returned if the allocation succeeds shall be suitably
>> aligned so that it may be assigned to a pointer to any type of object
>> and then used to access such an object in the space allocated (until
>> the space is explicitly freed or reallocated)."
>>
>> I put it in quotes because it's not the worst-case alignment you can
>> ever think of, but rather the worst case alignment of your CPUs
>> alignment requirements. This is 4 bytes for most CPUs.
> 
> That's just the minimum guarantee! Why do you think modern malloc() implementations don't try *very* hard to provide best possible alignment?
> 
>>> Stack allocation alignment is a harder issue but I doubt it's as bad as you
>>> make it out to be. On x86, for example, stack pointer is almost always 8 or
>>> 16 byte aligned with compilers whose writers have spent any time reading the
>>> Intel optimization manuals.
>>>
>>> So yes, your statements are absolutely correct but I strongly doubt it
>>> matters that much in practice unless you're using a really crappy
>>> compiler...
>>
>> I'm sorry, but the the fact of the matter is that we don't write code
>> for one compiler, we try to please many. Crappy compilers are very
>> much out there in the wild, and we have to deal with it. So, we can't
>> depend on char-arrays being aligned to 32-bytes. This code WILL break
>> on GCC for ARM, so it's not a theoretical issue at all. It will also
>> most likely break on GCC for x86 when optimizations are disabled.
> 
> Yes, ARM is a problem and I didn't try to claim otherwise. However, it's not "impossible to fix" as you say with memalign().
> 

#define is_aligned(ptr) (ptr & (sizeof(void *) - 1))
if (is_aligned(sha1) && is_aligned(sha2))
	return aligned_and_fast_hashcmp(sha1, sha2);

return memcmp(sha1, sha2, 20);

Problem solved for all architectures. Not as fast as the original
patch when we're lucky with alignment, but we cater to sucky
compilers and make the good ones go a lot faster. The really good
compilers that recognizes "is it aligned?" checks will optimize the
is_aligned() checks away or at least hint at the branch prediction
which path it should prefer.

Once again; Bear in mind that x86 style architectures with gcc is
almost certainly the most common combo for git users by a very wide
margin, so a 25-30% speedup for those users is pretty worthwhile.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:19           ` Ingo Molnar
  2011-04-28 12:02             ` Nguyen Thai Ngoc Duy
@ 2011-04-28 12:18             ` Erik Faye-Lund
  2011-04-28 20:20             ` Junio C Hamano
  2 siblings, 0 replies; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28 12:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Junio C Hamano, git, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
>
> * Erik Faye-Lund <kusmabite@gmail.com> wrote:
>
>> 2011/4/28 Ingo Molnar <mingo@elte.hu>:
>> >
>> > * Erik Faye-Lund <kusmabite@gmail.com> wrote:
>> >
>> >> > Secondly, the combined speedup of the cached case with my two patches
>> >> > appears to be more than 30% on my testbox so it's a very nifty win from two
>> >> > relatively simple changes.
>> >>
>> >> That speed-up was on ONE test vector, no? There are a lot of other uses of
>> >> hash-comparisons in Git, did you measure those?
>> >
>> > I picked this hash function because it showed up in the profile (see the
>> > profile i posted). There's one other hash that mattered as well in the profile,
>> > see the lookup_object() patch i sent yesterday.
>>
>> My point was that the 30% improvement was in "git gc", which is not
>> the only important use-case. How does this affect other git commands?
>
> In a followup mail i measured git fsck, which showed a speedup too. (despite
> being mostly dependent on external libraries to do most of the processing)
>
> If you'd like to see other things tested please suggest a testcase that you
> think uses these hashes extensively, i don't really know what the slowest (and
> affected) Git commands are - git gc is the one *i* notice as being pretty slow
> (for good reasons).
>

You only seem to test cases that iterate through the entire repo, and
I suspect that they might not be representative for all affected
use-cases.

So I'd love to see something like just timing of something like "git
diff > /dev/null" (and some other goodies) in a hot-cache repo with
and without your patch. Perhaps even timing of running the test-suite,
as this touches most git-commands...

>> We can't. The compiler decides the alignment of variables on the stack. Some
>> compilers / compiler-setting pairs might align char-arrays, while others
>> might not.
>
> Even if that were true it can be solved: you'd need to declare the sha1 not as
> a char array but as a u32 * array or so. We do have control over the alignment
> of data structures, obviously.

True, but that's a very intrusive change. And it's not a bug-fix as
you indicated :)

>> Like I said above, it can happen when allocated on the stack. But it can also
>> happen in malloc'ed structs, or in global variables. An array is aligned to
>> the size of it's base member type. But malloc does worst-case-allignment,
>> because it happens at run-time without type-information.
>
> Well, should we ready be ready to throw up our hands as if we didnt have
> control over the alignment of objects and have to accept suboptimal code as a
> result? We do have control over that.

Yes, but it's better to pick low-hanging fruits and see if we can get
99% of the performance increase without having to change all of the
code. See my previous e-mail (Message-ID:
<BANLkTik-uk-mpdHZxcz8Nem=nEzED_tuJg@mail.gmail.com>) for what I
suspect will do the trick without causing problems.

> In any case, i'll retract the null case as it really isnt called that often in
> the tests i've done - updated patch below - it simply falls back on to hashcmp.

Nice, I think this makes sense. I already stole that hunk and
incorporated that in the diff I posted ;)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 12:17                 ` Andreas Ericsson
@ 2011-04-28 12:28                   ` Erik Faye-Lund
  0 siblings, 0 replies; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28 12:28 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Pekka Enberg, Ingo Molnar, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On Thu, Apr 28, 2011 at 2:17 PM, Andreas Ericsson <ae@op5.se> wrote:
>>>> Stack allocation alignment is a harder issue but I doubt it's as bad as you
>>>> make it out to be. On x86, for example, stack pointer is almost always 8 or
>>>> 16 byte aligned with compilers whose writers have spent any time reading the
>>>> Intel optimization manuals.
>>>>
>>>> So yes, your statements are absolutely correct but I strongly doubt it
>>>> matters that much in practice unless you're using a really crappy
>>>> compiler...
>>>
>>> I'm sorry, but the the fact of the matter is that we don't write code
>>> for one compiler, we try to please many. Crappy compilers are very
>>> much out there in the wild, and we have to deal with it. So, we can't
>>> depend on char-arrays being aligned to 32-bytes. This code WILL break
>>> on GCC for ARM, so it's not a theoretical issue at all. It will also
>>> most likely break on GCC for x86 when optimizations are disabled.
>>
>> Yes, ARM is a problem and I didn't try to claim otherwise. However, it's not "impossible to fix" as you say with memalign().
>>
>
> #define is_aligned(ptr) (ptr & (sizeof(void *) - 1))
> if (is_aligned(sha1) && is_aligned(sha2))
>        return aligned_and_fast_hashcmp(sha1, sha2);
>
> return memcmp(sha1, sha2, 20);
>
> Problem solved for all architectures. Not as fast as the original
> patch when we're lucky with alignment, but we cater to sucky
> compilers and make the good ones go a lot faster. The really good
> compilers that recognizes "is it aligned?" checks will optimize the
> is_aligned() checks away or at least hint at the branch prediction
> which path it should prefer.

I'd rather go with the do-not-introduce-the-problem-in-the-first-place
approach. As I've pointed out many times already, the vast majority of
the performance increase comes from the early-out in the first
iteration. Why not just special case that ONE check, and do memcmp as
usual for the rest? The first iteration should affect 99.6% of all
mismatches, so it should have nice performance even for the unaligned
case. This gives us both speed and portability.

> Once again; Bear in mind that x86 style architectures with gcc is
> almost certainly the most common combo for git users by a very wide
> margin, so a 25-30% speedup for those users is pretty worthwhile.

Again, I never argued against speed. I argued against going a route
that is tricky to get right. Very reasonable alternatives were posted,
including Ingo's last patch.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 11:59                 ` Erik Faye-Lund
  2011-04-28 12:12                   ` Pekka Enberg
@ 2011-04-28 12:36                   ` Jonathan Nieder
  2011-04-28 12:40                     ` Erik Faye-Lund
  2011-04-28 13:37                     ` Ingo Molnar
  2011-04-29  7:05                   ` Alex Riesen
  2 siblings, 2 replies; 45+ messages in thread
From: Jonathan Nieder @ 2011-04-28 12:36 UTC (permalink / raw)
  To: Erik Faye-Lund
  Cc: Pekka Enberg, Ingo Molnar, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

Hi,

A side note for amusement.

Erik Faye-Lund wrote:

> --- a/cache.h
> +++ b/cache.h
> @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>  extern char *sha1_pack_index_name(const unsigned char *sha1);
>  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>  extern const unsigned char null_sha1[20];
> -static inline int is_null_sha1(const unsigned char *sha1)
> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>  {
> -	return !memcmp(sha1, null_sha1, 20);
> +	/* early out for fast mis-match */
> +	if (*sha1 != *sha2)
> +		return *sha1 - *sha2;
> +
> +	return memcmp(sha1 + 1, sha2 + 1, 19);
>  }

On the off-chance that sha1 and sha2 are nicely aligned, a more
redundant

	if (*sha1 != *sha2)
		return *sha1 - *sha2;

	return memcmp(sha1, sha2, 20);

would take advantage of that (yes, this is just superstition, but it
somehow seems comforting anyway).

Anyway, assuming it does not kill performance for some reason, the
above sounds good to me.  Thanks for spelling it out.

Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 12:36                   ` Jonathan Nieder
@ 2011-04-28 12:40                     ` Erik Faye-Lund
  2011-04-28 13:37                     ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28 12:40 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Pekka Enberg, Ingo Molnar, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On Thu, Apr 28, 2011 at 2:36 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Hi,
>
> A side note for amusement.
>
> Erik Faye-Lund wrote:
>
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>>  extern char *sha1_pack_index_name(const unsigned char *sha1);
>>  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>>  extern const unsigned char null_sha1[20];
>> -static inline int is_null_sha1(const unsigned char *sha1)
>> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>>  {
>> -     return !memcmp(sha1, null_sha1, 20);
>> +     /* early out for fast mis-match */
>> +     if (*sha1 != *sha2)
>> +             return *sha1 - *sha2;
>> +
>> +     return memcmp(sha1 + 1, sha2 + 1, 19);
>>  }
>
> On the off-chance that sha1 and sha2 are nicely aligned, a more
> redundant
>
>        if (*sha1 != *sha2)
>                return *sha1 - *sha2;
>
>        return memcmp(sha1, sha2, 20);
>
> would take advantage of that (yes, this is just superstition, but it
> somehow seems comforting anyway).

Good point, I think that's an improvement.

> Anyway, assuming it does not kill performance for some reason, the
> above sounds good to me.  Thanks for spelling it out.

If it does, then we haven't fully understood where it came from in the
first place, no? :P

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 12:36                   ` Jonathan Nieder
  2011-04-28 12:40                     ` Erik Faye-Lund
@ 2011-04-28 13:37                     ` Ingo Molnar
  2011-04-28 15:14                       ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28 13:37 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Erik Faye-Lund, Pekka Enberg, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker


* Jonathan Nieder <jrnieder@gmail.com> wrote:

> Hi,
> 
> A side note for amusement.
> 
> Erik Faye-Lund wrote:
> 
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
> >  extern char *sha1_pack_index_name(const unsigned char *sha1);
> >  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
> >  extern const unsigned char null_sha1[20];
> > -static inline int is_null_sha1(const unsigned char *sha1)
> > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> >  {
> > -	return !memcmp(sha1, null_sha1, 20);
> > +	/* early out for fast mis-match */
> > +	if (*sha1 != *sha2)
> > +		return *sha1 - *sha2;
> > +
> > +	return memcmp(sha1 + 1, sha2 + 1, 19);
> >  }
> 
> On the off-chance that sha1 and sha2 are nicely aligned, a more
> redundant
> 
> 	if (*sha1 != *sha2)
> 		return *sha1 - *sha2;
> 
> 	return memcmp(sha1, sha2, 20);
> 
> would take advantage of that (yes, this is just superstition, but it
> somehow seems comforting anyway).

Your variant also makes the code slightly more compact as the sha1+1 and sha2+1 
addresses do not have to be computed. I'll re-test and resend this variant.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 13:37                     ` Ingo Molnar
@ 2011-04-28 15:14                       ` Ingo Molnar
  2011-04-28 16:00                         ` Erik Faye-Lund
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28 15:14 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Erik Faye-Lund, Pekka Enberg, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker


* Ingo Molnar <mingo@elte.hu> wrote:

> * Jonathan Nieder <jrnieder@gmail.com> wrote:
> 
> > Hi,
> > 
> > A side note for amusement.
> > 
> > Erik Faye-Lund wrote:
> > 
> > > --- a/cache.h
> > > +++ b/cache.h
> > > @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
> > >  extern char *sha1_pack_index_name(const unsigned char *sha1);
> > >  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
> > >  extern const unsigned char null_sha1[20];
> > > -static inline int is_null_sha1(const unsigned char *sha1)
> > > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> > >  {
> > > -	return !memcmp(sha1, null_sha1, 20);
> > > +	/* early out for fast mis-match */
> > > +	if (*sha1 != *sha2)
> > > +		return *sha1 - *sha2;
> > > +
> > > +	return memcmp(sha1 + 1, sha2 + 1, 19);
> > >  }
> > 
> > On the off-chance that sha1 and sha2 are nicely aligned, a more
> > redundant
> > 
> > 	if (*sha1 != *sha2)
> > 		return *sha1 - *sha2;
> > 
> > 	return memcmp(sha1, sha2, 20);
> > 
> > would take advantage of that (yes, this is just superstition, but it
> > somehow seems comforting anyway).
> 
> Your variant also makes the code slightly more compact as the sha1+1 and sha2+1 
> addresses do not have to be computed. I'll re-test and resend this variant.

Seems to perform measurably worse:

 #
 # Open-coded loop:
 #
 Performance counter stats for './git gc' (10 runs):

       2358.560100 task-clock               #    0.763 CPUs utilized            ( +-  0.06% )
             1,870 context-switches         #    0.001 M/sec                    ( +-  3.09% )
               170 CPU-migrations           #    0.000 M/sec                    ( +-  3.54% )
            38,230 page-faults              #    0.016 M/sec                    ( +-  0.03% )
     7,513,529,543 cycles                   #    3.186 GHz                      ( +-  0.06% )
     1,634,103,128 stalled-cycles           #   21.75% of all cycles are idle   ( +-  0.28% )
    11,068,971,207 instructions             #    1.47  insns per cycle        
                                            #    0.15  stalled cycles per insn  ( +-  0.04% )
     2,487,656,519 branches                 # 1054.735 M/sec                    ( +-  0.03% )
        59,233,604 branch-misses            #    2.38% of all branches          ( +-  0.09% )

        3.092183093  seconds time elapsed  ( +-  3.49% )

 #
 # Front test + memcmp:
 #
 Performance counter stats for './git gc' (10 runs):

       2723.468639 task-clock               #    0.833 CPUs utilized            ( +-  0.22% )
             1,751 context-switches         #    0.001 M/sec                    ( +-  2.02% )
               167 CPU-migrations           #    0.000 M/sec                    ( +-  1.23% )
            38,230 page-faults              #    0.014 M/sec                    ( +-  0.03% )
     8,684,682,538 cycles                   #    3.189 GHz                      ( +-  0.21% )
     2,062,906,208 stalled-cycles           #   23.75% of all cycles are idle   ( +-  0.60% )
     9,019,624,641 instructions             #    1.04  insns per cycle        
                                            #    0.23  stalled cycles per insn  ( +-  0.04% )
     1,771,179,402 branches                 #  650.340 M/sec                    ( +-  0.04% )
        75,026,810 branch-misses            #    4.24% of all branches          ( +-  0.04% )

        3.271415104  seconds time elapsed  ( +-  1.97% )

So i think the open-coded loop variant i posted is faster.

The key observation is that there's two cases that matter to performance:

 - the hashes are different: in this case the front test catches 99% of the cases
 - the hashes are *equal*: in this case the open-coded loop performs better than the memcmp

My patch addresses both cases.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 15:14                       ` Ingo Molnar
@ 2011-04-28 16:00                         ` Erik Faye-Lund
  2011-04-28 20:32                           ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Erik Faye-Lund @ 2011-04-28 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jonathan Nieder, Pekka Enberg, Junio C Hamano, git,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker

On Thu, Apr 28, 2011 at 5:14 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>> * Jonathan Nieder <jrnieder@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > A side note for amusement.
>> >
>> > Erik Faye-Lund wrote:
>> >
>> > > --- a/cache.h
>> > > +++ b/cache.h
>> > > @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>> > >  extern char *sha1_pack_index_name(const unsigned char *sha1);
>> > >  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>> > >  extern const unsigned char null_sha1[20];
>> > > -static inline int is_null_sha1(const unsigned char *sha1)
>> > > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>> > >  {
>> > > - return !memcmp(sha1, null_sha1, 20);
>> > > + /* early out for fast mis-match */
>> > > + if (*sha1 != *sha2)
>> > > +         return *sha1 - *sha2;
>> > > +
>> > > + return memcmp(sha1 + 1, sha2 + 1, 19);
>> > >  }
>> >
>> > On the off-chance that sha1 and sha2 are nicely aligned, a more
>> > redundant
>> >
>> >     if (*sha1 != *sha2)
>> >             return *sha1 - *sha2;
>> >
>> >     return memcmp(sha1, sha2, 20);
>> >
>> > would take advantage of that (yes, this is just superstition, but it
>> > somehow seems comforting anyway).
>>
>> Your variant also makes the code slightly more compact as the sha1+1 and sha2+1
>> addresses do not have to be computed. I'll re-test and resend this variant.
>
> Seems to perform measurably worse:
>
>  #
>  # Open-coded loop:
>  #
>  Performance counter stats for './git gc' (10 runs):
>
>       2358.560100 task-clock               #    0.763 CPUs utilized            ( +-  0.06% )
>             1,870 context-switches         #    0.001 M/sec                    ( +-  3.09% )
>               170 CPU-migrations           #    0.000 M/sec                    ( +-  3.54% )
>            38,230 page-faults              #    0.016 M/sec                    ( +-  0.03% )
>     7,513,529,543 cycles                   #    3.186 GHz                      ( +-  0.06% )
>     1,634,103,128 stalled-cycles           #   21.75% of all cycles are idle   ( +-  0.28% )
>    11,068,971,207 instructions             #    1.47  insns per cycle
>                                            #    0.15  stalled cycles per insn  ( +-  0.04% )
>     2,487,656,519 branches                 # 1054.735 M/sec                    ( +-  0.03% )
>        59,233,604 branch-misses            #    2.38% of all branches          ( +-  0.09% )
>
>        3.092183093  seconds time elapsed  ( +-  3.49% )
>
>  #
>  # Front test + memcmp:
>  #
>  Performance counter stats for './git gc' (10 runs):
>
>       2723.468639 task-clock               #    0.833 CPUs utilized            ( +-  0.22% )
>             1,751 context-switches         #    0.001 M/sec                    ( +-  2.02% )
>               167 CPU-migrations           #    0.000 M/sec                    ( +-  1.23% )
>            38,230 page-faults              #    0.014 M/sec                    ( +-  0.03% )
>     8,684,682,538 cycles                   #    3.189 GHz                      ( +-  0.21% )
>     2,062,906,208 stalled-cycles           #   23.75% of all cycles are idle   ( +-  0.60% )
>     9,019,624,641 instructions             #    1.04  insns per cycle
>                                            #    0.23  stalled cycles per insn  ( +-  0.04% )
>     1,771,179,402 branches                 #  650.340 M/sec                    ( +-  0.04% )
>        75,026,810 branch-misses            #    4.24% of all branches          ( +-  0.04% )
>
>        3.271415104  seconds time elapsed  ( +-  1.97% )
>
> So i think the open-coded loop variant i posted is faster.
>
> The key observation is that there's two cases that matter to performance:
>
>  - the hashes are different: in this case the front test catches 99% of the cases
>  - the hashes are *equal*: in this case the open-coded loop performs better than the memcmp
>
> My patch addresses both cases.
>

Thanks. I also timed on my end (on Windows), and I came to the same
conclusion (but the improvements of your original was somewhat smaller
in my end; could be due to the test-case). It seems like the early-out
wasn't the only reason your original patch performed faster. It could
be that memcmp (probably) didn't get inlined, and the extra function
call outweighs the complexity. Or there's something else going on that
both affects glibc and msvcrt.dll.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:37       ` Ingo Molnar
  2011-04-28  9:50         ` Erik Faye-Lund
@ 2011-04-28 16:36         ` Dmitry Potapov
  1 sibling, 0 replies; 45+ messages in thread
From: Dmitry Potapov @ 2011-04-28 16:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Erik Faye-Lund, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

2011/4/28 Ingo Molnar <mingo@elte.hu>:
>
> If unsigned char arrays are allocated unaligned then that's another bug i
> suspect that should be fixed. Unaligned access on x86 is not free either -
> there's cycle penalties.

Unsigned char arrays can be stored unaligned. Basically, it depends on
the context in what they were declared. If a preceding field in some
structure ended unaligned then the byte array will start unaligned.
For example:

struct foo
{
   char ch;
   unsigned char sha1[20];
};

The same on the stack, except the compiler may pack them as it wishes.
So, you have no guarantee here. If you want to make sure all SHA-1 are
aligned properly, sha1 should be declared as ui32: 'uint32_t sha1[5]'.

Dmitry

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28  9:55         ` Erik Faye-Lund
@ 2011-04-28 20:19           ` H. Peter Anvin
  0 siblings, 0 replies; 45+ messages in thread
From: H. Peter Anvin @ 2011-04-28 20:19 UTC (permalink / raw)
  To: kusmabite
  Cc: Andreas Ericsson, Bernhard R. Link, Ralf Baechle, Junio C Hamano,
	Ingo Molnar, git, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker,
	Pekka Enberg

On 04/28/2011 02:55 AM, Erik Faye-Lund wrote:
> 
> I disagree. We have no guarantee that the SHA-1s are aligned on x86
> either, and unaligned accesses are slow on x86.
> 

Not particularly, especially not statistically.  Furthermore, for a
sizable chunk like a SHA-1, not all accesses will have the cross-grain
penalities that you sometimes can have.

> I think it's much much cleaner to add an early-out on the first byte,
> and hope that memcmp is optimized properly. If it's not, those
> platforms can add an override to memcmp in git-compat-util and/or
> compat/*.

Overall, doing an architecture optimization library especially for
widely used architectures like x86 is not a bad idea.

	-hpa

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 10:19           ` Ingo Molnar
  2011-04-28 12:02             ` Nguyen Thai Ngoc Duy
  2011-04-28 12:18             ` Erik Faye-Lund
@ 2011-04-28 20:20             ` Junio C Hamano
  2 siblings, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2011-04-28 20:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Erik Faye-Lund, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker, Pekka Enberg

Ingo Molnar <mingo@elte.hu> writes:

> In any case, i'll retract the null case as it really isnt called that often in 
> the tests i've done - updated patch below - it simply falls back on to hashcmp.
>
> Thanks,
>
> 	Ingo
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>

Thanks, will queue this version.

> diff --git a/cache.h b/cache.h
> index 2674f4c..39fa9cd 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -675,14 +675,24 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>  extern char *sha1_pack_index_name(const unsigned char *sha1);
>  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>  extern const unsigned char null_sha1[20];
> -static inline int is_null_sha1(const unsigned char *sha1)
> +
> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>  {
> -	return !memcmp(sha1, null_sha1, 20);
> +	int i;
> +
> +	for (i = 0; i < 20; i++, sha1++, sha2++) {
> +		if (*sha1 != *sha2)
> +			return *sha1 - *sha2;
> +	}
> +
> +	return 0;
>  }
> -static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
> +
> +static inline int is_null_sha1(const unsigned char *sha1)
>  {
> -	return memcmp(sha1, sha2, 20);
> +	return !hashcmp(sha1, null_sha1);
>  }
> +
>  static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
>  {
>  	memcpy(sha_dst, sha_src, 20);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 12:16                 ` Tor Arntsen
@ 2011-04-28 20:23                   ` H. Peter Anvin
  0 siblings, 0 replies; 45+ messages in thread
From: H. Peter Anvin @ 2011-04-28 20:23 UTC (permalink / raw)
  To: Tor Arntsen
  Cc: Pekka Enberg, kusmabite, Ingo Molnar, Junio C Hamano, git,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On 04/28/2011 05:16 AM, Tor Arntsen wrote:
> On Thu, Apr 28, 2011 at 12:30, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> Hi,
>>
>> On 4/28/11 1:19 PM, Erik Faye-Lund wrote:
> [alignment issues]
>>
>> Yes, ARM is a problem and I didn't try to claim otherwise. However, it's not
>> "impossible to fix" as you say with memalign().
> 
> MIPS (e.g. SGI machines) also bus-errors on non-aligned data.

On the other hand, MIPS has efficient instructions for accessing
known-unaligned data.

	-hpa

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 16:00                         ` Erik Faye-Lund
@ 2011-04-28 20:32                           ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2011-04-28 20:32 UTC (permalink / raw)
  To: Erik Faye-Lund
  Cc: Jonathan Nieder, Pekka Enberg, Junio C Hamano, git,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Frédéric Weisbecker


* Erik Faye-Lund <kusmabite@gmail.com> wrote:

> Thanks. I also timed on my end (on Windows), and I came to the same 
> conclusion (but the improvements of your original was somewhat smaller in my 
> end; could be due to the test-case). It seems like the early-out wasn't the 
> only reason your original patch performed faster. It could be that memcmp 
> (probably) didn't get inlined, and the extra function call outweighs the 
> complexity. [...]

Function calls arent that heavy really. My measurements identified the 
following effects:

 - profiling of stalled cycles clearly pinpointed the REP MOV string 
   instruction.

 - the patched code had less branch-misses - the clearer and inlined open-coded 
   loop is probably easier for the CPU to speculate along - while REP MOV 
   string ops are 'opaque' and the result might be harder to speculate.

So i think the main benefit of my patch is that it avoids the REP MOV 
instruction.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-28 11:59                 ` Erik Faye-Lund
  2011-04-28 12:12                   ` Pekka Enberg
  2011-04-28 12:36                   ` Jonathan Nieder
@ 2011-04-29  7:05                   ` Alex Riesen
  2011-04-29 16:24                     ` H. Peter Anvin
  2 siblings, 1 reply; 45+ messages in thread
From: Alex Riesen @ 2011-04-29  7:05 UTC (permalink / raw)
  To: kusmabite
  Cc: Pekka Enberg, Ingo Molnar, Junio C Hamano, git, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On Thu, Apr 28, 2011 at 13:59, Erik Faye-Lund <kusmabite@gmail.com> wrote:
> diff --git a/cache.h b/cache.h
> index c730c58..8bc03c6 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>  extern char *sha1_pack_index_name(const unsigned char *sha1);
>  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>  extern const unsigned char null_sha1[20];
> -static inline int is_null_sha1(const unsigned char *sha1)
> +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>  {
> -       return !memcmp(sha1, null_sha1, 20);
> +       /* early out for fast mis-match */
> +       if (*sha1 != *sha2)
> +               return *sha1 - *sha2;

Can one take advantage of common expression optimization here?
Like this:

+       if (*sha1 - *sha2)
+               return *sha1 - *sha2;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
  2011-04-29  7:05                   ` Alex Riesen
@ 2011-04-29 16:24                     ` H. Peter Anvin
  0 siblings, 0 replies; 45+ messages in thread
From: H. Peter Anvin @ 2011-04-29 16:24 UTC (permalink / raw)
  To: Alex Riesen
  Cc: kusmabite, Pekka Enberg, Ingo Molnar, Junio C Hamano, git,
	Thomas Gleixner, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Frédéric Weisbecker

On 04/29/2011 12:05 AM, Alex Riesen wrote:
> 
> Can one take advantage of common expression optimization here?
> Like this:
> 
> +       if (*sha1 - *sha2)
> +               return *sha1 - *sha2;
> 

Or even clue the compiler in explicitly:

	delta = *sha1 - *sha2;
	if (delta)
		return delta;

For newer x86-specific optimization there are a bunch of newer SSE
instructions which can be used to do bulk compares which may be faster
if optimized for the fixed length compare; of course it requires a new
enough processor.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2011-04-29 16:27 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-27 22:51 [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
2011-04-27 23:10 ` Ingo Molnar
2011-04-27 23:18 ` Jonathan Nieder
2011-04-28  6:36   ` Ingo Molnar
2011-04-28  9:31     ` Jonathan Nieder
2011-04-28 10:36     ` Ingo Molnar
2011-04-28  9:32   ` Dmitry Potapov
2011-04-27 23:32 ` Junio C Hamano
2011-04-28  0:35   ` Ralf Baechle
2011-04-28  8:18     ` Bernhard R. Link
2011-04-28  9:42       ` Andreas Ericsson
2011-04-28  9:55         ` Erik Faye-Lund
2011-04-28 20:19           ` H. Peter Anvin
2011-04-28  6:27   ` Ingo Molnar
2011-04-28  9:17     ` Erik Faye-Lund
2011-04-28  9:33       ` Ingo Molnar
2011-04-28  9:37       ` Ingo Molnar
2011-04-28  9:50         ` Erik Faye-Lund
2011-04-28 10:10           ` Pekka Enberg
2011-04-28 10:19             ` Erik Faye-Lund
2011-04-28 10:30               ` Pekka Enberg
2011-04-28 11:59                 ` Erik Faye-Lund
2011-04-28 12:12                   ` Pekka Enberg
2011-04-28 12:36                   ` Jonathan Nieder
2011-04-28 12:40                     ` Erik Faye-Lund
2011-04-28 13:37                     ` Ingo Molnar
2011-04-28 15:14                       ` Ingo Molnar
2011-04-28 16:00                         ` Erik Faye-Lund
2011-04-28 20:32                           ` Ingo Molnar
2011-04-29  7:05                   ` Alex Riesen
2011-04-29 16:24                     ` H. Peter Anvin
2011-04-28 12:16                 ` Tor Arntsen
2011-04-28 20:23                   ` H. Peter Anvin
2011-04-28 12:17                 ` Andreas Ericsson
2011-04-28 12:28                   ` Erik Faye-Lund
2011-04-28 10:19           ` Ingo Molnar
2011-04-28 12:02             ` Nguyen Thai Ngoc Duy
2011-04-28 12:18             ` Erik Faye-Lund
2011-04-28 20:20             ` Junio C Hamano
2011-04-28 16:36         ` Dmitry Potapov
2011-04-28  8:52 ` Dmitry Potapov
2011-04-28  9:11   ` Ingo Molnar
2011-04-28  9:31     ` Dmitry Potapov
2011-04-28  9:44       ` Ingo Molnar
2011-04-28  9:38     ` Ingo Molnar

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).