git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: git@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Peter Zijlstra" <a.p.zijlstra@chello.nl>,
	"Arnaldo Carvalho de Melo" <acme@redhat.com>,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Pekka Enberg" <penberg@cs.helsinki.fi>
Subject: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
Date: Thu, 28 Apr 2011 00:51:14 +0200	[thread overview]
Message-ID: <20110427225114.GA16765@elte.hu> (raw)


Looking at the stalled-cycled profile of a cached 'git gc' i noticed the 
find_pack_entry_one() entry:

 $ perf record -e stalled-cycles -F 10000 ./git gc
 $ perf report

# Events: 26K stalled-cycles
#
# Overhead     Command          Shared Object                                     Symbol
# ........  ..........  .....................  .........................................
#
    26.07%         git  git                    [.] lookup_object
    10.22%         git  libz.so.1.2.5          [.] 0xc43a          
     7.08%         git  libz.so.1.2.5          [.] inflate
     6.63%         git  git                    [.] find_pack_entry_one
     5.37%         git  [kernel.kallsyms]      [k] do_raw_spin_lock
     4.03%         git  git                    [.] lookup_blob
     3.09%         git  libc-2.13.90.so        [.] __strlen_sse42
     2.81%         git  libc-2.13.90.so        [.] __memcpy_ssse3_back

Annotation shows:

 $ perf annotate find_pack_entry_one

 Percent |      Source code & Disassembly of git
------------------------------------------------
         :
     ...
         :                      int cmp = hashcmp(index + mi * stride, sha1);
    0.90 :        4b9264:       89 ee                   mov    %ebp,%esi
    0.45 :        4b9266:       41 0f af f2             imul   %r10d,%esi
    2.86 :        4b926a:       4c 01 de                add    %r11,%rsi
   53.34 :        4b926d:       f3 a6                   repz cmpsb %es:(%rdi),%ds:(%rsi)
   14.37 :        4b926f:       0f 92 c0                setb   %al
    5.78 :        4b9272:       41 0f 97 c4             seta   %r12b
    1.52 :        4b9276:       41 28 c4                sub    %al,%r12b

Most overhead is in hashcmp(), which uses memcmp(), which falls back to 
assembly string operations.

But we know that hashcmp() compares hashes, which if they do not match, the first byte
will differ in 99% of the cases.

So i tried the patch below: instead of relying on GCC putting in the string 
ops, i used an open-coded loop for this relatively short comparison, which does 
not go beyond the first byte in 99% of the cases.

While it at i also open-coded the is_null_sha1() comparison: instead of 
comparing it to null_sha1[] byte by byte, we can use the sha1 value directly 
and use much more optimal 64-bit and 32-bit comparisons.

The results were rather surprising:

 #
 # Before:
 #

 $ perf stat --sync --repeat 10 ./git gc

 Performance counter stats for './git gc' (10 runs):

       2771.119892 task-clock               #    0.863 CPUs utilized            ( +-  0.16% )
             1,813 context-switches         #    0.001 M/sec                    ( +-  3.06% )
               167 CPU-migrations           #    0.000 M/sec                    ( +-  2.92% )
            39,210 page-faults              #    0.014 M/sec                    ( +-  0.26% )
     8,828,405,654 cycles                   #    3.186 GHz                      ( +-  0.13% )
     2,102,083,909 stalled-cycles           #   23.81% of all cycles are idle   ( +-  0.52% )
     8,821,931,740 instructions             #    1.00  insns per cycle        
                                            #    0.24  stalled cycles per insn  ( +-  0.04% )
     1,750,408,175 branches                 #  631.661 M/sec                    ( +-  0.04% )
        74,612,120 branch-misses            #    4.26% of all branches          ( +-  0.07% )

        3.211098537  seconds time elapsed  ( +-  1.52% )

[ Note: the --sync option to perf stat emits a sync() before each 'git gc' 
  test-run, this reduces the noise of 'elapsed time' numbers enormously. ]

 #
 # After:
 #

 $ perf stat --sync --repeat 10 ./git gc

 Performance counter stats for './git gc' (10 runs):

       2349.498022 task-clock               #    0.807 CPUs utilized            ( +-  0.15% )
             1,842 context-switches         #    0.001 M/sec                    ( +-  2.50% )
               164 CPU-migrations           #    0.000 M/sec                    ( +-  3.67% )
            39,350 page-faults              #    0.017 M/sec                    ( +-  0.06% )
     7,484,317,230 cycles                   #    3.185 GHz                      ( +-  0.15% )
     1,577,673,341 stalled-cycles           #   21.08% of all cycles are idle   ( +-  0.67% )
    11,067,826,786 instructions             #    1.48  insns per cycle        
                                            #    0.14  stalled cycles per insn  ( +-  0.02% )
     2,489,157,909 branches                 # 1059.442 M/sec                    ( +-  0.02% )
        59,384,019 branch-misses            #    2.39% of all branches          ( +-  0.22% )

        2.910829134  seconds time elapsed  ( +-  1.39% )

'git gc' got faster by 18%! Interestingly, 33% of all prior stalled cycles 
disappeared: most of them turned into actual cycle count savings and speedups.

So it's rather clear that the string assembly instructions based memcmp is 
suboptimal for short comparisons like this: there's quite a bit of setup 
latency in repz cmpsb and the CPU is idling around during that time.

(I ran this on a Nehalem system, so it's a rather fast and modern Intel CPU.)

Also note another very interesting detail, the number of branch misses went way 
down, well beyond the measurement noise:

 before:   74,612,120 branch-misses            #    4.26% of all branches          ( +-  0.07% )
  after:   59,384,019 branch-misses            #    2.39% of all branches          ( +-  0.22% )

One theory would be that the open-coded loop is easier for the CPU to speculate 
along, so it produces less branch misses.

The number of instructions and branches increased - this is mostly because the 
PMU counts a complex 'repz cmpsb' as a single instruction issuing many uops, 
while the open-coded loop consists of separate instructions - but roughly the 
same amount of uops.

I suspect 'git fsck' got faster as well, but i have not measured that.

There's more string op use in the Git sha1 code, but this was the lowest 
hanging fruit.

Thanks,

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/cache.h b/cache.h
index 2674f4c..c5a54fb 100644
--- a/cache.h
+++ b/cache.h
@@ -675,14 +675,33 @@ extern char *sha1_pack_name(const unsigned char *sha1);
 extern char *sha1_pack_index_name(const unsigned char *sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
-	return !memcmp(sha1, null_sha1, 20);
+	int i;
+
+	for (i = 0; i < 20; i++, sha1++, sha2++) {
+		if (*sha1 != *sha2) {
+			if (*sha1 < *sha2)
+				return -1;
+			return +1;
+		}
+	}
+
+	return 0;
 }
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+
+static inline int is_null_sha1(const unsigned char *sha1)
 {
-	return memcmp(sha1, sha2, 20);
+	const unsigned long long *sha1_64 = (void *)sha1;
+	const unsigned int *sha1_32 = (void *)sha1;
+
+	if (sha1_64[0] || sha1_64[1] || sha1_32[4])
+		return 0;
+
+	return 1;
 }
+
 static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
 {
 	memcpy(sha_dst, sha_src, 20);

             reply	other threads:[~2011-04-27 22:52 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-27 22:51 Ingo Molnar [this message]
2011-04-27 23:10 ` [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
2011-04-27 23:18 ` Jonathan Nieder
2011-04-28  6:36   ` Ingo Molnar
2011-04-28  9:31     ` Jonathan Nieder
2011-04-28 10:36     ` Ingo Molnar
2011-04-28  9:32   ` Dmitry Potapov
2011-04-27 23:32 ` Junio C Hamano
2011-04-28  0:35   ` Ralf Baechle
2011-04-28  8:18     ` Bernhard R. Link
2011-04-28  9:42       ` Andreas Ericsson
2011-04-28  9:55         ` Erik Faye-Lund
2011-04-28 20:19           ` H. Peter Anvin
2011-04-28  6:27   ` Ingo Molnar
2011-04-28  9:17     ` Erik Faye-Lund
2011-04-28  9:33       ` Ingo Molnar
2011-04-28  9:37       ` Ingo Molnar
2011-04-28  9:50         ` Erik Faye-Lund
2011-04-28 10:10           ` Pekka Enberg
2011-04-28 10:19             ` Erik Faye-Lund
2011-04-28 10:30               ` Pekka Enberg
2011-04-28 11:59                 ` Erik Faye-Lund
2011-04-28 12:12                   ` Pekka Enberg
2011-04-28 12:36                   ` Jonathan Nieder
2011-04-28 12:40                     ` Erik Faye-Lund
2011-04-28 13:37                     ` Ingo Molnar
2011-04-28 15:14                       ` Ingo Molnar
2011-04-28 16:00                         ` Erik Faye-Lund
2011-04-28 20:32                           ` Ingo Molnar
2011-04-29  7:05                   ` Alex Riesen
2011-04-29 16:24                     ` H. Peter Anvin
2011-04-28 12:16                 ` Tor Arntsen
2011-04-28 20:23                   ` H. Peter Anvin
2011-04-28 12:17                 ` Andreas Ericsson
2011-04-28 12:28                   ` Erik Faye-Lund
2011-04-28 10:19           ` Ingo Molnar
2011-04-28 12:02             ` Nguyen Thai Ngoc Duy
2011-04-28 12:18             ` Erik Faye-Lund
2011-04-28 20:20             ` Junio C Hamano
2011-04-28 16:36         ` Dmitry Potapov
2011-04-28  8:52 ` Dmitry Potapov
2011-04-28  9:11   ` Ingo Molnar
2011-04-28  9:31     ` Dmitry Potapov
2011-04-28  9:44       ` Ingo Molnar
2011-04-28  9:38     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110427225114.GA16765@elte.hu \
    --to=mingo@elte.hu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@redhat.com \
    --cc=fweisbec@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=hpa@zytor.com \
    --cc=penberg@cs.helsinki.fi \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).