From: Ingo Molnar <mingo@elte.hu>
To: git@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Peter Zijlstra" <a.p.zijlstra@chello.nl>,
"Arnaldo Carvalho de Melo" <acme@redhat.com>,
"Frédéric Weisbecker" <fweisbec@gmail.com>,
"Pekka Enberg" <penberg@cs.helsinki.fi>
Subject: [PATCH] git gc: Speed it up by 18% via faster hash comparisons
Date: Thu, 28 Apr 2011 00:51:14 +0200 [thread overview]
Message-ID: <20110427225114.GA16765@elte.hu> (raw)
Looking at the stalled-cycled profile of a cached 'git gc' i noticed the
find_pack_entry_one() entry:
$ perf record -e stalled-cycles -F 10000 ./git gc
$ perf report
# Events: 26K stalled-cycles
#
# Overhead Command Shared Object Symbol
# ........ .......... ..................... .........................................
#
26.07% git git [.] lookup_object
10.22% git libz.so.1.2.5 [.] 0xc43a
7.08% git libz.so.1.2.5 [.] inflate
6.63% git git [.] find_pack_entry_one
5.37% git [kernel.kallsyms] [k] do_raw_spin_lock
4.03% git git [.] lookup_blob
3.09% git libc-2.13.90.so [.] __strlen_sse42
2.81% git libc-2.13.90.so [.] __memcpy_ssse3_back
Annotation shows:
$ perf annotate find_pack_entry_one
Percent | Source code & Disassembly of git
------------------------------------------------
:
...
: int cmp = hashcmp(index + mi * stride, sha1);
0.90 : 4b9264: 89 ee mov %ebp,%esi
0.45 : 4b9266: 41 0f af f2 imul %r10d,%esi
2.86 : 4b926a: 4c 01 de add %r11,%rsi
53.34 : 4b926d: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi)
14.37 : 4b926f: 0f 92 c0 setb %al
5.78 : 4b9272: 41 0f 97 c4 seta %r12b
1.52 : 4b9276: 41 28 c4 sub %al,%r12b
Most overhead is in hashcmp(), which uses memcmp(), which falls back to
assembly string operations.
But we know that hashcmp() compares hashes, which if they do not match, the first byte
will differ in 99% of the cases.
So i tried the patch below: instead of relying on GCC putting in the string
ops, i used an open-coded loop for this relatively short comparison, which does
not go beyond the first byte in 99% of the cases.
While it at i also open-coded the is_null_sha1() comparison: instead of
comparing it to null_sha1[] byte by byte, we can use the sha1 value directly
and use much more optimal 64-bit and 32-bit comparisons.
The results were rather surprising:
#
# Before:
#
$ perf stat --sync --repeat 10 ./git gc
Performance counter stats for './git gc' (10 runs):
2771.119892 task-clock # 0.863 CPUs utilized ( +- 0.16% )
1,813 context-switches # 0.001 M/sec ( +- 3.06% )
167 CPU-migrations # 0.000 M/sec ( +- 2.92% )
39,210 page-faults # 0.014 M/sec ( +- 0.26% )
8,828,405,654 cycles # 3.186 GHz ( +- 0.13% )
2,102,083,909 stalled-cycles # 23.81% of all cycles are idle ( +- 0.52% )
8,821,931,740 instructions # 1.00 insns per cycle
# 0.24 stalled cycles per insn ( +- 0.04% )
1,750,408,175 branches # 631.661 M/sec ( +- 0.04% )
74,612,120 branch-misses # 4.26% of all branches ( +- 0.07% )
3.211098537 seconds time elapsed ( +- 1.52% )
[ Note: the --sync option to perf stat emits a sync() before each 'git gc'
test-run, this reduces the noise of 'elapsed time' numbers enormously. ]
#
# After:
#
$ perf stat --sync --repeat 10 ./git gc
Performance counter stats for './git gc' (10 runs):
2349.498022 task-clock # 0.807 CPUs utilized ( +- 0.15% )
1,842 context-switches # 0.001 M/sec ( +- 2.50% )
164 CPU-migrations # 0.000 M/sec ( +- 3.67% )
39,350 page-faults # 0.017 M/sec ( +- 0.06% )
7,484,317,230 cycles # 3.185 GHz ( +- 0.15% )
1,577,673,341 stalled-cycles # 21.08% of all cycles are idle ( +- 0.67% )
11,067,826,786 instructions # 1.48 insns per cycle
# 0.14 stalled cycles per insn ( +- 0.02% )
2,489,157,909 branches # 1059.442 M/sec ( +- 0.02% )
59,384,019 branch-misses # 2.39% of all branches ( +- 0.22% )
2.910829134 seconds time elapsed ( +- 1.39% )
'git gc' got faster by 18%! Interestingly, 33% of all prior stalled cycles
disappeared: most of them turned into actual cycle count savings and speedups.
So it's rather clear that the string assembly instructions based memcmp is
suboptimal for short comparisons like this: there's quite a bit of setup
latency in repz cmpsb and the CPU is idling around during that time.
(I ran this on a Nehalem system, so it's a rather fast and modern Intel CPU.)
Also note another very interesting detail, the number of branch misses went way
down, well beyond the measurement noise:
before: 74,612,120 branch-misses # 4.26% of all branches ( +- 0.07% )
after: 59,384,019 branch-misses # 2.39% of all branches ( +- 0.22% )
One theory would be that the open-coded loop is easier for the CPU to speculate
along, so it produces less branch misses.
The number of instructions and branches increased - this is mostly because the
PMU counts a complex 'repz cmpsb' as a single instruction issuing many uops,
while the open-coded loop consists of separate instructions - but roughly the
same amount of uops.
I suspect 'git fsck' got faster as well, but i have not measured that.
There's more string op use in the Git sha1 code, but this was the lowest
hanging fruit.
Thanks,
Ingo
Signed-off-by: Ingo Molnar <mingo@elte.hu>
diff --git a/cache.h b/cache.h
index 2674f4c..c5a54fb 100644
--- a/cache.h
+++ b/cache.h
@@ -675,14 +675,33 @@ extern char *sha1_pack_name(const unsigned char *sha1);
extern char *sha1_pack_index_name(const unsigned char *sha1);
extern const char *find_unique_abbrev(const unsigned char *sha1, int);
extern const unsigned char null_sha1[20];
-static inline int is_null_sha1(const unsigned char *sha1)
+
+static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
{
- return !memcmp(sha1, null_sha1, 20);
+ int i;
+
+ for (i = 0; i < 20; i++, sha1++, sha2++) {
+ if (*sha1 != *sha2) {
+ if (*sha1 < *sha2)
+ return -1;
+ return +1;
+ }
+ }
+
+ return 0;
}
-static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
+
+static inline int is_null_sha1(const unsigned char *sha1)
{
- return memcmp(sha1, sha2, 20);
+ const unsigned long long *sha1_64 = (void *)sha1;
+ const unsigned int *sha1_32 = (void *)sha1;
+
+ if (sha1_64[0] || sha1_64[1] || sha1_32[4])
+ return 0;
+
+ return 1;
}
+
static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
{
memcpy(sha_dst, sha_src, 20);
next reply other threads:[~2011-04-27 22:52 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-27 22:51 Ingo Molnar [this message]
2011-04-27 23:10 ` [PATCH] git gc: Speed it up by 18% via faster hash comparisons Ingo Molnar
2011-04-27 23:18 ` Jonathan Nieder
2011-04-28 6:36 ` Ingo Molnar
2011-04-28 9:31 ` Jonathan Nieder
2011-04-28 10:36 ` Ingo Molnar
2011-04-28 9:32 ` Dmitry Potapov
2011-04-27 23:32 ` Junio C Hamano
2011-04-28 0:35 ` Ralf Baechle
2011-04-28 8:18 ` Bernhard R. Link
2011-04-28 9:42 ` Andreas Ericsson
2011-04-28 9:55 ` Erik Faye-Lund
2011-04-28 20:19 ` H. Peter Anvin
2011-04-28 6:27 ` Ingo Molnar
2011-04-28 9:17 ` Erik Faye-Lund
2011-04-28 9:33 ` Ingo Molnar
2011-04-28 9:37 ` Ingo Molnar
2011-04-28 9:50 ` Erik Faye-Lund
2011-04-28 10:10 ` Pekka Enberg
2011-04-28 10:19 ` Erik Faye-Lund
2011-04-28 10:30 ` Pekka Enberg
2011-04-28 11:59 ` Erik Faye-Lund
2011-04-28 12:12 ` Pekka Enberg
2011-04-28 12:36 ` Jonathan Nieder
2011-04-28 12:40 ` Erik Faye-Lund
2011-04-28 13:37 ` Ingo Molnar
2011-04-28 15:14 ` Ingo Molnar
2011-04-28 16:00 ` Erik Faye-Lund
2011-04-28 20:32 ` Ingo Molnar
2011-04-29 7:05 ` Alex Riesen
2011-04-29 16:24 ` H. Peter Anvin
2011-04-28 12:16 ` Tor Arntsen
2011-04-28 20:23 ` H. Peter Anvin
2011-04-28 12:17 ` Andreas Ericsson
2011-04-28 12:28 ` Erik Faye-Lund
2011-04-28 10:19 ` Ingo Molnar
2011-04-28 12:02 ` Nguyen Thai Ngoc Duy
2011-04-28 12:18 ` Erik Faye-Lund
2011-04-28 20:20 ` Junio C Hamano
2011-04-28 16:36 ` Dmitry Potapov
2011-04-28 8:52 ` Dmitry Potapov
2011-04-28 9:11 ` Ingo Molnar
2011-04-28 9:31 ` Dmitry Potapov
2011-04-28 9:44 ` Ingo Molnar
2011-04-28 9:38 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110427225114.GA16765@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=acme@redhat.com \
--cc=fweisbec@gmail.com \
--cc=git@vger.kernel.org \
--cc=hpa@zytor.com \
--cc=penberg@cs.helsinki.fi \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).