* Linus' sha1 is much faster!
@ 2009-08-14 23:25 Pádraig Brady
2009-08-15 20:02 ` Bryan Donlan
2009-08-16 19:25 ` Giuseppe Scrivano
0 siblings, 2 replies; 28+ messages in thread
From: Pádraig Brady @ 2009-08-14 23:25 UTC (permalink / raw)
To: Bug-coreutils, Linus Torvalds; +Cc: Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 1110 bytes --]
I've noticed before that coreutils hashing utils
were a little behind in performance, but was prompted
to look at it again when I noticed the recently
updated sha1 implementation in git:
http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu
Testing that with the attached program which I wrote
in a couple of mins to try and match sha1sum's system calls
shows that it's around 33% faster, as shown below:
$ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1
$ time ./linus-sha1 300MB_file
df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file
real 0m2.742s
user 0m2.516s
sys 0m0.206s
$ time ~/git/coreutils/src/sha1sum 300MB_file
df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file
real 0m4.166s
user 0m3.846s
sys 0m0.298s
So, could we use that code in coreutils?
Think of all the dead fish it would save.
I've also attached a trivial block-sha1 patch which doesn't
affect performance, but does suppress a signed unsigned
comparison warning which occurs with -Wextra for example.
cheers,
Pádraig.
[-- Attachment #2: linus-sha1.c --]
[-- Type: text/x-csrc, Size: 681 bytes --]
/* gcc -O2 -Wall linus-sha1.c sha1.c -o linus-sha1 */
#include <stdio.h>
#include <stdlib.h>
#include "sha1.h"
int main(int argc, char** argv)
{
if (argc != 2) return 1;
const char* filename = argv[1];
FILE *fp = fopen (filename, "r");
if (!fp) return 1;
#define BS 4096 /* match coreutils */
blk_SHA_CTX ctx;
blk_SHA1_Init(&ctx);
size_t nr;
char buf[BS];
while ((nr=fread_unlocked(buf, 1, sizeof(buf), fp)))
blk_SHA1_Update(&ctx, buf, nr);
unsigned char hash[20];
blk_SHA1_Final(hash, &ctx);
int i;
for (i=0; i<sizeof(hash); i++)
printf("%02x",*(hash+i));
printf(" %s\n", filename);
return 0;
}
[-- Attachment #3: block-sha1-signed-warning.diff --]
[-- Type: text/x-patch, Size: 998 bytes --]
>From fa75e818836f763357ff9b7bbde3327e1aabbe47 Mon Sep 17 00:00:00 2001
From: =?utf-8?q?P=C3=A1draig=20Brady?= <P@draigBrady.com>
Date: Sat, 15 Aug 2009 00:17:30 +0100
Subject: [PATCH] block-sha1: suppress signed unsigned comparison warning
* block-sha1/sha1.c: Use unsigned ints as the values
will never go negative.
---
block-sha1/sha1.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index d3121f7..be763d8 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -231,13 +231,13 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
{
- int lenW = ctx->size & 63;
+ unsigned int lenW = ctx->size & 63;
ctx->size += len;
/* Read the data into W and process blocks as they get full */
if (lenW) {
- int left = 64 - lenW;
+ unsigned int left = 64 - lenW;
if (len < left)
left = len;
memcpy(lenW + (char *)ctx->W, data, left);
--
1.6.2.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady
@ 2009-08-15 20:02 ` Bryan Donlan
2009-08-15 20:12 ` John Tapsell
2009-08-16 19:25 ` Giuseppe Scrivano
1 sibling, 1 reply; 28+ messages in thread
From: Bryan Donlan @ 2009-08-15 20:02 UTC (permalink / raw)
To: Pádraig Brady
Cc: Bug-coreutils, Linus Torvalds, Git Mailing List, Brandon Casey,
Junio C Hamano, Nicolas Pitre
2009/8/14 Pádraig Brady <P@draigbrady.com>:
> I've noticed before that coreutils hashing utils
> were a little behind in performance, but was prompted
> to look at it again when I noticed the recently
> updated sha1 implementation in git:
> http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu
>
> Testing that with the attached program which I wrote
> in a couple of mins to try and match sha1sum's system calls
> shows that it's around 33% faster, as shown below:
>
> $ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1
>
> $ time ./linus-sha1 300MB_file
> df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file
> real 0m2.742s
> user 0m2.516s
> sys 0m0.206s
>
> $ time ~/git/coreutils/src/sha1sum 300MB_file
> df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file
>
> real 0m4.166s
> user 0m3.846s
> sys 0m0.298s
>
> So, could we use that code in coreutils?
> Think of all the dead fish it would save.
coreutils is licensed under GPLv3, and git under GPLv2 (only), so
you'd need permission from all contributors to the implementation in
order to relicense under GPLv3. A quick grep of the history suggests
these contributors to be:
Brandon Casey <drafnel@gmail.com>
Junio C Hamano <gitster@pobox.com>
Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Pitre <nico@cam.org>
(adding these people to the CC list)
Additionally, it was originally based on the code in
mozilla-sha1/sha1.c, but that contains a license grant allowing it to
be used under GPLv2 /or later/, so if GPLv3 relicensing is enough it
shouldn't be necessary to get in contact with the original author.
However if the FSF requires copyright assignment to accept the new
implementation, it will be necessary to track down contributors to the
original mozilla-sha1/sha1.c as well.
Note that I'm not a lawyer, so there might be other roadblocks etc to
this as well, etc :)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-15 20:02 ` Bryan Donlan
@ 2009-08-15 20:12 ` John Tapsell
2009-08-15 20:23 ` Linus Torvalds
2009-08-16 0:06 ` Theodore Tso
0 siblings, 2 replies; 28+ messages in thread
From: John Tapsell @ 2009-08-15 20:12 UTC (permalink / raw)
To: Bryan Donlan
Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds,
Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre
2009/8/15 Bryan Donlan <bdonlan@gmail.com>:
> coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> you'd need permission from all contributors to the implementation in
> order to relicense under GPLv3. A quick grep of the history suggests
> these contributors to be:
X11 also requires a fast SHA1 implementation. It uses this to check
if two pixmaps are the same. So it would be really nice to relicense
under a liberal enough license that xorg can use it.
John
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-15 20:12 ` John Tapsell
@ 2009-08-15 20:23 ` Linus Torvalds
2009-08-15 20:54 ` Linus Torvalds
2009-08-16 0:06 ` Theodore Tso
1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2009-08-15 20:23 UTC (permalink / raw)
To: John Tapsell
Cc: Bryan Donlan, Pádraig Brady, Bug-coreutils, Git Mailing List,
Brandon Casey, Junio C Hamano, Nicolas Pitre
On Sat, 15 Aug 2009, John Tapsell wrote:
> 2009/8/15 Bryan Donlan <bdonlan@gmail.com>:
> > coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> > you'd need permission from all contributors to the implementation in
> > order to relicense under GPLv3. A quick grep of the history suggests
> > these contributors to be:
>
> X11 also requires a fast SHA1 implementation. It uses this to check
> if two pixmaps are the same. So it would be really nice to relicense
> under a liberal enough license that xorg can use it.
I'm personally ok with retaining the mozilla-sha1 license.
There's not really anything _remaining_ of the mozilla code, but hey, I
started from it. In retrospect I probably should have started from the PPC
asm code that already did the blocking sanely - but that's a "20/20
hindsight" kind of thing.
Plus hey, the mozilla code being a horrid pile of crud was why I was so
convinced that I could improve on things. So that's a kind of source for
it, even if it's more about the motivational side than any actual
remaining code ;)
That said, I don't know if the MPL is ok for X11. I've not looked at
compatibility issues with MPL. For git, we could just ignore the MPL,
since the GPLv2 was acceptable regardless of it.
Linus
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-15 20:23 ` Linus Torvalds
@ 2009-08-15 20:54 ` Linus Torvalds
2009-08-17 1:55 ` Nicolas Pitre
2009-08-17 8:22 ` Andreas Ericsson
0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2009-08-15 20:54 UTC (permalink / raw)
To: John Tapsell
Cc: Paul Kocher, Bryan Donlan, Bug-coreutils, Brandon Casey,
Junio C Hamano, Nicolas Pitre, Pádraig Brady,
Git Mailing List
On Sat, 15 Aug 2009, Linus Torvalds wrote:
>
> That said, I don't know if the MPL is ok for X11. I've not looked at
> compatibility issues with MPL. For git, we could just ignore the MPL,
> since the GPLv2 was acceptable regardless of it.
If MPL isn't ok for X11, then we'd need to make sure that even the
silliest Mozilla crud has been rewritten. There really isn't much, but
hey, the _history_ is based on the mozilla code, and who knows - the
'blk_SHA_CTX' struct has things like the fields in the same order as the
Mozilla equivalent, for all those historical reasons.
(Heh. Looking at that, I probably should move the 'size' field first,
since that would have different alignment rules, and the struct would be
more tightly packed that way, and initialize better).
Afaik, none of the actual code remains (the mozilla SHA1 thing did the
wrong thing for performance even for just the final bytes, and did those a
byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
Of course, maybe the Mozilla people would be interested in taking my
faster version, and say that the new-BSD license is ok, and make everybody
happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added
him to the Cc.
Paul, for your information, we're talking about a faster rewritten "mostly
portable" SHA1 routines that you can find at
http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu
(follow the "blob" pointers to see sha1.c and sha1.h). I don't know if
you're active with Mozilla/Firefox or whether you even care, but you seem
to be the logical choice of person to ask.
Linus
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-15 20:12 ` John Tapsell
2009-08-15 20:23 ` Linus Torvalds
@ 2009-08-16 0:06 ` Theodore Tso
1 sibling, 0 replies; 28+ messages in thread
From: Theodore Tso @ 2009-08-16 0:06 UTC (permalink / raw)
To: John Tapsell
Cc: Bryan Donlan, Pádraig Brady, Bug-coreutils, Linus Torvalds,
Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre
On Sat, Aug 15, 2009 at 09:12:58PM +0100, John Tapsell wrote:
> 2009/8/15 Bryan Donlan <bdonlan@gmail.com>:
> > coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> > you'd need permission from all contributors to the implementation in
> > order to relicense under GPLv3. A quick grep of the history suggests
> > these contributors to be:
>
> X11 also requires a fast SHA1 implementation. It uses this to check
> if two pixmaps are the same. So it would be really nice to relicense
> under a liberal enough license that xorg can use it.
If the checksum isn't being exposed in the protocol (i.e., it's just
internal to the X server), one possibility for X11 is to consider to
use the SHA-3 candidate Skein instead. After receiving a large amount
of evaluation by cryptographic experts, it was one of the 18
algorithms (our of an original 64 entries) that have made it the 2nd
round of the NIST competition. It's also *substantially* faster than
SHA:
One exception to this is Skein, created by several well-known
cryptographers and noted pundit Bruce Schneier. It was designed
specifically to exploit all three of the Core 2 execution units
and to run at a full 64-bits. This gives it roughly four to 10
times the logic density of competing submissions.
This is what I meant by the Matrix quote above. They didn't bend
the spoon; they bent the crypto algorithm. They moved the logic
operations around in a way that wouldn't weaken the crypto, but
would strengthen its speed on the Intel Core 2.
In their paper (PDF), the authors of Skein express surprise that a
custom silicon ASIC implementation is not any faster than the
software implementation. They shouldn't be surprised. Every time
you can redefine a problem to run optimally in software, you will
reach the same speeds you get with optimized ASIC hardware. The
reason software has a reputation of being slow is because people
don't redefine the original problem.
http://www.darkreading.com/blog/archives/2008/11/bending_skein_c.html
For more information and some optimized implementation, see:
http://www.skein-hash.info/
- Ted
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady
2009-08-15 20:02 ` Bryan Donlan
@ 2009-08-16 19:25 ` Giuseppe Scrivano
2009-08-16 20:10 ` Linus Torvalds
2009-08-17 1:53 ` Pádraig Brady
1 sibling, 2 replies; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-16 19:25 UTC (permalink / raw)
To: Pádraig Brady; +Cc: Linus Torvalds, Bug-coreutils, Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 575 bytes --]
Hi Pádraig,
I tried to reproduce your results but I wasn't able to do it. The
biggest difference on a 300MB file I noticed was approximately 15% using
on both implementations -O2, and 5% using -O3.
My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
Pentium(R) D CPU 3.20GHz.
I also spent some time trying to improve the gnulib SHA1 implementation
and it seems a lookup table can improve things a bit.
Can you please try the patch that I have attached and tell me which
performance difference (if any) you get?
Thanks,
Giuseppe
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-SHA1-use-a-lookup-table-for-faster-hashing.patch --]
[-- Type: text/x-diff, Size: 8582 bytes --]
>From b975a5e0849eaa46e5cf410c5bf6e2308f044d61 Mon Sep 17 00:00:00 2001
From: Giuseppe Scrivano <gscrivano@gnu.org>
Date: Sun, 16 Aug 2009 20:53:54 +0200
Subject: [PATCH] SHA1: use a lookup table for faster hashing
* lib/sha1.c (struct sha1_pre): New member.
* lib/sha1.c (sha1_process_block): Use the lookup table to quickly find
indices to use in the current round.
---
lib/sha1.c | 160 ++++++++++++++++++++++++++++++++++-------------------------
1 files changed, 92 insertions(+), 68 deletions(-)
diff --git a/lib/sha1.c b/lib/sha1.c
index 9c6c7ae..ec18ba7 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -283,6 +283,32 @@ sha1_process_bytes (const void *buffer, size_t len, struct sha1_ctx *ctx)
#define F3(B,C,D) ( ( B & C ) | ( D & ( B | C ) ) )
#define F4(B,C,D) (B ^ C ^ D)
+struct lookup_t
+{
+ unsigned char l1 : 4;
+ unsigned char l2 : 4;
+ unsigned char l3 : 4;
+ unsigned char l4 : 4;
+};
+
+const static struct lookup_t
+sha1_pre[16] = {{(0 - 3) & 0x0f, (0 - 8) & 0x0f, (0 - 14) & 0x0f},
+ {(1 - 3) & 0x0f, (1 - 8) & 0x0f, (1 - 14) & 0x0f},
+ {(2 - 3) & 0x0f, (2 - 8) & 0x0f, (2 - 14) & 0x0f},
+ {(3 - 3) & 0x0f, (3 - 8) & 0x0f, (3 - 14) & 0x0f},
+ {(4 - 3) & 0x0f, (4 - 8) & 0x0f, (4 - 14) & 0x0f},
+ {(5 - 3) & 0x0f, (5 - 8) & 0x0f, (5 - 14) & 0x0f},
+ {(6 - 3) & 0x0f, (6 - 8) & 0x0f, (6 - 14) & 0x0f},
+ {(7 - 3) & 0x0f, (7 - 8) & 0x0f, (7 - 14) & 0x0f},
+ {(8 - 3) & 0x0f, (8 - 8) & 0x0f, (8 - 14) & 0x0f},
+ {(9 - 3) & 0x0f, (9 - 8) & 0x0f, (9 - 14) & 0x0f},
+ {(10 - 3) & 0x0f, (10 - 8) & 0x0f, (10 - 14) & 0x0f},
+ {(11 - 3) & 0x0f, (11 - 8) & 0x0f, (11 - 14) & 0x0f},
+ {(12 - 3) & 0x0f, (12 - 8) & 0x0f, (12 - 14) & 0x0f},
+ {(13 - 3) & 0x0f, (13 - 8) & 0x0f, (13 - 14) & 0x0f},
+ {(14 - 3) & 0x0f, (14 - 8) & 0x0f, (14 - 14) & 0x0f},
+ {(15 - 3) & 0x0f, (15 - 8) & 0x0f, (15 - 14) & 0x0f}};
+
/* Process LEN bytes of BUFFER, accumulating context into CTX.
It is assumed that LEN % 64 == 0.
Most of this code comes from GnuPG's cipher/sha1.c. */
@@ -309,9 +335,8 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
#define rol(x, n) (((x) << (n)) | ((uint32_t) (x) >> (32 - (n))))
-#define M(I) ( tm = x[I&0x0f] ^ x[(I-14)&0x0f] \
- ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f] \
- , (x[I&0x0f] = rol(tm, 1)) )
+#define M(I) (x[I] = rol (x[sha1_pre[I].l1] ^ x[sha1_pre[I].l2] \
+ ^ x[sha1_pre[I].l3] ^ x[I], 1))
#define R(A,B,C,D,E,F,K,M) do { E += rol( A, 5 ) \
+ F( B, C, D ) \
@@ -322,7 +347,6 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
while (words < endp)
{
- uint32_t tm;
int t;
for (t = 0; t < 16; t++)
{
@@ -346,70 +370,70 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
R( c, d, e, a, b, F1, K1, x[13] );
R( b, c, d, e, a, F1, K1, x[14] );
R( a, b, c, d, e, F1, K1, x[15] );
- R( e, a, b, c, d, F1, K1, M(16) );
- R( d, e, a, b, c, F1, K1, M(17) );
- R( c, d, e, a, b, F1, K1, M(18) );
- R( b, c, d, e, a, F1, K1, M(19) );
- R( a, b, c, d, e, F2, K2, M(20) );
- R( e, a, b, c, d, F2, K2, M(21) );
- R( d, e, a, b, c, F2, K2, M(22) );
- R( c, d, e, a, b, F2, K2, M(23) );
- R( b, c, d, e, a, F2, K2, M(24) );
- R( a, b, c, d, e, F2, K2, M(25) );
- R( e, a, b, c, d, F2, K2, M(26) );
- R( d, e, a, b, c, F2, K2, M(27) );
- R( c, d, e, a, b, F2, K2, M(28) );
- R( b, c, d, e, a, F2, K2, M(29) );
- R( a, b, c, d, e, F2, K2, M(30) );
- R( e, a, b, c, d, F2, K2, M(31) );
- R( d, e, a, b, c, F2, K2, M(32) );
- R( c, d, e, a, b, F2, K2, M(33) );
- R( b, c, d, e, a, F2, K2, M(34) );
- R( a, b, c, d, e, F2, K2, M(35) );
- R( e, a, b, c, d, F2, K2, M(36) );
- R( d, e, a, b, c, F2, K2, M(37) );
- R( c, d, e, a, b, F2, K2, M(38) );
- R( b, c, d, e, a, F2, K2, M(39) );
- R( a, b, c, d, e, F3, K3, M(40) );
- R( e, a, b, c, d, F3, K3, M(41) );
- R( d, e, a, b, c, F3, K3, M(42) );
- R( c, d, e, a, b, F3, K3, M(43) );
- R( b, c, d, e, a, F3, K3, M(44) );
- R( a, b, c, d, e, F3, K3, M(45) );
- R( e, a, b, c, d, F3, K3, M(46) );
- R( d, e, a, b, c, F3, K3, M(47) );
- R( c, d, e, a, b, F3, K3, M(48) );
- R( b, c, d, e, a, F3, K3, M(49) );
- R( a, b, c, d, e, F3, K3, M(50) );
- R( e, a, b, c, d, F3, K3, M(51) );
- R( d, e, a, b, c, F3, K3, M(52) );
- R( c, d, e, a, b, F3, K3, M(53) );
- R( b, c, d, e, a, F3, K3, M(54) );
- R( a, b, c, d, e, F3, K3, M(55) );
- R( e, a, b, c, d, F3, K3, M(56) );
- R( d, e, a, b, c, F3, K3, M(57) );
- R( c, d, e, a, b, F3, K3, M(58) );
- R( b, c, d, e, a, F3, K3, M(59) );
- R( a, b, c, d, e, F4, K4, M(60) );
- R( e, a, b, c, d, F4, K4, M(61) );
- R( d, e, a, b, c, F4, K4, M(62) );
- R( c, d, e, a, b, F4, K4, M(63) );
- R( b, c, d, e, a, F4, K4, M(64) );
- R( a, b, c, d, e, F4, K4, M(65) );
- R( e, a, b, c, d, F4, K4, M(66) );
- R( d, e, a, b, c, F4, K4, M(67) );
- R( c, d, e, a, b, F4, K4, M(68) );
- R( b, c, d, e, a, F4, K4, M(69) );
- R( a, b, c, d, e, F4, K4, M(70) );
- R( e, a, b, c, d, F4, K4, M(71) );
- R( d, e, a, b, c, F4, K4, M(72) );
- R( c, d, e, a, b, F4, K4, M(73) );
- R( b, c, d, e, a, F4, K4, M(74) );
- R( a, b, c, d, e, F4, K4, M(75) );
- R( e, a, b, c, d, F4, K4, M(76) );
- R( d, e, a, b, c, F4, K4, M(77) );
- R( c, d, e, a, b, F4, K4, M(78) );
- R( b, c, d, e, a, F4, K4, M(79) );
+ R( e, a, b, c, d, F1, K1, M( 0) );
+ R( d, e, a, b, c, F1, K1, M( 1) );
+ R( c, d, e, a, b, F1, K1, M( 2) );
+ R( b, c, d, e, a, F1, K1, M( 3) );
+ R( a, b, c, d, e, F2, K2, M( 4) );
+ R( e, a, b, c, d, F2, K2, M( 5) );
+ R( d, e, a, b, c, F2, K2, M( 6) );
+ R( c, d, e, a, b, F2, K2, M( 7) );
+ R( b, c, d, e, a, F2, K2, M( 8) );
+ R( a, b, c, d, e, F2, K2, M( 9) );
+ R( e, a, b, c, d, F2, K2, M(10) );
+ R( d, e, a, b, c, F2, K2, M(11) );
+ R( c, d, e, a, b, F2, K2, M(12) );
+ R( b, c, d, e, a, F2, K2, M(13) );
+ R( a, b, c, d, e, F2, K2, M(14) );
+ R( e, a, b, c, d, F2, K2, M(15) );
+ R( d, e, a, b, c, F2, K2, M( 0) );
+ R( c, d, e, a, b, F2, K2, M( 1) );
+ R( b, c, d, e, a, F2, K2, M( 2) );
+ R( a, b, c, d, e, F2, K2, M( 3) );
+ R( e, a, b, c, d, F2, K2, M( 4) );
+ R( d, e, a, b, c, F2, K2, M( 5) );
+ R( c, d, e, a, b, F2, K2, M( 6) );
+ R( b, c, d, e, a, F2, K2, M( 7) );
+ R( a, b, c, d, e, F3, K3, M( 8) );
+ R( e, a, b, c, d, F3, K3, M( 9) );
+ R( d, e, a, b, c, F3, K3, M(10) );
+ R( c, d, e, a, b, F3, K3, M(11) );
+ R( b, c, d, e, a, F3, K3, M(12) );
+ R( a, b, c, d, e, F3, K3, M(13) );
+ R( e, a, b, c, d, F3, K3, M(14) );
+ R( d, e, a, b, c, F3, K3, M(15) );
+ R( c, d, e, a, b, F3, K3, M( 0) );
+ R( b, c, d, e, a, F3, K3, M( 1) );
+ R( a, b, c, d, e, F3, K3, M( 2) );
+ R( e, a, b, c, d, F3, K3, M( 3) );
+ R( d, e, a, b, c, F3, K3, M( 4) );
+ R( c, d, e, a, b, F3, K3, M( 5) );
+ R( b, c, d, e, a, F3, K3, M( 6) );
+ R( a, b, c, d, e, F3, K3, M( 7) );
+ R( e, a, b, c, d, F3, K3, M( 8) );
+ R( d, e, a, b, c, F3, K3, M( 9) );
+ R( c, d, e, a, b, F3, K3, M(10) );
+ R( b, c, d, e, a, F3, K3, M(11) );
+ R( a, b, c, d, e, F4, K4, M(12) );
+ R( e, a, b, c, d, F4, K4, M(13) );
+ R( d, e, a, b, c, F4, K4, M(14) );
+ R( c, d, e, a, b, F4, K4, M(15) );
+ R( b, c, d, e, a, F4, K4, M( 0) );
+ R( a, b, c, d, e, F4, K4, M( 1) );
+ R( e, a, b, c, d, F4, K4, M( 2) );
+ R( d, e, a, b, c, F4, K4, M( 3) );
+ R( c, d, e, a, b, F4, K4, M( 4) );
+ R( b, c, d, e, a, F4, K4, M( 5) );
+ R( a, b, c, d, e, F4, K4, M( 6) );
+ R( e, a, b, c, d, F4, K4, M( 7) );
+ R( d, e, a, b, c, F4, K4, M( 8) );
+ R( c, d, e, a, b, F4, K4, M( 9) );
+ R( b, c, d, e, a, F4, K4, M(10) );
+ R( a, b, c, d, e, F4, K4, M(11) );
+ R( e, a, b, c, d, F4, K4, M(12) );
+ R( d, e, a, b, c, F4, K4, M(13) );
+ R( c, d, e, a, b, F4, K4, M(14) );
+ R( b, c, d, e, a, F4, K4, M(15) );
a = ctx->A += a;
b = ctx->B += b;
--
1.6.3.3
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-16 19:25 ` Giuseppe Scrivano
@ 2009-08-16 20:10 ` Linus Torvalds
2009-08-16 22:15 ` Giuseppe Scrivano
2009-08-17 1:53 ` Pádraig Brady
1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2009-08-16 20:10 UTC (permalink / raw)
To: Giuseppe Scrivano; +Cc: Bug-coreutils, Pádraig Brady, Git Mailing List
On Sun, 16 Aug 2009, Giuseppe Scrivano wrote:
>
> My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
> Pentium(R) D CPU 3.20GHz.
Netburst is very sensitive to random spill effects, and you can basically
tune things by just code shuffling that just has random effects on the
generated asm code.
> I also spent some time trying to improve the gnulib SHA1 implementation
> and it seems a lookup table can improve things a bit.
I pretty much can guarantee you that it improves things only because it
makes gcc generate crap code, which then hides some of the P4 issues.
I'd also suggest you try gcc-4.4, since that apparently fixes some of the
oddest spill issues.
Linus
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-16 20:10 ` Linus Torvalds
@ 2009-08-16 22:15 ` Giuseppe Scrivano
2009-08-16 22:47 ` Linus Torvalds
0 siblings, 1 reply; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-16 22:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Pádraig Brady, Bug-coreutils, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> I pretty much can guarantee you that it improves things only because it
> makes gcc generate crap code, which then hides some of the P4 issues.
>
> I'd also suggest you try gcc-4.4, since that apparently fixes some of the
> oddest spill issues.
Thanks for the hint. I tried gcc-4.4 and it produces slower code than
4.3 on the gnulib SHA1 implementation and my patch makes it even more!
I noticed that on my machine your implementation is ~30-40% faster using
SHA_ROT for rol/ror instructions than inline assembly, at least with the
test-case Pádraig wrote. Am I the only one reporting it?
Cheers,
Giuseppe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-16 22:15 ` Giuseppe Scrivano
@ 2009-08-16 22:47 ` Linus Torvalds
0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2009-08-16 22:47 UTC (permalink / raw)
To: Giuseppe Scrivano; +Cc: Bug-coreutils, Pádraig Brady, Git Mailing List
On Mon, 17 Aug 2009, Giuseppe Scrivano wrote:
>
> Thanks for the hint. I tried gcc-4.4 and it produces slower code than
> 4.3 on the gnulib SHA1 implementation and my patch makes it even more!
Check out the asm, see if you can see why. One of the most common problems
with P4's is literally that you end up loading from the same stack slot
that you just stored to (gcc can do some really crazy spills), and that
causes a store buffer hazard replay.
My personal opinion is that Netburst is useless for trying to optimize C
code for. It's just too random.
> I noticed that on my machine your implementation is ~30-40% faster using
> SHA_ROT for rol/ror instructions than inline assembly, at least with the
> test-case Pádraig wrote. Am I the only one reporting it?
I bet it's the same thing. Small perturbations of the source causing small
changes to register allocation and thus spilling, and then Netburst goes
crazy one way or another. It's interestign trying to fix it, and very
frustrating.
My workstation is a Nehalem (but Core 2 will have pretty much the same
behavior), and it doesn't have the crazy netburst behavior. Shorter and
simpler code generally performs better (which is _not_ true on Netburst).
On my machine, for example, forcing gcc to do those rotates on registers
is the difference between ~381MB/s and 415MB/s. And that's mainly because
it makes gcc keep A-E in registers, rather than trying to cache the
array[] references.
Linus
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-16 19:25 ` Giuseppe Scrivano
2009-08-16 20:10 ` Linus Torvalds
@ 2009-08-17 1:53 ` Pádraig Brady
2009-08-17 10:51 ` Giuseppe Scrivano
1 sibling, 1 reply; 28+ messages in thread
From: Pádraig Brady @ 2009-08-17 1:53 UTC (permalink / raw)
To: Giuseppe Scrivano; +Cc: Bug-coreutils, Linus Torvalds, Git Mailing List
Giuseppe Scrivano wrote:
> Hi Pádraig,
>
> I tried to reproduce your results but I wasn't able to do it. The
> biggest difference on a 300MB file I noticed was approximately 15% using
> on both implementations -O2, and 5% using -O3.
> My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
> Pentium(R) D CPU 3.20GHz.
>
> I also spent some time trying to improve the gnulib SHA1 implementation
> and it seems a lookup table can improve things a bit.
>
> Can you please try the patch that I have attached and tell me which
> performance difference (if any) you get?
Thanks for looking at this Giuseppe
and sorry for not mentioning my GCC and CPU.
Note the binaries below is compiled with
$(rpm -q --qf="%{OPTFLAGS}\n" coreutils)
for consistency, which on my F11 machines is:
-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
-mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1
Testing on 2 machines I have here:
$ rpm -q gcc
gcc-4.4.1-2.fc11.i586
$ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' '
model name : Intel(R) Pentium(R) M processor 1.70GHz
$ truncate -s300MB sha1.test
$ time sha1sum sha1.test
real 0m3.540s
$ time linus-sha1 sha1.test
real 0m2.319s (-34%)
$ time giuseppe-sha1sum sha1.test
real 0m3.513s (-.8%)
$ rpm -q gcc
gcc-4.4.1-2.fc11.i586
$ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' '
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
$ truncate -s300MB sha1.test
$ time sha1sum sha1.test
real 0m1.857s
$ time linus-sha1 sha1.test
real 0m1.102s (-40%)
$ time giuseppe-sha1sum sha1.test
real 0m1.932s (+ 4%)
cheers,
Pádraig.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-15 20:54 ` Linus Torvalds
@ 2009-08-17 1:55 ` Nicolas Pitre
2009-08-26 11:39 ` Pádraig Brady
2009-08-17 8:22 ` Andreas Ericsson
1 sibling, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 1:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: John Tapsell, Bryan Donlan, Pádraig Brady, Bug-coreutils,
Git Mailing List, Brandon Casey, Junio C Hamano, Paul Kocher
On Sat, 15 Aug 2009, Linus Torvalds wrote:
> (Heh. Looking at that, I probably should move the 'size' field first,
> since that would have different alignment rules, and the struct would be
> more tightly packed that way, and initialize better).
I was about to suggest (i.e. post) a patch for that. This is indeed a
good idea.
> Afaik, none of the actual code remains (the mozilla SHA1 thing did the
> wrong thing for performance even for just the final bytes, and did those a
> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
Maybe a patch adding a proper header with the actual license would be a
good idea too.
Nicolas
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
@ 2009-08-17 7:23 George Spelvin
2009-08-17 14:20 ` Nicolas Pitre
2009-08-17 17:06 ` Nicolas Pitre
0 siblings, 2 replies; 28+ messages in thread
From: George Spelvin @ 2009-08-17 7:23 UTC (permalink / raw)
To: bdonlan, johnflux, P; +Cc: art.08.09, git, linux, nico, torvalds
If it helps anyone resolve license issues, here's a from-FIPS-180-2
implementation that's placed in the public domain. That should be
compatible with any license.
It uses Linus's and Artur's performance ideas, and some of Linus' macro
ideas (in the rotate implementation), but tries to be textually different.
Is there anything recognizable that anyone cares to clam copyright to?
It's not quite 100% finished, as I haven't benchmarked it against Linus's
code yet, but it's functionally correct.
It's also clean with -W -Wall -Wextra.
TODO: Check if an initial copy to w[] is faster on i386 (less register
pressure).
/*
* Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
*
* This implementation is in the public domain. Copyright abandoned.
* You may do anything you like with it, including evil things.
*
* This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
* from the git mailing list (August, 2009). Additional optimization
* ideas cribbed from
* - Artur Skawina (x86, particularly P4, and much benchmarking)
* - Nicilas Pitre (ARM)
*/
/* Cut here for sha1.h */
#include <stddef.h> /* For size_t */
#include <stdint.h> /* For uint32_t, uint64_t */
typedef struct SHA_context {
uint64_t len; /* May be shrunk to uint32_t */
uint32_t iv[5];
unsigned char buf[64]; /* Must be 32-bit aligned */
} SHA_CTX;
void SHA1_Init(struct SHA_context *c);
void SHA1_Update(struct SHA_context *c, void const *p, size_t n);
void SHA1_Final(unsigned char hash[20], struct SHA_context *c);
/* End of sha1.h */
#include <string.h> /* For memcpy() */
#include <arpa/inet.h> /* For ntohl() */
static void sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks);
/* Machine specific hacks */
#if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \
defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \
defined(__s390__) || defined(__s390x__)
/* Unaligned access is okay */
static inline uint32_t get_be32(unsigned char const *p)
{
return ntohl(*(uint32_t const *)p);
}
static inline void put_be32(unsigned char const *p, uint32_t v)
{
*(uint32_t *)p = htonl(v);
}
#else
/* Unaligned access is not okay; do conversion as byte fetches */
static inline uint32_t get_be32(unsigned char const *p)
{
return p[0] << 24 || p[1] << 16 | p[8] << 8 | p[3];
}
static inline void put_be32(unsigned char const *p, uint32_t v)
{
p[0] = v >> 24;
p[1] = v >> 16;
p[2] = v >> 8;
p[3] = v;
}
#endif
void SHA1_Init(struct SHA_context *c)
{
/* This is a prefix of the SHA_context structure */
static struct {
uint64_t len;
uint32_t iv[5];
} const iv = { 0,
{ 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }
};
memcpy(c, &iv, sizeof iv);
}
void SHA1_Update(struct SHA_context *c, void const *p, size_t n)
{
size_t pos = c->len & 63; /* Offset into current input block */
c->len += n;
/* Initial partial block (if any) */
if (pos) {
size_t space = 63 - pos;
if (n < space)
goto end;
memcpy(c->buf + pos, p, space);
sha1_core(c->iv, c->buf, 1);
n -= space;
p = (char const *)p + space;
}
/* The large middle piece */
if (n >> 6) {
sha1_core(c->iv, p, n >> 6);
p = (char const *)p + (n & -(size_t)64);
n &= 63;
}
pos = 0;
end:
/* Final partial block (may be zero size) */
memcpy(c->buf + pos, p, n);
}
void SHA1_Final(unsigned char hash[20], struct SHA_context *c)
{
size_t pos = c->len & 63;
unsigned i;
/* Append a single 1 bit */
c->buf[pos++] = 0x80;
/* Append 0 bits until 64 bits remain in a block */
if (pos > 56) {
memset(c->buf + pos, 0, 64 - pos);
sha1_core(c->iv, c->buf, 1);
pos = 0;
}
memset(c->buf + pos, 0, 56 - pos);
/* Append total input length in bits */
((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29));
((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3);
/* Final hash round */
sha1_core(c->iv, c->buf, 1);
/* Copy hash result out */
for (i = 0; i < 5; i++)
put_be32(hash + 4*i, c->iv[i]);
}
/*
* Helper macros for sha1_core function. To avoid clutter, these macros
* are NOT fully parenthesized if it doesn't matter to the actual use.
*/
#if __GNUC__ && (defined(__i386__) || defined(__x86_64__))
/*
* GCC by itself only generates left rotates. Use right rotates if
* possible to be kinder to dinky implementations with iterative rotate
* instructions.
*/
#define ROT(op, x, k) \
({ uint32_t y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define ROL(x,k) ROT("roll", x, k)
#define ROR(x,k) ROT("rorl", x, k)
#else
/* Generic C equivalent */
#define ROT(x,l,r) ((x) << (l) | (x) >> (r))
#define ROL(x,k) ROT(x,k,32-(k))
#define ROR(x,k) ROT(x,32-(k),k)
#endif
/*
* An important temporary array in SHA-1 is the working array W[],
* which holds a scheduled key word per round. Only the last 16 words
* are relevant, so only use 16 words...
*/
#define W(i) w[(i) & 15]
/*
* If you have a small register set, it helps GCC to force stores to
* the w[] array to memory. Given 22 or more registers (e.g. PowerPC),
* GCC can map the entire w[] array to registers and this becomes
* counterproductive.
*
* The optimal kludge seems to differ between x86 and ARM. On the latter,
* forcing a full memory barrier is required to stop GCC from splitting
* live ranges for each round and generating a separate stack slot for
* each. (Which produces a huge stack frame and kills performance.)
*/
#if defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) *(volatile uint32_t *)&W(i) = x
#elif __GNUC__ && defined(__arm__)
#define STORE(i, x) W(i) = x; __asm__("":::"memory")
#else
#define STORE(i, x) W(i) = x
#endif
/* The three round functions. F2 is also used as F4 */
#define F1(b,c,d) (((d ^ c) & b) ^ d) /* Bitwise b ? c : d */
#define F2(b,c,d) (d ^ c ^ b) /* Even parity */
#define F3(b,c,d) (d & c) + ((d ^ c) & b) /* Majority function */
/* The four round constants */
#define K1 0x5a827999 /* 2^30 * sqrt(2) */
#define K2 0x6ed9eba1 /* 2^30 * sqrt(3) */
#define K3 0x8f1bbcdc /* 2^30 * sqrt(5) */
#define K4 0xca62c1d6 /* 2^30 * sqrt(10) */
/* Rounds 0..15 fetch a word from the input */
#define FETCH(t,i) t = get_be32(p + 4*(i)); STORE(i,t)
/* Rounds 16..79 mix previous words to get a new one */
#define MIX(t,i) t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1)
/* Rounds 16..76 have to store back the result */
#define CALC(t,i) MIX(t,i); STORE(i,t)
/* The basic SHA-1 round */
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) \
src(t,i); \
e += t + f(b,c,d) + k + ROL(a,5); \
b = ROR(b,2)
/* An aligned group of 5 rounds */
#define SHA_ROUND5(f,k,src,t,i) \
SHA_ROUND(a,b,c,d,e, f,k,src,t,i); \
SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1); \
SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2); \
SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3); \
SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4)
/*
* The core SHA-1 transform.
*
* iv[5] = Current SHA-1 hash state.
* p = Pointer to source data. Not necessarily aligned.
* nblocks = Number of 64-byte blocks at p. Guaranteed non-zero.
*/
static void
sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks)
{
uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0];
uint32_t w[16];
do {
uint32_t t;
SHA_ROUND5(F1, K1, FETCH, t, 0);
SHA_ROUND5(F1, K1, FETCH, t, 5);
SHA_ROUND5(F1, K1, FETCH, t, 10);
SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15);
SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16);
SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17);
SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18);
SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19);
SHA_ROUND5(F2, K2, CALC, t, 20);
SHA_ROUND5(F2, K2, CALC, t, 25);
SHA_ROUND5(F2, K2, CALC, t, 30);
SHA_ROUND5(F2, K2, CALC, t, 35);
SHA_ROUND5(F3, K3, CALC, t, 40);
SHA_ROUND5(F3, K3, CALC, t, 45);
SHA_ROUND5(F3, K3, CALC, t, 50);
SHA_ROUND5(F3, K3, CALC, t, 55);
SHA_ROUND5(F2, K4, CALC, t, 60);
SHA_ROUND5(F2, K4, CALC, t, 65);
SHA_ROUND5(F2, K4, CALC, t, 70);
SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75);
SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76);
/* Last 3 rounds don't need to store mixed value */
SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77);
SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78);
SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79);
iv[4] = e += iv[4];
iv[3] = d += iv[3];
iv[2] = c += iv[2];
iv[1] = b += iv[1];
iv[0] = a += iv[0];
} while (--nblocks);
}
/* Compile with -DUNITTEST for self-tests */
#if UNITTEST
#include <stdio.h>
/* Known answer test */
static void katest(char const *text, size_t len, unsigned char const hash[20])
{
SHA_CTX c;
unsigned char hash2[20];
int i;
SHA1_Init(&c);
SHA1_Update(&c, text, len);
SHA1_Final(hash2, &c);
for (i = 0; i < 20; i++)
if (hash[i] != hash2[i])
goto mismatch;
printf("%u-byte known answer test PASSED\n", len);
return;
mismatch:
printf("%u-byte known answer test FAILED:\n", len);
printf("Input: \"%.*s\"\n", len, text);
printf("Computed:");
for (i = 0; i < 20; i++)
printf(" %02x", hash2[i]);
printf("\nExpected:");
for (i = 0; i < 20; i++)
printf(" %02x", hash[i]);
putchar('\n');
}
int
main(void)
{
/* FIPS PUB 180.1 example A.1 */
static char const text1[3] = "abc";
static unsigned char const hash1[20] = {
0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e,
0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d };
/* FIPS PUB 180.1 example A.2 */
static char const text2[56] =
"abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
static unsigned char const hash2[20] = {
0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae,
0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 };
katest(text1, sizeof text1, hash1);
katest(text2, sizeof text2, hash2);
return 0;
}
#endif
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-15 20:54 ` Linus Torvalds
2009-08-17 1:55 ` Nicolas Pitre
@ 2009-08-17 8:22 ` Andreas Ericsson
1 sibling, 0 replies; 28+ messages in thread
From: Andreas Ericsson @ 2009-08-17 8:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: John Tapsell, Bryan Donlan, Pádraig Brady, Bug-coreutils,
Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre,
Paul Kocher, Paul Kocher
Linus Torvalds wrote:
>
> On Sat, 15 Aug 2009, Linus Torvalds wrote:
>> That said, I don't know if the MPL is ok for X11. I've not looked at
>> compatibility issues with MPL. For git, we could just ignore the MPL,
>> since the GPLv2 was acceptable regardless of it.
>
> If MPL isn't ok for X11, then we'd need to make sure that even the
> silliest Mozilla crud has been rewritten. There really isn't much, but
> hey, the _history_ is based on the mozilla code, and who knows - the
> 'blk_SHA_CTX' struct has things like the fields in the same order as the
> Mozilla equivalent, for all those historical reasons.
>
> (Heh. Looking at that, I probably should move the 'size' field first,
> since that would have different alignment rules, and the struct would be
> more tightly packed that way, and initialize better).
>
> Afaik, none of the actual code remains (the mozilla SHA1 thing did the
> wrong thing for performance even for just the final bytes, and did those a
> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
>
> Of course, maybe the Mozilla people would be interested in taking my
> faster version, and say that the new-BSD license is ok, and make everybody
> happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added
> him to the Cc.
>
> Paul, for your information, we're talking about a faster rewritten "mostly
> portable" SHA1 routines that you can find at
>
> http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu
>
> (follow the "blob" pointers to see sha1.c and sha1.h). I don't know if
> you're active with Mozilla/Firefox or whether you even care, but you seem
> to be the logical choice of person to ask.
>
I contacted Paul in february this year to get permission to use the mozilla
sha1 code for libgit2. His reply then was:
"I'm not sure which version the diffs are relative to, so I haven't reviewed them.
It's fine to distribute under BSD, GPL, or LGPL, however."
I also got explicit permission to relicense it under GPLv2 with the gcc exception.
I added the mail-address I used to contact him to CC as well. Sorry if you get
this twice, Paul.
Naturally, I'd like to use the faster version for libgit2 as well. The people
who Linus listed as contributors earlier (Brandon Casy, Linus, Junio and Nicolas
Pitre) have already consented to relicense their git contributions for libgit2
use. If anyone would like to revoke that consent for this code, speak now please,
or I'll patch it into libgit2 as well.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 1:53 ` Pádraig Brady
@ 2009-08-17 10:51 ` Giuseppe Scrivano
2009-08-17 15:44 ` Steven Noonan
0 siblings, 1 reply; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-17 10:51 UTC (permalink / raw)
To: Pádraig Brady; +Cc: Bug-coreutils, Linus Torvalds, Git Mailing List
Pádraig Brady <P@draigBrady.com> writes:
> -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
> -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1
thanks. I did again all tests on my machine using these same options.
I repeated each test 6 times and I took the median without consider the
first result. Except the first run that it is not considered, I didn't
report a big variance on results of the same test.
gcc 4.3.3
gnulib sha1: real 0m2.543s
gnulib sha1 lookup: real 0m1.906s (-25%)
linus's sha1: real 0m2.468s (-3%)
linus's sha1 no asm: real 0m2.289s (-9%)
gcc 4.4.1
gnulib sha1: real 0m3.386s
gnulib sha1 lookup: real 0m3.110s (-8%)
linus's sha1: real 0m1.701s (-49%)
linus's sha1 no asm: real 0m1.284s (-62%)
I don't see such big differences in asm generated by gcc 4.4.1 and gcc
4.3.3 to explain this performance difference, what I noticed immediately
is that in the gcc-4.4 generated asm there are more "lea" instructions
(+30%), but I doubt this is the reason of these poor results. Anyway, I
haven't yet looked much in details.
Cheers,
Giuseppe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 7:23 George Spelvin
@ 2009-08-17 14:20 ` Nicolas Pitre
2009-08-17 17:06 ` Nicolas Pitre
1 sibling, 0 replies; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 14:20 UTC (permalink / raw)
To: George Spelvin; +Cc: bdonlan, johnflux, P, art.08.09, git, Linus Torvalds
On Mon, 17 Aug 2009, George Spelvin wrote:
> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain. That should be
> compatible with any license.
>
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?
>
> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
>
> It's also clean with -W -Wall -Wextra.
>
> TODO: Check if an initial copy to w[] is faster on i386 (less register
> pressure).
>
> /*
> * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
> *
> * This implementation is in the public domain. Copyright abandoned.
> * You may do anything you like with it, including evil things.
> *
> * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
> * from the git mailing list (August, 2009). Additional optimization
> * ideas cribbed from
> * - Artur Skawina (x86, particularly P4, and much benchmarking)
> * - Nicilas Pitre (ARM)
Please be careful to spell my name correctly.
Nicolas
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 10:51 ` Giuseppe Scrivano
@ 2009-08-17 15:44 ` Steven Noonan
2009-08-17 16:22 ` Linus Torvalds
2009-08-17 17:32 ` Giuseppe Scrivano
0 siblings, 2 replies; 28+ messages in thread
From: Steven Noonan @ 2009-08-17 15:44 UTC (permalink / raw)
To: Giuseppe Scrivano
Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds,
Git Mailing List
On Mon, Aug 17, 2009 at 3:51 AM, Giuseppe Scrivano<gscrivano@gnu.org> wrote:
> Pádraig Brady <P@draigBrady.com> writes:
>
>> -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
>> -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
>> -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1
>
> thanks. I did again all tests on my machine using these same options.
> I repeated each test 6 times and I took the median without consider the
> first result. Except the first run that it is not considered, I didn't
> report a big variance on results of the same test.
>
>
> gcc 4.3.3
>
> gnulib sha1: real 0m2.543s
> gnulib sha1 lookup: real 0m1.906s (-25%)
> linus's sha1: real 0m2.468s (-3%)
> linus's sha1 no asm: real 0m2.289s (-9%)
>
>
> gcc 4.4.1
>
> gnulib sha1: real 0m3.386s
> gnulib sha1 lookup: real 0m3.110s (-8%)
> linus's sha1: real 0m1.701s (-49%)
> linus's sha1 no asm: real 0m1.284s (-62%)
>
>
> I don't see such big differences in asm generated by gcc 4.4.1 and gcc
> 4.3.3 to explain this performance difference, what I noticed immediately
> is that in the gcc-4.4 generated asm there are more "lea" instructions
> (+30%), but I doubt this is the reason of these poor results. Anyway, I
> haven't yet looked much in details.
>
> Cheers,
> Giuseppe
Interesting. I compared Linus' implementation to the public domain one
by Steve Reid[1], which is used in OpenLDAP and a few other projects.
Anyone with some experience testing these kinds of things in a
statistically sound manner want to try it out? In my tests, I got
this:
(average of 5 runs)
Linus' sha1: 283MB/s
Steve Reid's sha1: 305MB/s
- Steven
[1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 15:44 ` Steven Noonan
@ 2009-08-17 16:22 ` Linus Torvalds
2009-08-17 21:43 ` Steven Noonan
2009-08-17 17:32 ` Giuseppe Scrivano
1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2009-08-17 16:22 UTC (permalink / raw)
To: Steven Noonan
Cc: Giuseppe Scrivano, Pádraig Brady, Bug-coreutils,
Git Mailing List
On Mon, 17 Aug 2009, Steven Noonan wrote:
>
> Interesting. I compared Linus' implementation to the public domain one
> by Steve Reid[1]
You _really_ need to talk about what kind of environment you have.
There are three major issues:
- Netburst vs non-netburst
- 32-bit vs 64-bit
- compiler version
Steve Reid's code looks great, but the way it is coded, gcc makes a mess
of it, which is exactly what my SHA1 tries to avoid.
[ In contrast, gcc does very well on just about _any_ straightforward
unrolled SHA1 C code if the target architecture is something like PPC or
ia64 that has enough registers to keep it all in registers.
I haven't really tested other compilers - a less aggressive compiler
would actually do _better_ on SHA1, because the problem with gcc is that
it turns the whole temporary 16-entry word array into register accesses,
and tries to do register allocation on that _array_.
That is wonderful for the above-mentioned PPC and IA64, but it makes gcc
create totally crazy code when there aren't enough registers, and then
gcc starts spilling randomly (ie it starts spilling a-e etc). This is
why the compiler and version matters so much. ]
> (average of 5 runs)
> Linus' sha1: 283MB/s
> Steve Reid's sha1: 305MB/s
So I get very different results:
# TIME[s] SPEED[MB/s]
Reid 2.742 222.6
linus 1.464 417
this is Intel Nehalem, but compiled for 32-bit mode (which is the more
challenging one because x86-32 only has 7 general-purpose registers), and
with gcc-4.4.0.
Linus
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 7:23 George Spelvin
2009-08-17 14:20 ` Nicolas Pitre
@ 2009-08-17 17:06 ` Nicolas Pitre
2009-08-17 17:20 ` Paolo Bonzini
2009-08-17 18:54 ` George Spelvin
1 sibling, 2 replies; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 17:06 UTC (permalink / raw)
To: George Spelvin; +Cc: bdonlan, johnflux, P, art.08.09, git, torvalds
On Mon, 17 Aug 2009, George Spelvin wrote:
> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain. That should be
> compatible with any license.
>
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?
I don't think this trick of making source code textually different from
another work while still intimately mimicking the same structure entitles
you to any copyright (or non copyright) claims over that other work. I
certainly wouldn't bet any dime for this standing up in court.
Otherwise anyone could grab any copyrighted source code and perform a
bunch of search-and-replace ops on it, and maybe some code reordering
for good measure, to be able to claim own copyright on it. It is
probably much safer to simply ask the people involved to agree with your
relicensing. And so far I don't see anyone with a stake in this
fiercely wanting to stick to a particular license.
> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
>
> It's also clean with -W -Wall -Wextra.
Not if you try with the unaligned put_be32() as the destination pointer
is marked const.
As to the actual result on ARM... Well, the assembly _looks_ much worse
than Linus' version. It uses a stack frame of 152 bytes instead of 64
bytes. The resulting binary is also 6868 bytes large compared to 6180
bytes. Surprisingly, the performance is not that bad (the reason for
the underlined "looks" above) albeit still a bit worse, like 5% slower.
I was expecting much worse than that.
One possible reason for the bad assembly is probably due to the fact
that gcc is not smart enough to propagate constant address offsets
across different pointer types. For example, my first version of
get_be32() was a macro that did this:
#define SHA_SRC(t) \
({ unsigned char *__d = (unsigned char *)&data[t]; \
(__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })
With such a construct, gcc would always allocate a register to hold __d
and then dereference that with an offset from 0 to 3. Whereas:
#define SHA_SRC(t) \
({ unsigned char *__d = (unsigned char *)data; \
(__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
(__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); })
does produce optimal assembly as only the register holding the data
pointer is dereferenced with the absolute byte offset. I suspect your
usage of inline functions has the same effect as the first SHA_SRC
definition above.
Also, wrt skipping the last 3 write back to the 16 word array... For
all the (limited) attempts I've made so far to do that, it always ended
up making things worse. I've yet to investigate why though.
Nicolas
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 17:06 ` Nicolas Pitre
@ 2009-08-17 17:20 ` Paolo Bonzini
2009-08-17 18:54 ` George Spelvin
1 sibling, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2009-08-17 17:20 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: George Spelvin, bdonlan, johnflux, P, art.08.09, git
> my first version of
> get_be32() was a macro that did this:
>
> #define SHA_SRC(t) \
> ({ unsigned char *__d = (unsigned char *)&data[t]; \
> (__d[0]<< 24) | (__d[1]<< 16) | (__d[2]<< 8) | (__d[3]<< 0); })
>
> With such a construct, gcc would always allocate a register to hold __d
> and then dereference that with an offset from 0 to 3. Whereas:
>
> #define SHA_SRC(t) \
> ({ unsigned char *__d = (unsigned char *)data; \
> (__d[(t)*4 + 0]<< 24) | (__d[(t)*4 + 1]<< 16) | \
> (__d[(t)*4 + 2]<< 8) | (__d[(t)*4 + 3]<< 0); })
>
> does produce optimal assembly as only the register holding the data
> pointer is dereferenced with the absolute byte offset. I suspect your
> usage of inline functions has the same effect as the first SHA_SRC
> definition above.
Yes, that's what happens.
Paolo
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 15:44 ` Steven Noonan
2009-08-17 16:22 ` Linus Torvalds
@ 2009-08-17 17:32 ` Giuseppe Scrivano
1 sibling, 0 replies; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-17 17:32 UTC (permalink / raw)
To: Steven Noonan
Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds,
Git Mailing List
Hi,
These are the results I reported (median of 5 plus an additional not
considered first run) on the Steve Reid's SHA1 implementation using the
same flags to the compiler that I used for previous tests.
GCC 4.3.3: real 0m2.627s
GCC 4.4.1: real 0m3.742s
In both cases it showed to be slower than other implementations I have
already tried.
Additional note: as for gnulib SHA1, GCC 4.4.1 produced slower code than
GCC 4.3.3.
Cheers,
Giuseppe
Steven Noonan <steven@uplinklabs.net> writes:
>
> Interesting. I compared Linus' implementation to the public domain one
> by Steve Reid[1], which is used in OpenLDAP and a few other projects.
> Anyone with some experience testing these kinds of things in a
> statistically sound manner want to try it out? In my tests, I got
> this:
>
> (average of 5 runs)
> Linus' sha1: 283MB/s
> Steve Reid's sha1: 305MB/s
>
> - Steven
>
> [1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 17:06 ` Nicolas Pitre
2009-08-17 17:20 ` Paolo Bonzini
@ 2009-08-17 18:54 ` George Spelvin
2009-08-17 19:34 ` Nicolas Pitre
1 sibling, 1 reply; 28+ messages in thread
From: George Spelvin @ 2009-08-17 18:54 UTC (permalink / raw)
To: linux, nico; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds
> I don't think this trick of making source code textually different from
> another work while still intimately mimicking the same structure entitles
> you to any copyright (or non copyright) claims over that other work. I
> certainly wouldn't bet any dime for this standing up in court.
<div type="legal digression">
Actually, I would. I did a lot more than text search and replace;
I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright)
and then merged in the *ideas* from the mailing list.
(And from elsewhere; the idea of a five-round macro is from Brian Gladman.)
Remember, to the extent that something is *functional*, it is not
copyrightable; copyright only covers the non-functional expressive bits.
The vast majority of that code is simply required by the standard,
or the desired calling interface.
For a large portion of the rest, remember that standard programming
conventions (e.g. brace style, macro names IN CAPS, etc.) that's also
non-copyrightable "scene a faire" material.
It's well established that paraphrasing a recipe avoids copyright;
the proportions and treatment of the ingredients is not copyrightable.
For more details, see the extensive coverage of the NEC v. Intel decision
(1989) regarding the firmware for NEC's 8086-clone V20 microprocessor.
It was found non-infringing despite non-clean-room implementation and
substantial similarities.
</div>
> It is probably much safer to simply ask the people involved to agree
> with your relicensing. And so far I don't see anyone with a stake in
> this fiercely wanting to stick to a particular license.
As for politeness, that's exactly why I did post it and solicit
objections. The purpose of the rewrite is to avoid having to make
pessimistic assumptions about people who don't respond.
I suppose I should have made that request clearer:
Is there anyone who claims copyright on anything here?
Or would just like credit?
If so, are you willing to donate it to the public domain?
I like the GPL, but some code is generally applicable enough, and unlikely
to be improved on by others enough, that I'd rather just let everyone
use the same version.
> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
>
> It's also clean with -W -Wall -Wextra.
> Not if you try with the unaligned put_be32() as the destination pointer
> is marked const.
Oops. And the "||" in get_be32() doesn't help either.
There was another major bug I found, too.
And I'm still trying to match "linusv" performance.
> As to the actual result on ARM... Well, the assembly _looks_ much worse
> than Linus' version. It uses a stack frame of 152 bytes instead of 64
> bytes. The resulting binary is also 6868 bytes large compared to 6180
> bytes. Surprisingly, the performance is not that bad (the reason for
> the underlined "looks" above) albeit still a bit worse, like 5% slower.
> I was expecting much worse than that.
Thanks, I'll work on it.
> Also, wrt skipping the last 3 write back to the 16 word array... For
> all the (limited) attempts I've made so far to do that, it always ended
> up making things worse. I've yet to investigate why though.
Thanks. That's certainly surprising!
If anyone cares, my current code (adapted to fit sha1bench) is appended.
/*
* Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
*
* This implementation is in the public domain. Copyright abandoned.
* You may do anything you like with it, including evil things.
*
* This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
* from the git mailing list (August, 2009). Additional optimization
* ideas cribbed from
* - Artur Skawina (x86, particularly P4, and much benchmarking)
* - Nicolas Pitre (ARM)
*
* Contributors to this code are requested to donate their changes to
* the public domain to avoid version skew. If you don't want to, be
* sure to change the copyright waiver above.
*/
#include <string.h> /* For memcpy() */
#include <arpa/inet.h> /* For ntohl() */
#include "pdsha1.h"
#define SHA_CTX pd_SHA_CTX
#define SHA_context pd_SHA_context
#define SHA1_Init pd_SHA1_Init
#define SHA1_Update pd_SHA1_Update
#define SHA1_Final pd_SHA1_Final
static void sha1_core(uint32_t iv[5], unsigned char const *p);
void SHA1_Init(struct SHA_context *c)
{
/* This is a prefix of the SHA_context structure */
static struct {
uint64_t len;
uint32_t iv[5];
} const iv = { 0,
{ 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }
};
memcpy(c, &iv, sizeof iv);
}
void SHA1_Update(struct SHA_context *c, void const *p, size_t n)
{
size_t pos = c->len & 63; /* Offset into current input block */
size_t nblocks;
c->len += n;
/* Initial partial block (if any) */
if (pos) {
size_t space = 63 - pos;
if (n < space)
goto end;
memcpy(c->buf + pos, p, space);
sha1_core(c->iv, c->buf);
n -= space;
p = (char const *)p + space;
}
/* The large middle piece */
nblocks = n >> 6;
while (nblocks--) {
sha1_core(c->iv, p);
p += 64;
}
n &= 63;
pos = 0;
end:
/* Final partial block (may be zero size) */
memcpy(c->buf + pos, p, n);
}
/* Machine specific hacks */
#if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \
defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \
defined(__s390__) || defined(__s390x__)
/* Unaligned access is okay */
#define GET_BE32(p) ntohl(*(uint32_t const *)(p))
#define PUT_BE32(p,v) (*(uint32_t *)(p) = htonl(v))
#else
/*
* Unaligned access is not okay; do conversion as byte fetches
* These macros are IN_CAPS, as they are not side-effect safe
* The argument p must be an "unsigned char *" or "uint8_t *"
*/
#define GET_BE32(p) \
((uint32_t)(p)[0]<<24 | (p)[1]<<16 | (p)[2]<<8 | (p)[3])
#define PUT_BE32(p,v) \
( (p)[0] = (v)>>24, (p)[1] = (v)>>16, (p)[2] = (v)>>8, (p)[3] = (v) )
#endif
void SHA1_Final(unsigned char hash[20], struct SHA_context *c)
{
size_t pos = c->len & 63;
unsigned i;
/* Append a single 1 bit */
c->buf[pos++] = 0x80;
/* Append 0 bits until 64 bits remain in a block */
if (pos > 56) {
memset(c->buf + pos, 0, 64 - pos);
sha1_core(c->iv, c->buf);
pos = 0;
}
memset(c->buf + pos, 0, 56 - pos);
/* Append total input length in bits */
((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29));
((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3);
/* Final hash round */
sha1_core(c->iv, c->buf);
/* Copy hash result out */
for (i = 0; i < 5; i++)
PUT_BE32(hash + 4*i, c->iv[i]);
}
/*
* Helper macros for sha1_core function. To avoid clutter, these macros
* are NOT fully parenthesized if it doesn't matter to the actual use.
*/
#if __GNUC__ && (defined(__i386__) || defined(__x86_64__))
/*
* GCC by itself only generates left rotates. Use right rotates if
* possible to be kinder to dinky implementations with iterative rotate
* instructions.
*/
#define ROT(op, x, k) \
({ uint32_t y; __asm__(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define ROL(x,k) ROT("roll", x, k)
#define ROR(x,k) ROT("rorl", x, k)
#else
/* Generic C equivalent */
#define ROT(x,l,r) ((x) << (l) | (x) >> (r))
#define ROL(x,k) ROT(x,k,32-(k))
#define ROR(x,k) ROT(x,32-(k),k)
#endif
/*
* An important temporary array in SHA-1 is the working array W[],
* which holds a scheduled key word per round. Only the last 16 words
* are relevant, so only use 16 words...
*/
#define W(i) w[(i) & 15]
/*
* If you have a small register set, it helps GCC to force stores to
* the w[] array to memory. Given 22 or more registers (e.g. PowerPC),
* GCC can map the entire w[] array to registers and this becomes
* counterproductive.
*
* The optimal kludge seems to differ between x86 and ARM. On the latter,
* forcing a full memory barrier is required to stop GCC from splitting
* live ranges for each round and generating a separate stack slot for
* each. (Which produces a huge stack frame and kills performance.)
*/
#if __GNUC__ && defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) W(i) = x; __asm__ volatile("" : "+m" (W(x)))
#elif defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) *(volatile uint32_t *)&W(i) = x
#elif __GNUC__ && defined(__arm__)
#define STORE(i, x) W(i) = x; __asm__("":::"memory")
#else
#define STORE(i, x) W(i) = x
#endif
/* The three round functions. F2 is also used as F4 */
#define F1(b,c,d) (((d ^ c) & b) ^ d) /* Bitwise b ? c : d */
#define F2(b,c,d) (d ^ c ^ b) /* Even parity */
#define F3(b,c,d) (d & c) + ((d ^ c) & b) /* Majority function */
/* The four round constants */
#define K1 0x5a827999 /* 2^30 * sqrt(2) */
#define K2 0x6ed9eba1 /* 2^30 * sqrt(3) */
#define K3 0x8f1bbcdc /* 2^30 * sqrt(5) */
#define K4 0xca62c1d6 /* 2^30 * sqrt(10) */
/* Rounds 0..15 fetch a word from the input */
#define FETCH(t,i) t = GET_BE32(p + 4*(i)); STORE(i,t)
//#define FETCH(t,i) t = W(i)
/* Rounds 16..79 mix previous words to get a new one */
#define MIX(t,i) t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1)
/* Rounds 16..76 have to store back the result */
#define CALC(t,i) MIX(t,i); STORE(i,t)
/* The basic SHA-1 round */
#if 0
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) \
src(t,i); \
e += t + f(b,c,d) + k + ROL(a,5); \
b = ROR(b,2)
#elif 0
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do { \
uint32_t t; \
src(t,i); \
e += f(b,c,d); \
e += t + k; \
b = ROR(b,2); \
e += ROL(a,5); } while (0)
#else
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do { \
uint32_t t = ROL(a,5); \
e += f(b,c,d); \
e += k + t; \
src(t,i); \
b = ROR(b,2); \
e += t; } while(0)
#endif
/* An aligned group of 5 rounds */
#define SHA_ROUND5(f,k,src,t,i) \
SHA_ROUND(a,b,c,d,e, f,k,src,t,i); \
SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1); \
SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2); \
SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3); \
SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4)
/*
* The core SHA-1 transform.
*
* iv[5] = Current SHA-1 hash state.
* p = Pointer to 64 bytes of source data. Not necessarily aligned.
*/
static void
sha1_core(uint32_t iv[5], unsigned char const *p)
{
//uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0];
uint32_t a = iv[0], b = iv[1], c = iv[2], d = iv[3], e = iv[4];
uint32_t w[16];
#if 0
{
int i;
for (i = 0; i < 16; i++) {
STORE(i, get_be32(p + 4*i));
}
}
#endif
SHA_ROUND5(F1, K1, FETCH, t, 0);
SHA_ROUND5(F1, K1, FETCH, t, 5);
SHA_ROUND5(F1, K1, FETCH, t, 10);
SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15);
SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16);
SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17);
SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18);
SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19);
SHA_ROUND5(F2, K2, CALC, t, 20);
SHA_ROUND5(F2, K2, CALC, t, 25);
SHA_ROUND5(F2, K2, CALC, t, 30);
SHA_ROUND5(F2, K2, CALC, t, 35);
SHA_ROUND5(F3, K3, CALC, t, 40);
SHA_ROUND5(F3, K3, CALC, t, 45);
SHA_ROUND5(F3, K3, CALC, t, 50);
SHA_ROUND5(F3, K3, CALC, t, 55);
SHA_ROUND5(F2, K4, CALC, t, 60);
SHA_ROUND5(F2, K4, CALC, t, 65);
SHA_ROUND5(F2, K4, CALC, t, 70);
SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75);
SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76);
/* Last 3 rounds don't need to store mixed value */
SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77);
SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78);
SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79);
iv[0] += a;
iv[1] += b;
iv[2] += c;
iv[3] += d;
iv[4] += e;
}
/* Compile with -DUNITTEST for self-tests */
#if UNITTEST
#include <stdio.h>
#include <stdlib.h>
/* Known answer test */
static void katest(char const *text, size_t len, unsigned char const hash[20])
{
SHA_CTX c;
unsigned char hash2[20];
int i;
SHA1_Init(&c);
SHA1_Update(&c, text, len);
SHA1_Final(hash2, &c);
for (i = 0; i < 20; i++)
if (hash[i] != hash2[i])
goto mismatch;
printf("%u-byte known answer test PASSED\n", len);
return;
mismatch:
printf("%u-byte known answer test FAILED:\n", len);
if (len < 64)
printf("Input: \"%.*s\"\n", len, text);
else
printf("Input: \"%.64s\"...\n", text);
printf("Computed:");
for (i = 0; i < 20; i++)
printf(" %02x", hash2[i]);
printf("\nExpected:");
for (i = 0; i < 20; i++)
printf(" %02x", hash[i]);
putchar('\n');
}
int
main(void)
{
/* FIPS PUB 180.1 example A.1 */
static char const text1[3] = "abc";
static unsigned char const hash1[20] = {
0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e,
0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d };
/* FIPS PUB 180.1 example A.2 */
static char const text2[56] =
"abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
static unsigned char const hash2[20] = {
0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae,
0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 };
/* FIPS PUB 180.1 example A.3 */
char *text3;
static unsigned char const hash3[20] = {
0x34, 0xaa, 0x97, 0x3c, 0xd4, 0xc4, 0xda, 0xa4, 0xf6, 0x1e,
0xeb, 0x2b, 0xdb, 0xad, 0x27, 0x31, 0x65, 0x34, 0x01, 0x6f };
katest(text1, sizeof text1, hash1);
katest(text2, sizeof text2, hash2);
text3 = malloc(1000000);
memset(text3, 'a', 1000000);
katest(text3, 1000000, hash3);
return 0;
}
#endif
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 18:54 ` George Spelvin
@ 2009-08-17 19:34 ` Nicolas Pitre
2009-08-17 23:12 ` George Spelvin
0 siblings, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 19:34 UTC (permalink / raw)
To: George Spelvin; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds
On Mon, 17 Aug 2009, George Spelvin wrote:
> > I don't think this trick of making source code textually different from
> > another work while still intimately mimicking the same structure entitles
> > you to any copyright (or non copyright) claims over that other work. I
> > certainly wouldn't bet any dime for this standing up in court.
>
> <div type="legal digression">
> Actually, I would. I did a lot more than text search and replace;
> I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright)
> and then merged in the *ideas* from the mailing list.
>
> (And from elsewhere; the idea of a five-round macro is from Brian Gladman.)
>
> Remember, to the extent that something is *functional*, it is not
> copyrightable; copyright only covers the non-functional expressive bits.
> The vast majority of that code is simply required by the standard,
> or the desired calling interface.
>
> For a large portion of the rest, remember that standard programming
> conventions (e.g. brace style, macro names IN CAPS, etc.) that's also
> non-copyrightable "scene a faire" material.
>
> It's well established that paraphrasing a recipe avoids copyright;
> the proportions and treatment of the ingredients is not copyrightable.
>
> For more details, see the extensive coverage of the NEC v. Intel decision
> (1989) regarding the firmware for NEC's 8086-clone V20 microprocessor.
> It was found non-infringing despite non-clean-room implementation and
> substantial similarities.
> </div>
Whatever. NEC and Intel were certainly commercial competitors. They
were far from being friends. So if you feel like having too many
friends then just go ahead with that stance.
> As for politeness, that's exactly why I did post it and solicit
> objections.
You said:
|It uses Linus's and Artur's performance ideas, and some of Linus' macro
|ideas (in the rotate implementation), but tries to be textually
|different. Is there anything recognizable that anyone cares to clam
|copyright to?
the "try to be textually different" in order to ask for "anything
recognizable that anyone cares to clam copyright to" is what I find
dubious.
> The purpose of the rewrite is to avoid having to make
> pessimistic assumptions about people who don't respond.
>
> I suppose I should have made that request clearer:
> Is there anyone who claims copyright on anything here?
> Or would just like credit?
> If so, are you willing to donate it to the public domain?
I think this is much nicer to everyone involved.
As far as I'm concerned, I'm OK with giving any small copyright I might
have in this SHA1 implementation, if any, to the public domain.
Credits are always nice.
Nicolas
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 16:22 ` Linus Torvalds
@ 2009-08-17 21:43 ` Steven Noonan
0 siblings, 0 replies; 28+ messages in thread
From: Steven Noonan @ 2009-08-17 21:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: Giuseppe Scrivano, Pádraig Brady, Bug-coreutils,
Git Mailing List
On Mon, Aug 17, 2009 at 9:22 AM, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
>
>
> On Mon, 17 Aug 2009, Steven Noonan wrote:
>>
>> Interesting. I compared Linus' implementation to the public domain one
>> by Steve Reid[1]
>
> You _really_ need to talk about what kind of environment you have.
>
> There are three major issues:
> - Netburst vs non-netburst
> - 32-bit vs 64-bit
> - compiler version
Right. I'm running a Core 2 "Merom" 2.33GHz. The code was compiled for
x86_64 with GCC 4.2.1. I didn't _expect_ it to compile for x86_64, but
apparently the version of GCC that ships with Xcode 3.2 defaults to
compiling 64-bit code on machines that are capable of running it.
>
> Steve Reid's code looks great, but the way it is coded, gcc makes a mess
> of it, which is exactly what my SHA1 tries to avoid.
>
> [ In contrast, gcc does very well on just about _any_ straightforward
> unrolled SHA1 C code if the target architecture is something like PPC or
> ia64 that has enough registers to keep it all in registers.
>
> I haven't really tested other compilers - a less aggressive compiler
> would actually do _better_ on SHA1, because the problem with gcc is that
> it turns the whole temporary 16-entry word array into register accesses,
> and tries to do register allocation on that _array_.
>
> That is wonderful for the above-mentioned PPC and IA64, but it makes gcc
> create totally crazy code when there aren't enough registers, and then
> gcc starts spilling randomly (ie it starts spilling a-e etc). This is
> why the compiler and version matters so much. ]
>
>> (average of 5 runs)
>> Linus' sha1: 283MB/s
>> Steve Reid's sha1: 305MB/s
>
> So I get very different results:
>
> # TIME[s] SPEED[MB/s]
> Reid 2.742 222.6
> linus 1.464 417
Added -m32:
Steve Reid: 156MB/s
Linus: 209MB/s
So on x86, your code really kicks butt.
> this is Intel Nehalem, but compiled for 32-bit mode (which is the more
> challenging one because x86-32 only has 7 general-purpose registers), and
> with gcc-4.4.0.
>
> Linus
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 19:34 ` Nicolas Pitre
@ 2009-08-17 23:12 ` George Spelvin
0 siblings, 0 replies; 28+ messages in thread
From: George Spelvin @ 2009-08-17 23:12 UTC (permalink / raw)
To: linux, nico; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds
>> The purpose of the rewrite is to avoid having to make
>> pessimistic assumptions about people who don't respond.
>>
>> I suppose I should have made that request clearer:
>> Is there anyone who claims copyright on anything here?
>> Or would just like credit?
>> If so, are you willing to donate it to the public domain?
> I think this is much nicer to everyone involved.
>
> As far as I'm concerned, I'm OK with giving any small copyright I might
> have in this SHA1 implementation, if any, to the public domain.
> Credits are always nice.
My apologies. I read a lot of people talking about wanting the code
under different licenses, and thought I'd just cut through it by
providing some PD code.
I didn't turn around and look at it from the point of view of the
people who'd put the work into developing it. I don't mean to deny
anyone credit for their work. In fact, providing more detail is on
the to-do list, but I haven't waded through the mail archives and
tracked down who contributed what yet.
I'll work on those polish details once I have it producing the same
assembly code as Linus'.
There are a lot of possible highly-permissive licenses if one is wanted
(zlib, MIT, CC-by), but public domain seems simpler.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-17 1:55 ` Nicolas Pitre
@ 2009-08-26 11:39 ` Pádraig Brady
2017-04-20 21:35 ` galt
2017-04-20 21:38 ` galt
0 siblings, 2 replies; 28+ messages in thread
From: Pádraig Brady @ 2009-08-26 11:39 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Linus Torvalds, John Tapsell, Bryan Donlan, Bug-coreutils,
Git Mailing List, Brandon Casey, Junio C Hamano, Paul Kocher
Nicolas Pitre wrote:
> On Sat, 15 Aug 2009, Linus Torvalds wrote:
>
>> (Heh. Looking at that, I probably should move the 'size' field first,
>> since that would have different alignment rules, and the struct would be
>> more tightly packed that way, and initialize better).
>
> I was about to suggest (i.e. post) a patch for that. This is indeed a
> good idea.
>
>> Afaik, none of the actual code remains (the mozilla SHA1 thing did the
>> wrong thing for performance even for just the final bytes, and did those a
>> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
>
> Maybe a patch adding a proper header with the actual license would be a
> good idea too.
So have you decided on a final licence,
and if so update the headers accordingly?
cheers!
Pádraig.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-26 11:39 ` Pádraig Brady
@ 2017-04-20 21:35 ` galt
2017-04-20 21:38 ` galt
1 sibling, 0 replies; 28+ messages in thread
From: galt @ 2017-04-20 21:35 UTC (permalink / raw)
To: git
I also wanted to include Linus' sha1 in our software at work.
But the GPLv2 license was incompatible.
Too bad it is just just in the public domain.
I grabbed Steve Reid's public domain code from 1999
and ran it. It produced the same output.
I ran it on a 3GB input file, and Linus' code from 2009 takes 37 to 40
seconds.
(Just reading the file in the same 4k buffers only takes 3 seconds
so disk reading does not dominate the time.)
When I ran Steve's old version on the same input it was taking just 36 or 37
seconds.
So it is slightly faster.
Have compilers improved?
I am using gcc 4.4.7-17.
--
View this message in context: http://git.661346.n2.nabble.com/Linus-sha1-is-much-faster-tp3448007p7657473.html
Sent from the git mailing list archive at Nabble.com.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster!
2009-08-26 11:39 ` Pádraig Brady
2017-04-20 21:35 ` galt
@ 2017-04-20 21:38 ` galt
1 sibling, 0 replies; 28+ messages in thread
From: galt @ 2017-04-20 21:38 UTC (permalink / raw)
To: git
A Phádraig, cá bhfuil tú i do chónaí?
Tá mé i gCalafoirne.
--
View this message in context: http://git.661346.n2.nabble.com/Linus-sha1-is-much-faster-tp3448007p7657474.html
Sent from the git mailing list archive at Nabble.com.
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2017-04-20 21:38 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady
2009-08-15 20:02 ` Bryan Donlan
2009-08-15 20:12 ` John Tapsell
2009-08-15 20:23 ` Linus Torvalds
2009-08-15 20:54 ` Linus Torvalds
2009-08-17 1:55 ` Nicolas Pitre
2009-08-26 11:39 ` Pádraig Brady
2017-04-20 21:35 ` galt
2017-04-20 21:38 ` galt
2009-08-17 8:22 ` Andreas Ericsson
2009-08-16 0:06 ` Theodore Tso
2009-08-16 19:25 ` Giuseppe Scrivano
2009-08-16 20:10 ` Linus Torvalds
2009-08-16 22:15 ` Giuseppe Scrivano
2009-08-16 22:47 ` Linus Torvalds
2009-08-17 1:53 ` Pádraig Brady
2009-08-17 10:51 ` Giuseppe Scrivano
2009-08-17 15:44 ` Steven Noonan
2009-08-17 16:22 ` Linus Torvalds
2009-08-17 21:43 ` Steven Noonan
2009-08-17 17:32 ` Giuseppe Scrivano
-- strict thread matches above, loose matches on Subject: below --
2009-08-17 7:23 George Spelvin
2009-08-17 14:20 ` Nicolas Pitre
2009-08-17 17:06 ` Nicolas Pitre
2009-08-17 17:20 ` Paolo Bonzini
2009-08-17 18:54 ` George Spelvin
2009-08-17 19:34 ` Nicolas Pitre
2009-08-17 23:12 ` George Spelvin
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).