* Linus' sha1 is much faster! @ 2009-08-14 23:25 Pádraig Brady 2009-08-15 20:02 ` Bryan Donlan 2009-08-16 19:25 ` Giuseppe Scrivano 0 siblings, 2 replies; 28+ messages in thread From: Pádraig Brady @ 2009-08-14 23:25 UTC (permalink / raw) To: Bug-coreutils, Linus Torvalds; +Cc: Git Mailing List [-- Attachment #1: Type: text/plain, Size: 1110 bytes --] I've noticed before that coreutils hashing utils were a little behind in performance, but was prompted to look at it again when I noticed the recently updated sha1 implementation in git: http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu Testing that with the attached program which I wrote in a couple of mins to try and match sha1sum's system calls shows that it's around 33% faster, as shown below: $ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1 $ time ./linus-sha1 300MB_file df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file real 0m2.742s user 0m2.516s sys 0m0.206s $ time ~/git/coreutils/src/sha1sum 300MB_file df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file real 0m4.166s user 0m3.846s sys 0m0.298s So, could we use that code in coreutils? Think of all the dead fish it would save. I've also attached a trivial block-sha1 patch which doesn't affect performance, but does suppress a signed unsigned comparison warning which occurs with -Wextra for example. cheers, Pádraig. [-- Attachment #2: linus-sha1.c --] [-- Type: text/x-csrc, Size: 681 bytes --] /* gcc -O2 -Wall linus-sha1.c sha1.c -o linus-sha1 */ #include <stdio.h> #include <stdlib.h> #include "sha1.h" int main(int argc, char** argv) { if (argc != 2) return 1; const char* filename = argv[1]; FILE *fp = fopen (filename, "r"); if (!fp) return 1; #define BS 4096 /* match coreutils */ blk_SHA_CTX ctx; blk_SHA1_Init(&ctx); size_t nr; char buf[BS]; while ((nr=fread_unlocked(buf, 1, sizeof(buf), fp))) blk_SHA1_Update(&ctx, buf, nr); unsigned char hash[20]; blk_SHA1_Final(hash, &ctx); int i; for (i=0; i<sizeof(hash); i++) printf("%02x",*(hash+i)); printf(" %s\n", filename); return 0; } [-- Attachment #3: block-sha1-signed-warning.diff --] [-- Type: text/x-patch, Size: 998 bytes --] >From fa75e818836f763357ff9b7bbde3327e1aabbe47 Mon Sep 17 00:00:00 2001 From: =?utf-8?q?P=C3=A1draig=20Brady?= <P@draigBrady.com> Date: Sat, 15 Aug 2009 00:17:30 +0100 Subject: [PATCH] block-sha1: suppress signed unsigned comparison warning * block-sha1/sha1.c: Use unsigned ints as the values will never go negative. --- block-sha1/sha1.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index d3121f7..be763d8 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -231,13 +231,13 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx) void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { - int lenW = ctx->size & 63; + unsigned int lenW = ctx->size & 63; ctx->size += len; /* Read the data into W and process blocks as they get full */ if (lenW) { - int left = 64 - lenW; + unsigned int left = 64 - lenW; if (len < left) left = len; memcpy(lenW + (char *)ctx->W, data, left); -- 1.6.2.5 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady @ 2009-08-15 20:02 ` Bryan Donlan 2009-08-15 20:12 ` John Tapsell 2009-08-16 19:25 ` Giuseppe Scrivano 1 sibling, 1 reply; 28+ messages in thread From: Bryan Donlan @ 2009-08-15 20:02 UTC (permalink / raw) To: Pádraig Brady Cc: Bug-coreutils, Linus Torvalds, Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre 2009/8/14 Pádraig Brady <P@draigbrady.com>: > I've noticed before that coreutils hashing utils > were a little behind in performance, but was prompted > to look at it again when I noticed the recently > updated sha1 implementation in git: > http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu > > Testing that with the attached program which I wrote > in a couple of mins to try and match sha1sum's system calls > shows that it's around 33% faster, as shown below: > > $ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1 > > $ time ./linus-sha1 300MB_file > df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file > real 0m2.742s > user 0m2.516s > sys 0m0.206s > > $ time ~/git/coreutils/src/sha1sum 300MB_file > df1e19e245fee4f53087b50ef953ca2c8d1644d7 300MB_file > > real 0m4.166s > user 0m3.846s > sys 0m0.298s > > So, could we use that code in coreutils? > Think of all the dead fish it would save. coreutils is licensed under GPLv3, and git under GPLv2 (only), so you'd need permission from all contributors to the implementation in order to relicense under GPLv3. A quick grep of the history suggests these contributors to be: Brandon Casey <drafnel@gmail.com> Junio C Hamano <gitster@pobox.com> Linus Torvalds <torvalds@linux-foundation.org> Nicolas Pitre <nico@cam.org> (adding these people to the CC list) Additionally, it was originally based on the code in mozilla-sha1/sha1.c, but that contains a license grant allowing it to be used under GPLv2 /or later/, so if GPLv3 relicensing is enough it shouldn't be necessary to get in contact with the original author. However if the FSF requires copyright assignment to accept the new implementation, it will be necessary to track down contributors to the original mozilla-sha1/sha1.c as well. Note that I'm not a lawyer, so there might be other roadblocks etc to this as well, etc :) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-15 20:02 ` Bryan Donlan @ 2009-08-15 20:12 ` John Tapsell 2009-08-15 20:23 ` Linus Torvalds 2009-08-16 0:06 ` Theodore Tso 0 siblings, 2 replies; 28+ messages in thread From: John Tapsell @ 2009-08-15 20:12 UTC (permalink / raw) To: Bryan Donlan Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds, Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre 2009/8/15 Bryan Donlan <bdonlan@gmail.com>: > coreutils is licensed under GPLv3, and git under GPLv2 (only), so > you'd need permission from all contributors to the implementation in > order to relicense under GPLv3. A quick grep of the history suggests > these contributors to be: X11 also requires a fast SHA1 implementation. It uses this to check if two pixmaps are the same. So it would be really nice to relicense under a liberal enough license that xorg can use it. John ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-15 20:12 ` John Tapsell @ 2009-08-15 20:23 ` Linus Torvalds 2009-08-15 20:54 ` Linus Torvalds 2009-08-16 0:06 ` Theodore Tso 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2009-08-15 20:23 UTC (permalink / raw) To: John Tapsell Cc: Bryan Donlan, Pádraig Brady, Bug-coreutils, Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre On Sat, 15 Aug 2009, John Tapsell wrote: > 2009/8/15 Bryan Donlan <bdonlan@gmail.com>: > > coreutils is licensed under GPLv3, and git under GPLv2 (only), so > > you'd need permission from all contributors to the implementation in > > order to relicense under GPLv3. A quick grep of the history suggests > > these contributors to be: > > X11 also requires a fast SHA1 implementation. It uses this to check > if two pixmaps are the same. So it would be really nice to relicense > under a liberal enough license that xorg can use it. I'm personally ok with retaining the mozilla-sha1 license. There's not really anything _remaining_ of the mozilla code, but hey, I started from it. In retrospect I probably should have started from the PPC asm code that already did the blocking sanely - but that's a "20/20 hindsight" kind of thing. Plus hey, the mozilla code being a horrid pile of crud was why I was so convinced that I could improve on things. So that's a kind of source for it, even if it's more about the motivational side than any actual remaining code ;) That said, I don't know if the MPL is ok for X11. I've not looked at compatibility issues with MPL. For git, we could just ignore the MPL, since the GPLv2 was acceptable regardless of it. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-15 20:23 ` Linus Torvalds @ 2009-08-15 20:54 ` Linus Torvalds 2009-08-17 1:55 ` Nicolas Pitre 2009-08-17 8:22 ` Andreas Ericsson 0 siblings, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2009-08-15 20:54 UTC (permalink / raw) To: John Tapsell Cc: Paul Kocher, Bryan Donlan, Bug-coreutils, Brandon Casey, Junio C Hamano, Nicolas Pitre, Pádraig Brady, Git Mailing List On Sat, 15 Aug 2009, Linus Torvalds wrote: > > That said, I don't know if the MPL is ok for X11. I've not looked at > compatibility issues with MPL. For git, we could just ignore the MPL, > since the GPLv2 was acceptable regardless of it. If MPL isn't ok for X11, then we'd need to make sure that even the silliest Mozilla crud has been rewritten. There really isn't much, but hey, the _history_ is based on the mozilla code, and who knows - the 'blk_SHA_CTX' struct has things like the fields in the same order as the Mozilla equivalent, for all those historical reasons. (Heh. Looking at that, I probably should move the 'size' field first, since that would have different alignment rules, and the struct would be more tightly packed that way, and initialize better). Afaik, none of the actual code remains (the mozilla SHA1 thing did the wrong thing for performance even for just the final bytes, and did those a byte at a time etc, so I rewrote even the trivial SHA1_Final parts). Of course, maybe the Mozilla people would be interested in taking my faster version, and say that the new-BSD license is ok, and make everybody happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added him to the Cc. Paul, for your information, we're talking about a faster rewritten "mostly portable" SHA1 routines that you can find at http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu (follow the "blob" pointers to see sha1.c and sha1.h). I don't know if you're active with Mozilla/Firefox or whether you even care, but you seem to be the logical choice of person to ask. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-15 20:54 ` Linus Torvalds @ 2009-08-17 1:55 ` Nicolas Pitre 2009-08-26 11:39 ` Pádraig Brady 2009-08-17 8:22 ` Andreas Ericsson 1 sibling, 1 reply; 28+ messages in thread From: Nicolas Pitre @ 2009-08-17 1:55 UTC (permalink / raw) To: Linus Torvalds Cc: John Tapsell, Bryan Donlan, Pádraig Brady, Bug-coreutils, Git Mailing List, Brandon Casey, Junio C Hamano, Paul Kocher On Sat, 15 Aug 2009, Linus Torvalds wrote: > (Heh. Looking at that, I probably should move the 'size' field first, > since that would have different alignment rules, and the struct would be > more tightly packed that way, and initialize better). I was about to suggest (i.e. post) a patch for that. This is indeed a good idea. > Afaik, none of the actual code remains (the mozilla SHA1 thing did the > wrong thing for performance even for just the final bytes, and did those a > byte at a time etc, so I rewrote even the trivial SHA1_Final parts). Maybe a patch adding a proper header with the actual license would be a good idea too. Nicolas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 1:55 ` Nicolas Pitre @ 2009-08-26 11:39 ` Pádraig Brady 2017-04-20 21:35 ` galt 2017-04-20 21:38 ` galt 0 siblings, 2 replies; 28+ messages in thread From: Pádraig Brady @ 2009-08-26 11:39 UTC (permalink / raw) To: Nicolas Pitre Cc: Linus Torvalds, John Tapsell, Bryan Donlan, Bug-coreutils, Git Mailing List, Brandon Casey, Junio C Hamano, Paul Kocher Nicolas Pitre wrote: > On Sat, 15 Aug 2009, Linus Torvalds wrote: > >> (Heh. Looking at that, I probably should move the 'size' field first, >> since that would have different alignment rules, and the struct would be >> more tightly packed that way, and initialize better). > > I was about to suggest (i.e. post) a patch for that. This is indeed a > good idea. > >> Afaik, none of the actual code remains (the mozilla SHA1 thing did the >> wrong thing for performance even for just the final bytes, and did those a >> byte at a time etc, so I rewrote even the trivial SHA1_Final parts). > > Maybe a patch adding a proper header with the actual license would be a > good idea too. So have you decided on a final licence, and if so update the headers accordingly? cheers! Pádraig. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-26 11:39 ` Pádraig Brady @ 2017-04-20 21:35 ` galt 2017-04-20 21:38 ` galt 1 sibling, 0 replies; 28+ messages in thread From: galt @ 2017-04-20 21:35 UTC (permalink / raw) To: git I also wanted to include Linus' sha1 in our software at work. But the GPLv2 license was incompatible. Too bad it is just just in the public domain. I grabbed Steve Reid's public domain code from 1999 and ran it. It produced the same output. I ran it on a 3GB input file, and Linus' code from 2009 takes 37 to 40 seconds. (Just reading the file in the same 4k buffers only takes 3 seconds so disk reading does not dominate the time.) When I ran Steve's old version on the same input it was taking just 36 or 37 seconds. So it is slightly faster. Have compilers improved? I am using gcc 4.4.7-17. -- View this message in context: http://git.661346.n2.nabble.com/Linus-sha1-is-much-faster-tp3448007p7657473.html Sent from the git mailing list archive at Nabble.com. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-26 11:39 ` Pádraig Brady 2017-04-20 21:35 ` galt @ 2017-04-20 21:38 ` galt 1 sibling, 0 replies; 28+ messages in thread From: galt @ 2017-04-20 21:38 UTC (permalink / raw) To: git A Phádraig, cá bhfuil tú i do chónaí? Tá mé i gCalafoirne. -- View this message in context: http://git.661346.n2.nabble.com/Linus-sha1-is-much-faster-tp3448007p7657474.html Sent from the git mailing list archive at Nabble.com. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-15 20:54 ` Linus Torvalds 2009-08-17 1:55 ` Nicolas Pitre @ 2009-08-17 8:22 ` Andreas Ericsson 1 sibling, 0 replies; 28+ messages in thread From: Andreas Ericsson @ 2009-08-17 8:22 UTC (permalink / raw) To: Linus Torvalds Cc: John Tapsell, Bryan Donlan, Pádraig Brady, Bug-coreutils, Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre, Paul Kocher, Paul Kocher Linus Torvalds wrote: > > On Sat, 15 Aug 2009, Linus Torvalds wrote: >> That said, I don't know if the MPL is ok for X11. I've not looked at >> compatibility issues with MPL. For git, we could just ignore the MPL, >> since the GPLv2 was acceptable regardless of it. > > If MPL isn't ok for X11, then we'd need to make sure that even the > silliest Mozilla crud has been rewritten. There really isn't much, but > hey, the _history_ is based on the mozilla code, and who knows - the > 'blk_SHA_CTX' struct has things like the fields in the same order as the > Mozilla equivalent, for all those historical reasons. > > (Heh. Looking at that, I probably should move the 'size' field first, > since that would have different alignment rules, and the struct would be > more tightly packed that way, and initialize better). > > Afaik, none of the actual code remains (the mozilla SHA1 thing did the > wrong thing for performance even for just the final bytes, and did those a > byte at a time etc, so I rewrote even the trivial SHA1_Final parts). > > Of course, maybe the Mozilla people would be interested in taking my > faster version, and say that the new-BSD license is ok, and make everybody > happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added > him to the Cc. > > Paul, for your information, we're talking about a faster rewritten "mostly > portable" SHA1 routines that you can find at > > http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu > > (follow the "blob" pointers to see sha1.c and sha1.h). I don't know if > you're active with Mozilla/Firefox or whether you even care, but you seem > to be the logical choice of person to ask. > I contacted Paul in february this year to get permission to use the mozilla sha1 code for libgit2. His reply then was: "I'm not sure which version the diffs are relative to, so I haven't reviewed them. It's fine to distribute under BSD, GPL, or LGPL, however." I also got explicit permission to relicense it under GPLv2 with the gcc exception. I added the mail-address I used to contact him to CC as well. Sorry if you get this twice, Paul. Naturally, I'd like to use the faster version for libgit2 as well. The people who Linus listed as contributors earlier (Brandon Casy, Linus, Junio and Nicolas Pitre) have already consented to relicense their git contributions for libgit2 use. If anyone would like to revoke that consent for this code, speak now please, or I'll patch it into libgit2 as well. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-15 20:12 ` John Tapsell 2009-08-15 20:23 ` Linus Torvalds @ 2009-08-16 0:06 ` Theodore Tso 1 sibling, 0 replies; 28+ messages in thread From: Theodore Tso @ 2009-08-16 0:06 UTC (permalink / raw) To: John Tapsell Cc: Bryan Donlan, Pádraig Brady, Bug-coreutils, Linus Torvalds, Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre On Sat, Aug 15, 2009 at 09:12:58PM +0100, John Tapsell wrote: > 2009/8/15 Bryan Donlan <bdonlan@gmail.com>: > > coreutils is licensed under GPLv3, and git under GPLv2 (only), so > > you'd need permission from all contributors to the implementation in > > order to relicense under GPLv3. A quick grep of the history suggests > > these contributors to be: > > X11 also requires a fast SHA1 implementation. It uses this to check > if two pixmaps are the same. So it would be really nice to relicense > under a liberal enough license that xorg can use it. If the checksum isn't being exposed in the protocol (i.e., it's just internal to the X server), one possibility for X11 is to consider to use the SHA-3 candidate Skein instead. After receiving a large amount of evaluation by cryptographic experts, it was one of the 18 algorithms (our of an original 64 entries) that have made it the 2nd round of the NIST competition. It's also *substantially* faster than SHA: One exception to this is Skein, created by several well-known cryptographers and noted pundit Bruce Schneier. It was designed specifically to exploit all three of the Core 2 execution units and to run at a full 64-bits. This gives it roughly four to 10 times the logic density of competing submissions. This is what I meant by the Matrix quote above. They didn't bend the spoon; they bent the crypto algorithm. They moved the logic operations around in a way that wouldn't weaken the crypto, but would strengthen its speed on the Intel Core 2. In their paper (PDF), the authors of Skein express surprise that a custom silicon ASIC implementation is not any faster than the software implementation. They shouldn't be surprised. Every time you can redefine a problem to run optimally in software, you will reach the same speeds you get with optimized ASIC hardware. The reason software has a reputation of being slow is because people don't redefine the original problem. http://www.darkreading.com/blog/archives/2008/11/bending_skein_c.html For more information and some optimized implementation, see: http://www.skein-hash.info/ - Ted ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady 2009-08-15 20:02 ` Bryan Donlan @ 2009-08-16 19:25 ` Giuseppe Scrivano 2009-08-16 20:10 ` Linus Torvalds 2009-08-17 1:53 ` Pádraig Brady 1 sibling, 2 replies; 28+ messages in thread From: Giuseppe Scrivano @ 2009-08-16 19:25 UTC (permalink / raw) To: Pádraig Brady; +Cc: Linus Torvalds, Bug-coreutils, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 575 bytes --] Hi Pádraig, I tried to reproduce your results but I wasn't able to do it. The biggest difference on a 300MB file I noticed was approximately 15% using on both implementations -O2, and 5% using -O3. My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R) Pentium(R) D CPU 3.20GHz. I also spent some time trying to improve the gnulib SHA1 implementation and it seems a lookup table can improve things a bit. Can you please try the patch that I have attached and tell me which performance difference (if any) you get? Thanks, Giuseppe [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-SHA1-use-a-lookup-table-for-faster-hashing.patch --] [-- Type: text/x-diff, Size: 8582 bytes --] >From b975a5e0849eaa46e5cf410c5bf6e2308f044d61 Mon Sep 17 00:00:00 2001 From: Giuseppe Scrivano <gscrivano@gnu.org> Date: Sun, 16 Aug 2009 20:53:54 +0200 Subject: [PATCH] SHA1: use a lookup table for faster hashing * lib/sha1.c (struct sha1_pre): New member. * lib/sha1.c (sha1_process_block): Use the lookup table to quickly find indices to use in the current round. --- lib/sha1.c | 160 ++++++++++++++++++++++++++++++++++------------------------- 1 files changed, 92 insertions(+), 68 deletions(-) diff --git a/lib/sha1.c b/lib/sha1.c index 9c6c7ae..ec18ba7 100644 --- a/lib/sha1.c +++ b/lib/sha1.c @@ -283,6 +283,32 @@ sha1_process_bytes (const void *buffer, size_t len, struct sha1_ctx *ctx) #define F3(B,C,D) ( ( B & C ) | ( D & ( B | C ) ) ) #define F4(B,C,D) (B ^ C ^ D) +struct lookup_t +{ + unsigned char l1 : 4; + unsigned char l2 : 4; + unsigned char l3 : 4; + unsigned char l4 : 4; +}; + +const static struct lookup_t +sha1_pre[16] = {{(0 - 3) & 0x0f, (0 - 8) & 0x0f, (0 - 14) & 0x0f}, + {(1 - 3) & 0x0f, (1 - 8) & 0x0f, (1 - 14) & 0x0f}, + {(2 - 3) & 0x0f, (2 - 8) & 0x0f, (2 - 14) & 0x0f}, + {(3 - 3) & 0x0f, (3 - 8) & 0x0f, (3 - 14) & 0x0f}, + {(4 - 3) & 0x0f, (4 - 8) & 0x0f, (4 - 14) & 0x0f}, + {(5 - 3) & 0x0f, (5 - 8) & 0x0f, (5 - 14) & 0x0f}, + {(6 - 3) & 0x0f, (6 - 8) & 0x0f, (6 - 14) & 0x0f}, + {(7 - 3) & 0x0f, (7 - 8) & 0x0f, (7 - 14) & 0x0f}, + {(8 - 3) & 0x0f, (8 - 8) & 0x0f, (8 - 14) & 0x0f}, + {(9 - 3) & 0x0f, (9 - 8) & 0x0f, (9 - 14) & 0x0f}, + {(10 - 3) & 0x0f, (10 - 8) & 0x0f, (10 - 14) & 0x0f}, + {(11 - 3) & 0x0f, (11 - 8) & 0x0f, (11 - 14) & 0x0f}, + {(12 - 3) & 0x0f, (12 - 8) & 0x0f, (12 - 14) & 0x0f}, + {(13 - 3) & 0x0f, (13 - 8) & 0x0f, (13 - 14) & 0x0f}, + {(14 - 3) & 0x0f, (14 - 8) & 0x0f, (14 - 14) & 0x0f}, + {(15 - 3) & 0x0f, (15 - 8) & 0x0f, (15 - 14) & 0x0f}}; + /* Process LEN bytes of BUFFER, accumulating context into CTX. It is assumed that LEN % 64 == 0. Most of this code comes from GnuPG's cipher/sha1.c. */ @@ -309,9 +335,8 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx) #define rol(x, n) (((x) << (n)) | ((uint32_t) (x) >> (32 - (n)))) -#define M(I) ( tm = x[I&0x0f] ^ x[(I-14)&0x0f] \ - ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f] \ - , (x[I&0x0f] = rol(tm, 1)) ) +#define M(I) (x[I] = rol (x[sha1_pre[I].l1] ^ x[sha1_pre[I].l2] \ + ^ x[sha1_pre[I].l3] ^ x[I], 1)) #define R(A,B,C,D,E,F,K,M) do { E += rol( A, 5 ) \ + F( B, C, D ) \ @@ -322,7 +347,6 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx) while (words < endp) { - uint32_t tm; int t; for (t = 0; t < 16; t++) { @@ -346,70 +370,70 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx) R( c, d, e, a, b, F1, K1, x[13] ); R( b, c, d, e, a, F1, K1, x[14] ); R( a, b, c, d, e, F1, K1, x[15] ); - R( e, a, b, c, d, F1, K1, M(16) ); - R( d, e, a, b, c, F1, K1, M(17) ); - R( c, d, e, a, b, F1, K1, M(18) ); - R( b, c, d, e, a, F1, K1, M(19) ); - R( a, b, c, d, e, F2, K2, M(20) ); - R( e, a, b, c, d, F2, K2, M(21) ); - R( d, e, a, b, c, F2, K2, M(22) ); - R( c, d, e, a, b, F2, K2, M(23) ); - R( b, c, d, e, a, F2, K2, M(24) ); - R( a, b, c, d, e, F2, K2, M(25) ); - R( e, a, b, c, d, F2, K2, M(26) ); - R( d, e, a, b, c, F2, K2, M(27) ); - R( c, d, e, a, b, F2, K2, M(28) ); - R( b, c, d, e, a, F2, K2, M(29) ); - R( a, b, c, d, e, F2, K2, M(30) ); - R( e, a, b, c, d, F2, K2, M(31) ); - R( d, e, a, b, c, F2, K2, M(32) ); - R( c, d, e, a, b, F2, K2, M(33) ); - R( b, c, d, e, a, F2, K2, M(34) ); - R( a, b, c, d, e, F2, K2, M(35) ); - R( e, a, b, c, d, F2, K2, M(36) ); - R( d, e, a, b, c, F2, K2, M(37) ); - R( c, d, e, a, b, F2, K2, M(38) ); - R( b, c, d, e, a, F2, K2, M(39) ); - R( a, b, c, d, e, F3, K3, M(40) ); - R( e, a, b, c, d, F3, K3, M(41) ); - R( d, e, a, b, c, F3, K3, M(42) ); - R( c, d, e, a, b, F3, K3, M(43) ); - R( b, c, d, e, a, F3, K3, M(44) ); - R( a, b, c, d, e, F3, K3, M(45) ); - R( e, a, b, c, d, F3, K3, M(46) ); - R( d, e, a, b, c, F3, K3, M(47) ); - R( c, d, e, a, b, F3, K3, M(48) ); - R( b, c, d, e, a, F3, K3, M(49) ); - R( a, b, c, d, e, F3, K3, M(50) ); - R( e, a, b, c, d, F3, K3, M(51) ); - R( d, e, a, b, c, F3, K3, M(52) ); - R( c, d, e, a, b, F3, K3, M(53) ); - R( b, c, d, e, a, F3, K3, M(54) ); - R( a, b, c, d, e, F3, K3, M(55) ); - R( e, a, b, c, d, F3, K3, M(56) ); - R( d, e, a, b, c, F3, K3, M(57) ); - R( c, d, e, a, b, F3, K3, M(58) ); - R( b, c, d, e, a, F3, K3, M(59) ); - R( a, b, c, d, e, F4, K4, M(60) ); - R( e, a, b, c, d, F4, K4, M(61) ); - R( d, e, a, b, c, F4, K4, M(62) ); - R( c, d, e, a, b, F4, K4, M(63) ); - R( b, c, d, e, a, F4, K4, M(64) ); - R( a, b, c, d, e, F4, K4, M(65) ); - R( e, a, b, c, d, F4, K4, M(66) ); - R( d, e, a, b, c, F4, K4, M(67) ); - R( c, d, e, a, b, F4, K4, M(68) ); - R( b, c, d, e, a, F4, K4, M(69) ); - R( a, b, c, d, e, F4, K4, M(70) ); - R( e, a, b, c, d, F4, K4, M(71) ); - R( d, e, a, b, c, F4, K4, M(72) ); - R( c, d, e, a, b, F4, K4, M(73) ); - R( b, c, d, e, a, F4, K4, M(74) ); - R( a, b, c, d, e, F4, K4, M(75) ); - R( e, a, b, c, d, F4, K4, M(76) ); - R( d, e, a, b, c, F4, K4, M(77) ); - R( c, d, e, a, b, F4, K4, M(78) ); - R( b, c, d, e, a, F4, K4, M(79) ); + R( e, a, b, c, d, F1, K1, M( 0) ); + R( d, e, a, b, c, F1, K1, M( 1) ); + R( c, d, e, a, b, F1, K1, M( 2) ); + R( b, c, d, e, a, F1, K1, M( 3) ); + R( a, b, c, d, e, F2, K2, M( 4) ); + R( e, a, b, c, d, F2, K2, M( 5) ); + R( d, e, a, b, c, F2, K2, M( 6) ); + R( c, d, e, a, b, F2, K2, M( 7) ); + R( b, c, d, e, a, F2, K2, M( 8) ); + R( a, b, c, d, e, F2, K2, M( 9) ); + R( e, a, b, c, d, F2, K2, M(10) ); + R( d, e, a, b, c, F2, K2, M(11) ); + R( c, d, e, a, b, F2, K2, M(12) ); + R( b, c, d, e, a, F2, K2, M(13) ); + R( a, b, c, d, e, F2, K2, M(14) ); + R( e, a, b, c, d, F2, K2, M(15) ); + R( d, e, a, b, c, F2, K2, M( 0) ); + R( c, d, e, a, b, F2, K2, M( 1) ); + R( b, c, d, e, a, F2, K2, M( 2) ); + R( a, b, c, d, e, F2, K2, M( 3) ); + R( e, a, b, c, d, F2, K2, M( 4) ); + R( d, e, a, b, c, F2, K2, M( 5) ); + R( c, d, e, a, b, F2, K2, M( 6) ); + R( b, c, d, e, a, F2, K2, M( 7) ); + R( a, b, c, d, e, F3, K3, M( 8) ); + R( e, a, b, c, d, F3, K3, M( 9) ); + R( d, e, a, b, c, F3, K3, M(10) ); + R( c, d, e, a, b, F3, K3, M(11) ); + R( b, c, d, e, a, F3, K3, M(12) ); + R( a, b, c, d, e, F3, K3, M(13) ); + R( e, a, b, c, d, F3, K3, M(14) ); + R( d, e, a, b, c, F3, K3, M(15) ); + R( c, d, e, a, b, F3, K3, M( 0) ); + R( b, c, d, e, a, F3, K3, M( 1) ); + R( a, b, c, d, e, F3, K3, M( 2) ); + R( e, a, b, c, d, F3, K3, M( 3) ); + R( d, e, a, b, c, F3, K3, M( 4) ); + R( c, d, e, a, b, F3, K3, M( 5) ); + R( b, c, d, e, a, F3, K3, M( 6) ); + R( a, b, c, d, e, F3, K3, M( 7) ); + R( e, a, b, c, d, F3, K3, M( 8) ); + R( d, e, a, b, c, F3, K3, M( 9) ); + R( c, d, e, a, b, F3, K3, M(10) ); + R( b, c, d, e, a, F3, K3, M(11) ); + R( a, b, c, d, e, F4, K4, M(12) ); + R( e, a, b, c, d, F4, K4, M(13) ); + R( d, e, a, b, c, F4, K4, M(14) ); + R( c, d, e, a, b, F4, K4, M(15) ); + R( b, c, d, e, a, F4, K4, M( 0) ); + R( a, b, c, d, e, F4, K4, M( 1) ); + R( e, a, b, c, d, F4, K4, M( 2) ); + R( d, e, a, b, c, F4, K4, M( 3) ); + R( c, d, e, a, b, F4, K4, M( 4) ); + R( b, c, d, e, a, F4, K4, M( 5) ); + R( a, b, c, d, e, F4, K4, M( 6) ); + R( e, a, b, c, d, F4, K4, M( 7) ); + R( d, e, a, b, c, F4, K4, M( 8) ); + R( c, d, e, a, b, F4, K4, M( 9) ); + R( b, c, d, e, a, F4, K4, M(10) ); + R( a, b, c, d, e, F4, K4, M(11) ); + R( e, a, b, c, d, F4, K4, M(12) ); + R( d, e, a, b, c, F4, K4, M(13) ); + R( c, d, e, a, b, F4, K4, M(14) ); + R( b, c, d, e, a, F4, K4, M(15) ); a = ctx->A += a; b = ctx->B += b; -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-16 19:25 ` Giuseppe Scrivano @ 2009-08-16 20:10 ` Linus Torvalds 2009-08-16 22:15 ` Giuseppe Scrivano 2009-08-17 1:53 ` Pádraig Brady 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2009-08-16 20:10 UTC (permalink / raw) To: Giuseppe Scrivano; +Cc: Bug-coreutils, Pádraig Brady, Git Mailing List On Sun, 16 Aug 2009, Giuseppe Scrivano wrote: > > My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R) > Pentium(R) D CPU 3.20GHz. Netburst is very sensitive to random spill effects, and you can basically tune things by just code shuffling that just has random effects on the generated asm code. > I also spent some time trying to improve the gnulib SHA1 implementation > and it seems a lookup table can improve things a bit. I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues. I'd also suggest you try gcc-4.4, since that apparently fixes some of the oddest spill issues. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-16 20:10 ` Linus Torvalds @ 2009-08-16 22:15 ` Giuseppe Scrivano 2009-08-16 22:47 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Giuseppe Scrivano @ 2009-08-16 22:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: Pádraig Brady, Bug-coreutils, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > I pretty much can guarantee you that it improves things only because it > makes gcc generate crap code, which then hides some of the P4 issues. > > I'd also suggest you try gcc-4.4, since that apparently fixes some of the > oddest spill issues. Thanks for the hint. I tried gcc-4.4 and it produces slower code than 4.3 on the gnulib SHA1 implementation and my patch makes it even more! I noticed that on my machine your implementation is ~30-40% faster using SHA_ROT for rol/ror instructions than inline assembly, at least with the test-case Pádraig wrote. Am I the only one reporting it? Cheers, Giuseppe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-16 22:15 ` Giuseppe Scrivano @ 2009-08-16 22:47 ` Linus Torvalds 0 siblings, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2009-08-16 22:47 UTC (permalink / raw) To: Giuseppe Scrivano; +Cc: Bug-coreutils, Pádraig Brady, Git Mailing List On Mon, 17 Aug 2009, Giuseppe Scrivano wrote: > > Thanks for the hint. I tried gcc-4.4 and it produces slower code than > 4.3 on the gnulib SHA1 implementation and my patch makes it even more! Check out the asm, see if you can see why. One of the most common problems with P4's is literally that you end up loading from the same stack slot that you just stored to (gcc can do some really crazy spills), and that causes a store buffer hazard replay. My personal opinion is that Netburst is useless for trying to optimize C code for. It's just too random. > I noticed that on my machine your implementation is ~30-40% faster using > SHA_ROT for rol/ror instructions than inline assembly, at least with the > test-case Pádraig wrote. Am I the only one reporting it? I bet it's the same thing. Small perturbations of the source causing small changes to register allocation and thus spilling, and then Netburst goes crazy one way or another. It's interestign trying to fix it, and very frustrating. My workstation is a Nehalem (but Core 2 will have pretty much the same behavior), and it doesn't have the crazy netburst behavior. Shorter and simpler code generally performs better (which is _not_ true on Netburst). On my machine, for example, forcing gcc to do those rotates on registers is the difference between ~381MB/s and 415MB/s. And that's mainly because it makes gcc keep A-E in registers, rather than trying to cache the array[] references. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-16 19:25 ` Giuseppe Scrivano 2009-08-16 20:10 ` Linus Torvalds @ 2009-08-17 1:53 ` Pádraig Brady 2009-08-17 10:51 ` Giuseppe Scrivano 1 sibling, 1 reply; 28+ messages in thread From: Pádraig Brady @ 2009-08-17 1:53 UTC (permalink / raw) To: Giuseppe Scrivano; +Cc: Bug-coreutils, Linus Torvalds, Git Mailing List Giuseppe Scrivano wrote: > Hi Pádraig, > > I tried to reproduce your results but I wasn't able to do it. The > biggest difference on a 300MB file I noticed was approximately 15% using > on both implementations -O2, and 5% using -O3. > My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R) > Pentium(R) D CPU 3.20GHz. > > I also spent some time trying to improve the gnulib SHA1 implementation > and it seems a lookup table can improve things a bit. > > Can you please try the patch that I have attached and tell me which > performance difference (if any) you get? Thanks for looking at this Giuseppe and sorry for not mentioning my GCC and CPU. Note the binaries below is compiled with $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) for consistency, which on my F11 machines is: -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586 -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1 Testing on 2 machines I have here: $ rpm -q gcc gcc-4.4.1-2.fc11.i586 $ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' ' model name : Intel(R) Pentium(R) M processor 1.70GHz $ truncate -s300MB sha1.test $ time sha1sum sha1.test real 0m3.540s $ time linus-sha1 sha1.test real 0m2.319s (-34%) $ time giuseppe-sha1sum sha1.test real 0m3.513s (-.8%) $ rpm -q gcc gcc-4.4.1-2.fc11.i586 $ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' ' model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz $ truncate -s300MB sha1.test $ time sha1sum sha1.test real 0m1.857s $ time linus-sha1 sha1.test real 0m1.102s (-40%) $ time giuseppe-sha1sum sha1.test real 0m1.932s (+ 4%) cheers, Pádraig. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 1:53 ` Pádraig Brady @ 2009-08-17 10:51 ` Giuseppe Scrivano 2009-08-17 15:44 ` Steven Noonan 0 siblings, 1 reply; 28+ messages in thread From: Giuseppe Scrivano @ 2009-08-17 10:51 UTC (permalink / raw) To: Pádraig Brady; +Cc: Bug-coreutils, Linus Torvalds, Git Mailing List Pádraig Brady <P@draigBrady.com> writes: > -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions > -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586 > -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1 thanks. I did again all tests on my machine using these same options. I repeated each test 6 times and I took the median without consider the first result. Except the first run that it is not considered, I didn't report a big variance on results of the same test. gcc 4.3.3 gnulib sha1: real 0m2.543s gnulib sha1 lookup: real 0m1.906s (-25%) linus's sha1: real 0m2.468s (-3%) linus's sha1 no asm: real 0m2.289s (-9%) gcc 4.4.1 gnulib sha1: real 0m3.386s gnulib sha1 lookup: real 0m3.110s (-8%) linus's sha1: real 0m1.701s (-49%) linus's sha1 no asm: real 0m1.284s (-62%) I don't see such big differences in asm generated by gcc 4.4.1 and gcc 4.3.3 to explain this performance difference, what I noticed immediately is that in the gcc-4.4 generated asm there are more "lea" instructions (+30%), but I doubt this is the reason of these poor results. Anyway, I haven't yet looked much in details. Cheers, Giuseppe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 10:51 ` Giuseppe Scrivano @ 2009-08-17 15:44 ` Steven Noonan 2009-08-17 16:22 ` Linus Torvalds 2009-08-17 17:32 ` Giuseppe Scrivano 0 siblings, 2 replies; 28+ messages in thread From: Steven Noonan @ 2009-08-17 15:44 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds, Git Mailing List On Mon, Aug 17, 2009 at 3:51 AM, Giuseppe Scrivano<gscrivano@gnu.org> wrote: > Pádraig Brady <P@draigBrady.com> writes: > >> -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions >> -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586 >> -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1 > > thanks. I did again all tests on my machine using these same options. > I repeated each test 6 times and I took the median without consider the > first result. Except the first run that it is not considered, I didn't > report a big variance on results of the same test. > > > gcc 4.3.3 > > gnulib sha1: real 0m2.543s > gnulib sha1 lookup: real 0m1.906s (-25%) > linus's sha1: real 0m2.468s (-3%) > linus's sha1 no asm: real 0m2.289s (-9%) > > > gcc 4.4.1 > > gnulib sha1: real 0m3.386s > gnulib sha1 lookup: real 0m3.110s (-8%) > linus's sha1: real 0m1.701s (-49%) > linus's sha1 no asm: real 0m1.284s (-62%) > > > I don't see such big differences in asm generated by gcc 4.4.1 and gcc > 4.3.3 to explain this performance difference, what I noticed immediately > is that in the gcc-4.4 generated asm there are more "lea" instructions > (+30%), but I doubt this is the reason of these poor results. Anyway, I > haven't yet looked much in details. > > Cheers, > Giuseppe Interesting. I compared Linus' implementation to the public domain one by Steve Reid[1], which is used in OpenLDAP and a few other projects. Anyone with some experience testing these kinds of things in a statistically sound manner want to try it out? In my tests, I got this: (average of 5 runs) Linus' sha1: 283MB/s Steve Reid's sha1: 305MB/s - Steven [1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 15:44 ` Steven Noonan @ 2009-08-17 16:22 ` Linus Torvalds 2009-08-17 21:43 ` Steven Noonan 2009-08-17 17:32 ` Giuseppe Scrivano 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2009-08-17 16:22 UTC (permalink / raw) To: Steven Noonan Cc: Giuseppe Scrivano, Pádraig Brady, Bug-coreutils, Git Mailing List On Mon, 17 Aug 2009, Steven Noonan wrote: > > Interesting. I compared Linus' implementation to the public domain one > by Steve Reid[1] You _really_ need to talk about what kind of environment you have. There are three major issues: - Netburst vs non-netburst - 32-bit vs 64-bit - compiler version Steve Reid's code looks great, but the way it is coded, gcc makes a mess of it, which is exactly what my SHA1 tries to avoid. [ In contrast, gcc does very well on just about _any_ straightforward unrolled SHA1 C code if the target architecture is something like PPC or ia64 that has enough registers to keep it all in registers. I haven't really tested other compilers - a less aggressive compiler would actually do _better_ on SHA1, because the problem with gcc is that it turns the whole temporary 16-entry word array into register accesses, and tries to do register allocation on that _array_. That is wonderful for the above-mentioned PPC and IA64, but it makes gcc create totally crazy code when there aren't enough registers, and then gcc starts spilling randomly (ie it starts spilling a-e etc). This is why the compiler and version matters so much. ] > (average of 5 runs) > Linus' sha1: 283MB/s > Steve Reid's sha1: 305MB/s So I get very different results: # TIME[s] SPEED[MB/s] Reid 2.742 222.6 linus 1.464 417 this is Intel Nehalem, but compiled for 32-bit mode (which is the more challenging one because x86-32 only has 7 general-purpose registers), and with gcc-4.4.0. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 16:22 ` Linus Torvalds @ 2009-08-17 21:43 ` Steven Noonan 0 siblings, 0 replies; 28+ messages in thread From: Steven Noonan @ 2009-08-17 21:43 UTC (permalink / raw) To: Linus Torvalds Cc: Giuseppe Scrivano, Pádraig Brady, Bug-coreutils, Git Mailing List On Mon, Aug 17, 2009 at 9:22 AM, Linus Torvalds<torvalds@linux-foundation.org> wrote: > > > On Mon, 17 Aug 2009, Steven Noonan wrote: >> >> Interesting. I compared Linus' implementation to the public domain one >> by Steve Reid[1] > > You _really_ need to talk about what kind of environment you have. > > There are three major issues: > - Netburst vs non-netburst > - 32-bit vs 64-bit > - compiler version Right. I'm running a Core 2 "Merom" 2.33GHz. The code was compiled for x86_64 with GCC 4.2.1. I didn't _expect_ it to compile for x86_64, but apparently the version of GCC that ships with Xcode 3.2 defaults to compiling 64-bit code on machines that are capable of running it. > > Steve Reid's code looks great, but the way it is coded, gcc makes a mess > of it, which is exactly what my SHA1 tries to avoid. > > [ In contrast, gcc does very well on just about _any_ straightforward > unrolled SHA1 C code if the target architecture is something like PPC or > ia64 that has enough registers to keep it all in registers. > > I haven't really tested other compilers - a less aggressive compiler > would actually do _better_ on SHA1, because the problem with gcc is that > it turns the whole temporary 16-entry word array into register accesses, > and tries to do register allocation on that _array_. > > That is wonderful for the above-mentioned PPC and IA64, but it makes gcc > create totally crazy code when there aren't enough registers, and then > gcc starts spilling randomly (ie it starts spilling a-e etc). This is > why the compiler and version matters so much. ] > >> (average of 5 runs) >> Linus' sha1: 283MB/s >> Steve Reid's sha1: 305MB/s > > So I get very different results: > > # TIME[s] SPEED[MB/s] > Reid 2.742 222.6 > linus 1.464 417 Added -m32: Steve Reid: 156MB/s Linus: 209MB/s So on x86, your code really kicks butt. > this is Intel Nehalem, but compiled for 32-bit mode (which is the more > challenging one because x86-32 only has 7 general-purpose registers), and > with gcc-4.4.0. > > Linus > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 15:44 ` Steven Noonan 2009-08-17 16:22 ` Linus Torvalds @ 2009-08-17 17:32 ` Giuseppe Scrivano 1 sibling, 0 replies; 28+ messages in thread From: Giuseppe Scrivano @ 2009-08-17 17:32 UTC (permalink / raw) To: Steven Noonan Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds, Git Mailing List Hi, These are the results I reported (median of 5 plus an additional not considered first run) on the Steve Reid's SHA1 implementation using the same flags to the compiler that I used for previous tests. GCC 4.3.3: real 0m2.627s GCC 4.4.1: real 0m3.742s In both cases it showed to be slower than other implementations I have already tried. Additional note: as for gnulib SHA1, GCC 4.4.1 produced slower code than GCC 4.3.3. Cheers, Giuseppe Steven Noonan <steven@uplinklabs.net> writes: > > Interesting. I compared Linus' implementation to the public domain one > by Steve Reid[1], which is used in OpenLDAP and a few other projects. > Anyone with some experience testing these kinds of things in a > statistically sound manner want to try it out? In my tests, I got > this: > > (average of 5 runs) > Linus' sha1: 283MB/s > Steve Reid's sha1: 305MB/s > > - Steven > > [1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! @ 2009-08-17 7:23 George Spelvin 2009-08-17 14:20 ` Nicolas Pitre 2009-08-17 17:06 ` Nicolas Pitre 0 siblings, 2 replies; 28+ messages in thread From: George Spelvin @ 2009-08-17 7:23 UTC (permalink / raw) To: bdonlan, johnflux, P; +Cc: art.08.09, git, linux, nico, torvalds If it helps anyone resolve license issues, here's a from-FIPS-180-2 implementation that's placed in the public domain. That should be compatible with any license. It uses Linus's and Artur's performance ideas, and some of Linus' macro ideas (in the rotate implementation), but tries to be textually different. Is there anything recognizable that anyone cares to clam copyright to? It's not quite 100% finished, as I haven't benchmarked it against Linus's code yet, but it's functionally correct. It's also clean with -W -Wall -Wextra. TODO: Check if an initial copy to w[] is faster on i386 (less register pressure). /* * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2. * * This implementation is in the public domain. Copyright abandoned. * You may do anything you like with it, including evil things. * * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1" * from the git mailing list (August, 2009). Additional optimization * ideas cribbed from * - Artur Skawina (x86, particularly P4, and much benchmarking) * - Nicilas Pitre (ARM) */ /* Cut here for sha1.h */ #include <stddef.h> /* For size_t */ #include <stdint.h> /* For uint32_t, uint64_t */ typedef struct SHA_context { uint64_t len; /* May be shrunk to uint32_t */ uint32_t iv[5]; unsigned char buf[64]; /* Must be 32-bit aligned */ } SHA_CTX; void SHA1_Init(struct SHA_context *c); void SHA1_Update(struct SHA_context *c, void const *p, size_t n); void SHA1_Final(unsigned char hash[20], struct SHA_context *c); /* End of sha1.h */ #include <string.h> /* For memcpy() */ #include <arpa/inet.h> /* For ntohl() */ static void sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks); /* Machine specific hacks */ #if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \ defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \ defined(__s390__) || defined(__s390x__) /* Unaligned access is okay */ static inline uint32_t get_be32(unsigned char const *p) { return ntohl(*(uint32_t const *)p); } static inline void put_be32(unsigned char const *p, uint32_t v) { *(uint32_t *)p = htonl(v); } #else /* Unaligned access is not okay; do conversion as byte fetches */ static inline uint32_t get_be32(unsigned char const *p) { return p[0] << 24 || p[1] << 16 | p[8] << 8 | p[3]; } static inline void put_be32(unsigned char const *p, uint32_t v) { p[0] = v >> 24; p[1] = v >> 16; p[2] = v >> 8; p[3] = v; } #endif void SHA1_Init(struct SHA_context *c) { /* This is a prefix of the SHA_context structure */ static struct { uint64_t len; uint32_t iv[5]; } const iv = { 0, { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 } }; memcpy(c, &iv, sizeof iv); } void SHA1_Update(struct SHA_context *c, void const *p, size_t n) { size_t pos = c->len & 63; /* Offset into current input block */ c->len += n; /* Initial partial block (if any) */ if (pos) { size_t space = 63 - pos; if (n < space) goto end; memcpy(c->buf + pos, p, space); sha1_core(c->iv, c->buf, 1); n -= space; p = (char const *)p + space; } /* The large middle piece */ if (n >> 6) { sha1_core(c->iv, p, n >> 6); p = (char const *)p + (n & -(size_t)64); n &= 63; } pos = 0; end: /* Final partial block (may be zero size) */ memcpy(c->buf + pos, p, n); } void SHA1_Final(unsigned char hash[20], struct SHA_context *c) { size_t pos = c->len & 63; unsigned i; /* Append a single 1 bit */ c->buf[pos++] = 0x80; /* Append 0 bits until 64 bits remain in a block */ if (pos > 56) { memset(c->buf + pos, 0, 64 - pos); sha1_core(c->iv, c->buf, 1); pos = 0; } memset(c->buf + pos, 0, 56 - pos); /* Append total input length in bits */ ((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29)); ((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3); /* Final hash round */ sha1_core(c->iv, c->buf, 1); /* Copy hash result out */ for (i = 0; i < 5; i++) put_be32(hash + 4*i, c->iv[i]); } /* * Helper macros for sha1_core function. To avoid clutter, these macros * are NOT fully parenthesized if it doesn't matter to the actual use. */ #if __GNUC__ && (defined(__i386__) || defined(__x86_64__)) /* * GCC by itself only generates left rotates. Use right rotates if * possible to be kinder to dinky implementations with iterative rotate * instructions. */ #define ROT(op, x, k) \ ({ uint32_t y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; }) #define ROL(x,k) ROT("roll", x, k) #define ROR(x,k) ROT("rorl", x, k) #else /* Generic C equivalent */ #define ROT(x,l,r) ((x) << (l) | (x) >> (r)) #define ROL(x,k) ROT(x,k,32-(k)) #define ROR(x,k) ROT(x,32-(k),k) #endif /* * An important temporary array in SHA-1 is the working array W[], * which holds a scheduled key word per round. Only the last 16 words * are relevant, so only use 16 words... */ #define W(i) w[(i) & 15] /* * If you have a small register set, it helps GCC to force stores to * the w[] array to memory. Given 22 or more registers (e.g. PowerPC), * GCC can map the entire w[] array to registers and this becomes * counterproductive. * * The optimal kludge seems to differ between x86 and ARM. On the latter, * forcing a full memory barrier is required to stop GCC from splitting * live ranges for each round and generating a separate stack slot for * each. (Which produces a huge stack frame and kills performance.) */ #if defined(__i386__) || defined(__x86_64__) #define STORE(i, x) *(volatile uint32_t *)&W(i) = x #elif __GNUC__ && defined(__arm__) #define STORE(i, x) W(i) = x; __asm__("":::"memory") #else #define STORE(i, x) W(i) = x #endif /* The three round functions. F2 is also used as F4 */ #define F1(b,c,d) (((d ^ c) & b) ^ d) /* Bitwise b ? c : d */ #define F2(b,c,d) (d ^ c ^ b) /* Even parity */ #define F3(b,c,d) (d & c) + ((d ^ c) & b) /* Majority function */ /* The four round constants */ #define K1 0x5a827999 /* 2^30 * sqrt(2) */ #define K2 0x6ed9eba1 /* 2^30 * sqrt(3) */ #define K3 0x8f1bbcdc /* 2^30 * sqrt(5) */ #define K4 0xca62c1d6 /* 2^30 * sqrt(10) */ /* Rounds 0..15 fetch a word from the input */ #define FETCH(t,i) t = get_be32(p + 4*(i)); STORE(i,t) /* Rounds 16..79 mix previous words to get a new one */ #define MIX(t,i) t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1) /* Rounds 16..76 have to store back the result */ #define CALC(t,i) MIX(t,i); STORE(i,t) /* The basic SHA-1 round */ #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) \ src(t,i); \ e += t + f(b,c,d) + k + ROL(a,5); \ b = ROR(b,2) /* An aligned group of 5 rounds */ #define SHA_ROUND5(f,k,src,t,i) \ SHA_ROUND(a,b,c,d,e, f,k,src,t,i); \ SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1); \ SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2); \ SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3); \ SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4) /* * The core SHA-1 transform. * * iv[5] = Current SHA-1 hash state. * p = Pointer to source data. Not necessarily aligned. * nblocks = Number of 64-byte blocks at p. Guaranteed non-zero. */ static void sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks) { uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0]; uint32_t w[16]; do { uint32_t t; SHA_ROUND5(F1, K1, FETCH, t, 0); SHA_ROUND5(F1, K1, FETCH, t, 5); SHA_ROUND5(F1, K1, FETCH, t, 10); SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15); SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16); SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17); SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18); SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19); SHA_ROUND5(F2, K2, CALC, t, 20); SHA_ROUND5(F2, K2, CALC, t, 25); SHA_ROUND5(F2, K2, CALC, t, 30); SHA_ROUND5(F2, K2, CALC, t, 35); SHA_ROUND5(F3, K3, CALC, t, 40); SHA_ROUND5(F3, K3, CALC, t, 45); SHA_ROUND5(F3, K3, CALC, t, 50); SHA_ROUND5(F3, K3, CALC, t, 55); SHA_ROUND5(F2, K4, CALC, t, 60); SHA_ROUND5(F2, K4, CALC, t, 65); SHA_ROUND5(F2, K4, CALC, t, 70); SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75); SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76); /* Last 3 rounds don't need to store mixed value */ SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77); SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78); SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79); iv[4] = e += iv[4]; iv[3] = d += iv[3]; iv[2] = c += iv[2]; iv[1] = b += iv[1]; iv[0] = a += iv[0]; } while (--nblocks); } /* Compile with -DUNITTEST for self-tests */ #if UNITTEST #include <stdio.h> /* Known answer test */ static void katest(char const *text, size_t len, unsigned char const hash[20]) { SHA_CTX c; unsigned char hash2[20]; int i; SHA1_Init(&c); SHA1_Update(&c, text, len); SHA1_Final(hash2, &c); for (i = 0; i < 20; i++) if (hash[i] != hash2[i]) goto mismatch; printf("%u-byte known answer test PASSED\n", len); return; mismatch: printf("%u-byte known answer test FAILED:\n", len); printf("Input: \"%.*s\"\n", len, text); printf("Computed:"); for (i = 0; i < 20; i++) printf(" %02x", hash2[i]); printf("\nExpected:"); for (i = 0; i < 20; i++) printf(" %02x", hash[i]); putchar('\n'); } int main(void) { /* FIPS PUB 180.1 example A.1 */ static char const text1[3] = "abc"; static unsigned char const hash1[20] = { 0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e, 0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d }; /* FIPS PUB 180.1 example A.2 */ static char const text2[56] = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq"; static unsigned char const hash2[20] = { 0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae, 0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 }; katest(text1, sizeof text1, hash1); katest(text2, sizeof text2, hash2); return 0; } #endif ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 7:23 George Spelvin @ 2009-08-17 14:20 ` Nicolas Pitre 2009-08-17 17:06 ` Nicolas Pitre 1 sibling, 0 replies; 28+ messages in thread From: Nicolas Pitre @ 2009-08-17 14:20 UTC (permalink / raw) To: George Spelvin; +Cc: bdonlan, johnflux, P, art.08.09, git, Linus Torvalds On Mon, 17 Aug 2009, George Spelvin wrote: > If it helps anyone resolve license issues, here's a from-FIPS-180-2 > implementation that's placed in the public domain. That should be > compatible with any license. > > It uses Linus's and Artur's performance ideas, and some of Linus' macro > ideas (in the rotate implementation), but tries to be textually different. > Is there anything recognizable that anyone cares to clam copyright to? > > It's not quite 100% finished, as I haven't benchmarked it against Linus's > code yet, but it's functionally correct. > > It's also clean with -W -Wall -Wextra. > > TODO: Check if an initial copy to w[] is faster on i386 (less register > pressure). > > /* > * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2. > * > * This implementation is in the public domain. Copyright abandoned. > * You may do anything you like with it, including evil things. > * > * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1" > * from the git mailing list (August, 2009). Additional optimization > * ideas cribbed from > * - Artur Skawina (x86, particularly P4, and much benchmarking) > * - Nicilas Pitre (ARM) Please be careful to spell my name correctly. Nicolas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 7:23 George Spelvin 2009-08-17 14:20 ` Nicolas Pitre @ 2009-08-17 17:06 ` Nicolas Pitre 2009-08-17 17:20 ` Paolo Bonzini 2009-08-17 18:54 ` George Spelvin 1 sibling, 2 replies; 28+ messages in thread From: Nicolas Pitre @ 2009-08-17 17:06 UTC (permalink / raw) To: George Spelvin; +Cc: bdonlan, johnflux, P, art.08.09, git, torvalds On Mon, 17 Aug 2009, George Spelvin wrote: > If it helps anyone resolve license issues, here's a from-FIPS-180-2 > implementation that's placed in the public domain. That should be > compatible with any license. > > It uses Linus's and Artur's performance ideas, and some of Linus' macro > ideas (in the rotate implementation), but tries to be textually different. > Is there anything recognizable that anyone cares to clam copyright to? I don't think this trick of making source code textually different from another work while still intimately mimicking the same structure entitles you to any copyright (or non copyright) claims over that other work. I certainly wouldn't bet any dime for this standing up in court. Otherwise anyone could grab any copyrighted source code and perform a bunch of search-and-replace ops on it, and maybe some code reordering for good measure, to be able to claim own copyright on it. It is probably much safer to simply ask the people involved to agree with your relicensing. And so far I don't see anyone with a stake in this fiercely wanting to stick to a particular license. > It's not quite 100% finished, as I haven't benchmarked it against Linus's > code yet, but it's functionally correct. > > It's also clean with -W -Wall -Wextra. Not if you try with the unaligned put_be32() as the destination pointer is marked const. As to the actual result on ARM... Well, the assembly _looks_ much worse than Linus' version. It uses a stack frame of 152 bytes instead of 64 bytes. The resulting binary is also 6868 bytes large compared to 6180 bytes. Surprisingly, the performance is not that bad (the reason for the underlined "looks" above) albeit still a bit worse, like 5% slower. I was expecting much worse than that. One possible reason for the bad assembly is probably due to the fact that gcc is not smart enough to propagate constant address offsets across different pointer types. For example, my first version of get_be32() was a macro that did this: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)&data[t]; \ (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) With such a construct, gcc would always allocate a register to hold __d and then dereference that with an offset from 0 to 3. Whereas: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)data; \ (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) does produce optimal assembly as only the register holding the data pointer is dereferenced with the absolute byte offset. I suspect your usage of inline functions has the same effect as the first SHA_SRC definition above. Also, wrt skipping the last 3 write back to the 16 word array... For all the (limited) attempts I've made so far to do that, it always ended up making things worse. I've yet to investigate why though. Nicolas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 17:06 ` Nicolas Pitre @ 2009-08-17 17:20 ` Paolo Bonzini 2009-08-17 18:54 ` George Spelvin 1 sibling, 0 replies; 28+ messages in thread From: Paolo Bonzini @ 2009-08-17 17:20 UTC (permalink / raw) To: Nicolas Pitre; +Cc: George Spelvin, bdonlan, johnflux, P, art.08.09, git > my first version of > get_be32() was a macro that did this: > > #define SHA_SRC(t) \ > ({ unsigned char *__d = (unsigned char *)&data[t]; \ > (__d[0]<< 24) | (__d[1]<< 16) | (__d[2]<< 8) | (__d[3]<< 0); }) > > With such a construct, gcc would always allocate a register to hold __d > and then dereference that with an offset from 0 to 3. Whereas: > > #define SHA_SRC(t) \ > ({ unsigned char *__d = (unsigned char *)data; \ > (__d[(t)*4 + 0]<< 24) | (__d[(t)*4 + 1]<< 16) | \ > (__d[(t)*4 + 2]<< 8) | (__d[(t)*4 + 3]<< 0); }) > > does produce optimal assembly as only the register holding the data > pointer is dereferenced with the absolute byte offset. I suspect your > usage of inline functions has the same effect as the first SHA_SRC > definition above. Yes, that's what happens. Paolo ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 17:06 ` Nicolas Pitre 2009-08-17 17:20 ` Paolo Bonzini @ 2009-08-17 18:54 ` George Spelvin 2009-08-17 19:34 ` Nicolas Pitre 1 sibling, 1 reply; 28+ messages in thread From: George Spelvin @ 2009-08-17 18:54 UTC (permalink / raw) To: linux, nico; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds > I don't think this trick of making source code textually different from > another work while still intimately mimicking the same structure entitles > you to any copyright (or non copyright) claims over that other work. I > certainly wouldn't bet any dime for this standing up in court. <div type="legal digression"> Actually, I would. I did a lot more than text search and replace; I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright) and then merged in the *ideas* from the mailing list. (And from elsewhere; the idea of a five-round macro is from Brian Gladman.) Remember, to the extent that something is *functional*, it is not copyrightable; copyright only covers the non-functional expressive bits. The vast majority of that code is simply required by the standard, or the desired calling interface. For a large portion of the rest, remember that standard programming conventions (e.g. brace style, macro names IN CAPS, etc.) that's also non-copyrightable "scene a faire" material. It's well established that paraphrasing a recipe avoids copyright; the proportions and treatment of the ingredients is not copyrightable. For more details, see the extensive coverage of the NEC v. Intel decision (1989) regarding the firmware for NEC's 8086-clone V20 microprocessor. It was found non-infringing despite non-clean-room implementation and substantial similarities. </div> > It is probably much safer to simply ask the people involved to agree > with your relicensing. And so far I don't see anyone with a stake in > this fiercely wanting to stick to a particular license. As for politeness, that's exactly why I did post it and solicit objections. The purpose of the rewrite is to avoid having to make pessimistic assumptions about people who don't respond. I suppose I should have made that request clearer: Is there anyone who claims copyright on anything here? Or would just like credit? If so, are you willing to donate it to the public domain? I like the GPL, but some code is generally applicable enough, and unlikely to be improved on by others enough, that I'd rather just let everyone use the same version. > It's not quite 100% finished, as I haven't benchmarked it against Linus's > code yet, but it's functionally correct. > > It's also clean with -W -Wall -Wextra. > Not if you try with the unaligned put_be32() as the destination pointer > is marked const. Oops. And the "||" in get_be32() doesn't help either. There was another major bug I found, too. And I'm still trying to match "linusv" performance. > As to the actual result on ARM... Well, the assembly _looks_ much worse > than Linus' version. It uses a stack frame of 152 bytes instead of 64 > bytes. The resulting binary is also 6868 bytes large compared to 6180 > bytes. Surprisingly, the performance is not that bad (the reason for > the underlined "looks" above) albeit still a bit worse, like 5% slower. > I was expecting much worse than that. Thanks, I'll work on it. > Also, wrt skipping the last 3 write back to the 16 word array... For > all the (limited) attempts I've made so far to do that, it always ended > up making things worse. I've yet to investigate why though. Thanks. That's certainly surprising! If anyone cares, my current code (adapted to fit sha1bench) is appended. /* * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2. * * This implementation is in the public domain. Copyright abandoned. * You may do anything you like with it, including evil things. * * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1" * from the git mailing list (August, 2009). Additional optimization * ideas cribbed from * - Artur Skawina (x86, particularly P4, and much benchmarking) * - Nicolas Pitre (ARM) * * Contributors to this code are requested to donate their changes to * the public domain to avoid version skew. If you don't want to, be * sure to change the copyright waiver above. */ #include <string.h> /* For memcpy() */ #include <arpa/inet.h> /* For ntohl() */ #include "pdsha1.h" #define SHA_CTX pd_SHA_CTX #define SHA_context pd_SHA_context #define SHA1_Init pd_SHA1_Init #define SHA1_Update pd_SHA1_Update #define SHA1_Final pd_SHA1_Final static void sha1_core(uint32_t iv[5], unsigned char const *p); void SHA1_Init(struct SHA_context *c) { /* This is a prefix of the SHA_context structure */ static struct { uint64_t len; uint32_t iv[5]; } const iv = { 0, { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 } }; memcpy(c, &iv, sizeof iv); } void SHA1_Update(struct SHA_context *c, void const *p, size_t n) { size_t pos = c->len & 63; /* Offset into current input block */ size_t nblocks; c->len += n; /* Initial partial block (if any) */ if (pos) { size_t space = 63 - pos; if (n < space) goto end; memcpy(c->buf + pos, p, space); sha1_core(c->iv, c->buf); n -= space; p = (char const *)p + space; } /* The large middle piece */ nblocks = n >> 6; while (nblocks--) { sha1_core(c->iv, p); p += 64; } n &= 63; pos = 0; end: /* Final partial block (may be zero size) */ memcpy(c->buf + pos, p, n); } /* Machine specific hacks */ #if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \ defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \ defined(__s390__) || defined(__s390x__) /* Unaligned access is okay */ #define GET_BE32(p) ntohl(*(uint32_t const *)(p)) #define PUT_BE32(p,v) (*(uint32_t *)(p) = htonl(v)) #else /* * Unaligned access is not okay; do conversion as byte fetches * These macros are IN_CAPS, as they are not side-effect safe * The argument p must be an "unsigned char *" or "uint8_t *" */ #define GET_BE32(p) \ ((uint32_t)(p)[0]<<24 | (p)[1]<<16 | (p)[2]<<8 | (p)[3]) #define PUT_BE32(p,v) \ ( (p)[0] = (v)>>24, (p)[1] = (v)>>16, (p)[2] = (v)>>8, (p)[3] = (v) ) #endif void SHA1_Final(unsigned char hash[20], struct SHA_context *c) { size_t pos = c->len & 63; unsigned i; /* Append a single 1 bit */ c->buf[pos++] = 0x80; /* Append 0 bits until 64 bits remain in a block */ if (pos > 56) { memset(c->buf + pos, 0, 64 - pos); sha1_core(c->iv, c->buf); pos = 0; } memset(c->buf + pos, 0, 56 - pos); /* Append total input length in bits */ ((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29)); ((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3); /* Final hash round */ sha1_core(c->iv, c->buf); /* Copy hash result out */ for (i = 0; i < 5; i++) PUT_BE32(hash + 4*i, c->iv[i]); } /* * Helper macros for sha1_core function. To avoid clutter, these macros * are NOT fully parenthesized if it doesn't matter to the actual use. */ #if __GNUC__ && (defined(__i386__) || defined(__x86_64__)) /* * GCC by itself only generates left rotates. Use right rotates if * possible to be kinder to dinky implementations with iterative rotate * instructions. */ #define ROT(op, x, k) \ ({ uint32_t y; __asm__(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; }) #define ROL(x,k) ROT("roll", x, k) #define ROR(x,k) ROT("rorl", x, k) #else /* Generic C equivalent */ #define ROT(x,l,r) ((x) << (l) | (x) >> (r)) #define ROL(x,k) ROT(x,k,32-(k)) #define ROR(x,k) ROT(x,32-(k),k) #endif /* * An important temporary array in SHA-1 is the working array W[], * which holds a scheduled key word per round. Only the last 16 words * are relevant, so only use 16 words... */ #define W(i) w[(i) & 15] /* * If you have a small register set, it helps GCC to force stores to * the w[] array to memory. Given 22 or more registers (e.g. PowerPC), * GCC can map the entire w[] array to registers and this becomes * counterproductive. * * The optimal kludge seems to differ between x86 and ARM. On the latter, * forcing a full memory barrier is required to stop GCC from splitting * live ranges for each round and generating a separate stack slot for * each. (Which produces a huge stack frame and kills performance.) */ #if __GNUC__ && defined(__i386__) || defined(__x86_64__) #define STORE(i, x) W(i) = x; __asm__ volatile("" : "+m" (W(x))) #elif defined(__i386__) || defined(__x86_64__) #define STORE(i, x) *(volatile uint32_t *)&W(i) = x #elif __GNUC__ && defined(__arm__) #define STORE(i, x) W(i) = x; __asm__("":::"memory") #else #define STORE(i, x) W(i) = x #endif /* The three round functions. F2 is also used as F4 */ #define F1(b,c,d) (((d ^ c) & b) ^ d) /* Bitwise b ? c : d */ #define F2(b,c,d) (d ^ c ^ b) /* Even parity */ #define F3(b,c,d) (d & c) + ((d ^ c) & b) /* Majority function */ /* The four round constants */ #define K1 0x5a827999 /* 2^30 * sqrt(2) */ #define K2 0x6ed9eba1 /* 2^30 * sqrt(3) */ #define K3 0x8f1bbcdc /* 2^30 * sqrt(5) */ #define K4 0xca62c1d6 /* 2^30 * sqrt(10) */ /* Rounds 0..15 fetch a word from the input */ #define FETCH(t,i) t = GET_BE32(p + 4*(i)); STORE(i,t) //#define FETCH(t,i) t = W(i) /* Rounds 16..79 mix previous words to get a new one */ #define MIX(t,i) t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1) /* Rounds 16..76 have to store back the result */ #define CALC(t,i) MIX(t,i); STORE(i,t) /* The basic SHA-1 round */ #if 0 #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) \ src(t,i); \ e += t + f(b,c,d) + k + ROL(a,5); \ b = ROR(b,2) #elif 0 #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do { \ uint32_t t; \ src(t,i); \ e += f(b,c,d); \ e += t + k; \ b = ROR(b,2); \ e += ROL(a,5); } while (0) #else #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do { \ uint32_t t = ROL(a,5); \ e += f(b,c,d); \ e += k + t; \ src(t,i); \ b = ROR(b,2); \ e += t; } while(0) #endif /* An aligned group of 5 rounds */ #define SHA_ROUND5(f,k,src,t,i) \ SHA_ROUND(a,b,c,d,e, f,k,src,t,i); \ SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1); \ SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2); \ SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3); \ SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4) /* * The core SHA-1 transform. * * iv[5] = Current SHA-1 hash state. * p = Pointer to 64 bytes of source data. Not necessarily aligned. */ static void sha1_core(uint32_t iv[5], unsigned char const *p) { //uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0]; uint32_t a = iv[0], b = iv[1], c = iv[2], d = iv[3], e = iv[4]; uint32_t w[16]; #if 0 { int i; for (i = 0; i < 16; i++) { STORE(i, get_be32(p + 4*i)); } } #endif SHA_ROUND5(F1, K1, FETCH, t, 0); SHA_ROUND5(F1, K1, FETCH, t, 5); SHA_ROUND5(F1, K1, FETCH, t, 10); SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15); SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16); SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17); SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18); SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19); SHA_ROUND5(F2, K2, CALC, t, 20); SHA_ROUND5(F2, K2, CALC, t, 25); SHA_ROUND5(F2, K2, CALC, t, 30); SHA_ROUND5(F2, K2, CALC, t, 35); SHA_ROUND5(F3, K3, CALC, t, 40); SHA_ROUND5(F3, K3, CALC, t, 45); SHA_ROUND5(F3, K3, CALC, t, 50); SHA_ROUND5(F3, K3, CALC, t, 55); SHA_ROUND5(F2, K4, CALC, t, 60); SHA_ROUND5(F2, K4, CALC, t, 65); SHA_ROUND5(F2, K4, CALC, t, 70); SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75); SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76); /* Last 3 rounds don't need to store mixed value */ SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77); SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78); SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79); iv[0] += a; iv[1] += b; iv[2] += c; iv[3] += d; iv[4] += e; } /* Compile with -DUNITTEST for self-tests */ #if UNITTEST #include <stdio.h> #include <stdlib.h> /* Known answer test */ static void katest(char const *text, size_t len, unsigned char const hash[20]) { SHA_CTX c; unsigned char hash2[20]; int i; SHA1_Init(&c); SHA1_Update(&c, text, len); SHA1_Final(hash2, &c); for (i = 0; i < 20; i++) if (hash[i] != hash2[i]) goto mismatch; printf("%u-byte known answer test PASSED\n", len); return; mismatch: printf("%u-byte known answer test FAILED:\n", len); if (len < 64) printf("Input: \"%.*s\"\n", len, text); else printf("Input: \"%.64s\"...\n", text); printf("Computed:"); for (i = 0; i < 20; i++) printf(" %02x", hash2[i]); printf("\nExpected:"); for (i = 0; i < 20; i++) printf(" %02x", hash[i]); putchar('\n'); } int main(void) { /* FIPS PUB 180.1 example A.1 */ static char const text1[3] = "abc"; static unsigned char const hash1[20] = { 0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e, 0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d }; /* FIPS PUB 180.1 example A.2 */ static char const text2[56] = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq"; static unsigned char const hash2[20] = { 0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae, 0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 }; /* FIPS PUB 180.1 example A.3 */ char *text3; static unsigned char const hash3[20] = { 0x34, 0xaa, 0x97, 0x3c, 0xd4, 0xc4, 0xda, 0xa4, 0xf6, 0x1e, 0xeb, 0x2b, 0xdb, 0xad, 0x27, 0x31, 0x65, 0x34, 0x01, 0x6f }; katest(text1, sizeof text1, hash1); katest(text2, sizeof text2, hash2); text3 = malloc(1000000); memset(text3, 'a', 1000000); katest(text3, 1000000, hash3); return 0; } #endif ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 18:54 ` George Spelvin @ 2009-08-17 19:34 ` Nicolas Pitre 2009-08-17 23:12 ` George Spelvin 0 siblings, 1 reply; 28+ messages in thread From: Nicolas Pitre @ 2009-08-17 19:34 UTC (permalink / raw) To: George Spelvin; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds On Mon, 17 Aug 2009, George Spelvin wrote: > > I don't think this trick of making source code textually different from > > another work while still intimately mimicking the same structure entitles > > you to any copyright (or non copyright) claims over that other work. I > > certainly wouldn't bet any dime for this standing up in court. > > <div type="legal digression"> > Actually, I would. I did a lot more than text search and replace; > I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright) > and then merged in the *ideas* from the mailing list. > > (And from elsewhere; the idea of a five-round macro is from Brian Gladman.) > > Remember, to the extent that something is *functional*, it is not > copyrightable; copyright only covers the non-functional expressive bits. > The vast majority of that code is simply required by the standard, > or the desired calling interface. > > For a large portion of the rest, remember that standard programming > conventions (e.g. brace style, macro names IN CAPS, etc.) that's also > non-copyrightable "scene a faire" material. > > It's well established that paraphrasing a recipe avoids copyright; > the proportions and treatment of the ingredients is not copyrightable. > > For more details, see the extensive coverage of the NEC v. Intel decision > (1989) regarding the firmware for NEC's 8086-clone V20 microprocessor. > It was found non-infringing despite non-clean-room implementation and > substantial similarities. > </div> Whatever. NEC and Intel were certainly commercial competitors. They were far from being friends. So if you feel like having too many friends then just go ahead with that stance. > As for politeness, that's exactly why I did post it and solicit > objections. You said: |It uses Linus's and Artur's performance ideas, and some of Linus' macro |ideas (in the rotate implementation), but tries to be textually |different. Is there anything recognizable that anyone cares to clam |copyright to? the "try to be textually different" in order to ask for "anything recognizable that anyone cares to clam copyright to" is what I find dubious. > The purpose of the rewrite is to avoid having to make > pessimistic assumptions about people who don't respond. > > I suppose I should have made that request clearer: > Is there anyone who claims copyright on anything here? > Or would just like credit? > If so, are you willing to donate it to the public domain? I think this is much nicer to everyone involved. As far as I'm concerned, I'm OK with giving any small copyright I might have in this SHA1 implementation, if any, to the public domain. Credits are always nice. Nicolas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Linus' sha1 is much faster! 2009-08-17 19:34 ` Nicolas Pitre @ 2009-08-17 23:12 ` George Spelvin 0 siblings, 0 replies; 28+ messages in thread From: George Spelvin @ 2009-08-17 23:12 UTC (permalink / raw) To: linux, nico; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds >> The purpose of the rewrite is to avoid having to make >> pessimistic assumptions about people who don't respond. >> >> I suppose I should have made that request clearer: >> Is there anyone who claims copyright on anything here? >> Or would just like credit? >> If so, are you willing to donate it to the public domain? > I think this is much nicer to everyone involved. > > As far as I'm concerned, I'm OK with giving any small copyright I might > have in this SHA1 implementation, if any, to the public domain. > Credits are always nice. My apologies. I read a lot of people talking about wanting the code under different licenses, and thought I'd just cut through it by providing some PD code. I didn't turn around and look at it from the point of view of the people who'd put the work into developing it. I don't mean to deny anyone credit for their work. In fact, providing more detail is on the to-do list, but I haven't waded through the mail archives and tracked down who contributed what yet. I'll work on those polish details once I have it producing the same assembly code as Linus'. There are a lot of possible highly-permissive licenses if one is wanted (zlib, MIT, CC-by), but public domain seems simpler. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2017-04-20 21:38 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady 2009-08-15 20:02 ` Bryan Donlan 2009-08-15 20:12 ` John Tapsell 2009-08-15 20:23 ` Linus Torvalds 2009-08-15 20:54 ` Linus Torvalds 2009-08-17 1:55 ` Nicolas Pitre 2009-08-26 11:39 ` Pádraig Brady 2017-04-20 21:35 ` galt 2017-04-20 21:38 ` galt 2009-08-17 8:22 ` Andreas Ericsson 2009-08-16 0:06 ` Theodore Tso 2009-08-16 19:25 ` Giuseppe Scrivano 2009-08-16 20:10 ` Linus Torvalds 2009-08-16 22:15 ` Giuseppe Scrivano 2009-08-16 22:47 ` Linus Torvalds 2009-08-17 1:53 ` Pádraig Brady 2009-08-17 10:51 ` Giuseppe Scrivano 2009-08-17 15:44 ` Steven Noonan 2009-08-17 16:22 ` Linus Torvalds 2009-08-17 21:43 ` Steven Noonan 2009-08-17 17:32 ` Giuseppe Scrivano -- strict thread matches above, loose matches on Subject: below -- 2009-08-17 7:23 George Spelvin 2009-08-17 14:20 ` Nicolas Pitre 2009-08-17 17:06 ` Nicolas Pitre 2009-08-17 17:20 ` Paolo Bonzini 2009-08-17 18:54 ` George Spelvin 2009-08-17 19:34 ` Nicolas Pitre 2009-08-17 23:12 ` George Spelvin
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).