Linus' sha1 is much faster!

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Linus' sha1 is much faster!
@ 2009-08-14 23:25 Pádraig Brady
  2009-08-15 20:02 ` Bryan Donlan
  2009-08-16 19:25 ` Giuseppe Scrivano
  0 siblings, 2 replies; 28+ messages in thread
From: Pádraig Brady @ 2009-08-14 23:25 UTC (permalink / raw)
  To: Bug-coreutils, Linus Torvalds; +Cc: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1110 bytes --]

I've noticed before that coreutils hashing utils
were a little behind in performance, but was prompted
to look at it again when I noticed the recently
updated sha1 implementation in git:
http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu

Testing that with the attached program which I wrote
in a couple of mins to try and match sha1sum's system calls
shows that it's around 33% faster, as shown below:

$ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1

$ time ./linus-sha1 300MB_file
df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file
real    0m2.742s
user    0m2.516s
sys     0m0.206s

$ time ~/git/coreutils/src/sha1sum 300MB_file
df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file

real    0m4.166s
user    0m3.846s
sys     0m0.298s

So, could we use that code in coreutils?
Think of all the dead fish it would save.

I've also attached a trivial block-sha1 patch which doesn't
affect performance, but does suppress a signed unsigned
comparison warning which occurs with -Wextra for example.

cheers,
Pádraig.

[-- Attachment #2: linus-sha1.c --]
[-- Type: text/x-csrc, Size: 681 bytes --]

/* gcc -O2 -Wall linus-sha1.c sha1.c -o linus-sha1 */
#include <stdio.h>
#include <stdlib.h>
#include "sha1.h"

int main(int argc, char** argv)
{
    if (argc != 2) return 1;
    const char* filename = argv[1];
    FILE *fp = fopen (filename, "r");
    if (!fp) return 1;

    #define BS 4096 /* match coreutils */

    blk_SHA_CTX ctx;
    blk_SHA1_Init(&ctx);
    size_t nr;
    char buf[BS];
    while ((nr=fread_unlocked(buf, 1, sizeof(buf), fp)))
        blk_SHA1_Update(&ctx, buf, nr);
    unsigned char hash[20];
    blk_SHA1_Final(hash, &ctx);
    int i;
    for (i=0; i<sizeof(hash); i++)
        printf("%02x",*(hash+i));
    printf("  %s\n", filename);

    return 0;
}

[-- Attachment #3: block-sha1-signed-warning.diff --]
[-- Type: text/x-patch, Size: 998 bytes --]

>From fa75e818836f763357ff9b7bbde3327e1aabbe47 Mon Sep 17 00:00:00 2001
From: =?utf-8?q?P=C3=A1draig=20Brady?= <P@draigBrady.com>
Date: Sat, 15 Aug 2009 00:17:30 +0100
Subject: [PATCH] block-sha1: suppress signed unsigned comparison warning

* block-sha1/sha1.c: Use unsigned ints as the values
will never go negative.
---
 block-sha1/sha1.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index d3121f7..be763d8 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -231,13 +231,13 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 
 void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
-	int lenW = ctx->size & 63;
+	unsigned int lenW = ctx->size & 63;
 
 	ctx->size += len;
 
 	/* Read the data into W and process blocks as they get full */
 	if (lenW) {
-		int left = 64 - lenW;
+		unsigned int left = 64 - lenW;
 		if (len < left)
 			left = len;
 		memcpy(lenW + (char *)ctx->W, data, left);
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady
@ 2009-08-15 20:02 ` Bryan Donlan
  2009-08-15 20:12   ` John Tapsell
  2009-08-16 19:25 ` Giuseppe Scrivano
  1 sibling, 1 reply; 28+ messages in thread
From: Bryan Donlan @ 2009-08-15 20:02 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Bug-coreutils, Linus Torvalds, Git Mailing List, Brandon Casey,
	Junio C Hamano, Nicolas Pitre

2009/8/14 Pádraig Brady <P@draigbrady.com>:
> I've noticed before that coreutils hashing utils
> were a little behind in performance, but was prompted
> to look at it again when I noticed the recently
> updated sha1 implementation in git:
> http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu
>
> Testing that with the attached program which I wrote
> in a couple of mins to try and match sha1sum's system calls
> shows that it's around 33% faster, as shown below:
>
> $ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1
>
> $ time ./linus-sha1 300MB_file
> df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file
> real    0m2.742s
> user    0m2.516s
> sys     0m0.206s
>
> $ time ~/git/coreutils/src/sha1sum 300MB_file
> df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file
>
> real    0m4.166s
> user    0m3.846s
> sys     0m0.298s
>
> So, could we use that code in coreutils?
> Think of all the dead fish it would save.

coreutils is licensed under GPLv3, and git under GPLv2 (only), so
you'd need permission from all contributors to the implementation in
order to relicense under GPLv3. A quick grep of the history suggests
these contributors to be:

Brandon Casey <drafnel@gmail.com>
Junio C Hamano <gitster@pobox.com>
Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Pitre <nico@cam.org>
(adding these people to the CC list)

Additionally, it was originally based on the code in
mozilla-sha1/sha1.c, but that contains a license grant allowing it to
be used under GPLv2 /or later/, so if GPLv3 relicensing is enough it
shouldn't be necessary to get in contact with the original author.
However if the FSF requires copyright assignment to accept the new
implementation, it will be necessary to track down contributors to the
original mozilla-sha1/sha1.c as well.

Note that I'm not a lawyer, so there might be other roadblocks etc to
this as well, etc :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-15 20:02 ` Bryan Donlan
@ 2009-08-15 20:12   ` John Tapsell
  2009-08-15 20:23     ` Linus Torvalds
  2009-08-16  0:06     ` Theodore Tso
  0 siblings, 2 replies; 28+ messages in thread
From: John Tapsell @ 2009-08-15 20:12 UTC (permalink / raw)
  To: Bryan Donlan
  Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds,
	Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre

2009/8/15 Bryan Donlan <bdonlan@gmail.com>:
> coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> you'd need permission from all contributors to the implementation in
> order to relicense under GPLv3. A quick grep of the history suggests
> these contributors to be:


X11 also requires a fast SHA1 implementation.  It uses this to check
if two pixmaps are the same.  So it would be really nice to relicense
under a liberal enough license that xorg can use it.

John

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-15 20:12   ` John Tapsell
@ 2009-08-15 20:23     ` Linus Torvalds
  2009-08-15 20:54       ` Linus Torvalds
  2009-08-16  0:06     ` Theodore Tso
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2009-08-15 20:23 UTC (permalink / raw)
  To: John Tapsell
  Cc: Bryan Donlan, Pádraig Brady, Bug-coreutils, Git Mailing List,
	Brandon Casey, Junio C Hamano, Nicolas Pitre

On Sat, 15 Aug 2009, John Tapsell wrote:

> 2009/8/15 Bryan Donlan <bdonlan@gmail.com>:
> > coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> > you'd need permission from all contributors to the implementation in
> > order to relicense under GPLv3. A quick grep of the history suggests
> > these contributors to be:
> 
> X11 also requires a fast SHA1 implementation.  It uses this to check
> if two pixmaps are the same.  So it would be really nice to relicense
> under a liberal enough license that xorg can use it.

I'm personally ok with retaining the mozilla-sha1 license.

There's not really anything _remaining_ of the mozilla code, but hey, I 
started from it. In retrospect I probably should have started from the PPC 
asm code that already did the blocking sanely - but that's a "20/20 
hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so 
convinced that I could improve on things. So that's a kind of source for 
it, even if it's more about the motivational side than any actual 
remaining code ;)

That said, I don't know if the MPL is ok for X11. I've not looked at 
compatibility issues with MPL. For git, we could just ignore the MPL, 
since the GPLv2 was acceptable regardless of it.

			Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-15 20:23     ` Linus Torvalds
@ 2009-08-15 20:54       ` Linus Torvalds
  2009-08-17  1:55         ` Nicolas Pitre
  2009-08-17  8:22         ` Andreas Ericsson
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2009-08-15 20:54 UTC (permalink / raw)
  To: John Tapsell
  Cc: Paul Kocher, Bryan Donlan, Bug-coreutils, Brandon Casey,
	Junio C Hamano, Nicolas Pitre, Pádraig Brady,
	Git Mailing List

On Sat, 15 Aug 2009, Linus Torvalds wrote:
> 
> That said, I don't know if the MPL is ok for X11. I've not looked at 
> compatibility issues with MPL. For git, we could just ignore the MPL, 
> since the GPLv2 was acceptable regardless of it.

If MPL isn't ok for X11, then we'd need to make sure that even the 
silliest Mozilla crud has been rewritten. There really isn't much, but 
hey, the _history_ is based on the mozilla code, and who knows - the 
'blk_SHA_CTX' struct has things like the fields in the same order as the 
Mozilla equivalent, for all those historical reasons.

(Heh. Looking at that, I probably should move the 'size' field first, 
since that would have different alignment rules, and the struct would be 
more tightly packed that way, and initialize better).

Afaik, none of the actual code remains (the mozilla SHA1 thing did the 
wrong thing for performance even for just the final bytes, and did those a 
byte at a time etc, so I rewrote even the trivial SHA1_Final parts).

Of course, maybe the Mozilla people would be interested in taking my 
faster version, and say that the new-BSD license is ok, and make everybody 
happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added 
him to the Cc.

Paul, for your information, we're talking about a faster rewritten "mostly 
portable" SHA1 routines that you can find at

	http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu

(follow the "blob" pointers to see sha1.c and sha1.h). I don't know if 
you're active with Mozilla/Firefox or whether you even care, but you seem 
to be the logical choice of person to ask.

			Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-15 20:54       ` Linus Torvalds
@ 2009-08-17  1:55         ` Nicolas Pitre
  2009-08-26 11:39           ` Pádraig Brady
  2009-08-17  8:22         ` Andreas Ericsson
  1 sibling, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Tapsell, Bryan Donlan, Pádraig Brady, Bug-coreutils,
	Git Mailing List, Brandon Casey, Junio C Hamano, Paul Kocher

On Sat, 15 Aug 2009, Linus Torvalds wrote:

> (Heh. Looking at that, I probably should move the 'size' field first, 
> since that would have different alignment rules, and the struct would be 
> more tightly packed that way, and initialize better).

I was about to suggest (i.e. post) a patch for that.  This is indeed a 
good idea.

> Afaik, none of the actual code remains (the mozilla SHA1 thing did the 
> wrong thing for performance even for just the final bytes, and did those a 
> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).

Maybe a patch adding a proper header with the actual license would be a 
good idea too.


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17  1:55         ` Nicolas Pitre
@ 2009-08-26 11:39           ` Pádraig Brady
  2017-04-20 21:35             ` galt
  2017-04-20 21:38             ` galt
  0 siblings, 2 replies; 28+ messages in thread
From: Pádraig Brady @ 2009-08-26 11:39 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, John Tapsell, Bryan Donlan, Bug-coreutils,
	Git Mailing List, Brandon Casey, Junio C Hamano, Paul Kocher

Nicolas Pitre wrote:
> On Sat, 15 Aug 2009, Linus Torvalds wrote:
> 
>> (Heh. Looking at that, I probably should move the 'size' field first, 
>> since that would have different alignment rules, and the struct would be 
>> more tightly packed that way, and initialize better).
> 
> I was about to suggest (i.e. post) a patch for that.  This is indeed a 
> good idea.
> 
>> Afaik, none of the actual code remains (the mozilla SHA1 thing did the 
>> wrong thing for performance even for just the final bytes, and did those a 
>> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
> 
> Maybe a patch adding a proper header with the actual license would be a 
> good idea too.

So have you decided on a final licence,
and if so update the headers accordingly?

cheers!
Pádraig.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-26 11:39           ` Pádraig Brady
@ 2017-04-20 21:35             ` galt
  2017-04-20 21:38             ` galt
  1 sibling, 0 replies; 28+ messages in thread
From: galt @ 2017-04-20 21:35 UTC (permalink / raw)
  To: git

I also wanted to include Linus' sha1 in our software at work.
But the GPLv2 license was incompatible.
Too bad it is just just in the public domain.
I grabbed Steve Reid's public domain code from 1999
and ran it. It produced the same output.
I ran it on a 3GB input file, and Linus' code from 2009 takes 37 to 40
seconds.
(Just reading the file in the same 4k buffers only takes 3 seconds 
so disk reading does not dominate the time.)
When I ran Steve's old version on the same input it was taking just 36 or 37
seconds.
So it is slightly faster.
Have compilers improved?
I am using gcc 4.4.7-17.

--
View this message in context: http://git.661346.n2.nabble.com/Linus-sha1-is-much-faster-tp3448007p7657473.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-26 11:39           ` Pádraig Brady
  2017-04-20 21:35             ` galt
@ 2017-04-20 21:38             ` galt
  1 sibling, 0 replies; 28+ messages in thread
From: galt @ 2017-04-20 21:38 UTC (permalink / raw)
  To: git

A Phádraig, cá bhfuil tú i do chónaí?
Tá mé i gCalafoirne.



--
View this message in context: http://git.661346.n2.nabble.com/Linus-sha1-is-much-faster-tp3448007p7657474.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-15 20:54       ` Linus Torvalds
  2009-08-17  1:55         ` Nicolas Pitre
@ 2009-08-17  8:22         ` Andreas Ericsson
  1 sibling, 0 replies; 28+ messages in thread
From: Andreas Ericsson @ 2009-08-17  8:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Tapsell, Bryan Donlan, Pádraig Brady, Bug-coreutils,
	Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre,
	Paul Kocher, Paul Kocher

Linus Torvalds wrote:
> 
> On Sat, 15 Aug 2009, Linus Torvalds wrote:
>> That said, I don't know if the MPL is ok for X11. I've not looked at 
>> compatibility issues with MPL. For git, we could just ignore the MPL, 
>> since the GPLv2 was acceptable regardless of it.
> 
> If MPL isn't ok for X11, then we'd need to make sure that even the 
> silliest Mozilla crud has been rewritten. There really isn't much, but 
> hey, the _history_ is based on the mozilla code, and who knows - the 
> 'blk_SHA_CTX' struct has things like the fields in the same order as the 
> Mozilla equivalent, for all those historical reasons.
> 
> (Heh. Looking at that, I probably should move the 'size' field first, 
> since that would have different alignment rules, and the struct would be 
> more tightly packed that way, and initialize better).
> 
> Afaik, none of the actual code remains (the mozilla SHA1 thing did the 
> wrong thing for performance even for just the final bytes, and did those a 
> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
> 
> Of course, maybe the Mozilla people would be interested in taking my 
> faster version, and say that the new-BSD license is ok, and make everybody 
> happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added 
> him to the Cc.
> 
> Paul, for your information, we're talking about a faster rewritten "mostly 
> portable" SHA1 routines that you can find at
> 
> 	http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu
> 
> (follow the "blob" pointers to see sha1.c and sha1.h). I don't know if 
> you're active with Mozilla/Firefox or whether you even care, but you seem 
> to be the logical choice of person to ask.
> 

I contacted Paul in february this year to get permission to use the mozilla
sha1 code for libgit2. His reply then was:
"I'm not sure which version the diffs are relative to, so I haven't reviewed them.
It's fine to distribute under BSD, GPL, or LGPL, however."

I also got explicit permission to relicense it under GPLv2 with the gcc exception.

I added the mail-address I used to contact him to CC as well. Sorry if you get
this twice, Paul.

Naturally, I'd like to use the faster version for libgit2 as well. The people
who Linus listed as contributors earlier (Brandon Casy, Linus, Junio and Nicolas
Pitre) have already consented to relicense their git contributions for libgit2
use. If anyone would like to revoke that consent for this code, speak now please,
or I'll patch it into libgit2 as well.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-15 20:12   ` John Tapsell
  2009-08-15 20:23     ` Linus Torvalds
@ 2009-08-16  0:06     ` Theodore Tso
  1 sibling, 0 replies; 28+ messages in thread
From: Theodore Tso @ 2009-08-16  0:06 UTC (permalink / raw)
  To: John Tapsell
  Cc: Bryan Donlan, Pádraig Brady, Bug-coreutils, Linus Torvalds,
	Git Mailing List, Brandon Casey, Junio C Hamano, Nicolas Pitre

On Sat, Aug 15, 2009 at 09:12:58PM +0100, John Tapsell wrote:
> 2009/8/15 Bryan Donlan <bdonlan@gmail.com>:
> > coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> > you'd need permission from all contributors to the implementation in
> > order to relicense under GPLv3. A quick grep of the history suggests
> > these contributors to be:
> 
> X11 also requires a fast SHA1 implementation.  It uses this to check
> if two pixmaps are the same.  So it would be really nice to relicense
> under a liberal enough license that xorg can use it.

If the checksum isn't being exposed in the protocol (i.e., it's just
internal to the X server), one possibility for X11 is to consider to
use the SHA-3 candidate Skein instead.  After receiving a large amount
of evaluation by cryptographic experts, it was one of the 18
algorithms (our of an original 64 entries) that have made it the 2nd
round of the NIST competition.  It's also *substantially* faster than
SHA:

    One exception to this is Skein, created by several well-known
    cryptographers and noted pundit Bruce Schneier. It was designed
    specifically to exploit all three of the Core 2 execution units
    and to run at a full 64-bits. This gives it roughly four to 10
    times the logic density of competing submissions.

    This is what I meant by the Matrix quote above. They didn't bend
    the spoon; they bent the crypto algorithm. They moved the logic
    operations around in a way that wouldn't weaken the crypto, but
    would strengthen its speed on the Intel Core 2.

    In their paper (PDF), the authors of Skein express surprise that a
    custom silicon ASIC implementation is not any faster than the
    software implementation. They shouldn't be surprised. Every time
    you can redefine a problem to run optimally in software, you will
    reach the same speeds you get with optimized ASIC hardware. The
    reason software has a reputation of being slow is because people
    don't redefine the original problem.

    http://www.darkreading.com/blog/archives/2008/11/bending_skein_c.html

For more information and some optimized implementation, see:

	http://www.skein-hash.info/

							- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady
  2009-08-15 20:02 ` Bryan Donlan
@ 2009-08-16 19:25 ` Giuseppe Scrivano
  2009-08-16 20:10   ` Linus Torvalds
  2009-08-17  1:53   ` Pádraig Brady
  1 sibling, 2 replies; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-16 19:25 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Linus Torvalds, Bug-coreutils, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

Hi Pádraig,

I tried to reproduce your results but I wasn't able to do it.  The
biggest difference on a 300MB file I noticed was approximately 15% using
on both implementations -O2, and 5% using -O3.
My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
Pentium(R) D CPU 3.20GHz.

I also spent some time trying to improve the gnulib SHA1 implementation
and it seems a lookup table can improve things a bit.

Can you please try the patch that I have attached and tell me which
performance difference (if any) you get?

Thanks,
Giuseppe



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-SHA1-use-a-lookup-table-for-faster-hashing.patch --]
[-- Type: text/x-diff, Size: 8582 bytes --]

>From b975a5e0849eaa46e5cf410c5bf6e2308f044d61 Mon Sep 17 00:00:00 2001
From: Giuseppe Scrivano <gscrivano@gnu.org>
Date: Sun, 16 Aug 2009 20:53:54 +0200
Subject: [PATCH] SHA1: use a lookup table for faster hashing

* lib/sha1.c (struct sha1_pre): New member.
* lib/sha1.c (sha1_process_block): Use the lookup table to quickly find
indices to use in the current round.
---
 lib/sha1.c |  160 ++++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 92 insertions(+), 68 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 9c6c7ae..ec18ba7 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -283,6 +283,32 @@ sha1_process_bytes (const void *buffer, size_t len, struct sha1_ctx *ctx)
 #define F3(B,C,D) ( ( B & C ) | ( D & ( B | C ) ) )
 #define F4(B,C,D) (B ^ C ^ D)
 
+struct lookup_t
+{
+  unsigned char l1 : 4;
+  unsigned char l2 : 4;
+  unsigned char l3 : 4;
+  unsigned char l4 : 4;
+};
+
+const static struct lookup_t
+sha1_pre[16] = {{(0 - 3) & 0x0f, (0 - 8) & 0x0f, (0 - 14) & 0x0f},
+                {(1 - 3) & 0x0f, (1 - 8) & 0x0f, (1 - 14) & 0x0f},
+                {(2 - 3) & 0x0f, (2 - 8) & 0x0f, (2 - 14) & 0x0f},
+                {(3 - 3) & 0x0f, (3 - 8) & 0x0f, (3 - 14) & 0x0f},
+                {(4 - 3) & 0x0f, (4 - 8) & 0x0f, (4 - 14) & 0x0f},
+                {(5 - 3) & 0x0f, (5 - 8) & 0x0f, (5 - 14) & 0x0f},
+                {(6 - 3) & 0x0f, (6 - 8) & 0x0f, (6 - 14) & 0x0f},
+                {(7 - 3) & 0x0f, (7 - 8) & 0x0f, (7 - 14) & 0x0f},
+                {(8 - 3) & 0x0f, (8 - 8) & 0x0f, (8 - 14) & 0x0f},
+                {(9 - 3) & 0x0f, (9 - 8) & 0x0f, (9 - 14) & 0x0f},
+                {(10 - 3) & 0x0f, (10 - 8) & 0x0f, (10 - 14) & 0x0f},
+                {(11 - 3) & 0x0f, (11 - 8) & 0x0f, (11 - 14) & 0x0f},
+                {(12 - 3) & 0x0f, (12 - 8) & 0x0f, (12 - 14) & 0x0f},
+                {(13 - 3) & 0x0f, (13 - 8) & 0x0f, (13 - 14) & 0x0f},
+                {(14 - 3) & 0x0f, (14 - 8) & 0x0f, (14 - 14) & 0x0f},
+                {(15 - 3) & 0x0f, (15 - 8) & 0x0f, (15 - 14) & 0x0f}};
+
 /* Process LEN bytes of BUFFER, accumulating context into CTX.
    It is assumed that LEN % 64 == 0.
    Most of this code comes from GnuPG's cipher/sha1.c.  */
@@ -309,9 +335,8 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
 
 #define rol(x, n) (((x) << (n)) | ((uint32_t) (x) >> (32 - (n))))
 
-#define M(I) ( tm =   x[I&0x0f] ^ x[(I-14)&0x0f] \
-		    ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f] \
-	       , (x[I&0x0f] = rol(tm, 1)) )
+#define M(I) (x[I] = rol (x[sha1_pre[I].l1] ^ x[sha1_pre[I].l2] \
+                          ^ x[sha1_pre[I].l3] ^ x[I], 1))
 
 #define R(A,B,C,D,E,F,K,M)  do { E += rol( A, 5 )     \
 				      + F( B, C, D )  \
@@ -322,7 +347,6 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
 
   while (words < endp)
     {
-      uint32_t tm;
       int t;
       for (t = 0; t < 16; t++)
 	{
@@ -346,70 +370,70 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
       R( c, d, e, a, b, F1, K1, x[13] );
       R( b, c, d, e, a, F1, K1, x[14] );
       R( a, b, c, d, e, F1, K1, x[15] );
-      R( e, a, b, c, d, F1, K1, M(16) );
-      R( d, e, a, b, c, F1, K1, M(17) );
-      R( c, d, e, a, b, F1, K1, M(18) );
-      R( b, c, d, e, a, F1, K1, M(19) );
-      R( a, b, c, d, e, F2, K2, M(20) );
-      R( e, a, b, c, d, F2, K2, M(21) );
-      R( d, e, a, b, c, F2, K2, M(22) );
-      R( c, d, e, a, b, F2, K2, M(23) );
-      R( b, c, d, e, a, F2, K2, M(24) );
-      R( a, b, c, d, e, F2, K2, M(25) );
-      R( e, a, b, c, d, F2, K2, M(26) );
-      R( d, e, a, b, c, F2, K2, M(27) );
-      R( c, d, e, a, b, F2, K2, M(28) );
-      R( b, c, d, e, a, F2, K2, M(29) );
-      R( a, b, c, d, e, F2, K2, M(30) );
-      R( e, a, b, c, d, F2, K2, M(31) );
-      R( d, e, a, b, c, F2, K2, M(32) );
-      R( c, d, e, a, b, F2, K2, M(33) );
-      R( b, c, d, e, a, F2, K2, M(34) );
-      R( a, b, c, d, e, F2, K2, M(35) );
-      R( e, a, b, c, d, F2, K2, M(36) );
-      R( d, e, a, b, c, F2, K2, M(37) );
-      R( c, d, e, a, b, F2, K2, M(38) );
-      R( b, c, d, e, a, F2, K2, M(39) );
-      R( a, b, c, d, e, F3, K3, M(40) );
-      R( e, a, b, c, d, F3, K3, M(41) );
-      R( d, e, a, b, c, F3, K3, M(42) );
-      R( c, d, e, a, b, F3, K3, M(43) );
-      R( b, c, d, e, a, F3, K3, M(44) );
-      R( a, b, c, d, e, F3, K3, M(45) );
-      R( e, a, b, c, d, F3, K3, M(46) );
-      R( d, e, a, b, c, F3, K3, M(47) );
-      R( c, d, e, a, b, F3, K3, M(48) );
-      R( b, c, d, e, a, F3, K3, M(49) );
-      R( a, b, c, d, e, F3, K3, M(50) );
-      R( e, a, b, c, d, F3, K3, M(51) );
-      R( d, e, a, b, c, F3, K3, M(52) );
-      R( c, d, e, a, b, F3, K3, M(53) );
-      R( b, c, d, e, a, F3, K3, M(54) );
-      R( a, b, c, d, e, F3, K3, M(55) );
-      R( e, a, b, c, d, F3, K3, M(56) );
-      R( d, e, a, b, c, F3, K3, M(57) );
-      R( c, d, e, a, b, F3, K3, M(58) );
-      R( b, c, d, e, a, F3, K3, M(59) );
-      R( a, b, c, d, e, F4, K4, M(60) );
-      R( e, a, b, c, d, F4, K4, M(61) );
-      R( d, e, a, b, c, F4, K4, M(62) );
-      R( c, d, e, a, b, F4, K4, M(63) );
-      R( b, c, d, e, a, F4, K4, M(64) );
-      R( a, b, c, d, e, F4, K4, M(65) );
-      R( e, a, b, c, d, F4, K4, M(66) );
-      R( d, e, a, b, c, F4, K4, M(67) );
-      R( c, d, e, a, b, F4, K4, M(68) );
-      R( b, c, d, e, a, F4, K4, M(69) );
-      R( a, b, c, d, e, F4, K4, M(70) );
-      R( e, a, b, c, d, F4, K4, M(71) );
-      R( d, e, a, b, c, F4, K4, M(72) );
-      R( c, d, e, a, b, F4, K4, M(73) );
-      R( b, c, d, e, a, F4, K4, M(74) );
-      R( a, b, c, d, e, F4, K4, M(75) );
-      R( e, a, b, c, d, F4, K4, M(76) );
-      R( d, e, a, b, c, F4, K4, M(77) );
-      R( c, d, e, a, b, F4, K4, M(78) );
-      R( b, c, d, e, a, F4, K4, M(79) );
+      R( e, a, b, c, d, F1, K1, M( 0) );
+      R( d, e, a, b, c, F1, K1, M( 1) );
+      R( c, d, e, a, b, F1, K1, M( 2) );
+      R( b, c, d, e, a, F1, K1, M( 3) );
+      R( a, b, c, d, e, F2, K2, M( 4) );
+      R( e, a, b, c, d, F2, K2, M( 5) );
+      R( d, e, a, b, c, F2, K2, M( 6) );
+      R( c, d, e, a, b, F2, K2, M( 7) );
+      R( b, c, d, e, a, F2, K2, M( 8) );
+      R( a, b, c, d, e, F2, K2, M( 9) );
+      R( e, a, b, c, d, F2, K2, M(10) );
+      R( d, e, a, b, c, F2, K2, M(11) );
+      R( c, d, e, a, b, F2, K2, M(12) );
+      R( b, c, d, e, a, F2, K2, M(13) );
+      R( a, b, c, d, e, F2, K2, M(14) );
+      R( e, a, b, c, d, F2, K2, M(15) );
+      R( d, e, a, b, c, F2, K2, M( 0) );
+      R( c, d, e, a, b, F2, K2, M( 1) );
+      R( b, c, d, e, a, F2, K2, M( 2) );
+      R( a, b, c, d, e, F2, K2, M( 3) );
+      R( e, a, b, c, d, F2, K2, M( 4) );
+      R( d, e, a, b, c, F2, K2, M( 5) );
+      R( c, d, e, a, b, F2, K2, M( 6) );
+      R( b, c, d, e, a, F2, K2, M( 7) );
+      R( a, b, c, d, e, F3, K3, M( 8) );
+      R( e, a, b, c, d, F3, K3, M( 9) );
+      R( d, e, a, b, c, F3, K3, M(10) );
+      R( c, d, e, a, b, F3, K3, M(11) );
+      R( b, c, d, e, a, F3, K3, M(12) );
+      R( a, b, c, d, e, F3, K3, M(13) );
+      R( e, a, b, c, d, F3, K3, M(14) );
+      R( d, e, a, b, c, F3, K3, M(15) );
+      R( c, d, e, a, b, F3, K3, M( 0) );
+      R( b, c, d, e, a, F3, K3, M( 1) );
+      R( a, b, c, d, e, F3, K3, M( 2) );
+      R( e, a, b, c, d, F3, K3, M( 3) );
+      R( d, e, a, b, c, F3, K3, M( 4) );
+      R( c, d, e, a, b, F3, K3, M( 5) );
+      R( b, c, d, e, a, F3, K3, M( 6) );
+      R( a, b, c, d, e, F3, K3, M( 7) );
+      R( e, a, b, c, d, F3, K3, M( 8) );
+      R( d, e, a, b, c, F3, K3, M( 9) );
+      R( c, d, e, a, b, F3, K3, M(10) );
+      R( b, c, d, e, a, F3, K3, M(11) );
+      R( a, b, c, d, e, F4, K4, M(12) );
+      R( e, a, b, c, d, F4, K4, M(13) );
+      R( d, e, a, b, c, F4, K4, M(14) );
+      R( c, d, e, a, b, F4, K4, M(15) );
+      R( b, c, d, e, a, F4, K4, M( 0) );
+      R( a, b, c, d, e, F4, K4, M( 1) );
+      R( e, a, b, c, d, F4, K4, M( 2) );
+      R( d, e, a, b, c, F4, K4, M( 3) );
+      R( c, d, e, a, b, F4, K4, M( 4) );
+      R( b, c, d, e, a, F4, K4, M( 5) );
+      R( a, b, c, d, e, F4, K4, M( 6) );
+      R( e, a, b, c, d, F4, K4, M( 7) );
+      R( d, e, a, b, c, F4, K4, M( 8) );
+      R( c, d, e, a, b, F4, K4, M( 9) );
+      R( b, c, d, e, a, F4, K4, M(10) );
+      R( a, b, c, d, e, F4, K4, M(11) );
+      R( e, a, b, c, d, F4, K4, M(12) );
+      R( d, e, a, b, c, F4, K4, M(13) );
+      R( c, d, e, a, b, F4, K4, M(14) );
+      R( b, c, d, e, a, F4, K4, M(15) );
 
       a = ctx->A += a;
       b = ctx->B += b;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-16 19:25 ` Giuseppe Scrivano
@ 2009-08-16 20:10   ` Linus Torvalds
  2009-08-16 22:15     ` Giuseppe Scrivano
  2009-08-17  1:53   ` Pádraig Brady
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2009-08-16 20:10 UTC (permalink / raw)
  To: Giuseppe Scrivano; +Cc: Bug-coreutils, Pádraig Brady, Git Mailing List

On Sun, 16 Aug 2009, Giuseppe Scrivano wrote:
> 
> My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
> Pentium(R) D CPU 3.20GHz.

Netburst is very sensitive to random spill effects, and you can basically 
tune things by just code shuffling that just has random effects on the 
generated asm code.

> I also spent some time trying to improve the gnulib SHA1 implementation
> and it seems a lookup table can improve things a bit.

I pretty much can guarantee you that it improves things only because it 
makes gcc generate crap code, which then hides some of the P4 issues.

I'd also suggest you try gcc-4.4, since that apparently fixes some of the 
oddest spill issues.

			Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-16 20:10   ` Linus Torvalds
@ 2009-08-16 22:15     ` Giuseppe Scrivano
  2009-08-16 22:47       ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-16 22:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pádraig Brady, Bug-coreutils, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I pretty much can guarantee you that it improves things only because it 
> makes gcc generate crap code, which then hides some of the P4 issues.
>
> I'd also suggest you try gcc-4.4, since that apparently fixes some of the 
> oddest spill issues.

Thanks for the hint.  I tried gcc-4.4 and it produces slower code than
4.3 on the gnulib SHA1 implementation and my patch makes it even more!

I noticed that on my machine your implementation is ~30-40% faster using
SHA_ROT for rol/ror instructions than inline assembly, at least with the
test-case Pádraig wrote.  Am I the only one reporting it?

Cheers,
Giuseppe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-16 22:15     ` Giuseppe Scrivano
@ 2009-08-16 22:47       ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2009-08-16 22:47 UTC (permalink / raw)
  To: Giuseppe Scrivano; +Cc: Bug-coreutils, Pádraig Brady, Git Mailing List

On Mon, 17 Aug 2009, Giuseppe Scrivano wrote:
> 
> Thanks for the hint.  I tried gcc-4.4 and it produces slower code than
> 4.3 on the gnulib SHA1 implementation and my patch makes it even more!

Check out the asm, see if you can see why. One of the most common problems 
with P4's is literally that you end up loading from the same stack slot 
that you just stored to (gcc can do some really crazy spills), and that 
causes a store buffer hazard replay.

My personal opinion is that Netburst is useless for trying to optimize C 
code for. It's just too random.

> I noticed that on my machine your implementation is ~30-40% faster using
> SHA_ROT for rol/ror instructions than inline assembly, at least with the
> test-case Pádraig wrote.  Am I the only one reporting it?

I bet it's the same thing. Small perturbations of the source causing small 
changes to register allocation and thus spilling, and then Netburst goes 
crazy one way or another. It's interestign trying to fix it, and very 
frustrating.

My workstation is a Nehalem (but Core 2 will have pretty much the same 
behavior), and it doesn't have the crazy netburst behavior. Shorter and 
simpler code generally performs better (which is _not_ true on Netburst). 

On my machine, for example, forcing gcc to do those rotates on registers 
is the difference between ~381MB/s and 415MB/s. And that's mainly because 
it makes gcc keep A-E in registers, rather than trying to cache the 
array[] references.

			Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-16 19:25 ` Giuseppe Scrivano
  2009-08-16 20:10   ` Linus Torvalds
@ 2009-08-17  1:53   ` Pádraig Brady
  2009-08-17 10:51     ` Giuseppe Scrivano
  1 sibling, 1 reply; 28+ messages in thread
From: Pádraig Brady @ 2009-08-17  1:53 UTC (permalink / raw)
  To: Giuseppe Scrivano; +Cc: Bug-coreutils, Linus Torvalds, Git Mailing List

Giuseppe Scrivano wrote:
> Hi Pádraig,
> 
> I tried to reproduce your results but I wasn't able to do it.  The
> biggest difference on a 300MB file I noticed was approximately 15% using
> on both implementations -O2, and 5% using -O3.
> My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
> Pentium(R) D CPU 3.20GHz.
> 
> I also spent some time trying to improve the gnulib SHA1 implementation
> and it seems a lookup table can improve things a bit.
> 
> Can you please try the patch that I have attached and tell me which
> performance difference (if any) you get?

Thanks for looking at this Giuseppe
and sorry for not mentioning my GCC and CPU.

Note the binaries below is compiled with
$(rpm -q --qf="%{OPTFLAGS}\n" coreutils)
for consistency, which on my F11 machines is:

  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
  -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
  -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1

Testing on 2 machines I have here:

$ rpm -q gcc
gcc-4.4.1-2.fc11.i586
$ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' '
model name : Intel(R) Pentium(R) M processor 1.70GHz
$ truncate -s300MB sha1.test
$ time sha1sum sha1.test
real    0m3.540s
$ time linus-sha1 sha1.test
real    0m2.319s (-34%)
$ time  giuseppe-sha1sum sha1.test
real    0m3.513s (-.8%)

$ rpm -q gcc
gcc-4.4.1-2.fc11.i586
$ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' '
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
$ truncate -s300MB sha1.test
$ time sha1sum sha1.test
real    0m1.857s
$ time linus-sha1 sha1.test
real    0m1.102s (-40%)
$ time giuseppe-sha1sum sha1.test
real    0m1.932s (+ 4%)

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17  1:53   ` Pádraig Brady
@ 2009-08-17 10:51     ` Giuseppe Scrivano
  2009-08-17 15:44       ` Steven Noonan
  0 siblings, 1 reply; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-17 10:51 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Bug-coreutils, Linus Torvalds, Git Mailing List

Pádraig Brady <P@draigBrady.com> writes:

>   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
>   -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
>   -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1

thanks.  I did again all tests on my machine using these same options.
I repeated each test 6 times and I took the median without consider the
first result.  Except the first run that it is not considered, I didn't
report a big variance on results of the same test.

gcc 4.3.3

gnulib sha1:            real	0m2.543s
gnulib sha1 lookup:     real	0m1.906s (-25%)
linus's sha1:           real	0m2.468s (-3%)
linus's sha1 no asm:    real	0m2.289s (-9%)

gcc 4.4.1

gnulib sha1:            real	0m3.386s
gnulib sha1 lookup:     real	0m3.110s (-8%)
linus's sha1:           real	0m1.701s (-49%)
linus's sha1 no asm:    real	0m1.284s (-62%)

I don't see such big differences in asm generated by gcc 4.4.1 and gcc
4.3.3 to explain this performance difference, what I noticed immediately
is that in the gcc-4.4 generated asm there are more "lea" instructions
(+30%), but I doubt this is the reason of these poor results.  Anyway, I
haven't yet looked much in details.

Cheers,
Giuseppe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 10:51     ` Giuseppe Scrivano
@ 2009-08-17 15:44       ` Steven Noonan
  2009-08-17 16:22         ` Linus Torvalds
  2009-08-17 17:32         ` Giuseppe Scrivano
  0 siblings, 2 replies; 28+ messages in thread
From: Steven Noonan @ 2009-08-17 15:44 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds,
	Git Mailing List

On Mon, Aug 17, 2009 at 3:51 AM, Giuseppe Scrivano<gscrivano@gnu.org> wrote:
> Pádraig Brady <P@draigBrady.com> writes:
>
>>   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
>>   -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
>>   -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1
>
> thanks.  I did again all tests on my machine using these same options.
> I repeated each test 6 times and I took the median without consider the
> first result.  Except the first run that it is not considered, I didn't
> report a big variance on results of the same test.
>
>
> gcc 4.3.3
>
> gnulib sha1:            real    0m2.543s
> gnulib sha1 lookup:     real    0m1.906s (-25%)
> linus's sha1:           real    0m2.468s (-3%)
> linus's sha1 no asm:    real    0m2.289s (-9%)
>
>
> gcc 4.4.1
>
> gnulib sha1:            real    0m3.386s
> gnulib sha1 lookup:     real    0m3.110s (-8%)
> linus's sha1:           real    0m1.701s (-49%)
> linus's sha1 no asm:    real    0m1.284s (-62%)
>
>
> I don't see such big differences in asm generated by gcc 4.4.1 and gcc
> 4.3.3 to explain this performance difference, what I noticed immediately
> is that in the gcc-4.4 generated asm there are more "lea" instructions
> (+30%), but I doubt this is the reason of these poor results.  Anyway, I
> haven't yet looked much in details.
>
> Cheers,
> Giuseppe

Interesting. I compared Linus' implementation to the public domain one
by Steve Reid[1], which is used in OpenLDAP and a few other projects.
Anyone with some experience testing these kinds of things in a
statistically sound manner want to try it out? In my tests, I got
this:

(average of 5 runs)
Linus' sha1: 283MB/s
Steve Reid's sha1: 305MB/s

- Steven

[1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 15:44       ` Steven Noonan
@ 2009-08-17 16:22         ` Linus Torvalds
  2009-08-17 21:43           ` Steven Noonan
  2009-08-17 17:32         ` Giuseppe Scrivano
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2009-08-17 16:22 UTC (permalink / raw)
  To: Steven Noonan
  Cc: Giuseppe Scrivano, Pádraig Brady, Bug-coreutils,
	Git Mailing List

On Mon, 17 Aug 2009, Steven Noonan wrote:
> 
> Interesting. I compared Linus' implementation to the public domain one
> by Steve Reid[1]

You _really_ need to talk about what kind of environment you have.

There are three major issues:
 - Netburst vs non-netburst
 - 32-bit vs 64-bit
 - compiler version

Steve Reid's code looks great, but the way it is coded, gcc makes a mess 
of it, which is exactly what my SHA1 tries to avoid.

[ In contrast, gcc does very well on just about _any_ straightforward 
  unrolled SHA1 C code if the target architecture is something like PPC or 
  ia64 that has enough registers to keep it all in registers.

  I haven't really tested other compilers - a less aggressive compiler 
  would actually do _better_ on SHA1, because the problem with gcc is that 
  it turns the whole temporary 16-entry word array into register accesses, 
  and tries to do register allocation on that _array_.

  That is wonderful for the above-mentioned PPC and IA64, but it makes gcc 
  create totally crazy code when there aren't enough registers, and then 
  gcc starts spilling randomly (ie it starts spilling a-e etc). This is 
  why the compiler and version matters so much. ]

> (average of 5 runs)
> Linus' sha1: 283MB/s
> Steve Reid's sha1: 305MB/s

So I get very different results:

	#             TIME[s] SPEED[MB/s]
	Reid            2.742       222.6
	linus           1.464         417

this is Intel Nehalem, but compiled for 32-bit mode (which is the more 
challenging one because x86-32 only has 7 general-purpose registers), and 
with gcc-4.4.0.

			Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 16:22         ` Linus Torvalds
@ 2009-08-17 21:43           ` Steven Noonan
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Noonan @ 2009-08-17 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Giuseppe Scrivano, Pádraig Brady, Bug-coreutils,
	Git Mailing List

On Mon, Aug 17, 2009 at 9:22 AM, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
>
>
> On Mon, 17 Aug 2009, Steven Noonan wrote:
>>
>> Interesting. I compared Linus' implementation to the public domain one
>> by Steve Reid[1]
>
> You _really_ need to talk about what kind of environment you have.
>
> There are three major issues:
>  - Netburst vs non-netburst
>  - 32-bit vs 64-bit
>  - compiler version

Right. I'm running a Core 2 "Merom" 2.33GHz. The code was compiled for
x86_64 with GCC 4.2.1. I didn't _expect_ it to compile for x86_64, but
apparently the version of GCC that ships with Xcode 3.2 defaults to
compiling 64-bit code on machines that are capable of running it.

>
> Steve Reid's code looks great, but the way it is coded, gcc makes a mess
> of it, which is exactly what my SHA1 tries to avoid.
>
> [ In contrast, gcc does very well on just about _any_ straightforward
>  unrolled SHA1 C code if the target architecture is something like PPC or
>  ia64 that has enough registers to keep it all in registers.
>
>  I haven't really tested other compilers - a less aggressive compiler
>  would actually do _better_ on SHA1, because the problem with gcc is that
>  it turns the whole temporary 16-entry word array into register accesses,
>  and tries to do register allocation on that _array_.
>
>  That is wonderful for the above-mentioned PPC and IA64, but it makes gcc
>  create totally crazy code when there aren't enough registers, and then
>  gcc starts spilling randomly (ie it starts spilling a-e etc). This is
>  why the compiler and version matters so much. ]
>
>> (average of 5 runs)
>> Linus' sha1: 283MB/s
>> Steve Reid's sha1: 305MB/s
>
> So I get very different results:
>
>        #             TIME[s] SPEED[MB/s]
>        Reid            2.742       222.6
>        linus           1.464         417

Added -m32:

Steve Reid: 156MB/s
Linus: 209MB/s

So on x86, your code really kicks butt.

> this is Intel Nehalem, but compiled for 32-bit mode (which is the more
> challenging one because x86-32 only has 7 general-purpose registers), and
> with gcc-4.4.0.
>
>                        Linus
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 15:44       ` Steven Noonan
  2009-08-17 16:22         ` Linus Torvalds
@ 2009-08-17 17:32         ` Giuseppe Scrivano
  1 sibling, 0 replies; 28+ messages in thread
From: Giuseppe Scrivano @ 2009-08-17 17:32 UTC (permalink / raw)
  To: Steven Noonan
  Cc: Pádraig Brady, Bug-coreutils, Linus Torvalds,
	Git Mailing List

Hi,

These are the results I reported (median of 5 plus an additional not
considered first run) on the Steve Reid's SHA1 implementation using the
same flags to the compiler that I used for previous tests.

GCC 4.3.3:  real	0m2.627s
GCC 4.4.1:  real	0m3.742s

In both cases it showed to be slower than other implementations I have
already tried.

Additional note: as for gnulib SHA1, GCC 4.4.1 produced slower code than
GCC 4.3.3.

Cheers,
Giuseppe



Steven Noonan <steven@uplinklabs.net> writes:

>
> Interesting. I compared Linus' implementation to the public domain one
> by Steve Reid[1], which is used in OpenLDAP and a few other projects.
> Anyone with some experience testing these kinds of things in a
> statistically sound manner want to try it out? In my tests, I got
> this:
>
> (average of 5 runs)
> Linus' sha1: 283MB/s
> Steve Reid's sha1: 305MB/s
>
> - Steven
>
> [1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
@ 2009-08-17  7:23 George Spelvin
  2009-08-17 14:20 ` Nicolas Pitre
  2009-08-17 17:06 ` Nicolas Pitre
  0 siblings, 2 replies; 28+ messages in thread
From: George Spelvin @ 2009-08-17  7:23 UTC (permalink / raw)
  To: bdonlan, johnflux, P; +Cc: art.08.09, git, linux, nico, torvalds

If it helps anyone resolve license issues, here's a from-FIPS-180-2
implementation that's placed in the public domain.  That should be
compatible with any license.

It uses Linus's and Artur's performance ideas, and some of Linus' macro
ideas (in the rotate implementation), but tries to be textually different.
Is there anything recognizable that anyone cares to clam copyright to?

It's not quite 100% finished, as I haven't benchmarked it against Linus's
code yet, but it's functionally correct.

It's also clean with -W -Wall -Wextra.

TODO: Check if an initial copy to w[] is faster on i386 (less register
pressure).

/*
 * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
 *
 * This implementation is in the public domain.  Copyright abandoned.
 * You may do anything you like with it, including evil things.
 *
 * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
 * from the git mailing list (August, 2009).  Additional optimization
 * ideas cribbed from
 * - Artur Skawina (x86, particularly P4, and much benchmarking)
 * - Nicilas Pitre (ARM)
 */

/* Cut here for sha1.h */
#include <stddef.h>	/* For size_t */
#include <stdint.h>	/* For uint32_t, uint64_t */

typedef struct SHA_context {
	uint64_t len;		/* May be shrunk to uint32_t */
	uint32_t iv[5];
	unsigned char buf[64];		/* Must be 32-bit aligned */
} SHA_CTX;

void SHA1_Init(struct SHA_context *c);
void SHA1_Update(struct SHA_context *c, void const *p, size_t n);
void SHA1_Final(unsigned char hash[20], struct SHA_context *c);
/* End of sha1.h */

#include <string.h>	/* For memcpy() */
#include <arpa/inet.h>	/* For ntohl() */

static void sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks);

/* Machine specific hacks */
#if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \
    defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \
    defined(__s390__) || defined(__s390x__)
/* Unaligned access is okay */
static inline uint32_t get_be32(unsigned char const *p)
{
	return ntohl(*(uint32_t const *)p);
}
static inline void put_be32(unsigned char const *p, uint32_t v)
{
	*(uint32_t *)p = htonl(v);
}

#else
/* Unaligned access is not okay; do conversion as byte fetches */
static inline uint32_t get_be32(unsigned char const *p)
{
	return p[0] << 24 || p[1] << 16 | p[8] << 8 | p[3];
}
static inline void put_be32(unsigned char const *p, uint32_t v)
{
	p[0] = v >> 24;
	p[1] = v >> 16;
	p[2] = v >> 8;
	p[3] = v;
}
#endif

void SHA1_Init(struct SHA_context *c)
{
	/* This is a prefix of the SHA_context structure */
	static struct {
		uint64_t len;
		uint32_t iv[5];
	} const iv = { 0,
		{ 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }
	};

	memcpy(c, &iv, sizeof iv);
}

void SHA1_Update(struct SHA_context *c, void const *p, size_t n)
{
	size_t pos = c->len & 63;	/* Offset into current input block */

	c->len += n;

	/* Initial partial block (if any) */
	if (pos) {
		size_t space = 63 - pos;
		if (n < space)
			goto end;
		memcpy(c->buf + pos, p, space);
		sha1_core(c->iv, c->buf, 1);
		n -= space;
		p = (char const *)p + space;
	}

	/* The large middle piece */
	if (n >> 6) {
		sha1_core(c->iv, p, n >> 6);
		p = (char const *)p + (n & -(size_t)64);
		n &= 63;
	}
	pos = 0;
end:
	/* Final partial block (may be zero size) */
	memcpy(c->buf + pos, p, n);
}

void SHA1_Final(unsigned char hash[20], struct SHA_context *c)
{
	size_t pos = c->len & 63;
	unsigned i;

	/* Append a single 1 bit */
	c->buf[pos++] = 0x80;

	/* Append 0 bits until 64 bits remain in a block */
	if (pos > 56) {
		memset(c->buf + pos, 0, 64 - pos);
		sha1_core(c->iv, c->buf, 1);
		pos = 0;
	}
	memset(c->buf + pos, 0, 56 - pos);

	/* Append total input length in bits */
	((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29));
	((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3);

	/* Final hash round */
	sha1_core(c->iv, c->buf, 1);

	/* Copy hash result out */
	for (i = 0; i < 5; i++)
		put_be32(hash + 4*i, c->iv[i]);
}


/*
 * Helper macros for sha1_core function.  To avoid clutter, these macros
 * are NOT fully parenthesized if it doesn't matter to the actual use.
 */

#if __GNUC__ && (defined(__i386__) || defined(__x86_64__))
/*
 * GCC by itself only generates left rotates.  Use right rotates if
 * possible to be kinder to dinky implementations with iterative rotate
 * instructions.
 */
#define ROT(op, x, k) \
	({ uint32_t y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define ROL(x,k) ROT("roll", x, k)
#define ROR(x,k) ROT("rorl", x, k)

#else
/* Generic C equivalent */
#define ROT(x,l,r) ((x) << (l) | (x) >> (r))
#define ROL(x,k) ROT(x,k,32-(k))
#define ROR(x,k) ROT(x,32-(k),k)
#endif


/*
 * An important temporary array in SHA-1 is the working array W[],
 * which holds a scheduled key word per round.  Only the last 16 words
 * are relevant, so only use 16 words...
 */
#define W(i) w[(i) & 15]

/*
 * If you have a small register set, it helps GCC to force stores to
 * the w[] array to memory.  Given 22 or more registers (e.g. PowerPC),
 * GCC can map the entire w[] array to registers and this becomes
 * counterproductive.
 *
 * The optimal kludge seems to differ between x86 and ARM.  On the latter,
 * forcing a full memory barrier is required to stop GCC from splitting
 * live ranges for each round and generating a separate stack slot for
 * each.  (Which produces a huge stack frame and kills performance.)
 */
#if defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) *(volatile uint32_t *)&W(i) = x
#elif __GNUC__ && defined(__arm__)
#define STORE(i, x) W(i) = x; __asm__("":::"memory")
#else
#define STORE(i, x) W(i) = x
#endif


/* The three round functions.  F2 is also used as F4 */
#define F1(b,c,d) (((d ^ c) & b) ^ d)		/* Bitwise b ? c : d */
#define F2(b,c,d) (d ^ c ^ b)			/* Even parity */
#define F3(b,c,d) (d & c) + ((d ^ c) & b)	/* Majority function */

/* The four round constants */
#define K1 0x5a827999	/* 2^30 * sqrt(2) */
#define K2 0x6ed9eba1	/* 2^30 * sqrt(3) */
#define K3 0x8f1bbcdc	/* 2^30 * sqrt(5) */
#define K4 0xca62c1d6	/* 2^30 * sqrt(10) */

/* Rounds 0..15 fetch a word from the input */
#define FETCH(t,i)	t = get_be32(p + 4*(i)); STORE(i,t)
/* Rounds 16..79 mix previous words to get a new one */
#define MIX(t,i)	t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1)
/* Rounds 16..76 have to store back the result */
#define CALC(t,i)	MIX(t,i); STORE(i,t)

/* The basic SHA-1 round */
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i)	\
	src(t,i);				\
	e += t + f(b,c,d) + k + ROL(a,5);	\
	b = ROR(b,2)

/* An aligned group of 5 rounds */
#define SHA_ROUND5(f,k,src,t,i) \
	SHA_ROUND(a,b,c,d,e, f,k,src,t,i);	\
	SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1);	\
	SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2);	\
	SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3);	\
	SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4)

/*
 * The core SHA-1 transform.
 *
 * iv[5] = Current SHA-1 hash state.
 * p = Pointer to source data.  Not necessarily aligned.
 * nblocks = Number of 64-byte blocks at p.  Guaranteed non-zero.
 */
static void
sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks)
{
	uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0];
	uint32_t w[16];

	do {
		uint32_t t;

		SHA_ROUND5(F1, K1, FETCH, t,  0);
		SHA_ROUND5(F1, K1, FETCH, t,  5);
		SHA_ROUND5(F1, K1, FETCH, t, 10);
		SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15);
		SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16);
		SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17);
		SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18);
		SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19);

		SHA_ROUND5(F2, K2, CALC, t, 20);
		SHA_ROUND5(F2, K2, CALC, t, 25);
		SHA_ROUND5(F2, K2, CALC, t, 30);
		SHA_ROUND5(F2, K2, CALC, t, 35);

		SHA_ROUND5(F3, K3, CALC, t, 40);
		SHA_ROUND5(F3, K3, CALC, t, 45);
		SHA_ROUND5(F3, K3, CALC, t, 50);
		SHA_ROUND5(F3, K3, CALC, t, 55);

		SHA_ROUND5(F2, K4, CALC, t, 60);
		SHA_ROUND5(F2, K4, CALC, t, 65);
		SHA_ROUND5(F2, K4, CALC, t, 70);

		SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75);
		SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76);
		/* Last 3 rounds don't need to store mixed value */
		SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77);
		SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78);
		SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79);

		iv[4] = e += iv[4];
		iv[3] = d += iv[3];
		iv[2] = c += iv[2];
		iv[1] = b += iv[1];
		iv[0] = a += iv[0];
	} while (--nblocks);
}


/* Compile with -DUNITTEST for self-tests */
#if UNITTEST

#include <stdio.h>

/* Known answer test */
static void katest(char const *text, size_t len, unsigned char const hash[20])
{
	SHA_CTX c;
	unsigned char hash2[20];
	int i;

	SHA1_Init(&c);
	SHA1_Update(&c, text, len);
	SHA1_Final(hash2, &c);

	for (i = 0; i < 20; i++)
		if (hash[i] != hash2[i])
			goto mismatch;
	printf("%u-byte known answer test PASSED\n", len);
	return;

mismatch:
	printf("%u-byte known answer test FAILED:\n", len);
	printf("Input: \"%.*s\"\n", len, text);
	printf("Computed:");
	for (i = 0; i < 20; i++)
		printf(" %02x", hash2[i]);
	printf("\nExpected:");
	for (i = 0; i < 20; i++)
		printf(" %02x", hash[i]);
	putchar('\n');
}

int
main(void)
{
	/* FIPS PUB 180.1 example A.1 */
	static char const text1[3] = "abc";
	static unsigned char const hash1[20] = {
		0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e,
		0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d };

	/* FIPS PUB 180.1 example A.2 */
	static char const text2[56] =
		"abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
	static unsigned char const hash2[20] = {
		0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae,
		0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 };

	katest(text1, sizeof text1, hash1);
	katest(text2, sizeof text2, hash2);

	return 0;
}
#endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17  7:23 George Spelvin
@ 2009-08-17 14:20 ` Nicolas Pitre
  2009-08-17 17:06 ` Nicolas Pitre
  1 sibling, 0 replies; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 14:20 UTC (permalink / raw)
  To: George Spelvin; +Cc: bdonlan, johnflux, P, art.08.09, git, Linus Torvalds

On Mon, 17 Aug 2009, George Spelvin wrote:

> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain.  That should be
> compatible with any license.
> 
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?
> 
> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
> 
> It's also clean with -W -Wall -Wextra.
> 
> TODO: Check if an initial copy to w[] is faster on i386 (less register
> pressure).
> 
> /*
>  * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
>  *
>  * This implementation is in the public domain.  Copyright abandoned.
>  * You may do anything you like with it, including evil things.
>  *
>  * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
>  * from the git mailing list (August, 2009).  Additional optimization
>  * ideas cribbed from
>  * - Artur Skawina (x86, particularly P4, and much benchmarking)
>  * - Nicilas Pitre (ARM)

Please be careful to spell my name correctly.


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17  7:23 George Spelvin
  2009-08-17 14:20 ` Nicolas Pitre
@ 2009-08-17 17:06 ` Nicolas Pitre
  2009-08-17 17:20   ` Paolo Bonzini
  2009-08-17 18:54   ` George Spelvin
  1 sibling, 2 replies; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 17:06 UTC (permalink / raw)
  To: George Spelvin; +Cc: bdonlan, johnflux, P, art.08.09, git, torvalds

On Mon, 17 Aug 2009, George Spelvin wrote:

> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain.  That should be
> compatible with any license.
> 
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?

I don't think this trick of making source code textually different from 
another work while still intimately mimicking the same structure entitles 
you to any copyright (or non copyright) claims over that other work.  I 
certainly wouldn't bet any dime for this standing up in court.  
Otherwise anyone could grab any copyrighted source code and perform a 
bunch of search-and-replace ops on it, and maybe some code reordering 
for good measure, to be able to claim own copyright on it. It is 
probably much safer to simply ask the people involved to agree with your 
relicensing.  And so far I don't see anyone with a stake in this 
fiercely wanting to stick to a particular license.

> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
> 
> It's also clean with -W -Wall -Wextra.

Not if you try with the unaligned put_be32() as the destination pointer 
is marked const.

As to the actual result on ARM... Well, the assembly _looks_ much worse 
than Linus' version.  It uses a stack frame of 152 bytes instead of 64 
bytes.  The resulting binary is also 6868 bytes large compared to 6180 
bytes.  Surprisingly, the performance is not that bad (the reason for 
the underlined "looks" above) albeit still a bit worse, like 5% slower.  
I was expecting much worse than that.

One possible reason for the bad assembly is probably due to the fact 
that gcc is not smart enough to propagate constant address offsets 
across different pointer types.  For example, my first version of 
get_be32() was a macro that did this:

#define SHA_SRC(t) \
  ({ unsigned char *__d = (unsigned char *)&data[t]; \
     (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })

With such a construct, gcc would always allocate a register to hold __d 
and then dereference that with an offset from 0 to 3.  Whereas:

#define SHA_SRC(t) \
   ({   unsigned char *__d = (unsigned char *)data; \
        (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
        (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })

does produce optimal assembly as only the register holding the data 
pointer is dereferenced with the absolute byte offset.  I suspect your 
usage of inline functions has the same effect as the first SHA_SRC 
definition above.

Also, wrt skipping the last 3 write back to the 16 word array...  For 
all the (limited) attempts I've made so far to do that, it always ended 
up making things worse.  I've yet to investigate why though.

Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 17:06 ` Nicolas Pitre
@ 2009-08-17 17:20   ` Paolo Bonzini
  2009-08-17 18:54   ` George Spelvin
  1 sibling, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2009-08-17 17:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: George Spelvin, bdonlan, johnflux, P, art.08.09, git

> my first version of
> get_be32() was a macro that did this:
>
> #define SHA_SRC(t) \
>    ({ unsigned char *__d = (unsigned char *)&data[t]; \
>       (__d[0]<<  24) | (__d[1]<<  16) | (__d[2]<<  8) | (__d[3]<<  0); })
>
> With such a construct, gcc would always allocate a register to hold __d
> and then dereference that with an offset from 0 to 3.  Whereas:
>
> #define SHA_SRC(t) \
>     ({   unsigned char *__d = (unsigned char *)data; \
>          (__d[(t)*4 + 0]<<  24) | (__d[(t)*4 + 1]<<  16) | \
>          (__d[(t)*4 + 2]<<   8) | (__d[(t)*4 + 3]<<   0); })
>
> does produce optimal assembly as only the register holding the data
> pointer is dereferenced with the absolute byte offset.  I suspect your
> usage of inline functions has the same effect as the first SHA_SRC
> definition above.

Yes, that's what happens.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 17:06 ` Nicolas Pitre
  2009-08-17 17:20   ` Paolo Bonzini
@ 2009-08-17 18:54   ` George Spelvin
  2009-08-17 19:34     ` Nicolas Pitre
  1 sibling, 1 reply; 28+ messages in thread
From: George Spelvin @ 2009-08-17 18:54 UTC (permalink / raw)
  To: linux, nico; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds

> I don't think this trick of making source code textually different from 
> another work while still intimately mimicking the same structure entitles 
> you to any copyright (or non copyright) claims over that other work.  I 
> certainly wouldn't bet any dime for this standing up in court.  

<div type="legal digression">
Actually, I would.  I did a lot more than text search and replace;
I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright)
and then merged in the *ideas* from the mailing list.  

(And from elsewhere; the idea of a five-round macro is from Brian Gladman.)

Remember, to the extent that something is *functional*, it is not
copyrightable; copyright only covers the non-functional expressive bits.
The vast majority of that code is simply required by the standard,
or the desired calling interface.

For a large portion of the rest, remember that standard programming
conventions (e.g.  brace style, macro names IN CAPS, etc.) that's also
non-copyrightable "scene a faire" material.

It's well established that paraphrasing a recipe avoids copyright;
the proportions and treatment of the ingredients is not copyrightable.

For more details, see the extensive coverage of the NEC v. Intel decision
(1989) regarding the firmware for NEC's 8086-clone V20 microprocessor.
It was found non-infringing despite non-clean-room implementation and
substantial similarities.
</div>

> It is probably much safer to simply ask the people involved to agree
> with your relicensing.  And so far I don't see anyone with a stake in
> this fiercely wanting to stick to a particular license.

As for politeness, that's exactly why I did post it and solicit
objections.  The purpose of the rewrite is to avoid having to make
pessimistic assumptions about people who don't respond.

I suppose I should have made that request clearer:
Is there anyone who claims copyright on anything here?
Or would just like credit?
If so, are you willing to donate it to the public domain?

I like the GPL, but some code is generally applicable enough, and unlikely
to be improved on by others enough, that I'd rather just let everyone
use the same version.

> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
> 
> It's also clean with -W -Wall -Wextra.

> Not if you try with the unaligned put_be32() as the destination pointer 
> is marked const.

Oops.  And the "||" in get_be32() doesn't help either.
There was another major bug I found, too.

And I'm still trying to match "linusv" performance.

> As to the actual result on ARM... Well, the assembly _looks_ much worse 
> than Linus' version.  It uses a stack frame of 152 bytes instead of 64 
> bytes.  The resulting binary is also 6868 bytes large compared to 6180 
> bytes.  Surprisingly, the performance is not that bad (the reason for 
> the underlined "looks" above) albeit still a bit worse, like 5% slower.  
> I was expecting much worse than that.

Thanks, I'll work on it.

> Also, wrt skipping the last 3 write back to the 16 word array...  For 
> all the (limited) attempts I've made so far to do that, it always ended 
> up making things worse.  I've yet to investigate why though.

Thanks.  That's certainly surprising!


If anyone cares, my current code (adapted to fit sha1bench) is appended.

/*
 * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
 *
 * This implementation is in the public domain.  Copyright abandoned.
 * You may do anything you like with it, including evil things.
 *
 * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
 * from the git mailing list (August, 2009).  Additional optimization
 * ideas cribbed from
 * - Artur Skawina (x86, particularly P4, and much benchmarking)
 * - Nicolas Pitre (ARM)
 *
 * Contributors to this code are requested to donate their changes to
 * the public domain to avoid version skew.  If you don't want to, be
 * sure to change the copyright waiver above.
 */

#include <string.h>	/* For memcpy() */
#include <arpa/inet.h>	/* For ntohl() */

#include "pdsha1.h"

#define SHA_CTX pd_SHA_CTX
#define SHA_context pd_SHA_context
#define SHA1_Init pd_SHA1_Init
#define SHA1_Update pd_SHA1_Update
#define SHA1_Final pd_SHA1_Final

static void sha1_core(uint32_t iv[5], unsigned char const *p);


void SHA1_Init(struct SHA_context *c)
{
	/* This is a prefix of the SHA_context structure */
	static struct {
		uint64_t len;
		uint32_t iv[5];
	} const iv = { 0,
		{ 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }
	};

	memcpy(c, &iv, sizeof iv);
}

void SHA1_Update(struct SHA_context *c, void const *p, size_t n)
{
	size_t pos = c->len & 63;	/* Offset into current input block */
	size_t nblocks;

	c->len += n;

	/* Initial partial block (if any) */
	if (pos) {
		size_t space = 63 - pos;
		if (n < space)
			goto end;
		memcpy(c->buf + pos, p, space);
		sha1_core(c->iv, c->buf);
		n -= space;
		p = (char const *)p + space;
	}

	/* The large middle piece */
	nblocks = n >> 6;
	while (nblocks--) {
		sha1_core(c->iv, p);
		p += 64;
	}
	n &= 63;
	pos = 0;
end:
	/* Final partial block (may be zero size) */
	memcpy(c->buf + pos, p, n);
}

/* Machine specific hacks */
#if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \
    defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \
    defined(__s390__) || defined(__s390x__)
/* Unaligned access is okay */
#define GET_BE32(p) ntohl(*(uint32_t const *)(p))
#define PUT_BE32(p,v) (*(uint32_t *)(p) = htonl(v))

#else
/*
 * Unaligned access is not okay; do conversion as byte fetches
 * These macros are IN_CAPS, as they are not side-effect safe
 * The argument p must be an "unsigned char *" or "uint8_t *"
 */
#define GET_BE32(p) \
	((uint32_t)(p)[0]<<24 | (p)[1]<<16 | (p)[2]<<8 | (p)[3])
#define PUT_BE32(p,v) \
	( (p)[0] = (v)>>24, (p)[1] = (v)>>16, (p)[2] = (v)>>8, (p)[3] = (v) )
#endif

void SHA1_Final(unsigned char hash[20], struct SHA_context *c)
{
	size_t pos = c->len & 63;
	unsigned i;

	/* Append a single 1 bit */
	c->buf[pos++] = 0x80;

	/* Append 0 bits until 64 bits remain in a block */
	if (pos > 56) {
		memset(c->buf + pos, 0, 64 - pos);
		sha1_core(c->iv, c->buf);
		pos = 0;
	}
	memset(c->buf + pos, 0, 56 - pos);

	/* Append total input length in bits */
	((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29));
	((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3);

	/* Final hash round */
	sha1_core(c->iv, c->buf);

	/* Copy hash result out */
	for (i = 0; i < 5; i++)
		PUT_BE32(hash + 4*i, c->iv[i]);
}


/*
 * Helper macros for sha1_core function.  To avoid clutter, these macros
 * are NOT fully parenthesized if it doesn't matter to the actual use.
 */

#if __GNUC__ && (defined(__i386__) || defined(__x86_64__))
/*
 * GCC by itself only generates left rotates.  Use right rotates if
 * possible to be kinder to dinky implementations with iterative rotate
 * instructions.
 */
#define ROT(op, x, k) \
	({ uint32_t y; __asm__(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define ROL(x,k) ROT("roll", x, k)
#define ROR(x,k) ROT("rorl", x, k)

#else
/* Generic C equivalent */
#define ROT(x,l,r) ((x) << (l) | (x) >> (r))
#define ROL(x,k) ROT(x,k,32-(k))
#define ROR(x,k) ROT(x,32-(k),k)
#endif


/*
 * An important temporary array in SHA-1 is the working array W[],
 * which holds a scheduled key word per round.  Only the last 16 words
 * are relevant, so only use 16 words...
 */
#define W(i) w[(i) & 15]

/*
 * If you have a small register set, it helps GCC to force stores to
 * the w[] array to memory.  Given 22 or more registers (e.g. PowerPC),
 * GCC can map the entire w[] array to registers and this becomes
 * counterproductive.
 *
 * The optimal kludge seems to differ between x86 and ARM.  On the latter,
 * forcing a full memory barrier is required to stop GCC from splitting
 * live ranges for each round and generating a separate stack slot for
 * each.  (Which produces a huge stack frame and kills performance.)
 */
#if __GNUC__ && defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) W(i) = x; __asm__ volatile("" : "+m" (W(x)))

#elif defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) *(volatile uint32_t *)&W(i) = x

#elif __GNUC__ && defined(__arm__)
#define STORE(i, x) W(i) = x; __asm__("":::"memory")

#else
#define STORE(i, x) W(i) = x
#endif


/* The three round functions.  F2 is also used as F4 */
#define F1(b,c,d) (((d ^ c) & b) ^ d)		/* Bitwise b ? c : d */
#define F2(b,c,d) (d ^ c ^ b)			/* Even parity */
#define F3(b,c,d) (d & c) + ((d ^ c) & b)	/* Majority function */

/* The four round constants */
#define K1 0x5a827999	/* 2^30 * sqrt(2) */
#define K2 0x6ed9eba1	/* 2^30 * sqrt(3) */
#define K3 0x8f1bbcdc	/* 2^30 * sqrt(5) */
#define K4 0xca62c1d6	/* 2^30 * sqrt(10) */

/* Rounds 0..15 fetch a word from the input */
#define FETCH(t,i)	t = GET_BE32(p + 4*(i)); STORE(i,t)
//#define FETCH(t,i)	t = W(i)
/* Rounds 16..79 mix previous words to get a new one */
#define MIX(t,i)	t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1)
/* Rounds 16..76 have to store back the result */
#define CALC(t,i)	MIX(t,i); STORE(i,t)

/* The basic SHA-1 round */
#if 0
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i)	\
	src(t,i);				\
	e += t + f(b,c,d) + k + ROL(a,5);	\
	b = ROR(b,2)
#elif 0
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do {	\
	uint32_t t;				\
	src(t,i);				\
	e += f(b,c,d);				\
	e += t + k;				\
	b = ROR(b,2);				\
	e += ROL(a,5); } while (0)
#else
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do {	\
	uint32_t t = ROL(a,5);			\
	e += f(b,c,d);				\
	e += k + t;				\
	src(t,i);				\
	b = ROR(b,2);				\
	e += t;	} while(0)			
#endif

/* An aligned group of 5 rounds */
#define SHA_ROUND5(f,k,src,t,i) \
	SHA_ROUND(a,b,c,d,e, f,k,src,t,i);	\
	SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1);	\
	SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2);	\
	SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3);	\
	SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4)

/*
 * The core SHA-1 transform.
 *
 * iv[5] = Current SHA-1 hash state.
 * p = Pointer to 64 bytes of source data.  Not necessarily aligned.
 */
static void
sha1_core(uint32_t iv[5], unsigned char const *p)
{
	//uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0];
	uint32_t a = iv[0], b = iv[1], c = iv[2], d = iv[3], e = iv[4];
	uint32_t w[16];

#if 0
	{
		int i;
		for (i = 0; i < 16; i++) {
			STORE(i, get_be32(p + 4*i));
		}
	}
#endif
	SHA_ROUND5(F1, K1, FETCH, t,  0);
	SHA_ROUND5(F1, K1, FETCH, t,  5);
	SHA_ROUND5(F1, K1, FETCH, t, 10);
	SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15);
	SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16);
	SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17);
	SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18);
	SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19);

	SHA_ROUND5(F2, K2, CALC, t, 20);
	SHA_ROUND5(F2, K2, CALC, t, 25);
	SHA_ROUND5(F2, K2, CALC, t, 30);
	SHA_ROUND5(F2, K2, CALC, t, 35);

	SHA_ROUND5(F3, K3, CALC, t, 40);
	SHA_ROUND5(F3, K3, CALC, t, 45);
	SHA_ROUND5(F3, K3, CALC, t, 50);
	SHA_ROUND5(F3, K3, CALC, t, 55);

	SHA_ROUND5(F2, K4, CALC, t, 60);
	SHA_ROUND5(F2, K4, CALC, t, 65);
	SHA_ROUND5(F2, K4, CALC, t, 70);

	SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75);
	SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76);
	/* Last 3 rounds don't need to store mixed value */
	SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77);
	SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78);
	SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79);

	iv[0] += a;
	iv[1] += b;
	iv[2] += c;
	iv[3] += d;
	iv[4] += e;
}


/* Compile with -DUNITTEST for self-tests */
#if UNITTEST

#include <stdio.h>
#include <stdlib.h>

/* Known answer test */
static void katest(char const *text, size_t len, unsigned char const hash[20])
{
	SHA_CTX c;
	unsigned char hash2[20];
	int i;

	SHA1_Init(&c);
	SHA1_Update(&c, text, len);
	SHA1_Final(hash2, &c);

	for (i = 0; i < 20; i++)
		if (hash[i] != hash2[i])
			goto mismatch;
	printf("%u-byte known answer test PASSED\n", len);
	return;

mismatch:
	printf("%u-byte known answer test FAILED:\n", len);
	if (len < 64)
		printf("Input: \"%.*s\"\n", len, text);
	else
		printf("Input: \"%.64s\"...\n", text);
	printf("Computed:");
	for (i = 0; i < 20; i++)
		printf(" %02x", hash2[i]);
	printf("\nExpected:");
	for (i = 0; i < 20; i++)
		printf(" %02x", hash[i]);
	putchar('\n');
}

int
main(void)
{
	/* FIPS PUB 180.1 example A.1 */
	static char const text1[3] = "abc";
	static unsigned char const hash1[20] = {
		0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e,
		0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d };

	/* FIPS PUB 180.1 example A.2 */
	static char const text2[56] =
		"abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
	static unsigned char const hash2[20] = {
		0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae,
		0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 };

	/* FIPS PUB 180.1 example A.3 */
	char *text3;
	static unsigned char const hash3[20] = {
		0x34, 0xaa, 0x97, 0x3c, 0xd4, 0xc4, 0xda, 0xa4, 0xf6, 0x1e,
		0xeb, 0x2b, 0xdb, 0xad, 0x27, 0x31, 0x65, 0x34, 0x01, 0x6f };

	katest(text1, sizeof text1, hash1);
	katest(text2, sizeof text2, hash2);
	text3 = malloc(1000000);
	memset(text3, 'a', 1000000);
	katest(text3, 1000000, hash3);
	
	return 0;
}
#endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 18:54   ` George Spelvin
@ 2009-08-17 19:34     ` Nicolas Pitre
  2009-08-17 23:12       ` George Spelvin
  0 siblings, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2009-08-17 19:34 UTC (permalink / raw)
  To: George Spelvin; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds

On Mon, 17 Aug 2009, George Spelvin wrote:

> > I don't think this trick of making source code textually different from 
> > another work while still intimately mimicking the same structure entitles 
> > you to any copyright (or non copyright) claims over that other work.  I 
> > certainly wouldn't bet any dime for this standing up in court.  
> 
> <div type="legal digression">
> Actually, I would.  I did a lot more than text search and replace;
> I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright)
> and then merged in the *ideas* from the mailing list.  
> 
> (And from elsewhere; the idea of a five-round macro is from Brian Gladman.)
> 
> Remember, to the extent that something is *functional*, it is not
> copyrightable; copyright only covers the non-functional expressive bits.
> The vast majority of that code is simply required by the standard,
> or the desired calling interface.
> 
> For a large portion of the rest, remember that standard programming
> conventions (e.g.  brace style, macro names IN CAPS, etc.) that's also
> non-copyrightable "scene a faire" material.
> 
> It's well established that paraphrasing a recipe avoids copyright;
> the proportions and treatment of the ingredients is not copyrightable.
> 
> For more details, see the extensive coverage of the NEC v. Intel decision
> (1989) regarding the firmware for NEC's 8086-clone V20 microprocessor.
> It was found non-infringing despite non-clean-room implementation and
> substantial similarities.
> </div>

Whatever.  NEC and Intel were certainly commercial competitors.  They 
were far from being friends.  So if you feel like having too many 
friends then just go ahead with that stance.

> As for politeness, that's exactly why I did post it and solicit
> objections.

You said:

|It uses Linus's and Artur's performance ideas, and some of Linus' macro 
|ideas (in the rotate implementation), but tries to be textually 
|different. Is there anything recognizable that anyone cares to clam 
|copyright to?

the "try to be textually different" in order to ask for "anything 
recognizable that anyone cares to clam copyright to" is what I find 
dubious.

> The purpose of the rewrite is to avoid having to make
> pessimistic assumptions about people who don't respond.
> 
> I suppose I should have made that request clearer:
> Is there anyone who claims copyright on anything here?
> Or would just like credit?
> If so, are you willing to donate it to the public domain?

I think this is much nicer to everyone involved.

As far as I'm concerned, I'm OK with giving any small copyright I might 
have in this SHA1 implementation, if any, to the public domain.  
Credits are always nice.


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linus' sha1 is much faster!
  2009-08-17 19:34     ` Nicolas Pitre
@ 2009-08-17 23:12       ` George Spelvin
  0 siblings, 0 replies; 28+ messages in thread
From: George Spelvin @ 2009-08-17 23:12 UTC (permalink / raw)
  To: linux, nico; +Cc: art.08.09, bdonlan, git, johnflux, P, torvalds

>> The purpose of the rewrite is to avoid having to make
>> pessimistic assumptions about people who don't respond.
>> 
>> I suppose I should have made that request clearer:
>> Is there anyone who claims copyright on anything here?
>> Or would just like credit?
>> If so, are you willing to donate it to the public domain?

> I think this is much nicer to everyone involved.
>
> As far as I'm concerned, I'm OK with giving any small copyright I might 
> have in this SHA1 implementation, if any, to the public domain.  
> Credits are always nice.

My apologies.  I read a lot of people talking about wanting the code
under different licenses, and thought I'd just cut through it by
providing some PD code.

I didn't turn around and look at it from the point of view of the
people who'd put the work into developing it.  I don't mean to deny
anyone credit for their work.  In fact, providing more detail is on
the to-do list, but I haven't waded through the mail archives and
tracked down who contributed what yet.

I'll work on those polish details once I have it producing the same
assembly code as Linus'.

There are a lot of possible highly-permissive licenses if one is wanted
(zlib, MIT, CC-by), but public domain seems simpler.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-04-20 21:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-14 23:25 Linus' sha1 is much faster! Pádraig Brady
2009-08-15 20:02 ` Bryan Donlan
2009-08-15 20:12   ` John Tapsell
2009-08-15 20:23     ` Linus Torvalds
2009-08-15 20:54       ` Linus Torvalds
2009-08-17  1:55         ` Nicolas Pitre
2009-08-26 11:39           ` Pádraig Brady
2017-04-20 21:35             ` galt
2017-04-20 21:38             ` galt
2009-08-17  8:22         ` Andreas Ericsson
2009-08-16  0:06     ` Theodore Tso
2009-08-16 19:25 ` Giuseppe Scrivano
2009-08-16 20:10   ` Linus Torvalds
2009-08-16 22:15     ` Giuseppe Scrivano
2009-08-16 22:47       ` Linus Torvalds
2009-08-17  1:53   ` Pádraig Brady
2009-08-17 10:51     ` Giuseppe Scrivano
2009-08-17 15:44       ` Steven Noonan
2009-08-17 16:22         ` Linus Torvalds
2009-08-17 21:43           ` Steven Noonan
2009-08-17 17:32         ` Giuseppe Scrivano
  -- strict thread matches above, loose matches on Subject: below --
2009-08-17  7:23 George Spelvin
2009-08-17 14:20 ` Nicolas Pitre
2009-08-17 17:06 ` Nicolas Pitre
2009-08-17 17:20   ` Paolo Bonzini
2009-08-17 18:54   ` George Spelvin
2009-08-17 19:34     ` Nicolas Pitre
2009-08-17 23:12       ` George Spelvin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).