From mboxrd@z Thu Jan 1 00:00:00 1970 From: "George Spelvin" Subject: Re: Linus' sha1 is much faster! Date: 17 Aug 2009 14:54:48 -0400 Message-ID: <20090817185448.30254.qmail@science.horizon.com> References: Cc: art.08.09@gmail.com, bdonlan@gmail.com, git@vger.kernel.org, johnflux@gmail.com, P@draigBrady.com, torvalds@linux-foundation.org To: linux@horizon.com, nico@cam.org X-From: git-owner@vger.kernel.org Mon Aug 17 20:55:02 2009 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1Md7MC-0007J0-N0 for gcvg-git-2@lo.gmane.org; Mon, 17 Aug 2009 20:55:01 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751272AbZHQSyt (ORCPT ); Mon, 17 Aug 2009 14:54:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751051AbZHQSyt (ORCPT ); Mon, 17 Aug 2009 14:54:49 -0400 Received: from science.horizon.com ([71.41.210.146]:47083 "HELO science.horizon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750924AbZHQSys (ORCPT ); Mon, 17 Aug 2009 14:54:48 -0400 Received: (qmail 30255 invoked by uid 1000); 17 Aug 2009 14:54:48 -0400 In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: > I don't think this trick of making source code textually different from > another work while still intimately mimicking the same structure entitles > you to any copyright (or non copyright) claims over that other work. I > certainly wouldn't bet any dime for this standing up in court.
Actually, I would. I did a lot more than text search and replace; I re-implemented it from FIPS 180-2 (work of U.S. government, no copyright) and then merged in the *ideas* from the mailing list. (And from elsewhere; the idea of a five-round macro is from Brian Gladman.) Remember, to the extent that something is *functional*, it is not copyrightable; copyright only covers the non-functional expressive bits. The vast majority of that code is simply required by the standard, or the desired calling interface. For a large portion of the rest, remember that standard programming conventions (e.g. brace style, macro names IN CAPS, etc.) that's also non-copyrightable "scene a faire" material. It's well established that paraphrasing a recipe avoids copyright; the proportions and treatment of the ingredients is not copyrightable. For more details, see the extensive coverage of the NEC v. Intel decision (1989) regarding the firmware for NEC's 8086-clone V20 microprocessor. It was found non-infringing despite non-clean-room implementation and substantial similarities.
> It is probably much safer to simply ask the people involved to agree > with your relicensing. And so far I don't see anyone with a stake in > this fiercely wanting to stick to a particular license. As for politeness, that's exactly why I did post it and solicit objections. The purpose of the rewrite is to avoid having to make pessimistic assumptions about people who don't respond. I suppose I should have made that request clearer: Is there anyone who claims copyright on anything here? Or would just like credit? If so, are you willing to donate it to the public domain? I like the GPL, but some code is generally applicable enough, and unlikely to be improved on by others enough, that I'd rather just let everyone use the same version. > It's not quite 100% finished, as I haven't benchmarked it against Linus's > code yet, but it's functionally correct. > > It's also clean with -W -Wall -Wextra. > Not if you try with the unaligned put_be32() as the destination pointer > is marked const. Oops. And the "||" in get_be32() doesn't help either. There was another major bug I found, too. And I'm still trying to match "linusv" performance. > As to the actual result on ARM... Well, the assembly _looks_ much worse > than Linus' version. It uses a stack frame of 152 bytes instead of 64 > bytes. The resulting binary is also 6868 bytes large compared to 6180 > bytes. Surprisingly, the performance is not that bad (the reason for > the underlined "looks" above) albeit still a bit worse, like 5% slower. > I was expecting much worse than that. Thanks, I'll work on it. > Also, wrt skipping the last 3 write back to the 16 word array... For > all the (limited) attempts I've made so far to do that, it always ended > up making things worse. I've yet to investigate why though. Thanks. That's certainly surprising! If anyone cares, my current code (adapted to fit sha1bench) is appended. /* * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2. * * This implementation is in the public domain. Copyright abandoned. * You may do anything you like with it, including evil things. * * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1" * from the git mailing list (August, 2009). Additional optimization * ideas cribbed from * - Artur Skawina (x86, particularly P4, and much benchmarking) * - Nicolas Pitre (ARM) * * Contributors to this code are requested to donate their changes to * the public domain to avoid version skew. If you don't want to, be * sure to change the copyright waiver above. */ #include /* For memcpy() */ #include /* For ntohl() */ #include "pdsha1.h" #define SHA_CTX pd_SHA_CTX #define SHA_context pd_SHA_context #define SHA1_Init pd_SHA1_Init #define SHA1_Update pd_SHA1_Update #define SHA1_Final pd_SHA1_Final static void sha1_core(uint32_t iv[5], unsigned char const *p); void SHA1_Init(struct SHA_context *c) { /* This is a prefix of the SHA_context structure */ static struct { uint64_t len; uint32_t iv[5]; } const iv = { 0, { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 } }; memcpy(c, &iv, sizeof iv); } void SHA1_Update(struct SHA_context *c, void const *p, size_t n) { size_t pos = c->len & 63; /* Offset into current input block */ size_t nblocks; c->len += n; /* Initial partial block (if any) */ if (pos) { size_t space = 63 - pos; if (n < space) goto end; memcpy(c->buf + pos, p, space); sha1_core(c->iv, c->buf); n -= space; p = (char const *)p + space; } /* The large middle piece */ nblocks = n >> 6; while (nblocks--) { sha1_core(c->iv, p); p += 64; } n &= 63; pos = 0; end: /* Final partial block (may be zero size) */ memcpy(c->buf + pos, p, n); } /* Machine specific hacks */ #if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \ defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \ defined(__s390__) || defined(__s390x__) /* Unaligned access is okay */ #define GET_BE32(p) ntohl(*(uint32_t const *)(p)) #define PUT_BE32(p,v) (*(uint32_t *)(p) = htonl(v)) #else /* * Unaligned access is not okay; do conversion as byte fetches * These macros are IN_CAPS, as they are not side-effect safe * The argument p must be an "unsigned char *" or "uint8_t *" */ #define GET_BE32(p) \ ((uint32_t)(p)[0]<<24 | (p)[1]<<16 | (p)[2]<<8 | (p)[3]) #define PUT_BE32(p,v) \ ( (p)[0] = (v)>>24, (p)[1] = (v)>>16, (p)[2] = (v)>>8, (p)[3] = (v) ) #endif void SHA1_Final(unsigned char hash[20], struct SHA_context *c) { size_t pos = c->len & 63; unsigned i; /* Append a single 1 bit */ c->buf[pos++] = 0x80; /* Append 0 bits until 64 bits remain in a block */ if (pos > 56) { memset(c->buf + pos, 0, 64 - pos); sha1_core(c->iv, c->buf); pos = 0; } memset(c->buf + pos, 0, 56 - pos); /* Append total input length in bits */ ((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29)); ((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3); /* Final hash round */ sha1_core(c->iv, c->buf); /* Copy hash result out */ for (i = 0; i < 5; i++) PUT_BE32(hash + 4*i, c->iv[i]); } /* * Helper macros for sha1_core function. To avoid clutter, these macros * are NOT fully parenthesized if it doesn't matter to the actual use. */ #if __GNUC__ && (defined(__i386__) || defined(__x86_64__)) /* * GCC by itself only generates left rotates. Use right rotates if * possible to be kinder to dinky implementations with iterative rotate * instructions. */ #define ROT(op, x, k) \ ({ uint32_t y; __asm__(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; }) #define ROL(x,k) ROT("roll", x, k) #define ROR(x,k) ROT("rorl", x, k) #else /* Generic C equivalent */ #define ROT(x,l,r) ((x) << (l) | (x) >> (r)) #define ROL(x,k) ROT(x,k,32-(k)) #define ROR(x,k) ROT(x,32-(k),k) #endif /* * An important temporary array in SHA-1 is the working array W[], * which holds a scheduled key word per round. Only the last 16 words * are relevant, so only use 16 words... */ #define W(i) w[(i) & 15] /* * If you have a small register set, it helps GCC to force stores to * the w[] array to memory. Given 22 or more registers (e.g. PowerPC), * GCC can map the entire w[] array to registers and this becomes * counterproductive. * * The optimal kludge seems to differ between x86 and ARM. On the latter, * forcing a full memory barrier is required to stop GCC from splitting * live ranges for each round and generating a separate stack slot for * each. (Which produces a huge stack frame and kills performance.) */ #if __GNUC__ && defined(__i386__) || defined(__x86_64__) #define STORE(i, x) W(i) = x; __asm__ volatile("" : "+m" (W(x))) #elif defined(__i386__) || defined(__x86_64__) #define STORE(i, x) *(volatile uint32_t *)&W(i) = x #elif __GNUC__ && defined(__arm__) #define STORE(i, x) W(i) = x; __asm__("":::"memory") #else #define STORE(i, x) W(i) = x #endif /* The three round functions. F2 is also used as F4 */ #define F1(b,c,d) (((d ^ c) & b) ^ d) /* Bitwise b ? c : d */ #define F2(b,c,d) (d ^ c ^ b) /* Even parity */ #define F3(b,c,d) (d & c) + ((d ^ c) & b) /* Majority function */ /* The four round constants */ #define K1 0x5a827999 /* 2^30 * sqrt(2) */ #define K2 0x6ed9eba1 /* 2^30 * sqrt(3) */ #define K3 0x8f1bbcdc /* 2^30 * sqrt(5) */ #define K4 0xca62c1d6 /* 2^30 * sqrt(10) */ /* Rounds 0..15 fetch a word from the input */ #define FETCH(t,i) t = GET_BE32(p + 4*(i)); STORE(i,t) //#define FETCH(t,i) t = W(i) /* Rounds 16..79 mix previous words to get a new one */ #define MIX(t,i) t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1) /* Rounds 16..76 have to store back the result */ #define CALC(t,i) MIX(t,i); STORE(i,t) /* The basic SHA-1 round */ #if 0 #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) \ src(t,i); \ e += t + f(b,c,d) + k + ROL(a,5); \ b = ROR(b,2) #elif 0 #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do { \ uint32_t t; \ src(t,i); \ e += f(b,c,d); \ e += t + k; \ b = ROR(b,2); \ e += ROL(a,5); } while (0) #else #define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) do { \ uint32_t t = ROL(a,5); \ e += f(b,c,d); \ e += k + t; \ src(t,i); \ b = ROR(b,2); \ e += t; } while(0) #endif /* An aligned group of 5 rounds */ #define SHA_ROUND5(f,k,src,t,i) \ SHA_ROUND(a,b,c,d,e, f,k,src,t,i); \ SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1); \ SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2); \ SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3); \ SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4) /* * The core SHA-1 transform. * * iv[5] = Current SHA-1 hash state. * p = Pointer to 64 bytes of source data. Not necessarily aligned. */ static void sha1_core(uint32_t iv[5], unsigned char const *p) { //uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0]; uint32_t a = iv[0], b = iv[1], c = iv[2], d = iv[3], e = iv[4]; uint32_t w[16]; #if 0 { int i; for (i = 0; i < 16; i++) { STORE(i, get_be32(p + 4*i)); } } #endif SHA_ROUND5(F1, K1, FETCH, t, 0); SHA_ROUND5(F1, K1, FETCH, t, 5); SHA_ROUND5(F1, K1, FETCH, t, 10); SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15); SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16); SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17); SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18); SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19); SHA_ROUND5(F2, K2, CALC, t, 20); SHA_ROUND5(F2, K2, CALC, t, 25); SHA_ROUND5(F2, K2, CALC, t, 30); SHA_ROUND5(F2, K2, CALC, t, 35); SHA_ROUND5(F3, K3, CALC, t, 40); SHA_ROUND5(F3, K3, CALC, t, 45); SHA_ROUND5(F3, K3, CALC, t, 50); SHA_ROUND5(F3, K3, CALC, t, 55); SHA_ROUND5(F2, K4, CALC, t, 60); SHA_ROUND5(F2, K4, CALC, t, 65); SHA_ROUND5(F2, K4, CALC, t, 70); SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75); SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76); /* Last 3 rounds don't need to store mixed value */ SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77); SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78); SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79); iv[0] += a; iv[1] += b; iv[2] += c; iv[3] += d; iv[4] += e; } /* Compile with -DUNITTEST for self-tests */ #if UNITTEST #include #include /* Known answer test */ static void katest(char const *text, size_t len, unsigned char const hash[20]) { SHA_CTX c; unsigned char hash2[20]; int i; SHA1_Init(&c); SHA1_Update(&c, text, len); SHA1_Final(hash2, &c); for (i = 0; i < 20; i++) if (hash[i] != hash2[i]) goto mismatch; printf("%u-byte known answer test PASSED\n", len); return; mismatch: printf("%u-byte known answer test FAILED:\n", len); if (len < 64) printf("Input: \"%.*s\"\n", len, text); else printf("Input: \"%.64s\"...\n", text); printf("Computed:"); for (i = 0; i < 20; i++) printf(" %02x", hash2[i]); printf("\nExpected:"); for (i = 0; i < 20; i++) printf(" %02x", hash[i]); putchar('\n'); } int main(void) { /* FIPS PUB 180.1 example A.1 */ static char const text1[3] = "abc"; static unsigned char const hash1[20] = { 0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e, 0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d }; /* FIPS PUB 180.1 example A.2 */ static char const text2[56] = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq"; static unsigned char const hash2[20] = { 0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae, 0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 }; /* FIPS PUB 180.1 example A.3 */ char *text3; static unsigned char const hash3[20] = { 0x34, 0xaa, 0x97, 0x3c, 0xd4, 0xc4, 0xda, 0xa4, 0xf6, 0x1e, 0xeb, 0x2b, 0xdb, 0xad, 0x27, 0x31, 0x65, 0x34, 0x01, 0x6f }; katest(text1, sizeof text1, hash1); katest(text2, sizeof text2, hash2); text3 = malloc(1000000); memset(text3, 'a', 1000000); katest(text3, 1000000, hash3); return 0; } #endif