From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicolas Pitre Subject: Re: Linus' sha1 is much faster! Date: Mon, 17 Aug 2009 13:06:45 -0400 (EDT) Message-ID: References: <20090817072315.4314.qmail@science.horizon.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Transfer-Encoding: 7BIT Cc: bdonlan@gmail.com, johnflux@gmail.com, P@draigBrady.com, art.08.09@gmail.com, git@vger.kernel.org, torvalds@linux-foundation.org To: George Spelvin X-From: git-owner@vger.kernel.org Mon Aug 17 19:07:17 2009 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1Md5fw-0004RI-CB for gcvg-git-2@lo.gmane.org; Mon, 17 Aug 2009 19:07:16 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757492AbZHQRGv (ORCPT ); Mon, 17 Aug 2009 13:06:51 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756905AbZHQRGu (ORCPT ); Mon, 17 Aug 2009 13:06:50 -0400 Received: from relais.videotron.ca ([24.201.245.36]:36102 "EHLO relais.videotron.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755677AbZHQRGu (ORCPT ); Mon, 17 Aug 2009 13:06:50 -0400 Received: from xanadu.home ([66.130.28.92]) by VL-MO-MR005.ip.videotron.ca (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug 3 2007; 32bit)) with ESMTP id <0KOJ00B7E5OMCQJ0@VL-MO-MR005.ip.videotron.ca> for git@vger.kernel.org; Mon, 17 Aug 2009 12:55:35 -0400 (EDT) X-X-Sender: nico@xanadu.home In-reply-to: <20090817072315.4314.qmail@science.horizon.com> User-Agent: Alpine 2.00 (LFD 1167 2008-08-23) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, 17 Aug 2009, George Spelvin wrote: > If it helps anyone resolve license issues, here's a from-FIPS-180-2 > implementation that's placed in the public domain. That should be > compatible with any license. > > It uses Linus's and Artur's performance ideas, and some of Linus' macro > ideas (in the rotate implementation), but tries to be textually different. > Is there anything recognizable that anyone cares to clam copyright to? I don't think this trick of making source code textually different from another work while still intimately mimicking the same structure entitles you to any copyright (or non copyright) claims over that other work. I certainly wouldn't bet any dime for this standing up in court. Otherwise anyone could grab any copyrighted source code and perform a bunch of search-and-replace ops on it, and maybe some code reordering for good measure, to be able to claim own copyright on it. It is probably much safer to simply ask the people involved to agree with your relicensing. And so far I don't see anyone with a stake in this fiercely wanting to stick to a particular license. > It's not quite 100% finished, as I haven't benchmarked it against Linus's > code yet, but it's functionally correct. > > It's also clean with -W -Wall -Wextra. Not if you try with the unaligned put_be32() as the destination pointer is marked const. As to the actual result on ARM... Well, the assembly _looks_ much worse than Linus' version. It uses a stack frame of 152 bytes instead of 64 bytes. The resulting binary is also 6868 bytes large compared to 6180 bytes. Surprisingly, the performance is not that bad (the reason for the underlined "looks" above) albeit still a bit worse, like 5% slower. I was expecting much worse than that. One possible reason for the bad assembly is probably due to the fact that gcc is not smart enough to propagate constant address offsets across different pointer types. For example, my first version of get_be32() was a macro that did this: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)&data[t]; \ (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) With such a construct, gcc would always allocate a register to hold __d and then dereference that with an offset from 0 to 3. Whereas: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)data; \ (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) does produce optimal assembly as only the register holding the data pointer is dereferenced with the absolute byte offset. I suspect your usage of inline functions has the same effect as the first SHA_SRC definition above. Also, wrt skipping the last 3 write back to the 16 word array... For all the (limited) attempts I've made so far to do that, it always ended up making things worse. I've yet to investigate why though. Nicolas