From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Kastrup Subject: Re: [PATCH] diff-delta.c: pack the index structure Date: Sat, 08 Sep 2007 22:50:20 +0200 Message-ID: <85hcm4ubjn.fsf@lola.goethe.zz> References: <85fy1q11xv.fsf@lola.goethe.zz> <854pi5zh8q.fsf@lola.goethe.zz> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Cc: git@vger.kernel.org To: Nicolas Pitre X-From: git-owner@vger.kernel.org Sat Sep 08 22:50:46 2007 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1IU7GI-0007VT-IT for gcvg-git@gmane.org; Sat, 08 Sep 2007 22:50:39 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752149AbXIHUub (ORCPT ); Sat, 8 Sep 2007 16:50:31 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751851AbXIHUub (ORCPT ); Sat, 8 Sep 2007 16:50:31 -0400 Received: from mail-in-05.arcor-online.net ([151.189.21.45]:49395 "EHLO mail-in-05.arcor-online.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752073AbXIHUua (ORCPT ); Sat, 8 Sep 2007 16:50:30 -0400 Received: from mail-in-09-z2.arcor-online.net (mail-in-09-z2.arcor-online.net [151.189.8.21]) by mail-in-05.arcor-online.net (Postfix) with ESMTP id F3EF1183AF7; Sat, 8 Sep 2007 22:50:28 +0200 (CEST) Received: from mail-in-06.arcor-online.net (mail-in-06.arcor-online.net [151.189.21.46]) by mail-in-09-z2.arcor-online.net (Postfix) with ESMTP id CFD0828EE19; Sat, 8 Sep 2007 22:50:28 +0200 (CEST) Received: from lola.goethe.zz (dslb-084-061-020-082.pools.arcor-ip.net [84.61.20.82]) by mail-in-06.arcor-online.net (Postfix) with ESMTP id 7D11735E73F; Sat, 8 Sep 2007 22:50:20 +0200 (CEST) Received: by lola.goethe.zz (Postfix, from userid 1002) id 106B21CAD71D; Sat, 8 Sep 2007 22:50:20 +0200 (CEST) In-Reply-To: <854pi5zh8q.fsf@lola.goethe.zz> (David Kastrup's message of "Sat\, 08 Sep 2007 10\:36\:05 +0200") User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1.50 (gnu/linux) X-Virus-Scanned: ClamAV 0.91.2/4200/Sat Sep 8 21:00:15 2007 on mail-in-06.arcor-online.net X-Virus-Status: Clean Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: --=-=-= David Kastrup writes: > Another option avoiding the realloc would be not to use linked lists > at all but just collect all entries sequentually in "packed" form, > and sort them into order afterwards. That's an O(n lg n) operation > with a large n. Even if one has a sorting algorithm with good > memory locality, I doubt that the locality would compensate for the > lg n factor, even when taking into account that we save ourselves > the copying. And that is not even taken the possibility of having > to cull some buckets into account, another O(n) operation (which > amounts to copying everything once, too). As an illustration, I have implemented a proof-of-concept. This is a patch (not formatted for submission) against the current master. It takes about a 20% performance hit on the whole git-pack-objects run, so the diffing code itself is likely affected worse. For the proof-of-concept I did not do the hash bucket collision culling, nor did I realloc to the possibly reduced size. So a fairer comparison would likely involve removing the culling code from master (or implementing the culling in the proof-of-concept, but I am too lazy for that). I hope this addresses the not-in-place complaint. It is conceivable that under tight memory conditions, this approach might actually work out better. However, the glibc qsort implementation (at least it did so at one time) reverts to a mergesort for large numbers, and allocates a temporary buffer of the same size as the original data. If that is still the case, the memory impact would actually be worse here. One would need to implement a custom _real_ in place sorter to get around this. All in all, I don't think this approach is worth the trouble: the sorting overhead factor of lg n is just too much. --=-=-= Content-Type: text/x-patch Content-Disposition: attachment diff --git a/diff-delta.c b/diff-delta.c index 0dde2f2..f6e2416 100644 --- a/diff-delta.c +++ b/diff-delta.c @@ -115,7 +115,6 @@ static const unsigned int U[256] = { struct index_entry { const unsigned char *ptr; unsigned int val; - struct index_entry *next; }; struct delta_index { @@ -126,12 +125,24 @@ struct delta_index { struct index_entry *hash[FLEX_ARRAY]; }; +static unsigned int cmp_hmask; + +static int cmp_index_entry(const void *va, const void *vb) +{ + const struct index_entry *a = va, *b = vb; + if ((a->val & cmp_hmask) != (b->val & cmp_hmask)) + return (a->val & cmp_hmask) < (b->val & cmp_hmask) ? -1 : 1; + if (a->val != b->val) + return a->val < b->val ? -1 : 1; + return a->ptr < b->ptr ? -1 : 1; +} + struct delta_index * create_delta_index(const void *buf, unsigned long bufsize) { - unsigned int i, hsize, hmask, entries, prev_val, *hash_count; + unsigned int i, hsize, hmask, entries, prev_val, ihash; const unsigned char *data, *buffer = buf; struct delta_index *index; - struct index_entry *entry, **hash; + struct index_entry *entry, **hash, *entry_base; void *mem; unsigned long memsize; @@ -149,7 +160,7 @@ struct delta_index * create_delta_index(const void *buf, unsigned long bufsize) /* allocate lookup index */ memsize = sizeof(*index) + - sizeof(*hash) * hsize + + sizeof(*hash) * (hsize + 1) + sizeof(*entry) * entries; mem = malloc(memsize); if (!mem) @@ -157,21 +168,14 @@ struct delta_index * create_delta_index(const void *buf, unsigned long bufsize) index = mem; mem = index + 1; hash = mem; - mem = hash + hsize; - entry = mem; + mem = hash + hsize + 1; + entry_base = mem; + entry = entry_base; index->memsize = memsize; index->src_buf = buf; index->src_size = bufsize; index->hash_mask = hmask; - memset(hash, 0, hsize * sizeof(*hash)); - - /* allocate an array to count hash entries */ - hash_count = calloc(hsize, sizeof(*hash_count)); - if (!hash_count) { - free(index); - return NULL; - } /* then populate the index */ prev_val = ~0; @@ -184,29 +188,33 @@ struct delta_index * create_delta_index(const void *buf, unsigned long bufsize) if (val == prev_val) { /* keep the lowest of consecutive identical blocks */ entry[-1].ptr = data + RABIN_WINDOW; + --entries; } else { prev_val = val; i = val & hmask; entry->ptr = data + RABIN_WINDOW; entry->val = val; - entry->next = hash[i]; - hash[i] = entry++; - hash_count[i]++; + entry++; } } + cmp_hmask = hmask; + qsort(entry_base, entries, sizeof(struct index_entry), cmp_index_entry); + ihash = 0; + + for (i=0; ival & hmask) >= ihash) { + /* cull here */ + do { + hash[ihash++] = entry; + } while ((entry->val & hmask) >= ihash); + } + } + while (ihash <= hsize) { + hash[ihash++] = &entry_base[entries]; + } - /* - * Determine a limit on the number of entries in the same hash - * bucket. This guards us against pathological data sets causing - * really bad hash distribution with most entries in the same hash - * bucket that would bring us to O(m*n) computing costs (m and n - * corresponding to reference and target buffer sizes). - * - * Make sure none of the hash buckets has more entries than - * we're willing to test. Otherwise we cull the entry list - * uniformly to still preserve a good repartition across - * the reference buffer. - */ +#if 0 for (i = 0; i < hsize; i++) { if (hash_count[i] < HASH_LIMIT) continue; @@ -221,6 +229,7 @@ struct delta_index * create_delta_index(const void *buf, unsigned long bufsize) } while(entry); } free(hash_count); +#endif return index; } @@ -302,7 +311,7 @@ create_delta(const struct delta_index *index, val ^= U[data[-RABIN_WINDOW]]; val = ((val << 8) | *data) ^ T[val >> RABIN_SHIFT]; i = val & index->hash_mask; - for (entry = index->hash[i]; entry; entry = entry->next) { + for (entry = index->hash[i]; entry < index->hash[i+1]; entry++) { const unsigned char *ref = entry->ptr; const unsigned char *src = data; unsigned int ref_size = ref_top - ref; --=-=-= -- David Kastrup, Kriemhildstr. 15, 44793 Bochum --=-=-=--