From mboxrd@z Thu Jan 1 00:00:00 1970 From: Geert Bosch Subject: PATCH: New diff-delta.c implementation (updated) Date: Thu, 27 Apr 2006 21:59:53 -0400 (EDT) Message-ID: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-From: git-owner@vger.kernel.org Fri Apr 28 04:00:25 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FZIHP-0001kh-ET for gcvg-git@gmane.org; Fri, 28 Apr 2006 04:00:24 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965172AbWD1CAU (ORCPT ); Thu, 27 Apr 2006 22:00:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S965174AbWD1CAU (ORCPT ); Thu, 27 Apr 2006 22:00:20 -0400 Received: from nile.gnat.com ([205.232.38.5]:52689 "EHLO nile.gnat.com") by vger.kernel.org with ESMTP id S965172AbWD1CAR (ORCPT ); Thu, 27 Apr 2006 22:00:17 -0400 Received: from localhost (localhost [127.0.0.1]) by filtered-nile.gnat.com (Postfix) with ESMTP id 6A3E748CEFD for ; Thu, 27 Apr 2006 21:59:53 -0400 (EDT) Received: from nile.gnat.com ([127.0.0.1]) by localhost (nile.gnat.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 03261-01-2 for ; Thu, 27 Apr 2006 21:59:53 -0400 (EDT) Received: by nile.gnat.com (Postfix, from userid 4190) id 3BB2A48CEFA; Thu, 27 Apr 2006 21:59:53 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by nile.gnat.com (Postfix) with ESMTP id 36D9B48E475 for ; Thu, 27 Apr 2006 21:59:53 -0400 (EDT) To: git@vger.kernel.org Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: Even though the previous version did really well on large files with many changes, performance was lacking for the many small files with very few changes that are so common for a VCS. For example, it turns out that, for packing the 17005 objects in my git.git repository, diff_delta processes 240 MB worth of target data in about 12s on my powerbook. (There's even a little more source data, and the 12s includes compression/decompression time.) So the fancy fingerprint calculations really take too much time. Fortunately, it turns out that of the 240M, 120M matches directly at the start or the end of the source data. After this trivial matching, most remaining matches are quite small. The overhead of setting up buffers, computing longest runs of the same character and computing 64-bit fingerprints becomes very noticeable and can't be regained later. As a result I implemented special indexing and matching routines for "small" files. Here a fixed hash table size and index step are used. The fingerprint window has been reduced to be equal to the step size, which essentially gets rid of computation for characters leaving the window. Finally, the fingerprint size has been reduced to 32 bits with polynome of 31st degree. The result has been only a slight increase in delta size for very large test cases (but with better performance), and both smaller deltas and faster execution speed for repacking git.git. I had trouble cloning the Linux kernel repository, but am now reasonably confident this will outperform the existing algorithm pretty consistently. On PPC, the trivial matching in head and tail, and for long matching runs now shows up high in the profile. On x86, byte operations are very fast, so I think things should be at least equally good there. Please play around with this and let me know of any results. -Geert Signed-off-by: Geert Bosch #include #include #include #include #include #undef assert #define assert(x) do { } while (0) /* * MIN_HTAB_SIZE is fixed amount to be added to the size of the hash table * used for indexing and must be a power of two. This allows for small files * to have a sparse hash table, since in that case it's cheap. * Hash table sizes are rounded up to a power of two to avoid integer division. */ #define MIN_HTAB_SIZE 8192 #define MAX_HTAB_SIZE (1024*1024*1024) #define SMALL_HTAB_SIZE 8192 #define SMALL_INDEX_STEP 16 /* * Diffing files of gigabyte range is impractical with the current * algorithm, so we're assuming 32-bit sizes everywhere. * Size leaves some room for expansion when diffing random files. */ #define MAX_SIZE (0x7eff0000) /* For small files, indices are represented in 16 bits. * Since indices are always a multiple of the index_step, they * can be shifted right a few bits to accommodate files larger than 64K */ #define SMALL_SHIFT 4 #define MAX_SMALL_SIZE (0xff00<(x) ? (y) : (x)) #endif /* * The copies array is the central data structure for diff generation. * Data statements are implicit, for ranges not covered by any copy command. * * The sum of tgt and length for each entry must be monotonically increasing, * and data ranges must be non-overlapping. This is accomplished by not * extending matches backwards during initial matching. * * Copies may have zero length, to make it quick to delete copies during * optimization. However, the last copy in the list must always be a * non-trivial copy. * * Before committing copies, an important optimization is performed: during * a backward pass through the copies array, each entry is extended backwards, * and redundant copies are eliminated. * * If each match were extended backwards on insertion, the same data may be * matched an arbitrary number of times, resulting in potentially quadratic * time behavior. */ typedef struct copyinfo { unsigned src; unsigned tgt; unsigned length; } CopyInfo; static CopyInfo *copies; static int copy_count = 0; static unsigned max_copies = 0; /* Dynamically increased */ static unsigned *idx; static unsigned idx_size; static unsigned char *idx_data; static unsigned idx_data_len; typedef unsigned poly_t; static void rabin_reset(void) { memset(rabin_window, 0, sizeof(rabin_window)); } static poly_t rabin_slide (poly_t fp, unsigned char m) { unsigned char om; if (++rabin_pos == RABIN_WINDOW_SIZE) rabin_pos = 0; om = rabin_window[rabin_pos]; fp ^= U[om]; rabin_window[rabin_pos] = m; fp = ((fp << 8) | m) ^ T[fp >> RABIN_SHIFT]; return fp; } static int add_copy (unsigned src, unsigned tgt, unsigned length) { if (copy_count == max_copies) { max_copies *= 2; if (!max_copies) { max_copies = MAX_COPIES; copies = malloc (max_copies * sizeof (CopyInfo)); } else copies = realloc(copies, max_copies * sizeof (CopyInfo)); if (!copies) return 0; } copies[copy_count].src = src; copies[copy_count].tgt = tgt; copies[copy_count].length = length; return ++copy_count; } static unsigned maxofs[256]; static unsigned maxlen[256]; static unsigned maxfp[256]; static const unsigned small_idx_size = SMALL_HTAB_SIZE; static short unsigned small_idx[SMALL_HTAB_SIZE]; static void small_init_idx (unsigned char * data, unsigned len, unsigned head, unsigned tail) { const unsigned index_step = SMALL_INDEX_STEP; unsigned j = head - head % index_step; unsigned k; if (len < index_step) return; idx_data = data; idx_data_len = len; len -= MIN (len, tail + (index_step - 1)); memset (small_idx, 0, sizeof(small_idx)); while (j < len) { poly_t fp = 0; do fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT]; while (j % index_step); small_idx[fp % small_idx_size] = j >> SMALL_SHIFT; } } static void init_idx (unsigned char *data, unsigned len, int level, unsigned head, unsigned tail) { unsigned index_step = RABIN_WINDOW_SIZE / sizeof(unsigned) * sizeof(unsigned); unsigned j, k; unsigned char ch = 0; unsigned runlen = 0; poly_t fp = 0; /* Special case small files at low optimization levels */ if (level <= 1 && len < MAX_SMALL_SIZE && len - head - tail < (SMALL_HTAB_SIZE * SMALL_INDEX_STEP)) { small_init_idx(data, len, head, tail); return; } assert (len <= MAX_SIZE); assert (head < len); assert (level >= 0 && level <= 9); memset(maxofs, 0, sizeof(maxofs)); memset(maxlen, 0, sizeof(maxlen)); memset(maxfp, 0, sizeof(maxfp)); /* Smaller step size for higher optimization levels. The index_step must be a multiple of the word size */ if (level >= 1) index_step = MIN(index_step, 4 * sizeof (unsigned)); if (level >= 3) index_step = MIN (index_step, 3 * sizeof (unsigned)); if (level >= 4) index_step = MIN (index_step, 2 * sizeof (unsigned)); if (level >= 6) index_step = MIN (index_step, 1 * sizeof (unsigned)); assert (index_step && !(index_step % sizeof (unsigned))); /* Add fixed amount to hash table size, as small files will benefit a lot without using significantly more memory or time. */ idx_size = (level + 1) * ((len - head - tail) / index_step) / 2; idx_size = MIN (idx_size + MIN_HTAB_SIZE, MAX_HTAB_SIZE - 1); /* Round up to next power of two, but limit to MAX_HTAB_SIZE. */ { unsigned s = MIN_HTAB_SIZE; while (s < idx_size) s += s; idx_size = s; } idx_data = data; idx_data_len = len; idx = calloc(idx_size, sizeof(unsigned)); /* It is tempting to first index higher addresses, so hashes of lower addresses will get preference in the hash table. However, for repetitive patterns with a period that is a divisor of the fingerprint window, this may mean the match is not anchored at the end. Furthermore, even when using a window length that is prime, the benefits are small and the irregularity of the first matches being more important is not worth it. */ rabin_reset(); ch = 0; runlen = 0; if (head < RABIN_WINDOW_SIZE + index_step) head = 0; else { head -= head % index_step; for (j = head - RABIN_WINDOW_SIZE + 1; j < head; j++) fp = rabin_slide (fp, data[j]); } for (j = head; j + index_step < len - tail; j += index_step) { unsigned char pch = 0; unsigned hash; for (k = 0; k < index_step; k++) { pch = ch; ch = data[j + k]; if (ch != pch) runlen = 0; runlen++; fp = rabin_slide(fp, ch); } /* See if there is a word-aligned window-sized run of equal characters */ if (runlen >= RABIN_WINDOW_SIZE + sizeof(unsigned) - 1) { /* Skip ahead to end of run */ while (j + k < len && data[j + k] == ch) { k++; runlen++; } /* Although matches are usually anchored at the end, in the case of extended runs of equal characters it is better to anchor after the first RABIN_WINDOW_SIZE bytes. This allows for quick skip ahead while matching such runs, avoiding unneeded fingerprint calculations. Also, when anchoring at the end, matches will be generated after every word, because the fingerprint stays constant. Even though all matches would get combined during match optimization, it wastes time and space. */ if (runlen > maxlen[pch] + 4) { unsigned ofs; /* ofs points RABIN_WINDOW_SIZE bytes after the start of the run, rounded up to the next word */ ofs = j + k - runlen + RABIN_WINDOW_SIZE + (sizeof (unsigned) - 1); ofs -= ofs % sizeof(unsigned); maxofs[pch] = ofs; maxlen [pch] = runlen; assert(maxfp[pch] == 0 || maxfp[pch] == (unsigned)fp); maxfp[pch] = (unsigned)fp; } /* Keep input aligned as if no special run processing had taken place */ j += k - (k % index_step) - index_step; k = index_step; } /* Testing showed that avoiding collisions using secondary hashing, or hash chaining had little effect and is not worth the time. */ hash = ((unsigned)fp) & (idx_size - 1); idx[hash] = j + k; } /* Lastly, index the longest runs of equal characters found before. This ensures we always match the longerst such runs available. */ for (j = 0; j < 256; j++) if (maxlen[j]) idx[maxfp[j] % idx_size] = maxofs[j]; } /* Match data against the current index and record all possible copies */ static int small_find_copies(unsigned char *data, unsigned len, unsigned head) { unsigned j = head < RABIN_WINDOW_SIZE ? 0 : head - RABIN_WINDOW_SIZE; poly_t fp = 0; while (j < MAX (head, RABIN_WINDOW_SIZE) && j < len) fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT]; while (j < len) { unsigned ofs, src, tgt, runlen, maxrun; fp ^= U[data[j - RABIN_WINDOW_SIZE]]; fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT]; ofs = small_idx[fp & (small_idx_size - 1)] << SMALL_SHIFT; /* Invariant: data[0] .. data[j-1] has been processed fp is fingerprint of sliding window ending at j-1 ofs is zero or points just past tentative match ofs is a multiple of index_step */ if (!ofs) continue; runlen = 0; tgt = j - 4; src = ofs - 4; maxrun = MIN(idx_data_len - src, len - tgt); /* Hot loop */ while (runlen < maxrun && data[tgt + runlen] == idx_data[src + runlen]) runlen++; if (runlen < 4) continue; if (!add_copy(src, tgt, runlen)) return 0; /* For runs extending more than RABIN_WINDOW_SIZE bytes past j, skip ahead to prevent useless fingerprint computations. */ if (tgt + runlen > j + RABIN_WINDOW_SIZE) { fp = 0; j = tgt + runlen - RABIN_WINDOW_SIZE; while (j < tgt + runlen) fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT]; } /* Quickly scan ahead without looking for matches until the end of this run */ while (j < tgt + runlen) { fp ^= U[data[j - RABIN_WINDOW_SIZE]]; fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT]; } } return 1; } /* Match data against the current index and record all possible copies */ static int find_copies(unsigned char *data, unsigned len, unsigned head) { unsigned j = head < RABIN_WINDOW_SIZE ? 0 : head - RABIN_WINDOW_SIZE; poly_t fp = 0; assert (idx_data); if (!idx) return small_find_copies (data, len, head); rabin_reset(); while (j < head + RABIN_WINDOW_SIZE && j < len) fp = rabin_slide(fp, data[j++]); while (j < len) { unsigned ofs, src, tgt, runlen, maxrun; fp = rabin_slide(fp, data[j++]); ofs = idx[fp & (idx_size - 1)]; /* Invariant: data[0] .. data[j-1] has been processed fp is fingerprint of sliding window ending at j-1 ofs is zero or points just past tentative match ofs is a multiple of index_step */ if (!ofs) continue; runlen = 0; tgt = j - 4; src = ofs - 4; maxrun = MIN(idx_data_len - src, len - tgt); /* Hot loop */ while (runlen < maxrun && data[tgt + runlen] == idx_data[src + runlen]) runlen++; if (runlen < 4) continue; if (!add_copy(src, tgt, runlen)) return 0; /* For runs extending more than RABIN_WINDOW_SIZE bytes past j, skip ahead to prevent useless fingerprint computations. */ if (tgt + runlen > j + RABIN_WINDOW_SIZE) j = tgt + runlen - RABIN_WINDOW_SIZE; /* Quickly scan ahead without looking for matches until the end of this run */ while (j < tgt + runlen) fp = rabin_slide(fp, data[j++]); } return 1; } static unsigned header_length(unsigned srclen, unsigned tgtlen) { unsigned len = 0; assert (srclen <= MAX_SIZE && tgtlen <= MAX_SIZE); /* GIT headers start with the length of the source and target, with 7 bits per byte, least significant byte first, and the high bit indicating continuation. */ do { len++; srclen >>= 7; } while (srclen); do { len++; tgtlen >>= 7; } while (tgtlen); return len; } static unsigned char * write_header(unsigned char *patch, unsigned srclen, unsigned tgtlen) { assert (srclen <= MAX_SIZE && tgtlen <= MAX_SIZE); while (srclen >= 0x80) { *patch++ = srclen | 0x80; srclen >>= 7; } *patch++ = srclen; while (tgtlen >= 0x80) { *patch++ = tgtlen | 0x80; tgtlen >>= 7; } *patch++ = tgtlen; return patch; } static unsigned data_length(unsigned length) { /* Can only include 0x7f data bytes per command */ unsigned partial = length % 0x7f; assert (length > 0 && length <= MAX_SIZE); if (partial) partial++; return partial + (length / 0x7f) * 0x80; } static unsigned char * write_data(unsigned char *patch, unsigned char *data, unsigned size) { assert (size > 0 && size < MAX_SIZE); /* The return value must be equal to patch + data_length (patch, size). This correspondence is essential for calculating the patch size. */ /* GIT has no data commands for large data, rest is same as GDIFF */ do { unsigned s = size; if (s > 0x7f) s = 0x7f; *patch++ = s; memcpy(patch, data, s); data += s; patch += s; size -= s; } while (size); return patch; } static unsigned copy_length (unsigned offset, unsigned length) { unsigned size = 0; assert (offset < MAX_SIZE && length < MAX_SIZE); /* For now we only copy a maximum of 0x10000 bytes per command. Longer copies are broken into pieces of that size. */ do { signed s = length; if (s > 0x10000) s = 0x10000; size += !!(s & 0xff) + !!(s & 0xff00); size += !!(offset & 0xff) + !!(offset & 0xff00) + !!(offset & 0xff0000) + !!(offset & 0xff000000); size += 1; offset += s; length -= s; } while (length); return size; } static unsigned char * write_copy(unsigned char *patch, unsigned offset, unsigned size) { /* The return value must be equal to patch + copy_length (patch, offset, size). This correspondence is essential for calculating the patch size. */ do { unsigned char c = 0x80, *cmd = patch++; unsigned v, s = size; if (s > 0x10000) s = 0x10000; v = offset; if (v & 0xff) c |= 0x01, *patch++ = v; v >>= 8; if (v & 0xff) c |= 0x02, *patch++ = v; v >>= 8; if (v & 0xff) c |= 0x04, *patch++ = v; v >>= 8; if (v & 0xff) c |= 0x08, *patch++ = v; v = s; if (v & 0xff) c |= 0x10, *patch++ = v; v >>= 8; if (v & 0xff) c |= 0x20, *patch++ = v; *cmd = c; offset += s; size -= s; } while (size); return patch; } static unsigned process_copies (unsigned char *data, unsigned length, unsigned maxlen) { int j; unsigned ptr = length; unsigned patch_bytes = header_length(idx_data_len, length); /* Work through the copies backwards, extending each one backwards. */ for (j = copy_count - 1; j >= 0; j--) { CopyInfo *copy = copies+j; unsigned src = copy->src; unsigned tgt = copy->tgt; unsigned len = copy->length; int data_follows; if (tgt + len > ptr) { /* Part of copy already covered by later one, so shorten copy. */ if (ptr < tgt) { /* Copy completely disappeared, but guess that a backward extension might still be useful. This extension is non-contiguous, as it is irrelevant whether the skipped data would have matched or not. Be careful to not extend past the beginning of the source. */ unsigned adjust = tgt - ptr; tgt = ptr; src = (src < adjust) ? 0 : src - adjust; copy->tgt = tgt; copy->src = src; } len = ptr - tgt; } while (src && tgt && idx_data[src - 1] == data[tgt - 1]) { src--; tgt--; } len += copy->tgt - tgt; data_follows = (tgt + len < ptr); /* A short copy may cost as much as 6 bytes for the copy and 5 as result of an extra data command. It's not worth having extra copies in order to just save a byte or two. Being too smart here may hurt later compression as well. */ if (len < (data_follows ? 16 : 10)) len = 0; /* Some target data is not covered by the copies, account for the DATA command that will follow the copy. */ if (len && data_follows) patch_bytes += data_length(ptr - (tgt + len)); /* Everything about the copy is known and will not change. Write back the new information and update the patch size with the size of the copy instruction. */ copy->length = len; copy->src = src; copy->tgt = tgt; if (len) { /* update patch size for copy command */ patch_bytes += copy_length (src, len); ptr = tgt; } else if (j == copy_count - 1) { /* Remove empty copies at end of list. */ copy_count--; } if (patch_bytes > maxlen) return 0; } /* Account for data before first copy */ if (ptr != 0) patch_bytes += data_length(ptr); if (patch_bytes > maxlen) return 0; return patch_bytes; } static void * create_delta (unsigned char *data, unsigned len, unsigned char *delta, unsigned delta_size) { unsigned char *ptr = delta; unsigned offset = 0; int j; ptr = write_header(ptr, idx_data_len, len); for (j = 0; j < copy_count; j++) { CopyInfo *copy = copies + j; unsigned copylen = copy->length; if (!copylen) continue; if (copy->tgt > offset) { ptr = write_data(ptr, data + offset, copy->tgt - offset); } ptr = write_copy(ptr, copy->src, copylen); offset = copy->tgt + copylen; } if (offset < len) ptr = write_data(ptr, data + offset, len - offset); assert(ptr - delta == delta_size); return delta; } static void finalize_idx() { if (max_copies > 8 * MAX_COPIES) { free(copies); copies = 0; max_copies = 0; } copy_count = 0; if (idx) free(idx); idx = 0; idx_size = 0; idx_data = 0; idx_data_len = 0; } static unsigned match_head (unsigned char *from, unsigned char *to, unsigned size) { unsigned head = 0; while (head < size && from[head] == to[head]) head++; return head; } static unsigned match_tail (unsigned char *from, unsigned char *to, unsigned size) { unsigned tail = 0; while (tail < size && *(from - tail) == *(to - tail)) tail++; return tail; } void *diff_delta(void *from_buf, unsigned long from_size, void *to_buf, unsigned long to_size, unsigned long *delta_size, unsigned long max_size) { unsigned char *delta = 0; unsigned dsize; unsigned head = 0; unsigned tail = 0; assert (from_size <= MAX_SIZE && to_size <= MAX_SIZE); /* The following actually takes care of about half of all target data. This is performance critical, and may need some work. */ head = match_head(from_buf, to_buf, MIN(from_size, to_size)); tail = match_tail(from_buf + (from_size - 1), to_buf + (to_size - 1), MIN(from_size, to_size - head)); if (head <= RABIN_WINDOW_SIZE) head = 0; if (tail <= RABIN_WINDOW_SIZE) tail = 0; if (!max_size) max_size = from_size; init_idx (from_buf, from_size, 1, head, tail); if (head) add_copy (0, 0, head); if (head + tail + RABIN_WINDOW_SIZE < from_size) { if (!find_copies(to_buf, to_size - tail, head)) return 0; } if (tail) add_copy (from_size - tail, to_size - tail, tail); dsize = process_copies(to_buf, to_size, max_size); if (dsize) { delta = malloc (dsize); delta = create_delta (to_buf, to_size, delta, dsize); } finalize_idx (); if (delta) *delta_size = dsize; return delta; }