RFC: adding xdelta compression to git

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* RFC: adding xdelta compression to git
@ 2005-05-03  3:57 Alon Ziv
  2005-05-03  4:12 ` Nicolas Pitre
  2005-05-03  4:52 ` Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: Alon Ziv @ 2005-05-03  3:57 UTC (permalink / raw
  To: git

Looking for novel methods of wasting my time :), I am considering adding 
xdelta to git.

I have two concrete proposals, both of which (IMO) are consistent with the git 
philosophy:

1. Add a git-deltify command, which will take two trees and replace the second 
tree's blobs with delta-blobs referring to the first tree. Each delta-blob is 
self-contained; from the outside it looks like any other blob, but internally 
it contains another blob reference + an xdelta. The only function which would 
need to understand the new format would be unpack_sha1_file.
The scripting level will be in charge of deciding which trees to deltify (or 
undeltify--we could also have a "git-undeltify" command). A sane 
deltification schedule, for example, could be to always keep tagged versions 
as stand-alone objects, and deltify intermediate versions against the latest 
tag. It would also do its best to avoid delta chains (i.e. a delta referring 
to another delta).
Pros:
* Interoperates with the existing structure (including pull/push) with almost 
no changes to existing infrastructure.
Cons:
* Changes the repository format.
* Some performance impact (probably quite small).
* Same blob may have different representation in two repositories (one 
compressed, on deltified). [I am not sure this is really a bad thing...]

2. Add a completely external framework which manages a "deltas repository" of 
deltas. The shadow repository will contain delta objects between selected 
trees; again the scripts will need to populate it.
Pros:
* No changes at all to existing code.
Cons:
* Push/pull tools will need to be taught to talk with the new "deltas  
repository".
* Synchronization between the deltas repository and the real one may be lost, 
leading to odd failures.

Personally I'm rooting for #1 above... I would like to begin implementation in 
a few days, so any discussion will be useful.

	-az

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03  3:57 RFC: adding xdelta compression to git Alon Ziv
@ 2005-05-03  4:12 ` Nicolas Pitre
  2005-05-03  4:52 ` Linus Torvalds
  1 sibling, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-03  4:12 UTC (permalink / raw
  To: Alon Ziv; +Cc: git

On Tue, 3 May 2005, Alon Ziv wrote:

> Looking for novel methods of wasting my time :), I am considering adding 
> xdelta to git.
> 
> I have two concrete proposals, both of which (IMO) are consistent with the git 
> philosophy:
> 
> 1. Add a git-deltify command, which will take two trees and replace the second 
> tree's blobs with delta-blobs referring to the first tree. Each delta-blob is 
> self-contained; from the outside it looks like any other blob, but internally 
> it contains another blob reference + an xdelta. The only function which would 
> need to understand the new format would be unpack_sha1_file.
[....]

Guess what?

That's exactly what I did, except that I used libxdiff, stripped it to 
the bare minimum and even optimized it to be as efficient (i.e. fast) as 
possible given the git environment.

I'm finalizing the code right now.


Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03  3:57 RFC: adding xdelta compression to git Alon Ziv
  2005-05-03  4:12 ` Nicolas Pitre
@ 2005-05-03  4:52 ` Linus Torvalds
  2005-05-03  5:30   ` Davide Libenzi
                     ` (3 more replies)
  1 sibling, 4 replies; 32+ messages in thread
From: Linus Torvalds @ 2005-05-03  4:52 UTC (permalink / raw
  To: Alon Ziv; +Cc: git

On Tue, 3 May 2005, Alon Ziv wrote:
> 
> 1. Add a git-deltify command, which will take two trees and replace the second 
> tree's blobs with delta-blobs referring to the first tree.

If you do something like this, you want such a delta-blob to be named by 
the sha1 of the result, so that things that refer to it can transparently 
see either the original blob _or_ the "deltified" one, and will never 
care.

It seems that is your plan:

> from the outside it looks like any other blob, but internally it
> contains another blob reference + an xdelta.

Yes. git doesn't much care, as long as the objects unpack to the right 
format. That's all hidden away.

> The only function which would need to understand the new format would be
> unpack_sha1_file.

Yes. EXCEPT for one thing. fsck. I'd _really_ like fsck to be able to know
something about any xdelta objects, if only because if/when things go
wrong, it's really nasty to suddenly see a million "blob" objects not work
any more, with no indication of _why_ they don't work. The core reason may
be that one original object (that just got used as a base for tons of
other objects through deltas) is corrupt or missing. And then you want to
show that _one_ object.

> Cons:
> * Changes the repository format.

It wouldn't necessarily. You should be able to do this with _zero_ changes 
to existing objects what-so-ever.

What you do is introduce an "xdelta" object, which has a reference to a 
blob object and the delta. The git object model already names all objects 
by a simple ascii name, so adding a new object type in _no_ way changes 
any existing objects.

So you can just make "unpack_sha1_file()" notice that it unpacked a xdelta 
object, and then do the proper delta application, and nobody will ever be 
the wiser.

> * Some performance impact (probably quite small).

If you limit the depth of deltas, probably not too bad.

> * Same blob may have different representation in two repositories (one 
> compressed, on deltified). [I am not sure this is really a bad thing...]

THIS, I think, is the real issue. fsck-cache and pull etc, that needs to
know about references to other objects, would have to be able to see the
xdelta object, so that they can build up the reference graph. So you'd
need to basically make a "raw_unpack_sha1_file()" interface (the current
regular unpack_sha1_file()) for that.

Also, the fact is, since git saves things as separate files, you'd not win
as much as you would with some other backing store. So the second step is
to start packing the objects etc. I think there is actually a very steep
complexity edge here - not because any of the individual steps necessarily
add a whole lot, but because they all lead to the "next step".

I personally clearly feel that simplicity (and the resulting robustness)
is worth a _lot_ of disk-space.

So I think that what you suggest is likely to actually be pretty easy, but 
I'm not entirely convinced it's worth the slide into complexity.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03  4:52 ` Linus Torvalds
@ 2005-05-03  5:30   ` Davide Libenzi
  2005-05-03 15:52     ` C. Scott Ananian
  2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Davide Libenzi @ 2005-05-03  5:30 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Alon Ziv, git

On Mon, 2 May 2005, Linus Torvalds wrote:

> Yes. EXCEPT for one thing. fsck. I'd _really_ like fsck to be able to know
> something about any xdelta objects, if only because if/when things go
> wrong, it's really nasty to suddenly see a million "blob" objects not work
> any more, with no indication of _why_ they don't work. The core reason may
> be that one original object (that just got used as a base for tons of
> other objects through deltas) is corrupt or missing. And then you want to
> show that _one_ object.

Linus, xdelta-based algorithms already stores informations regarding the 
object that originated the diff. Since they have no context (like 
text-based diffs) and are simply based on offset-driven copy/insert 
operations, this is a requirement. Libxdiff uses an adler32+size of the 
original object, but you can get as fancy as you like in your own 
implementation. Before a delta patching, the stored information are cross 
checked with the input base object, and the delta patch will fail in the 
eventuality of mismatch. So an fsck is simply a walk backward (or forward, 
depending on your metadata model) of the whole delta chain.

- Davide

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] add the ability to create and retrieve delta objects
  2005-05-03  4:52 ` Linus Torvalds
  2005-05-03  5:30   ` Davide Libenzi
@ 2005-05-03  8:06   ` Nicolas Pitre
  2005-05-03 11:24     ` Chris Mason
                       ` (3 more replies)
  2005-05-03 12:48   ` RFC: adding xdelta compression to git Dan Holmsand
  2005-05-03 15:50   ` C. Scott Ananian
  3 siblings, 4 replies; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-03  8:06 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Alon Ziv, git

On Mon, 2 May 2005, Linus Torvalds wrote:

> If you do something like this, you want such a delta-blob to be named by 
> the sha1 of the result, so that things that refer to it can transparently 
> see either the original blob _or_ the "deltified" one, and will never 
> care.

Yep, that's what I've done last weekend (and just made it actually 
work since people are getting interested).

==========

This patch adds the necessary functionalities to perform delta 
compression on objects.  It adds a git-mkdelta command which can replace 
any object with its deltafied version given a reference object.

Access to a delta object will transparently fetch the reference object 
and apply the transformation.  Scripts can be used to perform any sort 
of compression policy on top of it.

The delta generator has been extracted from libxdiff and optimized for 
git usage in order to avoid as much data copy as possible, and the delta 
storage format modified to be even more compact.  Therefore no need to 
rely on any external library.  The test-delta program can be used to 
test it.

The fsck tool doesn't know about delta object and its relation with 
other objects yet.  But if one doesn't use git-mkdelta it should not be 
a problem.  Many refinements are possible but better merge them 
separately.  Loop detection and recursion limit are a few examples.

Signed-off-by: Nicolas Pitre <nico@cam.org>

--- k/delta.h
+++ l/delta.h
@@ -0,0 +1,6 @@
+extern void *diff_delta(void *from_buf, unsigned long from_size,
+			void *to_buf, unsigned long to_size,
+		        unsigned long *delta_size);
+extern void *patch_delta(void *src_buf, unsigned long src_size,
+			 void *delta_buf, unsigned long delta_size,
+			 unsigned long *dst_size);
--- k/diff-delta.c
+++ l/diff-delta.c
@@ -0,0 +1,315 @@
+/*
+ * diff-delta.c: generate a delta between two buffers
+ *
+ *  Many parts of this file have been lifted from LibXDiff version 0.10.
+ *  http://www.xmailserver.org/xdiff-lib.html
+ *
+ *  LibXDiff was written by Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2003	Davide Libenzi
+ *
+ *  Many mods for GIT usage by Nicolas Pitre <nico@cam.org>, (C) 2005.
+ *
+ *  This file is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ */
+
+#include <stdlib.h>
+#include "delta.h"
+
+
+/* block size: min = 16, max = 64k, power of 2 */
+#define BLK_SIZE 16
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define GR_PRIME 0x9e370001
+#define HASH(v, b) (((unsigned int)(v) * GR_PRIME) >> (32 - (b)))
+	
+/* largest prime smaller than 65536 */
+#define BASE 65521
+
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+#define NMAX 5552
+
+#define DO1(buf, i)  { s1 += buf[i]; s2 += s1; }
+#define DO2(buf, i)  DO1(buf, i); DO1(buf, i + 1);
+#define DO4(buf, i)  DO2(buf, i); DO2(buf, i + 2);
+#define DO8(buf, i)  DO4(buf, i); DO4(buf, i + 4);
+#define DO16(buf)    DO8(buf, 0); DO8(buf, 8);
+
+static unsigned int adler32(unsigned int adler, const unsigned char *buf, int len)
+{
+	int k;
+	unsigned int s1 = adler & 0xffff;
+	unsigned int s2 = adler >> 16;
+
+	while (len > 0) {
+		k = MIN(len, NMAX);
+		len -= k;
+		while (k >= 16) {
+			DO16(buf);
+			buf += 16;
+			k -= 16;
+		}
+		if (k != 0)
+			do {
+				s1 += *buf++;
+				s2 += s1;
+			} while (--k);
+		s1 %= BASE;
+		s2 %= BASE;
+	}
+
+	return (s2 << 16) | s1;
+}
+
+static unsigned int hashbits(unsigned int size)
+{
+	unsigned int val = 1, bits = 0;
+	while (val < size && bits < 32) {
+		val <<= 1;
+	       	bits++;
+	}
+	return bits ? bits: 1;
+}
+
+typedef struct s_chanode {
+	struct s_chanode *next;
+	int icurr;
+} chanode_t;
+
+typedef struct s_chastore {
+	chanode_t *head, *tail;
+	int isize, nsize;
+	chanode_t *ancur;
+	chanode_t *sncur;
+	int scurr;
+} chastore_t;
+
+static void cha_init(chastore_t *cha, int isize, int icount)
+{
+	cha->head = cha->tail = NULL;
+	cha->isize = isize;
+	cha->nsize = icount * isize;
+	cha->ancur = cha->sncur = NULL;
+	cha->scurr = 0;
+}
+
+static void *cha_alloc(chastore_t *cha)
+{
+	chanode_t *ancur;
+	void *data;
+
+	ancur = cha->ancur;
+	if (!ancur || ancur->icurr == cha->nsize) {
+		ancur = malloc(sizeof(chanode_t) + cha->nsize);
+		if (!ancur)
+			return NULL;
+		ancur->icurr = 0;
+		ancur->next = NULL;
+		if (cha->tail)
+			cha->tail->next = ancur;
+		if (!cha->head)
+			cha->head = ancur;
+		cha->tail = ancur;
+		cha->ancur = ancur;
+	}
+
+	data = (void *)ancur + sizeof(chanode_t) + ancur->icurr;
+	ancur->icurr += cha->isize;
+	return data;
+}
+
+static void cha_free(chastore_t *cha)
+{
+	chanode_t *cur = cha->head;
+	while (cur) {
+		chanode_t *tmp = cur;
+		cur = cur->next;
+		free(tmp);
+	}
+}
+
+typedef struct s_bdrecord {
+	struct s_bdrecord *next;
+	unsigned int fp;
+	const unsigned char *ptr;
+} bdrecord_t;
+
+typedef struct s_bdfile {
+	const unsigned char *data, *top;
+	chastore_t cha;
+	unsigned int fphbits;
+	bdrecord_t **fphash;
+} bdfile_t;
+
+static int delta_prepare(const unsigned char *buf, int bufsize, bdfile_t *bdf)
+{
+	unsigned int fphbits;
+	int i, hsize;
+	const unsigned char *base, *data, *top;
+	bdrecord_t *brec;
+	bdrecord_t **fphash;
+
+	fphbits = hashbits(bufsize / BLK_SIZE + 1);
+	hsize = 1 << fphbits;
+	fphash = malloc(hsize * sizeof(bdrecord_t *));
+	if (!fphash)
+		return -1;
+	for (i = 0; i < hsize; i++)
+		fphash[i] = NULL;
+	cha_init(&bdf->cha, sizeof(bdrecord_t), hsize / 4 + 1);
+
+	bdf->data = data = base = buf;
+	bdf->top = top = buf + bufsize;
+	data += (bufsize / BLK_SIZE) * BLK_SIZE;
+	if (data == top)
+		data -= BLK_SIZE;
+
+	for ( ; data >= base; data -= BLK_SIZE) {
+		brec = cha_alloc(&bdf->cha);
+		if (!brec) {
+			cha_free(&bdf->cha);
+			free(fphash);
+			return -1;
+		}
+		brec->fp = adler32(0, data, MIN(BLK_SIZE, top - data));
+		brec->ptr = data;
+		i = HASH(brec->fp, fphbits);
+		brec->next = fphash[i];
+		fphash[i] = brec;
+	}
+
+	bdf->fphbits = fphbits;
+	bdf->fphash = fphash;
+
+	return 0;
+}
+
+static void delta_cleanup(bdfile_t *bdf)
+{
+	free(bdf->fphash);
+	cha_free(&bdf->cha);
+}
+
+#define COPYOP_SIZE(o, s) \
+    (!!(o & 0xff) + !!(o & 0xff00) + !!(o & 0xff0000) + !!(o & 0xff000000) + \
+     !!(s & 0xff) + !!(s & 0xff00) + 1)
+
+void *diff_delta(void *from_buf, unsigned long from_size,
+		 void *to_buf, unsigned long to_size,
+		 unsigned long *delta_size)
+{
+	int i, outpos, outsize, inscnt, csize, msize, moff;
+	unsigned int fp;
+	const unsigned char *data, *top, *ptr1, *ptr2;
+	unsigned char *out, *orig;
+	bdrecord_t *brec;
+	bdfile_t bdf;
+
+	if (delta_prepare(from_buf, from_size, &bdf))
+		return NULL;
+	
+	outpos = 0;
+	outsize = 4096;
+	out = malloc(outsize);
+	if (!out) {
+		delta_cleanup(&bdf);
+		return NULL;
+	}
+
+	data = to_buf;
+	top = to_buf + to_size;
+
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size;
+
+	inscnt = 0;
+	moff = 0;
+	while (data < top) {
+		msize = 0;
+		fp = adler32(0, data, MIN(top - data, BLK_SIZE));
+		i = HASH(fp, bdf.fphbits);
+		for (brec = bdf.fphash[i]; brec; brec = brec->next) {
+			if (brec->fp == fp) {
+				csize = bdf.top - brec->ptr;
+				if (csize > top - data)
+					csize = top - data;
+				for (ptr1 = brec->ptr, ptr2 = data; 
+				     csize && *ptr1 == *ptr2;
+				     csize--, ptr1++, ptr2++);
+
+				csize = ptr1 - brec->ptr;
+				if (csize > msize) {
+					moff = brec->ptr - bdf.data;
+					msize = csize;
+					if (msize >= 0x10000) {
+						msize = 0x10000;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!msize || msize < COPYOP_SIZE(moff, msize)) {
+			if (!inscnt)
+				outpos++;
+			out[outpos++] = *data++;
+			inscnt++;
+			if (inscnt == 0x7f) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+		} else {
+			if (inscnt) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+
+			data += msize;
+			orig = out + outpos++;
+			i = 0x80;
+
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
+			msize >>= 8;
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
+
+			*orig = i;
+		}
+
+		/* next time around the largest possible output is 1 + 4 + 3 */
+		if (outpos > outsize - 8) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			out = realloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				delta_cleanup(&bdf);
+				return NULL;
+			}
+		}
+	}
+
+	if (inscnt)
+		out[-inscnt - 1] = inscnt;
+
+	delta_cleanup(&bdf);
+	*delta_size = outpos;
+	return out;
+}
--- k/mkdelta.c
+++ l/mkdelta.c
@@ -0,0 +1,95 @@
+#include "cache.h"
+#include "delta.h"
+
+static int write_delta_file(char *buf, unsigned long len, unsigned char *sha1_ref, unsigned char *path)
+{
+	int size;
+	char *compressed;
+	z_stream stream;
+	char hdr[50];
+	int fd, hdrlen;
+
+	/* Generate the header */
+	hdrlen = sprintf(hdr, "delta %lu", len+20)+1;
+	memcpy(hdr + hdrlen, sha1_ref, 20);
+	hdrlen += 20;
+
+	fd = open(path, O_WRONLY | O_CREAT | O_EXCL, 0666);
+	if (fd < 0)
+		return -1;
+
+	/* Set it up */
+	memset(&stream, 0, sizeof(stream));
+	deflateInit(&stream, Z_BEST_COMPRESSION);
+	size = deflateBound(&stream, len+hdrlen);
+	compressed = xmalloc(size);
+
+	/* Compress it */
+	stream.next_out = compressed;
+	stream.avail_out = size;
+
+	/* First header.. */
+	stream.next_in = hdr;
+	stream.avail_in = hdrlen;
+	while (deflate(&stream, 0) == Z_OK)
+		/* nothing */
+
+	/* Then the data itself.. */
+	stream.next_in = buf;
+	stream.avail_in = len;
+	while (deflate(&stream, Z_FINISH) == Z_OK)
+		/* nothing */;
+	deflateEnd(&stream);
+	size = stream.total_out;
+
+	if (write(fd, compressed, size) != size)
+		die("unable to write file");
+	close(fd);
+		
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned char sha1_ref[20], sha1_trg[20];
+	char type_ref[20], type_trg[20];
+	void *buf_ref, *buf_trg, *buf_delta;
+	unsigned long size_ref, size_trg, size_delta;
+	char *filename, tmpname[100];
+
+	if (argc != 3 || get_sha1(argv[1], sha1_ref) || get_sha1(argv[2], sha1_trg))
+		usage("git-mkdelta <reference_sha1> <target_sha1>");
+
+	buf_ref = read_sha1_file(sha1_ref, type_ref, &size_ref);
+	if (!buf_ref) {
+		fprintf(stderr, "%s: unable to read reference object\n", argv[0]);
+		exit(1);
+	}
+	buf_trg = read_sha1_file(sha1_trg, type_trg, &size_trg);
+	if (!buf_trg) {
+		fprintf(stderr, "%s: unable to read target object\n", argv[0]);
+		exit(1);
+	}
+	if (strcmp(type_ref, type_trg)) {
+		fprintf(stderr, "%s: reference and target are of different type\n", argv[0]);
+		exit(2);
+	}
+	buf_delta = diff_delta(buf_ref, size_ref, buf_trg, size_trg, &size_delta);
+	if (!buf_delta) {
+		fprintf(stderr, "%s: unable to create delta\n", argv[0]);
+		exit(3);
+	}
+
+	filename = sha1_file_name(sha1_trg);
+	sprintf(tmpname, "%s.delta.tmp", filename);
+	if (write_delta_file(buf_delta, size_delta, sha1_ref, tmpname)) {
+		perror(tmpname);
+		exit(1);
+	}
+	if (rename(tmpname, filename)) {
+		perror("rename");
+		exit(1);
+	}
+
+	return 0;
+}
--- k/patch-delta.c
+++ l/patch-delta.c
@@ -0,0 +1,73 @@
+/*
+ * patch-delta.c:
+ * recreate a buffer from a source and the delta produced by diff-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include "delta.h"
+
+void *patch_delta(void *src_buf, unsigned long src_size,
+		  void *delta_buf, unsigned long delta_size,
+		  unsigned long *dst_size)
+{
+	const unsigned char *data, *top;
+	unsigned char *dst, *out;
+	int size;
+
+	/* the smallest delta size possible is 10 bytes */
+	if (delta_size < 10)
+		return NULL;
+
+	data = delta_buf;
+	top = delta_buf + delta_size;
+
+	/* make sure the orig file size matches what we expect */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	if (size != src_size)
+		return NULL;
+
+	/* now the result size */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	dst = malloc(size);
+	if (!dst)
+		return NULL;
+
+	out = dst;
+	while (data < top) {
+		unsigned char cmd = *data++;
+		if (cmd & 0x80) {
+			unsigned int cp_off = 0, cp_size = 0;
+			if (cmd & 0x01) cp_off = *data++;
+			if (cmd & 0x02) cp_off |= (*data++ << 8);
+			if (cmd & 0x04) cp_off |= (*data++ << 16);
+			if (cmd & 0x08) cp_off |= (*data++ << 24);
+			if (cmd & 0x10) cp_size = *data++;
+			if (cmd & 0x20) cp_size |= (*data++ << 8);
+			if (cp_size == 0) cp_size = 0x10000;
+			memcpy(out, src_buf + cp_off, cp_size);
+			out += cp_size;
+		} else {
+			memcpy(out, data, cmd);
+			out += cmd;
+			data += cmd;
+		}
+	}
+
+	/* sanity check */
+	if (data != top || out - dst != size) {
+		free(dst);
+		return NULL;
+	}
+
+	*dst_size = size;
+	return dst;
+}
Binary files k/test-delta and l/test-delta differ
--- k/test-delta.c
+++ l/test-delta.c
@@ -0,0 +1,79 @@
+/*
+ * test-delta.c: test code to exercise diff-delta.c and patch-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include "delta.h"
+
+static const char *usage =
+	"test-delta (-d|-p) <from_file> <data_file> <out_file>";
+
+int main(int argc, char *argv[])
+{
+	int fd;
+	struct stat st;
+	void *from_buf, *data_buf, *out_buf;
+	unsigned long from_size, data_size, out_size;
+
+	if (argc != 5 || (strcmp(argv[1], "-d") && strcmp(argv[1], "-p"))) {
+		fprintf(stderr, "Usage: %s\n", usage);
+		return 1;
+	}
+
+	fd = open(argv[2], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[2]);
+		return 1;
+	}
+	from_size = st.st_size;
+	from_buf = mmap(NULL, from_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (from_buf == MAP_FAILED) {
+		perror(argv[2]);
+		return 1;
+	}
+	close(fd);
+
+	fd = open(argv[3], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[3]);
+		return 1;
+	}
+	data_size = st.st_size;
+	data_buf = mmap(NULL, data_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (data_buf == MAP_FAILED) {
+		perror(argv[3]);
+		return 1;
+	}
+	close(fd);
+
+	if (argv[1][1] == 'd')
+		out_buf = diff_delta(from_buf, from_size,
+				     data_buf, data_size, &out_size);
+	else
+		out_buf = patch_delta(from_buf, from_size,
+				      data_buf, data_size, &out_size);
+	if (!out_buf) {
+		fprintf(stderr, "delta operation failed (returned NULL)\n");
+		return 1;
+	}
+
+	fd = open (argv[4], O_WRONLY|O_CREAT|O_TRUNC, 0666);
+	if (fd < 0 || write(fd, out_buf, out_size) != out_size) {
+		perror(argv[4]);
+		return 1;
+	}
+
+	return 0;
+}
--- k/Makefile
+++ l/Makefile
@@ -21,7 +21,7 @@ PROG=   git-update-cache git-diff-files 
 	git-check-files git-ls-tree git-merge-base git-merge-cache \
 	git-unpack-file git-export git-diff-cache git-convert-cache \
 	git-http-pull git-rpush git-rpull git-rev-list git-mktag \
-	git-diff-tree-helper git-tar-tree git-local-pull
+	git-diff-tree-helper git-tar-tree git-local-pull git-mkdelta
 
 all: $(PROG)
 
@@ -29,7 +29,7 @@ install: $(PROG) $(SCRIPTS)
 	install $(PROG) $(SCRIPTS) $(HOME)/bin/
 
 LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
-	 tag.o date.o
+	 tag.o date.o diff-delta.o patch-delta.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h blob.h tree.h commit.h tag.h
 
@@ -63,6 +63,9 @@ $(LIB_FILE): $(LIB_OBJS)
 test-date: test-date.c date.o
 	$(CC) $(CFLAGS) -o $@ test-date.c date.o
 
+test-delta: test-delta.c diff-delta.o patch-delta.o
+	$(CC) $(CFLAGS) -o $@ $^
+
 git-%: %.c $(LIB_FILE)
 	$(CC) $(CFLAGS) -o $@ $(filter %.c,$^) $(LIBS)
 
@@ -92,6 +95,7 @@ git-rpush: rsh.c
 git-rpull: rsh.c pull.c
 git-rev-list: rev-list.c
 git-mktag: mktag.c
+git-mkdelta: mkdelta.c
 git-diff-tree-helper: diff-tree-helper.c
 git-tar-tree: tar-tree.c
 
--- k/sha1_file.c
+++ l/sha1_file.c
@@ -8,6 +8,7 @@
  */
 #include <stdarg.h>
 #include "cache.h"
+#include "delta.h"
 
 const char *sha1_file_directory = NULL;
 
@@ -186,7 +187,8 @@ void * unpack_sha1_file(void *map, unsig
 	int ret, bytes;
 	z_stream stream;
 	char buffer[8192];
-	char *buf;
+	char *buf, *delta_ref;
+	unsigned long delta_ref_sz;
 
 	/* Get the data stream */
 	memset(&stream, 0, sizeof(stream));
@@ -201,8 +203,15 @@ void * unpack_sha1_file(void *map, unsig
 		return NULL;
 	if (sscanf(buffer, "%10s %lu", type, size) != 2)
 		return NULL;
-
 	bytes = strlen(buffer) + 1;
+
+	if (!strcmp(type, "delta")) {
+		delta_ref = read_sha1_file(buffer + bytes, type, &delta_ref_sz);
+		if (!delta_ref)
+			return NULL;
+	} else
+		delta_ref = NULL;
+
 	buf = xmalloc(*size);
 
 	memcpy(buf, buffer + bytes, stream.total_out - bytes);
@@ -214,6 +223,17 @@ void * unpack_sha1_file(void *map, unsig
 			/* nothing */;
 	}
 	inflateEnd(&stream);
+
+	if (delta_ref) {
+		char *newbuf;
+		unsigned long newsize;
+		newbuf = patch_delta(delta_ref, delta_ref_sz, buf+20, *size-20, &newsize);
+		free(delta_ref);
+		free(buf);
+		buf = newbuf;
+		*size = newsize;
+	}
+
 	return buf;
 }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
@ 2005-05-03 11:24     ` Chris Mason
  2005-05-03 12:51       ` Nicolas Pitre
                         ` (2 more replies)
  2005-05-03 14:13     ` Chris Mason
                       ` (2 subsequent siblings)
  3 siblings, 3 replies; 32+ messages in thread
From: Chris Mason @ 2005-05-03 11:24 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git

On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> On Mon, 2 May 2005, Linus Torvalds wrote:
> > If you do something like this, you want such a delta-blob to be named by
> > the sha1 of the result, so that things that refer to it can transparently
> > see either the original blob _or_ the "deltified" one, and will never
> > care.
>
> Yep, that's what I've done last weekend (and just made it actually
> work since people are getting interested).
>
This looks much nicer than using zdelta, I'll try switching my packed item to 
your delta generator later this week.  Some quick and dirty space numbers to 
show why we need to pack the files together:

On the full import of all the bk->cvs changesets, the average file size 
in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.

This means that of the 2.5GB the .git directory consumes, about 1GB is taken 
up by files under 4k where deltas won't save space.  If the remaining files 
could be delta compressed down to less than 4k, they would still take up 
around 400MB on disk.

-chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03  4:52 ` Linus Torvalds
  2005-05-03  5:30   ` Davide Libenzi
  2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
@ 2005-05-03 12:48   ` Dan Holmsand
  2005-05-03 15:50   ` C. Scott Ananian
  3 siblings, 0 replies; 32+ messages in thread
From: Dan Holmsand @ 2005-05-03 12:48 UTC (permalink / raw
  To: git

Linus Torvalds wrote:
> Also, the fact is, since git saves things as separate files, you'd not win
> as much as you would with some other backing store. So the second step is
> to start packing the objects etc. I think there is actually a very steep
> complexity edge here - not because any of the individual steps necessarily
> add a whole lot, but because they all lead to the "next step".

Actually, you can win quite a lot.

I've just been playing with storing the entire 
linux-2.4.0-to-2.6.12-rc2-patchset as xdelta patches in git. The entire 
thing ended up being 577M (instead of some 3.5G), according to du -sh 
--apparent-size. Considering that that's some 800M of patches, that's 
not too bad.

I used a very simple scheme: I stored a delta to the previous version of 
every file if that delta was less than 20% in size of the new file 
(otherwise, the whole file was stored as usual). If the previous version 
was already in delta form, the delta was computed from that versions 
"parent". I didn't even care to look at files less than 4k in size.

In other words, I didn't have to use any delta "chains", and still got 
quite massive storage size gains. And this scheme could easily be used 
on the fly; I'm guessing that it would be performance neutral (or even a 
slight gain, since less has to be compressed and written to disk on 
commits, and there might be less to read when diffing, since two 
"delta-blobs" might share the same parent).

Or, with careful tuning, a repo might be "deltaified" later on (assuming 
that delta blobs are addressed using the expanded files' hash).

/dan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 11:24     ` Chris Mason
@ 2005-05-03 12:51       ` Nicolas Pitre
  2005-05-03 15:07       ` Linus Torvalds
  2005-05-03 15:57       ` C. Scott Ananian
  2 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-03 12:51 UTC (permalink / raw
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git

On Tue, 3 May 2005, Chris Mason wrote:

> This looks much nicer than using zdelta, I'll try switching my packed item to 
> your delta generator later this week.  Some quick and dirty space numbers to 
> show why we need to pack the files together:
> 
> On the full import of all the bk->cvs changesets, the average file size 
> in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.
> 
> This means that of the 2.5GB the .git directory consumes, about 1GB is taken 
> up by files under 4k where deltas won't save space.  If the remaining files 
> could be delta compressed down to less than 4k, they would still take up 
> around 400MB on disk.

Sure.  However it helps for history backups and network transfer.

However if the delta compression and packed storage can remain as 
decoupled as possible from each other this is good for flexibility.


Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
  2005-05-03 11:24     ` Chris Mason
@ 2005-05-03 14:13     ` Chris Mason
  2005-05-03 14:24       ` Nicolas Pitre
  2005-05-03 14:48     ` Linus Torvalds
  2005-05-04 15:56     ` Chris Mason
  3 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2005-05-03 14:13 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git

On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> On Mon, 2 May 2005, Linus Torvalds wrote:
> > If you do something like this, you want such a delta-blob to be named by
> > the sha1 of the result, so that things that refer to it can transparently
> > see either the original blob _or_ the "deltified" one, and will never
> > care.
>
> Yep, that's what I've done last weekend (and just made it actually
> work since people are getting interested).

Hmmm, something is strange here, am I using this wrong?

coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
*** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
Aborted

Valgrind output:

==9634== Invalid read of size 1
==9634==    at 0x1B9036F0: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906F is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036F3: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08D is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B9036F6: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906E is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036F9: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08C is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B9036FC: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906D is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B9036FF: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08B is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid read of size 1
==9634==    at 0x1B903702: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1B90906C is not stack'd, malloc'd or (recently) free'd
==9634==
==9634== Invalid write of size 1
==9634==    at 0x1B903708: memcpy (in /usr/lib/valgrind/vgpreload_memcheck.so)
==9634==    by 0x8049142: patch_delta (patch-delta.c:59)
==9634==    by 0x80487CB: main (test-delta.c:65)
==9634==  Address 0x1BA3A08A is not stack'd, malloc'd or (recently) free'd
delta operation failed (returned NULL)
==9634==
==9634== ERROR SUMMARY: 206 errors from 13 contexts (suppressed: 0 from 0)
==9634== malloc/free: in use at exit: 0 bytes in 0 blocks.
==9634== malloc/free: 1 allocs, 1 frees, 5 bytes allocated.
==9634== For a detailed leak analysis,  rerun with: --leak-check=yes
==9634== For counts of detected errors, rerun with: -v

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 14:13     ` Chris Mason
@ 2005-05-03 14:24       ` Nicolas Pitre
  2005-05-03 14:37         ` Chris Mason
  0 siblings, 1 reply; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-03 14:24 UTC (permalink / raw
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git

On Tue, 3 May 2005, Chris Mason wrote:

> On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> > On Mon, 2 May 2005, Linus Torvalds wrote:
> > > If you do something like this, you want such a delta-blob to be named by
> > > the sha1 of the result, so that things that refer to it can transparently
> > > see either the original blob _or_ the "deltified" one, and will never
> > > care.
> >
> > Yep, that's what I've done last weekend (and just made it actually
> > work since people are getting interested).
> 
> Hmmm, something is strange here, am I using this wrong?
> 
> coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> Aborted

Can you send me your foo and delta2 files?


Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 14:24       ` Nicolas Pitre
@ 2005-05-03 14:37         ` Chris Mason
  2005-05-03 15:04           ` Nicolas Pitre
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2005-05-03 14:37 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git

On Tuesday 03 May 2005 10:24, Nicolas Pitre wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > Hmmm, something is strange here, am I using this wrong?
> >
> > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> > Aborted
>
> Can you send me your foo and delta2 files?
>
Sorry, thought I had the whole command history in there.  I went for something 
small to start ;)

coffee:~/git/linus.orig # echo foo > foo
coffee:~/git/linus.orig # echo foo2 > foo2
coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
coffee:~/git/linus.orig # ls -la delta1
-rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
*** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
  2005-05-03 11:24     ` Chris Mason
  2005-05-03 14:13     ` Chris Mason
@ 2005-05-03 14:48     ` Linus Torvalds
  2005-05-03 15:52       ` Nicolas Pitre
  2005-05-04 15:56     ` Chris Mason
  3 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2005-05-03 14:48 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: Alon Ziv, Git Mailing List

On Tue, 3 May 2005, Nicolas Pitre wrote:
> 
> Yep, that's what I've done last weekend (and just made it actually 
> work since people are getting interested).

I have to say that it looks uncommonly simple. Also, afaik, this should
still work with the current fsck, it's just that because fsck doesn't
understand the linkages, the error reporting won't be as good as it could
be (I'd _much_ rather see "delta failed in object xxxxx" than "unable to
read xxxxxx").

Now, one thing I like about this approach is that the actual delta 
_generation_ can be done off-line, and independently of anything else. 
Which means that the performance paths I care about (commit etc) are 
largely unaffected, and you can "deltify" a git archive overnight or 
something. 

In fact, it means that you might even be able to use some fairly expensive 
"search for the best blob object to delta against", including very much a 
intelligent rename search (ie "oh, this is a new object, let's see if any 
of the old deleted objects generate a good delta"), but you might even go 
back more than one generation.

Hmm. How nasty are those scripts?

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 14:37         ` Chris Mason
@ 2005-05-03 15:04           ` Nicolas Pitre
  2005-05-03 16:54             ` Chris Mason
  0 siblings, 1 reply; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-03 15:04 UTC (permalink / raw
  To: Chris Mason; +Cc: Linus Torvalds, Alon Ziv, git

On Tue, 3 May 2005, Chris Mason wrote:

> On Tuesday 03 May 2005 10:24, Nicolas Pitre wrote:
> > On Tue, 3 May 2005, Chris Mason wrote:
> > > Hmmm, something is strange here, am I using this wrong?
> > >
> > > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
> > > Aborted
> >
> > Can you send me your foo and delta2 files?
> >
> Sorry, thought I had the whole command history in there.  I went for something 
> small to start ;)
> 
> coffee:~/git/linus.orig # echo foo > foo
> coffee:~/git/linus.orig # echo foo2 > foo2
> coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> coffee:~/git/linus.orig # ls -la delta1
> -rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
> coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***

OK, doh!

--- diff-delta.c.orig	2005-05-03 11:00:39.900529634 -0400
+++ diff-delta.c	2005-05-03 11:01:03.210031176 -0400
@@ -307,7 +307,7 @@
 	}
 
 	if (inscnt)
-		out[-inscnt - 1] = inscnt;
+		out[outpos - inscnt - 1] = inscnt;
 
 	delta_cleanup(&bdf);
 	*delta_size = outpos;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 11:24     ` Chris Mason
  2005-05-03 12:51       ` Nicolas Pitre
@ 2005-05-03 15:07       ` Linus Torvalds
  2005-05-03 16:09         ` Chris Mason
  2005-05-03 15:57       ` C. Scott Ananian
  2 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2005-05-03 15:07 UTC (permalink / raw
  To: Chris Mason; +Cc: Nicolas Pitre, Alon Ziv, git

On Tue, 3 May 2005, Chris Mason wrote:
> 
> On the full import of all the bk->cvs changesets, the average file size 
> in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.

Have you checked how many of those are blobs?

For many commits, we generate as many (or more) _tree_ objects as we 
generate blobs. 

And tree obejcts from the same "supertree" really is something that I
wouldn't mind packing some way, because they really tend to be very much
related (since they refer to each other). Eg the commit and the top-level
tree are almost always a pair, since you'd get a shared top-level tree
only with two commits that have the exact same content (which definitely
happens, don't get me wrong, but it we get some duplication for that case,
we'd still be winning).

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03  4:52 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2005-05-03 12:48   ` RFC: adding xdelta compression to git Dan Holmsand
@ 2005-05-03 15:50   ` C. Scott Ananian
  3 siblings, 0 replies; 32+ messages in thread
From: C. Scott Ananian @ 2005-05-03 15:50 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Alon Ziv, git

On Mon, 2 May 2005, Linus Torvalds wrote:

>> * Changes the repository format.
>
> It wouldn't necessarily. You should be able to do this with _zero_ changes
> to existing objects what-so-ever.

Yes.  The 'chunking' code I posted earlier does this, etc.  It's kinda odd 
computing a SHA-1 including the 'blob <size>\0' header, even when your 
representation doesn't use this type exactly, but it's no big deal.  I'm 
still tinkering with this, btw; I can get modest improvements in 'real' disk 
space used, but nothing earth-shattering (yet).  I'll post the list of 
things I tried and how well they worked at some point, just to save people 
the effort of retrying things.

I've been working from the 'no knowledge of commit structure needed' 
perspective; I think Chris Mason has been using the structure of the 
commit object to guide delta-fication and showing more impressive 
space savings.
  --scott

HTAUTOMAT Legion of Doom payment PBPRIME insurgent shortwave AVBUSY 
Nader PBCABOOSE overthrow explosion Ortega STANDEL ECJOB Sigint FBI
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03  5:30   ` Davide Libenzi
@ 2005-05-03 15:52     ` C. Scott Ananian
  2005-05-03 17:35       ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: C. Scott Ananian @ 2005-05-03 15:52 UTC (permalink / raw
  To: Davide Libenzi; +Cc: Linus Torvalds, Alon Ziv, git

On Mon, 2 May 2005, Davide Libenzi wrote:

> On Mon, 2 May 2005, Linus Torvalds wrote:
>
>> Yes. EXCEPT for one thing. fsck. I'd _really_ like fsck to be able to know
>> something about any xdelta objects, if only because if/when things go

> Linus, xdelta-based algorithms already stores informations regarding the
> object that originated the diff. Since they have no context (like
> text-based diffs) and are simply based on offset-driven copy/insert
> operations, this is a requirement. Libxdiff uses an adler32+size of the
> original object, but you can get as fancy as you like in your own
> implementation. Before a delta patching, the stored information are cross
> checked with the input base object, and the delta patch will fail in the
> eventuality of mismatch. So an fsck is simply a walk backward (or forward,
> depending on your metadata model) of the whole delta chain.

Linus knows this.  His point is just to be sure you actually *code* that 
walk in fsck, and (hopefully) do so w/o complicating the fsck too much.
  --scott

supercomputer BOND quiche SYNCARP Honduras North Korea Qaddafi PANCHO 
SKILLET KUDESK non-violent protest ESQUIRE struggle Saddam Hussein
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 14:48     ` Linus Torvalds
@ 2005-05-03 15:52       ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-03 15:52 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Git Mailing List

On Tue, 3 May 2005, Linus Torvalds wrote:

> On Tue, 3 May 2005, Nicolas Pitre wrote:
> > 
> > Yep, that's what I've done last weekend (and just made it actually 
> > work since people are getting interested).
> 
> I have to say that it looks uncommonly simple. Also, afaik, this should
> still work with the current fsck, it's just that because fsck doesn't
> understand the linkages, the error reporting won't be as good as it could
> be (I'd _much_ rather see "delta failed in object xxxxx" than "unable to
> read xxxxxx").

Yep.  Let's do it in a separate patch if you please.

> Now, one thing I like about this approach is that the actual delta 
> _generation_ can be done off-line, and independently of anything else. 
> Which means that the performance paths I care about (commit etc) are 
> largely unaffected, and you can "deltify" a git archive overnight or 
> something. 

Yes.  And actually you can use any kind of delta reference topology as 
you wish.  It may start from the first object revision and the next 
revision is a delta against the first, the third a delta against the 
second, etc.  But it is much more interesting to do it the other way 
around, such that the second revision is stored as is and the first 
revision is made a delta against the second revision.  Then on the next 
commit the third revision is stored as is and the second rev made a 
delta against the third, and so on.  You therefore get delta compression 
at commit time with little overhead if you wish to do that.  And this 
approach has the advantage of keeping the latest object revisions fast 
accessible and the delta overhead is relegated to the old historic 
objects.

And suppose the delta chain is too deep for some objects and accessing 
them gets too much overhead.  No problem: just pick a random object in 
the middle of the delta chain and swap it with its original undeltafied 
version and the delta chain is now cut in two.

Etc.  It's flexible and open to any arrangement.

OK, here's a revised patch correcting the little bug found by
Chris Mason.

==========

This patch adds the necessary functionalities to perform delta
compression on objects.  It adds a git-mkdelta command which can replace
any object with its deltafied version given a reference object.

Access to a delta object will transparently fetch the reference object
and apply the transformation.  Scripts can be used to perform any sort
of compression policy on top of it.

The delta generator has been extracted from libxdiff and optimized for
git usage in order to avoid as much data copy as possible, and the delta
storage format modified to be even more compact.  Therefore no need to
rely on any external library.  The test-delta program can be used to
test it.

Many refinements are needed but better merge them separately.  Loop 
detection and recursion treshold are a few examples.

Signed-off-by: Nicolas Pitre <nico@cam.org>

--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,7 @@ install: $(PROG) $(SCRIPTS)
 	install $(PROG) $(SCRIPTS) $(HOME)/bin/
 
 LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
-	 tag.o date.o
+	 tag.o date.o diff-delta.o patch-delta.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h blob.h tree.h commit.h tag.h
 
@@ -63,6 +63,9 @@ $(LIB_FILE): $(LIB_OBJS)
 test-date: test-date.c date.o
 	$(CC) $(CFLAGS) -o $@ test-date.c date.o
 
+test-delta: test-delta.c diff-delta.o patch-delta.o
+	$(CC) $(CFLAGS) -o $@ $^
+
 git-%: %.c $(LIB_FILE)
 	$(CC) $(CFLAGS) -o $@ $(filter %.c,$^) $(LIBS)
 
@@ -92,6 +95,7 @@ git-rpush: rsh.c
 git-rpull: rsh.c pull.c
 git-rev-list: rev-list.c
 git-mktag: mktag.c
+git-mkdelta: mkdelta.c
 git-diff-tree-helper: diff-tree-helper.c
 git-tar-tree: tar-tree.c
 git-write-blob: write-blob.c
Created: delta.h (mode:100644)
--- /dev/null
+++ b/delta.h
@@ -0,0 +1,6 @@
+extern void *diff_delta(void *from_buf, unsigned long from_size,
+			void *to_buf, unsigned long to_size,
+		        unsigned long *delta_size);
+extern void *patch_delta(void *src_buf, unsigned long src_size,
+			 void *delta_buf, unsigned long delta_size,
+			 unsigned long *dst_size);
Created: diff-delta.c (mode:100644)
--- /dev/null
+++ b/diff-delta.c
@@ -0,0 +1,315 @@
+/*
+ * diff-delta.c: generate a delta between two buffers
+ *
+ *  Many parts of this file have been lifted from LibXDiff version 0.10.
+ *  http://www.xmailserver.org/xdiff-lib.html
+ *
+ *  LibXDiff was written by Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2003	Davide Libenzi
+ *
+ *  Many mods for GIT usage by Nicolas Pitre <nico@cam.org>, (C) 2005.
+ *
+ *  This file is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ */
+
+#include <stdlib.h>
+#include "delta.h"
+
+
+/* block size: min = 16, max = 64k, power of 2 */
+#define BLK_SIZE 16
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define GR_PRIME 0x9e370001
+#define HASH(v, b) (((unsigned int)(v) * GR_PRIME) >> (32 - (b)))
+	
+/* largest prime smaller than 65536 */
+#define BASE 65521
+
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+#define NMAX 5552
+
+#define DO1(buf, i)  { s1 += buf[i]; s2 += s1; }
+#define DO2(buf, i)  DO1(buf, i); DO1(buf, i + 1);
+#define DO4(buf, i)  DO2(buf, i); DO2(buf, i + 2);
+#define DO8(buf, i)  DO4(buf, i); DO4(buf, i + 4);
+#define DO16(buf)    DO8(buf, 0); DO8(buf, 8);
+
+static unsigned int adler32(unsigned int adler, const unsigned char *buf, int len)
+{
+	int k;
+	unsigned int s1 = adler & 0xffff;
+	unsigned int s2 = adler >> 16;
+
+	while (len > 0) {
+		k = MIN(len, NMAX);
+		len -= k;
+		while (k >= 16) {
+			DO16(buf);
+			buf += 16;
+			k -= 16;
+		}
+		if (k != 0)
+			do {
+				s1 += *buf++;
+				s2 += s1;
+			} while (--k);
+		s1 %= BASE;
+		s2 %= BASE;
+	}
+
+	return (s2 << 16) | s1;
+}
+
+static unsigned int hashbits(unsigned int size)
+{
+	unsigned int val = 1, bits = 0;
+	while (val < size && bits < 32) {
+		val <<= 1;
+	       	bits++;
+	}
+	return bits ? bits: 1;
+}
+
+typedef struct s_chanode {
+	struct s_chanode *next;
+	int icurr;
+} chanode_t;
+
+typedef struct s_chastore {
+	chanode_t *head, *tail;
+	int isize, nsize;
+	chanode_t *ancur;
+	chanode_t *sncur;
+	int scurr;
+} chastore_t;
+
+static void cha_init(chastore_t *cha, int isize, int icount)
+{
+	cha->head = cha->tail = NULL;
+	cha->isize = isize;
+	cha->nsize = icount * isize;
+	cha->ancur = cha->sncur = NULL;
+	cha->scurr = 0;
+}
+
+static void *cha_alloc(chastore_t *cha)
+{
+	chanode_t *ancur;
+	void *data;
+
+	ancur = cha->ancur;
+	if (!ancur || ancur->icurr == cha->nsize) {
+		ancur = malloc(sizeof(chanode_t) + cha->nsize);
+		if (!ancur)
+			return NULL;
+		ancur->icurr = 0;
+		ancur->next = NULL;
+		if (cha->tail)
+			cha->tail->next = ancur;
+		if (!cha->head)
+			cha->head = ancur;
+		cha->tail = ancur;
+		cha->ancur = ancur;
+	}
+
+	data = (void *)ancur + sizeof(chanode_t) + ancur->icurr;
+	ancur->icurr += cha->isize;
+	return data;
+}
+
+static void cha_free(chastore_t *cha)
+{
+	chanode_t *cur = cha->head;
+	while (cur) {
+		chanode_t *tmp = cur;
+		cur = cur->next;
+		free(tmp);
+	}
+}
+
+typedef struct s_bdrecord {
+	struct s_bdrecord *next;
+	unsigned int fp;
+	const unsigned char *ptr;
+} bdrecord_t;
+
+typedef struct s_bdfile {
+	const unsigned char *data, *top;
+	chastore_t cha;
+	unsigned int fphbits;
+	bdrecord_t **fphash;
+} bdfile_t;
+
+static int delta_prepare(const unsigned char *buf, int bufsize, bdfile_t *bdf)
+{
+	unsigned int fphbits;
+	int i, hsize;
+	const unsigned char *base, *data, *top;
+	bdrecord_t *brec;
+	bdrecord_t **fphash;
+
+	fphbits = hashbits(bufsize / BLK_SIZE + 1);
+	hsize = 1 << fphbits;
+	fphash = malloc(hsize * sizeof(bdrecord_t *));
+	if (!fphash)
+		return -1;
+	for (i = 0; i < hsize; i++)
+		fphash[i] = NULL;
+	cha_init(&bdf->cha, sizeof(bdrecord_t), hsize / 4 + 1);
+
+	bdf->data = data = base = buf;
+	bdf->top = top = buf + bufsize;
+	data += (bufsize / BLK_SIZE) * BLK_SIZE;
+	if (data == top)
+		data -= BLK_SIZE;
+
+	for ( ; data >= base; data -= BLK_SIZE) {
+		brec = cha_alloc(&bdf->cha);
+		if (!brec) {
+			cha_free(&bdf->cha);
+			free(fphash);
+			return -1;
+		}
+		brec->fp = adler32(0, data, MIN(BLK_SIZE, top - data));
+		brec->ptr = data;
+		i = HASH(brec->fp, fphbits);
+		brec->next = fphash[i];
+		fphash[i] = brec;
+	}
+
+	bdf->fphbits = fphbits;
+	bdf->fphash = fphash;
+
+	return 0;
+}
+
+static void delta_cleanup(bdfile_t *bdf)
+{
+	free(bdf->fphash);
+	cha_free(&bdf->cha);
+}
+
+#define COPYOP_SIZE(o, s) \
+    (!!(o & 0xff) + !!(o & 0xff00) + !!(o & 0xff0000) + !!(o & 0xff000000) + \
+     !!(s & 0xff) + !!(s & 0xff00) + 1)
+
+void *diff_delta(void *from_buf, unsigned long from_size,
+		 void *to_buf, unsigned long to_size,
+		 unsigned long *delta_size)
+{
+	int i, outpos, outsize, inscnt, csize, msize, moff;
+	unsigned int fp;
+	const unsigned char *data, *top, *ptr1, *ptr2;
+	unsigned char *out, *orig;
+	bdrecord_t *brec;
+	bdfile_t bdf;
+
+	if (delta_prepare(from_buf, from_size, &bdf))
+		return NULL;
+	
+	outpos = 0;
+	outsize = 4096;
+	out = malloc(outsize);
+	if (!out) {
+		delta_cleanup(&bdf);
+		return NULL;
+	}
+
+	data = to_buf;
+	top = to_buf + to_size;
+
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size;
+
+	inscnt = 0;
+	moff = 0;
+	while (data < top) {
+		msize = 0;
+		fp = adler32(0, data, MIN(top - data, BLK_SIZE));
+		i = HASH(fp, bdf.fphbits);
+		for (brec = bdf.fphash[i]; brec; brec = brec->next) {
+			if (brec->fp == fp) {
+				csize = bdf.top - brec->ptr;
+				if (csize > top - data)
+					csize = top - data;
+				for (ptr1 = brec->ptr, ptr2 = data; 
+				     csize && *ptr1 == *ptr2;
+				     csize--, ptr1++, ptr2++);
+
+				csize = ptr1 - brec->ptr;
+				if (csize > msize) {
+					moff = brec->ptr - bdf.data;
+					msize = csize;
+					if (msize >= 0x10000) {
+						msize = 0x10000;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!msize || msize < COPYOP_SIZE(moff, msize)) {
+			if (!inscnt)
+				outpos++;
+			out[outpos++] = *data++;
+			inscnt++;
+			if (inscnt == 0x7f) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+		} else {
+			if (inscnt) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+
+			data += msize;
+			orig = out + outpos++;
+			i = 0x80;
+
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
+			msize >>= 8;
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
+
+			*orig = i;
+		}
+
+		/* next time around the largest possible output is 1 + 4 + 3 */
+		if (outpos > outsize - 8) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			out = realloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				delta_cleanup(&bdf);
+				return NULL;
+			}
+		}
+	}
+
+	if (inscnt)
+		out[outpos - inscnt - 1] = inscnt;
+
+	delta_cleanup(&bdf);
+	*delta_size = outpos;
+	return out;
+}
Created: patch-delta.c (mode:100644)
--- /dev/null
+++ b/patch-delta.c
@@ -0,0 +1,73 @@
+/*
+ * patch-delta.c:
+ * recreate a buffer from a source and the delta produced by diff-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include "delta.h"
+
+void *patch_delta(void *src_buf, unsigned long src_size,
+		  void *delta_buf, unsigned long delta_size,
+		  unsigned long *dst_size)
+{
+	const unsigned char *data, *top;
+	unsigned char *dst, *out;
+	int size;
+
+	/* the smallest delta size possible is 10 bytes */
+	if (delta_size < 10)
+		return NULL;
+
+	data = delta_buf;
+	top = delta_buf + delta_size;
+
+	/* make sure the orig file size matches what we expect */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	if (size != src_size)
+		return NULL;
+
+	/* now the result size */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	dst = malloc(size);
+	if (!dst)
+		return NULL;
+
+	out = dst;
+	while (data < top) {
+		unsigned char cmd = *data++;
+		if (cmd & 0x80) {
+			unsigned int cp_off = 0, cp_size = 0;
+			if (cmd & 0x01) cp_off = *data++;
+			if (cmd & 0x02) cp_off |= (*data++ << 8);
+			if (cmd & 0x04) cp_off |= (*data++ << 16);
+			if (cmd & 0x08) cp_off |= (*data++ << 24);
+			if (cmd & 0x10) cp_size = *data++;
+			if (cmd & 0x20) cp_size |= (*data++ << 8);
+			if (cp_size == 0) cp_size = 0x10000;
+			memcpy(out, src_buf + cp_off, cp_size);
+			out += cp_size;
+		} else {
+			memcpy(out, data, cmd);
+			out += cmd;
+			data += cmd;
+		}
+	}
+
+	/* sanity check */
+	if (data != top || out - dst != size) {
+		free(dst);
+		return NULL;
+	}
+
+	*dst_size = size;
+	return dst;
+}
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -8,6 +8,7 @@
  */
 #include <stdarg.h>
 #include "cache.h"
+#include "delta.h"
 
 const char *sha1_file_directory = NULL;
 
@@ -186,7 +187,8 @@ void * unpack_sha1_file(void *map, unsig
 	int ret, bytes;
 	z_stream stream;
 	char buffer[8192];
-	char *buf;
+	char *buf, *delta_ref;
+	unsigned long delta_ref_sz;
 
 	/* Get the data stream */
 	memset(&stream, 0, sizeof(stream));
@@ -201,8 +203,15 @@ void * unpack_sha1_file(void *map, unsig
 		return NULL;
 	if (sscanf(buffer, "%10s %lu", type, size) != 2)
 		return NULL;
-
 	bytes = strlen(buffer) + 1;
+
+	if (!strcmp(type, "delta")) {
+		delta_ref = read_sha1_file(buffer + bytes, type, &delta_ref_sz);
+		if (!delta_ref)
+			return NULL;
+	} else
+		delta_ref = NULL;
+
 	buf = xmalloc(*size);
 
 	memcpy(buf, buffer + bytes, stream.total_out - bytes);
@@ -214,6 +223,17 @@ void * unpack_sha1_file(void *map, unsig
 			/* nothing */;
 	}
 	inflateEnd(&stream);
+
+	if (delta_ref) {
+		char *newbuf;
+		unsigned long newsize;
+		newbuf = patch_delta(delta_ref, delta_ref_sz, buf+20, *size-20, &newsize);
+		free(delta_ref);
+		free(buf);
+		buf = newbuf;
+		*size = newsize;
+	}
+
 	return buf;
 }
 
Created: test-delta.c (mode:100644)
--- /dev/null
+++ b/test-delta.c
@@ -0,0 +1,79 @@
+/*
+ * test-delta.c: test code to exercise diff-delta.c and patch-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include "delta.h"
+
+static const char *usage =
+	"test-delta (-d|-p) <from_file> <data_file> <out_file>";
+
+int main(int argc, char *argv[])
+{
+	int fd;
+	struct stat st;
+	void *from_buf, *data_buf, *out_buf;
+	unsigned long from_size, data_size, out_size;
+
+	if (argc != 5 || (strcmp(argv[1], "-d") && strcmp(argv[1], "-p"))) {
+		fprintf(stderr, "Usage: %s\n", usage);
+		return 1;
+	}
+
+	fd = open(argv[2], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[2]);
+		return 1;
+	}
+	from_size = st.st_size;
+	from_buf = mmap(NULL, from_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (from_buf == MAP_FAILED) {
+		perror(argv[2]);
+		return 1;
+	}
+	close(fd);
+
+	fd = open(argv[3], O_RDONLY);
+	if (fd < 0 || fstat(fd, &st)) {
+		perror(argv[3]);
+		return 1;
+	}
+	data_size = st.st_size;
+	data_buf = mmap(NULL, data_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (data_buf == MAP_FAILED) {
+		perror(argv[3]);
+		return 1;
+	}
+	close(fd);
+
+	if (argv[1][1] == 'd')
+		out_buf = diff_delta(from_buf, from_size,
+				     data_buf, data_size, &out_size);
+	else
+		out_buf = patch_delta(from_buf, from_size,
+				      data_buf, data_size, &out_size);
+	if (!out_buf) {
+		fprintf(stderr, "delta operation failed (returned NULL)\n");
+		return 1;
+	}
+
+	fd = open (argv[4], O_WRONLY|O_CREAT|O_TRUNC, 0666);
+	if (fd < 0 || write(fd, out_buf, out_size) != out_size) {
+		perror(argv[4]);
+		return 1;
+	}
+
+	return 0;
+}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 11:24     ` Chris Mason
  2005-05-03 12:51       ` Nicolas Pitre
  2005-05-03 15:07       ` Linus Torvalds
@ 2005-05-03 15:57       ` C. Scott Ananian
  2005-05-03 16:35         ` Chris Mason
  2 siblings, 1 reply; 32+ messages in thread
From: C. Scott Ananian @ 2005-05-03 15:57 UTC (permalink / raw
  To: Chris Mason; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git

On Tue, 3 May 2005, Chris Mason wrote:

> your delta generator later this week.  Some quick and dirty space numbers to
> show why we need to pack the files together:

Are you accurately accounting for the cost of the extra hard/soft links 
your scheme requires?  Ie the directories get larger, lookups take 
slightly longer, etc.  Also access to a given file takes longer, and the 
deltas are referring to *other* packed files which *also* take longer to 
decompress and access...

How much better does delta-fication do, compared to just packing?
  --scott

NSA FJDEFLECT radar WASHTUB justice LCFLUTTER KUCLUB PBHISTORY Ft. Bragg 
ammunition immediate ESMERALDITE DC terrorist C4 SLBM affinity group
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 15:07       ` Linus Torvalds
@ 2005-05-03 16:09         ` Chris Mason
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Mason @ 2005-05-03 16:09 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Nicolas Pitre, Alon Ziv, git

On Tuesday 03 May 2005 11:07, Linus Torvalds wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > On the full import of all the bk->cvs changesets, the average file size
> > in .git is 4074 bytes.  73% of the files are 4096 bytes or smaller.
>
> Have you checked how many of those are blobs?
>
I've got cg-admin-lsobj running (effectively find .git -type f | xargs 
cat-file), it is taking a looong time but the ratios seem to stay pretty 
constant as it makes progress:

total: 186863
blob: 93688     (6.6 per commit)
commit: 14172
tree: 79003      (5.5 per commit)

> For many commits, we generate as many (or more) _tree_ objects as we
> generate blobs.
>
> And tree obejcts from the same "supertree" really is something that I
> wouldn't mind packing some way, because they really tend to be very much
> related (since they refer to each other). Eg the commit and the top-level
> tree are almost always a pair, since you'd get a shared top-level tree
> only with two commits that have the exact same content (which definitely
> happens, don't get me wrong, but it we get some duplication for that case,
> we'd still be winning).
>

The packed item patch wouldn't duplicate info in this case.  When it initially 
creates the packed buffer (before compression), it checks for an existing 
file with the same sha1 and returns if one is found.  This is to preserve the 
optimizations for write_tree case where it frequently tries to create files 
that already exist.

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 15:57       ` C. Scott Ananian
@ 2005-05-03 16:35         ` Chris Mason
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Mason @ 2005-05-03 16:35 UTC (permalink / raw
  To: C. Scott Ananian; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git

On Tuesday 03 May 2005 11:57, C. Scott Ananian wrote:
> On Tue, 3 May 2005, Chris Mason wrote:
> > your delta generator later this week.  Some quick and dirty space numbers
> > to show why we need to pack the files together:
>
> Are you accurately accounting for the cost of the extra hard/soft links
> your scheme requires?  Ie the directories get larger, lookups take
> slightly longer, etc.  Also access to a given file takes longer, and the
> deltas are referring to *other* packed files which *also* take longer to
> decompress and access...

My patch doesn't create any extra directory entries because the file for the 
packed file is unlinked after all the hard links are made.  Even if I kept 
the packed file directory entry, I'd adding one directory entry and saving an 
average 6-7 inodes per commit.

>
> How much better does delta-fication do, compared to just packing?

The best case for just packing is to pack the blobs, trees and commits all 
into one object.  Doing all three brought the tree down from 2.5GB to 1.57GB.  

The delta patch does pack trees together, but not into the same file as the 
blobs, and commits are not packed at all.  This is just because it is a pain 
to carry those changes around; it'll be easy to do later.

With the delta patch, the tree is around 900MB, I estimate packing the commits 
and trees into the blob files would save another 200MB.

Because space savings is so tightly coupled with packing ratios, a script to 
repack blobs, trees and commits from multiple commits will give much better 
compression.  Right now  the patch does not delta trees or commits, but it 
might make sense to delta the trees via the repacking script.

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03 15:04           ` Nicolas Pitre
@ 2005-05-03 16:54             ` Chris Mason
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Mason @ 2005-05-03 16:54 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git

On Tuesday 03 May 2005 11:04, Nicolas Pitre wrote:
> On Tue, 3 May 2005, Chris Mason wrote:

> > coffee:~/git/linus.orig # echo foo > foo
> > coffee:~/git/linus.orig # echo foo2 > foo2
> > coffee:~/git/linus.orig # ./test-delta -d foo foo2 delta1
> > coffee:~/git/linus.orig # ls -la delta1
> > -rw-r--r--  1 root root 14 2005-05-03 10:36 delta1
> > coffee:~/git/linus.orig # ./test-delta -p foo delta1 out
> > *** glibc detected *** free(): invalid next size (fast): 0x0804b008 ***
>
> OK, doh!

Thanks, this one works ;)  I'll kick off a run with this replacing zdelta, 
should be around 3 hours.  For my small tree run with 300 patches, its faster 
than zdelta with about the same space savings.

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03 15:52     ` C. Scott Ananian
@ 2005-05-03 17:35       ` Linus Torvalds
  2005-05-03 18:10         ` Davide Libenzi
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2005-05-03 17:35 UTC (permalink / raw
  To: C. Scott Ananian; +Cc: Davide Libenzi, Alon Ziv, git

On Tue, 3 May 2005, C. Scott Ananian wrote:
> 
> Linus knows this.  His point is just to be sure you actually *code* that 
> walk in fsck, and (hopefully) do so w/o complicating the fsck too much.

Indeed. It's also a performance issue.

If you do xdelta objects, and don't tell fsck about it, then fsck will 
just check every object as a blob. Why is that bad?

Think about it: let's say that you have a series of xdelta objects, and a 
fsck that is xdelta-unaware. It will unpack each object independently, 
which means that it will keep on doing the same early xdelta work over and 
over and over again. Instead of just applying them in order, and checking 
the sha1 of the result at each point.

Now, You probably want to limit the length of the chains to some firly 
small number anyway, so maybe that's not a big deal. Who knows. And I'm 
actually still so anal that I don't think I'd use this for _my_ tree, just 
because I'm a worry-wart (and I still think disk is incredibly cheap ;)

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: RFC: adding xdelta compression to git
  2005-05-03 17:35       ` Linus Torvalds
@ 2005-05-03 18:10         ` Davide Libenzi
  0 siblings, 0 replies; 32+ messages in thread
From: Davide Libenzi @ 2005-05-03 18:10 UTC (permalink / raw
  To: Linus Torvalds; +Cc: C. Scott Ananian, Alon Ziv, git

On Tue, 3 May 2005, Linus Torvalds wrote:

> On Tue, 3 May 2005, C. Scott Ananian wrote:
> > 
> > Linus knows this.  His point is just to be sure you actually *code* that 
> > walk in fsck, and (hopefully) do so w/o complicating the fsck too much.
> 
> Indeed. It's also a performance issue.
> 
> If you do xdelta objects, and don't tell fsck about it, then fsck will 
> just check every object as a blob. Why is that bad?
> 
> Think about it: let's say that you have a series of xdelta objects, and a 
> fsck that is xdelta-unaware. It will unpack each object independently, 
> which means that it will keep on doing the same early xdelta work over and 
> over and over again. Instead of just applying them in order, and checking 
> the sha1 of the result at each point.
> 
> Now, You probably want to limit the length of the chains to some firly 
> small number anyway, so maybe that's not a big deal. Who knows. And I'm 
> actually still so anal that I don't think I'd use this for _my_ tree, just 
> because I'm a worry-wart (and I still think disk is incredibly cheap ;)

If you use a "full tip" metadata format with reverse deltas, you drop a 
"full" version "time to time" along the chain, and you keep a small index 
file, you have:

1) No matter how big it becomes the xdelta collection object, you are only 
   touching very limited regions of it (due the small index file, that can 
   be less than 20+8 bytes per entry in the xdelta blob)

2) Checkout happens w/out even doing xpatching (since the tip is full)

3) Checkins requires only one xdelta operation (since the tip is full), 
   and zero if it is the time to store a full version along the chain (I 
   use to drop one every 10-16 xdeltas, depending on the progressive size 
   of the delta operations)

4) Worst case performance in reconstructing histories are bound by the 
   longest xdelta chain (10-16)

In some way I tend to agree (strangely ;) with you about the disk-cheap 
mantra, but network bandwidth matter IMO. So, if you do not want (being a 
real worry-wart) to use xdelta leverage on the FS trees, you can have way 
smarter network protocols using xdelta plus the knowledge of the git 
history structure. The rsync algo uses xdelta, but the poor guy is not 
able to leverage from the knowledge of the history that only git knows. 
So, if Larry and Greg shares a common object A, Larry changes A and makes 
a new git object B, rsync will transfer the whole object B, because it 
does not have any idea of the git structure. Git though, has this 
knowledge, and it can say to the remote fetcher: Look, I have this new 
thing called B, that is basically your thing A plus this very small xdelta 
(B-A). And typical xdelta diffs are really small (1/7 to 1/10 of classical 
'diff -u' ones).




- Davide


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
                       ` (2 preceding siblings ...)
  2005-05-03 14:48     ` Linus Torvalds
@ 2005-05-04 15:56     ` Chris Mason
  2005-05-04 16:12       ` C. Scott Ananian
  2005-05-04 21:47       ` Geert Bosch
  3 siblings, 2 replies; 32+ messages in thread
From: Chris Mason @ 2005-05-04 15:56 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: Linus Torvalds, Alon Ziv, git

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On Tuesday 03 May 2005 04:06, Nicolas Pitre wrote:
> On Mon, 2 May 2005, Linus Torvalds wrote:
> > If you do something like this, you want such a delta-blob to be named by
> > the sha1 of the result, so that things that refer to it can transparently
> > see either the original blob _or_ the "deltified" one, and will never
> > care.
>
> Yep, that's what I've done last weekend (and just made it actually
> work since people are getting interested).
>

My first run didn't go well, diff_delta generates an invalid delta when passed 
a buffer of length 0.  I really should not have been calling it this way, but 
it should do a quick check and return an error instead of something 
invalid ;)

I did two additional runs, first where I fixed the delta chain length at 1 as 
in the zdelta patch.   In this mode, if it tried to diff against a delta it 
would find the delta's parent and diff against that instead.  Even though 
zdelta had the same speeds for applying patches as xdiff(1), zdelta used 
significantly more cpu (53m vs 40m).

The next run was with the patch I've attached below, it allows chains up to 16 
deltas in length.  
                             git         zdelta       xdiff (1)      xdiff(16)
apply                  150m       117m       117m         104m
checkout             4m30s      3m41      4m43s        7m11s
checkout (hot)     56s           12s         14s             16s
space usage        2.5G         1G           1.2G           800m

The longer delta chains trigger more random io on checkout, negating the speed 
improvements from the packed item patch.  The hot cache times show that xdiff 
isn't using a huge amount of cpu to patch things in, and so there's room for 
smarter packing and regenerating deltas in order to keep checkout times low.  
This patch still doesn't pack commits and trees in with the blob files, and 
it doesn't delta trees, and so I expect better space/speed numbers in later 
revs.

I won't be able to work on this until next week, but here's my plan:

1) update to current git.  My patch is from before the safe file generation 
changes.

2) change update-cache and write-tree so that packing/deltas are off by 
default.  Add --packed and --delta options to both.

3) create a git-pack tool that can pack/unpack existing changesets,trees and 
files, optionally adding/removing deltas.

My current code should preserve the delta object header used by Nicolas, and 
removes all knowledge of deltas from the packed item headers.  This is not 
quite as efficient, but the resulting code is much cleaner.  I haven't tried, 
but it should be able to read a file created by his mkdelta.c.

-chris

[-- Attachment #2: delta-tree-3.diff --]
[-- Type: text/x-diff, Size: 32177 bytes --]

diff -urN --exclude .git linus.diff/cache.h linus.mine/cache.h
--- linus.diff/cache.h	2005-05-04 09:54:49.154735216 -0400
+++ linus.mine/cache.h	2005-05-04 08:47:28.618990016 -0400
@@ -64,6 +64,16 @@
 	char name[0];
 };
 
+struct packed_item {
+	/* length of compressed data */
+	unsigned long len;
+	struct packed_item *next;
+	/* sha1 of uncompressed data */
+	char sha1[20];
+	/* compressed data */
+	char *data;
+};
+
 #define CE_NAMEMASK  (0x0fff)
 #define CE_STAGEMASK (0x3000)
 #define CE_STAGESHIFT 12
@@ -119,8 +129,9 @@
 
 /* Read and unpack a sha1 file into memory, write memory to a sha1 file */
 extern void * map_sha1_file(const unsigned char *sha1, unsigned long *size);
-extern void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size);
+extern void * unpack_sha1_file(const unsigned char *sha1, void *map, unsigned long mapsize, char *type, unsigned long *size, int *chain);
 extern void * read_sha1_file(const unsigned char *sha1, char *type, unsigned long *size);
+extern void * read_sha1_delta_ref(const unsigned char *sha1, char *type, unsigned long *size, int *chain);
 extern int write_sha1_file(char *buf, unsigned long len, const char *type, unsigned char *return_sha1);
 
 extern int check_sha1_signature(unsigned char *sha1, void *buf, unsigned long size, const char *type);
@@ -135,6 +146,10 @@
 /* Convert to/from hex/sha1 representation */
 extern int get_sha1_hex(const char *hex, unsigned char *sha1);
 extern char *sha1_to_hex(const unsigned char *sha1);	/* static buffer result! */
+extern int pack_sha1_buffer(void *buf, unsigned long buf_len, char *type,
+                            unsigned char *returnsha1, unsigned char *refsha1, 
+			    struct packed_item **);
+int write_packed_buffer(struct packed_item *head);
 
 /* General helper functions */
 extern void usage(const char *err);
diff -urN --exclude .git linus.diff/delta.h linus.mine/delta.h
--- linus.diff/delta.h	1969-12-31 19:00:00.000000000 -0500
+++ linus.mine/delta.h	2005-05-03 08:22:32.000000000 -0400
@@ -0,0 +1,6 @@
+extern void *diff_delta(void *from_buf, unsigned long from_size,
+			void *to_buf, unsigned long to_size,
+		        unsigned long *delta_size);
+extern void *patch_delta(void *src_buf, unsigned long src_size,
+			 void *delta_buf, unsigned long delta_size,
+			 unsigned long *dst_size);
diff -urN --exclude .git linus.diff/diff-delta.c linus.mine/diff-delta.c
--- linus.diff/diff-delta.c	1969-12-31 19:00:00.000000000 -0500
+++ linus.mine/diff-delta.c	2005-05-03 12:40:58.000000000 -0400
@@ -0,0 +1,315 @@
+/*
+ * diff-delta.c: generate a delta between two buffers
+ *
+ *  Many parts of this file have been lifted from LibXDiff version 0.10.
+ *  http://www.xmailserver.org/xdiff-lib.html
+ *
+ *  LibXDiff was written by Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2003	Davide Libenzi
+ *
+ *  Many mods for GIT usage by Nicolas Pitre <nico@cam.org>, (C) 2005.
+ *
+ *  This file is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ */
+
+#include <stdlib.h>
+#include "delta.h"
+
+
+/* block size: min = 16, max = 64k, power of 2 */
+#define BLK_SIZE 16
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define GR_PRIME 0x9e370001
+#define HASH(v, b) (((unsigned int)(v) * GR_PRIME) >> (32 - (b)))
+	
+/* largest prime smaller than 65536 */
+#define BASE 65521
+
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+#define NMAX 5552
+
+#define DO1(buf, i)  { s1 += buf[i]; s2 += s1; }
+#define DO2(buf, i)  DO1(buf, i); DO1(buf, i + 1);
+#define DO4(buf, i)  DO2(buf, i); DO2(buf, i + 2);
+#define DO8(buf, i)  DO4(buf, i); DO4(buf, i + 4);
+#define DO16(buf)    DO8(buf, 0); DO8(buf, 8);
+
+static unsigned int adler32(unsigned int adler, const unsigned char *buf, int len)
+{
+	int k;
+	unsigned int s1 = adler & 0xffff;
+	unsigned int s2 = adler >> 16;
+
+	while (len > 0) {
+		k = MIN(len, NMAX);
+		len -= k;
+		while (k >= 16) {
+			DO16(buf);
+			buf += 16;
+			k -= 16;
+		}
+		if (k != 0)
+			do {
+				s1 += *buf++;
+				s2 += s1;
+			} while (--k);
+		s1 %= BASE;
+		s2 %= BASE;
+	}
+
+	return (s2 << 16) | s1;
+}
+
+static unsigned int hashbits(unsigned int size)
+{
+	unsigned int val = 1, bits = 0;
+	while (val < size && bits < 32) {
+		val <<= 1;
+	       	bits++;
+	}
+	return bits ? bits: 1;
+}
+
+typedef struct s_chanode {
+	struct s_chanode *next;
+	int icurr;
+} chanode_t;
+
+typedef struct s_chastore {
+	chanode_t *head, *tail;
+	int isize, nsize;
+	chanode_t *ancur;
+	chanode_t *sncur;
+	int scurr;
+} chastore_t;
+
+static void cha_init(chastore_t *cha, int isize, int icount)
+{
+	cha->head = cha->tail = NULL;
+	cha->isize = isize;
+	cha->nsize = icount * isize;
+	cha->ancur = cha->sncur = NULL;
+	cha->scurr = 0;
+}
+
+static void *cha_alloc(chastore_t *cha)
+{
+	chanode_t *ancur;
+	void *data;
+
+	ancur = cha->ancur;
+	if (!ancur || ancur->icurr == cha->nsize) {
+		ancur = malloc(sizeof(chanode_t) + cha->nsize);
+		if (!ancur)
+			return NULL;
+		ancur->icurr = 0;
+		ancur->next = NULL;
+		if (cha->tail)
+			cha->tail->next = ancur;
+		if (!cha->head)
+			cha->head = ancur;
+		cha->tail = ancur;
+		cha->ancur = ancur;
+	}
+
+	data = (void *)ancur + sizeof(chanode_t) + ancur->icurr;
+	ancur->icurr += cha->isize;
+	return data;
+}
+
+static void cha_free(chastore_t *cha)
+{
+	chanode_t *cur = cha->head;
+	while (cur) {
+		chanode_t *tmp = cur;
+		cur = cur->next;
+		free(tmp);
+	}
+}
+
+typedef struct s_bdrecord {
+	struct s_bdrecord *next;
+	unsigned int fp;
+	const unsigned char *ptr;
+} bdrecord_t;
+
+typedef struct s_bdfile {
+	const unsigned char *data, *top;
+	chastore_t cha;
+	unsigned int fphbits;
+	bdrecord_t **fphash;
+} bdfile_t;
+
+static int delta_prepare(const unsigned char *buf, int bufsize, bdfile_t *bdf)
+{
+	unsigned int fphbits;
+	int i, hsize;
+	const unsigned char *base, *data, *top;
+	bdrecord_t *brec;
+	bdrecord_t **fphash;
+
+	fphbits = hashbits(bufsize / BLK_SIZE + 1);
+	hsize = 1 << fphbits;
+	fphash = malloc(hsize * sizeof(bdrecord_t *));
+	if (!fphash)
+		return -1;
+	for (i = 0; i < hsize; i++)
+		fphash[i] = NULL;
+	cha_init(&bdf->cha, sizeof(bdrecord_t), hsize / 4 + 1);
+
+	bdf->data = data = base = buf;
+	bdf->top = top = buf + bufsize;
+	data += (bufsize / BLK_SIZE) * BLK_SIZE;
+	if (data == top)
+		data -= BLK_SIZE;
+
+	for ( ; data >= base; data -= BLK_SIZE) {
+		brec = cha_alloc(&bdf->cha);
+		if (!brec) {
+			cha_free(&bdf->cha);
+			free(fphash);
+			return -1;
+		}
+		brec->fp = adler32(0, data, MIN(BLK_SIZE, top - data));
+		brec->ptr = data;
+		i = HASH(brec->fp, fphbits);
+		brec->next = fphash[i];
+		fphash[i] = brec;
+	}
+
+	bdf->fphbits = fphbits;
+	bdf->fphash = fphash;
+
+	return 0;
+}
+
+static void delta_cleanup(bdfile_t *bdf)
+{
+	free(bdf->fphash);
+	cha_free(&bdf->cha);
+}
+
+#define COPYOP_SIZE(o, s) \
+    (!!(o & 0xff) + !!(o & 0xff00) + !!(o & 0xff0000) + !!(o & 0xff000000) + \
+     !!(s & 0xff) + !!(s & 0xff00) + 1)
+
+void *diff_delta(void *from_buf, unsigned long from_size,
+		 void *to_buf, unsigned long to_size,
+		 unsigned long *delta_size)
+{
+	int i, outpos, outsize, inscnt, csize, msize, moff;
+	unsigned int fp;
+	const unsigned char *data, *top, *ptr1, *ptr2;
+	unsigned char *out, *orig;
+	bdrecord_t *brec;
+	bdfile_t bdf;
+
+	if (delta_prepare(from_buf, from_size, &bdf))
+		return NULL;
+	
+	outpos = 0;
+	outsize = 4096;
+	out = malloc(outsize);
+	if (!out) {
+		delta_cleanup(&bdf);
+		return NULL;
+	}
+
+	data = to_buf;
+	top = to_buf + to_size;
+
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size; from_size >>= 8;
+	out[outpos++] = from_size;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size; to_size >>= 8;
+	out[outpos++] = to_size;
+
+	inscnt = 0;
+	moff = 0;
+	while (data < top) {
+		msize = 0;
+		fp = adler32(0, data, MIN(top - data, BLK_SIZE));
+		i = HASH(fp, bdf.fphbits);
+		for (brec = bdf.fphash[i]; brec; brec = brec->next) {
+			if (brec->fp == fp) {
+				csize = bdf.top - brec->ptr;
+				if (csize > top - data)
+					csize = top - data;
+				for (ptr1 = brec->ptr, ptr2 = data; 
+				     csize && *ptr1 == *ptr2;
+				     csize--, ptr1++, ptr2++);
+
+				csize = ptr1 - brec->ptr;
+				if (csize > msize) {
+					moff = brec->ptr - bdf.data;
+					msize = csize;
+					if (msize >= 0x10000) {
+						msize = 0x10000;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!msize || msize < COPYOP_SIZE(moff, msize)) {
+			if (!inscnt)
+				outpos++;
+			out[outpos++] = *data++;
+			inscnt++;
+			if (inscnt == 0x7f) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+		} else {
+			if (inscnt) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+
+			data += msize;
+			orig = out + outpos++;
+			i = 0x80;
+
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
+			msize >>= 8;
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
+
+			*orig = i;
+		}
+
+		/* next time around the largest possible output is 1 + 4 + 3 */
+		if (outpos > outsize - 8) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			out = realloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				delta_cleanup(&bdf);
+				return NULL;
+			}
+		}
+	}
+
+	if (inscnt)
+		out[outpos - inscnt - 1] = inscnt;
+
+	delta_cleanup(&bdf);
+	*delta_size = outpos;
+	return out;
+}
diff -urN --exclude .git linus.diff/fsck-cache.c linus.mine/fsck-cache.c
--- linus.diff/fsck-cache.c	2005-05-04 09:54:49.158734608 -0400
+++ linus.mine/fsck-cache.c	2005-05-04 08:44:49.778137496 -0400
@@ -142,7 +142,7 @@
 		if (map) {
 			char type[100];
 			unsigned long size;
-			void *buffer = unpack_sha1_file(map, mapsize, type, &size);
+			void *buffer = unpack_sha1_file(sha1, map, mapsize, type, &size, NULL);
 			if (!buffer)
 				return -1;
 			if (check_sha1_signature(sha1, buffer, size, type) < 0)
diff -urN --exclude .git linus.diff/git-mktag.c linus.mine/git-mktag.c
--- linus.diff/git-mktag.c	2005-05-04 09:54:49.158734608 -0400
+++ linus.mine/git-mktag.c	2005-05-04 08:45:04.763859320 -0400
@@ -31,7 +31,7 @@
 	if (map) {
 		char type[100];
 		unsigned long size;
-		void *buffer = unpack_sha1_file(map, mapsize, type, &size);
+		void *buffer = unpack_sha1_file(sha1, map, mapsize, type, &size, NULL);
 
 		if (buffer) {
 			if (!strcmp(type, expected_type))
diff -urN --exclude .git linus.diff/Makefile linus.mine/Makefile
--- linus.diff/Makefile	2005-05-04 09:54:49.153735368 -0400
+++ linus.mine/Makefile	2005-05-03 16:28:15.000000000 -0400
@@ -25,7 +25,8 @@
 install: $(PROG) $(SCRIPTS)
 	install $(PROG) $(SCRIPTS) $(HOME)/bin/
 
-LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o
+LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o diff-delta.o \
+	 patch-delta.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h
 
diff -urN --exclude .git linus.diff/patch-delta.c linus.mine/patch-delta.c
--- linus.diff/patch-delta.c	1969-12-31 19:00:00.000000000 -0500
+++ linus.mine/patch-delta.c	2005-05-04 09:57:53.520707328 -0400
@@ -0,0 +1,80 @@
+/*
+ * patch-delta.c:
+ * recreate a buffer from a source and the delta produced by diff-delta.c
+ *
+ * (C) 2005 Nicolas Pitre <nico@cam.org>
+ *
+ * This code is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include "delta.h"
+
+void *patch_delta(void *src_buf, unsigned long src_size,
+		  void *delta_buf, unsigned long delta_size,
+		  unsigned long *dst_size)
+{
+	const unsigned char *data, *top;
+	unsigned char *dst, *out;
+	int size;
+
+	/* the smallest delta size possible is 10 bytes */
+	if (delta_size < 10) {
+		return NULL;
+	}
+	data = delta_buf;
+	top = delta_buf + delta_size;
+
+	/* make sure the orig file size matches what we expect */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	if (size != src_size) {
+		return NULL;
+	}
+	/* now the result size */
+	size = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
+	data += 4;
+	dst = malloc(size);
+	if (!dst) {
+		return NULL;
+	}
+	out = dst;
+	while (data < top) {
+		unsigned char cmd = *data++;
+		if (cmd & 0x80) {
+			unsigned int cp_off = 0, cp_size = 0;
+			if (cmd & 0x01) cp_off = *data++;
+			if (cmd & 0x02) cp_off |= (*data++ << 8);
+			if (cmd & 0x04) cp_off |= (*data++ << 16);
+			if (cmd & 0x08) cp_off |= (*data++ << 24);
+			if (cmd & 0x10) cp_size = *data++;
+			if (cmd & 0x20) cp_size |= (*data++ << 8);
+			if (cp_size == 0) cp_size = 0x10000;
+			memcpy(out, src_buf + cp_off, cp_size);
+			out += cp_size;
+			if (out > dst + size) {
+				*(char *)0 = 0;
+			}
+		} else {
+			memcpy(out, data, cmd);
+			out += cmd;
+			data += cmd;
+			if (out > dst + size) {
+				*(char *)0 = 0;
+			}
+		}
+	}
+
+	/* sanity check */
+	if (data != top || out - dst != size) {
+		free(dst);
+		return NULL;
+	}
+
+	*dst_size = size;
+	return dst;
+}
diff -urN --exclude .git linus.diff/sha1_file.c linus.mine/sha1_file.c
--- linus.diff/sha1_file.c	2005-05-04 09:54:49.165733544 -0400
+++ linus.mine/sha1_file.c	2005-05-04 10:42:07.002316808 -0400
@@ -8,6 +8,7 @@
  */
 #include <stdarg.h>
 #include "cache.h"
+#include "delta.h"
 
 const char *sha1_file_directory = NULL;
 
@@ -139,39 +140,117 @@
 	return map;
 }
 
-void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size)
+/*
+ * looks through buf for the header entry corresponding to sha1.  returns
+ * 0 an entry is found and sets offset to the offset of the packed item
+ * in the file.  The offset is relative to the start of the packed items
+ * so you have to add in the length of the header before using it
+ * -1 is returned if the sha1 could not be found
+ */
+static int find_packed_header(const unsigned char *sha1, char *buf, unsigned long buf_len, unsigned long *offset)
+{
+	char *p;
+	p = buf;
+
+	*offset = 0;
+	while(p < buf + buf_len) {
+		unsigned long item_len;
+		unsigned char item_sha[20];
+		memcpy(item_sha, p, 20);
+		sscanf(p + 20, "%lu", &item_len);
+		p += 20 + strlen(p + 20) + 1;
+		if (memcmp(item_sha, sha1, 20) == 0)
+			return 0;
+		*offset += item_len;
+	}
+	return -1;
+}
+
+
+/*
+ * uncompresses a data segment without any extra delta/packed processing
+ */
+static void * _unpack_sha1_file(z_stream *stream, void *map, 
+                                unsigned long mapsize, char *type, 
+				unsigned long *size)
 {
 	int ret, bytes;
-	z_stream stream;
 	char buffer[8192];
 	char *buf;
 
 	/* Get the data stream */
-	memset(&stream, 0, sizeof(stream));
-	stream.next_in = map;
-	stream.avail_in = mapsize;
-	stream.next_out = buffer;
-	stream.avail_out = sizeof(buffer);
-
-	inflateInit(&stream);
-	ret = inflate(&stream, 0);
-	if (ret < Z_OK)
+	memset(stream, 0, sizeof(*stream));
+	stream->next_in = map;
+	stream->avail_in = mapsize;
+	stream->next_out = buffer;
+	stream->avail_out = sizeof(buffer);
+
+	inflateInit(stream);
+	ret = inflate(stream, 0);
+	if (ret < Z_OK) {
 		return NULL;
-	if (sscanf(buffer, "%10s %lu", type, size) != 2)
+	}
+	if (sscanf(buffer, "%10s %lu", type, size) != 2) {
 		return NULL;
-
+	}
 	bytes = strlen(buffer) + 1;
 	buf = xmalloc(*size);
 
-	memcpy(buf, buffer + bytes, stream.total_out - bytes);
-	bytes = stream.total_out - bytes;
+	memcpy(buf, buffer + bytes, stream->total_out - bytes);
+	bytes = stream->total_out - bytes;
 	if (bytes < *size && ret == Z_OK) {
-		stream.next_out = buf + bytes;
-		stream.avail_out = *size - bytes;
-		while (inflate(&stream, Z_FINISH) == Z_OK)
+		stream->next_out = buf + bytes;
+		stream->avail_out = *size - bytes;
+		while (inflate(stream, Z_FINISH) == Z_OK)
 			/* nothing */;
 	}
-	inflateEnd(&stream);
+	inflateEnd(stream);
+	return buf;
+}
+
+void * unpack_sha1_file(const unsigned char *sha1, void *map, 
+			unsigned long mapsize, char *type, unsigned long *size, 
+			int *chain)
+{
+	z_stream stream;
+	char *buf;
+	unsigned long offset;
+	unsigned long header_len;
+	buf = _unpack_sha1_file(&stream, map, mapsize, type, size);
+	if (!buf)
+		return buf;
+	if (!strcmp(type, "delta")) {
+		char *delta_ref;
+		unsigned long delta_size;
+		char *newbuf;
+		unsigned long newsize;
+		if (chain)
+			*chain += 1;
+		delta_ref = read_sha1_delta_ref(buf, type, &delta_size, chain);
+		if (!delta_ref) {
+			free(buf);
+			return NULL;
+		}
+		newbuf = patch_delta(delta_ref, delta_size, buf+20, *size-20, &newsize);
+		free(buf);
+		free(delta_ref);
+		*size = newsize;
+		return newbuf;
+
+	} else if (!strcmp(type, "packed")) {
+		if (!sha1) {
+			free(buf);
+			return NULL;
+		}
+		header_len = *size;
+		if (find_packed_header(sha1, buf, header_len, &offset)) {
+			free(buf);
+			return NULL;
+		}
+		offset += stream.total_in;
+		free(buf);
+		buf = unpack_sha1_file(sha1, map+offset, mapsize-offset, type, size, chain);
+	}
 	return buf;
 }
 
@@ -182,7 +261,25 @@
 
 	map = map_sha1_file(sha1, &mapsize);
 	if (map) {
-		buf = unpack_sha1_file(map, mapsize, type, size);
+		buf = unpack_sha1_file(sha1, map, mapsize, type, size, NULL);
+		munmap(map, mapsize);
+		return buf;
+	}
+	return NULL;
+}
+
+/*
+ * the same as read_sha1_file except chain is used to count the length
+ * of any delta chains hit while unpacking
+ */
+void * read_sha1_delta_ref(const unsigned char *sha1, char *type, unsigned long *size, int *chain)
+{
+	unsigned long mapsize;
+	void *map, *buf;
+
+	map = map_sha1_file(sha1, &mapsize);
+	if (map) {
+		buf = unpack_sha1_file(sha1, map, mapsize, type, size, chain);
 		munmap(map, mapsize);
 		return buf;
 	}
@@ -413,3 +510,306 @@
 		return 1;
 	return 0;
 }
+
+static void *pack_buffer(void *buf, unsigned long buf_len, char *metadata, int metadata_size, unsigned long *compsize)
+{
+	char *compressed;
+	z_stream stream;
+	unsigned long size;
+
+	/* Set it up */
+	memset(&stream, 0, sizeof(stream));
+	size = deflateBound(&stream, buf_len + metadata_size);
+	compressed = xmalloc(size);
+
+	/*
+	 * ASCII size + nul byte
+	 */	
+	stream.next_in = metadata;
+	stream.avail_in = metadata_size;
+	stream.next_out = compressed;
+	stream.avail_out = size;
+	deflateInit(&stream, Z_BEST_COMPRESSION);
+	while (deflate(&stream, 0) == Z_OK)
+		/* nothing */;
+
+	stream.next_in = buf;
+	stream.avail_in = buf_len;
+	/* Compress it */
+	while (deflate(&stream, Z_FINISH) == Z_OK)
+		/* nothing */;
+	deflateEnd(&stream);
+	size = stream.total_out;
+	*compsize = size;
+	return compressed;
+}
+
+/*
+ * generates a delta for buf against refsha1 and returns a compressed buffer
+ * with the results.  NULL is returned on error, or when the delta could
+ * not be done.  This might happen if the delta is larger then either the
+ * refsha1 or the buffer, or the delta chain is too long.
+ */
+static void *pack_delta_buffer(void *buf, unsigned long buf_len, char *metadata, int metadata_size, unsigned long *compsize, unsigned char *sha1, unsigned char *refsha1)
+{
+	char *compressed;
+	char *refbuffer = NULL;
+	char reftype[20];
+	unsigned long refsize = 0;
+	char *delta;
+	unsigned long delta_size;
+	char *lmetadata = xmalloc(220);
+	unsigned long lmetadata_size;
+	int chain_length = 0;
+
+	if (buf_len == 0)
+		return NULL;
+	refbuffer = read_sha1_delta_ref(refsha1, reftype, &refsize, &chain_length);
+
+	if (chain_length > 16) {
+		free(refbuffer);
+		return NULL;
+	}
+	/* note, we could just continue without the delta here */
+	if (!refbuffer) {
+		free(refbuffer);
+		return NULL;
+	}
+	delta = diff_delta(refbuffer, refsize, buf, buf_len, &delta_size);
+	free(refbuffer);
+	if (!delta)
+		return NULL;
+	if (delta_size > refsize || delta_size > buf_len) {
+		free(delta);
+		return NULL;
+	}
+	if (delta_size < 10) {
+		free(delta);
+		return NULL;
+	}
+	lmetadata_size = 1 + sprintf(lmetadata, "%s %lu","delta",delta_size+20);
+	memcpy(lmetadata + lmetadata_size, refsha1, 20);
+	lmetadata_size += 20;
+	compressed = pack_buffer(delta, delta_size, lmetadata, lmetadata_size, compsize);
+	free(lmetadata);
+	free(delta);
+	return compressed;
+}
+
+/*
+ * returns a newly malloc'd packed item with a compressed buffer for buf.  
+ * If refsha1 is non-null, attempts a delta against it.  The sha1 of buf 
+ * is returned via returnsha1.
+ */
+int pack_sha1_buffer(void *buf, unsigned long buf_len, char *type,
+		     unsigned char *returnsha1,
+		     unsigned char *refsha1,
+		     struct packed_item **packed_item)
+{
+	unsigned char sha1[20];
+	SHA_CTX c;
+	char *filename;
+	struct stat st;
+	char *compressed = NULL;
+	unsigned long size;
+	struct packed_item *item;
+	char *metadata = xmalloc(200);
+	int metadata_size;
+
+	*packed_item = NULL;
+
+	metadata_size = 1 + sprintf(metadata, "%s %lu", type, buf_len);
+
+	/* Sha1.. */
+	SHA1_Init(&c);
+	SHA1_Update(&c, metadata, metadata_size);
+	SHA1_Update(&c, buf, buf_len);
+	SHA1_Final(sha1, &c);
+
+	if (returnsha1)
+		memcpy(returnsha1, sha1, 20);
+
+	filename = sha1_file_name(sha1);
+	if (stat(filename, &st) == 0)
+		goto out;
+
+	
+	if (refsha1) {
+		compressed = pack_delta_buffer(buf, buf_len, metadata, 
+		                               metadata_size, &size, sha1, 
+					       refsha1);
+	}
+	if (!compressed) {
+		compressed = pack_buffer(buf, buf_len, metadata, 
+		                         metadata_size, &size);
+	}
+	free(metadata);
+	if (!compressed) {
+		return -1;
+	}
+	item = xmalloc(sizeof(struct packed_item));
+	memcpy(item->sha1, sha1, 20);
+	item->len = size;
+	item->next = NULL;
+	item->data = compressed;
+	*packed_item = item;
+out:
+	return 0;
+}
+
+static char *create_packed_header(struct packed_item *head, unsigned long *size)
+{
+	char *metadata = NULL;
+	int metadata_size = 0;
+	*size = 0;
+	int entry_size = 0;
+
+	while(head) {
+		char *p;
+		metadata = realloc(metadata, metadata_size + 220);
+		if (!metadata)
+			return NULL;
+		p = metadata+metadata_size;
+		memcpy(p, head->sha1, 20);
+		p += 20;
+		entry_size = 1 + sprintf(p, "%lu", head->len);
+		metadata_size += entry_size + 20;
+		head = head->next;
+	}
+	*size = metadata_size;
+	return metadata;
+}
+
+#define WRITE_BUFFER_SIZE 8192
+static char write_buffer[WRITE_BUFFER_SIZE];
+static unsigned long write_buffer_len;
+
+static int c_write(int fd, void *data, unsigned int len)
+{
+	while (len) {
+		unsigned int buffered = write_buffer_len;
+		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
+		if (partial > len)
+			partial = len;
+		memcpy(write_buffer + buffered, data, partial);
+		buffered += partial;
+		if (buffered == WRITE_BUFFER_SIZE) {
+			if (write(fd, write_buffer, WRITE_BUFFER_SIZE) != WRITE_BUFFER_SIZE)
+				return -1;
+			buffered = 0;
+		}
+		write_buffer_len = buffered;
+		len -= partial;
+		data += partial;
+ 	}
+ 	return 0;
+}
+
+static int c_flush(int fd)
+{
+	if (write_buffer_len) {
+		int left = write_buffer_len;
+		if (write(fd, write_buffer, left) != left)
+			return -1;
+		write_buffer_len = 0;
+	}
+	return 0;
+}
+
+/*
+ * creates a new packed file for all the items in head.  hard links are
+ * made from the sha1 of all the items back to the packd file, and then
+ * the packed file is unlinked.
+ */
+int write_packed_buffer(struct packed_item *head)
+{
+	unsigned char sha1[20];
+	SHA_CTX c;
+	char *filename;
+	char *metadata = xmalloc(200);
+	char *header;
+	int metadata_size;
+	int fd;
+	int ret = 0;
+	unsigned long header_len;
+	struct packed_item *item;
+	char *compressed;
+	z_stream stream;
+	unsigned long size;
+
+	header = create_packed_header(head, &header_len);
+	metadata_size = 1+sprintf(metadata, "packed %lu", header_len);
+	/* 
+	 * the header contains the sha1 of each item, so we only sha1 the
+	 * header
+	 */ 
+	SHA1_Init(&c);
+	SHA1_Update(&c, metadata, metadata_size);
+	SHA1_Update(&c, header, header_len);
+	SHA1_Final(sha1, &c);
+
+	filename = strdup(sha1_file_name(sha1));
+	fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666);
+	if (fd < 0) {
+		/* add collision check! */
+		if (errno != EEXIST) {
+			ret = -errno;
+		}
+		goto out_nofile;
+	}
+       /* compress just the header info */
+        memset(&stream, 0, sizeof(stream));
+        deflateInit(&stream, Z_BEST_COMPRESSION);
+	size = deflateBound(&stream, header_len + metadata_size);
+        compressed = xmalloc(size);
+
+        stream.next_in = metadata;
+        stream.avail_in = metadata_size;
+        stream.next_out = compressed;
+        stream.avail_out = size;
+        while (deflate(&stream, 0) == Z_OK)
+                /* nothing */;
+        stream.next_in = header;
+        stream.avail_in = header_len;
+        while (deflate(&stream, Z_FINISH) == Z_OK)
+                /* nothing */;
+        deflateEnd(&stream);
+        size = stream.total_out;
+
+	c_write(fd, compressed, size);
+	free(compressed);
+
+	item = head;
+	while(item) {
+		if (c_write(fd, item->data, item->len)) {
+			ret = -EIO;
+			goto out;
+		}
+		item = item->next;
+	}
+	if (c_flush(fd)) {
+		ret = -EIO;
+		goto out;
+	}
+	item = head;
+	while(item) {
+		char *item_file;
+		struct packed_item *next = item->next;
+		item_file = sha1_file_name(item->sha1);
+		if (link(filename, item_file) && errno != EEXIST) {
+			ret = -errno;
+			break;
+		}
+		free(item->data);
+		free(item);
+		item = next;
+	}
+	unlink(filename);
+out:
+	close(fd);
+out_nofile:
+	free(header);
+	free(metadata);
+	free(filename);
+	return ret;
+}
diff -urN --exclude .git linus.diff/update-cache.c linus.mine/update-cache.c
--- linus.diff/update-cache.c	2005-05-04 09:54:49.167733240 -0400
+++ linus.mine/update-cache.c	2005-05-02 20:51:32.000000000 -0400
@@ -31,55 +31,39 @@
 	return (unsigned long)ptr > (unsigned long)-1000L;
 }
 
-static int index_fd(unsigned char *sha1, int fd, struct stat *st)
+static int index_fd(unsigned char *sha1, unsigned char *refsha1, int fd, struct stat *st, struct packed_item **head, struct packed_item **tail, unsigned long *packed_size, int *packed_nr)
 {
-	z_stream stream;
 	unsigned long size = st->st_size;
-	int max_out_bytes = size + 200;
-	void *out = xmalloc(max_out_bytes);
-	void *metadata = xmalloc(200);
-	int metadata_size;
 	void *in;
-	SHA_CTX c;
+	int ret;
+	struct packed_item *new_item;
 
 	in = "";
 	if (size)
 		in = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	if (!out || (int)(long)in == -1)
+	if ((int)(long)in == -1) {
 		return -1;
-
-	metadata_size = 1+sprintf(metadata, "blob %lu", size);
-
-	SHA1_Init(&c);
-	SHA1_Update(&c, metadata, metadata_size);
-	SHA1_Update(&c, in, size);
-	SHA1_Final(sha1, &c);
-
-	memset(&stream, 0, sizeof(stream));
-	deflateInit(&stream, Z_BEST_COMPRESSION);
-
-	/*
-	 * ASCII size + nul byte
-	 */	
-	stream.next_in = metadata;
-	stream.avail_in = metadata_size;
-	stream.next_out = out;
-	stream.avail_out = max_out_bytes;
-	while (deflate(&stream, 0) == Z_OK)
-		/* nothing */;
-
-	/*
-	 * File content
-	 */
-	stream.next_in = in;
-	stream.avail_in = size;
-	while (deflate(&stream, Z_FINISH) == Z_OK)
-		/*nothing */;
-
-	deflateEnd(&stream);
-	
-	return write_sha1_buffer(sha1, out, stream.total_out);
+	}
+	ret = pack_sha1_buffer(in, size, "blob", sha1, refsha1, &new_item);
+	if (new_item) {
+		if (*tail)
+			(*tail)->next = new_item;
+		*tail = new_item;
+		if (!*head)
+			*head = new_item;
+		*packed_size += new_item->len;
+		*packed_nr++;
+		if (*packed_size > (512 * 1024) || *packed_nr > 1024) {
+			write_packed_buffer(*head);
+			*head = NULL;
+			*tail = NULL;
+			*packed_size = 0;
+			*packed_nr = 0;
+		}
+	}
+	munmap(in, size);
+	return ret;
 }
 
 /*
@@ -102,12 +86,14 @@
 	ce->ce_size = htonl(st->st_size);
 }
 
-static int add_file_to_cache(char *path)
+static int add_file_to_cache(char *path, struct packed_item **packed_head, struct packed_item **packed_tail, unsigned long *packed_size, int *packed_nr)
 {
 	int size, namelen;
 	struct cache_entry *ce;
 	struct stat st;
 	int fd;
+	int pos;
+	unsigned char *refsha1 = NULL;
 
 	fd = open(path, O_RDONLY);
 	if (fd < 0) {
@@ -129,8 +115,12 @@
 	fill_stat_cache_info(ce, &st);
 	ce->ce_mode = create_ce_mode(st.st_mode);
 	ce->ce_flags = htons(namelen);
+	pos = cache_name_pos(ce->name, namelen);
+	if (pos >= 0)
+		refsha1 = active_cache[pos]->sha1;
 
-	if (index_fd(ce->sha1, fd, &st) < 0)
+	if (index_fd(ce->sha1, refsha1, fd, &st, packed_head, 
+		     packed_tail, packed_size, packed_nr) < 0)
 		return -1;
 
 	return add_cache_entry(ce, allow_add);
@@ -311,6 +301,10 @@
 	int allow_options = 1;
 	static char lockfile[MAXPATHLEN+1];
 	const char *indexfile = get_index_file();
+	struct packed_item *packed_head = NULL;
+	struct packed_item *packed_tail = NULL;
+	unsigned long packed_size = 0;
+	int packed_nr = 0;
 
 	snprintf(lockfile, sizeof(lockfile), "%s.lock", indexfile);
 
@@ -362,8 +356,13 @@
 			fprintf(stderr, "Ignoring path %s\n", argv[i]);
 			continue;
 		}
-		if (add_file_to_cache(path))
+		if (add_file_to_cache(path, &packed_head, &packed_tail, &packed_size, &packed_nr))
 			die("Unable to add %s to database", path);
+
+	}
+	if (packed_head) {
+		if (write_packed_buffer(packed_head))
+			die("write packed buffer failed");
 	}
 	if (write_cache(newfd, active_cache, active_nr) || rename(lockfile, indexfile))
 		die("Unable to write new cachefile");
diff -urN --exclude .git linus.diff/write-tree.c linus.mine/write-tree.c
--- linus.diff/write-tree.c	2005-05-04 09:54:49.167733240 -0400
+++ linus.mine/write-tree.c	2005-05-02 20:51:32.000000000 -0400
@@ -5,24 +5,13 @@
  */
 #include "cache.h"
 
-static int check_valid_sha1(unsigned char *sha1)
-{
-	char *filename = sha1_file_name(sha1);
-	int ret;
-
-	/* If we were anal, we'd check that the sha1 of the contents actually matches */
-	ret = access(filename, R_OK);
-	if (ret)
-		perror(filename);
-	return ret;
-}
-
-static int write_tree(struct cache_entry **cachep, int maxentries, const char *base, int baselen, unsigned char *returnsha1)
+static int write_tree(struct cache_entry **cachep, int maxentries, const char *base, int baselen, unsigned char *returnsha1, struct packed_item **head)
 {
 	unsigned char subdir_sha1[20];
 	unsigned long size, offset;
 	char *buffer;
 	int nr;
+	struct packed_item *item;
 
 	/* Guess at some random initial size */
 	size = 8192;
@@ -50,7 +39,7 @@
 		if (dirname) {
 			int subdir_written;
 
-			subdir_written = write_tree(cachep + nr, maxentries - nr, pathname, dirname-pathname+1, subdir_sha1);
+			subdir_written = write_tree(cachep + nr, maxentries - nr, pathname, dirname-pathname+1, subdir_sha1, head);
 			nr += subdir_written;
 
 			/* Now we need to write out the directory entry into this tree.. */
@@ -62,9 +51,6 @@
 			sha1 = subdir_sha1;
 		}
 
-		if (check_valid_sha1(sha1) < 0)
-			exit(1);
-
 		entrylen = pathlen - baselen;
 		if (offset + entrylen + 100 > size) {
 			size = alloc_nr(offset + entrylen + 100);
@@ -77,7 +63,11 @@
 		nr++;
 	} while (nr < maxentries);
 
-	write_sha1_file(buffer, offset, "tree", returnsha1);
+	pack_sha1_buffer(buffer, offset, "tree", returnsha1, NULL, &item);
+	if (item) {
+		item->next = *head;
+		*head = item;
+	}
 	free(buffer);
 	return nr;
 }
@@ -87,6 +77,7 @@
 	int i, unmerged;
 	int entries = read_cache();
 	unsigned char sha1[20];
+	struct packed_item *head = NULL;
 
 	if (entries <= 0)
 		die("write-tree: no cache contents to write");
@@ -107,8 +98,12 @@
 		die("write-tree: not able to write tree");
 
 	/* Ok, write it out */
-	if (write_tree(active_cache, entries, "", 0, sha1) != entries)
+	if (write_tree(active_cache, entries, "", 0, sha1, &head) != entries)
 		die("write-tree: internal error");
+	if (head) {
+		if (write_packed_buffer(head))
+			die("write_packed_buffer error");
+	}
 	printf("%s\n", sha1_to_hex(sha1));
 	return 0;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 15:56     ` Chris Mason
@ 2005-05-04 16:12       ` C. Scott Ananian
  2005-05-04 17:44         ` Chris Mason
  2005-05-04 21:47       ` Geert Bosch
  1 sibling, 1 reply; 32+ messages in thread
From: C. Scott Ananian @ 2005-05-04 16:12 UTC (permalink / raw
  To: Chris Mason; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git

On Wed, 4 May 2005, Chris Mason wrote:

> 3) create a git-pack tool that can pack/unpack existing changesets,trees and
> files, optionally adding/removing deltas.

A 'git-pull' tool might be more use.  I can imagine Linus maintaining his 
local tree uncompressed, but the 'kernel.org' tree set up to 
git-pull-delta from him every hour or whatever, so that the 
network-accessible version is always network-efficient.  'git-pack'
would then simplify to a git-pull-delta from an existing local repository.

Ideally, you'd also be able to git-pull from a network packed repository 
and (transparently) unpack and undelta-fy the pulled files as they're 
added to your local repo.  This would keep Linus from accidentally getting 
packed files in his tree when he pulled from a maintainer's 
packed/delta-ed network-accessible tree.

I'd also be interested in seeing the speed/space numbers for some other 
delta chain lengths between 1 and 16.  Maybe some intermediate point is 
optimal.  [Also, limiting delta chains to a certain number of other 
*packed* objects -- instead of just 'objects' -- might be an improvement. 
Right now you're packing entire commits together, right?  Maybe defining a 
delta chain as 'N other commits max' might improve i/o performance, as 
you'd just have to keep N other unpacked files around, instead of an 
arbitrary number.]
  --scott

SSBN 743 BOND ESGAIN SUMAC ZPSECANT MHCHAOS Castro Flintlock payment 
anthrax SCRANTON PLO MKNAOMI DNC AVBLIMP RUFUS Secretary AK-47 Noriega
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 16:12       ` C. Scott Ananian
@ 2005-05-04 17:44         ` Chris Mason
  2005-05-04 22:03           ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2005-05-04 17:44 UTC (permalink / raw
  To: C. Scott Ananian; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git

On Wednesday 04 May 2005 12:12, C. Scott Ananian wrote:
> On Wed, 4 May 2005, Chris Mason wrote:
> > 3) create a git-pack tool that can pack/unpack existing changesets,trees
> > and files, optionally adding/removing deltas.
>
> A 'git-pull' tool might be more use.  I can imagine Linus maintaining his
> local tree uncompressed, but the 'kernel.org' tree set up to
> git-pull-delta from him every hour or whatever, so that the
> network-accessible version is always network-efficient.  'git-pack'
> would then simplify to a git-pull-delta from an existing local repository.

I had pictured the opposite, after the regular pull you would run git-pack to 
redelta/pack things to your local settings.  There's a ton of things to try 
in git pull for delta transmission, and I'd rather keep it separate from the 
local repository packing for now.

Later on if both tools are working properly we could look at combining them.

>
> Ideally, you'd also be able to git-pull from a network packed repository
> and (transparently) unpack and undelta-fy the pulled files as they're
> added to your local repo.  This would keep Linus from accidentally getting
> packed files in his tree when he pulled from a maintainer's
> packed/delta-ed network-accessible tree.
>
Yes, if Linus does take the patches, it's really important for people to be 
able to easily continue without deltas/packing if they want.

> I'd also be interested in seeing the speed/space numbers for some other
> delta chain lengths between 1 and 16.  Maybe some intermediate point is
> optimal.  [Also, limiting delta chains to a certain number of other
> *packed* objects -- instead of just 'objects' -- might be an improvement.
> Right now you're packing entire commits together, right?  Maybe defining a
> delta chain as 'N other commits max' might improve i/o performance, as
> you'd just have to keep N other unpacked files around, instead of an
> arbitrary number.]

It's easy to do additional runs, but I would rather do this kind of fine 
tuning with the git-pack tool.  That way we could experiment with packing 
multiple commits into a single file, deltas on tree objects, or even 
combining commits based on which files tye modify.

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 15:56     ` Chris Mason
  2005-05-04 16:12       ` C. Scott Ananian
@ 2005-05-04 21:47       ` Geert Bosch
  2005-05-04 22:34         ` Chris Mason
  1 sibling, 1 reply; 32+ messages in thread
From: Geert Bosch @ 2005-05-04 21:47 UTC (permalink / raw
  To: Chris Mason; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git

 From your tests it would seem that the zdelta version is the only one
to provide a uniform improvement over plain git. As it also seems the
simplest approach, I wonder why the consensus is that using xdiff
would be better?

   -Geert

On May 4, 2005, at 11:56, Chris Mason wrote:
> I did two additional runs, first where I fixed the delta chain  
> length at 1 as
> in the zdelta patch.   In this mode, if it tried to diff against a  
> delta it
> would find the delta's parent and diff against that instead.  Even  
> though
> zdelta had the same speeds for applying patches as xdiff(1), zdelta  
> used
> significantly more cpu (53m vs 40m).
>
> The next run was with the patch I've attached below, it allows  
> chains up to 16
> deltas in length.
>                              git         zdelta       xdiff  
> (1)      xdiff(16)
> apply                  150m       117m       117m         104m
> checkout             4m30s      3m41      4m43s        7m11s
> checkout (hot)     56s           12s         14s             16s
> space usage        2.5G         1G           1.2G           800m
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 17:44         ` Chris Mason
@ 2005-05-04 22:03           ` Linus Torvalds
  2005-05-04 22:43             ` Chris Mason
  2005-05-05  3:25             ` Nicolas Pitre
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2005-05-04 22:03 UTC (permalink / raw
  To: Chris Mason; +Cc: C. Scott Ananian, Nicolas Pitre, Alon Ziv, git

On Wed, 4 May 2005, Chris Mason wrote:
>
> Yes, if Linus does take the patches, it's really important for people to be 
> able to easily continue without deltas/packing if they want.

I'll happily take the patch and just not use the delta packing myself (at 
least until I trust it). But before I take the patch I want to make sure 
that people agree on it, and that it's been tested well enough that it 
won't cause people to corrupt their repositories.

For example, I do _not_ want to be in the situation SVN is in, where if
you corrupt your SVN database, you're totally screwed. There's a real
advantage to not having fancy data structures or complicated consistency
rules.

			Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 21:47       ` Geert Bosch
@ 2005-05-04 22:34         ` Chris Mason
  2005-05-05  3:10           ` Nicolas Pitre
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2005-05-04 22:34 UTC (permalink / raw
  To: Geert Bosch; +Cc: Nicolas Pitre, Linus Torvalds, Alon Ziv, git

On Wednesday 04 May 2005 17:47, Geert Bosch wrote:
>  From your tests it would seem that the zdelta version is the only one
> to provide a uniform improvement over plain git. As it also seems the
> simplest approach, I wonder why the consensus is that using xdiff
> would be better?

zdelta seems to be a research project.  It does compress better than the xdiff 
lib, but the speed improvements against xdiff(1) are probably because the 
resulting tree is smaller.  I favor the xdiff code because it's so much 
smaller, and seems easier for us to maintain.

For performance, there's still quite a bit of tuning that can be done in terms 
of when and how we delta.

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 22:03           ` Linus Torvalds
@ 2005-05-04 22:43             ` Chris Mason
  2005-05-05  3:25             ` Nicolas Pitre
  1 sibling, 0 replies; 32+ messages in thread
From: Chris Mason @ 2005-05-04 22:43 UTC (permalink / raw
  To: Linus Torvalds; +Cc: C. Scott Ananian, Nicolas Pitre, Alon Ziv, git

On Wednesday 04 May 2005 18:03, Linus Torvalds wrote:
> On Wed, 4 May 2005, Chris Mason wrote:
> > Yes, if Linus does take the patches, it's really important for people to
> > be able to easily continue without deltas/packing if they want.
>
> I'll happily take the patch and just not use the delta packing myself (at
> least until I trust it). But before I take the patch I want to make sure
> that people agree on it, and that it's been tested well enough that it
> won't cause people to corrupt their repositories.
>
> For example, I do _not_ want to be in the situation SVN is in, where if
> you corrupt your SVN database, you're totally screwed. There's a real
> advantage to not having fancy data structures or complicated consistency
> rules.

Fair enough ;)  I'm pretty flexible about most of the details, so I'm hoping 
it won't be hard to get a consensus.  The current code might be a little too 
simple in format (in the packed and delta headers), but so far it seems 
sufficient for git's needs.

The git-pack utility would probably be the best way to get other people to 
experiment and make suggestions, so I'll start there.

-chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 22:34         ` Chris Mason
@ 2005-05-05  3:10           ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-05  3:10 UTC (permalink / raw
  To: Chris Mason; +Cc: Geert Bosch, Linus Torvalds, Alon Ziv, git

On Wed, 4 May 2005, Chris Mason wrote:

> On Wednesday 04 May 2005 17:47, Geert Bosch wrote:
> >  From your tests it would seem that the zdelta version is the only one
> > to provide a uniform improvement over plain git. As it also seems the
> > simplest approach, I wonder why the consensus is that using xdiff
> > would be better?
> 
> zdelta seems to be a research project.  It does compress better than the xdiff 
> lib, but the speed improvements against xdiff(1) are probably because the 
> resulting tree is smaller.  I favor the xdiff code because it's so much 
> smaller, and seems easier for us to maintain.

Yep.  And compression can be improved without changing de decompressor 
since the decompressor is only a replay of what the compressor found to 
be redundent.  That redundency searching can probably be improved wrt to 
the current code.  And FRankly considering about 300 lines of code to 
create a delta and 60 lines to expand it is hard to beat maintenance 
wise.

> For performance, there's still quite a bit of tuning that can be done in terms 
> of when and how we delta.

Indeed.


Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] add the ability to create and retrieve delta objects
  2005-05-04 22:03           ` Linus Torvalds
  2005-05-04 22:43             ` Chris Mason
@ 2005-05-05  3:25             ` Nicolas Pitre
  1 sibling, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2005-05-05  3:25 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Chris Mason, C. Scott Ananian, Alon Ziv, git

On Wed, 4 May 2005, Linus Torvalds wrote:

> I'll happily take the patch and just not use the delta packing myself (at 
> least until I trust it). But before I take the patch I want to make sure 
> that people agree on it, and that it's been tested well enough that it 
> won't cause people to corrupt their repositories.

To that effect I'm adding knowledge of delta objects to fsck-cache, to 
verify lists of deltas are all reachable and that they all expand to the 
expected data.  This way it would be a good test to completely deltafy a 
whole repository, run fsck-cache to ensure everything is fine, and 
undeltafy it all to see if things are still sane.

> For example, I do _not_ want to be in the situation SVN is in, where if
> you corrupt your SVN database, you're totally screwed. There's a real
> advantage to not having fancy data structures or complicated consistency
> rules.

With deltas it is still a bit more risky by design since many objects 
end up depending on a single one.  You loose that top object and it's 
all the delta chain that's gone.  But having the choice to use them or 
not is what makes the whole system flexible and suited to anyone's 
balance between robustness vs disk space.  Converting back and forth is 
certainly not a problem with the git model.

And if you deltafy things such that the objects in your head tree are 
always top of delta chain then you're not badly affected if some 
intermediate delta objects are corrupted since they are part of old 
trees only.

Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-05-05  3:21 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-03  3:57 RFC: adding xdelta compression to git Alon Ziv
2005-05-03  4:12 ` Nicolas Pitre
2005-05-03  4:52 ` Linus Torvalds
2005-05-03  5:30   ` Davide Libenzi
2005-05-03 15:52     ` C. Scott Ananian
2005-05-03 17:35       ` Linus Torvalds
2005-05-03 18:10         ` Davide Libenzi
2005-05-03  8:06   ` [PATCH] add the ability to create and retrieve delta objects Nicolas Pitre
2005-05-03 11:24     ` Chris Mason
2005-05-03 12:51       ` Nicolas Pitre
2005-05-03 15:07       ` Linus Torvalds
2005-05-03 16:09         ` Chris Mason
2005-05-03 15:57       ` C. Scott Ananian
2005-05-03 16:35         ` Chris Mason
2005-05-03 14:13     ` Chris Mason
2005-05-03 14:24       ` Nicolas Pitre
2005-05-03 14:37         ` Chris Mason
2005-05-03 15:04           ` Nicolas Pitre
2005-05-03 16:54             ` Chris Mason
2005-05-03 14:48     ` Linus Torvalds
2005-05-03 15:52       ` Nicolas Pitre
2005-05-04 15:56     ` Chris Mason
2005-05-04 16:12       ` C. Scott Ananian
2005-05-04 17:44         ` Chris Mason
2005-05-04 22:03           ` Linus Torvalds
2005-05-04 22:43             ` Chris Mason
2005-05-05  3:25             ` Nicolas Pitre
2005-05-04 21:47       ` Geert Bosch
2005-05-04 22:34         ` Chris Mason
2005-05-05  3:10           ` Nicolas Pitre
2005-05-03 12:48   ` RFC: adding xdelta compression to git Dan Holmsand
2005-05-03 15:50   ` C. Scott Ananian

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).