git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/1] Diff-helper update
@ 2005-05-18  6:28 Junio C Hamano
  2005-05-18 15:41 ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2005-05-18  6:28 UTC (permalink / raw)
  To: torvalds; +Cc: git, pasky

This is just a cover letter but the next patch implements the
rename detection I told you about.

The output from the patched version is attached at the end of
this message as a demonstration.

My plan is to get the documentation and the framework in place
with this patch first.  The basic strategy is to hold created
and deleted files while we parse the incoming diff-tree output,
and match them up at the end, looking for usefully similar pair.

The similarity evaluator included in this round of patch detects
exact renames only, which is not very useful in practice, but
that would be improved in the later round.  It will probably be
done with the same deltify code Nico is using.

$ git-diff-tree -r \
    13ab4462d2aefb252d7c916bd537151856b7c967 \
    99665af5c0be0fe4319b39183e84917993153576 | ./git-diff-helper -r
diff -git a/Documentation/diff-format.txt b/Documentation/diff-format.txt
--- a/Documentation/diff-format.txt
+++ b/Documentation/diff-format.txt
@@ -45,7 +45,7 @@ with a '-p' option, they do not produce 
 instead they produce a patch file.
 ...
diff -git a/diff.h b/diff.h
--- a/diff.h
+++ b/diff.h
@@ -17,7 +17,7 @@ extern void diff_change(unsigned mode1, 
 
 extern void diff_unmerge(const char *path);
 
-/* These are for diff-tree-helper */
+/* These are for diff-helper */
 
 struct diff_spec {
 	unsigned char blob_sha1[20];
diff -git a/diff-tree-helper.c b/diff-helper.c
rename old diff-tree-helper.c
rename new diff-helper.c
diff -git a/Documentation/git-diff-tree-helper.txt b/Documentation/git-diff-helper.txt
rename old Documentation/git-diff-tree-helper.txt
rename new Documentation/git-diff-helper.txt



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18  6:28 [PATCH 0/1] Diff-helper update Junio C Hamano
@ 2005-05-18 15:41 ` Linus Torvalds
  2005-05-18 17:58   ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2005-05-18 15:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, pasky



On Tue, 17 May 2005, Junio C Hamano wrote:
>
> This is just a cover letter but the next patch implements the
> rename detection I told you about.

Ok. I now have one more worry: git-diff-tree with "--stdin".

I actually find that thing to be surpremely useful, and I use it for not 
just my silly git-whatchanged script, but also to make release notes. I 
just do

	git-rev-tree HEAD ^LAST_RELEASE | cut -d' ' -f2 | git-diff-tree -v -s --stdin

and that's wonderful. And sometimes I use "-p" instead of "-s", because 
I'm not making release notes, but because I'm doing a "whatchanged" with 
full diffs.

In other words, try something like this on the kernel tree:

	git-whatchanged -p include/asm-i386 arch/i386 | less -S

and stare in wonder at just how _useful_ this simple thing is.

Maybe it's just that I've not used all that many different SCM systems,
and I've certainly not used all the features of the ones I _have_ used, so
maybe I never knew, but dammit, I've never seen anybody else do something
quite that useful (at least doing it fast enough that it _remains_
useful).

(Replace "arch/i386" with "drivers/usb" or "fs/ext3" or whatever,
depending on just what your area of interest happens to be).

In other words, I'm very happy with git.

However, git-diff-helper doesn't understand these things, and the builtin
diff doesn't do the rename thing. Yet it would be very very useful to do.

Now, if you do just a

	git-whatchanged include/asm-i386 arch/i386

(or you can even use "-z" if you want to), it turns out hat git-diff-tree 
actually does output perfectly usable material. Each "set of diffs" is 
clearly separated, and there is no ambiguos material: all lines are 
either:
 - empty or start with a whitespace
 - start with "diff-tree ", "Author: " or "Date: "
 - are valid input for diff-helper

So what I'd suggest is one (or both) of two possibilities:
 - make the internal diff logic also able to do the same rename handling 
   as the external diff-helper. This may or may not be complex, I've not 
   looked at it.
 - change diff-helper subtly: instead of printing "cannot parse %s", any
   nonrecognized line would be a "ignore this line, but process all
   pending potential renames".

The above would mean that I could either just do

	git-whatchanged include/asm-i386 arch/i386 | git-diff-helper

or continue to use

	git-whatchanged -p include/asm-i386 arch/i386

and still get the "nice" output (and it's a _feature_ that a rename within
the arch/i386 then shows up as a rename, but a rename that crosses the
boundary shows up as a "create" or "delete" event).

Comments? 

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 15:41 ` Linus Torvalds
@ 2005-05-18 17:58   ` Junio C Hamano
  2005-05-18 18:14     ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2005-05-18 17:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, pasky

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> However, git-diff-helper doesn't understand these things, and the builtin
LT> diff doesn't do the rename thing. Yet it would be very very useful to do.

It is unclear what you meant by "these things" in "doesn't
understand these things", and what you meant by "it" in "it
would be very very useful to do."  Could you explain?

About the built-in diff not doing the rename , I have a bit
longer term (knowing _my_ timescale I'd imagine you would
understand that is not that long ;-) plan to have -p option for
diff-* family to use the same rename detection logic that I
added to diff-helper in the patch you are commenting on.  It
involves slight change to callers (three diff-* family main
programs) to add a call to tell the diff driver "I've given you
all the diffs, now go look for renames") at the end, and the
rest is changes to what diff.c does internally.

LT> So what I'd suggest is one (or both) of two possibilities:
LT>  - make the internal diff logic also able to do the same rename handling 
LT>    as the external diff-helper. This may or may not be complex, I've not 
LT>    looked at it.

Yes that is part of the plan.  I wanted to do things in these
steps:

  - Put rename detect in helper so screwups there would not
    impact the diff-* family's built-in output, with the initial
    dumb rename detection.

  - Improve rename detection still keeping the logic and
    machinery in diff-helper only.  I expect a heuristics
    similar to the one you posted on the deltification thread
    would work nicely here as well.

  - Straighten out the GIT_EXTERNAL_DIFF interface so that it
    can also express renames (the patch I sent currently punts
    there).  They will get eighth argument (rename destination)
    only when they are being fed a rename patch.
    git-apply-patch-script needs to be adjusted for this change.

  - Change diff-helper not to do the rename detection itself,
    but clean it up so it uses the same diff_addremove(),
    diff_change(), and diff_unmerge() interface the diff-*
    family use.  Change the implementation of these three
    functions so that they do not directly call
    run_external_diff() but pool changes for later matching when
    rename detection is in effect.  Add diff_finished() which
    would flush the rename candidate pools, and call that at the
    end of program from three diff-* family and diff-helper.
    The rename detection logic in diff-helper will be moved to
    this "inspect the pooled rename candidates, match them up
    and flush" part.

The patch I sent is the first step in the above sequence.

LT>  - change diff-helper subtly: instead of printing "cannot parse %s", any
LT>    nonrecognized line would be a "ignore this line, but process all
LT>    pending potential renames".

Once the built-in diff driver is straightened out the way I
outlined above, this change may turn out to be unnecessary, I
need to look at the whatchanged output and think a bit more
about this later today.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 17:58   ` Junio C Hamano
@ 2005-05-18 18:14     ` Linus Torvalds
  2005-05-18 18:38       ` Linus Torvalds
  2005-05-19  2:05       ` [preview] diff-helper rename detection Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2005-05-18 18:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, pasky



On Wed, 18 May 2005, Junio C Hamano wrote:
> >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
> 
> LT> However, git-diff-helper doesn't understand these things, and the builtin
> LT> diff doesn't do the rename thing. Yet it would be very very useful to do.
> 
> It is unclear what you meant by "these things" in "doesn't
> understand these things", and what you meant by "it" in "it
> would be very very useful to do."  Could you explain?

"These things" being the extra output from "diff-tree" that is not a "diff 
line".

If diff-helper just passes the lines it doesn't understand through
unmodified (_after_ having handled any pending rename logic), it will 
automatically do the right thing.

> About the built-in diff not doing the rename , I have a bit
> longer term (knowing _my_ timescale I'd imagine you would
> understand that is not that long ;-) plan to have -p option for
> diff-* family to use the same rename detection logic that I
> added to diff-helper in the patch you are commenting on.

Goodie. I was hoping that was the case, but felt that the diff-helper 
thing should be pretty easy to do.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 18:14     ` Linus Torvalds
@ 2005-05-18 18:38       ` Linus Torvalds
  2005-05-18 18:52         ` Thomas Glanzmann
  2005-05-18 20:30         ` Junio C Hamano
  2005-05-19  2:05       ` [preview] diff-helper rename detection Junio C Hamano
  1 sibling, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2005-05-18 18:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, pasky



On Wed, 18 May 2005, Linus Torvalds wrote:
> 
> If diff-helper just passes the lines it doesn't understand through
> unmodified (_after_ having handled any pending rename logic), it will 
> automatically do the right thing.

I took the liberty of doing just that. The only subtle issue was that 
the strbuf functions would consider an empty line to be EOF, which looked 
wrong and unintentional. Fixing that made the actual diff-helper changes 
totally trivial, and I can now do

	git-rev-list HEAD | git-diff-tree -r -v --stdin | ./git-diff-helper -r | less -S

and it does the right thing for me.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 18:38       ` Linus Torvalds
@ 2005-05-18 18:52         ` Thomas Glanzmann
  2005-05-18 20:30         ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Thomas Glanzmann @ 2005-05-18 18:52 UTC (permalink / raw)
  To: git

Hello,

> 	git-rev-list HEAD | git-diff-tree -r -v --stdin | ./git-diff-helper -r | less -S

nice one!

	Thomas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 18:38       ` Linus Torvalds
  2005-05-18 18:52         ` Thomas Glanzmann
@ 2005-05-18 20:30         ` Junio C Hamano
  2005-05-18 20:39           ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2005-05-18 20:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, pasky

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> I took the liberty of doing just that. The only subtle issue was that 
LT> the strbuf functions would consider an empty line to be EOF, which looked 
LT> wrong and unintentional. Fixing that made the actual diff-helper changes 
LT> totally trivial, and I can now do
LT> 	git-rev-list HEAD | git-diff-tree -r -v --stdin | ./git-diff-helper -r | less -S
LT> and it does the right thing for me.

Thanks for fixing up strbuf.

@@ -136,8 +268,12 @@ int main(int ac, const char **av) {
 		if (sb.eof)
 			break;
 		status = parse_diff_raw_output(sb.buf, av+1, ac-1, reverse);
-		if (status)
-			fprintf(stderr, "cannot parse %s\n", sb.buf);
+		if (status) {
+			flush_renames(av+1, ac-1, reverse);
+			printf("%s%c", sb.buf, line_termination);
+		}
 	}
+
+	flush_renames(av+1, ac-1, reverse);
 	return 0;
 }

I suspect doing something like this might be saner instead,
assuming non raw-diffs come at the end.  

		if (status)
			break;
	}
	flush_renames(av+1, ac-1, reverse);
	if (!sb.eof) {
        	spit out what we have in sb.eof, sendfile ;-) the
                rest of the input to the output.
	}
	return 0;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 20:30         ` Junio C Hamano
@ 2005-05-18 20:39           ` Linus Torvalds
  2005-05-18 23:54             ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2005-05-18 20:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, pasky



On Wed, 18 May 2005, Junio C Hamano wrote:
> 
> I suspect doing something like this might be saner instead,
> assuming non raw-diffs come at the end.  

It won't ever trigger, since we only exit the loop once we see EOF.

So the non-diffs at the end will trigger by the exact same "oh, we didn't 
recognize it" logic.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 20:39           ` Linus Torvalds
@ 2005-05-18 23:54             ` Junio C Hamano
  2005-05-19 11:11               ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2005-05-18 23:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, pasky

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> On Wed, 18 May 2005, Junio C Hamano wrote:
>> 
>> I suspect doing something like this might be saner instead,
>> assuming non raw-diffs come at the end.  

LT> It won't ever trigger, since we only exit the loop once we see EOF.

I was not talking about correctness, but the readability of the
code.  Breaking out from the loop to process raw-diff and
switching to straight copy would make our intent more explicit.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [preview] diff-helper rename detection.
  2005-05-18 18:14     ` Linus Torvalds
  2005-05-18 18:38       ` Linus Torvalds
@ 2005-05-19  2:05       ` Junio C Hamano
  2005-05-19  3:01         ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2005-05-19  2:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

>> About the built-in diff not doing the rename , I have a bit
>> longer term (knowing _my_ timescale I'd imagine you would
>> understand that is not that long ;-) plan to have -p option for
>> diff-* family to use the same rename detection logic that I
>> added to diff-helper in the patch you are commenting on.

LT> Goodie. I was hoping that was the case, but felt that the diff-helper 
LT> thing should be pretty easy to do.

This is not yet a request for inclusion but is a preview
requesting for testing and comments, especially from Linus (for
general usability comments and suggestions for cut-off-points in
heuristics) and Nico (in case I had blatantly misused the
diff-delta interface).

I stole diff-delta part from the delta support patch (but I
dropped the part that actually touches sha1_file part, hence
this does not introduce the deltified object store), and used it
as an "extent-of-damage estimator".  The rename detector logic
in the previous round was detecting exact renames only (and did
not even look at the filesystem so the raw-diff could not come
from diff-files or diff-cache without --cached), but this round
it actually checks the content.  The interim heuristics is:

  - If exact match in SHA1, or exact match comparing the files,
    then that is a rename (trivial);

  - If size changed more than 20% that cannot be a rename;

  - If delta produced by the deltification stuff between two is
    more than 20% of the original, that cannot be a rename;

  - Otherwise pick the deleted-created file pair that has the
    smallest delta between them.

I've only very lightly tested it but it seems to do the right
thing.  I fed soemthing like this when I had heavily modified
diff-helper.c (diff-helpee.c was taken from the git-ls-files
output for diff-helper.c from the cache) and lightly modified
diff.c (diffo.c was from the cache), and diffo.c -> diff.c were
reported as rename, while helpee -> helper was not.

-100644	blob	cde27275fad8103084d7ed2d08d246ba4ce6eb9c	Makefile
+100644	blob	cde27275fad8103084d7ed2d08d246ba4ce6eb9c	GNUMakefile
-100644	blob	2877ddc4df85179c03cdcc8c7a66831951ec5d97	diff-helpee.c
+100644	blob	0000000000000000000000000000000000000000	diff-helper.c
-100644	blob	74004e5a3f9fa491b30ab3d5f231826593e4eae4	diffo.c
+100644	blob	0000000000000000000000000000000000000000	diff.c

Not-signed-off-yet-by: Junio C Hamano <junkio@cox.net>
---
# - linus: merge-base: use the new lookup_commit_reference() helper function
# + 9: diff-helper detects renames with Nico's delta stuff.
diff -git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -36,7 +36,7 @@ install: $(PROG) $(SCRIPTS)
 	$(INSTALL) $(PROG) $(SCRIPTS) $(dest)$(bin)
 
 LIB_OBJS=read-cache.o sha1_file.o usage.o object.o commit.o tree.o blob.o \
-	 tag.o date.o
+	 tag.o date.o diff-delta.o
 LIB_FILE=libgit.a
 LIB_H=cache.h object.h blob.h tree.h commit.h tag.h
 
diff -git a/delta.h b/delta.h
new file mode 100644
--- /dev/null
+++ b/delta.h
@@ -0,0 +1,6 @@
+extern void *diff_delta(void *from_buf, unsigned long from_size,
+			void *to_buf, unsigned long to_size,
+		        unsigned long *delta_size);
+extern void *patch_delta(void *src_buf, unsigned long src_size,
+			 void *delta_buf, unsigned long delta_size,
+			 unsigned long *dst_size);
diff -git a/diff-delta.c b/diff-delta.c
new file mode 100644
--- /dev/null
+++ b/diff-delta.c
@@ -0,0 +1,330 @@
+/*
+ * diff-delta.c: generate a delta between two buffers
+ *
+ *  Many parts of this file have been lifted from LibXDiff version 0.10.
+ *  http://www.xmailserver.org/xdiff-lib.html
+ *
+ *  LibXDiff was written by Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2003	Davide Libenzi
+ *
+ *  Many mods for GIT usage by Nicolas Pitre <nico@cam.org>, (C) 2005.
+ *
+ *  This file is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ */
+
+#include <stdlib.h>
+#include "delta.h"
+
+
+/* block size: min = 16, max = 64k, power of 2 */
+#define BLK_SIZE 16
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+#define GR_PRIME 0x9e370001
+#define HASH(v, b) (((unsigned int)(v) * GR_PRIME) >> (32 - (b)))
+	
+/* largest prime smaller than 65536 */
+#define BASE 65521
+
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+#define NMAX 5552
+
+#define DO1(buf, i)  { s1 += buf[i]; s2 += s1; }
+#define DO2(buf, i)  DO1(buf, i); DO1(buf, i + 1);
+#define DO4(buf, i)  DO2(buf, i); DO2(buf, i + 2);
+#define DO8(buf, i)  DO4(buf, i); DO4(buf, i + 4);
+#define DO16(buf)    DO8(buf, 0); DO8(buf, 8);
+
+static unsigned int adler32(unsigned int adler, const unsigned char *buf, int len)
+{
+	int k;
+	unsigned int s1 = adler & 0xffff;
+	unsigned int s2 = adler >> 16;
+
+	while (len > 0) {
+		k = MIN(len, NMAX);
+		len -= k;
+		while (k >= 16) {
+			DO16(buf);
+			buf += 16;
+			k -= 16;
+		}
+		if (k != 0)
+			do {
+				s1 += *buf++;
+				s2 += s1;
+			} while (--k);
+		s1 %= BASE;
+		s2 %= BASE;
+	}
+
+	return (s2 << 16) | s1;
+}
+
+static unsigned int hashbits(unsigned int size)
+{
+	unsigned int val = 1, bits = 0;
+	while (val < size && bits < 32) {
+		val <<= 1;
+	       	bits++;
+	}
+	return bits ? bits: 1;
+}
+
+typedef struct s_chanode {
+	struct s_chanode *next;
+	int icurr;
+} chanode_t;
+
+typedef struct s_chastore {
+	chanode_t *head, *tail;
+	int isize, nsize;
+	chanode_t *ancur;
+	chanode_t *sncur;
+	int scurr;
+} chastore_t;
+
+static void cha_init(chastore_t *cha, int isize, int icount)
+{
+	cha->head = cha->tail = NULL;
+	cha->isize = isize;
+	cha->nsize = icount * isize;
+	cha->ancur = cha->sncur = NULL;
+	cha->scurr = 0;
+}
+
+static void *cha_alloc(chastore_t *cha)
+{
+	chanode_t *ancur;
+	void *data;
+
+	ancur = cha->ancur;
+	if (!ancur || ancur->icurr == cha->nsize) {
+		ancur = malloc(sizeof(chanode_t) + cha->nsize);
+		if (!ancur)
+			return NULL;
+		ancur->icurr = 0;
+		ancur->next = NULL;
+		if (cha->tail)
+			cha->tail->next = ancur;
+		if (!cha->head)
+			cha->head = ancur;
+		cha->tail = ancur;
+		cha->ancur = ancur;
+	}
+
+	data = (void *)ancur + sizeof(chanode_t) + ancur->icurr;
+	ancur->icurr += cha->isize;
+	return data;
+}
+
+static void cha_free(chastore_t *cha)
+{
+	chanode_t *cur = cha->head;
+	while (cur) {
+		chanode_t *tmp = cur;
+		cur = cur->next;
+		free(tmp);
+	}
+}
+
+typedef struct s_bdrecord {
+	struct s_bdrecord *next;
+	unsigned int fp;
+	const unsigned char *ptr;
+} bdrecord_t;
+
+typedef struct s_bdfile {
+	const unsigned char *data, *top;
+	chastore_t cha;
+	unsigned int fphbits;
+	bdrecord_t **fphash;
+} bdfile_t;
+
+static int delta_prepare(const unsigned char *buf, int bufsize, bdfile_t *bdf)
+{
+	unsigned int fphbits;
+	int i, hsize;
+	const unsigned char *base, *data, *top;
+	bdrecord_t *brec;
+	bdrecord_t **fphash;
+
+	fphbits = hashbits(bufsize / BLK_SIZE + 1);
+	hsize = 1 << fphbits;
+	fphash = malloc(hsize * sizeof(bdrecord_t *));
+	if (!fphash)
+		return -1;
+	for (i = 0; i < hsize; i++)
+		fphash[i] = NULL;
+	cha_init(&bdf->cha, sizeof(bdrecord_t), hsize / 4 + 1);
+
+	bdf->data = data = base = buf;
+	bdf->top = top = buf + bufsize;
+	data += (bufsize / BLK_SIZE) * BLK_SIZE;
+	if (data == top)
+		data -= BLK_SIZE;
+
+	for ( ; data >= base; data -= BLK_SIZE) {
+		brec = cha_alloc(&bdf->cha);
+		if (!brec) {
+			cha_free(&bdf->cha);
+			free(fphash);
+			return -1;
+		}
+		brec->fp = adler32(0, data, MIN(BLK_SIZE, top - data));
+		brec->ptr = data;
+		i = HASH(brec->fp, fphbits);
+		brec->next = fphash[i];
+		fphash[i] = brec;
+	}
+
+	bdf->fphbits = fphbits;
+	bdf->fphash = fphash;
+
+	return 0;
+}
+
+static void delta_cleanup(bdfile_t *bdf)
+{
+	free(bdf->fphash);
+	cha_free(&bdf->cha);
+}
+
+#define COPYOP_SIZE(o, s) \
+    (!!(o & 0xff) + !!(o & 0xff00) + !!(o & 0xff0000) + !!(o & 0xff000000) + \
+     !!(s & 0xff) + !!(s & 0xff00) + 1)
+
+void *diff_delta(void *from_buf, unsigned long from_size,
+		 void *to_buf, unsigned long to_size,
+		 unsigned long *delta_size)
+{
+	int i, outpos, outsize, inscnt, csize, msize, moff;
+	unsigned int fp;
+	const unsigned char *data, *top, *ptr1, *ptr2;
+	unsigned char *out, *orig;
+	bdrecord_t *brec;
+	bdfile_t bdf;
+
+	if (!from_size || !to_size || delta_prepare(from_buf, from_size, &bdf))
+		return NULL;
+	
+	outpos = 0;
+	outsize = 8192;
+	out = malloc(outsize);
+	if (!out) {
+		delta_cleanup(&bdf);
+		return NULL;
+	}
+
+	data = to_buf;
+	top = to_buf + to_size;
+
+	/* store reference buffer size */
+	orig = out + outpos++;
+	*orig = i = 0;
+	do {
+		if (from_size & 0xff) {
+			*orig |= (1 << i);
+			out[outpos++] = from_size;
+		}
+		i++;
+		from_size >>= 8;
+	} while (from_size);
+
+	/* store target buffer size */
+	orig = out + outpos++;
+	*orig = i = 0;
+	do {
+		if (to_size & 0xff) {
+			*orig |= (1 << i);
+			out[outpos++] = to_size;
+		}
+		i++;
+		to_size >>= 8;
+	} while (to_size);
+
+	inscnt = 0;
+	moff = 0;
+	while (data < top) {
+		msize = 0;
+		fp = adler32(0, data, MIN(top - data, BLK_SIZE));
+		i = HASH(fp, bdf.fphbits);
+		for (brec = bdf.fphash[i]; brec; brec = brec->next) {
+			if (brec->fp == fp) {
+				csize = bdf.top - brec->ptr;
+				if (csize > top - data)
+					csize = top - data;
+				for (ptr1 = brec->ptr, ptr2 = data; 
+				     csize && *ptr1 == *ptr2;
+				     csize--, ptr1++, ptr2++);
+
+				csize = ptr1 - brec->ptr;
+				if (csize > msize) {
+					moff = brec->ptr - bdf.data;
+					msize = csize;
+					if (msize >= 0x10000) {
+						msize = 0x10000;
+						break;
+					}
+				}
+			}
+		}
+
+		if (!msize || msize < COPYOP_SIZE(moff, msize)) {
+			if (!inscnt)
+				outpos++;
+			out[outpos++] = *data++;
+			inscnt++;
+			if (inscnt == 0x7f) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+		} else {
+			if (inscnt) {
+				out[outpos - inscnt - 1] = inscnt;
+				inscnt = 0;
+			}
+
+			data += msize;
+			orig = out + outpos++;
+			i = 0x80;
+
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+			moff >>= 8;
+			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
+			msize >>= 8;
+			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
+
+			*orig = i;
+		}
+
+		/* next time around the largest possible output is 1 + 4 + 3 */
+		if (outpos > outsize - 8) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			out = realloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				delta_cleanup(&bdf);
+				return NULL;
+			}
+		}
+	}
+
+	if (inscnt)
+		out[outpos - inscnt - 1] = inscnt;
+
+	delta_cleanup(&bdf);
+	*delta_size = outpos;
+	return out;
+}
diff -git a/diff-helper.c b/diff-helper.c
--- a/diff-helper.c
+++ b/diff-helper.c
@@ -5,6 +5,7 @@
 #include "cache.h"
 #include "strbuf.h"
 #include "diff.h"
+#include "delta.h"
 
 static int matches_pathspec(const char *name, const char **spec, int cnt)
 {
@@ -31,8 +32,14 @@ static int detect_rename = 0;
 
 static struct diff_spec_hold {
 	struct diff_spec_hold *next;
-	struct diff_spec_hold *matched;
-	struct diff_spec old, new;
+	struct diff_spec old;
+	struct diff_spec new;
+	unsigned long size;
+	int flags;
+#define MATCHED 1
+#define SHOULD_FREE 2
+#define SHOULD_MUNMAP 4
+	void *data;
 	char path[1];
 } *createdfile, *deletedfile;
 
@@ -47,101 +54,199 @@ static void hold_spec(const char *path,
 	*list = elem;
 	elem->old = *old;
 	elem->new = *new;
-	elem->matched = 0;
+	elem->size = 0;
+	elem->data = NULL;
+	elem->flags = 0;
+}
+
+/* diff_spec sha1_valid flag tells us if mode is to be trusted
+ * and if blob_sha1 is not null_sha1 then that is also to be
+ * trusted.  Here we need to know if we need to look at the file
+ * system.
+ */
+static int look_at_fs(struct diff_spec *s)
+{
+	static const unsigned char null_sha1[20] = { 0, };
+	return (!s->sha1_valid ||
+		!memcmp(null_sha1, s->blob_sha1, sizeof(null_sha1)));
+}
+
+static int populate_data(struct diff_spec_hold *s, int use_old)
+{
+	char type[20];
+	struct diff_spec *u = use_old ? &(s->old) : &(s->new);
+
+	if (!look_at_fs(u)) {
+		s->data = read_sha1_file(u->blob_sha1, type, &s->size);
+		s->flags |= SHOULD_FREE;
+	}
+	else {
+		struct stat st;
+		int fd;
+		fd = open(s->path, O_RDONLY);
+		if (fd < 0)
+			return -1;
+		if (fstat(fd, &st)) {
+			close(fd);
+			return -1;
+		}
+		s->size = st.st_size;
+		s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
+		close(fd);
+		if (!s->size)
+			s->data = "";
+		else
+			s->flags |= SHOULD_MUNMAP;
+	}
+	return 0;
+}
+
+static void free_data(struct diff_spec_hold *s)
+{
+	if (s->flags & SHOULD_FREE)
+		free(s->data);
+	else if (s->flags & SHOULD_MUNMAP)
+		munmap(s->data, s->size);
+	s->flags &= ~(SHOULD_FREE|SHOULD_MUNMAP);
+}
+
+static void flush_rest(struct diff_spec_hold *elem,
+		       const char **spec, int cnt, int reverse)
+{
+	for ( ; elem ; elem = elem->next) {
+		free_data(elem);
+		if (elem->flags & MATCHED)
+			continue;
+		if (!cnt ||
+		    matches_pathspec(elem->path, spec, cnt)) {
+			if (reverse)
+				run_external_diff(elem->path, NULL,
+						  &elem->new, &elem->old);
+			else
+				run_external_diff(elem->path, NULL,
+						  &elem->old, &elem->new);
+		}
+	}
 }
 
-#define MINIMUM_SCORE 7000
-int estimate_similarity(struct diff_spec *one, struct diff_spec *two)
+#define EXACT_MATCH  10000
+#define MINIMUM_SCORE 5000
+int estimate_similarity(struct diff_spec_hold *src,
+			struct diff_spec_hold *dst,
+			int expect)
 {
-	/* Return how similar they are, representing the score as an
-	 * integer between 0 and 10000.
+	/* src points at a deleted file (src->old) and dst points at
+	 * a created file (dst->new).  They may be quite similar in which
+	 * case we want to say src->old is renamed to dst->new.
 	 *
-	 * This version is very dumb and detects exact matches only.
-	 * Wnen Nico's delta stuff gets in, I'll use the delta
-	 * algorithm to estimate the similarity score in core.
+	 * Compare them and return how similar they are, representing
+	 * the score as an integer between 0 and 10000.  10000 is
+	 * reserved for the case where they match exactly.
 	 */
+	void *delta;
+	unsigned long delta_size;
 
-	if (one->sha1_valid && two->sha1_valid &&
-	    !memcmp(one->blob_sha1, two->blob_sha1, 20))
+	if (!look_at_fs(&(src->old)) && !look_at_fs(&(dst->new)) &&
+	    !memcmp(src->old.blob_sha1, dst->new.blob_sha1, 20))
 		return 10000;
-	return 0;
+	if (populate_data(src, 1) || populate_data(dst, 0))
+		/* this is an error but will be caught downstream */
+		return 0;
+	if (src->size == dst->size &&
+	    !memcmp(src->data, dst->data, src->size))
+		return 10000;
+
+	if (expect == EXACT_MATCH)
+		return 0;
+
+	delta_size = ((src->size < dst->size) ?
+		      (dst->size - src->size) : (src->size - dst->size));
+
+	/* We would not consider rename followed by more than
+	 * 20% edits; that is, delta_size must be smaller than
+	 * (src->size + dst->size)/2 * 0.2, which means...
+	 */
+	if ((src->size + dst->size) < delta_size * 10)
+		return 0;
+
+	delta = diff_delta(src->data, src->size,
+			   dst->data, dst->size,
+			   &delta_size);
+	free(delta);
+
+	/* This "delta" is really xdiff with adler32 and all the
+	 * overheads but it is a quick and dirty approximation.
+	 * Again, we say there should be less than 20% edit.
+	 */
+	if ((src->size + dst->size) < delta_size * 10)
+		return 0;
+
+	/* Now we will give some score to it.  Let's say 20% edit gets
+	 * 5000 points and 0% edit gets 9000 points.  That is, every
+	 * 1/20000 edit gets 1 point penalty.  The amount of penalty is:
+	 *
+	 * (delta_size * 2 / (src->size + dst->size)) * 20000
+	 *
+	 */
+	return 9000 - (40000 * delta_size / (src->size+dst->size));
 }
 
-static void flush_renames(const char **spec, int cnt, int reverse)
+static void flush_rename_pairs(int floor_score,
+			       const char **spec, int cnt, int reverse)
 {
-	struct diff_spec_hold *rename_src, *rename_dst, *elem;
-	struct diff_spec_hold *leftover = NULL;
+	struct diff_spec_hold *src, *dst, *elem;
 	int score, best_score;
 
-	while (createdfile) {
-		rename_dst = createdfile;
-		createdfile = rename_dst->next;
-		best_score = MINIMUM_SCORE;
-		rename_src = NULL;
-		for (elem = deletedfile;
-		     elem;
-		     elem = elem->next) {
-			if (elem->matched)
+	for (dst = createdfile; dst; dst = dst->next) {
+		if (dst->flags & MATCHED)
+			continue;
+
+		best_score = floor_score;
+		src = NULL;
+		for (elem = deletedfile; elem; elem = elem->next) {
+			if (elem->flags & MATCHED)
 				continue;
-			score = estimate_similarity(&elem->old,
-						    &rename_dst->new);
-			if (best_score < score) {
-				rename_src = elem;
+			score = estimate_similarity(elem, dst, best_score);
+			if (best_score <= score) {
+				src = elem;
 				best_score = score;
 			}
 		}
-		if (rename_src) {
-			rename_src->matched = rename_dst;
-			rename_dst->matched = rename_src;
+		if (src) {
+			src->flags |= MATCHED;
+			dst->flags |= MATCHED;
+			free_data(src);
+			free_data(dst);
 
 			if (!cnt ||
-			    matches_pathspec(rename_src->path, spec, cnt) ||
-			    matches_pathspec(rename_dst->path, spec, cnt)) {
+			    matches_pathspec(src->path, spec, cnt) ||
+			    matches_pathspec(dst->path, spec, cnt)) {
 				if (reverse)
-					run_external_diff(rename_dst->path,
-							  rename_src->path,
-							  &rename_dst->new,
-							  &rename_src->old);
+					run_external_diff(dst->path,
+							  src->path,
+							  &dst->new,
+							  &src->old);
 				else
-					run_external_diff(rename_src->path,
-							  rename_dst->path,
-							  &rename_src->old,
-							  &rename_dst->new);
+					run_external_diff(src->path,
+							  dst->path,
+							  &src->old,
+							  &dst->new);
 			}
 		}
-		else {
-			rename_dst->next = leftover;
-			leftover = rename_dst;
-		}
 	}
+}
 
-	/* unmatched deletes */
-	for (elem = deletedfile; elem; elem = elem->next) {
-		if (elem->matched)
-			continue;
-		if (!cnt ||
-		    matches_pathspec(elem->path, spec, cnt)) {
-			if (reverse)
-				run_external_diff(elem->path, NULL,
-						  &elem->new, &elem->old);
-			else
-				run_external_diff(elem->path, NULL,
-						  &elem->old, &elem->new);
-		}
-	}
+static void flush_renames(const char **spec, int cnt, int reverse)
+{
+	/* We do this in two steps; first we pick up the exact matches
+	 * only and then closest match
+	 */
 
-	/* unmatched creates */
-	for (elem = leftover; elem; elem = elem->next) {
-		if (!cnt ||
-		    matches_pathspec(elem->path, spec, cnt)) {
-			if (reverse)
-				run_external_diff(elem->path, NULL,
-						  &elem->new, &elem->old);
-			else
-				run_external_diff(elem->path, NULL,
-						  &elem->old, &elem->new);
-		}
-	}
+	flush_rename_pairs(EXACT_MATCH, spec, cnt, reverse);
+	flush_rename_pairs(MINIMUM_SCORE, spec, cnt, reverse);
+
+	flush_rest(createdfile, spec, cnt, reverse);
+	flush_rest(deletedfile, spec, cnt, reverse);
 }
 
 static int parse_oneside_change(const char *cp, struct diff_spec *one,
diff -git a/diff.c b/diff.c
--- a/diff.c
+++ b/diff.c
@@ -329,7 +329,7 @@ void run_external_diff(const char *name,
 		die("unable to fork");
 	if (!pid) {
 		const char *pgm = external_diff();
-		/* not passing rename patch to external ones */
+		/* NEEDSWORK: not passing rename patch to external ones */
 		if (!other && pgm) {
 			if (one && two)
 				execlp(pgm, pgm,


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [preview] diff-helper rename detection.
  2005-05-19  2:05       ` [preview] diff-helper rename detection Junio C Hamano
@ 2005-05-19  3:01         ` Linus Torvalds
  2005-05-19  3:08           ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2005-05-19  3:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List



On Wed, 18 May 2005, Junio C Hamano wrote:
> 
> This is not yet a request for inclusion but is a preview
> requesting for testing and comments, especially from Linus (for
> general usability comments and suggestions for cut-off-points in
> heuristics) and Nico (in case I had blatantly misused the
> diff-delta interface).

I think this is great, although I also bet that we'll tweak the heuristics
later.

I suspect that one tweak may be to try to find the "best" rename first,
rather than look at files in the order they were discovered (ie you'd
create a _matrix_ of the delta scores, rather than walk through the newly
created files in order). That would then potentially allow for a more
relaxed definitions of "rename", without any possibility of losing sight
of a better rename due to finding a bad one first.

But I think the simple heurstic is probably fine as a first cut.

So my only nit so far is that you declare "patch_delta()", even though you
don't actually have it. Just remove it.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [preview] diff-helper rename detection.
  2005-05-19  3:01         ` Linus Torvalds
@ 2005-05-19  3:08           ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2005-05-19  3:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> I suspect that one tweak may be to try to find the "best" rename first,
LT> rather than look at files in the order they were discovered (ie you'd
LT> create a _matrix_ of the delta scores, rather than walk through the newly
LT> created files in order). That would then potentially allow for a more
LT> relaxed definitions of "rename", without any possibility of losing sight
LT> of a better rename due to finding a bad one first.

I thought about that but I opted for simplicity of first picking
the exact matches.  I was unsure about the usefulness of relaxed
matching to begin with.  But I now realize that to see how useful
that would be we need to experiment it and a matrix approach
would become necessity for that.

LT> So my only nit so far is that you declare "patch_delta()",
LT> even though you don't actually have it. Just remove it.

I'd rather keep that file as is for later merge with Nico.  It
came from the delitification patch and I tried to keep the new
files as they are.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] Diff-helper update
  2005-05-18 23:54             ` Junio C Hamano
@ 2005-05-19 11:11               ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2005-05-19 11:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, pasky

>>>>> "JCH" == Junio C Hamano <junkio@cox.net> writes:

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
LT> On Wed, 18 May 2005, Junio C Hamano wrote:
>>> 
>>> I suspect doing something like this might be saner instead,
>>> assuming non raw-diffs come at the end.  

LT> It won't ever trigger, since we only exit the loop once we see EOF.

JCH> I was not talking about correctness, but the readability of the
JCH> code.  Breaking out from the loop to process raw-diff and
JCH> switching to straight copy would make our intent more explicit.

I was completely mistaken about what you were talking about.
Your output (helper's input) is (<non-diff material>* <raw-diff
material>*)*, so of course we should not break out of the main
loop.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-05-19 11:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-18  6:28 [PATCH 0/1] Diff-helper update Junio C Hamano
2005-05-18 15:41 ` Linus Torvalds
2005-05-18 17:58   ` Junio C Hamano
2005-05-18 18:14     ` Linus Torvalds
2005-05-18 18:38       ` Linus Torvalds
2005-05-18 18:52         ` Thomas Glanzmann
2005-05-18 20:30         ` Junio C Hamano
2005-05-18 20:39           ` Linus Torvalds
2005-05-18 23:54             ` Junio C Hamano
2005-05-19 11:11               ` Junio C Hamano
2005-05-19  2:05       ` [preview] diff-helper rename detection Junio C Hamano
2005-05-19  3:01         ` Linus Torvalds
2005-05-19  3:08           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).